
How to Scrape Twitter/X: Complete Guide for 2026
Learn how to scrape Twitter/X in 2026 with Python. Covers Cloudflare bypass, structured data extraction, and scaling your pipeline to millions of tweets.
March 30, 2026
Twitter/X holds some of the most valuable real-time signal data on the internet — public sentiment, trending discourse, influencer activity, and breaking narratives. The official API, however, is prohibitively expensive for most use cases (Basic tier caps at 10,000 tweets/month for $100), and the site's bot protections have become substantially more aggressive through 2025 and into 2026. This guide covers exactly how to scrape Twitter/X: what you're up against, the tools you need, and production-ready code to get started.
Why Scrape Twitter/X?
Before getting into implementation, three concrete use cases justify the engineering investment:
Brand sentiment monitoring. Marketing and comms teams track mentions, hashtag velocity, and sentiment shifts in real time. Scraping gives you raw signal without API cost caps or X's 30-day historical data ceiling — critical for retroactive analysis after a PR event.
Academic and investigative research. Researchers studying misinformation networks, political discourse, or public health communication regularly need large tweet corpora that exceed what the Academic Research tier provides. That tier is frequently oversubscribed, approval is slow, and access can be revoked without notice.
Competitive intelligence. Tracking competitor product announcements, executive statements, and community reactions at scale is a legitimate business intelligence use case. No official API product maps cleanly onto this workflow.
Anti-Bot Challenges on twitter.com
Twitter/X runs several protection layers simultaneously, and as of 2026 they've tightened across the board:
Cloudflare Turnstile. X migrated from in-house bot detection to Cloudflare's Turnstile challenge on login walls and rate-limited endpoints. Turnstile performs browser fingerprinting, behavioral analysis, and passive challenges invisible to legitimate users but fatal to naive scrapers. Unlike a traditional CAPTCHA, Turnstile doesn't present a puzzle — it silently fails bot sessions before they reach content.
JavaScript-rendered content. The entire frontend is a React SPA. A raw HTTP response from x.com contains almost no tweet data — meaningful content only appears after JavaScript hydration, authenticated GraphQL calls, and DOM rendering complete. requests + BeautifulSoup alone will not work here.
Login gates. As of late 2025, X requires authenticated sessions to view most profile timelines and all search results beyond a limited preview. Your scraper must manage session tokens — a significant attack surface that X monitors for anomalous usage patterns.
IP-based rate limiting. Even valid sessions are rate-limited by IP. A datacenter IP attempting meaningful volume will be blocked within minutes. Residential or ISP proxies are required.
Handling all of this in-house — rotating residential proxies, headless browser orchestration, Turnstile solving, session rotation — is weeks of engineering that will break on every X infrastructure update. AlterLab's Anti-bot bypass API handles this entire infrastructure layer, letting you focus on the data logic rather than the bypass mechanics.
Quick Start with the AlterLab API
Install the SDK and grab your API key from the Getting started guide, then run:
pip install alterlab beautifulsoup4The minimum viable scrape of a Twitter/X profile:
import alterlab
from bs4 import BeautifulSoup
client = alterlab.Client("YOUR_API_KEY")
response = client.scrape(
url="https://x.com/OpenAI",
render_js=True,
wait_for="article[data-testid='tweet']",
session_id="twitter-session-01",
timeout=30,
)
soup = BeautifulSoup(response.html, "html.parser")
tweets = soup.select("[data-testid='tweetText']")
for tweet in tweets:
print(tweet.get_text(strip=True))render_js=True instructs the API to use a headless browser. The wait_for parameter accepts any CSS selector — the scraper holds until that element appears in the DOM before returning the response. Without it, you'll frequently receive the React loading skeleton rather than actual tweet content.
The equivalent cURL request:
curl -X POST https://api.alterlab.io/v1/scrape \
-H "X-API-Key: YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://x.com/OpenAI",
"render_js": true,
"wait_for": "article[data-testid='\''tweet'\'']",
"session_id": "twitter-session-01",
"timeout": 30
}'Try scraping Twitter/X with AlterLab — no setup required
Extracting Structured Data
Twitter/X's DOM structure is more stable than most SPAs because it's driven by data-testid attributes tied to React component names rather than hashed CSS class names. These selectors survive most frontend deploys:
| Data Point | CSS Selector |
|---|---|
| Tweet text | [data-testid="tweetText"] |
| Tweet timestamp | time[datetime] |
| Display name | [data-testid="User-Name"] span:first-child |
| Handle | [data-testid="User-Name"] a[href] span |
| Like count | [data-testid="like"] [data-testid="app-text-transition-container"] span |
| Retweet count | [data-testid="retweet"] [data-testid="app-text-transition-container"] span |
| Reply count | [data-testid="reply"] [data-testid="app-text-transition-container"] span |
| Attached images | [data-testid="tweetPhoto"] img |
Here's a complete parser that returns structured Tweet objects from a rendered HTML response:
from bs4 import BeautifulSoup
from dataclasses import dataclass
from typing import Optional
import alterlab
@dataclass
class Tweet:
text: str
timestamp: Optional[str]
display_name: Optional[str]
handle: Optional[str]
likes: str
retweets: str
replies: str
def parse_tweets(html: str) -> list[Tweet]:
soup = BeautifulSoup(html, "html.parser")
results = []
for article in soup.select("article[data-testid='tweet']"):
text_el = article.select_one("[data-testid='tweetText']")
time_el = article.select_one("time[datetime]")
name_els = article.select("[data-testid='User-Name'] span")
like_el = article.select_one(
"[data-testid='like'] [data-testid='app-text-transition-container'] span"
)
rt_el = article.select_one(
"[data-testid='retweet'] [data-testid='app-text-transition-container'] span"
)
reply_el = article.select_one(
"[data-testid='reply'] [data-testid='app-text-transition-container'] span"
)
results.append(Tweet(
text=text_el.get_text(strip=True) if text_el else "",
timestamp=time_el["datetime"] if time_el else None,
display_name=name_els[0].get_text(strip=True) if name_els else None,
handle=name_els[1].get_text(strip=True) if len(name_els) > 1 else None,
likes=like_el.get_text(strip=True) if like_el else "0",
retweets=rt_el.get_text(strip=True) if rt_el else "0",
replies=reply_el.get_text(strip=True) if reply_el else "0",
))
return results
# Scrape a live search feed
client = alterlab.Client("YOUR_API_KEY")
response = client.scrape(
url="https://x.com/search?q=%23MachineLearning&src=typed_query&f=live",
render_js=True,
wait_for="article[data-testid='tweet']",
)
tweets = parse_tweets(response.html)
for t in tweets:
print(f"[{t.timestamp}] @{t.handle} ({t.likes} likes): {t.text[:120]}")Note the f=live query parameter — this returns chronological results rather than X's ranked "Top" feed, which is almost always what you want for monitoring pipelines where recency matters.
Common Pitfalls
Empty responses despite HTTP 200. This happens when render_js=True is omitted, or when the wait_for selector never resolves within the timeout window. Twitter/X's React app serves a loading skeleton on initial response — if your scraper captures the page before hydration completes, you get the shell, not the data. Always use wait_for with a tweet-specific selector, and set a realistic timeout (20–30 seconds).
Session invalidation mid-pipeline. X expires session tokens aggressively and will silently begin returning truncated or redirected content rather than erroring explicitly. If you're relying on session_id for persistence across requests, build retry logic around session resets. Never hardcode a single session into a long-running pipeline without rotation logic.
Engagement counts are strings, not integers. Like, retweet, and reply counts are displayed as abbreviated strings ("12.4K", "1.2M"). Parse these to numeric values on ingest — don't assume they're castable with int() directly.
Search results are geo-localized. Trending topics, search result ordering, and content visibility differ by region. If your pipeline targets a specific market, specify a matching proxy country in your request parameters to ensure consistent, representative results.
Infinite scroll vs. pagination. Twitter/X has no page number parameters. Additional content loads via scroll events triggering background GraphQL calls. To scrape beyond the first viewport's content, use the scroll option to simulate scroll events before extraction, or target the internal GraphQL API endpoints directly (more stable for high-volume work, but requires managing auth tokens).
Scaling Up
For high-volume pipelines, sequential requests are too slow. Move to concurrent scraping with asyncio:
import asyncio
import alterlab
from parse_tweets import parse_tweets # from the previous example
async def scrape_profile(client: alterlab.AsyncClient, username: str) -> dict:
try:
response = await client.scrape(
url=f"https://x.com/{username}",
render_js=True,
wait_for="article[data-testid='tweet']",
timeout=30,
)
tweets = parse_tweets(response.html)
return {
"username": username,
"tweet_count": len(tweets),
"tweets": tweets,
}
except alterlab.ScraperError as e:
return {"username": username, "error": str(e), "tweets": []}
async def main():
targets = ["OpenAI", "AnthropicAI", "GoogleDeepMind", "MetaAI", "MistralAI"]
async with alterlab.AsyncClient("YOUR_API_KEY") as client:
tasks = [scrape_profile(client, handle) for handle in targets]
results = await asyncio.gather(*tasks)
for r in results:
status = f"{r['tweet_count']} tweets" if "error" not in r else f"FAILED: {r['error']}"
print(f"@{r['username']}: {status}")
asyncio.run(main())For production scheduling, feed URLs into a job queue (Celery, ARQ, or Cloud Tasks) rather than running asyncio.gather across thousands of targets directly. A sustainable sustained throughput for most pipelines is 10–50 concurrent requests with 1–3 seconds of jitter between batches to avoid triggering X's per-session rate limiting.
Storage strategy. Raw rendered HTML from a Twitter/X profile runs 400–900 KB per page. At any meaningful scale, parse and persist structured fields immediately — don't store raw HTML. Pipe output from parse_tweets() directly to PostgreSQL, BigQuery, or your warehouse of choice.
Cost planning. JS-rendered pages consume more credits than static scrapes due to headless browser compute time. For large pipelines, review AlterLab pricing before you hit your plan's limits mid-run. Growth and Enterprise tiers include priority queuing and dedicated proxy pools, which meaningfully improve throughput on high-frequency jobs where queue latency adds up.
Key Takeaways
- Raw HTTP requests won't work. Twitter/X is a full React SPA with JavaScript-gated content, Cloudflare Turnstile, and login-walled endpoints. Headless browser rendering is non-negotiable.
- Lean on
data-testidselectors. They're tied to React component names and survive CSS class hash rotations — far more stable than class-based selectors on this site. - Build session rotation in from day one. X invalidates tokens aggressively and degrades content silently rather than returning errors. Your pipeline needs to detect and recover from this automatically.
- Parse immediately, don't store raw HTML. Structured field extraction at ingest time keeps storage costs manageable and query performance fast at scale.
- Proxy geography affects your data. Trending topics, search ranking, and content visibility are all geo-localized. Match your proxy region to your target market.
Related Guides
Working with other social platforms? These guides cover the same anti-bot patterns with site-specific implementation details:
- How to Scrape Reddit — API rate limits, comment thread traversal, subreddit monitoring pipelines
- How to Scrape Instagram — Login walls, story metadata, profile and hashtag scraping
- How to Scrape TikTok — Video metadata extraction, trending content feeds, creator analytics
Was this article helpful?
Frequently Asked Questions
Related Articles
Popular Posts
Recommended

Selenium Bot Detection: Why You Get Caught and How to Avoid It

Why Your Headless Browser Gets Detected (and How to Fix It)

Web Scraping APIs vs DIY Scrapers: When to Stop Building Infrastructure

Scraping E-Commerce Sites at Scale Without Getting Blocked

Web Scraping with Node.js and Puppeteer: The Complete 2026 Guide
Newsletter
Scraping insights and API tips. No spam.
Recommended Reading

Selenium Bot Detection: Why You Get Caught and How to Avoid It

Why Your Headless Browser Gets Detected (and How to Fix It)

Web Scraping APIs vs DIY Scrapers: When to Stop Building Infrastructure

Scraping E-Commerce Sites at Scale Without Getting Blocked

Web Scraping with Node.js and Puppeteer: The Complete 2026 Guide
Stay in the Loop
Get scraping insights, API tips, and platform updates. No spam — we only send when we have something worth reading.


