AlterLabAlterLab
How to Scrape Twitter/X: Complete Guide for 2026
Tutorials

How to Scrape Twitter/X: Complete Guide for 2026

Learn how to scrape Twitter/X in 2026 with Python. Covers Cloudflare bypass, structured data extraction, and scaling your pipeline to millions of tweets.

Yash Dubey
Yash Dubey

March 30, 2026

8 min read
7 views

Twitter/X holds some of the most valuable real-time signal data on the internet — public sentiment, trending discourse, influencer activity, and breaking narratives. The official API, however, is prohibitively expensive for most use cases (Basic tier caps at 10,000 tweets/month for $100), and the site's bot protections have become substantially more aggressive through 2025 and into 2026. This guide covers exactly how to scrape Twitter/X: what you're up against, the tools you need, and production-ready code to get started.

Why Scrape Twitter/X?

Before getting into implementation, three concrete use cases justify the engineering investment:

Brand sentiment monitoring. Marketing and comms teams track mentions, hashtag velocity, and sentiment shifts in real time. Scraping gives you raw signal without API cost caps or X's 30-day historical data ceiling — critical for retroactive analysis after a PR event.

Academic and investigative research. Researchers studying misinformation networks, political discourse, or public health communication regularly need large tweet corpora that exceed what the Academic Research tier provides. That tier is frequently oversubscribed, approval is slow, and access can be revoked without notice.

Competitive intelligence. Tracking competitor product announcements, executive statements, and community reactions at scale is a legitimate business intelligence use case. No official API product maps cleanly onto this workflow.

Anti-Bot Challenges on twitter.com

Twitter/X runs several protection layers simultaneously, and as of 2026 they've tightened across the board:

Cloudflare Turnstile. X migrated from in-house bot detection to Cloudflare's Turnstile challenge on login walls and rate-limited endpoints. Turnstile performs browser fingerprinting, behavioral analysis, and passive challenges invisible to legitimate users but fatal to naive scrapers. Unlike a traditional CAPTCHA, Turnstile doesn't present a puzzle — it silently fails bot sessions before they reach content.

JavaScript-rendered content. The entire frontend is a React SPA. A raw HTTP response from x.com contains almost no tweet data — meaningful content only appears after JavaScript hydration, authenticated GraphQL calls, and DOM rendering complete. requests + BeautifulSoup alone will not work here.

Login gates. As of late 2025, X requires authenticated sessions to view most profile timelines and all search results beyond a limited preview. Your scraper must manage session tokens — a significant attack surface that X monitors for anomalous usage patterns.

IP-based rate limiting. Even valid sessions are rate-limited by IP. A datacenter IP attempting meaningful volume will be blocked within minutes. Residential or ISP proxies are required.

Handling all of this in-house — rotating residential proxies, headless browser orchestration, Turnstile solving, session rotation — is weeks of engineering that will break on every X infrastructure update. AlterLab's Anti-bot bypass API handles this entire infrastructure layer, letting you focus on the data logic rather than the bypass mechanics.

Quick Start with the AlterLab API

Install the SDK and grab your API key from the Getting started guide, then run:

Bash
pip install alterlab beautifulsoup4

The minimum viable scrape of a Twitter/X profile:

Python
import alterlab
from bs4 import BeautifulSoup

client = alterlab.Client("YOUR_API_KEY")

response = client.scrape(
    url="https://x.com/OpenAI",
    render_js=True,
    wait_for="article[data-testid='tweet']",
    session_id="twitter-session-01",
    timeout=30,
)

soup = BeautifulSoup(response.html, "html.parser")
tweets = soup.select("[data-testid='tweetText']")

for tweet in tweets:
    print(tweet.get_text(strip=True))

render_js=True instructs the API to use a headless browser. The wait_for parameter accepts any CSS selector — the scraper holds until that element appears in the DOM before returning the response. Without it, you'll frequently receive the React loading skeleton rather than actual tweet content.

The equivalent cURL request:

Bash
curl -X POST https://api.alterlab.io/v1/scrape \
  -H "X-API-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://x.com/OpenAI",
    "render_js": true,
    "wait_for": "article[data-testid='\''tweet'\'']",
    "session_id": "twitter-session-01",
    "timeout": 30
  }'
99.1%Success Rate on Twitter/X
1.4sAvg Response Time
50+Anti-Bot Bypass Techniques
200+Residential Proxy Locations
Try it yourself

Try scraping Twitter/X with AlterLab — no setup required

Extracting Structured Data

Twitter/X's DOM structure is more stable than most SPAs because it's driven by data-testid attributes tied to React component names rather than hashed CSS class names. These selectors survive most frontend deploys:

Data PointCSS Selector
Tweet text[data-testid="tweetText"]
Tweet timestamptime[datetime]
Display name[data-testid="User-Name"] span:first-child
Handle[data-testid="User-Name"] a[href] span
Like count[data-testid="like"] [data-testid="app-text-transition-container"] span
Retweet count[data-testid="retweet"] [data-testid="app-text-transition-container"] span
Reply count[data-testid="reply"] [data-testid="app-text-transition-container"] span
Attached images[data-testid="tweetPhoto"] img

Here's a complete parser that returns structured Tweet objects from a rendered HTML response:

Python
from bs4 import BeautifulSoup
from dataclasses import dataclass
from typing import Optional
import alterlab

@dataclass
class Tweet:
    text: str
    timestamp: Optional[str]
    display_name: Optional[str]
    handle: Optional[str]
    likes: str
    retweets: str
    replies: str

def parse_tweets(html: str) -> list[Tweet]:
    soup = BeautifulSoup(html, "html.parser")
    results = []

    for article in soup.select("article[data-testid='tweet']"):
        text_el = article.select_one("[data-testid='tweetText']")
        time_el = article.select_one("time[datetime]")
        name_els = article.select("[data-testid='User-Name'] span")
        like_el = article.select_one(
            "[data-testid='like'] [data-testid='app-text-transition-container'] span"
        )
        rt_el = article.select_one(
            "[data-testid='retweet'] [data-testid='app-text-transition-container'] span"
        )
        reply_el = article.select_one(
            "[data-testid='reply'] [data-testid='app-text-transition-container'] span"
        )

        results.append(Tweet(
            text=text_el.get_text(strip=True) if text_el else "",
            timestamp=time_el["datetime"] if time_el else None,
            display_name=name_els[0].get_text(strip=True) if name_els else None,
            handle=name_els[1].get_text(strip=True) if len(name_els) > 1 else None,
            likes=like_el.get_text(strip=True) if like_el else "0",
            retweets=rt_el.get_text(strip=True) if rt_el else "0",
            replies=reply_el.get_text(strip=True) if reply_el else "0",
        ))

    return results

# Scrape a live search feed
client = alterlab.Client("YOUR_API_KEY")
response = client.scrape(
    url="https://x.com/search?q=%23MachineLearning&src=typed_query&f=live",
    render_js=True,
    wait_for="article[data-testid='tweet']",
)

tweets = parse_tweets(response.html)
for t in tweets:
    print(f"[{t.timestamp}] @{t.handle} ({t.likes} likes): {t.text[:120]}")

Note the f=live query parameter — this returns chronological results rather than X's ranked "Top" feed, which is almost always what you want for monitoring pipelines where recency matters.

Common Pitfalls

Empty responses despite HTTP 200. This happens when render_js=True is omitted, or when the wait_for selector never resolves within the timeout window. Twitter/X's React app serves a loading skeleton on initial response — if your scraper captures the page before hydration completes, you get the shell, not the data. Always use wait_for with a tweet-specific selector, and set a realistic timeout (20–30 seconds).

Session invalidation mid-pipeline. X expires session tokens aggressively and will silently begin returning truncated or redirected content rather than erroring explicitly. If you're relying on session_id for persistence across requests, build retry logic around session resets. Never hardcode a single session into a long-running pipeline without rotation logic.

Engagement counts are strings, not integers. Like, retweet, and reply counts are displayed as abbreviated strings ("12.4K", "1.2M"). Parse these to numeric values on ingest — don't assume they're castable with int() directly.

Search results are geo-localized. Trending topics, search result ordering, and content visibility differ by region. If your pipeline targets a specific market, specify a matching proxy country in your request parameters to ensure consistent, representative results.

Infinite scroll vs. pagination. Twitter/X has no page number parameters. Additional content loads via scroll events triggering background GraphQL calls. To scrape beyond the first viewport's content, use the scroll option to simulate scroll events before extraction, or target the internal GraphQL API endpoints directly (more stable for high-volume work, but requires managing auth tokens).

Scaling Up

For high-volume pipelines, sequential requests are too slow. Move to concurrent scraping with asyncio:

Python
import asyncio
import alterlab
from parse_tweets import parse_tweets  # from the previous example

async def scrape_profile(client: alterlab.AsyncClient, username: str) -> dict:
    try:
        response = await client.scrape(
            url=f"https://x.com/{username}",
            render_js=True,
            wait_for="article[data-testid='tweet']",
            timeout=30,
        )
        tweets = parse_tweets(response.html)
        return {
            "username": username,
            "tweet_count": len(tweets),
            "tweets": tweets,
        }
    except alterlab.ScraperError as e:
        return {"username": username, "error": str(e), "tweets": []}

async def main():
    targets = ["OpenAI", "AnthropicAI", "GoogleDeepMind", "MetaAI", "MistralAI"]

    async with alterlab.AsyncClient("YOUR_API_KEY") as client:
        tasks = [scrape_profile(client, handle) for handle in targets]
        results = await asyncio.gather(*tasks)

    for r in results:
        status = f"{r['tweet_count']} tweets" if "error" not in r else f"FAILED: {r['error']}"
        print(f"@{r['username']}: {status}")

asyncio.run(main())

For production scheduling, feed URLs into a job queue (Celery, ARQ, or Cloud Tasks) rather than running asyncio.gather across thousands of targets directly. A sustainable sustained throughput for most pipelines is 10–50 concurrent requests with 1–3 seconds of jitter between batches to avoid triggering X's per-session rate limiting.

Storage strategy. Raw rendered HTML from a Twitter/X profile runs 400–900 KB per page. At any meaningful scale, parse and persist structured fields immediately — don't store raw HTML. Pipe output from parse_tweets() directly to PostgreSQL, BigQuery, or your warehouse of choice.

Cost planning. JS-rendered pages consume more credits than static scrapes due to headless browser compute time. For large pipelines, review AlterLab pricing before you hit your plan's limits mid-run. Growth and Enterprise tiers include priority queuing and dedicated proxy pools, which meaningfully improve throughput on high-frequency jobs where queue latency adds up.

Key Takeaways

  • Raw HTTP requests won't work. Twitter/X is a full React SPA with JavaScript-gated content, Cloudflare Turnstile, and login-walled endpoints. Headless browser rendering is non-negotiable.
  • Lean on data-testid selectors. They're tied to React component names and survive CSS class hash rotations — far more stable than class-based selectors on this site.
  • Build session rotation in from day one. X invalidates tokens aggressively and degrades content silently rather than returning errors. Your pipeline needs to detect and recover from this automatically.
  • Parse immediately, don't store raw HTML. Structured field extraction at ingest time keeps storage costs manageable and query performance fast at scale.
  • Proxy geography affects your data. Trending topics, search ranking, and content visibility are all geo-localized. Match your proxy region to your target market.

Working with other social platforms? These guides cover the same anti-bot patterns with site-specific implementation details:

Share

Was this article helpful?

Frequently Asked Questions

Scraping publicly accessible Twitter/X data sits in a legal gray area. X's Terms of Service explicitly prohibit unauthorized scraping, and the company has pursued legal action against large-scale operators — most notably in *X Corp. v. Bright Data* (2024). For internal research, brand monitoring, or non-commercial analysis, the risk profile is substantially lower, but you should consult legal counsel before commercializing scraped data or operating at enterprise scale.
Twitter/X layers Cloudflare Turnstile, IP-based rate limiting, and JavaScript-gated content that make DIY bypass brittle and expensive to maintain — X updates its detection stack regularly and will silently block or serve stale content. The most reliable approach is a managed scraping API that handles residential proxy rotation, headless browser rendering, and challenge solving transparently. AlterLab's [Anti-bot bypass API](/anti-bot-bypass-api) covers all three layers with a single request flag, without you managing any browser infrastructure.
The primary cost driver is JavaScript rendering — headless browser requests consume significantly more compute than static page fetches. At low volumes (under 10k requests/month), most starter plans comfortably cover Twitter/X workloads. For sustained pipelines at 100k+ pages/month, check the [AlterLab pricing](/pricing) page: Growth and Enterprise tiers offer the best per-request rate for JS-rendered pages and include priority proxy queuing, which meaningfully improves throughput on high-frequency jobs.