Pricing Compare Playground Blog Docs Changelog

How to Scrape Twitter/X: Complete Guide for 2026

Learn how to scrape Twitter/X in 2026 with Python. Covers Cloudflare bypass, structured data extraction, and scaling your pipeline to millions of tweets.

Yash DubeyMarch 30, 2026

8 min read

7,100 views

AlterLab handles this automatically — scrape any URL with one API call. No infrastructure required.

Try it free

Twitter/X holds some of the most valuable real-time signal data on the internet — public sentiment, trending discourse, influencer activity, and breaking narratives. The official API, however, is prohibitively expensive for most use cases (Basic tier caps at 10,000 tweets/month for $100), and the site's bot protections have become substantially more aggressive through 2025 and into 2026. This guide covers exactly how to scrape Twitter/X: what you're up against, the tools you need, and production-ready code to get started.

Why Scrape Twitter/X?

Before getting into implementation, three concrete use cases justify the engineering investment:

Brand sentiment monitoring. Marketing and comms teams track mentions, hashtag velocity, and sentiment shifts in real time. Scraping gives you raw signal without API cost caps or X's 30-day historical data ceiling — critical for retroactive analysis after a PR event.

Academic and investigative research. Researchers studying misinformation networks, political discourse, or public health communication regularly need large tweet corpora that exceed what the Academic Research tier provides. That tier is frequently oversubscribed, approval is slow, and access can be revoked without notice.

Competitive intelligence. Tracking competitor product announcements, executive statements, and community reactions at scale is a legitimate business intelligence use case. No official API product maps cleanly onto this workflow.

Anti-Bot Challenges on twitter.com

Twitter/X runs several protection layers simultaneously, and as of 2026 they've tightened across the board:

Cloudflare Turnstile. X migrated from in-house bot detection to Cloudflare's Turnstile challenge on login walls and rate-limited endpoints. Turnstile performs browser fingerprinting, behavioral analysis, and passive challenges invisible to legitimate users but fatal to naive scrapers. Unlike a traditional CAPTCHA, Turnstile doesn't present a puzzle — it silently fails bot sessions before they reach content.

JavaScript-rendered content. The entire frontend is a React SPA. A raw HTTP response from x.com contains almost no tweet data — meaningful content only appears after JavaScript hydration, authenticated GraphQL calls, and DOM rendering complete. requests + BeautifulSoup alone will not work here.

Login gates. As of late 2025, X requires authenticated sessions to view most profile timelines and all search results beyond a limited preview. Your scraper must manage session tokens — a significant attack surface that X monitors for anomalous usage patterns.

IP-based rate limiting. Even valid sessions are rate-limited by IP. A datacenter IP attempting meaningful volume will be blocked within minutes. Residential or ISP proxies are required.

Handling all of this in-house — rotating residential proxies, headless browser orchestration, Turnstile solving, session rotation — is weeks of engineering that will break on every X infrastructure update. AlterLab's Anti-bot bypass API handles this entire infrastructure layer, letting you focus on the data logic rather than the bypass mechanics.

Quick Start with the AlterLab API

Install the SDK and grab your API key from the Getting started guide, then run:

Bash

pip install alterlab beautifulsoup4

The minimum viable scrape of a Twitter/X profile:

Python

import alterlab
from bs4 import BeautifulSoup

client = alterlab.Client("YOUR_API_KEY")

response = client.scrape(
    url="https://x.com/OpenAI",
    render_js=True,
    wait_for="article[data-testid='tweet']",
    session_id="twitter-session-01",
    timeout=30,
)

soup = BeautifulSoup(response.html, "html.parser")
tweets = soup.select("[data-testid='tweetText']")

for tweet in tweets:
    print(tweet.get_text(strip=True))

render_js=True instructs the API to use a headless browser. The wait_for parameter accepts any CSS selector — the scraper holds until that element appears in the DOM before returning the response. Without it, you'll frequently receive the React loading skeleton rather than actual tweet content.

The equivalent cURL request:

Bash

curl -X POST https://api.alterlab.io/v1/scrape \
  -H "X-API-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://x.com/OpenAI",
    "render_js": true,
    "wait_for": "article[data-testid='\''tweet'\'']",
    "session_id": "twitter-session-01",
    "timeout": 30
  }'

99.1%Success Rate on Twitter/X

1.4sAvg Response Time

50+Anti-Bot Bypass Techniques

200+Residential Proxy Locations

Try it yourself

Try scraping Twitter/X with AlterLab — no setup required

curl -X POST https://api.alterlab.io/v1/scrape \
  -H "X-API-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"url": "https://x.com"}'

Enable JavaScript to try the live demo, or sign up to use the API directly.

Extracting Structured Data

Twitter/X's DOM structure is more stable than most SPAs because it's driven by data-testid attributes tied to React component names rather than hashed CSS class names. These selectors survive most frontend deploys:

Data Point	CSS Selector
Tweet text	`[data-testid="tweetText"]`
Tweet timestamp	`time[datetime]`
Display name	`[data-testid="User-Name"] span:first-child`
Handle	`[data-testid="User-Name"] a[href] span`
Like count	`[data-testid="like"] [data-testid="app-text-transition-container"] span`
Retweet count	`[data-testid="retweet"] [data-testid="app-text-transition-container"] span`
Reply count	`[data-testid="reply"] [data-testid="app-text-transition-container"] span`
Attached images	`[data-testid="tweetPhoto"] img`

Here's a complete parser that returns structured Tweet objects from a rendered HTML response:

Python

from bs4 import BeautifulSoup
from dataclasses import dataclass
from typing import Optional
import alterlab

@dataclass
class Tweet:
    text: str
    timestamp: Optional[str]
    display_name: Optional[str]
    handle: Optional[str]
    likes: str
    retweets: str
    replies: str

def parse_tweets(html: str) -> list[Tweet]:
    soup = BeautifulSoup(html, "html.parser")
    results = []

    for article in soup.select("article[data-testid='tweet']"):
        text_el = article.select_one("[data-testid='tweetText']")
        time_el = article.select_one("time[datetime]")
        name_els = article.select("[data-testid='User-Name'] span")
        like_el = article.select_one(
            "[data-testid='like'] [data-testid='app-text-transition-container'] span"
        )
        rt_el = article.select_one(
            "[data-testid='retweet'] [data-testid='app-text-transition-container'] span"
        )
        reply_el = article.select_one(
            "[data-testid='reply'] [data-testid='app-text-transition-container'] span"
        )

        results.append(Tweet(
            text=text_el.get_text(strip=True) if text_el else "",
            timestamp=time_el["datetime"] if time_el else None,
            display_name=name_els[0].get_text(strip=True) if name_els else None,
            handle=name_els[1].get_text(strip=True) if len(name_els) > 1 else None,
            likes=like_el.get_text(strip=True) if like_el else "0",
            retweets=rt_el.get_text(strip=True) if rt_el else "0",
            replies=reply_el.get_text(strip=True) if reply_el else "0",
        ))

    return results

# Scrape a live search feed
client = alterlab.Client("YOUR_API_KEY")
response = client.scrape(
    url="https://x.com/search?q=%23MachineLearning&src=typed_query&f=live",
    render_js=True,
    wait_for="article[data-testid='tweet']",
)

tweets = parse_tweets(response.html)
for t in tweets:
    print(f"[{t.timestamp}] @{t.handle} ({t.likes} likes): {t.text[:120]}")

Note the f=live query parameter — this returns chronological results rather than X's ranked "Top" feed, which is almost always what you want for monitoring pipelines where recency matters.

Common Pitfalls

Empty responses despite HTTP 200. This happens when render_js=True is omitted, or when the wait_for selector never resolves within the timeout window. Twitter/X's React app serves a loading skeleton on initial response — if your scraper captures the page before hydration completes, you get the shell, not the data. Always use wait_for with a tweet-specific selector, and set a realistic timeout (20–30 seconds).

Session invalidation mid-pipeline. X expires session tokens aggressively and will silently begin returning truncated or redirected content rather than erroring explicitly. If you're relying on session_id for persistence across requests, build retry logic around session resets. Never hardcode a single session into a long-running pipeline without rotation logic.

Engagement counts are strings, not integers. Like, retweet, and reply counts are displayed as abbreviated strings ("12.4K", "1.2M"). Parse these to numeric values on ingest — don't assume they're castable with int() directly.

Search results are geo-localized. Trending topics, search result ordering, and content visibility differ by region. If your pipeline targets a specific market, specify a matching proxy country in your request parameters to ensure consistent, representative results.

Infinite scroll vs. pagination. Twitter/X has no page number parameters. Additional content loads via scroll events triggering background GraphQL calls. To scrape beyond the first viewport's content, use the scroll option to simulate scroll events before extraction, or target the internal GraphQL API endpoints directly (more stable for high-volume work, but requires managing auth tokens).

Scaling Up

For high-volume pipelines, sequential requests are too slow. Move to concurrent scraping with asyncio:

Python

import asyncio
import alterlab
from parse_tweets import parse_tweets  # from the previous example

async def scrape_profile(client: alterlab.AsyncClient, username: str) -> dict:
    try:
        response = await client.scrape(
            url=f"https://x.com/{username}",
            render_js=True,
            wait_for="article[data-testid='tweet']",
            timeout=30,
        )
        tweets = parse_tweets(response.html)
        return {
            "username": username,
            "tweet_count": len(tweets),
            "tweets": tweets,
        }
    except alterlab.ScraperError as e:
        return {"username": username, "error": str(e), "tweets": []}

async def main():
    targets = ["OpenAI", "AnthropicAI", "GoogleDeepMind", "MetaAI", "MistralAI"]

    async with alterlab.AsyncClient("YOUR_API_KEY") as client:
        tasks = [scrape_profile(client, handle) for handle in targets]
        results = await asyncio.gather(*tasks)

    for r in results:
        status = f"{r['tweet_count']} tweets" if "error" not in r else f"FAILED: {r['error']}"
        print(f"@{r['username']}: {status}")

asyncio.run(main())

For production scheduling, feed URLs into a job queue (Celery, ARQ, or Cloud Tasks) rather than running asyncio.gather across thousands of targets directly. A sustainable sustained throughput for most pipelines is 10–50 concurrent requests with 1–3 seconds of jitter between batches to avoid triggering X's per-session rate limiting.

Storage strategy. Raw rendered HTML from a Twitter/X profile runs 400–900 KB per page. At any meaningful scale, parse and persist structured fields immediately — don't store raw HTML. Pipe output from parse_tweets() directly to PostgreSQL, BigQuery, or your warehouse of choice.

Cost planning. JS-rendered pages consume more credits than static scrapes due to headless browser compute time. For large pipelines, review AlterLab pricing before you hit your plan's limits mid-run. Growth and Enterprise tiers include priority queuing and dedicated proxy pools, which meaningfully improve throughput on high-frequency jobs where queue latency adds up.

Key Takeaways

Raw HTTP requests won't work. Twitter/X is a full React SPA with JavaScript-gated content, Cloudflare Turnstile, and login-walled endpoints. Headless browser rendering is non-negotiable.
Lean on data-testid selectors. They're tied to React component names and survive CSS class hash rotations — far more stable than class-based selectors on this site.
Build session rotation in from day one. X invalidates tokens aggressively and degrades content silently rather than returning errors. Your pipeline needs to detect and recover from this automatically.
Parse immediately, don't store raw HTML. Structured field extraction at ingest time keeps storage costs manageable and query performance fast at scale.
Proxy geography affects your data. Trending topics, search ranking, and content visibility are all geo-localized. Match your proxy region to your target market.

Working with other social platforms? These guides cover the same anti-bot patterns with site-specific implementation details:

How to Scrape Reddit — API rate limits, comment thread traversal, subreddit monitoring pipelines
How to Scrape Instagram — Login walls, story metadata, profile and hashtag scraping
How to Scrape TikTok — Video metadata extraction, trending content feeds, creator analytics

Was this article helpful?

Try it yourself

Extract public social data reliably

Full browser rendering with automatic challenge resolution. Get structured data from public pages with a single POST request.

curl -X POST https://api.alterlab.io/v1/scrape \
  -H "X-API-Key: YOUR_KEY" \
  -H "Content-Type: application/json" \
  -d '{"url": "https://x.com/alterlab_io", "render_js": true}'

No credit card required · 5,000 free requests

Frequently Asked Questions

Scraping publicly accessible Twitter/X data sits in a legal gray area. X's Terms of Service explicitly prohibit unauthorized scraping, and the company has pursued legal action against large-scale operators — most notably in *X Corp. v. Bright Data* (2024). For internal research, brand monitoring, or non-commercial analysis, the risk profile is substantially lower, but you should consult legal counsel before commercializing scraped data or operating at enterprise scale.

Twitter/X layers Cloudflare Turnstile, IP-based rate limiting, and JavaScript-gated content that make DIY bypass brittle and expensive to maintain — X updates its detection stack regularly and will silently block or serve stale content. The most reliable approach is a managed scraping API that handles residential proxy rotation, headless browser rendering, and challenge solving transparently. AlterLab's [Anti-bot bypass API](/anti-bot-bypass-api) covers all three layers with a single request flag, without you managing any browser infrastructure.

The primary cost driver is JavaScript rendering — headless browser requests consume significantly more compute than static page fetches. At low volumes (under 10k requests/month), most starter plans comfortably cover Twitter/X workloads. For sustained pipelines at 100k+ pages/month, check the [AlterLab pricing](/pricing) page: Growth and Enterprise tiers offer the best per-request rate for JS-rendered pages and include priority proxy queuing, which meaningfully improves throughput on high-frequency jobs.

Yash Dubey

View all posts

Tutorials

AutoTrader Data API: Extract Structured JSON in 2026

Build a robust data pipeline for automotive market intelligence. Learn how to use an autotrader data api to get structured JSON without writing fragile parsers.

Herald Blog Service

Jun 29, 2026

Tutorials

IMDB Data API: Extract Structured JSON in 2026

Learn how to extract structured IMDB data (title, rating, genre) via API using AlterLab's Extract API for reliable JSON output in 2026.

Herald Blog Service

Jun 29, 2026

Tutorials

CarGurus Data API: Extract Structured JSON in 2026

Learn how to retrieve structured CarGurus data through a modern data API. Get JSON with make, model, year, price, mileage and location using AlterLab's Extract API. Simple, compliant, and built for developers.

Herald Blog Service

Jun 29, 2026

Stay in the Loop

Get scraping insights, API tips, and platform updates. No spam — we only send when we have something worth reading.

Web Scraping API Resources

Part of the Web Scraping API Documentation cluster

Web Scraping API Documentation

Complete API reference with 5-tier auto-escalation — Curl to challenge resolution.

Pillar page

JavaScript Rendering Guide

Configure Tier 4 browser rendering for SPAs and dynamic content.

Authenticated Scraping Guide

Scrape pages behind login using session management.

Web Scraping API Benchmarks

Real success rates and cost data across all 5 tiers.

AlterLab for AI Agents

MCP Server, Python SDK, and Firecrawl-compatible API for AI agent workflows.

How to Scrape Twitter/X: Complete Guide for 2026

Why Scrape Twitter/X?

Anti-Bot Challenges on twitter.com

Quick Start with the AlterLab API

Extracting Structured Data

Common Pitfalls

Scaling Up

Key Takeaways

Frequently Asked Questions

Related Articles

AutoTrader Data API: Extract Structured JSON in 2026

IMDB Data API: Extract Structured JSON in 2026

CarGurus Data API: Extract Structured JSON in 2026

Popular Posts

Playwright Bot Detection: What Actually Works in 2026

Why Your Headless Browser Gets Detected (and How to Fix It)

How to Scrape Twitter/X: Complete Guide for 2026

Best Web Scraping APIs in 2026: Complete Comparison Guide

How to Scrape Cloudflare-Protected Sites in 2026

Recommended

How to Scrape AliExpress: Complete Guide for 2026

Why Your Headless Browser Gets Detected (and How to Fix It)

AlterLab vs Firecrawl: Which Scraping API Is Better in 2026?

How to Scrape Twitter/X Data: Complete Guide for 2026

How to Scrape Cloudflare-Protected Sites in 2026

Newsletter

Recommended Reading

How to Scrape AliExpress: Complete Guide for 2026

Why Your Headless Browser Gets Detected (and How to Fix It)

AlterLab vs Firecrawl: Which Scraping API Is Better in 2026?

How to Scrape Twitter/X Data: Complete Guide for 2026

How to Scrape Cloudflare-Protected Sites in 2026

Stay in the Loop

Explore AlterLab

Anti-Bot Handling API

JavaScript Rendering API

Pricing

Documentation

Web Scraping API Resources

Why Scrape Twitter/X?

Anti-Bot Challenges on twitter.com

Quick Start with the AlterLab API

Extracting Structured Data

Common Pitfalls

Scaling Up

Key Takeaways

Related Guides

Frequently Asked Questions

Related Articles

AutoTrader Data API: Extract Structured JSON in 2026

IMDB Data API: Extract Structured JSON in 2026

CarGurus Data API: Extract Structured JSON in 2026

Popular Posts

Playwright Bot Detection: What Actually Works in 2026

Why Your Headless Browser Gets Detected (and How to Fix It)

How to Scrape Twitter/X: Complete Guide for 2026

Best Web Scraping APIs in 2026: Complete Comparison Guide

How to Scrape Cloudflare-Protected Sites in 2026

Recommended

How to Scrape AliExpress: Complete Guide for 2026

Why Your Headless Browser Gets Detected (and How to Fix It)

AlterLab vs Firecrawl: Which Scraping API Is Better in 2026?

How to Scrape Twitter/X Data: Complete Guide for 2026

How to Scrape Cloudflare-Protected Sites in 2026

Newsletter

Recommended Reading

How to Scrape AliExpress: Complete Guide for 2026

Why Your Headless Browser Gets Detected (and How to Fix It)

AlterLab vs Firecrawl: Which Scraping API Is Better in 2026?

How to Scrape Twitter/X Data: Complete Guide for 2026

How to Scrape Cloudflare-Protected Sites in 2026

Stay in the Loop

Explore AlterLab

Anti-Bot Handling API

JavaScript Rendering API

Pricing

Documentation

Web Scraping API Resources