AlterLabAlterLab
How to Scrape Reddit: Complete Guide for 2026
Tutorials

How to Scrape Reddit: Complete Guide for 2026

Learn how to scrape Reddit in 2026 with Python. Handle rate limits, extract structured post data, bypass anti-bot measures, and scale your pipeline.

Yash Dubey
Yash Dubey

March 30, 2026

10 min read
9 views

Reddit generates millions of posts and comments daily across 100,000+ active communities. Unlike most platforms, a significant portion of that data is accessible without authentication — no OAuth dance, no login wall. That makes it one of the highest-signal public data sources available to engineers building monitoring, research, or intelligence pipelines.

This guide covers the full stack: what Reddit's defenses look like in practice, how to extract clean structured data using both the JSON API and HTML parsing, and how to scale to hundreds of subreddits without getting blocked.


Why Scrape Reddit?

Three use cases dominate production Reddit scrapers:

Brand and product sentiment monitoring — Reddit discussions are candid in a way that Twitter threads and review sites rarely are. Subreddits like r/personalfinance, r/wallstreetbets, and category-specific communities move faster than traditional media. Fintech teams, investor relations departments, and product orgs monitor these feeds in near-real-time to detect reputation events before they surface elsewhere.

Market research and trend detection — Academic researchers, journalists, and strategy teams use Reddit to map emerging narratives at scale. A longitudinal dataset of r/MachineLearning or r/datascience posts tells you what practitioners are actually evaluating — tools, frameworks, vendors — rather than what analyst reports say they should be.

Lead generation from high-intent communities — B2B teams mine niche subreddits for decision-makers asking "what's the best tool for X?" Those threads surface high-intent buyers actively comparing options. Automated monitoring of relevant search terms and post flairs generates a continuous stream of warm leads.


Anti-Bot Challenges on Reddit

Reddit's defenses are lighter than enterprise-grade targets like LinkedIn or Amazon, but they'll kill a naive scraper within minutes:

Rate limiting on the JSON API — The public JSON endpoint (reddit.com/r/sub.json) allows roughly one unauthenticated request per two seconds. Burst past that and you receive 429 responses. The HTML interfaces enforce similar limits, and Reddit's rate-limit windows are per-IP — rotating user agents alone doesn't help.

User-Agent enforcement — Reddit explicitly blocks requests with missing or generic User-Agent headers. The library default python-requests/2.31.0 will get you flagged faster than any other signal. Reddit's own API guidelines require a descriptive UA string in the format AppName/Version (context).

New Reddit's SPA architecturewww.reddit.com is a React single-page application. A raw HTTP request returns an HTML shell with almost no post data — the actual content arrives via async JavaScript fetching internal API calls. Without headless browser rendering, you get empty content. This is the single biggest trap engineers fall into when scraping Reddit for the first time.

TLS fingerprinting via CDN infrastructure — Reddit's CDN layer can fingerprint the TLS handshake to distinguish browser clients from HTTP libraries. Standard requests or httpx produce distinct TLS signatures that residential proxies alone won't obscure. You need full browser fingerprint spoofing at the TLS layer.

old.reddit.com as a practical escape hatch — The legacy interface serves fully server-rendered HTML. No JavaScript execution required, CSS selectors work reliably, and the DOM structure has been stable for years. For post and comment data, old.reddit.com is the correct default target.

Building rotation logic, fingerprint spoofing, session management, and retry handling yourself is a multi-week project that breaks whenever Reddit updates its stack. AlterLab's anti-bot bypass API handles all of it transparently at the infrastructure level.

99.2%Scrape Success Rate
1.4sAvg Response Time
~2sReddit's Unauthenticated Rate Limit
50M+Reddit Posts Indexed Daily

Quick Start with AlterLab

Install the SDK, then export your API key. The getting started guide covers environment setup, key management, and first-request verification in detail.

Bash
pip install alterlab beautifulsoup4
export ALTERLAB_API_KEY="YOUR_API_KEY"

The simplest working request — fetch old.reddit.com with a descriptive User-Agent:

Python
import alterlab

client = alterlab.Client("YOUR_API_KEY")

# Target old.reddit.com — server-rendered, no JS needed
response = client.scrape(
    "https://old.reddit.com/r/Python/hot/",
    headers={"User-Agent": "ResearchPipeline/1.0 ([email protected])"},
)

print(response.status_code)   # 200
print(response.text[:500])    # HTML content

The equivalent via cURL for quick terminal verification:

Bash
curl -X POST https://api.alterlab.io/v1/scrape \
  -H "X-API-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://old.reddit.com/r/Python/hot/",
    "headers": {
      "User-Agent": "ResearchPipeline/1.0 ([email protected])"
    }
  }'

Prefer the JSON API when you don't need HTML. Appending .json to any Reddit URL returns a fully structured JSON response — no parsing, no selectors, no BeautifulSoup:

Python
import alterlab
import json

client = alterlab.Client("YOUR_API_KEY")

# Reddit's built-in JSON endpoint — structured, fast, no JS rendering needed
response = client.scrape(
    "https://www.reddit.com/r/MachineLearning/hot.json?limit=25",
    headers={"User-Agent": "ResearchPipeline/1.0 ([email protected])"},
)

data = json.loads(response.text)
posts = data["data"]["children"]

for post in posts:
    p = post["data"]
    print(f"{p['score']:>6} upvotes | {p['title'][:75]}")

Extracting Structured Data

From the JSON API

The .json endpoint exposes the full post object. Key fields under data.children[].data:

FieldTypeDescription
idstringBase-36 post ID
titlestringPost title
authorstringReddit username
scoreintNet upvote count
upvote_ratiofloatFraction of upvotes (0.0–1.0)
urlstringLink URL or Reddit permalink for text posts
selftextstringPost body text (empty for link posts)
num_commentsintComment count including removed
created_utcfloatUnix timestamp (UTC)
subredditstringSubreddit name
link_flair_textstringPost flair label
Python
import alterlab
import json
from datetime import datetime, timezone

client = alterlab.Client("YOUR_API_KEY")

def scrape_subreddit(subreddit: str, sort: str = "hot", limit: int = 100) -> list[dict]:
    url = f"https://www.reddit.com/r/{subreddit}/{sort}.json?limit={limit}"
    response = client.scrape(
        url,
        headers={"User-Agent": "ResearchPipeline/1.0 ([email protected])"},
    )
    data = json.loads(response.text)

    posts = []
    for child in data["data"]["children"]:
        p = child["data"]
        posts.append({
            "id":           p["id"],
            "title":        p["title"],
            "author":       p["author"],
            "score":        p["score"],
            "upvote_ratio": p["upvote_ratio"],
            "comments":     p["num_comments"],
            "created":      datetime.fromtimestamp(p["created_utc"], tz=timezone.utc).isoformat(),
            "url":          p["url"],
            "body":         p.get("selftext", ""),
            "flair":        p.get("link_flair_text") or "",
        })

    return posts

results = scrape_subreddit("MachineLearning", sort="new", limit=50)
print(f"Extracted {len(results)} posts")

From old.reddit.com HTML

Use HTML parsing when you need data not exposed by the JSON API — sidebar content, wiki pages, or custom CSS flairs. old.reddit.com has a stable DOM structure that's been consistent for years:

Python
import alterlab
from bs4 import BeautifulSoup

client = alterlab.Client("YOUR_API_KEY")

response = client.scrape("https://old.reddit.com/r/Python/hot/")
soup = BeautifulSoup(response.text, "html.parser")

posts = []
for thing in soup.select("div.thing"):
    title_tag     = thing.select_one("a.title.may-blank")
    score_tag     = thing.select_one("div.score.unvoted")
    author_tag    = thing.select_one("a.author")
    timestamp_tag = thing.select_one("time.live-timestamp")
    comments_tag  = thing.select_one("a.comments")

    posts.append({
        "title":     title_tag.get_text(strip=True) if title_tag else None,
        "score":     score_tag.get("title") if score_tag else None,
        "author":    author_tag.get_text(strip=True) if author_tag else None,
        "timestamp": timestamp_tag.get("datetime") if timestamp_tag else None,
        "permalink": thing.get("data-permalink"),
        "comments":  comments_tag.get_text(strip=True) if comments_tag else None,
    })

print(f"Parsed {len(posts)} posts")

CSS selector reference for old.reddit.com:

ElementSelector
Post containerdiv.thing
Post titlea.title.may-blank
Scorediv.score.unvoted
Authora.author
Subreddit taga.subreddit
Post timestamptime.live-timestamp
Comments linka.comments
Post flairspan.flair

Common Pitfalls

Scraping www.reddit.com without headless rendering — The new Reddit UI loads post data asynchronously. A raw HTTP request returns empty content shells. Use old.reddit.com for HTML scraping, or target the .json API endpoints directly. Only enable headless browser mode for www.reddit.com when you specifically need data exposed only in the new UI.

Ignoring pagination — The JSON API returns 25 posts by default, capped at 100 per request. Complete subreddit crawls require paginating with the after cursor (the name fullname field of the last post):

Python
import alterlab
import json
import time

client = alterlab.Client("YOUR_API_KEY")

def scrape_all_posts(subreddit: str, max_pages: int = 10) -> list[dict]:
    posts, after = [], None

    for _ in range(max_pages):
        qs  = f"limit=100&after={after}" if after else "limit=100"
        url = f"https://www.reddit.com/r/{subreddit}/new.json?{qs}"

        response = client.scrape(url, headers={"User-Agent": "Pipeline/1.0"})
        data     = json.loads(response.text)
        children = data["data"]["children"]

        if not children:
            break

        posts.extend(child["data"] for child in children)
        after = data["data"]["after"]

        if not after:
            break

        time.sleep(0.5)  # Polite crawl delay between pages

    return posts

Assuming num_comments matches the fetched comment treenum_comments counts deleted and removed comments. When you fetch the comment tree separately, the actual retrievable count will always be lower. Don't use this field for data completeness checks.

Not handling deleted and removed content — Posts with [deleted] as author or [removed] as selftext are endemic to Reddit data. Filter or flag them explicitly rather than letting them corrupt downstream aggregations.

Session expiry on long-running crawls — For multi-hour crawls across hundreds of subreddits, cookies and auth tokens can expire mid-run. Build retry logic with exponential backoff and log failures per-URL so you can resume from the last successful page rather than restarting from scratch.


Scaling Up

Async batching across subreddits

Python
import asyncio
import json
import alterlab

client = alterlab.Client("YOUR_API_KEY")

async def fetch_subreddit(subreddit: str) -> list[dict]:
    loop = asyncio.get_event_loop()
    # AlterLab client is sync; use thread pool executor for async contexts
    response = await loop.run_in_executor(
        None,
        lambda: client.scrape(
            f"https://www.reddit.com/r/{subreddit}/hot.json?limit=100",
            headers={"User-Agent": "Pipeline/1.0 ([email protected])"},
        ),
    )
    data = json.loads(response.text)
    return [child["data"] for child in data["data"]["children"]]

async def main():
    subreddits = ["Python", "MachineLearning", "datascience", "rust", "golang"]
    results    = await asyncio.gather(*[fetch_subreddit(s) for s in subreddits])
    all_posts  = [post for batch in results for post in batch]
    print(f"Scraped {len(all_posts)} posts across {len(subreddits)} subreddits")

asyncio.run(main())

Cost at scale

For Reddit specifically, almost all use cases can run on standard fetch mode — no JavaScript rendering required. The JSON API and old.reddit.com both return complete data in a single synchronous response. Reserve headless browser mode for edge cases: www.reddit.com award displays, embedded media metadata, or authenticated sessions.

A pipeline scraping 50 subreddits at 100 posts each, refreshed hourly, runs roughly 5,000 requests/hour and 120,000 requests/day. At standard tier rates on AlterLab's pricing plans, this fits comfortably within the mid-tier. If you need comment trees in addition to post listings, factor in one additional request per post for the comment JSON endpoint.

Try it yourself

Try scraping Reddit's Python subreddit JSON API live with AlterLab


Key Takeaways

  • JSON API first — Appending .json to any Reddit URL returns structured data without HTML parsing or JavaScript rendering. It's faster, cheaper, and more reliable than scraping HTML.
  • Target old.reddit.com for HTML scraping — The legacy interface is server-rendered with a stable DOM. The new Reddit SPA requires headless rendering; avoid it unless strictly necessary.
  • Paginate with after — Default responses cap at 25 posts. Use the after cursor field to walk through complete subreddit listings.
  • Filter deleted content explicitly[deleted] authors and [removed] bodies are common; handle them at ingestion rather than letting them propagate into analytics.
  • Parallelize across subreddits, not within them — Concurrency at the subreddit level maximizes throughput without triggering per-endpoint rate limits.
  • Proxy rotation, TLS fingerprint spoofing, and rate-limit handling are solved infrastructure problems — don't build and maintain that stack when purpose-built tooling exists.

If you're building broader social media data pipelines, these guides cover comparable challenges on adjacent platforms:

Share

Was this article helpful?

Frequently Asked Questions

Scraping publicly accessible Reddit posts and comments is generally protected under U.S. case law (hiQ Labs v. LinkedIn), but Reddit's Terms of Service restrict automated access outside their official API. For read-only research on public data, legal exposure is low — but review Reddit's ToS before running production pipelines at scale, and respect rate limits to avoid IP-level enforcement.
Reddit uses rate limiting, User-Agent enforcement, and TLS fingerprinting to detect and block scrapers. Managing rotating residential proxies, browser fingerprint spoofing, and session headers yourself takes weeks to build and breaks whenever Reddit updates its stack. AlterLab's anti-bot bypass API handles all of this transparently — your scrape requests look indistinguishable from real browser traffic without any of the infrastructure overhead.
Cost depends on request volume and rendering mode. Reddit's JSON API endpoints and old.reddit.com HTML pages require only standard fetch mode — no JavaScript rendering — which is the most cost-efficient tier. A pipeline scraping 50 subreddits at 100 posts each, refreshed hourly, runs roughly 120,000 requests per day. See AlterLab's pricing page for current tier breakdowns and per-request rates.