Pricing Compare Playground Blog Docs Changelog

How to Scrape Reddit: Complete Guide for 2026

Learn how to scrape Reddit in 2026 with Python. Handle rate limits, extract structured post data, bypass anti-bot measures, and scale your pipeline.

Yash DubeyMarch 30, 2026

10 min read

645 views

AlterLab handles this automatically — scrape any URL with one API call. No infrastructure required.

Try it free

Reddit generates millions of posts and comments daily across 100,000+ active communities. Unlike most platforms, a significant portion of that data is accessible without authentication — no OAuth dance, no login wall. That makes it one of the highest-signal public data sources available to engineers building monitoring, research, or intelligence pipelines.

This guide covers the full stack: what Reddit's defenses look like in practice, how to extract clean structured data using both the JSON API and HTML parsing, and how to scale to hundreds of subreddits without getting blocked.

Why Scrape Reddit?

Three use cases dominate production Reddit scrapers:

Brand and product sentiment monitoring — Reddit discussions are candid in a way that Twitter threads and review sites rarely are. Subreddits like r/personalfinance, r/wallstreetbets, and category-specific communities move faster than traditional media. Fintech teams, investor relations departments, and product orgs monitor these feeds in near-real-time to detect reputation events before they surface elsewhere.

Market research and trend detection — Academic researchers, journalists, and strategy teams use Reddit to map emerging narratives at scale. A longitudinal dataset of r/MachineLearning or r/datascience posts tells you what practitioners are actually evaluating — tools, frameworks, vendors — rather than what analyst reports say they should be.

Lead generation from high-intent communities — B2B teams mine niche subreddits for decision-makers asking "what's the best tool for X?" Those threads surface high-intent buyers actively comparing options. Automated monitoring of relevant search terms and post flairs generates a continuous stream of warm leads.

Anti-Bot Challenges on Reddit

Reddit's defenses are lighter than enterprise-grade targets like LinkedIn or Amazon, but they'll kill a naive scraper within minutes:

Rate limiting on the JSON API — The public JSON endpoint (reddit.com/r/sub.json) allows roughly one unauthenticated request per two seconds. Burst past that and you receive 429 responses. The HTML interfaces enforce similar limits, and Reddit's rate-limit windows are per-IP — rotating user agents alone doesn't help.

User-Agent enforcement — Reddit explicitly blocks requests with missing or generic User-Agent headers. The library default python-requests/2.31.0 will get you flagged faster than any other signal. Reddit's own API guidelines require a descriptive UA string in the format AppName/Version (context).

New Reddit's SPA architecture — www.reddit.com is a React single-page application. A raw HTTP request returns an HTML shell with almost no post data — the actual content arrives via async JavaScript fetching internal API calls. Without headless browser rendering, you get empty content. This is the single biggest trap engineers fall into when scraping Reddit for the first time.

TLS fingerprinting via CDN infrastructure — Reddit's CDN layer can fingerprint the TLS handshake to distinguish browser clients from HTTP libraries. Standard requests or httpx produce distinct TLS signatures that residential proxies alone won't obscure. You need full browser fingerprint spoofing at the TLS layer.

old.reddit.com as a practical escape hatch — The legacy interface serves fully server-rendered HTML. No JavaScript execution required, CSS selectors work reliably, and the DOM structure has been stable for years. For post and comment data, old.reddit.com is the correct default target.

Building rotation logic, fingerprint spoofing, session management, and retry handling yourself is a multi-week project that breaks whenever Reddit updates its stack. AlterLab's anti-bot bypass API handles all of it transparently at the infrastructure level.

99.2%Scrape Success Rate

1.4sAvg Response Time

~2sReddit's Unauthenticated Rate Limit

50M+Reddit Posts Indexed Daily

Quick Start with AlterLab

Install the SDK, then export your API key. The getting started guide covers environment setup, key management, and first-request verification in detail.

Bash

pip install alterlab beautifulsoup4
export ALTERLAB_API_KEY="YOUR_API_KEY"

The simplest working request — fetch old.reddit.com with a descriptive User-Agent:

Python

import alterlab

client = alterlab.Client("YOUR_API_KEY")

# Target old.reddit.com — server-rendered, no JS needed
response = client.scrape(
    "https://old.reddit.com/r/Python/hot/",
    headers={"User-Agent": "ResearchPipeline/1.0 ([email protected])"},
)

print(response.status_code)   # 200
print(response.text[:500])    # HTML content

The equivalent via cURL for quick terminal verification:

Bash

curl -X POST https://api.alterlab.io/v1/scrape \
  -H "X-API-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://old.reddit.com/r/Python/hot/",
    "headers": {
      "User-Agent": "ResearchPipeline/1.0 ([email protected])"
    }
  }'

Prefer the JSON API when you don't need HTML. Appending .json to any Reddit URL returns a fully structured JSON response — no parsing, no selectors, no BeautifulSoup:

Python

import alterlab
import json

client = alterlab.Client("YOUR_API_KEY")

# Reddit's built-in JSON endpoint — structured, fast, no JS rendering needed
response = client.scrape(
    "https://www.reddit.com/r/MachineLearning/hot.json?limit=25",
    headers={"User-Agent": "ResearchPipeline/1.0 ([email protected])"},
)

data = json.loads(response.text)
posts = data["data"]["children"]

for post in posts:
    p = post["data"]
    print(f"{p['score']:>6} upvotes | {p['title'][:75]}")

Extracting Structured Data

From the JSON API

The .json endpoint exposes the full post object. Key fields under data.children[].data:

Field	Type	Description
`id`	string	Base-36 post ID
`title`	string	Post title
`author`	string	Reddit username
`score`	int	Net upvote count
`upvote_ratio`	float	Fraction of upvotes (0.0–1.0)
`url`	string	Link URL or Reddit permalink for text posts
`selftext`	string	Post body text (empty for link posts)
`num_comments`	int	Comment count including removed
`created_utc`	float	Unix timestamp (UTC)
`subreddit`	string	Subreddit name
`link_flair_text`	string	Post flair label

Python

import alterlab
import json
from datetime import datetime, timezone

client = alterlab.Client("YOUR_API_KEY")

def scrape_subreddit(subreddit: str, sort: str = "hot", limit: int = 100) -> list[dict]:
    url = f"https://www.reddit.com/r/{subreddit}/{sort}.json?limit={limit}"
    response = client.scrape(
        url,
        headers={"User-Agent": "ResearchPipeline/1.0 ([email protected])"},
    )
    data = json.loads(response.text)

    posts = []
    for child in data["data"]["children"]:
        p = child["data"]
        posts.append({
            "id":           p["id"],
            "title":        p["title"],
            "author":       p["author"],
            "score":        p["score"],
            "upvote_ratio": p["upvote_ratio"],
            "comments":     p["num_comments"],
            "created":      datetime.fromtimestamp(p["created_utc"], tz=timezone.utc).isoformat(),
            "url":          p["url"],
            "body":         p.get("selftext", ""),
            "flair":        p.get("link_flair_text") or "",
        })

    return posts

results = scrape_subreddit("MachineLearning", sort="new", limit=50)
print(f"Extracted {len(results)} posts")

From old.reddit.com HTML

Use HTML parsing when you need data not exposed by the JSON API — sidebar content, wiki pages, or custom CSS flairs. old.reddit.com has a stable DOM structure that's been consistent for years:

Python

import alterlab
from bs4 import BeautifulSoup

client = alterlab.Client("YOUR_API_KEY")

response = client.scrape("https://old.reddit.com/r/Python/hot/")
soup = BeautifulSoup(response.text, "html.parser")

posts = []
for thing in soup.select("div.thing"):
    title_tag     = thing.select_one("a.title.may-blank")
    score_tag     = thing.select_one("div.score.unvoted")
    author_tag    = thing.select_one("a.author")
    timestamp_tag = thing.select_one("time.live-timestamp")
    comments_tag  = thing.select_one("a.comments")

    posts.append({
        "title":     title_tag.get_text(strip=True) if title_tag else None,
        "score":     score_tag.get("title") if score_tag else None,
        "author":    author_tag.get_text(strip=True) if author_tag else None,
        "timestamp": timestamp_tag.get("datetime") if timestamp_tag else None,
        "permalink": thing.get("data-permalink"),
        "comments":  comments_tag.get_text(strip=True) if comments_tag else None,
    })

print(f"Parsed {len(posts)} posts")

CSS selector reference for old.reddit.com:

Element	Selector
Post container	`div.thing`
Post title	`a.title.may-blank`
Score	`div.score.unvoted`
Author	`a.author`
Subreddit tag	`a.subreddit`
Post timestamp	`time.live-timestamp`
Comments link	`a.comments`
Post flair	`span.flair`

Common Pitfalls

Scraping www.reddit.com without headless rendering — The new Reddit UI loads post data asynchronously. A raw HTTP request returns empty content shells. Use old.reddit.com for HTML scraping, or target the .json API endpoints directly. Only enable headless browser mode for www.reddit.com when you specifically need data exposed only in the new UI.

Ignoring pagination — The JSON API returns 25 posts by default, capped at 100 per request. Complete subreddit crawls require paginating with the after cursor (the name fullname field of the last post):

Python

import alterlab
import json
import time

client = alterlab.Client("YOUR_API_KEY")

def scrape_all_posts(subreddit: str, max_pages: int = 10) -> list[dict]:
    posts, after = [], None

    for _ in range(max_pages):
        qs  = f"limit=100&after={after}" if after else "limit=100"
        url = f"https://www.reddit.com/r/{subreddit}/new.json?{qs}"

        response = client.scrape(url, headers={"User-Agent": "Pipeline/1.0"})
        data     = json.loads(response.text)
        children = data["data"]["children"]

        if not children:
            break

        posts.extend(child["data"] for child in children)
        after = data["data"]["after"]

        if not after:
            break

        time.sleep(0.5)  # Polite crawl delay between pages

    return posts

Assuming num_comments matches the fetched comment tree — num_comments counts deleted and removed comments. When you fetch the comment tree separately, the actual retrievable count will always be lower. Don't use this field for data completeness checks.

Not handling deleted and removed content — Posts with [deleted] as author or [removed] as selftext are endemic to Reddit data. Filter or flag them explicitly rather than letting them corrupt downstream aggregations.

Session expiry on long-running crawls — For multi-hour crawls across hundreds of subreddits, cookies and auth tokens can expire mid-run. Build retry logic with exponential backoff and log failures per-URL so you can resume from the last successful page rather than restarting from scratch.

Scaling Up

Async batching across subreddits

Python

import asyncio
import json
import alterlab

client = alterlab.Client("YOUR_API_KEY")

async def fetch_subreddit(subreddit: str) -> list[dict]:
    loop = asyncio.get_event_loop()
    # AlterLab client is sync; use thread pool executor for async contexts
    response = await loop.run_in_executor(
        None,
        lambda: client.scrape(
            f"https://www.reddit.com/r/{subreddit}/hot.json?limit=100",
            headers={"User-Agent": "Pipeline/1.0 ([email protected])"},
        ),
    )
    data = json.loads(response.text)
    return [child["data"] for child in data["data"]["children"]]

async def main():
    subreddits = ["Python", "MachineLearning", "datascience", "rust", "golang"]
    results    = await asyncio.gather(*[fetch_subreddit(s) for s in subreddits])
    all_posts  = [post for batch in results for post in batch]
    print(f"Scraped {len(all_posts)} posts across {len(subreddits)} subreddits")

asyncio.run(main())

Cost at scale

For Reddit specifically, almost all use cases can run on standard fetch mode — no JavaScript rendering required. The JSON API and old.reddit.com both return complete data in a single synchronous response. Reserve headless browser mode for edge cases: www.reddit.com award displays, embedded media metadata, or authenticated sessions.

A pipeline scraping 50 subreddits at 100 posts each, refreshed hourly, runs roughly 5,000 requests/hour and 120,000 requests/day. At standard tier rates on AlterLab's pricing plans, this fits comfortably within the mid-tier. If you need comment trees in addition to post listings, factor in one additional request per post for the comment JSON endpoint.

Try it yourself

Try scraping Reddit's Python subreddit JSON API live with AlterLab

curl -X POST https://api.alterlab.io/v1/scrape \
  -H "X-API-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"url": "https://www.reddit.com/r/Python/hot.json?limit=5"}'

Enable JavaScript to try the live demo, or sign up to use the API directly.

Key Takeaways

JSON API first — Appending .json to any Reddit URL returns structured data without HTML parsing or JavaScript rendering. It's faster, cheaper, and more reliable than scraping HTML.
Target old.reddit.com for HTML scraping — The legacy interface is server-rendered with a stable DOM. The new Reddit SPA requires headless rendering; avoid it unless strictly necessary.
Paginate with after — Default responses cap at 25 posts. Use the after cursor field to walk through complete subreddit listings.
Filter deleted content explicitly — [deleted] authors and [removed] bodies are common; handle them at ingestion rather than letting them propagate into analytics.
Parallelize across subreddits, not within them — Concurrency at the subreddit level maximizes throughput without triggering per-endpoint rate limits.
Proxy rotation, TLS fingerprint spoofing, and rate-limit handling are solved infrastructure problems — don't build and maintain that stack when purpose-built tooling exists.

If you're building broader social media data pipelines, these guides cover comparable challenges on adjacent platforms:

How to Scrape Twitter/X — API v2 authentication, rate limit windows, and full tweet object extraction
How to Scrape Instagram — GraphQL endpoint reverse engineering and login-wall handling
How to Scrape TikTok — Mobile-first API architecture, device fingerprinting, and dynamic content rendering

Was this article helpful?

Try it yourself

Skip the proxy management overhead

AlterLab handles proxy rotation, browser environments, and challenge resolution for you.

curl -X POST https://api.alterlab.io/v1/scrape \
  -H "X-API-Key: YOUR_KEY" \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com"}'

No credit card required · 5,000 free requests

Frequently Asked Questions

Scraping publicly accessible Reddit posts and comments is generally protected under U.S. case law (hiQ Labs v. LinkedIn), but Reddit's Terms of Service restrict automated access outside their official API. For read-only research on public data, legal exposure is low — but review Reddit's ToS before running production pipelines at scale, and respect rate limits to avoid IP-level enforcement.

Reddit uses rate limiting, User-Agent enforcement, and TLS fingerprinting to detect and block scrapers. Managing rotating residential proxies, browser fingerprint spoofing, and session headers yourself takes weeks to build and breaks whenever Reddit updates its stack. AlterLab's anti-bot bypass API handles all of this transparently — your scrape requests look indistinguishable from real browser traffic without any of the infrastructure overhead.

Cost depends on request volume and rendering mode. Reddit's JSON API endpoints and old.reddit.com HTML pages require only standard fetch mode — no JavaScript rendering — which is the most cost-efficient tier. A pipeline scraping 50 subreddits at 100 posts each, refreshed hourly, runs roughly 120,000 requests per day. See AlterLab's pricing page for current tier breakdowns and per-request rates.

Yash Dubey

View all posts

Tutorials

AutoTrader Data API: Extract Structured JSON in 2026

Build a robust data pipeline for automotive market intelligence. Learn how to use an autotrader data api to get structured JSON without writing fragile parsers.

Herald Blog Service

Jun 29, 2026

Tutorials

IMDB Data API: Extract Structured JSON in 2026

Learn how to extract structured IMDB data (title, rating, genre) via API using AlterLab's Extract API for reliable JSON output in 2026.

Herald Blog Service

Jun 29, 2026

Tutorials

CarGurus Data API: Extract Structured JSON in 2026

Learn how to retrieve structured CarGurus data through a modern data API. Get JSON with make, model, year, price, mileage and location using AlterLab's Extract API. Simple, compliant, and built for developers.

Herald Blog Service

Jun 29, 2026

Stay in the Loop

Get scraping insights, API tips, and platform updates. No spam — we only send when we have something worth reading.

Web Scraping API Resources

Part of the Web Scraping API Documentation cluster

Web Scraping API Documentation

Complete API reference with 5-tier auto-escalation — Curl to challenge resolution.

Pillar page

JavaScript Rendering Guide

Configure Tier 4 browser rendering for SPAs and dynamic content.

Authenticated Scraping Guide

Scrape pages behind login using session management.

Web Scraping API Benchmarks

Real success rates and cost data across all 5 tiers.

AlterLab for AI Agents

MCP Server, Python SDK, and Firecrawl-compatible API for AI agent workflows.

Why Scrape Reddit?

Anti-Bot Challenges on Reddit

Quick Start with AlterLab

Extracting Structured Data

From the JSON API

From old.reddit.com HTML

Common Pitfalls

Scaling Up

Async batching across subreddits

Cost at scale

Key Takeaways

Related Guides

Frequently Asked Questions

Related Articles

AutoTrader Data API: Extract Structured JSON in 2026

IMDB Data API: Extract Structured JSON in 2026

CarGurus Data API: Extract Structured JSON in 2026

Popular Posts

Playwright Bot Detection: What Actually Works in 2026

Why Your Headless Browser Gets Detected (and How to Fix It)

How to Scrape Twitter/X: Complete Guide for 2026

Best Web Scraping APIs in 2026: Complete Comparison Guide

How to Scrape Cloudflare-Protected Sites in 2026

Recommended

How to Scrape AliExpress: Complete Guide for 2026

Why Your Headless Browser Gets Detected (and How to Fix It)

AlterLab vs Firecrawl: Which Scraping API Is Better in 2026?

How to Scrape Twitter/X Data: Complete Guide for 2026

How to Scrape Cloudflare-Protected Sites in 2026

Newsletter

Recommended Reading

How to Scrape AliExpress: Complete Guide for 2026

Why Your Headless Browser Gets Detected (and How to Fix It)

AlterLab vs Firecrawl: Which Scraping API Is Better in 2026?

How to Scrape Twitter/X Data: Complete Guide for 2026

How to Scrape Cloudflare-Protected Sites in 2026

Stay in the Loop

Explore AlterLab

Anti-Bot Handling API

JavaScript Rendering API

Pricing

Documentation

Web Scraping API Resources