
How to Scrape Reddit: Complete Guide for 2026
Learn how to scrape Reddit in 2026 with Python. Handle rate limits, extract structured post data, bypass anti-bot measures, and scale your pipeline.
March 30, 2026
Reddit generates millions of posts and comments daily across 100,000+ active communities. Unlike most platforms, a significant portion of that data is accessible without authentication — no OAuth dance, no login wall. That makes it one of the highest-signal public data sources available to engineers building monitoring, research, or intelligence pipelines.
This guide covers the full stack: what Reddit's defenses look like in practice, how to extract clean structured data using both the JSON API and HTML parsing, and how to scale to hundreds of subreddits without getting blocked.
Why Scrape Reddit?
Three use cases dominate production Reddit scrapers:
Brand and product sentiment monitoring — Reddit discussions are candid in a way that Twitter threads and review sites rarely are. Subreddits like r/personalfinance, r/wallstreetbets, and category-specific communities move faster than traditional media. Fintech teams, investor relations departments, and product orgs monitor these feeds in near-real-time to detect reputation events before they surface elsewhere.
Market research and trend detection — Academic researchers, journalists, and strategy teams use Reddit to map emerging narratives at scale. A longitudinal dataset of r/MachineLearning or r/datascience posts tells you what practitioners are actually evaluating — tools, frameworks, vendors — rather than what analyst reports say they should be.
Lead generation from high-intent communities — B2B teams mine niche subreddits for decision-makers asking "what's the best tool for X?" Those threads surface high-intent buyers actively comparing options. Automated monitoring of relevant search terms and post flairs generates a continuous stream of warm leads.
Anti-Bot Challenges on Reddit
Reddit's defenses are lighter than enterprise-grade targets like LinkedIn or Amazon, but they'll kill a naive scraper within minutes:
Rate limiting on the JSON API — The public JSON endpoint (reddit.com/r/sub.json) allows roughly one unauthenticated request per two seconds. Burst past that and you receive 429 responses. The HTML interfaces enforce similar limits, and Reddit's rate-limit windows are per-IP — rotating user agents alone doesn't help.
User-Agent enforcement — Reddit explicitly blocks requests with missing or generic User-Agent headers. The library default python-requests/2.31.0 will get you flagged faster than any other signal. Reddit's own API guidelines require a descriptive UA string in the format AppName/Version (context).
New Reddit's SPA architecture — www.reddit.com is a React single-page application. A raw HTTP request returns an HTML shell with almost no post data — the actual content arrives via async JavaScript fetching internal API calls. Without headless browser rendering, you get empty content. This is the single biggest trap engineers fall into when scraping Reddit for the first time.
TLS fingerprinting via CDN infrastructure — Reddit's CDN layer can fingerprint the TLS handshake to distinguish browser clients from HTTP libraries. Standard requests or httpx produce distinct TLS signatures that residential proxies alone won't obscure. You need full browser fingerprint spoofing at the TLS layer.
old.reddit.com as a practical escape hatch — The legacy interface serves fully server-rendered HTML. No JavaScript execution required, CSS selectors work reliably, and the DOM structure has been stable for years. For post and comment data, old.reddit.com is the correct default target.
Building rotation logic, fingerprint spoofing, session management, and retry handling yourself is a multi-week project that breaks whenever Reddit updates its stack. AlterLab's anti-bot bypass API handles all of it transparently at the infrastructure level.
Quick Start with AlterLab
Install the SDK, then export your API key. The getting started guide covers environment setup, key management, and first-request verification in detail.
pip install alterlab beautifulsoup4
export ALTERLAB_API_KEY="YOUR_API_KEY"The simplest working request — fetch old.reddit.com with a descriptive User-Agent:
import alterlab
client = alterlab.Client("YOUR_API_KEY")
# Target old.reddit.com — server-rendered, no JS needed
response = client.scrape(
"https://old.reddit.com/r/Python/hot/",
headers={"User-Agent": "ResearchPipeline/1.0 ([email protected])"},
)
print(response.status_code) # 200
print(response.text[:500]) # HTML contentThe equivalent via cURL for quick terminal verification:
curl -X POST https://api.alterlab.io/v1/scrape \
-H "X-API-Key: YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://old.reddit.com/r/Python/hot/",
"headers": {
"User-Agent": "ResearchPipeline/1.0 ([email protected])"
}
}'Prefer the JSON API when you don't need HTML. Appending .json to any Reddit URL returns a fully structured JSON response — no parsing, no selectors, no BeautifulSoup:
import alterlab
import json
client = alterlab.Client("YOUR_API_KEY")
# Reddit's built-in JSON endpoint — structured, fast, no JS rendering needed
response = client.scrape(
"https://www.reddit.com/r/MachineLearning/hot.json?limit=25",
headers={"User-Agent": "ResearchPipeline/1.0 ([email protected])"},
)
data = json.loads(response.text)
posts = data["data"]["children"]
for post in posts:
p = post["data"]
print(f"{p['score']:>6} upvotes | {p['title'][:75]}")Extracting Structured Data
From the JSON API
The .json endpoint exposes the full post object. Key fields under data.children[].data:
| Field | Type | Description |
|---|---|---|
id | string | Base-36 post ID |
title | string | Post title |
author | string | Reddit username |
score | int | Net upvote count |
upvote_ratio | float | Fraction of upvotes (0.0–1.0) |
url | string | Link URL or Reddit permalink for text posts |
selftext | string | Post body text (empty for link posts) |
num_comments | int | Comment count including removed |
created_utc | float | Unix timestamp (UTC) |
subreddit | string | Subreddit name |
link_flair_text | string | Post flair label |
import alterlab
import json
from datetime import datetime, timezone
client = alterlab.Client("YOUR_API_KEY")
def scrape_subreddit(subreddit: str, sort: str = "hot", limit: int = 100) -> list[dict]:
url = f"https://www.reddit.com/r/{subreddit}/{sort}.json?limit={limit}"
response = client.scrape(
url,
headers={"User-Agent": "ResearchPipeline/1.0 ([email protected])"},
)
data = json.loads(response.text)
posts = []
for child in data["data"]["children"]:
p = child["data"]
posts.append({
"id": p["id"],
"title": p["title"],
"author": p["author"],
"score": p["score"],
"upvote_ratio": p["upvote_ratio"],
"comments": p["num_comments"],
"created": datetime.fromtimestamp(p["created_utc"], tz=timezone.utc).isoformat(),
"url": p["url"],
"body": p.get("selftext", ""),
"flair": p.get("link_flair_text") or "",
})
return posts
results = scrape_subreddit("MachineLearning", sort="new", limit=50)
print(f"Extracted {len(results)} posts")From old.reddit.com HTML
Use HTML parsing when you need data not exposed by the JSON API — sidebar content, wiki pages, or custom CSS flairs. old.reddit.com has a stable DOM structure that's been consistent for years:
import alterlab
from bs4 import BeautifulSoup
client = alterlab.Client("YOUR_API_KEY")
response = client.scrape("https://old.reddit.com/r/Python/hot/")
soup = BeautifulSoup(response.text, "html.parser")
posts = []
for thing in soup.select("div.thing"):
title_tag = thing.select_one("a.title.may-blank")
score_tag = thing.select_one("div.score.unvoted")
author_tag = thing.select_one("a.author")
timestamp_tag = thing.select_one("time.live-timestamp")
comments_tag = thing.select_one("a.comments")
posts.append({
"title": title_tag.get_text(strip=True) if title_tag else None,
"score": score_tag.get("title") if score_tag else None,
"author": author_tag.get_text(strip=True) if author_tag else None,
"timestamp": timestamp_tag.get("datetime") if timestamp_tag else None,
"permalink": thing.get("data-permalink"),
"comments": comments_tag.get_text(strip=True) if comments_tag else None,
})
print(f"Parsed {len(posts)} posts")CSS selector reference for old.reddit.com:
| Element | Selector |
|---|---|
| Post container | div.thing |
| Post title | a.title.may-blank |
| Score | div.score.unvoted |
| Author | a.author |
| Subreddit tag | a.subreddit |
| Post timestamp | time.live-timestamp |
| Comments link | a.comments |
| Post flair | span.flair |
Common Pitfalls
Scraping www.reddit.com without headless rendering — The new Reddit UI loads post data asynchronously. A raw HTTP request returns empty content shells. Use old.reddit.com for HTML scraping, or target the .json API endpoints directly. Only enable headless browser mode for www.reddit.com when you specifically need data exposed only in the new UI.
Ignoring pagination — The JSON API returns 25 posts by default, capped at 100 per request. Complete subreddit crawls require paginating with the after cursor (the name fullname field of the last post):
import alterlab
import json
import time
client = alterlab.Client("YOUR_API_KEY")
def scrape_all_posts(subreddit: str, max_pages: int = 10) -> list[dict]:
posts, after = [], None
for _ in range(max_pages):
qs = f"limit=100&after={after}" if after else "limit=100"
url = f"https://www.reddit.com/r/{subreddit}/new.json?{qs}"
response = client.scrape(url, headers={"User-Agent": "Pipeline/1.0"})
data = json.loads(response.text)
children = data["data"]["children"]
if not children:
break
posts.extend(child["data"] for child in children)
after = data["data"]["after"]
if not after:
break
time.sleep(0.5) # Polite crawl delay between pages
return postsAssuming num_comments matches the fetched comment tree — num_comments counts deleted and removed comments. When you fetch the comment tree separately, the actual retrievable count will always be lower. Don't use this field for data completeness checks.
Not handling deleted and removed content — Posts with [deleted] as author or [removed] as selftext are endemic to Reddit data. Filter or flag them explicitly rather than letting them corrupt downstream aggregations.
Session expiry on long-running crawls — For multi-hour crawls across hundreds of subreddits, cookies and auth tokens can expire mid-run. Build retry logic with exponential backoff and log failures per-URL so you can resume from the last successful page rather than restarting from scratch.
Scaling Up
Async batching across subreddits
import asyncio
import json
import alterlab
client = alterlab.Client("YOUR_API_KEY")
async def fetch_subreddit(subreddit: str) -> list[dict]:
loop = asyncio.get_event_loop()
# AlterLab client is sync; use thread pool executor for async contexts
response = await loop.run_in_executor(
None,
lambda: client.scrape(
f"https://www.reddit.com/r/{subreddit}/hot.json?limit=100",
headers={"User-Agent": "Pipeline/1.0 ([email protected])"},
),
)
data = json.loads(response.text)
return [child["data"] for child in data["data"]["children"]]
async def main():
subreddits = ["Python", "MachineLearning", "datascience", "rust", "golang"]
results = await asyncio.gather(*[fetch_subreddit(s) for s in subreddits])
all_posts = [post for batch in results for post in batch]
print(f"Scraped {len(all_posts)} posts across {len(subreddits)} subreddits")
asyncio.run(main())Cost at scale
For Reddit specifically, almost all use cases can run on standard fetch mode — no JavaScript rendering required. The JSON API and old.reddit.com both return complete data in a single synchronous response. Reserve headless browser mode for edge cases: www.reddit.com award displays, embedded media metadata, or authenticated sessions.
A pipeline scraping 50 subreddits at 100 posts each, refreshed hourly, runs roughly 5,000 requests/hour and 120,000 requests/day. At standard tier rates on AlterLab's pricing plans, this fits comfortably within the mid-tier. If you need comment trees in addition to post listings, factor in one additional request per post for the comment JSON endpoint.
Try scraping Reddit's Python subreddit JSON API live with AlterLab
Key Takeaways
- JSON API first — Appending
.jsonto any Reddit URL returns structured data without HTML parsing or JavaScript rendering. It's faster, cheaper, and more reliable than scraping HTML. - Target
old.reddit.comfor HTML scraping — The legacy interface is server-rendered with a stable DOM. The new Reddit SPA requires headless rendering; avoid it unless strictly necessary. - Paginate with
after— Default responses cap at 25 posts. Use theaftercursor field to walk through complete subreddit listings. - Filter deleted content explicitly —
[deleted]authors and[removed]bodies are common; handle them at ingestion rather than letting them propagate into analytics. - Parallelize across subreddits, not within them — Concurrency at the subreddit level maximizes throughput without triggering per-endpoint rate limits.
- Proxy rotation, TLS fingerprint spoofing, and rate-limit handling are solved infrastructure problems — don't build and maintain that stack when purpose-built tooling exists.
Related Guides
If you're building broader social media data pipelines, these guides cover comparable challenges on adjacent platforms:
- How to Scrape Twitter/X — API v2 authentication, rate limit windows, and full tweet object extraction
- How to Scrape Instagram — GraphQL endpoint reverse engineering and login-wall handling
- How to Scrape TikTok — Mobile-first API architecture, device fingerprinting, and dynamic content rendering
Was this article helpful?
Frequently Asked Questions
Related Articles
Popular Posts
Recommended

Selenium Bot Detection: Why You Get Caught and How to Avoid It

Why Your Headless Browser Gets Detected (and How to Fix It)

Web Scraping APIs vs DIY Scrapers: When to Stop Building Infrastructure

Scraping E-Commerce Sites at Scale Without Getting Blocked

Web Scraping with Node.js and Puppeteer: The Complete 2026 Guide
Newsletter
Scraping insights and API tips. No spam.
Recommended Reading

Selenium Bot Detection: Why You Get Caught and How to Avoid It

Why Your Headless Browser Gets Detected (and How to Fix It)

Web Scraping APIs vs DIY Scrapers: When to Stop Building Infrastructure

Scraping E-Commerce Sites at Scale Without Getting Blocked

Web Scraping with Node.js and Puppeteer: The Complete 2026 Guide
Stay in the Loop
Get scraping insights, API tips, and platform updates. No spam — we only send when we have something worth reading.


