
Scrape Hacker News & Reddit for Market Intel
A practical guide to scraping Hacker News, Reddit, and developer forums for competitive market intelligence without triggering rate limits or bot detection.
March 30, 2026
Developer forums are the highest-signal source of unfiltered market intelligence available. HN comment threads, Reddit /r/devops posts, and Stack Overflow questions tell you what engineers want, hate, and are willing to pay for — faster than any survey, and without the social desirability bias that distorts NPS scores. Here's how to extract that data systematically, at scale, without getting blocked.
What You're Actually Mining
Define your signal categories before writing any code. Raw post volume without a taxonomy is noise.
- Competitor mentions — threads where users compare tools in your category
- Pain point extraction — "I wish X would just…" phrasing, feature request threads,
[Ask HN]posts - Pricing sensitivity — reactions to pricing changes, "too expensive" comments, tier comparisons
- Technology momentum — what stacks, libraries, and patterns are gaining adoption in your space
- Support gap analysis — questions that competitors' docs answer poorly, indicating switching opportunity
HN skews toward technical decision-makers and founders — strong signal for B2B and infrastructure products. Reddit's /r/devops, /r/sysadmin, and /r/programming are higher volume but noisier. Stack Overflow is the lowest volume but highest intent: someone asking "how do I migrate from X to Y" is announcing a purchase decision.
How the Pipeline Works
Technical Challenges by Platform
Each platform has a distinct anti-scraping posture. Choose your access method accordingly.
Reddit blocks datacenter IPs aggressively and often silently — you'll get a 200 OK with an empty children array instead of a proper 429. Stack Overflow runs behind Cloudflare and serves JS challenges to non-browser TLS fingerprints. Using anti-bot bypass for these platforms handles TLS normalization and challenge resolution automatically.
Scraping Hacker News
HN provides two official data endpoints. Use them before touching HTML.
- Algolia Search API (
hn.algolia.com/api/v1) — full-text search across all HN content, JSON, paginated, no auth required - Firebase REST API (
hacker-news.firebaseio.com/v0) — fetch individual items and live feeds by ID
For market intelligence, the Algolia API is the right starting point. You can search by keyword, filter by content type (story, comment, ask_hn, show_hn), and sort by recency or relevance. The search_by_date endpoint is better for monitoring workflows than search — recency is more actionable than Algolia's relevance score for this use case.
import alterlab
import json
from datetime import datetime, timedelta
from urllib.parse import quote
client = alterlab.Client("YOUR_API_KEY")
def search_hn(
query: str,
tags: str = "story",
days_back: int = 30,
limit: int = 100,
) -> list[dict]:
"""
Search HN via Algolia. Returns structured post data.
tags: 'story' | 'comment' | 'ask_hn' | 'show_hn'
"""
since = int((datetime.utcnow() - timedelta(days=days_back)).timestamp())
url = (
f"https://hn.algolia.com/api/v1/search_by_date"
f"?query={quote(query)}"
f"&tags={tags}"
f"&numericFilters=created_at_i>{since}"
f"&hitsPerPage={limit}"
)
resp = client.scrape(url, {"render_js": False})
data = json.loads(resp.text)
return [
{
"id": hit["objectID"],
"title": hit.get("title") or hit.get("comment_text", "")[:120],
"url": hit.get("url") or f"https://news.ycombinator.com/item?id={hit['objectID']}",
"points": hit.get("points", 0),
"num_comments": hit.get("num_comments", 0),
"author": hit.get("author"),
"created_at": hit["created_at"],
}
for hit in data.get("hits", [])
]
if __name__ == "__main__":
competitors = ["ScrapingBee", "Apify", "Bright Data", "Zyte"]
for competitor in competitors:
results = search_hn(competitor, tags="comment", days_back=90)
top = sorted(results, key=lambda x: x["points"] or 0, reverse=True)[:3]
print(f"\n{competitor}: {len(results)} mentions in 90 days")
for r in top:
print(f" [{r['points'] or 0:>4}pts] {r['title'][:75]}")
print(f" {r['url']}")Two implementation notes: Comments don't have a title field — truncate comment_text instead. The Algolia free tier allows roughly 10k requests per day per IP, which is sufficient for monitoring but insufficient for historical backfills — route bulk requests through a proxy pool to avoid exhausting the quota from a single source IP.
To fetch the full comment thread for a specific story and extract discussion-level sentiment, use the Firebase API:
def fetch_story_comments(story_id: int, max_top_level: int = 50) -> list[dict]:
"""Fetch top-level comments for an HN story via Firebase REST API."""
url = f"https://hacker-news.firebaseio.com/v0/item/{story_id}.json"
resp = client.scrape(url, {"render_js": False})
story = json.loads(resp.text)
comment_ids = story.get("kids", [])[:max_top_level]
comments = []
for cid in comment_ids:
c_resp = client.scrape(
f"https://hacker-news.firebaseio.com/v0/item/{cid}.json",
{"render_js": False},
)
comment = json.loads(c_resp.text)
if not comment or comment.get("deleted") or comment.get("dead"):
continue
comments.append({
"id": cid,
"text": comment.get("text", ""),
"author": comment.get("by"),
"score": comment.get("score", 0),
"time": comment.get("time"),
})
return commentsScraping Reddit
Reddit's situation is more complex. The official API now requires OAuth and charges for commercial-scale access (after the 2023 API pricing changes). The JSON endpoint still works for read-only access and requires no authentication — it's the right tool for this use case.
The .json trick: append .json to any Reddit URL and you receive the raw data structure the frontend renders from. https://www.reddit.com/r/devops/search.json?q=kubernetes&restrict_sr=1 returns the same posts as the page, without touching a line of JavaScript.
Try scraping this page with AlterLab
The equivalent cURL call through AlterLab's API:
curl -X POST https://api.alterlab.io/v1/scrape \
-H "X-API-Key: YOUR_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://www.reddit.com/r/devops/search.json?q=web+scraping&restrict_sr=1&sort=new&t=month&limit=25",
"render_js": false,
"premium_proxy": true
}'For a full multi-subreddit sweep in Python:
import alterlab
import json
import time
from urllib.parse import quote
client = alterlab.Client("YOUR_API_KEY")
SUBREDDITS = ["devops", "programming", "Python", "sysadmin", "webdev", "MachineLearning"]
def search_subreddit(
subreddit: str,
query: str,
sort: str = "new",
limit: int = 25,
time_filter: str = "month",
) -> list[dict]:
"""
Query Reddit's search JSON endpoint for a specific subreddit.
sort: 'new' | 'relevance' | 'top' | 'comments'
time_filter: 'hour' | 'day' | 'week' | 'month' | 'year' | 'all'
"""
url = (
f"https://www.reddit.com/r/{subreddit}/search.json"
f"?q={quote(query)}&restrict_sr=1"
f"&sort={sort}&t={time_filter}&limit={limit}"
)
resp = client.scrape(url, {
"render_js": False,
"premium_proxy": True, # Required: Reddit blocks datacenter IPs
})
data = json.loads(resp.text)
posts = data.get("data", {}).get("children", [])
return [
{
"title": p["data"]["title"],
"score": p["data"]["score"],
"upvote_ratio": p["data"]["upvote_ratio"],
"num_comments": p["data"]["num_comments"],
"permalink": f"https://reddit.com{p['data']['permalink']}",
"selftext": p["data"].get("selftext", "")[:800],
"created_utc": p["data"]["created_utc"],
"subreddit": subreddit,
}
for p in posts
]
def collect_signals(keywords: list[str]) -> list[dict]:
"""Cross-subreddit keyword sweep with deduplication."""
all_posts = []
for sub in SUBREDDITS:
for keyword in keywords:
posts = search_subreddit(sub, keyword, sort="new", time_filter="month")
all_posts.extend(posts)
time.sleep(1.5) # Jitter between requests — critical for staying under rate limits
seen, unique = set(), []
for p in all_posts:
if p["permalink"] not in seen:
seen.add(p["permalink"])
unique.append(p)
return sorted(unique, key=lambda x: x["score"], reverse=True)
if __name__ == "__main__":
signals = collect_signals(["web scraping API", "proxy rotation", "anti-bot bypass"])
print(f"Collected {len(signals)} unique posts\n")
for s in signals[:15]:
print(f"[{s['score']:>5}↑] [{s['subreddit']:<15}] {s['title'][:65]}")The premium_proxy: true flag is non-negotiable for Reddit. Without it, you'll receive silent failures — 200 OK responses with empty children arrays — from datacenter IP blocks. See the Python SDK reference for the full options schema including header injection and session persistence.
Fetching Comment Data
Post scores tell you signal volume. Comment bodies tell you signal content. Append .json to any post permalink:
def fetch_post_comments(permalink: str, depth: int = 2, limit: int = 50) -> list[dict]:
"""
Fetch top-level comments from a Reddit post.
permalink: relative path, e.g. '/r/devops/comments/abc123/post_title/'
"""
url = f"https://www.reddit.com{permalink}.json?depth={depth}&limit={limit}"
resp = client.scrape(url, {"render_js": False, "premium_proxy": True})
data = json.loads(resp.text)
# Reddit returns a 2-element list: [0] post metadata, [1] comment listing
if not isinstance(data, list) or len(data) < 2:
return []
comment_listing = data[1]["data"]["children"]
comments = []
for c in comment_listing:
if c["kind"] != "t1": # Skip "more comments" load-more objects
continue
body = c["data"].get("body", "")
if body in ("[deleted]", "[removed]"):
continue
comments.append({
"author": c["data"].get("author"),
"body": body,
"score": c["data"].get("score", 0),
"created_utc": c["data"]["created_utc"],
})
return sorted(comments, key=lambda x: x["score"], reverse=True)Anti-Detection Patterns That Actually Matter
Most scrapers fail not because of CAPTCHAs — they fail because of subtler fingerprinting.
User-Agent consistency: Use a realistic, fixed UA string across all requests to the same domain. Rotating UAs on every request increases suspicion; real browsers don't change UA mid-session.
Burst suppression: Burst patterns (10 requests in 500ms) trigger rate limiters even within hourly quotas. The time.sleep(1.5) in the Reddit example above is not courtesy — it's a functional requirement. Respect Retry-After headers unconditionally.
Silent block detection: Reddit and some forum platforms return 200 OK with empty data instead of 429. Always validate that children or hits arrays are non-empty, and implement exponential backoff on empty responses.
TLS fingerprinting: Platforms running Cloudflare (Stack Overflow, many niche forums) inspect TLS handshake parameters to detect non-browser clients. AlterLab's anti-bot bypass normalizes TLS fingerprints to match real browser profiles — no configuration required on your end.
JSON over HTML, always: JSON endpoints don't execute JavaScript-based bot detection. They're faster, more stable across UI changes, and produce cleaner data. Fall back to HTML scraping only when no JSON endpoint exists.
Building a Repeatable Pipeline
A one-off scrape is an audit. A scheduled pipeline is market intelligence. The minimum viable architecture:
- Scheduler — cron job or an Airflow DAG, hourly for Reddit, daily for HN historical sweeps
- Deduplication — SHA-256 hash of post URL, skip already-processed items before fetching comments
- Classification — keyword match on
["pricing", "cost", "expensive", "alternative to", "migrating from"]gets you 80% accuracy with zero latency; feed post bodies to a lightweight LLM for the remaining ambiguous cases - Storage — PostgreSQL with a
forum_poststable:(id, source, url, title, body, score, created_at, signal_category, processed_at) - Alerting — Slack webhook triggered when a post crosses a score threshold (e.g., >150 upvotes) and matches a competitor keyword — these are the posts worth reading same-day
Classification quality compounds with volume. After 90 days of labeled data, you'll have enough signal to train a fine-tuned classifier that outperforms keyword matching on nuanced cases like implicit pricing complaints or indirect competitor comparisons.
For the full API reference — including rate limits, available proxy tiers, and response schemas — see the API documentation.
Takeaway
Developer forums are high-signal, underutilized intelligence sources. The technical barriers are manageable with the right approach:
| Source | Best Endpoint | Key Requirement |
|---|---|---|
| Hacker News | Algolia API (search_by_date) | Proxy rotation for bulk |
.json suffix on any URL | Residential proxies; 1–3s jitter | |
| Stack Overflow | Stack Exchange API v2.3 | Anti-bot bypass for HTML fallback |
| Dev.to | Public REST API (/api/articles) | No special handling needed |
The pipeline compounds in value over time. Thirty days of daily crawls establishes a baseline. Ninety days reveals trend lines. Six months and you're seeing sentiment shifts in your category before they surface in support tickets or churn metrics — with enough lead time to act on them.
Was this article helpful?
Frequently Asked Questions
Related Articles
Popular Posts
Recommended

Selenium Bot Detection: Why You Get Caught and How to Avoid It

Why Your Headless Browser Gets Detected (and How to Fix It)

Web Scraping APIs vs DIY Scrapers: When to Stop Building Infrastructure

Scraping E-Commerce Sites at Scale Without Getting Blocked

Web Scraping with Node.js and Puppeteer: The Complete 2026 Guide
Newsletter
Scraping insights and API tips. No spam.
Recommended Reading

Selenium Bot Detection: Why You Get Caught and How to Avoid It

Why Your Headless Browser Gets Detected (and How to Fix It)

Web Scraping APIs vs DIY Scrapers: When to Stop Building Infrastructure

Scraping E-Commerce Sites at Scale Without Getting Blocked

Web Scraping with Node.js and Puppeteer: The Complete 2026 Guide
Stay in the Loop
Get scraping insights, API tips, and platform updates. No spam — we only send when we have something worth reading.


