AlterLabAlterLab
Scrape Hacker News & Reddit for Market Intel
Tutorials

Scrape Hacker News & Reddit for Market Intel

A practical guide to scraping Hacker News, Reddit, and developer forums for competitive market intelligence without triggering rate limits or bot detection.

Yash Dubey
Yash Dubey

March 30, 2026

9 min read
4 views

Developer forums are the highest-signal source of unfiltered market intelligence available. HN comment threads, Reddit /r/devops posts, and Stack Overflow questions tell you what engineers want, hate, and are willing to pay for — faster than any survey, and without the social desirability bias that distorts NPS scores. Here's how to extract that data systematically, at scale, without getting blocked.

What You're Actually Mining

Define your signal categories before writing any code. Raw post volume without a taxonomy is noise.

  • Competitor mentions — threads where users compare tools in your category
  • Pain point extraction — "I wish X would just…" phrasing, feature request threads, [Ask HN] posts
  • Pricing sensitivity — reactions to pricing changes, "too expensive" comments, tier comparisons
  • Technology momentum — what stacks, libraries, and patterns are gaining adoption in your space
  • Support gap analysis — questions that competitors' docs answer poorly, indicating switching opportunity

HN skews toward technical decision-makers and founders — strong signal for B2B and infrastructure products. Reddit's /r/devops, /r/sysadmin, and /r/programming are higher volume but noisier. Stack Overflow is the lowest volume but highest intent: someone asking "how do I migrate from X to Y" is announcing a purchase decision.

How the Pipeline Works

Technical Challenges by Platform

Each platform has a distinct anti-scraping posture. Choose your access method accordingly.

Reddit blocks datacenter IPs aggressively and often silently — you'll get a 200 OK with an empty children array instead of a proper 429. Stack Overflow runs behind Cloudflare and serves JS challenges to non-browser TLS fingerprints. Using anti-bot bypass for these platforms handles TLS normalization and challenge resolution automatically.


Scraping Hacker News

HN provides two official data endpoints. Use them before touching HTML.

  • Algolia Search API (hn.algolia.com/api/v1) — full-text search across all HN content, JSON, paginated, no auth required
  • Firebase REST API (hacker-news.firebaseio.com/v0) — fetch individual items and live feeds by ID

For market intelligence, the Algolia API is the right starting point. You can search by keyword, filter by content type (story, comment, ask_hn, show_hn), and sort by recency or relevance. The search_by_date endpoint is better for monitoring workflows than search — recency is more actionable than Algolia's relevance score for this use case.

Python
import alterlab
import json
from datetime import datetime, timedelta
from urllib.parse import quote

client = alterlab.Client("YOUR_API_KEY")

def search_hn(
    query: str,
    tags: str = "story",
    days_back: int = 30,
    limit: int = 100,
) -> list[dict]:
    """
    Search HN via Algolia. Returns structured post data.
    tags: 'story' | 'comment' | 'ask_hn' | 'show_hn'
    """
    since = int((datetime.utcnow() - timedelta(days=days_back)).timestamp())
    url = (
        f"https://hn.algolia.com/api/v1/search_by_date"
        f"?query={quote(query)}"
        f"&tags={tags}"
        f"&numericFilters=created_at_i>{since}"
        f"&hitsPerPage={limit}"
    )

    resp = client.scrape(url, {"render_js": False})
    data = json.loads(resp.text)

    return [
        {
            "id": hit["objectID"],
            "title": hit.get("title") or hit.get("comment_text", "")[:120],
            "url": hit.get("url") or f"https://news.ycombinator.com/item?id={hit['objectID']}",
            "points": hit.get("points", 0),
            "num_comments": hit.get("num_comments", 0),
            "author": hit.get("author"),
            "created_at": hit["created_at"],
        }
        for hit in data.get("hits", [])
    ]


if __name__ == "__main__":
    competitors = ["ScrapingBee", "Apify", "Bright Data", "Zyte"]

    for competitor in competitors:
        results = search_hn(competitor, tags="comment", days_back=90)
        top = sorted(results, key=lambda x: x["points"] or 0, reverse=True)[:3]
        print(f"\n{competitor}: {len(results)} mentions in 90 days")
        for r in top:
            print(f"  [{r['points'] or 0:>4}pts] {r['title'][:75]}")
            print(f"           {r['url']}")

Two implementation notes: Comments don't have a title field — truncate comment_text instead. The Algolia free tier allows roughly 10k requests per day per IP, which is sufficient for monitoring but insufficient for historical backfills — route bulk requests through a proxy pool to avoid exhausting the quota from a single source IP.

To fetch the full comment thread for a specific story and extract discussion-level sentiment, use the Firebase API:

Python
def fetch_story_comments(story_id: int, max_top_level: int = 50) -> list[dict]:
    """Fetch top-level comments for an HN story via Firebase REST API."""
    url = f"https://hacker-news.firebaseio.com/v0/item/{story_id}.json"
    resp = client.scrape(url, {"render_js": False})
    story = json.loads(resp.text)

    comment_ids = story.get("kids", [])[:max_top_level]
    comments = []

    for cid in comment_ids:
        c_resp = client.scrape(
            f"https://hacker-news.firebaseio.com/v0/item/{cid}.json",
            {"render_js": False},
        )
        comment = json.loads(c_resp.text)
        if not comment or comment.get("deleted") or comment.get("dead"):
            continue
        comments.append({
            "id": cid,
            "text": comment.get("text", ""),
            "author": comment.get("by"),
            "score": comment.get("score", 0),
            "time": comment.get("time"),
        })

    return comments

Scraping Reddit

Reddit's situation is more complex. The official API now requires OAuth and charges for commercial-scale access (after the 2023 API pricing changes). The JSON endpoint still works for read-only access and requires no authentication — it's the right tool for this use case.

The .json trick: append .json to any Reddit URL and you receive the raw data structure the frontend renders from. https://www.reddit.com/r/devops/search.json?q=kubernetes&restrict_sr=1 returns the same posts as the page, without touching a line of JavaScript.

Try it yourself

Try scraping this page with AlterLab

The equivalent cURL call through AlterLab's API:

Bash
curl -X POST https://api.alterlab.io/v1/scrape \
  -H "X-API-Key: YOUR_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://www.reddit.com/r/devops/search.json?q=web+scraping&restrict_sr=1&sort=new&t=month&limit=25",
    "render_js": false,
    "premium_proxy": true
  }'

For a full multi-subreddit sweep in Python:

Python
import alterlab
import json
import time
from urllib.parse import quote

client = alterlab.Client("YOUR_API_KEY")

SUBREDDITS = ["devops", "programming", "Python", "sysadmin", "webdev", "MachineLearning"]


def search_subreddit(
    subreddit: str,
    query: str,
    sort: str = "new",
    limit: int = 25,
    time_filter: str = "month",
) -> list[dict]:
    """
    Query Reddit's search JSON endpoint for a specific subreddit.
    sort:        'new' | 'relevance' | 'top' | 'comments'
    time_filter: 'hour' | 'day' | 'week' | 'month' | 'year' | 'all'
    """
    url = (
        f"https://www.reddit.com/r/{subreddit}/search.json"
        f"?q={quote(query)}&restrict_sr=1"
        f"&sort={sort}&t={time_filter}&limit={limit}"
    )

    resp = client.scrape(url, {
        "render_js": False,
        "premium_proxy": True,  # Required: Reddit blocks datacenter IPs
    })

    data = json.loads(resp.text)
    posts = data.get("data", {}).get("children", [])

    return [
        {
            "title": p["data"]["title"],
            "score": p["data"]["score"],
            "upvote_ratio": p["data"]["upvote_ratio"],
            "num_comments": p["data"]["num_comments"],
            "permalink": f"https://reddit.com{p['data']['permalink']}",
            "selftext": p["data"].get("selftext", "")[:800],
            "created_utc": p["data"]["created_utc"],
            "subreddit": subreddit,
        }
        for p in posts
    ]


def collect_signals(keywords: list[str]) -> list[dict]:
    """Cross-subreddit keyword sweep with deduplication."""
    all_posts = []

    for sub in SUBREDDITS:
        for keyword in keywords:
            posts = search_subreddit(sub, keyword, sort="new", time_filter="month")
            all_posts.extend(posts)
            time.sleep(1.5)  # Jitter between requests — critical for staying under rate limits

    seen, unique = set(), []
    for p in all_posts:
        if p["permalink"] not in seen:
            seen.add(p["permalink"])
            unique.append(p)

    return sorted(unique, key=lambda x: x["score"], reverse=True)


if __name__ == "__main__":
    signals = collect_signals(["web scraping API", "proxy rotation", "anti-bot bypass"])
    print(f"Collected {len(signals)} unique posts\n")
    for s in signals[:15]:
        print(f"[{s['score']:>5}↑] [{s['subreddit']:<15}] {s['title'][:65]}")

The premium_proxy: true flag is non-negotiable for Reddit. Without it, you'll receive silent failures — 200 OK responses with empty children arrays — from datacenter IP blocks. See the Python SDK reference for the full options schema including header injection and session persistence.

Fetching Comment Data

Post scores tell you signal volume. Comment bodies tell you signal content. Append .json to any post permalink:

Python
def fetch_post_comments(permalink: str, depth: int = 2, limit: int = 50) -> list[dict]:
    """
    Fetch top-level comments from a Reddit post.
    permalink: relative path, e.g. '/r/devops/comments/abc123/post_title/'
    """
    url = f"https://www.reddit.com{permalink}.json?depth={depth}&limit={limit}"
    resp = client.scrape(url, {"render_js": False, "premium_proxy": True})

    data = json.loads(resp.text)
    # Reddit returns a 2-element list: [0] post metadata, [1] comment listing
    if not isinstance(data, list) or len(data) < 2:
        return []

    comment_listing = data[1]["data"]["children"]
    comments = []

    for c in comment_listing:
        if c["kind"] != "t1":  # Skip "more comments" load-more objects
            continue
        body = c["data"].get("body", "")
        if body in ("[deleted]", "[removed]"):
            continue
        comments.append({
            "author": c["data"].get("author"),
            "body": body,
            "score": c["data"].get("score", 0),
            "created_utc": c["data"]["created_utc"],
        })

    return sorted(comments, key=lambda x: x["score"], reverse=True)

Anti-Detection Patterns That Actually Matter

Most scrapers fail not because of CAPTCHAs — they fail because of subtler fingerprinting.

User-Agent consistency: Use a realistic, fixed UA string across all requests to the same domain. Rotating UAs on every request increases suspicion; real browsers don't change UA mid-session.

Burst suppression: Burst patterns (10 requests in 500ms) trigger rate limiters even within hourly quotas. The time.sleep(1.5) in the Reddit example above is not courtesy — it's a functional requirement. Respect Retry-After headers unconditionally.

Silent block detection: Reddit and some forum platforms return 200 OK with empty data instead of 429. Always validate that children or hits arrays are non-empty, and implement exponential backoff on empty responses.

TLS fingerprinting: Platforms running Cloudflare (Stack Overflow, many niche forums) inspect TLS handshake parameters to detect non-browser clients. AlterLab's anti-bot bypass normalizes TLS fingerprints to match real browser profiles — no configuration required on your end.

JSON over HTML, always: JSON endpoints don't execute JavaScript-based bot detection. They're faster, more stable across UI changes, and produce cleaner data. Fall back to HTML scraping only when no JSON endpoint exists.

~1,000HN Submissions/Day
52M+Reddit DAUs
60 req/minReddit Unauth Limit
10k req/dayHN Algolia Free Tier

Building a Repeatable Pipeline

A one-off scrape is an audit. A scheduled pipeline is market intelligence. The minimum viable architecture:

  1. Scheduler — cron job or an Airflow DAG, hourly for Reddit, daily for HN historical sweeps
  2. Deduplication — SHA-256 hash of post URL, skip already-processed items before fetching comments
  3. Classification — keyword match on ["pricing", "cost", "expensive", "alternative to", "migrating from"] gets you 80% accuracy with zero latency; feed post bodies to a lightweight LLM for the remaining ambiguous cases
  4. Storage — PostgreSQL with a forum_posts table: (id, source, url, title, body, score, created_at, signal_category, processed_at)
  5. Alerting — Slack webhook triggered when a post crosses a score threshold (e.g., >150 upvotes) and matches a competitor keyword — these are the posts worth reading same-day

Classification quality compounds with volume. After 90 days of labeled data, you'll have enough signal to train a fine-tuned classifier that outperforms keyword matching on nuanced cases like implicit pricing complaints or indirect competitor comparisons.

For the full API reference — including rate limits, available proxy tiers, and response schemas — see the API documentation.


Takeaway

Developer forums are high-signal, underutilized intelligence sources. The technical barriers are manageable with the right approach:

SourceBest EndpointKey Requirement
Hacker NewsAlgolia API (search_by_date)Proxy rotation for bulk
Reddit.json suffix on any URLResidential proxies; 1–3s jitter
Stack OverflowStack Exchange API v2.3Anti-bot bypass for HTML fallback
Dev.toPublic REST API (/api/articles)No special handling needed

The pipeline compounds in value over time. Thirty days of daily crawls establishes a baseline. Ninety days reveals trend lines. Six months and you're seeing sentiment shifts in your category before they surface in support tickets or churn metrics — with enough lead time to act on them.

Share

Was this article helpful?

Frequently Asked Questions

Use Reddit's `.json` endpoint (append `.json` to any Reddit URL) and route requests through residential proxies. Reddit aggressively blocks datacenter IP ranges; residential proxies make traffic indistinguishable from real users. A 1–3 second jitter between requests keeps you well under rate limit thresholds.
HN's terms of service don't prohibit scraping, and Y Combinator provides an official Firebase REST API and an Algolia search API as preferred access methods. For bulk historical data, exhaust these official endpoints before falling back to HTML scraping. IP rate limiting still applies regardless of legal access, so proxy rotation remains necessary at scale.
Reddit enforces approximately 60 requests per minute on unauthenticated JSON access. HTML scraping triggers blocks faster — typically after 10–20 rapid requests from the same IP. Burst patterns are the primary trigger; distributing requests with 1–3 seconds of jitter between calls keeps you safely within the limit even during large crawls.