Pricing Compare Playground Blog Docs Changelog

Scrape Hacker News & Reddit for Market Intel

A practical guide to scraping Hacker News, Reddit, and developer forums for competitive market intelligence without triggering rate limits or bot detection.

Yash DubeyMarch 30, 2026

9 min read

200 views

AlterLab handles this automatically — scrape any URL with one API call. No infrastructure required.

Try it free

Developer forums are the highest-signal source of unfiltered market intelligence available. HN comment threads, Reddit /r/devops posts, and Stack Overflow questions tell you what engineers want, hate, and are willing to pay for — faster than any survey, and without the social desirability bias that distorts NPS scores. Here's how to extract that data systematically, at scale, without getting blocked.

What You're Actually Mining

Define your signal categories before writing any code. Raw post volume without a taxonomy is noise.

Competitor mentions — threads where users compare tools in your category
Pain point extraction — "I wish X would just…" phrasing, feature request threads, [Ask HN] posts
Pricing sensitivity — reactions to pricing changes, "too expensive" comments, tier comparisons
Technology momentum — what stacks, libraries, and patterns are gaining adoption in your space
Support gap analysis — questions that competitors' docs answer poorly, indicating switching opportunity

HN skews toward technical decision-makers and founders — strong signal for B2B and infrastructure products. Reddit's /r/devops, /r/sysadmin, and /r/programming are higher volume but noisier. Stack Overflow is the lowest volume but highest intent: someone asking "how do I migrate from X to Y" is announcing a purchase decision.

How the Pipeline Works

Technical Challenges by Platform

Each platform has a distinct anti-scraping posture. Choose your access method accordingly.

Reddit blocks datacenter IPs aggressively and often silently — you'll get a 200 OK with an empty children array instead of a proper 429. Stack Overflow runs behind Cloudflare and serves JS challenges to non-browser TLS fingerprints. Using anti-bot bypass for these platforms handles TLS normalization and challenge resolution automatically.

Scraping Hacker News

HN provides two official data endpoints. Use them before touching HTML.

Algolia Search API (hn.algolia.com/api/v1) — full-text search across all HN content, JSON, paginated, no auth required
Firebase REST API (hacker-news.firebaseio.com/v0) — fetch individual items and live feeds by ID

For market intelligence, the Algolia API is the right starting point. You can search by keyword, filter by content type (story, comment, ask_hn, show_hn), and sort by recency or relevance. The search_by_date endpoint is better for monitoring workflows than search — recency is more actionable than Algolia's relevance score for this use case.

Python

import alterlab
import json
from datetime import datetime, timedelta
from urllib.parse import quote

client = alterlab.Client("YOUR_API_KEY")

def search_hn(
    query: str,
    tags: str = "story",
    days_back: int = 30,
    limit: int = 100,
) -> list[dict]:
    """
    Search HN via Algolia. Returns structured post data.
    tags: 'story' | 'comment' | 'ask_hn' | 'show_hn'
    """
    since = int((datetime.utcnow() - timedelta(days=days_back)).timestamp())
    url = (
        f"https://hn.algolia.com/api/v1/search_by_date"
        f"?query={quote(query)}"
        f"&tags={tags}"
        f"&numericFilters=created_at_i>{since}"
        f"&hitsPerPage={limit}"
    )

    resp = client.scrape(url, {"render_js": False})
    data = json.loads(resp.text)

    return [
        {
            "id": hit["objectID"],
            "title": hit.get("title") or hit.get("comment_text", "")[:120],
            "url": hit.get("url") or f"https://news.ycombinator.com/item?id={hit['objectID']}",
            "points": hit.get("points", 0),
            "num_comments": hit.get("num_comments", 0),
            "author": hit.get("author"),
            "created_at": hit["created_at"],
        }
        for hit in data.get("hits", [])
    ]


if __name__ == "__main__":
    competitors = ["ScrapingBee", "Apify", "Bright Data", "Zyte"]

    for competitor in competitors:
        results = search_hn(competitor, tags="comment", days_back=90)
        top = sorted(results, key=lambda x: x["points"] or 0, reverse=True)[:3]
        print(f"\n{competitor}: {len(results)} mentions in 90 days")
        for r in top:
            print(f"  [{r['points'] or 0:>4}pts] {r['title'][:75]}")
            print(f"           {r['url']}")

Two implementation notes: Comments don't have a title field — truncate comment_text instead. The Algolia free tier allows roughly 10k requests per day per IP, which is sufficient for monitoring but insufficient for historical backfills — route bulk requests through a proxy pool to avoid exhausting the quota from a single source IP.

To fetch the full comment thread for a specific story and extract discussion-level sentiment, use the Firebase API:

Python

def fetch_story_comments(story_id: int, max_top_level: int = 50) -> list[dict]:
    """Fetch top-level comments for an HN story via Firebase REST API."""
    url = f"https://hacker-news.firebaseio.com/v0/item/{story_id}.json"
    resp = client.scrape(url, {"render_js": False})
    story = json.loads(resp.text)

    comment_ids = story.get("kids", [])[:max_top_level]
    comments = []

    for cid in comment_ids:
        c_resp = client.scrape(
            f"https://hacker-news.firebaseio.com/v0/item/{cid}.json",
            {"render_js": False},
        )
        comment = json.loads(c_resp.text)
        if not comment or comment.get("deleted") or comment.get("dead"):
            continue
        comments.append({
            "id": cid,
            "text": comment.get("text", ""),
            "author": comment.get("by"),
            "score": comment.get("score", 0),
            "time": comment.get("time"),
        })

    return comments

Scraping Reddit

Reddit's situation is more complex. The official API now requires OAuth and charges for commercial-scale access (after the 2023 API pricing changes). The JSON endpoint still works for read-only access and requires no authentication — it's the right tool for this use case.

The .json trick: append .json to any Reddit URL and you receive the raw data structure the frontend renders from. https://www.reddit.com/r/devops/search.json?q=kubernetes&restrict_sr=1 returns the same posts as the page, without touching a line of JavaScript.

Try it yourself

Try scraping this page with AlterLab

curl -X POST https://api.alterlab.io/v1/scrape \
  -H "X-API-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"url": "https://www.reddit.com/r/python"}'

Enable JavaScript to try the live demo, or sign up to use the API directly.

The equivalent cURL call through AlterLab's API:

Bash

curl -X POST https://api.alterlab.io/v1/scrape \
  -H "X-API-Key: YOUR_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://www.reddit.com/r/devops/search.json?q=web+scraping&restrict_sr=1&sort=new&t=month&limit=25",
    "render_js": false,
    "premium_proxy": true
  }'

For a full multi-subreddit sweep in Python:

Python

import alterlab
import json
import time
from urllib.parse import quote

client = alterlab.Client("YOUR_API_KEY")

SUBREDDITS = ["devops", "programming", "Python", "sysadmin", "webdev", "MachineLearning"]


def search_subreddit(
    subreddit: str,
    query: str,
    sort: str = "new",
    limit: int = 25,
    time_filter: str = "month",
) -> list[dict]:
    """
    Query Reddit's search JSON endpoint for a specific subreddit.
    sort:        'new' | 'relevance' | 'top' | 'comments'
    time_filter: 'hour' | 'day' | 'week' | 'month' | 'year' | 'all'
    """
    url = (
        f"https://www.reddit.com/r/{subreddit}/search.json"
        f"?q={quote(query)}&restrict_sr=1"
        f"&sort={sort}&t={time_filter}&limit={limit}"
    )

    resp = client.scrape(url, {
        "render_js": False,
        "premium_proxy": True,  # Required: Reddit blocks datacenter IPs
    })

    data = json.loads(resp.text)
    posts = data.get("data", {}).get("children", [])

    return [
        {
            "title": p["data"]["title"],
            "score": p["data"]["score"],
            "upvote_ratio": p["data"]["upvote_ratio"],
            "num_comments": p["data"]["num_comments"],
            "permalink": f"https://reddit.com{p['data']['permalink']}",
            "selftext": p["data"].get("selftext", "")[:800],
            "created_utc": p["data"]["created_utc"],
            "subreddit": subreddit,
        }
        for p in posts
    ]


def collect_signals(keywords: list[str]) -> list[dict]:
    """Cross-subreddit keyword sweep with deduplication."""
    all_posts = []

    for sub in SUBREDDITS:
        for keyword in keywords:
            posts = search_subreddit(sub, keyword, sort="new", time_filter="month")
            all_posts.extend(posts)
            time.sleep(1.5)  # Jitter between requests — critical for staying under rate limits

    seen, unique = set(), []
    for p in all_posts:
        if p["permalink"] not in seen:
            seen.add(p["permalink"])
            unique.append(p)

    return sorted(unique, key=lambda x: x["score"], reverse=True)


if __name__ == "__main__":
    signals = collect_signals(["web scraping API", "proxy rotation", "anti-bot bypass"])
    print(f"Collected {len(signals)} unique posts\n")
    for s in signals[:15]:
        print(f"[{s['score']:>5}↑] [{s['subreddit']:<15}] {s['title'][:65]}")

The premium_proxy: true flag is non-negotiable for Reddit. Without it, you'll receive silent failures — 200 OK responses with empty children arrays — from datacenter IP blocks. See the Python SDK reference for the full options schema including header injection and session persistence.

Fetching Comment Data

Post scores tell you signal volume. Comment bodies tell you signal content. Append .json to any post permalink:

Python

def fetch_post_comments(permalink: str, depth: int = 2, limit: int = 50) -> list[dict]:
    """
    Fetch top-level comments from a Reddit post.
    permalink: relative path, e.g. '/r/devops/comments/abc123/post_title/'
    """
    url = f"https://www.reddit.com{permalink}.json?depth={depth}&limit={limit}"
    resp = client.scrape(url, {"render_js": False, "premium_proxy": True})

    data = json.loads(resp.text)
    # Reddit returns a 2-element list: [0] post metadata, [1] comment listing
    if not isinstance(data, list) or len(data) < 2:
        return []

    comment_listing = data[1]["data"]["children"]
    comments = []

    for c in comment_listing:
        if c["kind"] != "t1":  # Skip "more comments" load-more objects
            continue
        body = c["data"].get("body", "")
        if body in ("[deleted]", "[removed]"):
            continue
        comments.append({
            "author": c["data"].get("author"),
            "body": body,
            "score": c["data"].get("score", 0),
            "created_utc": c["data"]["created_utc"],
        })

    return sorted(comments, key=lambda x: x["score"], reverse=True)

Anti-Detection Patterns That Actually Matter

Most scrapers fail not because of CAPTCHAs — they fail because of subtler fingerprinting.

User-Agent consistency: Use a realistic, fixed UA string across all requests to the same domain. Rotating UAs on every request increases suspicion; real browsers don't change UA mid-session.

Burst suppression: Burst patterns (10 requests in 500ms) trigger rate limiters even within hourly quotas. The time.sleep(1.5) in the Reddit example above is not courtesy — it's a functional requirement. Respect Retry-After headers unconditionally.

Silent block detection: Reddit and some forum platforms return 200 OK with empty data instead of 429. Always validate that children or hits arrays are non-empty, and implement exponential backoff on empty responses.

TLS fingerprinting: Platforms running Cloudflare (Stack Overflow, many niche forums) inspect TLS handshake parameters to detect non-browser clients. AlterLab's anti-bot bypass normalizes TLS fingerprints to match real browser profiles — no configuration required on your end.

JSON over HTML, always: JSON endpoints don't execute JavaScript-based bot detection. They're faster, more stable across UI changes, and produce cleaner data. Fall back to HTML scraping only when no JSON endpoint exists.

~1,000HN Submissions/Day

52M+Reddit DAUs

60 req/minReddit Unauth Limit

10k req/dayHN Algolia Free Tier

Building a Repeatable Pipeline

A one-off scrape is an audit. A scheduled pipeline is market intelligence. The minimum viable architecture:

Scheduler — cron job or an Airflow DAG, hourly for Reddit, daily for HN historical sweeps
Deduplication — SHA-256 hash of post URL, skip already-processed items before fetching comments
Classification — keyword match on ["pricing", "cost", "expensive", "alternative to", "migrating from"] gets you 80% accuracy with zero latency; feed post bodies to a lightweight LLM for the remaining ambiguous cases
Storage — PostgreSQL with a forum_posts table: (id, source, url, title, body, score, created_at, signal_category, processed_at)
Alerting — Slack webhook triggered when a post crosses a score threshold (e.g., >150 upvotes) and matches a competitor keyword — these are the posts worth reading same-day

Classification quality compounds with volume. After 90 days of labeled data, you'll have enough signal to train a fine-tuned classifier that outperforms keyword matching on nuanced cases like implicit pricing complaints or indirect competitor comparisons.

For the full API reference — including rate limits, available proxy tiers, and response schemas — see the API documentation.

Takeaway

Developer forums are high-signal, underutilized intelligence sources. The technical barriers are manageable with the right approach:

Source	Best Endpoint	Key Requirement
Hacker News	Algolia API (`search_by_date`)	Proxy rotation for bulk
Reddit	`.json` suffix on any URL	Residential proxies; 1–3s jitter
Stack Overflow	Stack Exchange API v2.3	Anti-bot bypass for HTML fallback
Dev.to	Public REST API (`/api/articles`)	No special handling needed

The pipeline compounds in value over time. Thirty days of daily crawls establishes a baseline. Ninety days reveals trend lines. Six months and you're seeing sentiment shifts in your category before they surface in support tickets or churn metrics — with enough lead time to act on them.

Was this article helpful?

Try it yourself

Skip the proxy management overhead

AlterLab handles proxy rotation, browser environments, and challenge resolution for you.

curl -X POST https://api.alterlab.io/v1/scrape \
  -H "X-API-Key: YOUR_KEY" \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com"}'

No credit card required · 5,000 free requests

Frequently Asked Questions

Use Reddit's `.json` endpoint (append `.json` to any Reddit URL) and route requests through residential proxies. Reddit aggressively blocks datacenter IP ranges; residential proxies make traffic indistinguishable from real users. A 1–3 second jitter between requests keeps you well under rate limit thresholds.

HN's terms of service don't prohibit scraping, and Y Combinator provides an official Firebase REST API and an Algolia search API as preferred access methods. For bulk historical data, exhaust these official endpoints before falling back to HTML scraping. IP rate limiting still applies regardless of legal access, so proxy rotation remains necessary at scale.

Reddit enforces approximately 60 requests per minute on unauthenticated JSON access. HTML scraping triggers blocks faster — typically after 10–20 rapid requests from the same IP. Burst patterns are the primary trigger; distributing requests with 1–3 seconds of jitter between calls keeps you safely within the limit even during large crawls.

Yash Dubey

View all posts

Tutorials

How to Scrape DoorDash Data: Complete Guide for 2026

Learn how to scrape DoorDash data using Python and Node.js. A technical guide on extracting public food data, handling anti-bot protections, and structured AI extraction.

Herald Blog Service

Jul 4, 2026

Web Scraping

Playwright vs. Puppeteer vs. Selenium for Scraping in 2026

Compare Playwright, Puppeteer, and Selenium for web scraping in 2026. Learn which browser automation tool is best for speed, reliability, and bot detection handling.

Herald Blog Service

Jul 4, 2026

Tutorials

SEC EDGAR Data API: Extract Structured JSON in 2026

Get structured JSON from SEC EDGAR via AlterLab’s API. Extract title, identifier, date_published and more with schema validation. Always start with the answer and keep it concise.

Herald Blog Service

Jul 2, 2026

Stay in the Loop

Get scraping insights, API tips, and platform updates. No spam — we only send when we have something worth reading.

Web Scraping API Resources

Part of the Web Scraping API Documentation cluster

Web Scraping API Documentation

Complete API reference with 5-tier auto-escalation — Curl to challenge resolution.

Pillar page

JavaScript Rendering Guide

Configure Tier 4 browser rendering for SPAs and dynamic content.

Authenticated Scraping Guide

Scrape pages behind login using session management.

Web Scraping API Benchmarks

Real success rates and cost data across all 5 tiers.

AlterLab for AI Agents

MCP Server, Python SDK, and Firecrawl-compatible API for AI agent workflows.

What You're Actually Mining

How the Pipeline Works

Technical Challenges by Platform

Scraping Hacker News

Scraping Reddit

Fetching Comment Data

Anti-Detection Patterns That Actually Matter

Building a Repeatable Pipeline

Takeaway

Frequently Asked Questions

Related Articles

How to Scrape DoorDash Data: Complete Guide for 2026

Playwright vs. Puppeteer vs. Selenium for Scraping in 2026

SEC EDGAR Data API: Extract Structured JSON in 2026

Popular Posts

Playwright Bot Detection: What Actually Works in 2026

How to Scrape Twitter/X: Complete Guide for 2026

Why Your Headless Browser Gets Detected (and How to Fix It)

Best Web Scraping APIs in 2026: Complete Comparison Guide

How to Scrape Cloudflare-Protected Sites in 2026

Recommended

How to Scrape AliExpress: Complete Guide for 2026

Why Your Headless Browser Gets Detected (and How to Fix It)

AlterLab vs Firecrawl: Which Scraping API Is Better in 2026?

How to Scrape Twitter/X Data: Complete Guide for 2026

How to Scrape Cloudflare-Protected Sites in 2026

Newsletter

Recommended Reading

How to Scrape AliExpress: Complete Guide for 2026

Why Your Headless Browser Gets Detected (and How to Fix It)

AlterLab vs Firecrawl: Which Scraping API Is Better in 2026?

How to Scrape Twitter/X Data: Complete Guide for 2026

How to Scrape Cloudflare-Protected Sites in 2026

Stay in the Loop

Explore AlterLab

Anti-Bot Handling API

JavaScript Rendering API

Pricing

Documentation

Web Scraping API Resources