How to Scrape Hacker News Data: Complete Guide for 2026
Tutorials

How to Scrape Hacker News Data: Complete Guide for 2026

Learn to scrape Hacker News with Python and Node.js using AlterLab's API. Handle anti-bot measures, extract structured data, and scale responsibly.

4 min read
5 views

AlterLab handles this automaticallyscrape any URL with one API call. No infrastructure required.

Try it free

This guide covers extracting publicly accessible data. Always review a site's robots.txt and Terms of Service before scraping.

TL;DR

Scrape Hacker News using AlterLab's API with Python or Node.js. Start at T1 tier, let the API auto-escalate if needed, and extract structured data via CSS selectors or Cortex. Respect rate limits and robots.txt.

Why collect tech data from Hacker News?

Hacker News aggregates real-time tech discussions, product launches, and industry sentiment. Practical use cases include:

  • Tracking startup funding announcements and job postings for market research
  • Monitoring technology trends by analyzing upvote patterns on specific topics
  • Building competitor intelligence feeds by scraping links to rival products

Technical challenges

Hacker News implements standard anti-bot protections: rate limiting by IP, User-Agent header validation, and occasional JavaScript challenges for suspicious traffic. Raw HTTP requests (curl/urllib) frequently receive 429 or 403 responses. AlterLab's Smart Rendering API automates proxy rotation, header optimization, and tier escalation to maintain access while respecting site policies.

Quick start with AlterLab API

Begin with our Getting started guide. Here's how to fetch the Hacker News front page:

Python
import alterlab

client = alterlab.Client("YOUR_API_KEY")
response = client.scrape("https://news.ycombinator.com")
print(response.text[:500])  # First 500 chars of HTML
JAVASCRIPT
import { AlterLab } from "@alterlab/sdk";

const client = new AlterLab({ apiKey: "YOUR_API_KEY" });
const response = await client.scrape("https://news.ycombinator.com");
console.log(response.text.slice(0, 500));
Bash
curl -X POST https://api.alterlab.io/v1/scrape \
  -H "X-API-Key: YOUR_KEY" \
  -d '{"url": "https://news.ycombinator.com"}'

Extracting structured data

Target these common elements using CSS selectors:

  • Story titles: .titleline > a
  • Scores: .score
  • Author names: .hnuser
  • Comment counts: .age > a:nth-child(3)

Example Python extraction:

Python
import alterlab
from parsel import Selector

client = alterlab.Client("YOUR_API_KEY")
html = client.scrape("https://news.ycombinator.com").text
selector = Selector(text=html)

titles = selector.css(".titleline > a::text").getall()
print(f"Found {len(titles)} stories")

Structured JSON extraction with Cortex

For typed output without manual parsing, use Cortex AI extraction. Define a schema for story objects:

Python
import alterlab

client = alterlab.Client("YOUR_API_KEY")
result = client.extract(
    url="https://news.ycombinator.com",
    schema={
        "type": "array",
        "items": {
            "type": "object",
            "properties": {
                "title": {"type": "string"},
                "url": {"type": "string", "format": "uri"},
                "score": {"type": "integer"},
                "author": {"type": "string"}
            },
            "required": ["title", "url"]
        }
    }
)
print(result.data)  # List of validated story objects

Cost breakdown

Hacker News typically requires T2 (standard headers) or T3 (stealth) tiers due to anti-bot measures. AlterLab auto-escalates: start at T1, pay only for the tier that succeeds.

TierUse CaseCost per RequestCost per 1,000Requests per $1
T1 — CurlStatic HTML, no JS needed$0.0002$0.205,000
T2 — HTTPStandard pages with headers$0.0003$0.303,333
T3 — StealthProtected pages, anti-bot active$0.002$2.00500
T4 — BrowserFull JS rendering required$0.004$4.00250
T5 — CAPTCHACAPTCHA solving + JS rendering$0.02$20.0050

See AlterLab pricing for volume discounts. For most Hacker News scraping, expect $0.30-$2.00 per 1,000 requests.

Best practices

  • Rate limiting: AlterLab respects Crawl-delay in robots.txt. Add wait_time=1 parameter for 1-second intervals between requests.
  • Robots.txt: Hacker News allows scraping with User-agent: * and Crawl-delay: 30. Adjust frequency accordingly.
  • Dynamic content: Use render_js=true for AJAX-loaded comments (triggers T4 tier only when necessary).
  • Error handling: Implement exponential backoff for 429 responses. AlterLab auto-retries failed tiers.

Scaling up

For large datasets:

  1. Batch requests: Send 100 URLs per API call using urls array parameter
  2. Scheduling: Use AlterLab's cron endpoint for daily/weekly scrapes
  3. Storage: Stream results directly to S3 or your database via webhooks
  4. Responsibility: Monitor response codes; pause if 4xx errors exceed 1%
99.2%Success Rate
1.2sAvg Response
$0.002Per Request (T3)

Key takeaways

  • AlterLab manages anti-bot challenges so you focus on data extraction
  • Always verify public data accessibility and comply with robots.txt
  • Use Cortex for type-safe JSON output instead of brittle CSS selectors
  • Start scraping at T1 tier—pay only for what succeeds
  • Scale responsibly with rate limiting and error handling

Related resource: Hacker News scraping guide

Share

Was this article helpful?

Frequently Asked Questions

Scraping publicly accessible data from Hacker News is generally permissible under laws like hiQ v LinkedIn, but you must comply with robots.txt, rate limits, and Hacker News' Terms of Service. Avoid private data and respect crawl-delay directives.
Hacker News employs standard anti-bot measures including rate limiting, header checks, and occasional JS challenges. Raw HTTP requests often get blocked; AlterLab's Smart Rendering API handles proxy rotation, header management, and tier escalation automatically.
For Hacker News (typically T2/T3 tier), costs range from $0.0003-$0.002 per request. AlterLab's auto-escalation means you start at T1 and only pay for the succeeding tier. See pricing table for exact per-1k request costs.