Pricing Compare Playground Blog Docs Changelog

How to Scrape Hacker News: Complete Guide for 2026

Learn how to scrape Hacker News (news.ycombinator.com) with Python. Get structured data from stories, comments, and user profiles at scale.

Yash DubeyApril 9, 2026

10 min read

187 views

AlterLab handles this automatically — scrape any URL with one API call. No infrastructure required.

Try it free

Why scrape Hacker News?

Hacker News is one of the most concentrated sources of technical discussion on the internet. Every day, thousands of engineers, founders, and investors post links, share opinions, and debate technology trends. That data has value if you can extract it reliably.

Here is what teams actually build with this data:

Trend detection and market research. Track which technologies, frameworks, and startups are gaining traction. A sudden spike in mentions of a specific tool often precedes broader adoption. Product teams use this to validate roadmap decisions. Engineering leaders use it to evaluate which technologies their teams should invest in learning.

Competitive intelligence. Monitor when competitors get mentioned, what people say about their products, and which features generate the most discussion. This is not about vanity metrics. It is about understanding how your market perceives alternatives to your product.

Lead generation and outreach. Identify companies hiring by tracking "Who is hiring?" threads. Extract contact information from user profiles. Find engineers discussing problems your product solves. Sales teams use this to build targeted outreach lists based on actual expressed needs, not purchased contact databases.

300K+Daily Stories

99.2%Success Rate

1.2sAvg Response

24/7Monitoring

Anti-bot challenges on news.ycombinator.com

Hacker News looks simple. It is a mostly static HTML site with minimal JavaScript. That makes it tempting to write a quick requests script and pull the data yourself.

The reality is more complicated.

Y Combinator runs infrastructure that detects and blocks automated scraping. Their protections include:

Rate limiting by IP. Make too many requests from the same IP address and you will get blocked. The threshold is not published, and it varies based on their current load and detection rules. You might get 50 requests before a block, or you might get blocked on request 10.

IP reputation scoring. Cloud-hosted IPs from AWS, GCP, and Azure are flagged more aggressively than residential IPs. If your scraper runs on an EC2 instance, you are already starting with a disadvantage.

Request fingerprinting. They analyze headers, TLS fingerprints, and behavioral patterns to distinguish browsers from scripts. A Python requests call with default headers looks nothing like a real browser. Missing Accept-Language headers, unusual User-Agent strings, and absent Sec-Fetch headers all trigger detection.

Session-based challenges. Some pages serve JavaScript challenges that must be solved before content loads. These are not traditional CAPTCHAs, but they block headless browsers that do not execute the challenge code.

Building infrastructure to handle all of this yourself means managing proxy pools, rotating user agents, solving challenges, and constantly updating your approach as their detection evolves. Most teams spend weeks building this before realizing it is not their core competency.

AlterLab handles this through its anti-bot bypass API. You send a URL, get back clean HTML. The proxy rotation, header management, and challenge solving happen automatically.

Quick start with AlterLab API

If you want to scrape Hacker News without managing proxies or solving challenges, the fastest path is the AlterLab API. Here is how to get your first request working.

First, install the Python SDK:

Bash

pip install alterlab

Then make your first request:

Python

import alterlab
from bs4 import BeautifulSoup

client = alterlab.Client("YOUR_API_KEY")
response = client.scrape(
    "https://news.ycombinator.com",
    formats=["html"]
)

soup = BeautifulSoup(response.html, "html.parser")
stories = soup.select("tr.athing")

for story in stories[:5]:
    title = story.select_one(".titleline > a").text
    url = story.select_one(".titleline > a")["href"]
    print(f"{title} - {url}")

This returns the front page as clean HTML, which you parse with BeautifulSoup. The formats=["html"] parameter ensures you get the raw markup. You can also request JSON or Markdown output depending on your pipeline.

The same request via cURL:

Bash

curl -X POST https://api.alterlab.io/v1/scrape \
  -H "X-API-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://news.ycombinator.com",
    "formats": ["html"]
  }'

For Node.js developers, the equivalent:

JAVASCRIPT

import AlterLab from "alterlab";

const client = new AlterLab("YOUR_API_KEY");
const response = await client.scrape("https://news.ycombinator.com", {
  formats: ["html"]
});

const $ = loadCheerio(response.html);
$("tr.athing").slice(0, 5).each((_, el) => {
  const title = $(el).find(".titleline > a").text();
  const url = $(el).find(".titleline > a").attr("href");
  console.log(`${title} - ${url}`);
});

If you are new to the platform, the getting started guide walks through account setup, API key generation, and your first scrape in under five minutes.

Extracting structured data

Hacker News uses a table-based layout that has remained largely unchanged for years. This stability makes it straightforward to write CSS selectors that will not break on every redesign.

Front page stories

Each story on the front page is wrapped in a <tr class="athing"> element. The structure looks like this:

HTML

<tr class="athing" id="12345678">
  <td class="title">
    <span class="rank">1.</span>
    <span class="titleline">
      <a href="https://example.com/article">Story Title</a>
      <span class="sitebit comhead">
        (<a href="from?site=example.com"><span class="sitestr">example.com</span></a>)
      </span>
    </span>
  </td>
</tr>
<tr>
  <td class="subtext">
    <span class="score">142 points</span>
    by <a href="user?id=username">username</a>
    <span class="age"><a href="item?id=12345678">3 hours ago</a></span>
    <span class="comments"><a href="item?id=12345678">87 comments</a></span>
  </td>
</tr>

Here is a complete extraction script:

Python

import alterlab
from bs4 import BeautifulSoup
import json

client = alterlab.Client("YOUR_API_KEY")
response = client.scrape("https://news.ycombinator.com", formats=["html"])

soup = BeautifulSoup(response.html, "html.parser")
stories = []

for athing in soup.select("tr.athing"):
    story_id = athing["id"]
    title_el = athing.select_one(".titleline > a")
    title = title_el.text if title_el else None
    url = title_el["href"] if title_el else None

    subtext = athing.find_next_sibling("tr").select_one(".subtext")
    score = subtext.select_one(".score").text if subtext.select_one(".score") else "0 points"
    author = subtext.select_one(".hnuser").text if subtext.select_one(".hnuser") else None
    comments_el = subtext.select_one(".comments a")
    comments = comments_el.text if comments_el else "0 comments"
    comment_url = comments_el["href"] if comments_el else None

    stories.append({
        "id": story_id,
        "title": title,
        "url": url,
        "score": score,
        "author": author,
        "comments": comments,
        "comment_url": comment_url
    })

print(json.dumps(stories, indent=2))

Individual story pages and comments

To scrape a specific story and its comments, target the item page:

Python

import alterlab
from bs4 import BeautifulSoup

client = alterlab.Client("YOUR_API_KEY")
story_id = "12345678"
response = client.scrape(f"https://news.ycombinator.com/item?id={story_id}", formats=["html"])

soup = BeautifulSoup(response.html, "html.parser")

comments = []
for comment in soup.select("tr.comtr"):
    author_el = comment.select_one(".hnuser")
    text_el = comment.select_one(".commtext")
    if author_el and text_el:
        comments.append({
            "author": author_el.text,
            "text": text_el.get_text(strip=True),
            "time": comment.select_one(".age").text if comment.select_one(".age") else None
        })

print(f"Extracted {len(comments)} comments")

User profiles

User profile pages contain karma counts, account age, and submission history:

Python

import alterlab
from bs4 import BeautifulSoup

client = alterlab.Client("YOUR_API_KEY")
response = client.scrape("https://news.ycombinator.com/user?id=pg", formats=["html"])

soup = BeautifulSoup(response.html, "html.parser")
user_info = {}

for row in soup.select("table tr"):
    cells = row.select("td")
    if len(cells) == 2:
        key = cells[0].text.strip().rstrip(":")
        value = cells[1].text.strip()
        user_info[key] = value

print(user_info)

Try it yourself

Try scraping Hacker News with AlterLab

curl -X POST https://api.alterlab.io/v1/scrape \
  -H "X-API-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"url": "https://news.ycombinator.com"}'

Enable JavaScript to try the live demo, or sign up to use the API directly.

Common pitfalls

Even with anti-bot handling solved, there are specific challenges when scraping Hacker News that trip up most implementations.

Rate limiting on comment threads

Deep comment threads on popular stories can contain hundreds of nested replies. Loading the full page for a story with 500+ comments returns a large HTML document. If you scrape multiple stories in parallel, you will hit rate limits quickly.

The solution is pagination. Hacker News does not paginate comments on the item page, but you can limit your extraction to top-level comments and fetch child comments separately if needed. Use the max_depth parameter in your parser to control how many nesting levels you extract.

Dynamic content in "More" links

The front page shows 30 stories by default. Older stories are accessible through a "More" link at the bottom, which loads the next page at https://news.ycombinator.com/?p=2. If you need historical front page data, you must iterate through pages.

Be aware that pages beyond the first few may return different HTML structures for very old stories. The core tr.athing selector remains stable, but subtext formatting can vary slightly.

Session handling for authenticated content

Some Hacker News features require authentication. The "hidden" stories list, saved items, and certain user settings are only visible when logged in. If your use case requires authenticated scraping, you need to pass session cookies with your requests.

With AlterLab, you can include cookies in your scrape request:

Python

import alterlab

client = alterlab.Client("YOUR_API_KEY")
response = client.scrape(
    "https://news.ycombinator.com/saved",
    formats=["html"],
    cookies={
        "user": "your_username",
        "key": "your_auth_cookie"
    }
)

Note that sharing authentication cookies carries security implications. Use dedicated API keys with spend limits and rotate credentials regularly.

HTML parsing edge cases

Hacker News allows a subset of HTML in comments. This means comment text can contain <a>, <p>, <pre>, and <code> tags. When extracting comment text, use get_text(strip=True) to strip markup, or preserve the HTML if your application needs formatting.

Some comments are marked [dead] or [flagged]. These appear in the HTML with specific classes. Filter them out if your analysis requires only active content:

Python

active_comments = [
    c for c in all_comments
    if "[dead]" not in c["text"] and "[flagged]" not in c["text"]
]

Scaling up

Running a few scrapes manually is straightforward. Running thousands of scrapes daily on a schedule requires infrastructure planning.

Batch requests

When you need to scrape multiple URLs at once, batch them in a single API call rather than making individual requests. This reduces overhead and ensures consistent proxy rotation across your batch.

Python

import alterlab

client = alterlab.Client("YOUR_API_KEY")
urls = [
    "https://news.ycombinator.com",
    "https://news.ycombinator.com/news",
    "https://news.ycombinator.com/ask",
    "https://news.ycombinator.com/show",
    "https://news.ycombinator.com/jobs"
]

results = client.scrape_batch(urls, formats=["html"])
for url, result in zip(urls, results):
    print(f"{url}: {len(result.html)} bytes")

Scheduling recurring scrapes

If you need fresh Hacker News data every hour, set up a scheduled scrape with a cron expression. AlterLab handles the timing, execution, and result delivery automatically.

Python

import alterlab

client = alterlab.Client("YOUR_API_KEY")
schedule = client.schedules.create(
    url="https://news.ycombinator.com",
    cron="0 * * * *",
    formats=["json"],
    webhook_url="https://your-server.com/hn-webhook"
)
print(f"Scheduled scrape ID: {schedule.id}")

This runs every hour and pushes results to your webhook endpoint. No polling required.

Handling large datasets

If you are archiving Hacker News data long-term, consider these patterns:

Deduplication. Stories appear on the front page for hours and may be scraped multiple times. Use the story ID (tr.athing[id]) as a unique key to avoid duplicate records.

Incremental extraction. Instead of re-scraping the entire front page, track the highest story ID you have seen and only extract new stories. This reduces data volume and processing time.

Storage format. Store extracted data in a columnar format like Parquet if you plan to run analytics. JSON is fine for ingestion, but Parquet gives you faster query performance at scale.

Cost management

Hacker News is a relatively simple target. The front page and item pages are mostly static HTML, which means they fall into lower pricing tiers. Headless browser rendering is not required for most use cases.

Cost scales linearly with request volume. A schedule that scrapes the front page hourly costs 24 requests per day. Adding comment extraction for the top 10 stories multiplies that by 11. Plan your request volume before setting up schedules.

Review AlterLab pricing for current tier rates and volume discounts. Setting spend limits on your API keys prevents unexpected charges if a schedule runs more frequently than intended.

Monitoring for changes

Hacker News does not change often, but when it does, your selectors can break. Set up monitoring to detect when the HTML structure changes:

Python

import alterlab

client = alterlab.Client("YOUR_API_KEY")
monitor = client.monitors.create(
    url="https://news.ycombinator.com",
    check_interval="1h",
    diff_detection=True,
    alert_email="[email protected]"
)

This checks the page hourly and alerts you when the HTML structure changes significantly enough to affect your selectors.

Key takeaways

Hacker News is a valuable data source for trend detection, competitive intelligence, and lead generation. The site uses standard anti-bot protections that require proxy rotation and careful header management to bypass reliably.

The AlterLab API handles the infrastructure layer. You send a URL, get clean HTML, and parse it with your preferred tool. CSS selectors on Hacker News are stable because the layout has not changed significantly in years.

For production pipelines, batch your requests, schedule recurring scrapes with webhooks, and set up monitoring to catch structural changes before they break your extraction logic. Track story IDs to deduplicate, and use spend limits to control costs.

Was this article helpful?

Try it yourself

Skip the proxy management overhead

AlterLab handles proxy rotation, browser environments, and challenge resolution for you.

curl -X POST https://api.alterlab.io/v1/scrape \
  -H "X-API-Key: YOUR_KEY" \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com"}'

No credit card required · 5,000 free requests

Frequently Asked Questions

Hacker News is a publicly accessible website, and scraping publicly available data is generally legal in most jurisdictions. However, you should review their terms of service, respect robots.txt directives, and avoid aggressive request rates that could impact their infrastructure.

Hacker News uses standard anti-bot protections including rate limiting and IP-based blocking. AlterLab's anti-bot bypass API handles proxy rotation, header management, and request fingerprinting automatically, so you get clean HTML without managing infrastructure.

AlterLab uses a pay-as-you-go model with tiered pricing based on request complexity. Simple HTML scrapes of Hacker News fall into lower tiers, while headless browser requests for dynamic content cost more. Check the pricing page for current rates and volume discounts.

Yash Dubey

View all posts

Tutorials

How to Give Your AI Agent Access to Medium Data

Learn how to connect your AI agent to Medium using AlterLab's Extract API to retrieve structured, public data for RAG pipelines and content intelligence.

Herald Blog Service

Jul 9, 2026

Best Practices

Managing Headless Browser Overhead in Data Pipelines

Learn how to reduce latency and resource consumption when using headless browsers for data extraction in large-scale web scraping pipelines.

Herald Blog Service

Jul 8, 2026

Tutorials

How to Give Your AI Agent Access to AngelList Data

Enable AI agents to retrieve AngelList job data via AlterLab structured extraction with clean JSON output and automatic anti bot handling

Herald Blog Service

Jul 7, 2026

Stay in the Loop

Get scraping insights, API tips, and platform updates. No spam — we only send when we have something worth reading.

Web Scraping API Resources

Part of the Web Scraping API Documentation cluster

Web Scraping API Documentation

Complete API reference with 5-tier auto-escalation — Curl to challenge resolution.

Pillar page

JavaScript Rendering Guide

Configure Tier 4 browser rendering for SPAs and dynamic content.

Authenticated Scraping Guide

Scrape pages behind login using session management.

Web Scraping API Benchmarks

Real success rates and cost data across all 5 tiers.

AlterLab for AI Agents

MCP Server, Python SDK, and Firecrawl-compatible API for AI agent workflows.

Why scrape Hacker News?

Anti-bot challenges on news.ycombinator.com

Quick start with AlterLab API

Extracting structured data

Front page stories

Individual story pages and comments

User profiles

Common pitfalls

Rate limiting on comment threads

Dynamic content in "More" links

Session handling for authenticated content

HTML parsing edge cases

Scaling up

Batch requests

Scheduling recurring scrapes

Handling large datasets

Cost management

Monitoring for changes

Key takeaways

Related guides

Frequently Asked Questions

Related Articles

How to Give Your AI Agent Access to Medium Data

Managing Headless Browser Overhead in Data Pipelines

How to Give Your AI Agent Access to AngelList Data

Popular Posts

Playwright Bot Detection: What Actually Works in 2026

How to Scrape Twitter/X: Complete Guide for 2026

Why Your Headless Browser Gets Detected (and How to Fix It)

Best Web Scraping APIs in 2026: Complete Comparison Guide

How to Scrape Cloudflare-Protected Sites in 2026

Recommended

How to Scrape AliExpress: Complete Guide for 2026

Why Your Headless Browser Gets Detected (and How to Fix It)

AlterLab vs Firecrawl: Which Scraping API Is Better in 2026?

How to Scrape Twitter/X Data: Complete Guide for 2026

How to Scrape Cloudflare-Protected Sites in 2026

Newsletter

Recommended Reading

How to Scrape AliExpress: Complete Guide for 2026

Why Your Headless Browser Gets Detected (and How to Fix It)

AlterLab vs Firecrawl: Which Scraping API Is Better in 2026?

How to Scrape Twitter/X Data: Complete Guide for 2026

How to Scrape Cloudflare-Protected Sites in 2026

Stay in the Loop

Explore AlterLab

Anti-Bot Handling API

JavaScript Rendering API

Pricing

Documentation

Web Scraping API Resources