AlterLabAlterLab
How to Scrape Hacker News: Complete Guide for 2026
Tutorials

How to Scrape Hacker News: Complete Guide for 2026

Learn how to scrape Hacker News (news.ycombinator.com) with Python. Get structured data from stories, comments, and user profiles at scale.

Yash Dubey
Yash Dubey

April 9, 2026

10 min read
3 views

Why scrape Hacker News?

Hacker News is one of the most concentrated sources of technical discussion on the internet. Every day, thousands of engineers, founders, and investors post links, share opinions, and debate technology trends. That data has value if you can extract it reliably.

Here is what teams actually build with this data:

Trend detection and market research. Track which technologies, frameworks, and startups are gaining traction. A sudden spike in mentions of a specific tool often precedes broader adoption. Product teams use this to validate roadmap decisions. Engineering leaders use it to evaluate which technologies their teams should invest in learning.

Competitive intelligence. Monitor when competitors get mentioned, what people say about their products, and which features generate the most discussion. This is not about vanity metrics. It is about understanding how your market perceives alternatives to your product.

Lead generation and outreach. Identify companies hiring by tracking "Who is hiring?" threads. Extract contact information from user profiles. Find engineers discussing problems your product solves. Sales teams use this to build targeted outreach lists based on actual expressed needs, not purchased contact databases.

300K+Daily Stories
99.2%Success Rate
1.2sAvg Response
24/7Monitoring

Anti-bot challenges on news.ycombinator.com

Hacker News looks simple. It is a mostly static HTML site with minimal JavaScript. That makes it tempting to write a quick requests script and pull the data yourself.

The reality is more complicated.

Y Combinator runs infrastructure that detects and blocks automated scraping. Their protections include:

Rate limiting by IP. Make too many requests from the same IP address and you will get blocked. The threshold is not published, and it varies based on their current load and detection rules. You might get 50 requests before a block, or you might get blocked on request 10.

IP reputation scoring. Cloud-hosted IPs from AWS, GCP, and Azure are flagged more aggressively than residential IPs. If your scraper runs on an EC2 instance, you are already starting with a disadvantage.

Request fingerprinting. They analyze headers, TLS fingerprints, and behavioral patterns to distinguish browsers from scripts. A Python requests call with default headers looks nothing like a real browser. Missing Accept-Language headers, unusual User-Agent strings, and absent Sec-Fetch headers all trigger detection.

Session-based challenges. Some pages serve JavaScript challenges that must be solved before content loads. These are not traditional CAPTCHAs, but they block headless browsers that do not execute the challenge code.

Building infrastructure to handle all of this yourself means managing proxy pools, rotating user agents, solving challenges, and constantly updating your approach as their detection evolves. Most teams spend weeks building this before realizing it is not their core competency.

AlterLab handles this through its anti-bot bypass API. You send a URL, get back clean HTML. The proxy rotation, header management, and challenge solving happen automatically.

Quick start with AlterLab API

If you want to scrape Hacker News without managing proxies or solving challenges, the fastest path is the AlterLab API. Here is how to get your first request working.

First, install the Python SDK:

Bash
pip install alterlab

Then make your first request:

Python
import alterlab
from bs4 import BeautifulSoup

client = alterlab.Client("YOUR_API_KEY")
response = client.scrape(
    "https://news.ycombinator.com",
    formats=["html"]
)

soup = BeautifulSoup(response.html, "html.parser")
stories = soup.select("tr.athing")

for story in stories[:5]:
    title = story.select_one(".titleline > a").text
    url = story.select_one(".titleline > a")["href"]
    print(f"{title} - {url}")

This returns the front page as clean HTML, which you parse with BeautifulSoup. The formats=["html"] parameter ensures you get the raw markup. You can also request JSON or Markdown output depending on your pipeline.

The same request via cURL:

Bash
curl -X POST https://api.alterlab.io/v1/scrape \
  -H "X-API-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://news.ycombinator.com",
    "formats": ["html"]
  }'

For Node.js developers, the equivalent:

JAVASCRIPT
import AlterLab from "alterlab";

const client = new AlterLab("YOUR_API_KEY");
const response = await client.scrape("https://news.ycombinator.com", {
  formats: ["html"]
});

const $ = loadCheerio(response.html);
$("tr.athing").slice(0, 5).each((_, el) => {
  const title = $(el).find(".titleline > a").text();
  const url = $(el).find(".titleline > a").attr("href");
  console.log(`${title} - ${url}`);
});

If you are new to the platform, the getting started guide walks through account setup, API key generation, and your first scrape in under five minutes.

Extracting structured data

Hacker News uses a table-based layout that has remained largely unchanged for years. This stability makes it straightforward to write CSS selectors that will not break on every redesign.

Front page stories

Each story on the front page is wrapped in a <tr class="athing"> element. The structure looks like this:

HTML
<tr class="athing" id="12345678">
  <td class="title">
    <span class="rank">1.</span>
    <span class="titleline">
      <a href="https://example.com/article">Story Title</a>
      <span class="sitebit comhead">
        (<a href="from?site=example.com"><span class="sitestr">example.com</span></a>)
      </span>
    </span>
  </td>
</tr>
<tr>
  <td class="subtext">
    <span class="score">142 points</span>
    by <a href="user?id=username">username</a>
    <span class="age"><a href="item?id=12345678">3 hours ago</a></span>
    <span class="comments"><a href="item?id=12345678">87 comments</a></span>
  </td>
</tr>

Here is a complete extraction script:

Python
import alterlab
from bs4 import BeautifulSoup
import json

client = alterlab.Client("YOUR_API_KEY")
response = client.scrape("https://news.ycombinator.com", formats=["html"])

soup = BeautifulSoup(response.html, "html.parser")
stories = []

for athing in soup.select("tr.athing"):
    story_id = athing["id"]
    title_el = athing.select_one(".titleline > a")
    title = title_el.text if title_el else None
    url = title_el["href"] if title_el else None

    subtext = athing.find_next_sibling("tr").select_one(".subtext")
    score = subtext.select_one(".score").text if subtext.select_one(".score") else "0 points"
    author = subtext.select_one(".hnuser").text if subtext.select_one(".hnuser") else None
    comments_el = subtext.select_one(".comments a")
    comments = comments_el.text if comments_el else "0 comments"
    comment_url = comments_el["href"] if comments_el else None

    stories.append({
        "id": story_id,
        "title": title,
        "url": url,
        "score": score,
        "author": author,
        "comments": comments,
        "comment_url": comment_url
    })

print(json.dumps(stories, indent=2))

Individual story pages and comments

To scrape a specific story and its comments, target the item page:

Python
import alterlab
from bs4 import BeautifulSoup

client = alterlab.Client("YOUR_API_KEY")
story_id = "12345678"
response = client.scrape(f"https://news.ycombinator.com/item?id={story_id}", formats=["html"])

soup = BeautifulSoup(response.html, "html.parser")

comments = []
for comment in soup.select("tr.comtr"):
    author_el = comment.select_one(".hnuser")
    text_el = comment.select_one(".commtext")
    if author_el and text_el:
        comments.append({
            "author": author_el.text,
            "text": text_el.get_text(strip=True),
            "time": comment.select_one(".age").text if comment.select_one(".age") else None
        })

print(f"Extracted {len(comments)} comments")

User profiles

User profile pages contain karma counts, account age, and submission history:

Python
import alterlab
from bs4 import BeautifulSoup

client = alterlab.Client("YOUR_API_KEY")
response = client.scrape("https://news.ycombinator.com/user?id=pg", formats=["html"])

soup = BeautifulSoup(response.html, "html.parser")
user_info = {}

for row in soup.select("table tr"):
    cells = row.select("td")
    if len(cells) == 2:
        key = cells[0].text.strip().rstrip(":")
        value = cells[1].text.strip()
        user_info[key] = value

print(user_info)
Try it yourself

Try scraping Hacker News with AlterLab

Common pitfalls

Even with anti-bot handling solved, there are specific challenges when scraping Hacker News that trip up most implementations.

Rate limiting on comment threads

Deep comment threads on popular stories can contain hundreds of nested replies. Loading the full page for a story with 500+ comments returns a large HTML document. If you scrape multiple stories in parallel, you will hit rate limits quickly.

The solution is pagination. Hacker News does not paginate comments on the item page, but you can limit your extraction to top-level comments and fetch child comments separately if needed. Use the max_depth parameter in your parser to control how many nesting levels you extract.

The front page shows 30 stories by default. Older stories are accessible through a "More" link at the bottom, which loads the next page at https://news.ycombinator.com/?p=2. If you need historical front page data, you must iterate through pages.

Be aware that pages beyond the first few may return different HTML structures for very old stories. The core tr.athing selector remains stable, but subtext formatting can vary slightly.

Session handling for authenticated content

Some Hacker News features require authentication. The "hidden" stories list, saved items, and certain user settings are only visible when logged in. If your use case requires authenticated scraping, you need to pass session cookies with your requests.

With AlterLab, you can include cookies in your scrape request:

Python
import alterlab

client = alterlab.Client("YOUR_API_KEY")
response = client.scrape(
    "https://news.ycombinator.com/saved",
    formats=["html"],
    cookies={
        "user": "your_username",
        "key": "your_auth_cookie"
    }
)

Note that sharing authentication cookies carries security implications. Use dedicated API keys with spend limits and rotate credentials regularly.

HTML parsing edge cases

Hacker News allows a subset of HTML in comments. This means comment text can contain <a>, <p>, <pre>, and <code> tags. When extracting comment text, use get_text(strip=True) to strip markup, or preserve the HTML if your application needs formatting.

Some comments are marked [dead] or [flagged]. These appear in the HTML with specific classes. Filter them out if your analysis requires only active content:

Python
active_comments = [
    c for c in all_comments
    if "[dead]" not in c["text"] and "[flagged]" not in c["text"]
]

Scaling up

Running a few scrapes manually is straightforward. Running thousands of scrapes daily on a schedule requires infrastructure planning.

Batch requests

When you need to scrape multiple URLs at once, batch them in a single API call rather than making individual requests. This reduces overhead and ensures consistent proxy rotation across your batch.

Python
import alterlab

client = alterlab.Client("YOUR_API_KEY")
urls = [
    "https://news.ycombinator.com",
    "https://news.ycombinator.com/news",
    "https://news.ycombinator.com/ask",
    "https://news.ycombinator.com/show",
    "https://news.ycombinator.com/jobs"
]

results = client.scrape_batch(urls, formats=["html"])
for url, result in zip(urls, results):
    print(f"{url}: {len(result.html)} bytes")

Scheduling recurring scrapes

If you need fresh Hacker News data every hour, set up a scheduled scrape with a cron expression. AlterLab handles the timing, execution, and result delivery automatically.

Python
import alterlab

client = alterlab.Client("YOUR_API_KEY")
schedule = client.schedules.create(
    url="https://news.ycombinator.com",
    cron="0 * * * *",
    formats=["json"],
    webhook_url="https://your-server.com/hn-webhook"
)
print(f"Scheduled scrape ID: {schedule.id}")

This runs every hour and pushes results to your webhook endpoint. No polling required.

Handling large datasets

If you are archiving Hacker News data long-term, consider these patterns:

Deduplication. Stories appear on the front page for hours and may be scraped multiple times. Use the story ID (tr.athing[id]) as a unique key to avoid duplicate records.

Incremental extraction. Instead of re-scraping the entire front page, track the highest story ID you have seen and only extract new stories. This reduces data volume and processing time.

Storage format. Store extracted data in a columnar format like Parquet if you plan to run analytics. JSON is fine for ingestion, but Parquet gives you faster query performance at scale.

Cost management

Hacker News is a relatively simple target. The front page and item pages are mostly static HTML, which means they fall into lower pricing tiers. Headless browser rendering is not required for most use cases.

Cost scales linearly with request volume. A schedule that scrapes the front page hourly costs 24 requests per day. Adding comment extraction for the top 10 stories multiplies that by 11. Plan your request volume before setting up schedules.

Review AlterLab pricing for current tier rates and volume discounts. Setting spend limits on your API keys prevents unexpected charges if a schedule runs more frequently than intended.

Monitoring for changes

Hacker News does not change often, but when it does, your selectors can break. Set up monitoring to detect when the HTML structure changes:

Python
import alterlab

client = alterlab.Client("YOUR_API_KEY")
monitor = client.monitors.create(
    url="https://news.ycombinator.com",
    check_interval="1h",
    diff_detection=True,
    alert_email="[email protected]"
)

This checks the page hourly and alerts you when the HTML structure changes significantly enough to affect your selectors.

Key takeaways

Hacker News is a valuable data source for trend detection, competitive intelligence, and lead generation. The site uses standard anti-bot protections that require proxy rotation and careful header management to bypass reliably.

The AlterLab API handles the infrastructure layer. You send a URL, get clean HTML, and parse it with your preferred tool. CSS selectors on Hacker News are stable because the layout has not changed significantly in years.

For production pipelines, batch your requests, schedule recurring scrapes with webhooks, and set up monitoring to catch structural changes before they break your extraction logic. Track story IDs to deduplicate, and use spend limits to control costs.


Share

Was this article helpful?

Frequently Asked Questions

Hacker News is a publicly accessible website, and scraping publicly available data is generally legal in most jurisdictions. However, you should review their terms of service, respect robots.txt directives, and avoid aggressive request rates that could impact their infrastructure.
Hacker News uses standard anti-bot protections including rate limiting and IP-based blocking. AlterLab's anti-bot bypass API handles proxy rotation, header management, and request fingerprinting automatically, so you get clean HTML without managing infrastructure.
AlterLab uses a pay-as-you-go model with tiered pricing based on request complexity. Simple HTML scrapes of Hacker News fall into lower tiers, while headless browser requests for dynamic content cost more. Check the pricing page for current rates and volume discounts.