Hacker News Data API: Extract Structured JSON in 2026
Tutorials

Hacker News Data API: Extract Structured JSON in 2026

Extract structured Hacker News data via API using AlterLab's Extract AI. Get typed JSON output for title, author, date and more—no HTML parsing needed.

4 min read
3 views

This guide covers extracting publicly accessible data. Always review a site's robots.txt and Terms of Service before scraping.

TL;DR

To get structured Hacker News data via API, use AlterLab's Extract endpoint with a JSON schema defining your desired fields (title, author, published_date, tags, URL). Pass the schema and target URL to receive validated, typed JSON output—eliminating fragile HTML parsing. The process requires only two lines of Python code after setup.

Why use Hacker News data?

Hacker News provides real-time insights into tech trends, making it valuable for:

  • AI training datasets: Collecting technical article titles and discussions for natural language processing models
  • Competitive intelligence: Monitoring emerging technologies and startup announcements mentioned in threads
  • Content aggregation: Building tech news feeds or trend analysis tools for developer communities

What data can you extract?

From public Hacker News pages, you can extract these structured fields:

  • title: The headline of the story or discussion
  • author: The username of the submitter
  • published_date: Timestamp when the item was posted
  • tags: Associated categories or keywords (if visible in the snippet)
  • url: Direct link to the external article or internal discussion

All fields are publicly visible on the news.ycombinator.com homepage and item pages. AlterLab's AI identifies and extracts them based on your schema definition.

The extraction approach

Raw HTTP requests combined with HTML parsing fail frequently on Hacker News due to:

  • Dynamic content loaded via JavaScript
  • Frequent frontend updates breaking CSS selectors
  • Anti-bot measures requiring session handling

A data API approach solves these by:

  • Handling JavaScript rendering and anti-bot challenges automatically
  • Returning structured data matching your schema instead of raw HTML
  • Providing built-in retry logic and rate limit management
  • Eliminating the need for maintenance-heavy parsing code

Quick start with AlterLab Extract API

First, install the AlterLab Python client and follow the Getting started guide. Then extract data with minimal code:

Python
import alterlab

client = alterlab.Client("YOUR_API_KEY")

schema = {
  "type": "object",
  "properties": {
    "title": {
      "type": "string",
      "description": "The title field"
    },
    "author": {
      "type": "string",
      "description": "The author field"
    },
    "published_date": {
      "type": "string",
      "description": "The published date field"
    },
    "tags": {
      "type": "string",
      "description": "The tags field"
    },
    "url": {
      "type": "string",
      "description": "The url field"
    }
  }
}

result = client.extract(
    url="https://news.ycombinator.com/item?id=40000000",
    schema=schema,
)
print(result.data)

The equivalent cURL request:

Bash
curl -X POST https://api.alterlab.io/v1/extract \
  -H "X-API-Key: YOUR_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://news.ycombinator.com/item?id=40000000",
    "schema": {"properties": {"title": {"type": "string"}, "author": {"type": "string"}, "published_date": {"type": "string"}}}
  }'

Both examples return structured JSON like:

JSON
{
  "title": "Example Tech Article",
  "author": "techblogger",
  "published_date": "2026-03-15T14:30:00Z",
  "tags": ["programming", "ai"],
  "url": "https://example.com/tech-article"
}

Define your schema

The schema parameter drives AlterLab's extraction accuracy. Key principles:

  • Type safety: Define string, number, boolean, or array types for each field
  • Description hints: Help the AI understand context (e.g., "ISO 8601 timestamp")
  • Required fields: Omit "required" array to allow partial extraction when data is missing
  • Nested objects: Extract complex structures like comment threads using object types

AlterLab validates output against your schema, returning only matching fields. If the AI cannot find a field, it returns null for that key—never inventing data.

Handle pagination and scale

For extracting multiple Hacker News pages:

  1. Batch processing: Use async requests with alterlab.extract_batch() for concurrent processing
  2. Rate limiting: AlterLab automatically respects Hacker News's crawl-delay; adjust via max_concurrency parameter
  3. Error handling: Check result.success flag and result.error for failed extractions
  4. Cost optimization: See AlterLab pricing for volume discounts—pay only for successful extractions

Example async batch job:

Python
import alterlab
import asyncio

client = alterlab.Client("YOUR_API_KEY")

urls = [
    "https://news.ycombinator.com",
    "https://news.ycombinator.com/news?p=2",
    "https://news.ycombinator.com/news?p=3"
]

schema = {
  "type": "object",
  "properties": {
    "title": {"type": "string"},
    "url": {"type": "string"}
  }
}

async def extract_all():
    tasks = [
        client.extract(url=url, schema=schema)
        for url in urls
    ]
    results = await asyncio.gather(*tasks)
    for result in results:
        if result.success:
            print(result.data)

asyncio.run(extract_all())

Key takeaways

  • AlterLab's Extract API converts public web pages into typed JSON without parsing fragility
  • Define your exact data needs via JSON schema for validated, consistent output
  • The service handles JavaScript, anti-bot measures, and rate limiting automatically
  • Start with a single endpoint call; scale to batches using async patterns
  • Always verify compliance with robots.txt and Terms of Service before extraction

Begin extracting structured Hacker News data today—visit the Extract API docs for full reference.

Share

Was this article helpful?

Frequently Asked Questions

Hacker News offers an unofficial Firebase API for basic story data, but it lacks structured output for fields like full article content or metadata. AlterLab fills this gap by extracting structured JSON from public pages using AI, respecting robots.txt and rate limits.
You can extract publicly available fields including title, author, published_date, tags, and URL from Hacker News pages. Define your desired schema and AlterLab returns validated, typed JSON—no CSS selectors or parsing required.
AlterLab uses pay-as-you-go pricing with no minimums or expiration. Costs scale with extraction volume; see /pricing for details. You pay only for successful extractions, making it efficient for data pipelines.