Beginner4 steps

How to Scrape News Articles

News sites serve article content dynamically, require JavaScript to reveal full article text behind paywalls or subscription prompts, and change their HTML structure frequently. Reliable news extraction requires a robust rendering pipeline and flexible selectors.

Step-by-Step Guide

Target the article URL or news feed

Identify the URLs of articles you want to monitor. Many news sites also publish RSS or Atom feeds — check for /feed or /rss URLs as a supplementary source of article URLs.

Fetch and extract article content

Send each article URL to AlterLab. Parse the HTML to extract the headline, author, publish date, and body text using standard article schema selectors.

Parse publish dates

Look for article publish timestamps in JSON-LD schema tags or in the HTML as `<time>` elements with a `datetime` attribute in ISO 8601 format.

Deduplicate and store

Use the article URL as a unique key to prevent storing duplicate articles across repeat collection runs.

Code Example

Python

import requests
import json
from bs4 import BeautifulSoup

def extract_article(url: str, api_key: str) -> dict:
    response = requests.post(
        "https://alterlab.io/api/v1/scrape",
        headers={"X-API-Key": api_key, "Content-Type": "application/json"},
        json={"url": url},
    )
    html = response.json().get("html", "")
    soup = BeautifulSoup(html, "html.parser")

    # Extract JSON-LD Article schema
    for script in soup.find_all("script", type="application/ld+json"):
        try:
            data = json.loads(script.string or "")
            if data.get("@type") == "NewsArticle":
                return {
                    "headline": data.get("headline"),
                    "author": data.get("author", {}).get("name"),
                    "datePublished": data.get("datePublished"),
                }
        except json.JSONDecodeError:
            pass

    return {
        "headline": soup.select_one("h1")?.get_text(strip=True),
        "datePublished": soup.select_one("time[datetime]")?.get("datetime"),
    }

Replace YOUR_API_KEY with your key from the . No credit card required.

Try this yourself with AlterLab

Run this tutorial on live websites with AlterLab's API. Free tier includes 5,000 requests — no credit card required.

View API docs

Frequently Asked Questions

How do I handle paywalled articles?

Many paywall implementations use JavaScript to hide content after initial load. Enabling render_js in your request often reveals the full article text if the paywall is a client-side overlay. Server-side paywalls (content not in the HTML at all) require a valid subscription to access.

Responsible Use

AlterLab is designed for extracting publicly available data. Always review the terms of service for any website you access, respect robots.txt directives, and ensure your use case complies with applicable laws in your jurisdiction.

Your first scrape.
Sixty seconds.

$1 free credit — up to 5,000 scrapes. No credit card.
Just a POST request.

terminal

curl -X POST https://api.alterlab.io/v1/scrape \

-H "X-API-Key: YOUR_KEY" \

-H "Content-Type: application/json" \

-d '{"url": "https://example.com", "formats": ["markdown"]}'

Start building free

No credit card required · $1 free credit, up to 5,000 scrapes · Balance never expires

How to Scrape News Articles

Step-by-Step Guide

Target the article URL or news feed

Fetch and extract article content

Parse publish dates

Deduplicate and store

Code Example

Try this yourself with AlterLab

Frequently Asked Questions

How do I handle paywalled articles?

How do I handle paywalled articles?

Responsible Use

More tutorials

Your first scrape. Sixty seconds.

Your first scrape.
Sixty seconds.