Beginner4 steps

How to Scrape News Articles

News sites serve article content dynamically, require JavaScript to reveal full article text behind paywalls or subscription prompts, and change their HTML structure frequently. Reliable news extraction requires a robust rendering pipeline and flexible selectors.

Step-by-Step Guide

1

Target the article URL or news feed

Identify the URLs of articles you want to monitor. Many news sites also publish RSS or Atom feeds — check for /feed or /rss URLs as a supplementary source of article URLs.

2

Fetch and extract article content

Send each article URL to AlterLab. Parse the HTML to extract the headline, author, publish date, and body text using standard article schema selectors.

3

Parse publish dates

Look for article publish timestamps in JSON-LD schema tags or in the HTML as `<time>` elements with a `datetime` attribute in ISO 8601 format.

4

Deduplicate and store

Use the article URL as a unique key to prevent storing duplicate articles across repeat collection runs.

Code Example

import requests
import json
from bs4 import BeautifulSoup

def extract_article(url: str, api_key: str) -> dict:
    response = requests.post(
        "https://alterlab.io/api/v1/scrape",
        headers={"X-API-Key": api_key, "Content-Type": "application/json"},
        json={"url": url},
    )
    html = response.json().get("html", "")
    soup = BeautifulSoup(html, "html.parser")

    # Extract JSON-LD Article schema
    for script in soup.find_all("script", type="application/ld+json"):
        try:
            data = json.loads(script.string or "")
            if data.get("@type") == "NewsArticle":
                return {
                    "headline": data.get("headline"),
                    "author": data.get("author", {}).get("name"),
                    "datePublished": data.get("datePublished"),
                }
        except json.JSONDecodeError:
            pass

    return {
        "headline": soup.select_one("h1")?.get_text(strip=True),
        "datePublished": soup.select_one("time[datetime]")?.get("datetime"),
    }

Replace YOUR_API_KEY with your key from the dashboard. No credit card required.

Ready to try it?

Run this tutorial on live websites with AlterLab's API. Start free — no credit card required.

Frequently Asked Questions

Responsible Use

AlterLab is designed for extracting publicly available data. Always review the terms of service for any website you access, respect robots.txt directives, and ensure your use case complies with applicable laws in your jurisdiction.

More tutorials

Browse all how-to guides for web scraping — from beginner extractions to advanced multi-page pipelines.

Your first scrape.
Sixty seconds.

$1 free balance. No credit card. No SDK.Just a POST request.

terminal
curl -X POST https://api.alterlab.io/v1/scrape \
-H "X-API-Key: YOUR_KEY" \
-H "Content-Type: application/json" \
-d '{"url": "https://example.com", "formats": ["markdown"]}'

No credit card required · Up to 5,000 free scrapes · Balance never expire

    How to Scrape News Articles and Headlines | AlterLab | AlterLab