How to Scrape News Articles
News sites serve article content dynamically, require JavaScript to reveal full article text behind paywalls or subscription prompts, and change their HTML structure frequently. Reliable news extraction requires a robust rendering pipeline and flexible selectors.
Step-by-Step Guide
Target the article URL or news feed
Identify the URLs of articles you want to monitor. Many news sites also publish RSS or Atom feeds — check for /feed or /rss URLs as a supplementary source of article URLs.
Fetch and extract article content
Send each article URL to AlterLab. Parse the HTML to extract the headline, author, publish date, and body text using standard article schema selectors.
Parse publish dates
Look for article publish timestamps in JSON-LD schema tags or in the HTML as `<time>` elements with a `datetime` attribute in ISO 8601 format.
Deduplicate and store
Use the article URL as a unique key to prevent storing duplicate articles across repeat collection runs.
Code Example
import requests
import json
from bs4 import BeautifulSoup
def extract_article(url: str, api_key: str) -> dict:
response = requests.post(
"https://alterlab.io/api/v1/scrape",
headers={"X-API-Key": api_key, "Content-Type": "application/json"},
json={"url": url},
)
html = response.json().get("html", "")
soup = BeautifulSoup(html, "html.parser")
# Extract JSON-LD Article schema
for script in soup.find_all("script", type="application/ld+json"):
try:
data = json.loads(script.string or "")
if data.get("@type") == "NewsArticle":
return {
"headline": data.get("headline"),
"author": data.get("author", {}).get("name"),
"datePublished": data.get("datePublished"),
}
except json.JSONDecodeError:
pass
return {
"headline": soup.select_one("h1")?.get_text(strip=True),
"datePublished": soup.select_one("time[datetime]")?.get("datetime"),
}Replace YOUR_API_KEY with your key from the dashboard. No credit card required.
Ready to try it?
Run this tutorial on live websites with AlterLab's API. Start free — no credit card required.
Frequently Asked Questions
Responsible Use
AlterLab is designed for extracting publicly available data. Always review the terms of service for any website you access, respect robots.txt directives, and ensure your use case complies with applicable laws in your jurisdiction.
More tutorials
Browse all how-to guides for web scraping — from beginner extractions to advanced multi-page pipelines.
Your first scrape.
Sixty seconds.
$1 free balance. No credit card. No SDK.
Just a POST request.
No credit card required · Up to 5,000 free scrapes · Balance never expire