How to Extract Structured Data from HTML
Raw HTML contains the data you need buried in nested tags, inconsistent formatting, and multiple possible locations. Extracting clean, structured output requires a systematic approach using CSS selectors, JSON-LD parsing, or table extraction.
Step-by-Step Guide
Fetch the page HTML
Use AlterLab to retrieve the fully rendered HTML, including any content loaded via JavaScript.
Check for embedded JSON-LD
Many product and article pages embed structured data in JSON-LD script tags. Parse these first — they often contain exactly the fields you need in clean JSON format.
Fall back to CSS selector extraction
For pages without JSON-LD, use CSS selectors with BeautifulSoup or lxml to target specific elements. Inspect the page in browser DevTools to find reliable selectors.
Normalize and clean extracted values
Strip whitespace, remove HTML entities, and convert types (strings to numbers, date strings to datetime objects) before storing your data.
Code Example
import requests
import json
from bs4 import BeautifulSoup
def extract_structured(url: str, api_key: str) -> dict:
response = requests.post(
"https://alterlab.io/api/v1/scrape",
headers={"X-API-Key": api_key, "Content-Type": "application/json"},
json={"url": url, "render_js": True},
)
html = response.json().get("html", "")
soup = BeautifulSoup(html, "html.parser")
# Try JSON-LD first
for script in soup.find_all("script", type="application/ld+json"):
try:
data = json.loads(script.string)
if data.get("@type") in ("Product", "Article"):
return data
except (json.JSONDecodeError, AttributeError):
continue
# Fall back to CSS selectors
return {
"name": soup.select_one("h1")?.get_text(strip=True),
"price": soup.select_one("[data-price]")?.get("data-price"),
}Replace YOUR_API_KEY with your key from the dashboard. No credit card required.
Ready to try it?
Run this tutorial on live websites with AlterLab's API. Start free — no credit card required.
Frequently Asked Questions
Responsible Use
AlterLab is designed for extracting publicly available data. Always review the terms of service for any website you access, respect robots.txt directives, and ensure your use case complies with applicable laws in your jurisdiction.
More tutorials
Browse all how-to guides for web scraping — from beginner extractions to advanced multi-page pipelines.
Your first scrape.
Sixty seconds.
$1 free balance. No credit card. No SDK.
Just a POST request.
No credit card required · Up to 5,000 free scrapes · Balance never expire