Beginner4 steps

How to Extract Structured Data from HTML

Raw HTML contains the data you need buried in nested tags, inconsistent formatting, and multiple possible locations. Extracting clean, structured output requires a systematic approach using CSS selectors, JSON-LD parsing, or table extraction.

Step-by-Step Guide

Fetch the page HTML

Use AlterLab to retrieve the fully rendered HTML, including any content loaded via JavaScript.

Check for embedded JSON-LD

Many product and article pages embed structured data in JSON-LD script tags. Parse these first — they often contain exactly the fields you need in clean JSON format.

Fall back to CSS selector extraction

For pages without JSON-LD, use CSS selectors with BeautifulSoup or lxml to target specific elements. Inspect the page in browser DevTools to find reliable selectors.

Normalize and clean extracted values

Strip whitespace, remove HTML entities, and convert types (strings to numbers, date strings to datetime objects) before storing your data.

Code Example

Python

import requests
import json
from bs4 import BeautifulSoup

def extract_structured(url: str, api_key: str) -> dict:
    response = requests.post(
        "https://alterlab.io/api/v1/scrape",
        headers={"X-API-Key": api_key, "Content-Type": "application/json"},
        json={"url": url, "render_js": True},
    )
    html = response.json().get("html", "")
    soup = BeautifulSoup(html, "html.parser")

    # Try JSON-LD first
    for script in soup.find_all("script", type="application/ld+json"):
        try:
            data = json.loads(script.string)
            if data.get("@type") in ("Product", "Article"):
                return data
        except (json.JSONDecodeError, AttributeError):
            continue

    # Fall back to CSS selectors
    return {
        "name": soup.select_one("h1")?.get_text(strip=True),
        "price": soup.select_one("[data-price]")?.get("data-price"),
    }

Replace YOUR_API_KEY with your key from the . No credit card required.

Try this yourself with AlterLab

Run this tutorial on live websites with AlterLab's API. Free tier includes 5,000 requests — no credit card required.

View API docs

Frequently Asked Questions

What is JSON-LD and why should I parse it first?

JSON-LD is structured data embedded by website owners to help search engines understand content. It contains clean, validated data (product name, price, availability) in standard schema.org format — much easier to parse than HTML.

How do I extract data from HTML tables?

Use `soup.find('table')` and then iterate over `tr` and `td` elements. The `pandas.read_html()` function can also parse tables directly from raw HTML into a DataFrame.

Responsible Use

AlterLab is designed for extracting publicly available data. Always review the terms of service for any website you access, respect robots.txt directives, and ensure your use case complies with applicable laws in your jurisdiction.

Your first scrape.
Sixty seconds.

$1 free credit — up to 5,000 scrapes. No credit card.
Just a POST request.

terminal

curl -X POST https://api.alterlab.io/v1/scrape \

-H "X-API-Key: YOUR_KEY" \

-H "Content-Type: application/json" \

-d '{"url": "https://example.com", "formats": ["markdown"]}'

Start building free

No credit card required · $1 free credit, up to 5,000 scrapes · Balance never expires

How to Extract Structured Data from HTML

Step-by-Step Guide

Fetch the page HTML

Check for embedded JSON-LD

Fall back to CSS selector extraction

Normalize and clean extracted values

Code Example

Try this yourself with AlterLab

Frequently Asked Questions

What is JSON-LD and why should I parse it first?

How do I extract data from HTML tables?

What is JSON-LD and why should I parse it first?

How do I extract data from HTML tables?

Responsible Use

More tutorials

Your first scrape. Sixty seconds.

Your first scrape.
Sixty seconds.