extraction

Data Extraction

The process of identifying and pulling specific fields from raw web content, converting unstructured HTML into typed JSON records.

Data extraction is the step after web scraping that transforms raw HTML into usable structured data. While scraping retrieves the page content, extraction identifies and isolates the specific fields needed — product name, price, review count, publication date — and outputs clean, typed records.

Extraction techniques range from CSS selectors and XPath expressions for simple, consistent DOM structures to AI-powered schema extraction for complex or variable layouts. CSS selectors like `div.price > span` and XPath expressions like `//div[@class='price']/span/text()` are precise and fast for stable page structures. For pages where the DOM changes frequently or is semantically complex, AI extraction accepts a JSON schema definition and uses a language model to locate and return the matching fields.

AlterLab supports both approaches. For structured, predictable pages, pass CSS selectors or XPath in the request. For variable layouts, pass an `extract_schema` JSON schema and AlterLab returns the extracted fields directly in the response — no HTML parsing required on your end.

Examples

# AI-powered schema extraction
{
  "url": "https://example.com/product",
  "extract_schema": {
    "type": "object",
    "properties": {
      "title": { "type": "string" },
      "price": { "type": "number" },
      "rating": { "type": "number" }
    }
  }
}

Related Terms

    Data Extraction — Web Scraping Glossary | AlterLab