AlterLabAlterLab
Extract Structured Data from Websites Using AI Instead of CSS Selectors
Tutorials

Extract Structured Data from Websites Using AI Instead of CSS Selectors

Learn how to extract structured data from any website using AI-powered extraction. Skip fragile CSS selectors and get clean JSON with a single API call.

Yash Dubey
Yash Dubey

April 12, 2026

6 min read
18 views

The Problem with CSS Selectors

You write a scraper targeting .product-price .amount. It works. Two weeks later, the site ships a redesign and your selector returns null. You inspect the DOM, find the new class, patch your code, and move on. This repeats every few months for every site you scrape.

CSS selectors couple your extraction logic to implementation details you do not control. Class names change. DOM structures shift. A/B tests swap element order. Each change breaks your pipeline silently until you notice missing data downstream.

AI extraction removes this coupling. You describe the data you want in plain text. The model reads the page, understands the semantic structure, and returns clean JSON. No selectors to maintain. No DOM inspection when layouts change.

How AI Extraction Works

The process has three steps:

  1. Fetch the page content (rendered, with JavaScript executed)
  2. Pass the content and your extraction schema to a language model
  3. Return structured JSON matching your schema

The model does not guess. It reads the actual rendered DOM, identifies elements matching your description, and extracts their values. If a product page has a price, name, and rating, you describe those fields and get them back as typed JSON.

Setting Up

Install the Python SDK:

Bash
pip install alterlab

Or use the REST API directly with curl. Both approaches are covered below. You will need an API key from your dashboard.

Example: Extracting Product Data

Here is a product page on an e-commerce site. You need the product name, price, rating, and number of reviews. With CSS selectors, you would inspect the DOM, write four selectors, and hope they survive the next deploy.

With AI extraction, you describe the fields:

Python
import alterlab

client = alterlab.Client("YOUR_API_KEY")

response = client.scrape(
    url="https://example-store.com/products/wireless-headphones",
    formats=["json"],
    cortex={
        "prompt": "Extract: product_name (string), price (float), rating (float out of 5), review_count (integer)"
    }
)

data = response.json["cortex"]
print(data)

Output:

JSON
{
  "product_name": "Sony WH-1000XM5 Wireless Headphones",
  "price": 348.00,
  "rating": 4.7,
  "review_count": 2841
}

The same request via curl:

Bash
curl -X POST https://api.alterlab.io/v1/scrape \
  -H "X-API-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example-store.com/products/wireless-headphones",
    "formats": ["json"],
    "cortex": {
      "prompt": "Extract: product_name (string), price (float), rating (float out of 5), review_count (integer)"
    }
  }'
Try it yourself

Try extracting product data with AlterLab Cortex

Structured Schemas with JSON Schema

For production pipelines, you want type guarantees. Pass a JSON Schema instead of a plain text prompt. The model validates its output against your schema before returning it.

Python
import alterlab

client = alterlab.Client("YOUR_API_KEY")

schema = {
    "type": "object",
    "properties": {
        "products": {
            "type": "array",
            "items": {
                "type": "object",
                "properties": {
                    "name": {"type": "string"},
                    "price": {"type": "number"},
                    "in_stock": {"type": "boolean"},
                    "sku": {"type": "string"}
                },
                "required": ["name", "price", "in_stock"]
            }
        }
    }
}

response = client.scrape(
    url="https://example-store.com/category/electronics",
    formats=["json"],
    cortex={"prompt": "Extract all products from this category page", "schema": schema}
)

for product in response.json["cortex"]["products"]:
    print(f"{product['name']}: ${product['price']}")

This returns an array of products with typed fields. Missing optional fields are omitted. Required fields are always present. If the model cannot confidently extract a required field, it returns an error you can handle in your pipeline.

Handling Dynamic Content

Many sites load data client-side. A product listing might render empty HTML, then populate via JavaScript fetches. Traditional scrapers that only fetch raw HTML get nothing back.

AI extraction requires the rendered DOM. The platform handles this automatically: it launches a headless browser, waits for the page to stabilize, then passes the rendered content to the model. You do not need to configure wait times or detect network idle.

For sites with aggressive bot detection, the anti-bot bypass layer handles fingerprint rotation, TLS fingerprint matching, and challenge solving before the page ever reaches the extraction step.

When to Use AI Extraction vs CSS Selectors

AI extraction is not a replacement for every scraping pattern. It is a tool for specific scenarios.

Use AI extraction when:

  • The site changes its layout frequently
  • You are prototyping and need data fast
  • The page structure is complex or inconsistent
  • You need to extract from many different sites with one pipeline

Use CSS selectors when:

  • The page structure is stable and predictable
  • You are scraping at very high volume and cost matters
  • You need sub-second response times
  • The data is in simple, consistent locations

You can mix both approaches in the same pipeline. Use AI extraction for complex pages and selectors for stable ones. The Python SDK supports both patterns with the same client interface.

Real-World Pattern: Monitoring Competitor Prices

Here is a practical pipeline that combines scheduling with AI extraction. You want to track prices for a list of competitor products daily.

Python
import alterlab
import json

client = alterlab.Client("YOUR_API_KEY")

competitors = [
    {"url": "https://competitor-a.com/product/123", "name": "Competitor A"},
    {"url": "https://competitor-b.com/p/abc", "name": "Competitor B"},
]

for competitor in competitors:
    response = client.scrape(
        url=competitor["url"],
        formats=["json"],
        cortex={
            "prompt": "Extract: product_name (string), price (float), availability (string)"
        }
    )

    data = response.json["cortex"]
    print(f"{competitor['name']}: {data['product_name']} @ ${data['price']} - {data['availability']}")

Wrap this in a scheduled job and store results in your database. When prices change, your pipeline detects the delta automatically. The monitoring feature can also handle this natively by watching pages for content changes and pushing diffs to your webhook endpoint.

Error Handling

AI extraction can fail when the page does not contain the requested data, the model cannot parse the structure, or the schema validation fails. Handle these cases explicitly:

Python
import alterlab

client = alterlab.Client("YOUR_API_KEY")

try:
    response = client.scrape(
        url="https://example.com/page",
        formats=["json"],
        cortex={"prompt": "Extract: email (string), phone (string)"}
    )

    if "error" in response.json.get("cortex", {}):
        print(f"Extraction failed: {response.json['cortex']['error']}")
    else:
        print(response.json["cortex"])

except alterlab.APIError as e:
    print(f"API error: {e.status_code} - {e.message}")

Common errors include pages that require authentication, content behind CAPTCHAs that exceed your tier, and schemas with impossible constraints. The API returns structured error messages so you can retry, adjust your prompt, or skip the page.

Performance Considerations

AI extraction adds latency compared to raw HTML fetching. A typical request takes 3-8 seconds depending on page complexity and model load. For most pipelines, this is acceptable. Price monitoring, lead generation, and market research do not require sub-second responses.

If you need speed, use a two-tier approach:

  1. Fetch raw HTML with a basic tier (fast, cheap)
  2. Only escalate to AI extraction when the raw response is insufficient

Set min_tier in your request to skip lower tiers for known-difficult sites. This avoids the retry loop and gets you to the rendering tier on the first attempt.

Check the pricing page for current tier costs and rate limits.

Takeaway

CSS selectors tie your scraping logic to markup you do not control. AI extraction breaks that dependency. Describe the data you need, get back typed JSON, and stop maintaining selectors every time a site redesigns.

Use AI extraction for dynamic pages, prototyping, and multi-site pipelines. Use selectors for stable, high-volume targets. Mix both in the same pipeline based on each site's characteristics.

The quickstart guide covers installation and your first request in under five minutes.

Share

Was this article helpful?

Frequently Asked Questions

AI-powered web data extraction uses large language models to understand page content and return structured data without requiring CSS selectors or XPath expressions. You describe what you need in plain text, and the model locates and extracts it.
Use AI extraction when pages have dynamic class names, frequent layout changes, or complex nested structures that make selectors brittle. CSS selectors work well for stable pages with consistent markup.
AI extraction costs slightly more per request due to model inference, but eliminates ongoing maintenance from broken selectors. For high-volume stable pages, CSS selectors remain more cost-effective.