Pricing Compare Playground Blog Docs Changelog

Extract Structured Data from Websites Using AI Instead of CSS Selectors

Learn how to extract structured data from any website using AI-powered extraction. Skip fragile CSS selectors and get clean JSON with a single API call.

Yash DubeyApril 12, 2026

6 min read

320 views

AlterLab handles this automatically — scrape any URL with one API call. No infrastructure required.

Try it free

The Problem with CSS Selectors

You write a scraper targeting .product-price .amount. It works. Two weeks later, the site ships a redesign and your selector returns null. You inspect the DOM, find the new class, patch your code, and move on. This repeats every few months for every site you scrape.

CSS selectors couple your extraction logic to implementation details you do not control. Class names change. DOM structures shift. A/B tests swap element order. Each change breaks your pipeline silently until you notice missing data downstream.

AI extraction removes this coupling. You describe the data you want in plain text. The model reads the page, understands the semantic structure, and returns clean JSON. No selectors to maintain. No DOM inspection when layouts change.

How AI Extraction Works

The process has three steps:

Fetch the page content (rendered, with JavaScript executed)
Pass the content and your extraction schema to a language model
Return structured JSON matching your schema

The model does not guess. It reads the actual rendered DOM, identifies elements matching your description, and extracts their values. If a product page has a price, name, and rating, you describe those fields and get them back as typed JSON.

Setting Up

Install the Python SDK:

Bash

pip install alterlab

Or use the REST API directly with curl. Both approaches are covered below. You will need an API key from your dashboard.

Example: Extracting Product Data

Here is a product page on an e-commerce site. You need the product name, price, rating, and number of reviews. With CSS selectors, you would inspect the DOM, write four selectors, and hope they survive the next deploy.

With AI extraction, you describe the fields:

Python

import alterlab

client = alterlab.Client("YOUR_API_KEY")

response = client.scrape(
    url="https://example-store.com/products/wireless-headphones",
    formats=["json"],
    cortex={
        "prompt": "Extract: product_name (string), price (float), rating (float out of 5), review_count (integer)"
    }
)

data = response.json["cortex"]
print(data)

Output:

JSON

{
  "product_name": "Sony WH-1000XM5 Wireless Headphones",
  "price": 348.00,
  "rating": 4.7,
  "review_count": 2841
}

The same request via curl:

Bash

curl -X POST https://api.alterlab.io/v1/scrape \
  -H "X-API-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example-store.com/products/wireless-headphones",
    "formats": ["json"],
    "cortex": {
      "prompt": "Extract: product_name (string), price (float), rating (float out of 5), review_count (integer)"
    }
  }'

Try it yourself

Try extracting product data with AlterLab Cortex

curl -X POST https://api.alterlab.io/v1/scrape \
  -H "X-API-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example-store.com/products/wireless-headphones"}'

Enable JavaScript to try the live demo, or sign up to use the API directly.

Structured Schemas with JSON Schema

For production pipelines, you want type guarantees. Pass a JSON Schema instead of a plain text prompt. The model validates its output against your schema before returning it.

Python

import alterlab

client = alterlab.Client("YOUR_API_KEY")

schema = {
    "type": "object",
    "properties": {
        "products": {
            "type": "array",
            "items": {
                "type": "object",
                "properties": {
                    "name": {"type": "string"},
                    "price": {"type": "number"},
                    "in_stock": {"type": "boolean"},
                    "sku": {"type": "string"}
                },
                "required": ["name", "price", "in_stock"]
            }
        }
    }
}

response = client.scrape(
    url="https://example-store.com/category/electronics",
    formats=["json"],
    cortex={"prompt": "Extract all products from this category page", "schema": schema}
)

for product in response.json["cortex"]["products"]:
    print(f"{product['name']}: ${product['price']}")

This returns an array of products with typed fields. Missing optional fields are omitted. Required fields are always present. If the model cannot confidently extract a required field, it returns an error you can handle in your pipeline.

Handling Dynamic Content

Many sites load data client-side. A product listing might render empty HTML, then populate via JavaScript fetches. Traditional scrapers that only fetch raw HTML get nothing back.

AI extraction requires the rendered DOM. The platform handles this automatically: it launches a headless browser, waits for the page to stabilize, then passes the rendered content to the model. You do not need to configure wait times or detect network idle.

For sites with aggressive bot detection, the anti-bot bypass layer handles fingerprint rotation, TLS fingerprint matching, and challenge solving before the page ever reaches the extraction step.

When to Use AI Extraction vs CSS Selectors

AI extraction is not a replacement for every scraping pattern. It is a tool for specific scenarios.

Use AI extraction when:

The site changes its layout frequently
You are prototyping and need data fast
The page structure is complex or inconsistent
You need to extract from many different sites with one pipeline

Use CSS selectors when:

The page structure is stable and predictable
You are scraping at very high volume and cost matters
You need sub-second response times
The data is in simple, consistent locations

You can mix both approaches in the same pipeline. Use AI extraction for complex pages and selectors for stable ones. The Python SDK supports both patterns with the same client interface.

Real-World Pattern: Monitoring Competitor Prices

Here is a practical pipeline that combines scheduling with AI extraction. You want to track prices for a list of competitor products daily.

Python

import alterlab
import json

client = alterlab.Client("YOUR_API_KEY")

competitors = [
    {"url": "https://competitor-a.com/product/123", "name": "Competitor A"},
    {"url": "https://competitor-b.com/p/abc", "name": "Competitor B"},
]

for competitor in competitors:
    response = client.scrape(
        url=competitor["url"],
        formats=["json"],
        cortex={
            "prompt": "Extract: product_name (string), price (float), availability (string)"
        }
    )

    data = response.json["cortex"]
    print(f"{competitor['name']}: {data['product_name']} @ ${data['price']} - {data['availability']}")

Wrap this in a scheduled job and store results in your database. When prices change, your pipeline detects the delta automatically. The monitoring feature can also handle this natively by watching pages for content changes and pushing diffs to your webhook endpoint.

Error Handling

AI extraction can fail when the page does not contain the requested data, the model cannot parse the structure, or the schema validation fails. Handle these cases explicitly:

Python

import alterlab

client = alterlab.Client("YOUR_API_KEY")

try:
    response = client.scrape(
        url="https://example.com/page",
        formats=["json"],
        cortex={"prompt": "Extract: email (string), phone (string)"}
    )

    if "error" in response.json.get("cortex", {}):
        print(f"Extraction failed: {response.json['cortex']['error']}")
    else:
        print(response.json["cortex"])

except alterlab.APIError as e:
    print(f"API error: {e.status_code} - {e.message}")

Common errors include pages that require authentication, content behind CAPTCHAs that exceed your tier, and schemas with impossible constraints. The API returns structured error messages so you can retry, adjust your prompt, or skip the page.

Performance Considerations

AI extraction adds latency compared to raw HTML fetching. A typical request takes 3-8 seconds depending on page complexity and model load. For most pipelines, this is acceptable. Price monitoring, lead generation, and market research do not require sub-second responses.

If you need speed, use a two-tier approach:

Fetch raw HTML with a basic tier (fast, cheap)
Only escalate to AI extraction when the raw response is insufficient

Set min_tier in your request to skip lower tiers for known-difficult sites. This avoids the retry loop and gets you to the rendering tier on the first attempt.

Check the pricing page for current tier costs and rate limits.

Takeaway

CSS selectors tie your scraping logic to markup you do not control. AI extraction breaks that dependency. Describe the data you need, get back typed JSON, and stop maintaining selectors every time a site redesigns.

Use AI extraction for dynamic pages, prototyping, and multi-site pipelines. Use selectors for stable, high-volume targets. Mix both in the same pipeline based on each site's characteristics.

The quickstart guide covers installation and your first request in under five minutes.

Was this article helpful?

Try it yourself

One API call. Any language.

Python SDK, Node SDK, or plain HTTP. Get started in under a minute.

from alterlab import AlterLab

client = AlterLab(api_key="YOUR_KEY")
result = client.scrape("https://example.com")
print(result.markdown)

No credit card required · 5,000 free requests

Frequently Asked Questions

AI-powered web data extraction uses large language models to understand page content and return structured data without requiring CSS selectors or XPath expressions. You describe what you need in plain text, and the model locates and extracts it.

Use AI extraction when pages have dynamic class names, frequent layout changes, or complex nested structures that make selectors brittle. CSS selectors work well for stable pages with consistent markup.

AI extraction costs slightly more per request due to model inference, but eliminates ongoing maintenance from broken selectors. For high-volume stable pages, CSS selectors remain more cost-effective.

Yash Dubey

View all posts

Tutorials

How to Migrate from Smartproxy to AlterLab: Step-by-Step Guide (2026)

Learn how to migrate from Smartproxy to AlterLab in under an hour. Replace bandwidth-based billing with pay-as-you-go pricing and a streamlined API.

Herald Blog Service

Jul 11, 2026

Tutorials

How to Give Your AI Agent Access to Medium Data

Learn how to connect your AI agent to Medium using AlterLab's Extract API to retrieve structured, public data for RAG pipelines and content intelligence.

Herald Blog Service

Jul 9, 2026

Best Practices

Managing Headless Browser Overhead in Data Pipelines

Learn how to reduce latency and resource consumption when using headless browsers for data extraction in large-scale web scraping pipelines.

Herald Blog Service

Jul 8, 2026

Stay in the Loop

Get scraping insights, API tips, and platform updates. No spam — we only send when we have something worth reading.

Web Scraping API Resources

Part of the Web Scraping API Documentation cluster

Web Scraping API Documentation

Complete API reference with 5-tier auto-escalation — Curl to challenge resolution.

Pillar page

JavaScript Rendering Guide

Configure Tier 4 browser rendering for SPAs and dynamic content.

Authenticated Scraping Guide

Scrape pages behind login using session management.

Web Scraping API Benchmarks

Real success rates and cost data across all 5 tiers.

AlterLab for AI Agents

MCP Server, Python SDK, and Firecrawl-compatible API for AI agent workflows.

The Problem with CSS Selectors

How AI Extraction Works

Setting Up

Example: Extracting Product Data

Structured Schemas with JSON Schema

Handling Dynamic Content

When to Use AI Extraction vs CSS Selectors

Real-World Pattern: Monitoring Competitor Prices

Error Handling

Performance Considerations

Takeaway

Frequently Asked Questions

Related Articles

How to Migrate from Smartproxy to AlterLab: Step-by-Step Guide (2026)

How to Give Your AI Agent Access to Medium Data

Managing Headless Browser Overhead in Data Pipelines

Popular Posts

Playwright Bot Detection: What Actually Works in 2026

How to Scrape Twitter/X: Complete Guide for 2026

Why Your Headless Browser Gets Detected (and How to Fix It)

Best Web Scraping APIs in 2026: Complete Comparison Guide

How to Scrape Cloudflare-Protected Sites in 2026

Recommended

How to Scrape AliExpress: Complete Guide for 2026

Why Your Headless Browser Gets Detected (and How to Fix It)

AlterLab vs Firecrawl: Which Scraping API Is Better in 2026?

How to Scrape Twitter/X Data: Complete Guide for 2026

How to Scrape Cloudflare-Protected Sites in 2026

Newsletter

Recommended Reading

How to Scrape AliExpress: Complete Guide for 2026

Why Your Headless Browser Gets Detected (and How to Fix It)

AlterLab vs Firecrawl: Which Scraping API Is Better in 2026?

How to Scrape Twitter/X Data: Complete Guide for 2026

How to Scrape Cloudflare-Protected Sites in 2026

Stay in the Loop

Explore AlterLab

Python Web Scraping API

Compare Scraping APIs

Pricing

Documentation

Web Scraping API Resources