Tutorial

Structured Extraction

Extract structured data from web pages using pre-built profiles or custom JSON schemas. Turn messy HTML into clean, predictable JSON.

How It Works

AlterLab extracts data according to your schema specifications. Pre-built profiles work out of the box for common page types, or define custom schemas for exact control over field names and types.

Extraction Methods

Pre-built Profiles

Ready-to-use templates for common data types

Easiest

JSON Schema

Define exact structure with types and validation

Most Control

Natural Language

Describe what you want in plain English

Most Flexible

Pre-built Profiles

Use pre-defined extraction profiles for common page types. These are optimized schemas that work out of the box.

Profile	Extracted Fields	Best For
`product`	name, price, description, images, ratings, availability	E-commerce product pages
`article`	title, author, date, content, summary	News, blogs, documentation
`job_posting`	title, company, location, salary, requirements	Job boards, career pages
`faq`	questions, answers, categories	FAQ pages, help centers
`recipe`	name, ingredients, instructions, time, servings	Recipe websites
`event`	name, date, location, description, organizer	Event pages, calendars
`auto`	Automatically detected	Unknown page types

Python

import requests

# Extract product data using the product profile
response = requests.post(
    "https://api.alterlab.io/api/v1/scrape",
    headers={"X-API-Key": "YOUR_API_KEY"},
    json={
        "url": "https://shop.example.com/product/123",
        "extraction_profile": "product"
    }
)

data = response.json()
product = data["extracted"]

print(f"Name: {product['name']}")
print(f"Price: {product['price']}")
print(f"Rating: {product['rating']}")

Custom JSON Schema

Define exactly what data you want using JSON Schema. This gives you full control over field names, types, and structure.

Python

import requests

# Define a custom schema for competitor pricing
schema = {
    "type": "object",
    "properties": {
        "product_name": {
            "type": "string",
            "description": "The full product name"
        },
        "current_price": {
            "type": "number",
            "description": "Current price in USD"
        },
        "original_price": {
            "type": "number",
            "description": "Original price before discount, if any"
        },
        "discount_percent": {
            "type": "number",
            "description": "Discount percentage if on sale"
        },
        "in_stock": {
            "type": "boolean",
            "description": "Whether the product is currently in stock"
        },
        "shipping_info": {
            "type": "string",
            "description": "Shipping time or availability"
        }
    },
    "required": ["product_name", "current_price", "in_stock"]
}

response = requests.post(
    "https://api.alterlab.io/api/v1/scrape",
    headers={"X-API-Key": "YOUR_API_KEY"},
    json={
        "url": "https://competitor.com/product/xyz",
        "extraction_schema": schema
    }
)

data = response.json()
pricing = data["extracted"]

print(f"Product: {pricing['product_name']}")
print(f"Price: ${pricing['current_price']}")
print(f"In Stock: {pricing['in_stock']}")

Schema Tips

Use description fields to guide the AI
Mark important fields as required
Use specific types (number vs string) for proper formatting

Natural Language Prompts

LLM-Powered Extraction

Natural language extraction uses the extraction_prompt parameter (max 2,000 characters). Results are returned in the extraction_result field. You can combine it with extraction_schema for structured output.

Just describe what you want in plain English. Great for quick extraction or when you're not sure of the exact structure.

Python

import requests

# Simple natural language extraction
response = requests.post(
    "https://api.alterlab.io/api/v1/scrape",
    headers={"X-API-Key": "YOUR_API_KEY"},
    json={
        "url": "https://techblog.example.com/article/123",
        "extraction_prompt": "Extract the article title, author name, publication date, and a 2-sentence summary of the main points"
    }
)

data = response.json()
print(data["extraction_result"])

# More complex extraction
response = requests.post(
    "https://api.alterlab.io/api/v1/scrape",
    headers={"X-API-Key": "YOUR_API_KEY"},
    json={
        "url": "https://shop.example.com/category/laptops",
        "extraction_prompt": """
        Extract all laptops listed on this page. For each laptop, get:
        - Product name
        - Price (as a number)
        - Key specs (RAM, storage, processor)
        - Whether it's on sale
        Return as a list of objects.
        """
    }
)

Combining Methods

Use a schema for structure and a prompt for additional guidance:

Python

import requests

schema = {
    "type": "object",
    "properties": {
        "company_name": {"type": "string"},
        "headquarters": {"type": "string"},
        "founded_year": {"type": "integer"},
        "key_products": {
            "type": "array",
            "items": {"type": "string"}
        },
        "leadership": {
            "type": "array",
            "items": {
                "type": "object",
                "properties": {
                    "name": {"type": "string"},
                    "title": {"type": "string"}
                }
            }
        }
    }
}

response = requests.post(
    "https://api.alterlab.io/api/v1/scrape",
    headers={"X-API-Key": "YOUR_API_KEY"},
    json={
        "url": "https://company.example.com/about",
        "extraction_schema": schema,
        "extraction_prompt": "Focus on the 'About Us' and 'Leadership' sections. Only include C-level executives in the leadership array."
    }
)

Evidence Mode

Enable evidence mode to see exactly where each extracted value came from in the source HTML:

Python

response = requests.post(
    "https://api.alterlab.io/api/v1/scrape",
    headers={"X-API-Key": "YOUR_API_KEY"},
    json={
        "url": "https://shop.example.com/product/123",
        "extraction_profile": "product",
        "evidence": True  # Enable evidence tracking
    }
)

data = response.json()

# Each field now includes provenance
for field, value in data["extracted"].items():
    if isinstance(value, dict) and "evidence" in value:
        print(f"{field}: {value['value']}")
        print(f"  Source: {value['evidence'][:100]}...")
    else:
        print(f"{field}: {value}")

Evidence Response Format

When evidence: true is set, each extracted field includes a provenance reference showing the source HTML element it was extracted from:

JSON

{
  "extracted": {
    "name": {
      "value": "Wireless Headphones Pro",
      "evidence": "<h1 class=\"product-title\">Wireless Headphones Pro</h1>"
    },
    "price": {
      "value": 79.99,
      "evidence": "<span class=\"price\">$79.99</span>"
    },
    "in_stock": {
      "value": true,
      "evidence": "<div class=\"stock-status\">In Stock</div>"
    }
  }
}

Use Cases for Evidence

Debugging extraction issues
Verifying data accuracy
Audit trails for compliance
Building training data for ML models

Real-World Examples

E-commerce Price Monitoring

JSON

{
  "url": "https://amazon.com/dp/B0123456789",
  "extraction_schema": {
    "type": "object",
    "properties": {
      "title": {"type": "string"},
      "price": {"type": "number"},
      "rating": {"type": "number"},
      "review_count": {"type": "integer"},
      "availability": {"type": "string"},
      "seller": {"type": "string"}
    }
  }
}

News Article Analysis

JSON

{
  "url": "https://news.example.com/article/123",
  "extraction_prompt": "Extract the headline, author, publication date, main topics covered, sentiment (positive/negative/neutral), and a 100-word summary"
}

Job Board Scraping

JSON

{
  "url": "https://jobs.example.com/listing/456",
  "extraction_profile": "job_posting",
  "extraction_prompt": "Also extract: required years of experience, remote work policy, and tech stack mentioned"
}

Best Practices

1. Start with Profiles

Pre-built profiles are optimized and tested. Use them when they match your use case, then customize with prompts if needed.

2. Be Specific in Prompts

"Extract the price" is less effective than "Extract the current sale price in USD as a number, ignoring shipping costs."

3. Use Schemas for Consistency

When scraping many pages, schemas ensure consistent field names and types across all results.

4. Handle Missing Data

Not all pages have all fields. Check for null values and handle gracefully.

5. Test on Sample Pages First

Before running large batches, test your extraction on a few pages to verify the output matches expectations.

Alerts & Notifications E-commerce Scraping

Last updated: March 2026

Tutorial

Structured Extraction

Extract structured data from web pages using pre-built profiles or custom JSON schemas. Turn messy HTML into clean, predictable JSON.

How It Works

AlterLab extracts data according to your schema specifications. Pre-built profiles work out of the box for common page types, or define custom schemas for exact control over field names and types.

Extraction Methods

Pre-built Profiles

Ready-to-use templates for common data types

Easiest

JSON Schema

Define exact structure with types and validation

Most Control

Natural Language

Describe what you want in plain English

Most Flexible

Pre-built Profiles

Use pre-defined extraction profiles for common page types. These are optimized schemas that work out of the box.

Profile	Extracted Fields	Best For
`product`	name, price, description, images, ratings, availability	E-commerce product pages
`article`	title, author, date, content, summary	News, blogs, documentation
`job_posting`	title, company, location, salary, requirements	Job boards, career pages
`faq`	questions, answers, categories	FAQ pages, help centers
`recipe`	name, ingredients, instructions, time, servings	Recipe websites
`event`	name, date, location, description, organizer	Event pages, calendars
`auto`	Automatically detected	Unknown page types

Python

import requests

# Extract product data using the product profile
response = requests.post(
    "https://api.alterlab.io/api/v1/scrape",
    headers={"X-API-Key": "YOUR_API_KEY"},
    json={
        "url": "https://shop.example.com/product/123",
        "extraction_profile": "product"
    }
)

data = response.json()
product = data["extracted"]

print(f"Name: {product['name']}")
print(f"Price: {product['price']}")
print(f"Rating: {product['rating']}")

Custom JSON Schema

Define exactly what data you want using JSON Schema. This gives you full control over field names, types, and structure.

Python

import requests

# Define a custom schema for competitor pricing
schema = {
    "type": "object",
    "properties": {
        "product_name": {
            "type": "string",
            "description": "The full product name"
        },
        "current_price": {
            "type": "number",
            "description": "Current price in USD"
        },
        "original_price": {
            "type": "number",
            "description": "Original price before discount, if any"
        },
        "discount_percent": {
            "type": "number",
            "description": "Discount percentage if on sale"
        },
        "in_stock": {
            "type": "boolean",
            "description": "Whether the product is currently in stock"
        },
        "shipping_info": {
            "type": "string",
            "description": "Shipping time or availability"
        }
    },
    "required": ["product_name", "current_price", "in_stock"]
}

response = requests.post(
    "https://api.alterlab.io/api/v1/scrape",
    headers={"X-API-Key": "YOUR_API_KEY"},
    json={
        "url": "https://competitor.com/product/xyz",
        "extraction_schema": schema
    }
)

data = response.json()
pricing = data["extracted"]

print(f"Product: {pricing['product_name']}")
print(f"Price: ${pricing['current_price']}")
print(f"In Stock: {pricing['in_stock']}")

Schema Tips

Use description fields to guide the AI
Mark important fields as required
Use specific types (number vs string) for proper formatting

Natural Language Prompts

LLM-Powered Extraction

Just describe what you want in plain English. Great for quick extraction or when you're not sure of the exact structure.

Python

import requests

# Simple natural language extraction
response = requests.post(
    "https://api.alterlab.io/api/v1/scrape",
    headers={"X-API-Key": "YOUR_API_KEY"},
    json={
        "url": "https://techblog.example.com/article/123",
        "extraction_prompt": "Extract the article title, author name, publication date, and a 2-sentence summary of the main points"
    }
)

data = response.json()
print(data["extraction_result"])

# More complex extraction
response = requests.post(
    "https://api.alterlab.io/api/v1/scrape",
    headers={"X-API-Key": "YOUR_API_KEY"},
    json={
        "url": "https://shop.example.com/category/laptops",
        "extraction_prompt": """
        Extract all laptops listed on this page. For each laptop, get:
        - Product name
        - Price (as a number)
        - Key specs (RAM, storage, processor)
        - Whether it's on sale
        Return as a list of objects.
        """
    }
)

Combining Methods

Use a schema for structure and a prompt for additional guidance:

Python

import requests

schema = {
    "type": "object",
    "properties": {
        "company_name": {"type": "string"},
        "headquarters": {"type": "string"},
        "founded_year": {"type": "integer"},
        "key_products": {
            "type": "array",
            "items": {"type": "string"}
        },
        "leadership": {
            "type": "array",
            "items": {
                "type": "object",
                "properties": {
                    "name": {"type": "string"},
                    "title": {"type": "string"}
                }
            }
        }
    }
}

response = requests.post(
    "https://api.alterlab.io/api/v1/scrape",
    headers={"X-API-Key": "YOUR_API_KEY"},
    json={
        "url": "https://company.example.com/about",
        "extraction_schema": schema,
        "extraction_prompt": "Focus on the 'About Us' and 'Leadership' sections. Only include C-level executives in the leadership array."
    }
)

Evidence Mode

Enable evidence mode to see exactly where each extracted value came from in the source HTML:

Python

response = requests.post(
    "https://api.alterlab.io/api/v1/scrape",
    headers={"X-API-Key": "YOUR_API_KEY"},
    json={
        "url": "https://shop.example.com/product/123",
        "extraction_profile": "product",
        "evidence": True  # Enable evidence tracking
    }
)

data = response.json()

# Each field now includes provenance
for field, value in data["extracted"].items():
    if isinstance(value, dict) and "evidence" in value:
        print(f"{field}: {value['value']}")
        print(f"  Source: {value['evidence'][:100]}...")
    else:
        print(f"{field}: {value}")

Evidence Response Format

When evidence: true is set, each extracted field includes a provenance reference showing the source HTML element it was extracted from:

JSON

{
  "extracted": {
    "name": {
      "value": "Wireless Headphones Pro",
      "evidence": "<h1 class=\"product-title\">Wireless Headphones Pro</h1>"
    },
    "price": {
      "value": 79.99,
      "evidence": "<span class=\"price\">$79.99</span>"
    },
    "in_stock": {
      "value": true,
      "evidence": "<div class=\"stock-status\">In Stock</div>"
    }
  }
}

Use Cases for Evidence

Debugging extraction issues
Verifying data accuracy
Audit trails for compliance
Building training data for ML models

Real-World Examples

E-commerce Price Monitoring

JSON

{
  "url": "https://amazon.com/dp/B0123456789",
  "extraction_schema": {
    "type": "object",
    "properties": {
      "title": {"type": "string"},
      "price": {"type": "number"},
      "rating": {"type": "number"},
      "review_count": {"type": "integer"},
      "availability": {"type": "string"},
      "seller": {"type": "string"}
    }
  }
}

News Article Analysis

JSON

{
  "url": "https://news.example.com/article/123",
  "extraction_prompt": "Extract the headline, author, publication date, main topics covered, sentiment (positive/negative/neutral), and a 100-word summary"
}

Job Board Scraping

JSON

{
  "url": "https://jobs.example.com/listing/456",
  "extraction_profile": "job_posting",
  "extraction_prompt": "Also extract: required years of experience, remote work policy, and tech stack mentioned"
}

Best Practices

1. Start with Profiles

Pre-built profiles are optimized and tested. Use them when they match your use case, then customize with prompts if needed.

2. Be Specific in Prompts

"Extract the price" is less effective than "Extract the current sale price in USD as a number, ignoring shipping costs."

3. Use Schemas for Consistency

When scraping many pages, schemas ensure consistent field names and types across all results.

4. Handle Missing Data

Not all pages have all fields. Check for null values and handle gracefully.

5. Test on Sample Pages First

Before running large batches, test your extraction on a few pages to verify the output matches expectations.

Alerts & Notifications E-commerce Scraping

Last updated: March 2026