Guide

New

Structured Extraction

Turn raw content into clean, typed data using schemas, profiles, and natural language prompts. Works with the standalone /v1/extract endpoint or inline with /v1/scrape.

Prerequisite

This guide covers extraction patterns and workflows. For the full parameter reference, see the Extract API Reference.

Extract vs Scrape

AlterLab offers two ways to get structured data. Choose based on whether you already have the content.

Scenario	Use	Why
You have a URL, need data	`POST /v1/scrape` with extraction params	Scrapes the page and extracts in one call. Costs scrape credits + extraction.
You already have HTML/text content	`POST /v1/extract`	No scraping needed — just extraction. Cheaper ($0.0025 per call).
Processing cached/archived content	`POST /v1/extract`	Re-extract from previously scraped content without re-scraping the URL.
Processing LLM or OCR output	`POST /v1/extract` with `content_type: "text"`	Structure unstructured text from any source.
Batch processing many documents	`POST /v1/extract` in parallel	Feed content from your database or file system without network overhead.

Extraction Schemas

An extraction schema is a JSON Schema object that defines the fields you want. AlterLab matches fields using exact matching, case-insensitive matching, field aliases, and type coercion.

Basic Schema

Define the fields and their types. The extractor maps them from the content automatically.

JSON

{
  "type": "object",
  "properties": {
    "title": { "type": "string" },
    "price": { "type": "number" },
    "in_stock": { "type": "boolean" },
    "tags": {
      "type": "array",
      "items": { "type": "string" }
    }
  }
}

Field Descriptions for Better Matching

Add description to guide the extractor (especially useful with LLM prompts).

JSON

{
  "type": "object",
  "properties": {
    "headline": {
      "type": "string",
      "description": "The main product name or page title"
    },
    "cost": {
      "type": "number",
      "description": "The price in USD, without currency symbol"
    },
    "verdict": {
      "type": "string",
      "description": "One-sentence summary of the review"
    }
  }
}

Nested Objects and Arrays

Schemas support nesting for complex data structures.

Python

response = requests.post(
    "https://api.alterlab.io/api/v1/extract",
    headers={"X-API-Key": "YOUR_API_KEY"},
    json={
        "content": job_listing_html,
        "content_type": "html",
        "extraction_schema": {
            "type": "object",
            "properties": {
                "title": {"type": "string"},
                "company": {"type": "string"},
                "salary": {
                    "type": "object",
                    "properties": {
                        "min": {"type": "number"},
                        "max": {"type": "number"},
                        "currency": {"type": "string"}
                    }
                },
                "requirements": {
                    "type": "array",
                    "items": {"type": "string"}
                },
                "benefits": {
                    "type": "array",
                    "items": {"type": "string"}
                }
            }
        }
    }
)

job = response.json()["formats"]["json"]
print(f"{job['title']} at {job['company']}")
print(f"Salary: {job['salary']['min']}-{job['salary']['max']} {job['salary']['currency']}")

Type Coercion

AlterLab automatically converts types where possible. A price like "$99.99" is coerced to 99.99 when the schema expects a number. Strings like "true" become true for boolean fields.

Extraction Profiles

Profiles are pre-built schemas for common page types. They save you from writing a schema when the data follows standard patterns.

Python

# Extract product data — no custom schema needed
response = requests.post(
    "https://api.alterlab.io/api/v1/extract",
    headers={"X-API-Key": "YOUR_API_KEY"},
    json={
        "content": product_page_html,
        "extraction_profile": "product"
    }
)

product = response.json()["formats"]["json"]
# Returns: name, price, currency, images, rating, availability, brand, description

Profile + Schema

You can combine a profile with a custom schema. The profile determines the extraction strategy, and the schema filters the output to only the fields you need.

Extraction Prompts

When algorithmic extraction is not enough, use an extraction_prompt to invoke an LLM. This is ideal for unstructured text, complex reasoning, or custom transformations.

When to Use Prompts

Scenario	Method	Why
HTML with Schema.org / Open Graph	Schema or profile (algorithmic)	Faster, cheaper, deterministic. No LLM needed.
Plain text or unstructured content	Prompt + schema	LLM understands natural language context.
Summarization or interpretation	Prompt only	Schema optional — prompt tells the LLM what output to produce.
Complex multi-field extraction	Prompt + schema	Schema ensures typed output. Prompt provides reasoning context.

Writing Effective Prompts

Python

# Summarize and extract key points from an article
response = requests.post(
    "https://api.alterlab.io/api/v1/extract",
    headers={"X-API-Key": "YOUR_API_KEY"},
    json={
        "content": long_article_text,
        "content_type": "text",
        "extraction_prompt": (
            "Summarize this article in 2-3 sentences. "
            "Extract the 3 most important claims and whether "
            "evidence is provided for each."
        ),
        "extraction_schema": {
            "type": "object",
            "properties": {
                "summary": {"type": "string"},
                "claims": {
                    "type": "array",
                    "items": {
                        "type": "object",
                        "properties": {
                            "claim": {"type": "string"},
                            "has_evidence": {"type": "boolean"}
                        }
                    }
                }
            }
        }
    }
)

Evidence Mode

Evidence mode tracks where each extracted value came from in the source content. This is useful for auditing, debugging extraction quality, and building trust in extracted data.

Python

# Enable evidence tracking
response = requests.post(
    "https://api.alterlab.io/api/v1/extract",
    headers={"X-API-Key": "YOUR_API_KEY"},
    json={
        "content": html_content,
        "content_type": "html",
        "extraction_schema": {
            "type": "object",
            "properties": {
                "title": {"type": "string"},
                "price": {"type": "number"},
                "rating": {"type": "number"}
            }
        },
        "evidence": True
    }
)

data = response.json()
# The json output includes provenance metadata for each field
print(data["formats"]["json"])

When to Use Evidence

Evidence mode is most valuable when you need to verify extraction accuracy, debug mismatched fields, or provide an audit trail for compliance-sensitive data.

Combining Methods

You can combine schemas, profiles, and prompts in a single request. The extraction pipeline processes them in order:

Content Transformation

Content is converted to requested formats (text, json, markdown, etc.) using the profile for extraction strategy.

Schema Filtering

If extraction_schema is provided, the JSON output is filtered to match your schema with type coercion.

LLM Extraction

If extraction_prompt is provided, the LLM processes the text and replaces the JSON output with its structured result.

Python

# Profile + schema + prompt — full pipeline
response = requests.post(
    "https://api.alterlab.io/api/v1/extract",
    headers={"X-API-Key": "YOUR_API_KEY"},
    json={
        "content": product_review_html,
        "content_type": "html",
        "extraction_profile": "product",       # Step 1: use product strategy
        "extraction_schema": {                  # Step 2: filter to these fields
            "type": "object",
            "properties": {
                "name": {"type": "string"},
                "price": {"type": "number"},
                "sentiment": {"type": "string"}
            }
        },
        "extraction_prompt": (                  # Step 3: LLM adds sentiment
            "Extract the product name and price. "
            "Also analyze the overall review sentiment "
            "as 'positive', 'negative', or 'mixed'."
        ),
        "formats": ["json", "text"]
    }
)

Bulk Extraction

Process multiple documents by calling the Extract endpoint in parallel. Since there is no network fetching, responses are fast.

Python

import asyncio
import aiohttp

API_KEY = "YOUR_API_KEY"
EXTRACT_URL = "https://api.alterlab.io/api/v1/extract"

async def extract_one(session, content, schema):
    async with session.post(
        EXTRACT_URL,
        headers={"X-API-Key": API_KEY},
        json={
            "content": content,
            "content_type": "html",
            "extraction_schema": schema,
        },
    ) as resp:
        return await resp.json()

async def extract_bulk(documents, schema, max_concurrent=10):
    semaphore = asyncio.Semaphore(max_concurrent)
    async with aiohttp.ClientSession() as session:
        async def limited(doc):
            async with semaphore:
                return await extract_one(session, doc, schema)
        return await asyncio.gather(*[limited(doc) for doc in documents])

# Usage
schema = {
    "type": "object",
    "properties": {
        "title": {"type": "string"},
        "price": {"type": "number"},
    },
}

documents = [html_1, html_2, html_3]  # Your HTML content
results = asyncio.run(extract_bulk(documents, schema))

for r in results:
    print(r["formats"]["json"])

Best Practices

Prefer algorithmic extraction over LLM

For HTML with Schema.org or Open Graph metadata, use extraction_schema or extraction_profile without a prompt. Algorithmic extraction is faster, cheaper, and deterministic.

Use the right content_type

Setting content_type correctly helps the pipeline parse your content optimally. HTML gets full DOM analysis; text and markdown get simpler processing.

Add descriptions to schema fields

When using extraction_prompt, field descriptions in the schema help the LLM understand what each field should contain.

Keep prompts specific and concise

Short, clear prompts produce better results than verbose ones. Tell the LLM exactly what to extract — not what the content is about. Max 2,000 characters.

Mind the content size limit

Content over 200K characters incurs double cost. For LLM extraction, content is truncated to 30K characters. If your content is very large, consider extracting the relevant section before calling the API.

Use source_url for context

When extracting from content you scraped earlier, pass the original URL as source_url. The LLM uses this for context — it is not fetched.

← Previous: Extract API Reference Next: JSON Schema Filtering →

Last updated: March 2026

Guide

New

Structured Extraction

Turn raw content into clean, typed data using schemas, profiles, and natural language prompts. Works with the standalone /v1/extract endpoint or inline with /v1/scrape.

Prerequisite

This guide covers extraction patterns and workflows. For the full parameter reference, see the Extract API Reference.

Extract vs Scrape

AlterLab offers two ways to get structured data. Choose based on whether you already have the content.

Scenario	Use	Why
You have a URL, need data	`POST /v1/scrape` with extraction params	Scrapes the page and extracts in one call. Costs scrape credits + extraction.
You already have HTML/text content	`POST /v1/extract`	No scraping needed — just extraction. Cheaper ($0.0025 per call).
Processing cached/archived content	`POST /v1/extract`	Re-extract from previously scraped content without re-scraping the URL.
Processing LLM or OCR output	`POST /v1/extract` with `content_type: "text"`	Structure unstructured text from any source.
Batch processing many documents	`POST /v1/extract` in parallel	Feed content from your database or file system without network overhead.

Extraction Schemas

An extraction schema is a JSON Schema object that defines the fields you want. AlterLab matches fields using exact matching, case-insensitive matching, field aliases, and type coercion.

Basic Schema

Define the fields and their types. The extractor maps them from the content automatically.

JSON

{
  "type": "object",
  "properties": {
    "title": { "type": "string" },
    "price": { "type": "number" },
    "in_stock": { "type": "boolean" },
    "tags": {
      "type": "array",
      "items": { "type": "string" }
    }
  }
}

Field Descriptions for Better Matching

Add description to guide the extractor (especially useful with LLM prompts).

JSON

{
  "type": "object",
  "properties": {
    "headline": {
      "type": "string",
      "description": "The main product name or page title"
    },
    "cost": {
      "type": "number",
      "description": "The price in USD, without currency symbol"
    },
    "verdict": {
      "type": "string",
      "description": "One-sentence summary of the review"
    }
  }
}

Nested Objects and Arrays

Schemas support nesting for complex data structures.

Python

response = requests.post(
    "https://api.alterlab.io/api/v1/extract",
    headers={"X-API-Key": "YOUR_API_KEY"},
    json={
        "content": job_listing_html,
        "content_type": "html",
        "extraction_schema": {
            "type": "object",
            "properties": {
                "title": {"type": "string"},
                "company": {"type": "string"},
                "salary": {
                    "type": "object",
                    "properties": {
                        "min": {"type": "number"},
                        "max": {"type": "number"},
                        "currency": {"type": "string"}
                    }
                },
                "requirements": {
                    "type": "array",
                    "items": {"type": "string"}
                },
                "benefits": {
                    "type": "array",
                    "items": {"type": "string"}
                }
            }
        }
    }
)

job = response.json()["formats"]["json"]
print(f"{job['title']} at {job['company']}")
print(f"Salary: {job['salary']['min']}-{job['salary']['max']} {job['salary']['currency']}")

Type Coercion

AlterLab automatically converts types where possible. A price like "$99.99" is coerced to 99.99 when the schema expects a number. Strings like "true" become true for boolean fields.

Extraction Profiles

Profiles are pre-built schemas for common page types. They save you from writing a schema when the data follows standard patterns.

Python

# Extract product data — no custom schema needed
response = requests.post(
    "https://api.alterlab.io/api/v1/extract",
    headers={"X-API-Key": "YOUR_API_KEY"},
    json={
        "content": product_page_html,
        "extraction_profile": "product"
    }
)

product = response.json()["formats"]["json"]
# Returns: name, price, currency, images, rating, availability, brand, description

Profile + Schema

You can combine a profile with a custom schema. The profile determines the extraction strategy, and the schema filters the output to only the fields you need.

Extraction Prompts

When algorithmic extraction is not enough, use an extraction_prompt to invoke an LLM. This is ideal for unstructured text, complex reasoning, or custom transformations.

When to Use Prompts

Scenario	Method	Why
HTML with Schema.org / Open Graph	Schema or profile (algorithmic)	Faster, cheaper, deterministic. No LLM needed.
Plain text or unstructured content	Prompt + schema	LLM understands natural language context.
Summarization or interpretation	Prompt only	Schema optional — prompt tells the LLM what output to produce.
Complex multi-field extraction	Prompt + schema	Schema ensures typed output. Prompt provides reasoning context.

Writing Effective Prompts

Python

# Summarize and extract key points from an article
response = requests.post(
    "https://api.alterlab.io/api/v1/extract",
    headers={"X-API-Key": "YOUR_API_KEY"},
    json={
        "content": long_article_text,
        "content_type": "text",
        "extraction_prompt": (
            "Summarize this article in 2-3 sentences. "
            "Extract the 3 most important claims and whether "
            "evidence is provided for each."
        ),
        "extraction_schema": {
            "type": "object",
            "properties": {
                "summary": {"type": "string"},
                "claims": {
                    "type": "array",
                    "items": {
                        "type": "object",
                        "properties": {
                            "claim": {"type": "string"},
                            "has_evidence": {"type": "boolean"}
                        }
                    }
                }
            }
        }
    }
)

Evidence Mode

Evidence mode tracks where each extracted value came from in the source content. This is useful for auditing, debugging extraction quality, and building trust in extracted data.

Python

# Enable evidence tracking
response = requests.post(
    "https://api.alterlab.io/api/v1/extract",
    headers={"X-API-Key": "YOUR_API_KEY"},
    json={
        "content": html_content,
        "content_type": "html",
        "extraction_schema": {
            "type": "object",
            "properties": {
                "title": {"type": "string"},
                "price": {"type": "number"},
                "rating": {"type": "number"}
            }
        },
        "evidence": True
    }
)

data = response.json()
# The json output includes provenance metadata for each field
print(data["formats"]["json"])

When to Use Evidence

Evidence mode is most valuable when you need to verify extraction accuracy, debug mismatched fields, or provide an audit trail for compliance-sensitive data.

Combining Methods

You can combine schemas, profiles, and prompts in a single request. The extraction pipeline processes them in order:

Content Transformation

Content is converted to requested formats (text, json, markdown, etc.) using the profile for extraction strategy.

Schema Filtering

If extraction_schema is provided, the JSON output is filtered to match your schema with type coercion.

LLM Extraction

If extraction_prompt is provided, the LLM processes the text and replaces the JSON output with its structured result.

Python

# Profile + schema + prompt — full pipeline
response = requests.post(
    "https://api.alterlab.io/api/v1/extract",
    headers={"X-API-Key": "YOUR_API_KEY"},
    json={
        "content": product_review_html,
        "content_type": "html",
        "extraction_profile": "product",       # Step 1: use product strategy
        "extraction_schema": {                  # Step 2: filter to these fields
            "type": "object",
            "properties": {
                "name": {"type": "string"},
                "price": {"type": "number"},
                "sentiment": {"type": "string"}
            }
        },
        "extraction_prompt": (                  # Step 3: LLM adds sentiment
            "Extract the product name and price. "
            "Also analyze the overall review sentiment "
            "as 'positive', 'negative', or 'mixed'."
        ),
        "formats": ["json", "text"]
    }
)

Bulk Extraction

Process multiple documents by calling the Extract endpoint in parallel. Since there is no network fetching, responses are fast.

Python

import asyncio
import aiohttp

API_KEY = "YOUR_API_KEY"
EXTRACT_URL = "https://api.alterlab.io/api/v1/extract"

async def extract_one(session, content, schema):
    async with session.post(
        EXTRACT_URL,
        headers={"X-API-Key": API_KEY},
        json={
            "content": content,
            "content_type": "html",
            "extraction_schema": schema,
        },
    ) as resp:
        return await resp.json()

async def extract_bulk(documents, schema, max_concurrent=10):
    semaphore = asyncio.Semaphore(max_concurrent)
    async with aiohttp.ClientSession() as session:
        async def limited(doc):
            async with semaphore:
                return await extract_one(session, doc, schema)
        return await asyncio.gather(*[limited(doc) for doc in documents])

# Usage
schema = {
    "type": "object",
    "properties": {
        "title": {"type": "string"},
        "price": {"type": "number"},
    },
}

documents = [html_1, html_2, html_3]  # Your HTML content
results = asyncio.run(extract_bulk(documents, schema))

for r in results:
    print(r["formats"]["json"])

Best Practices

Prefer algorithmic extraction over LLM

For HTML with Schema.org or Open Graph metadata, use extraction_schema or extraction_profile without a prompt. Algorithmic extraction is faster, cheaper, and deterministic.

Use the right content_type

Setting content_type correctly helps the pipeline parse your content optimally. HTML gets full DOM analysis; text and markdown get simpler processing.

Add descriptions to schema fields

When using extraction_prompt, field descriptions in the schema help the LLM understand what each field should contain.

Keep prompts specific and concise

Short, clear prompts produce better results than verbose ones. Tell the LLM exactly what to extract — not what the content is about. Max 2,000 characters.

Mind the content size limit

Use source_url for context

When extracting from content you scraped earlier, pass the original URL as source_url. The LLM uses this for context — it is not fetched.

← Previous: Extract API Reference Next: JSON Schema Filtering →

Last updated: March 2026