AlterLabAlterLab
PricingComparePlaygroundBlogDocsChangelog
    AlterLabAlterLab
    PricingComparePlaygroundBlogDocsChangelog
    IntroductionQuickstartInstallationYour First Request
    REST APIJob PollingAPI KeysSessions APINew
    OverviewPythonNode.js
    JavaScript RenderingOutput FormatsPDF & OCRCachingWebhooksJSON Schema FilteringWebSocket Real-TimeBring Your Own ProxyProAuthenticated ScrapingNewWeb CrawlingBatch ScrapingSchedulerChange DetectionCloud Storage ExportSpend LimitsOrganizations & TeamsAlerts & Notifications
    Structured ExtractionAIE-commerce ScrapingNews MonitoringPrice MonitoringMulti-Page CrawlingMonitoring DashboardAI Agent / MCPMCPData Pipeline to Cloud
    PricingRate LimitsError Codes
    From FirecrawlFrom ApifyFrom ScrapingBee / ScraperAPI
    PlaygroundPricingStatus
    Guide
    New

    Structured Extraction

    Turn raw content into clean, typed data using schemas, profiles, and natural language prompts. Works with the standalone /v1/extract endpoint or inline with /v1/scrape.

    Prerequisite

    This guide covers extraction patterns and workflows. For the full parameter reference, see the Extract API Reference.

    Extract vs Scrape

    AlterLab offers two ways to get structured data. Choose based on whether you already have the content.

    ScenarioUseWhy
    You have a URL, need dataPOST /v1/scrape with extraction paramsScrapes the page and extracts in one call. Costs scrape credits + extraction.
    You already have HTML/text contentPOST /v1/extractNo scraping needed — just extraction. Cheaper ($0.0025 per call).
    Processing cached/archived contentPOST /v1/extractRe-extract from previously scraped content without re-scraping the URL.
    Processing LLM or OCR outputPOST /v1/extract with content_type: "text"Structure unstructured text from any source.
    Batch processing many documentsPOST /v1/extract in parallelFeed content from your database or file system without network overhead.

    Extraction Schemas

    An extraction schema is a JSON Schema object that defines the fields you want. AlterLab matches fields using exact matching, case-insensitive matching, field aliases, and type coercion.

    Basic Schema

    Define the fields and their types. The extractor maps them from the content automatically.

    JSON
    {
      "type": "object",
      "properties": {
        "title": { "type": "string" },
        "price": { "type": "number" },
        "in_stock": { "type": "boolean" },
        "tags": {
          "type": "array",
          "items": { "type": "string" }
        }
      }
    }

    Field Descriptions for Better Matching

    Add description to guide the extractor (especially useful with LLM prompts).

    JSON
    {
      "type": "object",
      "properties": {
        "headline": {
          "type": "string",
          "description": "The main product name or page title"
        },
        "cost": {
          "type": "number",
          "description": "The price in USD, without currency symbol"
        },
        "verdict": {
          "type": "string",
          "description": "One-sentence summary of the review"
        }
      }
    }

    Nested Objects and Arrays

    Schemas support nesting for complex data structures.

    Python
    response = requests.post(
        "https://api.alterlab.io/api/v1/extract",
        headers={"X-API-Key": "YOUR_API_KEY"},
        json={
            "content": job_listing_html,
            "content_type": "html",
            "extraction_schema": {
                "type": "object",
                "properties": {
                    "title": {"type": "string"},
                    "company": {"type": "string"},
                    "salary": {
                        "type": "object",
                        "properties": {
                            "min": {"type": "number"},
                            "max": {"type": "number"},
                            "currency": {"type": "string"}
                        }
                    },
                    "requirements": {
                        "type": "array",
                        "items": {"type": "string"}
                    },
                    "benefits": {
                        "type": "array",
                        "items": {"type": "string"}
                    }
                }
            }
        }
    )
    
    job = response.json()["formats"]["json"]
    print(f"{job['title']} at {job['company']}")
    print(f"Salary: {job['salary']['min']}-{job['salary']['max']} {job['salary']['currency']}")

    Type Coercion

    AlterLab automatically converts types where possible. A price like "$99.99" is coerced to 99.99 when the schema expects a number. Strings like "true" become true for boolean fields.

    Extraction Profiles

    Profiles are pre-built schemas for common page types. They save you from writing a schema when the data follows standard patterns.

    Python
    # Extract product data — no custom schema needed
    response = requests.post(
        "https://api.alterlab.io/api/v1/extract",
        headers={"X-API-Key": "YOUR_API_KEY"},
        json={
            "content": product_page_html,
            "extraction_profile": "product"
        }
    )
    
    product = response.json()["formats"]["json"]
    # Returns: name, price, currency, images, rating, availability, brand, description

    Profile + Schema

    You can combine a profile with a custom schema. The profile determines the extraction strategy, and the schema filters the output to only the fields you need.

    Extraction Prompts

    When algorithmic extraction is not enough, use an extraction_prompt to invoke an LLM. This is ideal for unstructured text, complex reasoning, or custom transformations.

    When to Use Prompts

    ScenarioMethodWhy
    HTML with Schema.org / Open GraphSchema or profile (algorithmic)Faster, cheaper, deterministic. No LLM needed.
    Plain text or unstructured contentPrompt + schemaLLM understands natural language context.
    Summarization or interpretationPrompt onlySchema optional — prompt tells the LLM what output to produce.
    Complex multi-field extractionPrompt + schemaSchema ensures typed output. Prompt provides reasoning context.

    Writing Effective Prompts

    Python
    # Summarize and extract key points from an article
    response = requests.post(
        "https://api.alterlab.io/api/v1/extract",
        headers={"X-API-Key": "YOUR_API_KEY"},
        json={
            "content": long_article_text,
            "content_type": "text",
            "extraction_prompt": (
                "Summarize this article in 2-3 sentences. "
                "Extract the 3 most important claims and whether "
                "evidence is provided for each."
            ),
            "extraction_schema": {
                "type": "object",
                "properties": {
                    "summary": {"type": "string"},
                    "claims": {
                        "type": "array",
                        "items": {
                            "type": "object",
                            "properties": {
                                "claim": {"type": "string"},
                                "has_evidence": {"type": "boolean"}
                            }
                        }
                    }
                }
            }
        }
    )

    Evidence Mode

    Evidence mode tracks where each extracted value came from in the source content. This is useful for auditing, debugging extraction quality, and building trust in extracted data.

    Python
    # Enable evidence tracking
    response = requests.post(
        "https://api.alterlab.io/api/v1/extract",
        headers={"X-API-Key": "YOUR_API_KEY"},
        json={
            "content": html_content,
            "content_type": "html",
            "extraction_schema": {
                "type": "object",
                "properties": {
                    "title": {"type": "string"},
                    "price": {"type": "number"},
                    "rating": {"type": "number"}
                }
            },
            "evidence": True
        }
    )
    
    data = response.json()
    # The json output includes provenance metadata for each field
    print(data["formats"]["json"])

    When to Use Evidence

    Evidence mode is most valuable when you need to verify extraction accuracy, debug mismatched fields, or provide an audit trail for compliance-sensitive data.

    Combining Methods

    You can combine schemas, profiles, and prompts in a single request. The extraction pipeline processes them in order:

    1

    Content Transformation

    Content is converted to requested formats (text, json, markdown, etc.) using the profile for extraction strategy.

    2

    Schema Filtering

    If extraction_schema is provided, the JSON output is filtered to match your schema with type coercion.

    3

    LLM Extraction

    If extraction_prompt is provided, the LLM processes the text and replaces the JSON output with its structured result.

    Python
    # Profile + schema + prompt — full pipeline
    response = requests.post(
        "https://api.alterlab.io/api/v1/extract",
        headers={"X-API-Key": "YOUR_API_KEY"},
        json={
            "content": product_review_html,
            "content_type": "html",
            "extraction_profile": "product",       # Step 1: use product strategy
            "extraction_schema": {                  # Step 2: filter to these fields
                "type": "object",
                "properties": {
                    "name": {"type": "string"},
                    "price": {"type": "number"},
                    "sentiment": {"type": "string"}
                }
            },
            "extraction_prompt": (                  # Step 3: LLM adds sentiment
                "Extract the product name and price. "
                "Also analyze the overall review sentiment "
                "as 'positive', 'negative', or 'mixed'."
            ),
            "formats": ["json", "text"]
        }
    )

    Bulk Extraction

    Process multiple documents by calling the Extract endpoint in parallel. Since there is no network fetching, responses are fast.

    Python
    import asyncio
    import aiohttp
    
    API_KEY = "YOUR_API_KEY"
    EXTRACT_URL = "https://api.alterlab.io/api/v1/extract"
    
    async def extract_one(session, content, schema):
        async with session.post(
            EXTRACT_URL,
            headers={"X-API-Key": API_KEY},
            json={
                "content": content,
                "content_type": "html",
                "extraction_schema": schema,
            },
        ) as resp:
            return await resp.json()
    
    async def extract_bulk(documents, schema, max_concurrent=10):
        semaphore = asyncio.Semaphore(max_concurrent)
        async with aiohttp.ClientSession() as session:
            async def limited(doc):
                async with semaphore:
                    return await extract_one(session, doc, schema)
            return await asyncio.gather(*[limited(doc) for doc in documents])
    
    # Usage
    schema = {
        "type": "object",
        "properties": {
            "title": {"type": "string"},
            "price": {"type": "number"},
        },
    }
    
    documents = [html_1, html_2, html_3]  # Your HTML content
    results = asyncio.run(extract_bulk(documents, schema))
    
    for r in results:
        print(r["formats"]["json"])

    Best Practices

    Prefer algorithmic extraction over LLM

    For HTML with Schema.org or Open Graph metadata, use extraction_schema or extraction_profile without a prompt. Algorithmic extraction is faster, cheaper, and deterministic.

    Use the right content_type

    Setting content_type correctly helps the pipeline parse your content optimally. HTML gets full DOM analysis; text and markdown get simpler processing.

    Add descriptions to schema fields

    When using extraction_prompt, field descriptions in the schema help the LLM understand what each field should contain.

    Keep prompts specific and concise

    Short, clear prompts produce better results than verbose ones. Tell the LLM exactly what to extract — not what the content is about. Max 2,000 characters.

    Mind the content size limit

    Content over 200K characters incurs double cost. For LLM extraction, content is truncated to 30K characters. If your content is very large, consider extracting the relevant section before calling the API.

    Use source_url for context

    When extracting from content you scraped earlier, pass the original URL as source_url. The LLM uses this for context — it is not fetched.

    ← Previous: Extract API ReferenceNext: JSON Schema Filtering →
    Last updated: March 2026

    On this page