AlterLabAlterLab
PricingComparePlaygroundBlogDocsChangelog
    AlterLabAlterLab
    PricingComparePlaygroundBlogDocsChangelog
    IntroductionQuickstartInstallationYour First Request
    REST APIJob PollingAPI KeysSessions APINew
    OverviewPythonNode.js
    JavaScript RenderingOutput FormatsPDF & OCRCachingWebhooksJSON Schema FilteringWebSocket Real-TimeBring Your Own ProxyProAuthenticated ScrapingNewWeb CrawlingBatch ScrapingSchedulerChange DetectionCloud Storage ExportSpend LimitsOrganizations & TeamsAlerts & Notifications
    Structured ExtractionAIE-commerce ScrapingNews MonitoringPrice MonitoringMulti-Page CrawlingMonitoring DashboardAI Agent / MCPMCPData Pipeline to Cloud
    PricingRate LimitsError Codes
    From FirecrawlFrom ApifyFrom ScrapingBee / ScraperAPI
    PlaygroundPricingStatus
    API Reference
    New

    Extract API

    Extract structured data from raw HTML, text, or markdown content without scraping a URL. Bring your own data and let AlterLab handle the extraction.

    Bring Your Own Data

    The Extract endpoint is designed for users who already have HTML or text content and want to pull structured data from it — no scraping needed. For scraping + extraction in one call, see the REST API.

    Overview

    1

    Send Content

    POST raw HTML, text, or markdown to /api/v1/extract. Specify the content type so the pipeline knows how to parse it.

    2

    Define What to Extract

    Use an extraction_schema (JSON Schema), an extraction_profile (pre-built template), or an extraction_prompt (natural language) to describe the output structure.

    3

    Get Structured Data

    Receive clean, typed JSON matching your schema — plus optional text, markdown, or RAG-ready output formats.

    POST /v1/extract

    POST
    /api/v1/extract

    Extract structured data from raw content. Returns the extraction result synchronously.

    Bash
    curl -X POST https://api.alterlab.io/api/v1/extract \
      -H "X-API-Key: your_api_key" \
      -H "Content-Type: application/json" \
      -d '{
        "content": "<html><body><h1>Widget Pro</h1><span class=\"price\">$49.99</span></body></html>",
        "content_type": "html",
        "extraction_schema": {
          "type": "object",
          "properties": {
            "name": {"type": "string"},
            "price": {"type": "number"}
          }
        }
      }'

    Request Body

    ParameterTypeRequiredDescription
    contentstringYesRaw content to extract from. Max 5 MB. Must not be blank.
    content_typestringNoType of content: html (default), text, or markdown.
    extraction_schemaobjectNoJSON Schema defining the output structure. Fields are mapped from content using algorithmic matching, type coercion, and field aliases.
    extraction_profilestringNoPre-built extraction template. One of: auto, product, article, job_posting, faq, recipe, event.
    extraction_promptstringNoNatural language instructions for LLM extraction. Max 2,000 characters. When provided, an LLM processes the content.
    formatsstring[]NoOutput formats. Options: json (default), text, markdown, html, json_v2, rag.
    source_urlstringNoOriginal URL of the content. Used as context for the LLM — not fetched.
    evidencebooleanNoWhen true, include field provenance tracking — shows where each extracted value came from. Default: false.

    Content Types

    TypeUse ForNotes
    htmlFull or partial HTML pages, scraped contentBest extraction quality. Schema.org, Open Graph, and DOM structure are all used for matching.
    textPlain text, OCR output, transcriptsWorks best with extraction_prompt for LLM-based extraction.
    markdownMarkdown documents, LLM output, wiki pagesPreserves markdown formatting in the markdown output format.

    Extraction Profiles

    Profiles are pre-built extraction templates that know which fields to look for. Use them when you want structured data without writing a custom schema.

    ProfileFields ExtractedBest For
    autoDetects page type automaticallyGeneral-purpose extraction
    productname, price, currency, images, rating, availability, brand, descriptionE-commerce product pages
    articletitle, author, published_date, content, summary, imagesNews articles, blog posts
    job_postingtitle, company, location, salary, description, requirementsJob listing pages
    faqquestions, answers (as question/answer pairs)FAQ and help pages
    recipename, ingredients, instructions, cook_time, servings, nutritionRecipe pages
    eventname, date, location, description, organizer, priceEvent listing pages

    Response

    FieldTypeDescription
    extract_idstringUnique identifier for this extraction (e.g., ext_a1b2c3d4...)
    formatsobjectExtraction results keyed by requested format (e.g., {"json": {...}, "text": "..."})
    credits_usedintegerCredits consumed in microcents (e.g., 2500 = $0.0025)
    model_usedstring | nullLLM model used, if extraction_prompt was provided. Null for algorithmic extraction.
    extraction_methodstringMethod used: algorithmic, llm, playbook, or lossless.
    content_size_charsintegerSize of the input content in characters.
    JSON
    {
      "extract_id": "ext_a1b2c3d4e5f6a7b8c9d0e1f2",
      "formats": {
        "json": {
          "name": "Widget Pro",
          "price": 49.99
        }
      },
      "credits_used": 2500,
      "model_used": null,
      "extraction_method": "algorithmic",
      "content_size_chars": 1234
    }

    Credit Model

    ScenarioCostNotes
    Base extraction2,500 microcents ($0.0025)Equivalent to a Tier 3.5 scrape — applies to all extractions.
    Large content (>200K chars)5,000 microcents ($0.005)Double cost for content exceeding ~50K tokens.

    No Scraping Cost

    Since you provide the content, there is no scraping tier cost. Extract is typically 2-10x cheaper than scrape + extract for the same data.

    Error Codes

    StatusErrorDescription
    400validation_errorInvalid request body — missing content, invalid content_type, or malformed schema.
    401unauthorizedMissing or invalid API key.
    402insufficient_creditsNot enough credits for the extraction.
    422unprocessable_entityContent is blank or exceeds the 5 MB limit.

    Examples

    Schema-Based Extraction

    Define a JSON Schema to extract specific fields from HTML content.

    Python
    import requests
    
    response = requests.post(
        "https://api.alterlab.io/api/v1/extract",
        headers={"X-API-Key": "YOUR_API_KEY"},
        json={
            "content": """<html>
              <body>
                <h1>Widget Pro</h1>
                <span class="price">$49.99</span>
                <p class="desc">The ultimate widget for professionals.</p>
                <span class="rating">4.8 out of 5</span>
              </body>
            </html>""",
            "content_type": "html",
            "extraction_schema": {
                "type": "object",
                "properties": {
                    "name": {"type": "string"},
                    "price": {"type": "number"},
                    "description": {"type": "string"},
                    "rating": {"type": "number"}
                }
            }
        }
    )
    
    data = response.json()
    print(data["formats"]["json"])
    # {"name": "Widget Pro", "price": 49.99, "description": "The ultimate...", "rating": 4.8}

    Profile-Based Extraction

    Use a pre-built profile to extract common data types without writing a schema.

    Python
    # Extract product data using the built-in profile
    response = requests.post(
        "https://api.alterlab.io/api/v1/extract",
        headers={"X-API-Key": "YOUR_API_KEY"},
        json={
            "content": scraped_html,  # HTML you already have
            "content_type": "html",
            "extraction_profile": "product",
            "formats": ["json", "text"]
        }
    )
    
    data = response.json()
    product = data["formats"]["json"]
    print(f"{product['name']} - {product['price']} {product['currency']}")

    LLM Extraction with Prompt

    Use natural language to tell the LLM what to extract. Combine with a schema for typed output.

    Python
    # Extract with natural language instructions
    response = requests.post(
        "https://api.alterlab.io/api/v1/extract",
        headers={"X-API-Key": "YOUR_API_KEY"},
        json={
            "content": article_text,
            "content_type": "text",
            "extraction_prompt": "Extract the main argument, key evidence points, and the author's conclusion.",
            "extraction_schema": {
                "type": "object",
                "properties": {
                    "main_argument": {"type": "string"},
                    "evidence": {
                        "type": "array",
                        "items": {"type": "string"}
                    },
                    "conclusion": {"type": "string"}
                }
            }
        }
    )
    
    data = response.json()
    print(f"Method: {data['extraction_method']}")  # "llm"
    print(f"Model: {data['model_used']}")
    print(data["formats"]["json"])

    Evidence Mode

    Enable evidence tracking to see where each extracted value came from in the source content.

    Python
    # Track field provenance with evidence mode
    response = requests.post(
        "https://api.alterlab.io/api/v1/extract",
        headers={"X-API-Key": "YOUR_API_KEY"},
        json={
            "content": html_content,
            "content_type": "html",
            "extraction_schema": {
                "type": "object",
                "properties": {
                    "title": {"type": "string"},
                    "price": {"type": "number"}
                }
            },
            "evidence": True
        }
    )
    
    data = response.json()
    # Evidence is included in the json output alongside extracted values
    print(data["formats"]["json"])
    ← Previous: Search APINext: Structured Extraction Guide →
    Last updated: March 2026

    On this page