AlterLabAlterLab
PricingComparePlaygroundBlogDocsChangelog
    AlterLabAlterLab
    PricingComparePlaygroundBlogDocsChangelog
    IntroductionQuickstartInstallationYour First Request
    REST APIJob PollingAPI KeysSessions APINew
    OverviewPythonNode.js
    JavaScript RenderingOutput FormatsPDF & OCRCachingWebhooksJSON Schema FilteringWebSocket Real-TimeBring Your Own ProxyProAuthenticated ScrapingNewWeb CrawlingBatch ScrapingSchedulerChange DetectionCloud Storage ExportSpend LimitsOrganizations & TeamsAlerts & Notifications
    Structured ExtractionAIE-commerce ScrapingNews MonitoringPrice MonitoringMulti-Page CrawlingMonitoring DashboardAI Agent / MCPMCPData Pipeline to Cloud
    PricingRate LimitsError Codes
    From FirecrawlFrom ApifyFrom ScrapingBee / ScraperAPI
    PlaygroundPricingStatus
    Tutorial

    Structured Extraction

    Extract structured data from web pages using pre-built profiles or custom JSON schemas. Turn messy HTML into clean, predictable JSON.

    How It Works

    AlterLab extracts data according to your schema specifications. Pre-built profiles work out of the box for common page types, or define custom schemas for exact control over field names and types.

    Extraction Methods

    Pre-built Profiles

    Ready-to-use templates for common data types

    Easiest

    JSON Schema

    Define exact structure with types and validation

    Most Control

    Natural Language

    Describe what you want in plain English

    Most Flexible

    Pre-built Profiles

    Use pre-defined extraction profiles for common page types. These are optimized schemas that work out of the box.

    ProfileExtracted FieldsBest For
    productname, price, description, images, ratings, availabilityE-commerce product pages
    articletitle, author, date, content, summaryNews, blogs, documentation
    job_postingtitle, company, location, salary, requirementsJob boards, career pages
    faqquestions, answers, categoriesFAQ pages, help centers
    recipename, ingredients, instructions, time, servingsRecipe websites
    eventname, date, location, description, organizerEvent pages, calendars
    autoAutomatically detectedUnknown page types
    Python
    import requests
    
    # Extract product data using the product profile
    response = requests.post(
        "https://api.alterlab.io/api/v1/scrape",
        headers={"X-API-Key": "YOUR_API_KEY"},
        json={
            "url": "https://shop.example.com/product/123",
            "extraction_profile": "product"
        }
    )
    
    data = response.json()
    product = data["extracted"]
    
    print(f"Name: {product['name']}")
    print(f"Price: {product['price']}")
    print(f"Rating: {product['rating']}")

    Custom JSON Schema

    Define exactly what data you want using JSON Schema. This gives you full control over field names, types, and structure.

    Python
    import requests
    
    # Define a custom schema for competitor pricing
    schema = {
        "type": "object",
        "properties": {
            "product_name": {
                "type": "string",
                "description": "The full product name"
            },
            "current_price": {
                "type": "number",
                "description": "Current price in USD"
            },
            "original_price": {
                "type": "number",
                "description": "Original price before discount, if any"
            },
            "discount_percent": {
                "type": "number",
                "description": "Discount percentage if on sale"
            },
            "in_stock": {
                "type": "boolean",
                "description": "Whether the product is currently in stock"
            },
            "shipping_info": {
                "type": "string",
                "description": "Shipping time or availability"
            }
        },
        "required": ["product_name", "current_price", "in_stock"]
    }
    
    response = requests.post(
        "https://api.alterlab.io/api/v1/scrape",
        headers={"X-API-Key": "YOUR_API_KEY"},
        json={
            "url": "https://competitor.com/product/xyz",
            "extraction_schema": schema
        }
    )
    
    data = response.json()
    pricing = data["extracted"]
    
    print(f"Product: {pricing['product_name']}")
    print(f"Price: ${pricing['current_price']}")
    print(f"In Stock: {pricing['in_stock']}")

    Schema Tips

    • Use description fields to guide the AI
    • Mark important fields as required
    • Use specific types (number vs string) for proper formatting

    Natural Language Prompts

    LLM-Powered Extraction

    Natural language extraction uses the extraction_prompt parameter (max 2,000 characters). Results are returned in the extraction_result field. You can combine it with extraction_schema for structured output.

    Just describe what you want in plain English. Great for quick extraction or when you're not sure of the exact structure.

    Python
    import requests
    
    # Simple natural language extraction
    response = requests.post(
        "https://api.alterlab.io/api/v1/scrape",
        headers={"X-API-Key": "YOUR_API_KEY"},
        json={
            "url": "https://techblog.example.com/article/123",
            "extraction_prompt": "Extract the article title, author name, publication date, and a 2-sentence summary of the main points"
        }
    )
    
    data = response.json()
    print(data["extraction_result"])
    
    # More complex extraction
    response = requests.post(
        "https://api.alterlab.io/api/v1/scrape",
        headers={"X-API-Key": "YOUR_API_KEY"},
        json={
            "url": "https://shop.example.com/category/laptops",
            "extraction_prompt": """
            Extract all laptops listed on this page. For each laptop, get:
            - Product name
            - Price (as a number)
            - Key specs (RAM, storage, processor)
            - Whether it's on sale
            Return as a list of objects.
            """
        }
    )

    Combining Methods

    Use a schema for structure and a prompt for additional guidance:

    Python
    import requests
    
    schema = {
        "type": "object",
        "properties": {
            "company_name": {"type": "string"},
            "headquarters": {"type": "string"},
            "founded_year": {"type": "integer"},
            "key_products": {
                "type": "array",
                "items": {"type": "string"}
            },
            "leadership": {
                "type": "array",
                "items": {
                    "type": "object",
                    "properties": {
                        "name": {"type": "string"},
                        "title": {"type": "string"}
                    }
                }
            }
        }
    }
    
    response = requests.post(
        "https://api.alterlab.io/api/v1/scrape",
        headers={"X-API-Key": "YOUR_API_KEY"},
        json={
            "url": "https://company.example.com/about",
            "extraction_schema": schema,
            "extraction_prompt": "Focus on the 'About Us' and 'Leadership' sections. Only include C-level executives in the leadership array."
        }
    )

    Evidence Mode

    Enable evidence mode to see exactly where each extracted value came from in the source HTML:

    Python
    response = requests.post(
        "https://api.alterlab.io/api/v1/scrape",
        headers={"X-API-Key": "YOUR_API_KEY"},
        json={
            "url": "https://shop.example.com/product/123",
            "extraction_profile": "product",
            "evidence": True  # Enable evidence tracking
        }
    )
    
    data = response.json()
    
    # Each field now includes provenance
    for field, value in data["extracted"].items():
        if isinstance(value, dict) and "evidence" in value:
            print(f"{field}: {value['value']}")
            print(f"  Source: {value['evidence'][:100]}...")
        else:
            print(f"{field}: {value}")

    Evidence Response Format

    When evidence: true is set, each extracted field includes a provenance reference showing the source HTML element it was extracted from:

    JSON
    {
      "extracted": {
        "name": {
          "value": "Wireless Headphones Pro",
          "evidence": "<h1 class=\"product-title\">Wireless Headphones Pro</h1>"
        },
        "price": {
          "value": 79.99,
          "evidence": "<span class=\"price\">$79.99</span>"
        },
        "in_stock": {
          "value": true,
          "evidence": "<div class=\"stock-status\">In Stock</div>"
        }
      }
    }

    Use Cases for Evidence

    • Debugging extraction issues
    • Verifying data accuracy
    • Audit trails for compliance
    • Building training data for ML models

    Real-World Examples

    E-commerce Price Monitoring

    JSON
    {
      "url": "https://amazon.com/dp/B0123456789",
      "extraction_schema": {
        "type": "object",
        "properties": {
          "title": {"type": "string"},
          "price": {"type": "number"},
          "rating": {"type": "number"},
          "review_count": {"type": "integer"},
          "availability": {"type": "string"},
          "seller": {"type": "string"}
        }
      }
    }

    News Article Analysis

    JSON
    {
      "url": "https://news.example.com/article/123",
      "extraction_prompt": "Extract the headline, author, publication date, main topics covered, sentiment (positive/negative/neutral), and a 100-word summary"
    }

    Job Board Scraping

    JSON
    {
      "url": "https://jobs.example.com/listing/456",
      "extraction_profile": "job_posting",
      "extraction_prompt": "Also extract: required years of experience, remote work policy, and tech stack mentioned"
    }

    Best Practices

    1. Start with Profiles

    Pre-built profiles are optimized and tested. Use them when they match your use case, then customize with prompts if needed.

    2. Be Specific in Prompts

    "Extract the price" is less effective than "Extract the current sale price in USD as a number, ignoring shipping costs."

    3. Use Schemas for Consistency

    When scraping many pages, schemas ensure consistent field names and types across all results.

    4. Handle Missing Data

    Not all pages have all fields. Check for null values and handle gracefully.

    5. Test on Sample Pages First

    Before running large batches, test your extraction on a few pages to verify the output matches expectations.

    Alerts & NotificationsE-commerce Scraping
    Last updated: March 2026

    On this page