AlterLabAlterLab
PricingComparePlaygroundBlogDocsChangelog
    AlterLabAlterLab
    PricingComparePlaygroundBlogDocsChangelog
    IntroductionQuickstartInstallationYour First Request
    REST APIJob PollingAPI KeysSessions APINew
    OverviewPythonNode.js
    JavaScript RenderingOutput FormatsPDF & OCRCachingWebhooksJSON Schema FilteringWebSocket Real-TimeBring Your Own ProxyProAuthenticated ScrapingNewWeb CrawlingBatch ScrapingSchedulerChange DetectionCloud Storage ExportSpend LimitsOrganizations & TeamsAlerts & Notifications
    Structured ExtractionAIE-commerce ScrapingNews MonitoringPrice MonitoringMulti-Page CrawlingMonitoring DashboardAI Agent / MCPMCPData Pipeline to Cloud
    PricingRate LimitsError Codes
    From FirecrawlFrom ApifyFrom ScrapingBee / ScraperAPI
    PlaygroundPricingStatus
    Guide

    JSON Schema Filtering

    Filter and restructure already-extracted data to match your desired output format.

    Pure Data Transformation

    JSON Schema filtering is not LLM extraction. It filters existing structured data (Schema.org, Open Graph, etc.) to match your desired schema. Think of it as a smart field mapper.

    How It Works

    1

    Automatic Extraction

    We extract structured data from the page using Schema.org, Open Graph, readability, and other sources.

    2

    Schema Matching

    Your JSON Schema tells us which fields you want. We use exact matching, case-insensitive matching, field aliases, and nested search.

    3

    Type Coercion

    We automatically convert types (string→number, string→boolean) and parse prices ($99.99 → 99.99).

    4

    Filtered Result

    You receive a clean, structured response with only the fields you requested in filtered_content.

    Zero Additional Cost

    Schema filtering happens in milliseconds after extraction at no extra charge. It's pure data transformation.

    Basic Example

    Add extraction_schema to your request with a standard JSON Schema:

    Bash
    curl -X POST https://api.alterlab.io/api/v1/scrape \
      -H "X-API-Key: YOUR_API_KEY" \
      -H "Content-Type: application/json" \
      -d '{
        "url": "https://example.com/products/watch",
        "extraction_schema": {
          "type": "object",
          "properties": {
            "title": {"type": "string"},
            "price": {"type": "number"},
            "image": {"type": "string"},
            "available": {"type": "boolean"}
          }
        }
      }'
    Try in Playground
    Response
    JSON
    {
      "success": true,
      "content": { ... },           // Full extraction (unchanged)
      "filtered_content": {         // Your filtered data
        "title": "Patek Philippe Calatrava",
        "price": 22500,
        "image": "https://example.com/watch.jpg",
        "available": true
      },
      "credits_used": 1
    }

    Field Aliases

    Schema filtering automatically handles common field name variations. You don't need to know the exact field names in the source data.

    Your Schema FieldAuto-Matched Source Fields
    titlename, product_name, productName, heading, headline
    priceamount, value, cost, priceAmount
    imagethumbnail, imageUrl, img, photo, picture, mainImage
    authorwriter, byline, authorName, creator
    publishedpublishedAt, datePublished, date, publishDate
    availableavailability, inStock, in_stock, stock
    in_stockavailability, available, inStock, stock, isAvailable
    skuasin, productId, product_id, identifier, item_id
    image_urlsimages, imageUrls, photos, pictures, gallery

    4-Level Matching Strategy

    1. Exact match (case-sensitive): price → price
    2. Case-insensitive: Price → price
    3. Aliases: amount → price
    4. Nested search: jsonLd.price → price

    Type Coercion

    Schema filtering automatically converts types when possible:

    String → Number

    "100.50"→100.5

    Handles currency symbols, thousands separators: "$1,234.56" → 1234.56

    String → Boolean

    "in_stock"→true
    "yes"→true
    "out_of_stock"→false

    Understands common truthy/falsy values: true/false, yes/no, 1/0, in_stock/out_of_stock

    String → Integer

    "42"→42

    Parses numeric strings to integers when schema specifies "integer"

    Graceful Fallback

    If type coercion fails, the original value is returned unchanged. Your request won't fail due to conversion errors.

    Nested Objects & Arrays

    Schema filtering supports complex nested structures and arrays of objects:

    JSON
    {
      "extraction_schema": {
        "type": "object",
        "properties": {
          "product": {
            "type": "object",
            "properties": {
              "name": {"type": "string"},
              "price": {"type": "number"}
            }
          },
          "seller": {
            "type": "object",
            "properties": {
              "name": {"type": "string"},
              "rating": {"type": "number"}
            }
          }
        }
      }
    }

    Nested Search

    Fields are automatically searched in nested locations like jsonLd, openGraph, and metadata containers (up to 2 levels deep).

    Real-World Examples

    E-commerce Product Scraping

    Python
    import requests
    
    # Scrape product with custom schema
    response = requests.post(
        "https://api.alterlab.io/api/v1/scrape",
        headers={"X-API-Key": "YOUR_API_KEY"},
        json={
            "url": "https://store.example.com/products/luxury-watch",
            "extraction_schema": {
                "type": "object",
                "properties": {
                    "title": {"type": "string"},
                    "cost": {"type": "number"},
                    "thumbnail": {"type": "string"},
                    "available": {"type": "boolean"},
                    "brand": {"type": "string"},
                    "rating": {"type": "number"}
                }
            }
        }
    )
    
    product = response.json()["filtered_content"]
    
    # Clean, structured output:
    # {
    #   "title": "Patek Philippe Calatrava",
    #   "cost": 22500,
    #   "thumbnail": "https://example.com/watch.jpg",
    #   "available": true,
    #   "brand": "Patek Philippe",
    #   "rating": 4.8
    # }

    Amazon Product Scraping

    Amazon returns specific field names. Use these aliases for clean mapping:

    Your Schema FieldAmazon Returns
    titlename
    in_stockavailability
    skuasin
    image_urlsimages
    priceprice (direct match)
    Python
    import requests
    
    # Scrape Amazon product with your preferred field names
    response = requests.post(
        "https://api.alterlab.io/api/v1/scrape",
        headers={"X-API-Key": "YOUR_API_KEY"},
        json={
            "url": "https://www.amazon.com/dp/B08XB8P9GW",
            "extraction_schema": {
                "type": "object",
                "properties": {
                    "title": {"type": "string"},      # maps from 'name'
                    "price": {"type": "number"},      # direct match
                    "in_stock": {"type": "boolean"},  # maps from 'availability'
                    "sku": {"type": "string"},        # maps from 'asin'
                    "image_urls": {"type": "array", "items": {"type": "string"}}  # maps from 'images'
                }
            }
        }
    )
    
    product = response.json()["filtered_content"]
    
    # Result with YOUR field names:
    # {
    #   "title": "Children's Lunch Box with Compartments",
    #   "price": 24.99,
    #   "in_stock": true,         # coerced from "In Stock"
    #   "sku": "B08XB8P9GW",       # mapped from asin
    #   "image_urls": ["https://m.media-amazon.com/images/..."]
    # }

    News Article Extraction

    Python
    import requests
    
    # Extract article metadata
    response = requests.post(
        "https://api.alterlab.io/api/v1/scrape",
        headers={"X-API-Key": "YOUR_API_KEY"},
        json={
            "url": "https://news.example.com/article/breaking-news",
            "extraction_schema": {
                "type": "object",
                "properties": {
                    "title": {"type": "string"},
                    "author": {"type": "string"},
                    "published": {"type": "string"},
                    "description": {"type": "string"}
                }
            }
        }
    )
    
    article = response.json()["filtered_content"]
    
    # Result:
    # {
    #   "title": "Breaking News: Major Discovery",
    #   "author": "John Doe",
    #   "published": "2024-01-15T10:30:00Z",
    #   "description": "Scientists announce breakthrough..."
    # }

    Best Practices

    1. Use User-Friendly Field Names

    Prefer title over name for products, price over amount. Aliases will find the right source field.

    2. Specify Types for Coercion

    Always specify "type" in your schema. This enables automatic type conversion (string→number, string→boolean).

    3. Handle Missing Fields

    Not all fields will be present on every page. Check if fields exist: filtered.get('field', default_value)

    4. Use Default Values

    Specify defaults in your schema: {"price": {"type": "number", "default": 0}}

    5. Keep Full Extraction Available

    filtered_content is separate from content. The full extraction is always available if you need additional fields.

    6. Test Your Schema

    Use the Interactive Playground to test your schema on sample URLs before integrating into production.

    WebhooksWebSocket Real-Time
    Last updated: March 2026

    On this page