AlterLabAlterLab
Guide

JSON Schema Filtering

Filter and restructure already-extracted data to match your desired output format.

Pure Data Transformation

JSON Schema filtering is not LLM extraction. It filters existing structured data (Schema.org, Open Graph, etc.) to match your desired schema. Think of it as a smart field mapper.

How It Works

1

Automatic Extraction

We extract structured data from the page using Schema.org, Open Graph, readability, and other sources.

2

Schema Matching

Your JSON Schema tells us which fields you want. We use exact matching, case-insensitive matching, field aliases, and nested search.

3

Type Coercion

We automatically convert types (string→number, string→boolean) and parse prices ($99.99 → 99.99).

4

Filtered Result

You receive a clean, structured response with only the fields you requested in filtered_content.

Zero Additional Cost

Schema filtering happens in milliseconds after extraction at no extra charge. It's pure data transformation.

Basic Example

Add extraction_schema to your request with a standard JSON Schema:

curl -X POST https://api.alterlab.io/api/v1/scrape \
  -H "X-API-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com/products/watch",
    "extraction_schema": {
      "type": "object",
      "properties": {
        "title": {"type": "string"},
        "price": {"type": "number"},
        "image": {"type": "string"},
        "available": {"type": "boolean"}
      }
    }
  }'
Response
{
  "success": true,
  "content": { ... },           // Full extraction (unchanged)
  "filtered_content": {         // Your filtered data
    "title": "Patek Philippe Calatrava",
    "price": 22500,
    "image": "https://example.com/watch.jpg",
    "available": true
  },
  "credits_used": 1
}

Field Aliases

Schema filtering automatically handles common field name variations. You don't need to know the exact field names in the source data.

Your Schema FieldAuto-Matched Source Fields
titlename, product_name, productName, heading, headline
priceamount, value, cost, priceAmount
imagethumbnail, imageUrl, img, photo, picture, mainImage
authorwriter, byline, authorName, creator
publishedpublishedAt, datePublished, date, publishDate
availableavailability, inStock, in_stock, stock
in_stockavailability, available, inStock, stock, isAvailable
skuasin, productId, product_id, identifier, item_id
image_urlsimages, imageUrls, photos, pictures, gallery

4-Level Matching Strategy

  1. Exact match (case-sensitive): price price
  2. Case-insensitive: Price price
  3. Aliases: amount price
  4. Nested search: jsonLd.price price

Type Coercion

Schema filtering automatically converts types when possible:

String → Number

"100.50"100.5

Handles currency symbols, thousands separators: "$1,234.56" → 1234.56

String → Boolean

"in_stock"true
"yes"true
"out_of_stock"false

Understands common truthy/falsy values: true/false, yes/no, 1/0, in_stock/out_of_stock

String → Integer

"42"42

Parses numeric strings to integers when schema specifies "integer"

Graceful Fallback

If type coercion fails, the original value is returned unchanged. Your request won't fail due to conversion errors.

Nested Objects & Arrays

Schema filtering supports complex nested structures and arrays of objects:

{
  "extraction_schema": {
    "type": "object",
    "properties": {
      "product": {
        "type": "object",
        "properties": {
          "name": {"type": "string"},
          "price": {"type": "number"}
        }
      },
      "seller": {
        "type": "object",
        "properties": {
          "name": {"type": "string"},
          "rating": {"type": "number"}
        }
      }
    }
  }
}
{
  "extraction_schema": {
    "type": "array",
    "items": {
      "type": "object",
      "properties": {
        "name": {"type": "string"},
        "price": {"type": "number"},
        "available": {"type": "boolean"}
      }
    }
  }
}

Nested Search

Fields are automatically searched in nested locations like jsonLd, openGraph, and metadata containers (up to 2 levels deep).

Real-World Examples

E-commerce Product Scraping

import requests

# Scrape product with custom schema
response = requests.post(
    "https://api.alterlab.io/api/v1/scrape",
    headers={"X-API-Key": "YOUR_API_KEY"},
    json={
        "url": "https://store.example.com/products/luxury-watch",
        "extraction_schema": {
            "type": "object",
            "properties": {
                "title": {"type": "string"},
                "cost": {"type": "number"},
                "thumbnail": {"type": "string"},
                "available": {"type": "boolean"},
                "brand": {"type": "string"},
                "rating": {"type": "number"}
            }
        }
    }
)

product = response.json()["filtered_content"]

# Clean, structured output:
# {
#   "title": "Patek Philippe Calatrava",
#   "cost": 22500,
#   "thumbnail": "https://example.com/watch.jpg",
#   "available": true,
#   "brand": "Patek Philippe",
#   "rating": 4.8
# }

Amazon Product Scraping

Amazon returns specific field names. Use these aliases for clean mapping:

Your Schema FieldAmazon Returns
titlename
in_stockavailability
skuasin
image_urlsimages
priceprice (direct match)
import requests

# Scrape Amazon product with your preferred field names
response = requests.post(
    "https://api.alterlab.io/api/v1/scrape",
    headers={"X-API-Key": "YOUR_API_KEY"},
    json={
        "url": "https://www.amazon.com/dp/B08XB8P9GW",
        "extraction_schema": {
            "type": "object",
            "properties": {
                "title": {"type": "string"},      # maps from 'name'
                "price": {"type": "number"},      # direct match
                "in_stock": {"type": "boolean"},  # maps from 'availability'
                "sku": {"type": "string"},        # maps from 'asin'
                "image_urls": {"type": "array", "items": {"type": "string"}}  # maps from 'images'
            }
        }
    }
)

product = response.json()["filtered_content"]

# Result with YOUR field names:
# {
#   "title": "Children's Lunch Box with Compartments",
#   "price": 24.99,
#   "in_stock": true,         # coerced from "In Stock"
#   "sku": "B08XB8P9GW",       # mapped from asin
#   "image_urls": ["https://m.media-amazon.com/images/..."]
# }

News Article Extraction

import requests

# Extract article metadata
response = requests.post(
    "https://api.alterlab.io/api/v1/scrape",
    headers={"X-API-Key": "YOUR_API_KEY"},
    json={
        "url": "https://news.example.com/article/breaking-news",
        "extraction_schema": {
            "type": "object",
            "properties": {
                "title": {"type": "string"},
                "author": {"type": "string"},
                "published": {"type": "string"},
                "description": {"type": "string"}
            }
        }
    }
)

article = response.json()["filtered_content"]

# Result:
# {
#   "title": "Breaking News: Major Discovery",
#   "author": "John Doe",
#   "published": "2024-01-15T10:30:00Z",
#   "description": "Scientists announce breakthrough..."
# }

Best Practices

1. Use User-Friendly Field Names

Prefer title over name for products, price over amount. Aliases will find the right source field.

2. Specify Types for Coercion

Always specify "type" in your schema. This enables automatic type conversion (string→number, string→boolean).

3. Handle Missing Fields

Not all fields will be present on every page. Check if fields exist: filtered.get('field', default_value)

4. Use Default Values

Specify defaults in your schema: {"price": {"type": "number", "default": 0}}

5. Keep Full Extraction Available

filtered_content is separate from content. The full extraction is always available if you need additional fields.

6. Test Your Schema

Use the Interactive Playground to test your schema on sample URLs before integrating into production.