Guide

JSON Schema Filtering

Filter and restructure already-extracted data to match your desired output format.

Pure Data Transformation

JSON Schema filtering is not LLM extraction. It filters existing structured data (Schema.org, Open Graph, etc.) to match your desired schema. Think of it as a smart field mapper.

How It Works

Automatic Extraction

We extract structured data from the page using Schema.org, Open Graph, readability, and other sources.

Schema Matching

Your JSON Schema tells us which fields you want. We use exact matching, case-insensitive matching, field aliases, and nested search.

Type Coercion

We automatically convert types (string→number, string→boolean) and parse prices ($99.99 → 99.99).

Filtered Result

You receive a clean, structured response with only the fields you requested in filtered_content.

Zero Additional Cost

Schema filtering happens in milliseconds after extraction at no extra charge. It's pure data transformation.

Basic Example

Add extraction_schema to your request with a standard JSON Schema:

Bash

curl -X POST https://api.alterlab.io/api/v1/scrape \
  -H "X-API-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com/products/watch",
    "extraction_schema": {
      "type": "object",
      "properties": {
        "title": {"type": "string"},
        "price": {"type": "number"},
        "image": {"type": "string"},
        "available": {"type": "boolean"}
      }
    }
  }'

Try in Playground

Response

JSON

{
  "success": true,
  "content": { ... },           // Full extraction (unchanged)
  "filtered_content": {         // Your filtered data
    "title": "Patek Philippe Calatrava",
    "price": 22500,
    "image": "https://example.com/watch.jpg",
    "available": true
  },
  "credits_used": 1
}

Field Aliases

Schema filtering automatically handles common field name variations. You don't need to know the exact field names in the source data.

Your Schema Field	Auto-Matched Source Fields
`title`	name, product_name, productName, heading, headline
`price`	amount, value, cost, priceAmount
`image`	thumbnail, imageUrl, img, photo, picture, mainImage
`author`	writer, byline, authorName, creator
`published`	publishedAt, datePublished, date, publishDate
`available`	availability, inStock, in_stock, stock
`in_stock`	availability, available, inStock, stock, isAvailable
`sku`	asin, productId, product_id, identifier, item_id
`image_urls`	images, imageUrls, photos, pictures, gallery

4-Level Matching Strategy

Exact match (case-sensitive): price → price
Case-insensitive: Price → price
Aliases: amount → price
Nested search: jsonLd.price → price

Type Coercion

Schema filtering automatically converts types when possible:

String → Number

"100.50"→100.5

Handles currency symbols, thousands separators: "$1,234.56" → 1234.56

String → Boolean

"in_stock"→true

"yes"→true

"out_of_stock"→false

Understands common truthy/falsy values: true/false, yes/no, 1/0, in_stock/out_of_stock

String → Integer

"42"→42

Parses numeric strings to integers when schema specifies "integer"

Graceful Fallback

If type coercion fails, the original value is returned unchanged. Your request won't fail due to conversion errors.

Nested Objects & Arrays

Schema filtering supports complex nested structures and arrays of objects:

JSON

{
  "extraction_schema": {
    "type": "object",
    "properties": {
      "product": {
        "type": "object",
        "properties": {
          "name": {"type": "string"},
          "price": {"type": "number"}
        }
      },
      "seller": {
        "type": "object",
        "properties": {
          "name": {"type": "string"},
          "rating": {"type": "number"}
        }
      }
    }
  }
}

Nested Search

Fields are automatically searched in nested locations like jsonLd, openGraph, and metadata containers (up to 2 levels deep).

Real-World Examples

E-commerce Product Scraping

Python

import requests

# Scrape product with custom schema
response = requests.post(
    "https://api.alterlab.io/api/v1/scrape",
    headers={"X-API-Key": "YOUR_API_KEY"},
    json={
        "url": "https://store.example.com/products/luxury-watch",
        "extraction_schema": {
            "type": "object",
            "properties": {
                "title": {"type": "string"},
                "cost": {"type": "number"},
                "thumbnail": {"type": "string"},
                "available": {"type": "boolean"},
                "brand": {"type": "string"},
                "rating": {"type": "number"}
            }
        }
    }
)

product = response.json()["filtered_content"]

# Clean, structured output:
# {
#   "title": "Patek Philippe Calatrava",
#   "cost": 22500,
#   "thumbnail": "https://example.com/watch.jpg",
#   "available": true,
#   "brand": "Patek Philippe",
#   "rating": 4.8
# }

Amazon Product Scraping

Amazon returns specific field names. Use these aliases for clean mapping:

Your Schema Field	Amazon Returns
`title`	name
`in_stock`	availability
`sku`	asin
`image_urls`	images
`price`	price (direct match)

Python

import requests

# Scrape Amazon product with your preferred field names
response = requests.post(
    "https://api.alterlab.io/api/v1/scrape",
    headers={"X-API-Key": "YOUR_API_KEY"},
    json={
        "url": "https://www.amazon.com/dp/B08XB8P9GW",
        "extraction_schema": {
            "type": "object",
            "properties": {
                "title": {"type": "string"},      # maps from 'name'
                "price": {"type": "number"},      # direct match
                "in_stock": {"type": "boolean"},  # maps from 'availability'
                "sku": {"type": "string"},        # maps from 'asin'
                "image_urls": {"type": "array", "items": {"type": "string"}}  # maps from 'images'
            }
        }
    }
)

product = response.json()["filtered_content"]

# Result with YOUR field names:
# {
#   "title": "Children's Lunch Box with Compartments",
#   "price": 24.99,
#   "in_stock": true,         # coerced from "In Stock"
#   "sku": "B08XB8P9GW",       # mapped from asin
#   "image_urls": ["https://m.media-amazon.com/images/..."]
# }

News Article Extraction

Python

import requests

# Extract article metadata
response = requests.post(
    "https://api.alterlab.io/api/v1/scrape",
    headers={"X-API-Key": "YOUR_API_KEY"},
    json={
        "url": "https://news.example.com/article/breaking-news",
        "extraction_schema": {
            "type": "object",
            "properties": {
                "title": {"type": "string"},
                "author": {"type": "string"},
                "published": {"type": "string"},
                "description": {"type": "string"}
            }
        }
    }
)

article = response.json()["filtered_content"]

# Result:
# {
#   "title": "Breaking News: Major Discovery",
#   "author": "John Doe",
#   "published": "2024-01-15T10:30:00Z",
#   "description": "Scientists announce breakthrough..."
# }

Best Practices

1. Use User-Friendly Field Names

Prefer title over name for products, price over amount. Aliases will find the right source field.

2. Specify Types for Coercion

Always specify "type" in your schema. This enables automatic type conversion (string→number, string→boolean).

3. Handle Missing Fields

Not all fields will be present on every page. Check if fields exist: filtered.get('field', default_value)

4. Use Default Values

Specify defaults in your schema: {"price": {"type": "number", "default": 0}}

5. Keep Full Extraction Available

filtered_content is separate from content. The full extraction is always available if you need additional fields.

6. Test Your Schema

Use the Interactive Playground to test your schema on sample URLs before integrating into production.

Webhooks WebSocket Real-Time

Last updated: March 2026

Guide

JSON Schema Filtering

Filter and restructure already-extracted data to match your desired output format.

Pure Data Transformation

JSON Schema filtering is not LLM extraction. It filters existing structured data (Schema.org, Open Graph, etc.) to match your desired schema. Think of it as a smart field mapper.

How It Works

Automatic Extraction

We extract structured data from the page using Schema.org, Open Graph, readability, and other sources.

Schema Matching

Your JSON Schema tells us which fields you want. We use exact matching, case-insensitive matching, field aliases, and nested search.

Type Coercion

We automatically convert types (string→number, string→boolean) and parse prices ($99.99 → 99.99).

Filtered Result

You receive a clean, structured response with only the fields you requested in filtered_content.

Zero Additional Cost

Schema filtering happens in milliseconds after extraction at no extra charge. It's pure data transformation.

Basic Example

Add extraction_schema to your request with a standard JSON Schema:

Bash

curl -X POST https://api.alterlab.io/api/v1/scrape \
  -H "X-API-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com/products/watch",
    "extraction_schema": {
      "type": "object",
      "properties": {
        "title": {"type": "string"},
        "price": {"type": "number"},
        "image": {"type": "string"},
        "available": {"type": "boolean"}
      }
    }
  }'

Try in Playground

Response

JSON

{
  "success": true,
  "content": { ... },           // Full extraction (unchanged)
  "filtered_content": {         // Your filtered data
    "title": "Patek Philippe Calatrava",
    "price": 22500,
    "image": "https://example.com/watch.jpg",
    "available": true
  },
  "credits_used": 1
}

Field Aliases

Schema filtering automatically handles common field name variations. You don't need to know the exact field names in the source data.

Your Schema Field	Auto-Matched Source Fields
`title`	name, product_name, productName, heading, headline
`price`	amount, value, cost, priceAmount
`image`	thumbnail, imageUrl, img, photo, picture, mainImage
`author`	writer, byline, authorName, creator
`published`	publishedAt, datePublished, date, publishDate
`available`	availability, inStock, in_stock, stock
`in_stock`	availability, available, inStock, stock, isAvailable
`sku`	asin, productId, product_id, identifier, item_id
`image_urls`	images, imageUrls, photos, pictures, gallery

4-Level Matching Strategy

Exact match (case-sensitive): price → price
Case-insensitive: Price → price
Aliases: amount → price
Nested search: jsonLd.price → price

Type Coercion

Schema filtering automatically converts types when possible:

String → Number

"100.50"→100.5

Handles currency symbols, thousands separators: "$1,234.56" → 1234.56

String → Boolean

"in_stock"→true

"yes"→true

"out_of_stock"→false

Understands common truthy/falsy values: true/false, yes/no, 1/0, in_stock/out_of_stock

String → Integer

"42"→42

Parses numeric strings to integers when schema specifies "integer"

Graceful Fallback

If type coercion fails, the original value is returned unchanged. Your request won't fail due to conversion errors.

Nested Objects & Arrays

Schema filtering supports complex nested structures and arrays of objects:

JSON

{
  "extraction_schema": {
    "type": "object",
    "properties": {
      "product": {
        "type": "object",
        "properties": {
          "name": {"type": "string"},
          "price": {"type": "number"}
        }
      },
      "seller": {
        "type": "object",
        "properties": {
          "name": {"type": "string"},
          "rating": {"type": "number"}
        }
      }
    }
  }
}

Nested Search

Fields are automatically searched in nested locations like jsonLd, openGraph, and metadata containers (up to 2 levels deep).

Real-World Examples

E-commerce Product Scraping

Python

import requests

# Scrape product with custom schema
response = requests.post(
    "https://api.alterlab.io/api/v1/scrape",
    headers={"X-API-Key": "YOUR_API_KEY"},
    json={
        "url": "https://store.example.com/products/luxury-watch",
        "extraction_schema": {
            "type": "object",
            "properties": {
                "title": {"type": "string"},
                "cost": {"type": "number"},
                "thumbnail": {"type": "string"},
                "available": {"type": "boolean"},
                "brand": {"type": "string"},
                "rating": {"type": "number"}
            }
        }
    }
)

product = response.json()["filtered_content"]

# Clean, structured output:
# {
#   "title": "Patek Philippe Calatrava",
#   "cost": 22500,
#   "thumbnail": "https://example.com/watch.jpg",
#   "available": true,
#   "brand": "Patek Philippe",
#   "rating": 4.8
# }

Amazon Product Scraping

Amazon returns specific field names. Use these aliases for clean mapping:

Your Schema Field	Amazon Returns
`title`	name
`in_stock`	availability
`sku`	asin
`image_urls`	images
`price`	price (direct match)

Python

import requests

# Scrape Amazon product with your preferred field names
response = requests.post(
    "https://api.alterlab.io/api/v1/scrape",
    headers={"X-API-Key": "YOUR_API_KEY"},
    json={
        "url": "https://www.amazon.com/dp/B08XB8P9GW",
        "extraction_schema": {
            "type": "object",
            "properties": {
                "title": {"type": "string"},      # maps from 'name'
                "price": {"type": "number"},      # direct match
                "in_stock": {"type": "boolean"},  # maps from 'availability'
                "sku": {"type": "string"},        # maps from 'asin'
                "image_urls": {"type": "array", "items": {"type": "string"}}  # maps from 'images'
            }
        }
    }
)

product = response.json()["filtered_content"]

# Result with YOUR field names:
# {
#   "title": "Children's Lunch Box with Compartments",
#   "price": 24.99,
#   "in_stock": true,         # coerced from "In Stock"
#   "sku": "B08XB8P9GW",       # mapped from asin
#   "image_urls": ["https://m.media-amazon.com/images/..."]
# }

News Article Extraction

Python

import requests

# Extract article metadata
response = requests.post(
    "https://api.alterlab.io/api/v1/scrape",
    headers={"X-API-Key": "YOUR_API_KEY"},
    json={
        "url": "https://news.example.com/article/breaking-news",
        "extraction_schema": {
            "type": "object",
            "properties": {
                "title": {"type": "string"},
                "author": {"type": "string"},
                "published": {"type": "string"},
                "description": {"type": "string"}
            }
        }
    }
)

article = response.json()["filtered_content"]

# Result:
# {
#   "title": "Breaking News: Major Discovery",
#   "author": "John Doe",
#   "published": "2024-01-15T10:30:00Z",
#   "description": "Scientists announce breakthrough..."
# }

Best Practices

1. Use User-Friendly Field Names

Prefer title over name for products, price over amount. Aliases will find the right source field.

2. Specify Types for Coercion

Always specify "type" in your schema. This enables automatic type conversion (string→number, string→boolean).

3. Handle Missing Fields

Not all fields will be present on every page. Check if fields exist: filtered.get('field', default_value)

4. Use Default Values

Specify defaults in your schema: {"price": {"type": "number", "default": 0}}

5. Keep Full Extraction Available

filtered_content is separate from content. The full extraction is always available if you need additional fields.

6. Test Your Schema

Use the Interactive Playground to test your schema on sample URLs before integrating into production.

Webhooks WebSocket Real-Time

Last updated: March 2026