JSON Schema Filtering
Filter and restructure already-extracted data to match your desired output format.
Pure Data Transformation
How It Works
Automatic Extraction
We extract structured data from the page using Schema.org, Open Graph, readability, and other sources.
Schema Matching
Your JSON Schema tells us which fields you want. We use exact matching, case-insensitive matching, field aliases, and nested search.
Type Coercion
We automatically convert types (string→number, string→boolean) and parse prices ($99.99 → 99.99).
Filtered Result
You receive a clean, structured response with only the fields you requested in filtered_content.
Zero Additional Cost
Schema filtering happens in milliseconds after extraction at no extra charge. It's pure data transformation.
Basic Example
Add extraction_schema to your request with a standard JSON Schema:
curl -X POST https://api.alterlab.io/api/v1/scrape \
-H "X-API-Key: YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com/products/watch",
"extraction_schema": {
"type": "object",
"properties": {
"title": {"type": "string"},
"price": {"type": "number"},
"image": {"type": "string"},
"available": {"type": "boolean"}
}
}
}'{
"success": true,
"content": { ... }, // Full extraction (unchanged)
"filtered_content": { // Your filtered data
"title": "Patek Philippe Calatrava",
"price": 22500,
"image": "https://example.com/watch.jpg",
"available": true
},
"credits_used": 1
}Field Aliases
Schema filtering automatically handles common field name variations. You don't need to know the exact field names in the source data.
| Your Schema Field | Auto-Matched Source Fields |
|---|---|
title | name, product_name, productName, heading, headline |
price | amount, value, cost, priceAmount |
image | thumbnail, imageUrl, img, photo, picture, mainImage |
author | writer, byline, authorName, creator |
published | publishedAt, datePublished, date, publishDate |
available | availability, inStock, in_stock, stock |
in_stock | availability, available, inStock, stock, isAvailable |
sku | asin, productId, product_id, identifier, item_id |
image_urls | images, imageUrls, photos, pictures, gallery |
4-Level Matching Strategy
- Exact match (case-sensitive):
price→price - Case-insensitive:
Price→price - Aliases:
amount→price - Nested search:
jsonLd.price→price
Type Coercion
Schema filtering automatically converts types when possible:
String → Number
Handles currency symbols, thousands separators: "$1,234.56" → 1234.56
String → Boolean
Understands common truthy/falsy values: true/false, yes/no, 1/0, in_stock/out_of_stock
String → Integer
Parses numeric strings to integers when schema specifies "integer"
Graceful Fallback
Nested Objects & Arrays
Schema filtering supports complex nested structures and arrays of objects:
{
"extraction_schema": {
"type": "object",
"properties": {
"product": {
"type": "object",
"properties": {
"name": {"type": "string"},
"price": {"type": "number"}
}
},
"seller": {
"type": "object",
"properties": {
"name": {"type": "string"},
"rating": {"type": "number"}
}
}
}
}
}{
"extraction_schema": {
"type": "array",
"items": {
"type": "object",
"properties": {
"name": {"type": "string"},
"price": {"type": "number"},
"available": {"type": "boolean"}
}
}
}
}Nested Search
Fields are automatically searched in nested locations like jsonLd, openGraph, and metadata containers (up to 2 levels deep).
Real-World Examples
E-commerce Product Scraping
import requests
# Scrape product with custom schema
response = requests.post(
"https://api.alterlab.io/api/v1/scrape",
headers={"X-API-Key": "YOUR_API_KEY"},
json={
"url": "https://store.example.com/products/luxury-watch",
"extraction_schema": {
"type": "object",
"properties": {
"title": {"type": "string"},
"cost": {"type": "number"},
"thumbnail": {"type": "string"},
"available": {"type": "boolean"},
"brand": {"type": "string"},
"rating": {"type": "number"}
}
}
}
)
product = response.json()["filtered_content"]
# Clean, structured output:
# {
# "title": "Patek Philippe Calatrava",
# "cost": 22500,
# "thumbnail": "https://example.com/watch.jpg",
# "available": true,
# "brand": "Patek Philippe",
# "rating": 4.8
# }Amazon Product Scraping
Amazon returns specific field names. Use these aliases for clean mapping:
| Your Schema Field | Amazon Returns |
|---|---|
title | name |
in_stock | availability |
sku | asin |
image_urls | images |
price | price (direct match) |
import requests
# Scrape Amazon product with your preferred field names
response = requests.post(
"https://api.alterlab.io/api/v1/scrape",
headers={"X-API-Key": "YOUR_API_KEY"},
json={
"url": "https://www.amazon.com/dp/B08XB8P9GW",
"extraction_schema": {
"type": "object",
"properties": {
"title": {"type": "string"}, # maps from 'name'
"price": {"type": "number"}, # direct match
"in_stock": {"type": "boolean"}, # maps from 'availability'
"sku": {"type": "string"}, # maps from 'asin'
"image_urls": {"type": "array", "items": {"type": "string"}} # maps from 'images'
}
}
}
)
product = response.json()["filtered_content"]
# Result with YOUR field names:
# {
# "title": "Children's Lunch Box with Compartments",
# "price": 24.99,
# "in_stock": true, # coerced from "In Stock"
# "sku": "B08XB8P9GW", # mapped from asin
# "image_urls": ["https://m.media-amazon.com/images/..."]
# }News Article Extraction
import requests
# Extract article metadata
response = requests.post(
"https://api.alterlab.io/api/v1/scrape",
headers={"X-API-Key": "YOUR_API_KEY"},
json={
"url": "https://news.example.com/article/breaking-news",
"extraction_schema": {
"type": "object",
"properties": {
"title": {"type": "string"},
"author": {"type": "string"},
"published": {"type": "string"},
"description": {"type": "string"}
}
}
}
)
article = response.json()["filtered_content"]
# Result:
# {
# "title": "Breaking News: Major Discovery",
# "author": "John Doe",
# "published": "2024-01-15T10:30:00Z",
# "description": "Scientists announce breakthrough..."
# }Best Practices
1. Use User-Friendly Field Names
Prefer title over name for products, price over amount. Aliases will find the right source field.
2. Specify Types for Coercion
Always specify "type" in your schema. This enables automatic type conversion (string→number, string→boolean).
3. Handle Missing Fields
Not all fields will be present on every page. Check if fields exist: filtered.get('field', default_value)
4. Use Default Values
Specify defaults in your schema: {"price": {"type": "number", "default": 0}}
5. Keep Full Extraction Available
filtered_content is separate from content. The full extraction is always available if you need additional fields.
6. Test Your Schema
Use the Interactive Playground to test your schema on sample URLs before integrating into production.