Structured Extraction
Extract structured data from web pages using pre-built profiles or custom JSON schemas. Turn messy HTML into clean, predictable JSON.
How It Works
Extraction Methods
Pre-built Profiles
Ready-to-use templates for common data types
JSON Schema
Define exact structure with types and validation
Natural Language
Describe what you want in plain English
Pre-built Profiles
Use pre-defined extraction profiles for common page types. These are optimized schemas that work out of the box.
| Profile | Extracted Fields | Best For |
|---|---|---|
product | name, price, description, images, ratings, availability | E-commerce product pages |
article | title, author, date, content, summary | News, blogs, documentation |
job_posting | title, company, location, salary, requirements | Job boards, career pages |
faq | questions, answers, categories | FAQ pages, help centers |
recipe | name, ingredients, instructions, time, servings | Recipe websites |
event | name, date, location, description, organizer | Event pages, calendars |
auto | Automatically detected | Unknown page types |
import requests
# Extract product data using the product profile
response = requests.post(
"https://api.alterlab.io/api/v1/scrape",
headers={"X-API-Key": "YOUR_API_KEY"},
json={
"url": "https://shop.example.com/product/123",
"extraction_profile": "product"
}
)
data = response.json()
product = data["extracted"]
print(f"Name: {product['name']}")
print(f"Price: {product['price']}")
print(f"Rating: {product['rating']}")Custom JSON Schema
Define exactly what data you want using JSON Schema. This gives you full control over field names, types, and structure.
import requests
# Define a custom schema for competitor pricing
schema = {
"type": "object",
"properties": {
"product_name": {
"type": "string",
"description": "The full product name"
},
"current_price": {
"type": "number",
"description": "Current price in USD"
},
"original_price": {
"type": "number",
"description": "Original price before discount, if any"
},
"discount_percent": {
"type": "number",
"description": "Discount percentage if on sale"
},
"in_stock": {
"type": "boolean",
"description": "Whether the product is currently in stock"
},
"shipping_info": {
"type": "string",
"description": "Shipping time or availability"
}
},
"required": ["product_name", "current_price", "in_stock"]
}
response = requests.post(
"https://api.alterlab.io/api/v1/scrape",
headers={"X-API-Key": "YOUR_API_KEY"},
json={
"url": "https://competitor.com/product/xyz",
"extraction_schema": schema
}
)
data = response.json()
pricing = data["extracted"]
print(f"Product: {pricing['product_name']}")
print(f"Price: ${pricing['current_price']}")
print(f"In Stock: {pricing['in_stock']}")Schema Tips
- Use
descriptionfields to guide the AI - Mark important fields as
required - Use specific types (
numbervsstring) for proper formatting
Natural Language Prompts
Feature In Development
Just describe what you want in plain English. Great for quick extraction or when you're not sure of the exact structure.
import requests
# Simple natural language extraction
response = requests.post(
"https://api.alterlab.io/api/v1/scrape",
headers={"X-API-Key": "YOUR_API_KEY"},
json={
"url": "https://techblog.example.com/article/123",
"extraction_prompt": "Extract the article title, author name, publication date, and a 2-sentence summary of the main points"
}
)
data = response.json()
print(data["extracted"])
# More complex extraction
response = requests.post(
"https://api.alterlab.io/api/v1/scrape",
headers={"X-API-Key": "YOUR_API_KEY"},
json={
"url": "https://shop.example.com/category/laptops",
"extraction_prompt": """
Extract all laptops listed on this page. For each laptop, get:
- Product name
- Price (as a number)
- Key specs (RAM, storage, processor)
- Whether it's on sale
Return as a list of objects.
"""
}
)Combining Methods
Use a schema for structure and a prompt for additional guidance:
import requests
schema = {
"type": "object",
"properties": {
"company_name": {"type": "string"},
"headquarters": {"type": "string"},
"founded_year": {"type": "integer"},
"key_products": {
"type": "array",
"items": {"type": "string"}
},
"leadership": {
"type": "array",
"items": {
"type": "object",
"properties": {
"name": {"type": "string"},
"title": {"type": "string"}
}
}
}
}
}
response = requests.post(
"https://api.alterlab.io/api/v1/scrape",
headers={"X-API-Key": "YOUR_API_KEY"},
json={
"url": "https://company.example.com/about",
"extraction_schema": schema,
"extraction_prompt": "Focus on the 'About Us' and 'Leadership' sections. Only include C-level executives in the leadership array."
}
)Evidence Mode
Enable evidence mode to see exactly where each extracted value came from in the source HTML:
response = requests.post(
"https://api.alterlab.io/api/v1/scrape",
headers={"X-API-Key": "YOUR_API_KEY"},
json={
"url": "https://shop.example.com/product/123",
"extraction_profile": "product",
"evidence": True # Enable evidence tracking
}
)
data = response.json()
# Each field now includes provenance
for field, value in data["extracted"].items():
if isinstance(value, dict) and "evidence" in value:
print(f"{field}: {value['value']}")
print(f" Source: {value['evidence'][:100]}...")
else:
print(f"{field}: {value}")Use Cases for Evidence
- Debugging extraction issues
- Verifying data accuracy
- Audit trails for compliance
- Building training data for ML models
Real-World Examples
E-commerce Price Monitoring
{
"url": "https://amazon.com/dp/B0123456789",
"extraction_schema": {
"type": "object",
"properties": {
"title": {"type": "string"},
"price": {"type": "number"},
"rating": {"type": "number"},
"review_count": {"type": "integer"},
"availability": {"type": "string"},
"seller": {"type": "string"}
}
}
}News Article Analysis
{
"url": "https://news.example.com/article/123",
"extraction_prompt": "Extract the headline, author, publication date, main topics covered, sentiment (positive/negative/neutral), and a 100-word summary"
}Job Board Scraping
{
"url": "https://jobs.example.com/listing/456",
"extraction_profile": "job_posting",
"extraction_prompt": "Also extract: required years of experience, remote work policy, and tech stack mentioned"
}Best Practices
1. Start with Profiles
Pre-built profiles are optimized and tested. Use them when they match your use case, then customize with prompts if needed.
2. Be Specific in Prompts
"Extract the price" is less effective than "Extract the current sale price in USD as a number, ignoring shipping costs."
3. Use Schemas for Consistency
When scraping many pages, schemas ensure consistent field names and types across all results.
4. Handle Missing Data
Not all pages have all fields. Check for null values and handle gracefully.
5. Test on Sample Pages First
Before running large batches, test your extraction on a few pages to verify the output matches expectations.