Extraction Profiles
Pre-built extraction templates for common page types. Use a profile to extract structured data without writing a custom schema — AlterLab knows which fields to look for.
No LLM required
extraction_prompt to layer LLM reasoning on top of any profile.Available Profiles
| Profile | Primary Use Case | Key Fields |
|---|---|---|
| auto | Unknown page type | Detects page type, applies best profile |
| product | E-commerce product pages | name, price, currency, images, rating, availability |
| article | News articles, blog posts | title, author, published_date, content, summary |
| job_posting | Job listing pages | title, company, location, salary, requirements |
| faq | FAQ and help pages | question/answer pairs array |
| recipe | Recipe and cooking pages | name, ingredients, instructions, cook_time, servings |
| event | Event listing pages | name, date, location, description, price |
auto
The auto profile analyzes the page structure and selects the most appropriate extraction strategy. Use it when you are processing mixed content types or do not know the page type in advance.
response = requests.post(
"https://api.alterlab.io/api/v1/extract",
headers={"X-API-Key": "YOUR_API_KEY"},
json={
"content": unknown_page_html,
"content_type": "html",
"extraction_profile": "auto"
}
)
data = response.json()
# Returns the fields appropriate for the detected page type
print(data["formats"]["json"])product
Extracts structured product data from e-commerce pages. Combines Schema.org Product markup, Open Graph data, and DOM parsing for maximum coverage.
Fields Extracted
| Field | Type | Description |
|---|---|---|
| name | string | Product name / title |
| price | number | Numeric price (currency symbol stripped) |
| currency | string | ISO 4217 currency code (e.g., USD) |
| images | string[] | Product image URLs |
| rating | number | null | Numeric rating (normalized 0–5) |
| availability | string | null | in_stock, out_of_stock, limited, preorder |
| brand | string | null | Brand or manufacturer name |
| description | string | null | Product description text |
Example Output
{
"name": "Widget Pro Max",
"price": 49.99,
"currency": "USD",
"images": [
"https://example.com/img/widget-pro-1.jpg",
"https://example.com/img/widget-pro-2.jpg"
],
"rating": 4.7,
"availability": "in_stock",
"brand": "WidgetCo",
"description": "The ultimate widget for professionals. Water-resistant, 5-year warranty."
}response = requests.post(
"https://api.alterlab.io/api/v1/extract",
headers={"X-API-Key": "YOUR_API_KEY"},
json={
"content": product_page_html,
"content_type": "html",
"extraction_profile": "product",
"formats": ["json"]
}
)
product = response.json()["formats"]["json"]
print(f"{product['name']} — {product['currency']}{product['price']}")
print(f"In stock: {product['availability'] == 'in_stock'}")article
Extracts editorial content from news articles, blog posts, and long-form pages. Uses article-specific signals including byline, dateline, and body copy detection.
Example Output
{
"title": "Breaking: Market Hits Record High",
"author": "Jane Smith",
"published_date": "2026-05-10T14:30:00Z",
"content": "Full article body text...",
"summary": "Markets surged on strong employment data, reaching...",
"images": ["https://example.com/img/market-chart.jpg"]
}job_posting
Extracts structured job listing data from career pages, LinkedIn posts, and job board listings. Handles both Schema.org JobPosting markup and unstructured listings.
Example Output
{
"title": "Senior Software Engineer",
"company": "Acme Corp",
"location": "San Francisco, CA (Hybrid)",
"salary": {
"min": 180000,
"max": 230000,
"currency": "USD",
"period": "yearly"
},
"description": "We are looking for a senior engineer to join...",
"requirements": [
"5+ years of backend experience",
"Proficiency in Python or Go",
"Experience with distributed systems"
],
"employment_type": "FULL_TIME",
"remote": true
}# Scrape job listings at scale
jobs = []
for html in job_page_html_list:
resp = requests.post(
"https://api.alterlab.io/api/v1/extract",
headers={"X-API-Key": "YOUR_API_KEY"},
json={
"content": html,
"content_type": "html",
"extraction_profile": "job_posting"
}
)
jobs.append(resp.json()["formats"]["json"])
# All jobs now have: title, company, location, salary, requirements
for job in jobs:
salary = job.get("salary", {})
print(f"{job['title']} at {job['company']} — {salary.get('min')}–{salary.get('max')} {salary.get('currency')}")faq
Extracts question/answer pairs from FAQ sections, help center pages, and support articles. Handles both Schema.org FAQPage markup and header/paragraph patterns.
Example Output
{
"faqs": [
{
"question": "How do I reset my password?",
"answer": "Click the 'Forgot Password' link on the login page and enter your email..."
},
{
"question": "Is there a free trial?",
"answer": "Yes, all new accounts receive $5 in free credits upon signup."
}
]
}recipe
Extracts structured recipe data from cooking sites. Handles Schema.org Recipe markup as well as DOM-based ingredient and instruction list detection.
Example Output
{
"name": "Classic Chocolate Chip Cookies",
"ingredients": [
"2 1/4 cups all-purpose flour",
"1 tsp baking soda",
"2 sticks unsalted butter, softened",
"3/4 cup granulated sugar",
"2 large eggs",
"2 cups chocolate chips"
],
"instructions": [
"Preheat oven to 375°F.",
"Cream butter and sugar until fluffy.",
"Beat in eggs one at a time.",
"Gradually blend in flour mixture.",
"Stir in chocolate chips.",
"Drop rounded tablespoons onto ungreased baking sheets.",
"Bake for 9–11 minutes."
],
"prep_time": "PT15M",
"cook_time": "PT11M",
"total_time": "PT26M",
"servings": 60,
"nutrition": {
"calories": 110,
"fat": "6g",
"sugar": "8g"
}
}event
Extracts event metadata from event listing pages, ticketing sites, and venue calendars. Uses Schema.org Event markup and text-based date/location detection.
Example Output
{
"name": "AI Summit 2026",
"date": "2026-09-15T09:00:00",
"end_date": "2026-09-16T18:00:00",
"location": {
"name": "Moscone Center",
"address": "747 Howard St, San Francisco, CA 94103"
},
"description": "Two-day summit bringing together AI researchers and practitioners...",
"organizer": "AI Alliance",
"price": {
"min": 299,
"max": 1499,
"currency": "USD"
},
"url": "https://example.com/ai-summit-2026",
"online": false
}Profile + Custom Schema
Combine a profile with a custom extraction_schema to filter the profile output to only the fields you need. The profile determines the extraction strategy; the schema determines the output shape.
# Use the product profile, but only keep name, price, and availability
response = requests.post(
"https://api.alterlab.io/api/v1/extract",
headers={"X-API-Key": "YOUR_API_KEY"},
json={
"content": product_html,
"content_type": "html",
"extraction_profile": "product",
"extraction_schema": {
"type": "object",
"properties": {
"name": {"type": "string"},
"price": {"type": "number"},
"availability": {"type": "string"}
}
}
}
)
# Output contains only name, price, availability
data = response.json()["formats"]["json"]
print(data) # {"name": "...", "price": 49.99, "availability": "in_stock"}Profile + Prompt
extraction_prompt to any profile request. The profile handles standard fields algorithmically; the LLM processes the prompt to add derived or computed fields. See the BYOK Extraction guide for setup instructions.Profile vs Schema vs Prompt
These three extraction methods can be used independently or together. Here is when to use each.
| Method | Speed | Cost | Best For |
|---|---|---|---|
extraction_profile | Fast | $0.0025 / call | Known page types with standard structure |
extraction_schema | Fast | $0.0025 / call | Custom fields from HTML with semantic markup |
extraction_prompt | Slower (LLM) | $0.0035 + tokens | Plain text, reasoning, summarization, classification |
| Profile + Schema | Fast | $0.0025 / call | Known page type, but only a subset of fields needed |
| Profile + Prompt + Schema | Slower (LLM) | $0.0035 + tokens | Standard fields algorithmically + derived fields via LLM |