Guide

New

Extraction Profiles

Pre-built extraction templates for common page types. Use a profile to extract structured data without writing a custom schema — AlterLab knows which fields to look for.

No LLM required

Profiles use algorithmic extraction by default — they are fast, deterministic, and cost the base $0.0025 per call. You can add an extraction_prompt to layer LLM reasoning on top of any profile.

Available Profiles

Profile	Primary Use Case	Key Fields
auto	Unknown page type	Detects page type, applies best profile
product	E-commerce product pages	name, price, currency, images, rating, availability
article	News articles, blog posts	title, author, published_date, content, summary
job_posting	Job listing pages	title, company, location, salary, requirements
faq	FAQ and help pages	question/answer pairs array
recipe	Recipe and cooking pages	name, ingredients, instructions, cook_time, servings
event	Event listing pages	name, date, location, description, price

`auto`

The auto profile analyzes the page structure and selects the most appropriate extraction strategy. Use it when you are processing mixed content types or do not know the page type in advance.

Python

response = requests.post(
    "https://api.alterlab.io/api/v1/extract",
    headers={"X-API-Key": "YOUR_API_KEY"},
    json={
        "content": unknown_page_html,
        "content_type": "html",
        "extraction_profile": "auto"
    }
)

data = response.json()
# Returns the fields appropriate for the detected page type
print(data["formats"]["json"])

`product`

Extracts structured product data from e-commerce pages. Combines Schema.org Product markup, Open Graph data, and DOM parsing for maximum coverage.

Fields Extracted

Field	Type	Description
name	string	Product name / title
price	number	Numeric price (currency symbol stripped)
currency	string	ISO 4217 currency code (e.g., USD)
images	string[]	Product image URLs
rating	number \| null	Numeric rating (normalized 0–5)
availability	string \| null	in_stock, out_of_stock, limited, preorder
brand	string \| null	Brand or manufacturer name
description	string \| null	Product description text

Example Output

JSON

{
  "name": "Widget Pro Max",
  "price": 49.99,
  "currency": "USD",
  "images": [
    "https://example.com/img/widget-pro-1.jpg",
    "https://example.com/img/widget-pro-2.jpg"
  ],
  "rating": 4.7,
  "availability": "in_stock",
  "brand": "WidgetCo",
  "description": "The ultimate widget for professionals. Water-resistant, 5-year warranty."
}

Python

response = requests.post(
    "https://api.alterlab.io/api/v1/extract",
    headers={"X-API-Key": "YOUR_API_KEY"},
    json={
        "content": product_page_html,
        "content_type": "html",
        "extraction_profile": "product",
        "formats": ["json"]
    }
)

product = response.json()["formats"]["json"]
print(f"{product['name']} — {product['currency']}{product['price']}")
print(f"In stock: {product['availability'] == 'in_stock'}")

`article`

Extracts editorial content from news articles, blog posts, and long-form pages. Uses article-specific signals including byline, dateline, and body copy detection.

Example Output

JSON

{
  "title": "Breaking: Market Hits Record High",
  "author": "Jane Smith",
  "published_date": "2026-05-10T14:30:00Z",
  "content": "Full article body text...",
  "summary": "Markets surged on strong employment data, reaching...",
  "images": ["https://example.com/img/market-chart.jpg"]
}

`job_posting`

Extracts structured job listing data from career pages, LinkedIn posts, and job board listings. Handles both Schema.org JobPosting markup and unstructured listings.

Example Output

JSON

{
  "title": "Senior Software Engineer",
  "company": "Acme Corp",
  "location": "San Francisco, CA (Hybrid)",
  "salary": {
    "min": 180000,
    "max": 230000,
    "currency": "USD",
    "period": "yearly"
  },
  "description": "We are looking for a senior engineer to join...",
  "requirements": [
    "5+ years of backend experience",
    "Proficiency in Python or Go",
    "Experience with distributed systems"
  ],
  "employment_type": "FULL_TIME",
  "remote": true
}

Python

# Scrape job listings at scale
jobs = []
for html in job_page_html_list:
    resp = requests.post(
        "https://api.alterlab.io/api/v1/extract",
        headers={"X-API-Key": "YOUR_API_KEY"},
        json={
            "content": html,
            "content_type": "html",
            "extraction_profile": "job_posting"
        }
    )
    jobs.append(resp.json()["formats"]["json"])

# All jobs now have: title, company, location, salary, requirements
for job in jobs:
    salary = job.get("salary", {})
    print(f"{job['title']} at {job['company']} — {salary.get('min')}–{salary.get('max')} {salary.get('currency')}")

`faq`

Extracts question/answer pairs from FAQ sections, help center pages, and support articles. Handles both Schema.org FAQPage markup and header/paragraph patterns.

Example Output

JSON

{
  "faqs": [
    {
      "question": "How do I reset my password?",
      "answer": "Click the 'Forgot Password' link on the login page and enter your email..."
    },
    {
      "question": "Is there a free trial?",
      "answer": "Yes, all new accounts receive $5 in free credits upon signup."
    }
  ]
}

`recipe`

Extracts structured recipe data from cooking sites. Handles Schema.org Recipe markup as well as DOM-based ingredient and instruction list detection.

Example Output

JSON

{
  "name": "Classic Chocolate Chip Cookies",
  "ingredients": [
    "2 1/4 cups all-purpose flour",
    "1 tsp baking soda",
    "2 sticks unsalted butter, softened",
    "3/4 cup granulated sugar",
    "2 large eggs",
    "2 cups chocolate chips"
  ],
  "instructions": [
    "Preheat oven to 375°F.",
    "Cream butter and sugar until fluffy.",
    "Beat in eggs one at a time.",
    "Gradually blend in flour mixture.",
    "Stir in chocolate chips.",
    "Drop rounded tablespoons onto ungreased baking sheets.",
    "Bake for 9–11 minutes."
  ],
  "prep_time": "PT15M",
  "cook_time": "PT11M",
  "total_time": "PT26M",
  "servings": 60,
  "nutrition": {
    "calories": 110,
    "fat": "6g",
    "sugar": "8g"
  }
}

`event`

Extracts event metadata from event listing pages, ticketing sites, and venue calendars. Uses Schema.org Event markup and text-based date/location detection.

Example Output

JSON

{
  "name": "AI Summit 2026",
  "date": "2026-09-15T09:00:00",
  "end_date": "2026-09-16T18:00:00",
  "location": {
    "name": "Moscone Center",
    "address": "747 Howard St, San Francisco, CA 94103"
  },
  "description": "Two-day summit bringing together AI researchers and practitioners...",
  "organizer": "AI Alliance",
  "price": {
    "min": 299,
    "max": 1499,
    "currency": "USD"
  },
  "url": "https://example.com/ai-summit-2026",
  "online": false
}

Profile + Custom Schema

Combine a profile with a custom extraction_schema to filter the profile output to only the fields you need. The profile determines the extraction strategy; the schema determines the output shape.

Python

# Use the product profile, but only keep name, price, and availability
response = requests.post(
    "https://api.alterlab.io/api/v1/extract",
    headers={"X-API-Key": "YOUR_API_KEY"},
    json={
        "content": product_html,
        "content_type": "html",
        "extraction_profile": "product",
        "extraction_schema": {
            "type": "object",
            "properties": {
                "name": {"type": "string"},
                "price": {"type": "number"},
                "availability": {"type": "string"}
            }
        }
    }
)

# Output contains only name, price, availability
data = response.json()["formats"]["json"]
print(data)  # {"name": "...", "price": 49.99, "availability": "in_stock"}

Profile + Prompt

You can also add an extraction_prompt to any profile request. The profile handles standard fields algorithmically; the LLM processes the prompt to add derived or computed fields. See the BYOK Extraction guide for setup instructions.

Profile vs Schema vs Prompt

These three extraction methods can be used independently or together. Here is when to use each.

Method	Speed	Cost	Best For
`extraction_profile`	Fast	$0.0025 / call	Known page types with standard structure
`extraction_schema`	Fast	$0.0025 / call	Custom fields from HTML with semantic markup
`extraction_prompt`	Slower (LLM)	$0.0035 + tokens	Plain text, reasoning, summarization, classification
Profile + Schema	Fast	$0.0025 / call	Known page type, but only a subset of fields needed
Profile + Prompt + Schema	Slower (LLM)	$0.0035 + tokens	Standard fields algorithmically + derived fields via LLM

Alerts & Notifications BYOK Extraction

Last updated: June 2026

Guide

New

Extraction Profiles

Pre-built extraction templates for common page types. Use a profile to extract structured data without writing a custom schema — AlterLab knows which fields to look for.

No LLM required

Profiles use algorithmic extraction by default — they are fast, deterministic, and cost the base $0.0025 per call. You can add an extraction_prompt to layer LLM reasoning on top of any profile.

Available Profiles

Profile	Primary Use Case	Key Fields
auto	Unknown page type	Detects page type, applies best profile
product	E-commerce product pages	name, price, currency, images, rating, availability
article	News articles, blog posts	title, author, published_date, content, summary
job_posting	Job listing pages	title, company, location, salary, requirements
faq	FAQ and help pages	question/answer pairs array
recipe	Recipe and cooking pages	name, ingredients, instructions, cook_time, servings
event	Event listing pages	name, date, location, description, price

`auto`

The auto profile analyzes the page structure and selects the most appropriate extraction strategy. Use it when you are processing mixed content types or do not know the page type in advance.

Python

response = requests.post(
    "https://api.alterlab.io/api/v1/extract",
    headers={"X-API-Key": "YOUR_API_KEY"},
    json={
        "content": unknown_page_html,
        "content_type": "html",
        "extraction_profile": "auto"
    }
)

data = response.json()
# Returns the fields appropriate for the detected page type
print(data["formats"]["json"])

`product`

Extracts structured product data from e-commerce pages. Combines Schema.org Product markup, Open Graph data, and DOM parsing for maximum coverage.

Fields Extracted

Field	Type	Description
name	string	Product name / title
price	number	Numeric price (currency symbol stripped)
currency	string	ISO 4217 currency code (e.g., USD)
images	string[]	Product image URLs
rating	number \| null	Numeric rating (normalized 0–5)
availability	string \| null	in_stock, out_of_stock, limited, preorder
brand	string \| null	Brand or manufacturer name
description	string \| null	Product description text

Example Output

JSON

{
  "name": "Widget Pro Max",
  "price": 49.99,
  "currency": "USD",
  "images": [
    "https://example.com/img/widget-pro-1.jpg",
    "https://example.com/img/widget-pro-2.jpg"
  ],
  "rating": 4.7,
  "availability": "in_stock",
  "brand": "WidgetCo",
  "description": "The ultimate widget for professionals. Water-resistant, 5-year warranty."
}

Python

response = requests.post(
    "https://api.alterlab.io/api/v1/extract",
    headers={"X-API-Key": "YOUR_API_KEY"},
    json={
        "content": product_page_html,
        "content_type": "html",
        "extraction_profile": "product",
        "formats": ["json"]
    }
)

product = response.json()["formats"]["json"]
print(f"{product['name']} — {product['currency']}{product['price']}")
print(f"In stock: {product['availability'] == 'in_stock'}")

`article`

Extracts editorial content from news articles, blog posts, and long-form pages. Uses article-specific signals including byline, dateline, and body copy detection.

Example Output

JSON

{
  "title": "Breaking: Market Hits Record High",
  "author": "Jane Smith",
  "published_date": "2026-05-10T14:30:00Z",
  "content": "Full article body text...",
  "summary": "Markets surged on strong employment data, reaching...",
  "images": ["https://example.com/img/market-chart.jpg"]
}

`job_posting`

Extracts structured job listing data from career pages, LinkedIn posts, and job board listings. Handles both Schema.org JobPosting markup and unstructured listings.

Example Output

JSON

{
  "title": "Senior Software Engineer",
  "company": "Acme Corp",
  "location": "San Francisco, CA (Hybrid)",
  "salary": {
    "min": 180000,
    "max": 230000,
    "currency": "USD",
    "period": "yearly"
  },
  "description": "We are looking for a senior engineer to join...",
  "requirements": [
    "5+ years of backend experience",
    "Proficiency in Python or Go",
    "Experience with distributed systems"
  ],
  "employment_type": "FULL_TIME",
  "remote": true
}

Python

# Scrape job listings at scale
jobs = []
for html in job_page_html_list:
    resp = requests.post(
        "https://api.alterlab.io/api/v1/extract",
        headers={"X-API-Key": "YOUR_API_KEY"},
        json={
            "content": html,
            "content_type": "html",
            "extraction_profile": "job_posting"
        }
    )
    jobs.append(resp.json()["formats"]["json"])

# All jobs now have: title, company, location, salary, requirements
for job in jobs:
    salary = job.get("salary", {})
    print(f"{job['title']} at {job['company']} — {salary.get('min')}–{salary.get('max')} {salary.get('currency')}")

`faq`

Extracts question/answer pairs from FAQ sections, help center pages, and support articles. Handles both Schema.org FAQPage markup and header/paragraph patterns.

Example Output

JSON

{
  "faqs": [
    {
      "question": "How do I reset my password?",
      "answer": "Click the 'Forgot Password' link on the login page and enter your email..."
    },
    {
      "question": "Is there a free trial?",
      "answer": "Yes, all new accounts receive $5 in free credits upon signup."
    }
  ]
}

`recipe`

Extracts structured recipe data from cooking sites. Handles Schema.org Recipe markup as well as DOM-based ingredient and instruction list detection.

Example Output

JSON

{
  "name": "Classic Chocolate Chip Cookies",
  "ingredients": [
    "2 1/4 cups all-purpose flour",
    "1 tsp baking soda",
    "2 sticks unsalted butter, softened",
    "3/4 cup granulated sugar",
    "2 large eggs",
    "2 cups chocolate chips"
  ],
  "instructions": [
    "Preheat oven to 375°F.",
    "Cream butter and sugar until fluffy.",
    "Beat in eggs one at a time.",
    "Gradually blend in flour mixture.",
    "Stir in chocolate chips.",
    "Drop rounded tablespoons onto ungreased baking sheets.",
    "Bake for 9–11 minutes."
  ],
  "prep_time": "PT15M",
  "cook_time": "PT11M",
  "total_time": "PT26M",
  "servings": 60,
  "nutrition": {
    "calories": 110,
    "fat": "6g",
    "sugar": "8g"
  }
}

`event`

Extracts event metadata from event listing pages, ticketing sites, and venue calendars. Uses Schema.org Event markup and text-based date/location detection.

Example Output

JSON

{
  "name": "AI Summit 2026",
  "date": "2026-09-15T09:00:00",
  "end_date": "2026-09-16T18:00:00",
  "location": {
    "name": "Moscone Center",
    "address": "747 Howard St, San Francisco, CA 94103"
  },
  "description": "Two-day summit bringing together AI researchers and practitioners...",
  "organizer": "AI Alliance",
  "price": {
    "min": 299,
    "max": 1499,
    "currency": "USD"
  },
  "url": "https://example.com/ai-summit-2026",
  "online": false
}

Profile + Custom Schema

Combine a profile with a custom extraction_schema to filter the profile output to only the fields you need. The profile determines the extraction strategy; the schema determines the output shape.

Python

# Use the product profile, but only keep name, price, and availability
response = requests.post(
    "https://api.alterlab.io/api/v1/extract",
    headers={"X-API-Key": "YOUR_API_KEY"},
    json={
        "content": product_html,
        "content_type": "html",
        "extraction_profile": "product",
        "extraction_schema": {
            "type": "object",
            "properties": {
                "name": {"type": "string"},
                "price": {"type": "number"},
                "availability": {"type": "string"}
            }
        }
    }
)

# Output contains only name, price, availability
data = response.json()["formats"]["json"]
print(data)  # {"name": "...", "price": 49.99, "availability": "in_stock"}

Profile + Prompt

Profile vs Schema vs Prompt

These three extraction methods can be used independently or together. Here is when to use each.

Method	Speed	Cost	Best For
`extraction_profile`	Fast	$0.0025 / call	Known page types with standard structure
`extraction_schema`	Fast	$0.0025 / call	Custom fields from HTML with semantic markup
`extraction_prompt`	Slower (LLM)	$0.0035 + tokens	Plain text, reasoning, summarization, classification
Profile + Schema	Fast	$0.0025 / call	Known page type, but only a subset of fields needed
Profile + Prompt + Schema	Slower (LLM)	$0.0035 + tokens	Standard fields algorithmically + derived fields via LLM

Alerts & Notifications BYOK Extraction

Last updated: June 2026