Structured Extraction
Turn raw content into clean, typed data using schemas, profiles, and natural language prompts. Works with the standalone /v1/extract endpoint or inline with /v1/scrape.
Prerequisite
Extract vs Scrape
AlterLab offers two ways to get structured data. Choose based on whether you already have the content.
| Scenario | Use | Why |
|---|---|---|
| You have a URL, need data | POST /v1/scrape with extraction params | Scrapes the page and extracts in one call. Costs scrape credits + extraction. |
| You already have HTML/text content | POST /v1/extract | No scraping needed — just extraction. Cheaper ($0.0025 per call). |
| Processing cached/archived content | POST /v1/extract | Re-extract from previously scraped content without re-scraping the URL. |
| Processing LLM or OCR output | POST /v1/extract with content_type: "text" | Structure unstructured text from any source. |
| Batch processing many documents | POST /v1/extract in parallel | Feed content from your database or file system without network overhead. |
Extraction Schemas
An extraction schema is a JSON Schema object that defines the fields you want. AlterLab matches fields using exact matching, case-insensitive matching, field aliases, and type coercion.
Basic Schema
Define the fields and their types. The extractor maps them from the content automatically.
{
"type": "object",
"properties": {
"title": { "type": "string" },
"price": { "type": "number" },
"in_stock": { "type": "boolean" },
"tags": {
"type": "array",
"items": { "type": "string" }
}
}
}Field Descriptions for Better Matching
Add description to guide the extractor (especially useful with LLM prompts).
{
"type": "object",
"properties": {
"headline": {
"type": "string",
"description": "The main product name or page title"
},
"cost": {
"type": "number",
"description": "The price in USD, without currency symbol"
},
"verdict": {
"type": "string",
"description": "One-sentence summary of the review"
}
}
}Nested Objects and Arrays
Schemas support nesting for complex data structures.
response = requests.post(
"https://api.alterlab.io/api/v1/extract",
headers={"X-API-Key": "YOUR_API_KEY"},
json={
"content": job_listing_html,
"content_type": "html",
"extraction_schema": {
"type": "object",
"properties": {
"title": {"type": "string"},
"company": {"type": "string"},
"salary": {
"type": "object",
"properties": {
"min": {"type": "number"},
"max": {"type": "number"},
"currency": {"type": "string"}
}
},
"requirements": {
"type": "array",
"items": {"type": "string"}
},
"benefits": {
"type": "array",
"items": {"type": "string"}
}
}
}
}
)
job = response.json()["formats"]["json"]
print(f"{job['title']} at {job['company']}")
print(f"Salary: {job['salary']['min']}-{job['salary']['max']} {job['salary']['currency']}")Type Coercion
"$99.99" is coerced to 99.99 when the schema expects a number. Strings like "true" become true for boolean fields.Extraction Profiles
Profiles are pre-built schemas for common page types. They save you from writing a schema when the data follows standard patterns.
# Extract product data — no custom schema needed
response = requests.post(
"https://api.alterlab.io/api/v1/extract",
headers={"X-API-Key": "YOUR_API_KEY"},
json={
"content": product_page_html,
"extraction_profile": "product"
}
)
product = response.json()["formats"]["json"]
# Returns: name, price, currency, images, rating, availability, brand, descriptionProfile + Schema
Extraction Prompts
When algorithmic extraction is not enough, use an extraction_prompt to invoke an LLM. This is ideal for unstructured text, complex reasoning, or custom transformations.
When to Use Prompts
| Scenario | Method | Why |
|---|---|---|
| HTML with Schema.org / Open Graph | Schema or profile (algorithmic) | Faster, cheaper, deterministic. No LLM needed. |
| Plain text or unstructured content | Prompt + schema | LLM understands natural language context. |
| Summarization or interpretation | Prompt only | Schema optional — prompt tells the LLM what output to produce. |
| Complex multi-field extraction | Prompt + schema | Schema ensures typed output. Prompt provides reasoning context. |
Writing Effective Prompts
# Summarize and extract key points from an article
response = requests.post(
"https://api.alterlab.io/api/v1/extract",
headers={"X-API-Key": "YOUR_API_KEY"},
json={
"content": long_article_text,
"content_type": "text",
"extraction_prompt": (
"Summarize this article in 2-3 sentences. "
"Extract the 3 most important claims and whether "
"evidence is provided for each."
),
"extraction_schema": {
"type": "object",
"properties": {
"summary": {"type": "string"},
"claims": {
"type": "array",
"items": {
"type": "object",
"properties": {
"claim": {"type": "string"},
"has_evidence": {"type": "boolean"}
}
}
}
}
}
}
)Evidence Mode
Evidence mode tracks where each extracted value came from in the source content. This is useful for auditing, debugging extraction quality, and building trust in extracted data.
# Enable evidence tracking
response = requests.post(
"https://api.alterlab.io/api/v1/extract",
headers={"X-API-Key": "YOUR_API_KEY"},
json={
"content": html_content,
"content_type": "html",
"extraction_schema": {
"type": "object",
"properties": {
"title": {"type": "string"},
"price": {"type": "number"},
"rating": {"type": "number"}
}
},
"evidence": True
}
)
data = response.json()
# The json output includes provenance metadata for each field
print(data["formats"]["json"])When to Use Evidence
Combining Methods
You can combine schemas, profiles, and prompts in a single request. The extraction pipeline processes them in order:
Content Transformation
Content is converted to requested formats (text, json, markdown, etc.) using the profile for extraction strategy.
Schema Filtering
If extraction_schema is provided, the JSON output is filtered to match your schema with type coercion.
LLM Extraction
If extraction_prompt is provided, the LLM processes the text and replaces the JSON output with its structured result.
# Profile + schema + prompt — full pipeline
response = requests.post(
"https://api.alterlab.io/api/v1/extract",
headers={"X-API-Key": "YOUR_API_KEY"},
json={
"content": product_review_html,
"content_type": "html",
"extraction_profile": "product", # Step 1: use product strategy
"extraction_schema": { # Step 2: filter to these fields
"type": "object",
"properties": {
"name": {"type": "string"},
"price": {"type": "number"},
"sentiment": {"type": "string"}
}
},
"extraction_prompt": ( # Step 3: LLM adds sentiment
"Extract the product name and price. "
"Also analyze the overall review sentiment "
"as 'positive', 'negative', or 'mixed'."
),
"formats": ["json", "text"]
}
)Bulk Extraction
Process multiple documents by calling the Extract endpoint in parallel. Since there is no network fetching, responses are fast.
import asyncio
import aiohttp
API_KEY = "YOUR_API_KEY"
EXTRACT_URL = "https://api.alterlab.io/api/v1/extract"
async def extract_one(session, content, schema):
async with session.post(
EXTRACT_URL,
headers={"X-API-Key": API_KEY},
json={
"content": content,
"content_type": "html",
"extraction_schema": schema,
},
) as resp:
return await resp.json()
async def extract_bulk(documents, schema, max_concurrent=10):
semaphore = asyncio.Semaphore(max_concurrent)
async with aiohttp.ClientSession() as session:
async def limited(doc):
async with semaphore:
return await extract_one(session, doc, schema)
return await asyncio.gather(*[limited(doc) for doc in documents])
# Usage
schema = {
"type": "object",
"properties": {
"title": {"type": "string"},
"price": {"type": "number"},
},
}
documents = [html_1, html_2, html_3] # Your HTML content
results = asyncio.run(extract_bulk(documents, schema))
for r in results:
print(r["formats"]["json"])Best Practices
Prefer algorithmic extraction over LLM
For HTML with Schema.org or Open Graph metadata, use extraction_schema or extraction_profile without a prompt. Algorithmic extraction is faster, cheaper, and deterministic.
Use the right content_type
Setting content_type correctly helps the pipeline parse your content optimally. HTML gets full DOM analysis; text and markdown get simpler processing.
Add descriptions to schema fields
When using extraction_prompt, field descriptions in the schema help the LLM understand what each field should contain.
Keep prompts specific and concise
Short, clear prompts produce better results than verbose ones. Tell the LLM exactly what to extract — not what the content is about. Max 2,000 characters.
Mind the content size limit
Content over 200K characters incurs double cost. For LLM extraction, content is truncated to 30K characters. If your content is very large, consider extracting the relevant section before calling the API.
Use source_url for context
When extracting from content you scraped earlier, pass the original URL as source_url. The LLM uses this for context — it is not fetched.