Extract API
Extract structured data from raw HTML, text, or markdown content without scraping a URL. Bring your own data and let AlterLab handle the extraction.
Bring Your Own Data
Overview
Send Content
POST raw HTML, text, or markdown to /api/v1/extract. Specify the content type so the pipeline knows how to parse it.
Define What to Extract
Use an extraction_schema (JSON Schema), an extraction_profile (pre-built template), or an extraction_prompt (natural language) to describe the output structure.
Get Structured Data
Receive clean, typed JSON matching your schema — plus optional text, markdown, or RAG-ready output formats.
POST /v1/extract
/api/v1/extractExtract structured data from raw content. Returns the extraction result synchronously.
curl -X POST https://api.alterlab.io/api/v1/extract \
-H "X-API-Key: your_api_key" \
-H "Content-Type: application/json" \
-d '{
"content": "<html><body><h1>Widget Pro</h1><span class=\"price\">$49.99</span></body></html>",
"content_type": "html",
"extraction_schema": {
"type": "object",
"properties": {
"name": {"type": "string"},
"price": {"type": "number"}
}
}
}'Request Body
| Parameter | Type | Required | Description |
|---|---|---|---|
| content | string | Yes | Raw content to extract from. Max 5 MB. Must not be blank. |
| content_type | string | No | Type of content: html (default), text, or markdown. |
| extraction_schema | object | No | JSON Schema defining the output structure. Fields are mapped from content using algorithmic matching, type coercion, and field aliases. |
| extraction_profile | string | No | Pre-built extraction template. One of: auto, product, article, job_posting, faq, recipe, event. |
| extraction_prompt | string | No | Natural language instructions for LLM extraction. Max 2,000 characters. When provided, an LLM processes the content. |
| formats | string[] | No | Output formats. Options: json (default), text, markdown, html, json_v2, rag. |
| source_url | string | No | Original URL of the content. Used as context for the LLM — not fetched. |
| evidence | boolean | No | When true, include field provenance tracking — shows where each extracted value came from. Default: false. |
Content Types
| Type | Use For | Notes |
|---|---|---|
| html | Full or partial HTML pages, scraped content | Best extraction quality. Schema.org, Open Graph, and DOM structure are all used for matching. |
| text | Plain text, OCR output, transcripts | Works best with extraction_prompt for LLM-based extraction. |
| markdown | Markdown documents, LLM output, wiki pages | Preserves markdown formatting in the markdown output format. |
Extraction Profiles
Profiles are pre-built extraction templates that know which fields to look for. Use them when you want structured data without writing a custom schema.
| Profile | Fields Extracted | Best For |
|---|---|---|
| auto | Detects page type automatically | General-purpose extraction |
| product | name, price, currency, images, rating, availability, brand, description | E-commerce product pages |
| article | title, author, published_date, content, summary, images | News articles, blog posts |
| job_posting | title, company, location, salary, description, requirements | Job listing pages |
| faq | questions, answers (as question/answer pairs) | FAQ and help pages |
| recipe | name, ingredients, instructions, cook_time, servings, nutrition | Recipe pages |
| event | name, date, location, description, organizer, price | Event listing pages |
Response
| Field | Type | Description |
|---|---|---|
| extract_id | string | Unique identifier for this extraction (e.g., ext_a1b2c3d4...) |
| formats | object | Extraction results keyed by requested format (e.g., {"json": {...}, "text": "..."}) |
| credits_used | integer | Credits consumed in microcents (e.g., 2500 = $0.0025) |
| model_used | string | null | LLM model used, if extraction_prompt was provided. Null for algorithmic extraction. |
| extraction_method | string | Method used: algorithmic, llm, playbook, or lossless. |
| content_size_chars | integer | Size of the input content in characters. |
{
"extract_id": "ext_a1b2c3d4e5f6a7b8c9d0e1f2",
"formats": {
"json": {
"name": "Widget Pro",
"price": 49.99
}
},
"credits_used": 2500,
"model_used": null,
"extraction_method": "algorithmic",
"content_size_chars": 1234
}Credit Model
| Scenario | Cost | Notes |
|---|---|---|
| Base extraction | 2,500 microcents ($0.0025) | Equivalent to a Tier 3.5 scrape — applies to all extractions. |
| Large content (>200K chars) | 5,000 microcents ($0.005) | Double cost for content exceeding ~50K tokens. |
No Scraping Cost
Error Codes
| Status | Error | Description |
|---|---|---|
| 400 | validation_error | Invalid request body — missing content, invalid content_type, or malformed schema. |
| 401 | unauthorized | Missing or invalid API key. |
| 402 | insufficient_credits | Not enough credits for the extraction. |
| 422 | unprocessable_entity | Content is blank or exceeds the 5 MB limit. |
Examples
Schema-Based Extraction
Define a JSON Schema to extract specific fields from HTML content.
import requests
response = requests.post(
"https://api.alterlab.io/api/v1/extract",
headers={"X-API-Key": "YOUR_API_KEY"},
json={
"content": """<html>
<body>
<h1>Widget Pro</h1>
<span class="price">$49.99</span>
<p class="desc">The ultimate widget for professionals.</p>
<span class="rating">4.8 out of 5</span>
</body>
</html>""",
"content_type": "html",
"extraction_schema": {
"type": "object",
"properties": {
"name": {"type": "string"},
"price": {"type": "number"},
"description": {"type": "string"},
"rating": {"type": "number"}
}
}
}
)
data = response.json()
print(data["formats"]["json"])
# {"name": "Widget Pro", "price": 49.99, "description": "The ultimate...", "rating": 4.8}Profile-Based Extraction
Use a pre-built profile to extract common data types without writing a schema.
# Extract product data using the built-in profile
response = requests.post(
"https://api.alterlab.io/api/v1/extract",
headers={"X-API-Key": "YOUR_API_KEY"},
json={
"content": scraped_html, # HTML you already have
"content_type": "html",
"extraction_profile": "product",
"formats": ["json", "text"]
}
)
data = response.json()
product = data["formats"]["json"]
print(f"{product['name']} - {product['price']} {product['currency']}")LLM Extraction with Prompt
Use natural language to tell the LLM what to extract. Combine with a schema for typed output.
# Extract with natural language instructions
response = requests.post(
"https://api.alterlab.io/api/v1/extract",
headers={"X-API-Key": "YOUR_API_KEY"},
json={
"content": article_text,
"content_type": "text",
"extraction_prompt": "Extract the main argument, key evidence points, and the author's conclusion.",
"extraction_schema": {
"type": "object",
"properties": {
"main_argument": {"type": "string"},
"evidence": {
"type": "array",
"items": {"type": "string"}
},
"conclusion": {"type": "string"}
}
}
}
)
data = response.json()
print(f"Method: {data['extraction_method']}") # "llm"
print(f"Model: {data['model_used']}")
print(data["formats"]["json"])Evidence Mode
Enable evidence tracking to see where each extracted value came from in the source content.
# Track field provenance with evidence mode
response = requests.post(
"https://api.alterlab.io/api/v1/extract",
headers={"X-API-Key": "YOUR_API_KEY"},
json={
"content": html_content,
"content_type": "html",
"extraction_schema": {
"type": "object",
"properties": {
"title": {"type": "string"},
"price": {"type": "number"}
}
},
"evidence": True
}
)
data = response.json()
# Evidence is included in the json output alongside extracted values
print(data["formats"]["json"])