API Reference

New

Extract API

Extract structured data from raw HTML, text, or markdown content without scraping a URL. Bring your own data and let AlterLab handle the extraction.

Bring Your Own Data

The Extract endpoint is designed for users who already have HTML or text content and want to pull structured data from it — no scraping needed. For scraping + extraction in one call, see the REST API.

Overview

Send Content

POST raw HTML, text, or markdown to /api/v1/extract. Specify the content type so the pipeline knows how to parse it.

Define What to Extract

Use an extraction_schema (JSON Schema), an extraction_profile (pre-built template), or an extraction_prompt (natural language) to describe the output structure.

Get Structured Data

Receive clean, typed JSON matching your schema — plus optional text, markdown, or RAG-ready output formats.

POST /v1/extract

POST

/api/v1/extract

Extract structured data from raw content. Returns the extraction result synchronously.

Bash

curl -X POST https://api.alterlab.io/api/v1/extract \
  -H "X-API-Key: your_api_key" \
  -H "Content-Type: application/json" \
  -d '{
    "content": "<html><body><h1>Widget Pro</h1><span class=\"price\">$49.99</span></body></html>",
    "content_type": "html",
    "extraction_schema": {
      "type": "object",
      "properties": {
        "name": {"type": "string"},
        "price": {"type": "number"}
      }
    }
  }'

Request Body

Parameter	Type	Required	Description
content	string	Yes	Raw content to extract from. Max 5 MB. Must not be blank.
content_type	string	No	Type of content: `html` (default), `text`, or `markdown`.
extraction_schema	object	No	JSON Schema defining the output structure. Fields are mapped from content using algorithmic matching, type coercion, and field aliases.
extraction_profile	string	No	Pre-built extraction template. One of: `auto`, `product`, `article`, `job_posting`, `faq`, `recipe`, `event`.
extraction_prompt	string	No	Natural language instructions for LLM extraction. Max 2,000 characters. When provided, an LLM processes the content.
formats	string[]	No	Output formats. Options: `json` (default), `text`, `markdown`, `html`, `json_v2`, `rag`.
source_url	string	No	Original URL of the content. Used as context for the LLM — not fetched.
evidence	boolean	No	When `true`, include field provenance tracking — shows where each extracted value came from. Default: `false`.

Content Types

Type	Use For	Notes
html	Full or partial HTML pages, scraped content	Best extraction quality. Schema.org, Open Graph, and DOM structure are all used for matching.
text	Plain text, OCR output, transcripts	Works best with `extraction_prompt` for LLM-based extraction.
markdown	Markdown documents, LLM output, wiki pages	Preserves markdown formatting in the `markdown` output format.

Extraction Profiles

Profiles are pre-built extraction templates that know which fields to look for. Use them when you want structured data without writing a custom schema.

Profile	Fields Extracted	Best For
auto	Detects page type automatically	General-purpose extraction
product	name, price, currency, images, rating, availability, brand, description	E-commerce product pages
article	title, author, published_date, content, summary, images	News articles, blog posts
job_posting	title, company, location, salary, description, requirements	Job listing pages
faq	questions, answers (as question/answer pairs)	FAQ and help pages
recipe	name, ingredients, instructions, cook_time, servings, nutrition	Recipe pages
event	name, date, location, description, organizer, price	Event listing pages

Response

Field	Type	Description
extract_id	string	Unique identifier for this extraction (e.g., `ext_a1b2c3d4...`)
formats	object	Extraction results keyed by requested format (e.g., `{"json": {...}, "text": "..."}`)
credits_used	integer	Credits consumed in microcents (e.g., 2500 = $0.0025)
model_used	string \| null	LLM model used, if `extraction_prompt` was provided. Null for algorithmic extraction.
extraction_method	string	Method used: `algorithmic`, `llm`, `playbook`, or `lossless`.
content_size_chars	integer	Size of the input content in characters.

JSON

{
  "extract_id": "ext_a1b2c3d4e5f6a7b8c9d0e1f2",
  "formats": {
    "json": {
      "name": "Widget Pro",
      "price": 49.99
    }
  },
  "credits_used": 2500,
  "model_used": null,
  "extraction_method": "algorithmic",
  "content_size_chars": 1234
}

Credit Model

Scenario	Cost	Notes
Base extraction	2,500 microcents ($0.0025)	Equivalent to a Tier 3.5 scrape — applies to all extractions.
Large content (>200K chars)	5,000 microcents ($0.005)	Double cost for content exceeding ~50K tokens.

No Scraping Cost

Since you provide the content, there is no scraping tier cost. Extract is typically 2-10x cheaper than scrape + extract for the same data.

Error Codes

Status	Error	Description
400	validation_error	Invalid request body — missing content, invalid content_type, or malformed schema.
401	unauthorized	Missing or invalid API key.
402	insufficient_credits	Not enough credits for the extraction.
422	unprocessable_entity	Content is blank or exceeds the 5 MB limit.

Examples

Schema-Based Extraction

Define a JSON Schema to extract specific fields from HTML content.

Python

import requests

response = requests.post(
    "https://api.alterlab.io/api/v1/extract",
    headers={"X-API-Key": "YOUR_API_KEY"},
    json={
        "content": """<html>
          <body>
            <h1>Widget Pro</h1>
            <span class="price">$49.99</span>
            <p class="desc">The ultimate widget for professionals.</p>
            <span class="rating">4.8 out of 5</span>
          </body>
        </html>""",
        "content_type": "html",
        "extraction_schema": {
            "type": "object",
            "properties": {
                "name": {"type": "string"},
                "price": {"type": "number"},
                "description": {"type": "string"},
                "rating": {"type": "number"}
            }
        }
    }
)

data = response.json()
print(data["formats"]["json"])
# {"name": "Widget Pro", "price": 49.99, "description": "The ultimate...", "rating": 4.8}

Profile-Based Extraction

Use a pre-built profile to extract common data types without writing a schema.

Python

# Extract product data using the built-in profile
response = requests.post(
    "https://api.alterlab.io/api/v1/extract",
    headers={"X-API-Key": "YOUR_API_KEY"},
    json={
        "content": scraped_html,  # HTML you already have
        "content_type": "html",
        "extraction_profile": "product",
        "formats": ["json", "text"]
    }
)

data = response.json()
product = data["formats"]["json"]
print(f"{product['name']} - {product['price']} {product['currency']}")

LLM Extraction with Prompt

Use natural language to tell the LLM what to extract. Combine with a schema for typed output.

Python

# Extract with natural language instructions
response = requests.post(
    "https://api.alterlab.io/api/v1/extract",
    headers={"X-API-Key": "YOUR_API_KEY"},
    json={
        "content": article_text,
        "content_type": "text",
        "extraction_prompt": "Extract the main argument, key evidence points, and the author's conclusion.",
        "extraction_schema": {
            "type": "object",
            "properties": {
                "main_argument": {"type": "string"},
                "evidence": {
                    "type": "array",
                    "items": {"type": "string"}
                },
                "conclusion": {"type": "string"}
            }
        }
    }
)

data = response.json()
print(f"Method: {data['extraction_method']}")  # "llm"
print(f"Model: {data['model_used']}")
print(data["formats"]["json"])

Evidence Mode

Enable evidence tracking to see where each extracted value came from in the source content.

Python

# Track field provenance with evidence mode
response = requests.post(
    "https://api.alterlab.io/api/v1/extract",
    headers={"X-API-Key": "YOUR_API_KEY"},
    json={
        "content": html_content,
        "content_type": "html",
        "extraction_schema": {
            "type": "object",
            "properties": {
                "title": {"type": "string"},
                "price": {"type": "number"}
            }
        },
        "evidence": True
    }
)

data = response.json()
# Evidence is included in the json output alongside extracted values
print(data["formats"]["json"])

← Previous: Search API Next: Structured Extraction Guide →

Last updated: March 2026

API Reference

New

Extract API

Extract structured data from raw HTML, text, or markdown content without scraping a URL. Bring your own data and let AlterLab handle the extraction.

Bring Your Own Data

Overview

Send Content

POST raw HTML, text, or markdown to /api/v1/extract. Specify the content type so the pipeline knows how to parse it.

Define What to Extract

Use an extraction_schema (JSON Schema), an extraction_profile (pre-built template), or an extraction_prompt (natural language) to describe the output structure.

Get Structured Data

Receive clean, typed JSON matching your schema — plus optional text, markdown, or RAG-ready output formats.

POST /v1/extract

POST

/api/v1/extract

Extract structured data from raw content. Returns the extraction result synchronously.

Bash

curl -X POST https://api.alterlab.io/api/v1/extract \
  -H "X-API-Key: your_api_key" \
  -H "Content-Type: application/json" \
  -d '{
    "content": "<html><body><h1>Widget Pro</h1><span class=\"price\">$49.99</span></body></html>",
    "content_type": "html",
    "extraction_schema": {
      "type": "object",
      "properties": {
        "name": {"type": "string"},
        "price": {"type": "number"}
      }
    }
  }'

Request Body

Parameter	Type	Required	Description
content	string	Yes	Raw content to extract from. Max 5 MB. Must not be blank.
content_type	string	No	Type of content: `html` (default), `text`, or `markdown`.
extraction_schema	object	No	JSON Schema defining the output structure. Fields are mapped from content using algorithmic matching, type coercion, and field aliases.
extraction_profile	string	No	Pre-built extraction template. One of: `auto`, `product`, `article`, `job_posting`, `faq`, `recipe`, `event`.
extraction_prompt	string	No	Natural language instructions for LLM extraction. Max 2,000 characters. When provided, an LLM processes the content.
formats	string[]	No	Output formats. Options: `json` (default), `text`, `markdown`, `html`, `json_v2`, `rag`.
source_url	string	No	Original URL of the content. Used as context for the LLM — not fetched.
evidence	boolean	No	When `true`, include field provenance tracking — shows where each extracted value came from. Default: `false`.

Content Types

Type	Use For	Notes
html	Full or partial HTML pages, scraped content	Best extraction quality. Schema.org, Open Graph, and DOM structure are all used for matching.
text	Plain text, OCR output, transcripts	Works best with `extraction_prompt` for LLM-based extraction.
markdown	Markdown documents, LLM output, wiki pages	Preserves markdown formatting in the `markdown` output format.

Extraction Profiles

Profiles are pre-built extraction templates that know which fields to look for. Use them when you want structured data without writing a custom schema.

Profile	Fields Extracted	Best For
auto	Detects page type automatically	General-purpose extraction
product	name, price, currency, images, rating, availability, brand, description	E-commerce product pages
article	title, author, published_date, content, summary, images	News articles, blog posts
job_posting	title, company, location, salary, description, requirements	Job listing pages
faq	questions, answers (as question/answer pairs)	FAQ and help pages
recipe	name, ingredients, instructions, cook_time, servings, nutrition	Recipe pages
event	name, date, location, description, organizer, price	Event listing pages

Response

Field	Type	Description
extract_id	string	Unique identifier for this extraction (e.g., `ext_a1b2c3d4...`)
formats	object	Extraction results keyed by requested format (e.g., `{"json": {...}, "text": "..."}`)
credits_used	integer	Credits consumed in microcents (e.g., 2500 = $0.0025)
model_used	string \| null	LLM model used, if `extraction_prompt` was provided. Null for algorithmic extraction.
extraction_method	string	Method used: `algorithmic`, `llm`, `playbook`, or `lossless`.
content_size_chars	integer	Size of the input content in characters.

JSON

{
  "extract_id": "ext_a1b2c3d4e5f6a7b8c9d0e1f2",
  "formats": {
    "json": {
      "name": "Widget Pro",
      "price": 49.99
    }
  },
  "credits_used": 2500,
  "model_used": null,
  "extraction_method": "algorithmic",
  "content_size_chars": 1234
}

Credit Model

Scenario	Cost	Notes
Base extraction	2,500 microcents ($0.0025)	Equivalent to a Tier 3.5 scrape — applies to all extractions.
Large content (>200K chars)	5,000 microcents ($0.005)	Double cost for content exceeding ~50K tokens.

No Scraping Cost

Since you provide the content, there is no scraping tier cost. Extract is typically 2-10x cheaper than scrape + extract for the same data.

Error Codes

Status	Error	Description
400	validation_error	Invalid request body — missing content, invalid content_type, or malformed schema.
401	unauthorized	Missing or invalid API key.
402	insufficient_credits	Not enough credits for the extraction.
422	unprocessable_entity	Content is blank or exceeds the 5 MB limit.

Examples

Schema-Based Extraction

Define a JSON Schema to extract specific fields from HTML content.

Python

import requests

response = requests.post(
    "https://api.alterlab.io/api/v1/extract",
    headers={"X-API-Key": "YOUR_API_KEY"},
    json={
        "content": """<html>
          <body>
            <h1>Widget Pro</h1>
            <span class="price">$49.99</span>
            <p class="desc">The ultimate widget for professionals.</p>
            <span class="rating">4.8 out of 5</span>
          </body>
        </html>""",
        "content_type": "html",
        "extraction_schema": {
            "type": "object",
            "properties": {
                "name": {"type": "string"},
                "price": {"type": "number"},
                "description": {"type": "string"},
                "rating": {"type": "number"}
            }
        }
    }
)

data = response.json()
print(data["formats"]["json"])
# {"name": "Widget Pro", "price": 49.99, "description": "The ultimate...", "rating": 4.8}

Profile-Based Extraction

Use a pre-built profile to extract common data types without writing a schema.

Python

# Extract product data using the built-in profile
response = requests.post(
    "https://api.alterlab.io/api/v1/extract",
    headers={"X-API-Key": "YOUR_API_KEY"},
    json={
        "content": scraped_html,  # HTML you already have
        "content_type": "html",
        "extraction_profile": "product",
        "formats": ["json", "text"]
    }
)

data = response.json()
product = data["formats"]["json"]
print(f"{product['name']} - {product['price']} {product['currency']}")

LLM Extraction with Prompt

Use natural language to tell the LLM what to extract. Combine with a schema for typed output.

Python

# Extract with natural language instructions
response = requests.post(
    "https://api.alterlab.io/api/v1/extract",
    headers={"X-API-Key": "YOUR_API_KEY"},
    json={
        "content": article_text,
        "content_type": "text",
        "extraction_prompt": "Extract the main argument, key evidence points, and the author's conclusion.",
        "extraction_schema": {
            "type": "object",
            "properties": {
                "main_argument": {"type": "string"},
                "evidence": {
                    "type": "array",
                    "items": {"type": "string"}
                },
                "conclusion": {"type": "string"}
            }
        }
    }
)

data = response.json()
print(f"Method: {data['extraction_method']}")  # "llm"
print(f"Model: {data['model_used']}")
print(data["formats"]["json"])

Evidence Mode

Enable evidence tracking to see where each extracted value came from in the source content.

Python

# Track field provenance with evidence mode
response = requests.post(
    "https://api.alterlab.io/api/v1/extract",
    headers={"X-API-Key": "YOUR_API_KEY"},
    json={
        "content": html_content,
        "content_type": "html",
        "extraction_schema": {
            "type": "object",
            "properties": {
                "title": {"type": "string"},
                "price": {"type": "number"}
            }
        },
        "evidence": True
    }
)

data = response.json()
# Evidence is included in the json output alongside extracted values
print(data["formats"]["json"])

← Previous: Search API Next: Structured Extraction Guide →

Last updated: March 2026