Guide

Formats

Output Formats

Get scraped data in exactly the shape your application needs. AlterLab supports 6 output formats — from plain text to RAG-optimized chunks — and you can request multiple formats in a single API call.

Default Behavior

When you omit the formats parameter, AlterLab returns ["markdown", "json"] by default — optimized for LLM workflows. Markdown preserves document structure while JSON provides structured data extraction.

Quick Comparison

Format	Output	Best For	Preserves Structure
`text`	Plain text, zero HTML	NLP, search indexing, diff	No
`html`	Sanitized, readable HTML	Re-rendering, archival	Full
`json`	Structured key-value data	Products, articles, recipes	Semantic
`json_v2`	Section tree, tables, classified links	Universal extraction, analytics	Full + semantic
`markdown`	Headings, tables, lists, links	LLM context, documentation	Yes
`rag`	Chunked markdown with token counts	Vector DBs, RAG pipelines	Per-chunk

`text` — Plain Text

Extracts readable content with all HTML tags stripped. Uses Readability for article extraction, then converts to clean text with normalized whitespace. Ideal when you need raw content for NLP pipelines, full-text search, or text comparison.

When to Use

Full-text search indexing
Sentiment analysis and NLP pipelines
Content diffing between scrape runs
Word count and readability scoring

Bash

curl -X POST https://alterlab.io/api/v1/scrape \
  -H "X-API-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com/blog/post",
    "formats": ["text"]
  }'

Example Response

JSON

{
  "content": {
    "text": "How to Build a Web Scraper in Python\n\nWeb scraping is the process of extracting data from websites. In this guide, we'll walk through building a production-ready scraper using Python and BeautifulSoup.\n\nStep 1: Install Dependencies\n\nFirst, install the required packages:\n\npip install requests beautifulsoup4\n\nStep 2: Fetch the Page\n\nUse the requests library to download the HTML content..."
  }
}

`html` — Cleaned HTML

Returns sanitized HTML with navigation, ads, scripts, and boilerplate removed. The output preserves the document structure — headings, paragraphs, images, tables, and links remain intact. Useful when you need to re-render the content or preserve rich formatting.

When to Use

Re-rendering content in your own UI
Web archival and caching
Email newsletter generation from scraped articles
Custom post-processing with your own HTML parser

Bash

curl -X POST https://alterlab.io/api/v1/scrape \
  -H "X-API-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com/blog/post",
    "formats": ["html"]
  }'

Example Response

JSON

{
  "content": {
    "html": "<article><h1>How to Build a Web Scraper in Python</h1><p>Web scraping is the process of extracting data from websites. In this guide, we'll walk through building a production-ready scraper.</p><h2>Step 1: Install Dependencies</h2><p>First, install the required packages:</p><pre><code>pip install requests beautifulsoup4</code></pre></article>"
  }
}

`json` — Structured JSON

Extracts structured data using Schema.org, Open Graph, JSON-LD, microdata, and page-specific playbooks. The output schema depends on the page type — articles return headline, author, and body; products return name, price, and availability; recipes return ingredients and steps. Works best on pages with rich structured data or supported domain playbooks.

When to Use

Extracting product data from e-commerce sites
Parsing article metadata (author, date, headline)
Collecting recipe ingredients and instructions
Any page with Schema.org or JSON-LD markup

Bash

curl -X POST https://alterlab.io/api/v1/scrape \
  -H "X-API-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com/product/widget",
    "formats": ["json"]
  }'

Example Response (Product Page)

JSON

{
  "content": {
    "json": {
      "type": "Product",
      "name": "Wireless Noise-Canceling Headphones",
      "price": 299.99,
      "currency": "USD",
      "availability": "InStock",
      "description": "Premium over-ear headphones with active noise cancellation...",
      "image": "https://example.com/images/headphones.jpg",
      "rating": 4.7,
      "review_count": 2341,
      "brand": "AudioTech",
      "sku": "AT-WNC-500"
    }
  }
}

Schema Depends on Page Type

The json output schema varies by content type. Articles return headline, author, datePublished. Products return name, price, availability. Use json_v2 for a consistent schema across all page types.

`json_v2` — Universal Deterministic Extraction

A consistent, deterministic extraction format that works on any page type — no LLM required. Returns a hierarchical section tree, structured tables, classified links (navigation, content, social, CTA), media items, contact info, and rich metadata. The schema is stable across all websites, making it ideal for building pipelines that need reliable, predictable output.

No LLM Costs

json_v2 uses purely algorithmic extraction — no LLM calls, no token costs, no latency variance. You get structured data at scraping speed with deterministic results.

When to Use

Building data pipelines that need a consistent output schema
Extracting tables, links, and media without writing custom parsers
Content analytics and competitive intelligence
When you need structured data but want to avoid LLM extraction costs

Bash

curl -X POST https://alterlab.io/api/v1/scrape \
  -H "X-API-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com/blog/post",
    "formats": ["json_v2"]
  }'

Example Response

JSON

{
  "content": {
    "json_v2": {
      "version": "1.0",
      "extraction_method": "universal",
      "metadata": {
        "title": "How to Build a Web Scraper in Python",
        "description": "A complete guide to building production web scrapers",
        "language": "en",
        "author": { "name": "Jane Doe", "url": "/authors/jane" },
        "dates": {
          "published": "2026-01-15T10:00:00Z",
          "modified": "2026-03-01T14:30:00Z"
        }
      },
      "sections": [
        {
          "id": "section-0",
          "heading": { "text": "How to Build a Web Scraper", "level": 1 },
          "content": [
            { "type": "paragraph", "text": "Web scraping is the process of..." }
          ],
          "children": [
            {
              "id": "section-1",
              "heading": { "text": "Install Dependencies", "level": 2 },
              "content": [
                { "type": "code", "text": "pip install requests", "language": "bash" }
              ],
              "children": []
            }
          ]
        }
      ],
      "tables": [
        {
          "id": "table-0",
          "caption": "Comparison of HTTP Libraries",
          "headers": ["Library", "Async", "Speed"],
          "rows": [
            ["requests", "No", "Fast"],
            ["httpx", "Yes", "Faster"],
            ["aiohttp", "Yes", "Fastest"]
          ],
          "row_count": 3,
          "col_count": 3,
          "has_header": true
        }
      ],
      "links": {
        "navigation": [{ "text": "Home", "url": "/" }],
        "content": [{ "text": "BeautifulSoup docs", "url": "https://..." }],
        "social": [{ "text": "Twitter", "url": "https://twitter.com/...", "platform": "twitter" }],
        "cta": [],
        "external": [],
        "resource": []
      }
    }
  }
}

Schema Reference

The json_v2 response always contains these top-level fields:

Field	Type	Description
`version`	string	Schema version (currently "1.0")
`metadata`	object	Title, description, language, author, dates
`structured_data`	object	JSON-LD, Open Graph, Twitter Card, microdata, meta tags
`sections`	Section[]	Hierarchical content tree with headings, paragraphs, lists, code blocks
`tables`	Table[]	Structured tables with headers, rows, and captions
`links`	ClassifiedLinks	Links classified as navigation, content, social, CTA, external, resource
`contacts`	ContactInfo?	Emails, phones, addresses, social profiles
`dates`	DateTimeline?	Published, modified, created dates with source attribution
`media`	MediaItem[]?	Images, videos, audio with context (hero, content, thumbnail)

`markdown` — Structured Markdown

Converts page content to clean Markdown that preserves document structure — headings, tables, lists, links, and code blocks are all retained. This is the recommended format for feeding content into LLMs because it provides rich context in a token-efficient encoding.

When to Use

LLM context windows — Markdown is more token-efficient than HTML
Documentation generation and knowledge base building
Content migration between platforms
Human-readable output that preserves tables and formatting

Bash

curl -X POST https://alterlab.io/api/v1/scrape \
  -H "X-API-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com/blog/post",
    "formats": ["markdown"]
  }'

Example Response

JSON

{
  "content": {
    "markdown": "# How to Build a Web Scraper in Python\n\nWeb scraping is the process of extracting data from websites.\n\n## Step 1: Install Dependencies\n\nFirst, install the required packages:\n\n```bash\npip install requests beautifulsoup4\n```\n\n## Step 2: Fetch the Page\n\n| Library | Async | Speed |\n|---------|-------|-------|\n| requests | No | Fast |\n| httpx | Yes | Faster |\n\n> **Tip**: Use httpx for async scraping workloads."
  }
}

`rag` — RAG-Optimized Chunks

Purpose-built for Retrieval-Augmented Generation (RAG) pipelines. Splits content into semantically meaningful Markdown chunks with pre-computed token counts, per-chunk metadata, and link extraction. Chunks are sized for embedding models (target: 500 tokens max, 50 tokens min) and split on heading boundaries to preserve context.

Built for AI Pipelines

The rag format saves you from building your own chunking pipeline. Chunks are pre-sized for popular embedding models (text-embedding-3-small, Cohere embed-v3), include token counts using the cl100k_base tokenizer, and preserve heading hierarchy for better retrieval.

When to Use

Ingesting web content into vector databases (Pinecone, Weaviate, Qdrant, ChromaDB)
Building RAG applications with LLMs
Knowledge base construction from scraped content
Semantic search over scraped data

Bash

curl -X POST https://alterlab.io/api/v1/scrape \
  -H "X-API-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com/blog/post",
    "formats": ["rag"]
  }'

Example Response

JSON

{
  "content": {
    "rag": {
      "metadata": {
        "title": "How to Build a Web Scraper in Python",
        "description": "A complete guide to building production web scrapers",
        "author": "Jane Doe",
        "published_at": "2026-01-15T10:00:00Z",
        "language": "en",
        "url": "https://example.com/blog/post",
        "domain": "example.com",
        "content_type": "Article",
        "content_type_confidence": 0.95,
        "total_tokens": 1847,
        "total_chunks": 5,
        "word_count": 1420
      },
      "chunks": [
        {
          "index": 0,
          "heading": "How to Build a Web Scraper in Python",
          "heading_level": 1,
          "content": "# How to Build a Web Scraper in Python\n\nWeb scraping is the process of extracting data from websites. In this guide...",
          "token_count": 387,
          "links": [
            { "text": "BeautifulSoup docs", "url": "https://..." }
          ]
        },
        {
          "index": 1,
          "heading": "Install Dependencies",
          "heading_level": 2,
          "content": "## Install Dependencies\n\nFirst, install the required packages:\n\n```bash\npip install requests beautifulsoup4\n```",
          "token_count": 142,
          "links": []
        }
      ]
    }
  }
}

Chunk Structure

Field	Type	Description
`index`	number	Sequential chunk index (0-based)
`heading`	string?	Section heading this chunk belongs to
`heading_level`	number	Heading depth (1-6), 0 for preamble
`content`	string	Markdown content of this chunk
`token_count`	number	Token count using `cl100k_base` tokenizer
`links`	Link[]	Links found within this chunk

Multi-Format Requests

Request multiple formats in a single API call. The page is scraped once and the content is transformed into each requested format — no extra cost, no redundant network requests.

Bash

curl -X POST https://alterlab.io/api/v1/scrape \
  -H "X-API-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com/blog/post",
    "formats": ["text", "markdown", "json"]
  }'

Response Structure

JSON

{
  "content": {
    "text": "How to Build a Web Scraper in Python...",
    "markdown": "# How to Build a Web Scraper in Python\n\n...",
    "json": {
      "type": "Article",
      "headline": "How to Build a Web Scraper in Python",
      "author": "Jane Doe",
      "datePublished": "2026-01-15"
    }
  },
  "billing": {
    "credits_used": 1,
    "tier": "tier_1"
  }
}

One Scrape, Multiple Formats

Multi-format requests do not cost extra cost. The page is fetched once and transformed into each requested format server-side. Requesting ["text", "markdown", "json"] costs the same as requesting a single format.

Choosing the Right Format

Use Case	Recommended Format	Why
LLM context / summarization	`markdown`	Token-efficient, preserves headings and tables
RAG / vector DB ingestion	`rag`	Pre-chunked with token counts, ready to embed
E-commerce product data	`json`	Extracts price, name, availability from Schema.org
Generic structured extraction	`json_v2`	Consistent schema, no LLM cost, works on any page
Full-text search indexing	`text`	Clean plaintext, no markup to strip
Content re-rendering	`html`	Preserves all formatting and media tags
Multi-purpose pipeline	`markdown` + `json`	Default combo — structure + semantic data
AI agent / MCP tool	`markdown`	Best for tool-use context in Claude, GPT, etc.

Pricing Impact

Output format selection does not affect cost. Cost is determined by the scraping tier (complexity of the target site), not by how many formats you request. Requesting one format or all six costs the same cost.

What Affects Cost	What Does NOT Affect Cost
Scraping tier (1-4)	Number of formats requested
Add-ons (screenshot, PDF, OCR)	Which specific formats you choose
LLM extraction (extraction_prompt)	json_v2 (purely algorithmic, no LLM)

Cost-Free Structured Data

json_v2 is the only format that provides structured extraction without LLM costs. If you need structured data but want to keep costs predictable, prefer json_v2 over LLM-based extraction_prompt.

Embeddings Add-on

You can request vector embeddings alongside any output format by adding "embeddings" to the formats array. Embeddings are generated for the text content of the page using a hosted embedding model and returned alongside the rest of your chosen formats.

Add-on	Cost	Output
Embeddings	+$0.001 per request (1,000 microcents)	Float32 vector array in `result.embeddings`

JSON

// Request embeddings alongside markdown
{
  "url": "https://example.com/article",
  "formats": ["markdown", "embeddings"]
}

// Response includes both formats
{
  "result": {
    "markdown": "# Article Title\n\nContent here...",
    "embeddings": [0.0023, -0.0147, 0.0891, ...],  // 1536-dim vector
    "embeddings_model": "text-embedding-3-small",
    "embeddings_dimensions": 1536
  }
}

Works with RAG Format

Combining rag and embeddings in the same request gives you pre-chunked content and per-chunk vectors — ready for direct ingestion into any vector database. This is the recommended approach for RAG pipelines.

Best Practices

1. Request only what you need

While multi-format requests are free, each format adds to the response payload size and server-side processing time. Request only the formats your application actually consumes.

2. Use `rag` instead of custom chunking

If you are building a RAG pipeline, use the rag format instead of requesting markdown and splitting it yourself. The built-in chunker respects heading boundaries, pre-computes token counts with cl100k_base, and extracts per-chunk links and metadata.

3. Prefer `json_v2` for new projects

The json format has a variable schema that depends on the page type. For new integrations, prefer json_v2 which provides a stable, consistent schema across all page types. Use json when you specifically need type-aware extraction (e.g., product price from Schema.org).

4. Combine `markdown` + `json_v2` for maximum utility

For applications that need both human-readable content and structured data, request ["markdown", "json_v2"]. Use markdown for display and LLM context, and json_v2 for programmatic data access — tables, links, metadata — without parsing markdown.

5. Check for extraction errors

Individual formats can fail while others succeed. Always check for an error key in each format's output. For example, rag returns {"error": "extraction_failed"} if chunking fails, while other requested formats may still contain valid data.

JavaScript Rendering PDF & OCR

Last updated: June 2026

Guide

Formats

Output Formats

Default Behavior

Quick Comparison

Format	Output	Best For	Preserves Structure
`text`	Plain text, zero HTML	NLP, search indexing, diff	No
`html`	Sanitized, readable HTML	Re-rendering, archival	Full
`json`	Structured key-value data	Products, articles, recipes	Semantic
`json_v2`	Section tree, tables, classified links	Universal extraction, analytics	Full + semantic
`markdown`	Headings, tables, lists, links	LLM context, documentation	Yes
`rag`	Chunked markdown with token counts	Vector DBs, RAG pipelines	Per-chunk

`text` — Plain Text

When to Use

Full-text search indexing
Sentiment analysis and NLP pipelines
Content diffing between scrape runs
Word count and readability scoring

Bash

curl -X POST https://alterlab.io/api/v1/scrape \
  -H "X-API-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com/blog/post",
    "formats": ["text"]
  }'

Example Response

JSON

{
  "content": {
    "text": "How to Build a Web Scraper in Python\n\nWeb scraping is the process of extracting data from websites. In this guide, we'll walk through building a production-ready scraper using Python and BeautifulSoup.\n\nStep 1: Install Dependencies\n\nFirst, install the required packages:\n\npip install requests beautifulsoup4\n\nStep 2: Fetch the Page\n\nUse the requests library to download the HTML content..."
  }
}

`html` — Cleaned HTML

When to Use

Re-rendering content in your own UI
Web archival and caching
Email newsletter generation from scraped articles
Custom post-processing with your own HTML parser

Bash

curl -X POST https://alterlab.io/api/v1/scrape \
  -H "X-API-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com/blog/post",
    "formats": ["html"]
  }'

Example Response

JSON

{
  "content": {
    "html": "<article><h1>How to Build a Web Scraper in Python</h1><p>Web scraping is the process of extracting data from websites. In this guide, we'll walk through building a production-ready scraper.</p><h2>Step 1: Install Dependencies</h2><p>First, install the required packages:</p><pre><code>pip install requests beautifulsoup4</code></pre></article>"
  }
}

`json` — Structured JSON

When to Use

Extracting product data from e-commerce sites
Parsing article metadata (author, date, headline)
Collecting recipe ingredients and instructions
Any page with Schema.org or JSON-LD markup

Bash

curl -X POST https://alterlab.io/api/v1/scrape \
  -H "X-API-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com/product/widget",
    "formats": ["json"]
  }'

Example Response (Product Page)

JSON

{
  "content": {
    "json": {
      "type": "Product",
      "name": "Wireless Noise-Canceling Headphones",
      "price": 299.99,
      "currency": "USD",
      "availability": "InStock",
      "description": "Premium over-ear headphones with active noise cancellation...",
      "image": "https://example.com/images/headphones.jpg",
      "rating": 4.7,
      "review_count": 2341,
      "brand": "AudioTech",
      "sku": "AT-WNC-500"
    }
  }
}

Schema Depends on Page Type

`json_v2` — Universal Deterministic Extraction

No LLM Costs

json_v2 uses purely algorithmic extraction — no LLM calls, no token costs, no latency variance. You get structured data at scraping speed with deterministic results.

When to Use

Building data pipelines that need a consistent output schema
Extracting tables, links, and media without writing custom parsers
Content analytics and competitive intelligence
When you need structured data but want to avoid LLM extraction costs

Bash

curl -X POST https://alterlab.io/api/v1/scrape \
  -H "X-API-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com/blog/post",
    "formats": ["json_v2"]
  }'

Example Response

JSON

{
  "content": {
    "json_v2": {
      "version": "1.0",
      "extraction_method": "universal",
      "metadata": {
        "title": "How to Build a Web Scraper in Python",
        "description": "A complete guide to building production web scrapers",
        "language": "en",
        "author": { "name": "Jane Doe", "url": "/authors/jane" },
        "dates": {
          "published": "2026-01-15T10:00:00Z",
          "modified": "2026-03-01T14:30:00Z"
        }
      },
      "sections": [
        {
          "id": "section-0",
          "heading": { "text": "How to Build a Web Scraper", "level": 1 },
          "content": [
            { "type": "paragraph", "text": "Web scraping is the process of..." }
          ],
          "children": [
            {
              "id": "section-1",
              "heading": { "text": "Install Dependencies", "level": 2 },
              "content": [
                { "type": "code", "text": "pip install requests", "language": "bash" }
              ],
              "children": []
            }
          ]
        }
      ],
      "tables": [
        {
          "id": "table-0",
          "caption": "Comparison of HTTP Libraries",
          "headers": ["Library", "Async", "Speed"],
          "rows": [
            ["requests", "No", "Fast"],
            ["httpx", "Yes", "Faster"],
            ["aiohttp", "Yes", "Fastest"]
          ],
          "row_count": 3,
          "col_count": 3,
          "has_header": true
        }
      ],
      "links": {
        "navigation": [{ "text": "Home", "url": "/" }],
        "content": [{ "text": "BeautifulSoup docs", "url": "https://..." }],
        "social": [{ "text": "Twitter", "url": "https://twitter.com/...", "platform": "twitter" }],
        "cta": [],
        "external": [],
        "resource": []
      }
    }
  }
}

Schema Reference

The json_v2 response always contains these top-level fields:

Field	Type	Description
`version`	string	Schema version (currently "1.0")
`metadata`	object	Title, description, language, author, dates
`structured_data`	object	JSON-LD, Open Graph, Twitter Card, microdata, meta tags
`sections`	Section[]	Hierarchical content tree with headings, paragraphs, lists, code blocks
`tables`	Table[]	Structured tables with headers, rows, and captions
`links`	ClassifiedLinks	Links classified as navigation, content, social, CTA, external, resource
`contacts`	ContactInfo?	Emails, phones, addresses, social profiles
`dates`	DateTimeline?	Published, modified, created dates with source attribution
`media`	MediaItem[]?	Images, videos, audio with context (hero, content, thumbnail)

`markdown` — Structured Markdown

When to Use

LLM context windows — Markdown is more token-efficient than HTML
Documentation generation and knowledge base building
Content migration between platforms
Human-readable output that preserves tables and formatting

Bash

curl -X POST https://alterlab.io/api/v1/scrape \
  -H "X-API-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com/blog/post",
    "formats": ["markdown"]
  }'

Example Response

JSON

{
  "content": {
    "markdown": "# How to Build a Web Scraper in Python\n\nWeb scraping is the process of extracting data from websites.\n\n## Step 1: Install Dependencies\n\nFirst, install the required packages:\n\n```bash\npip install requests beautifulsoup4\n```\n\n## Step 2: Fetch the Page\n\n| Library | Async | Speed |\n|---------|-------|-------|\n| requests | No | Fast |\n| httpx | Yes | Faster |\n\n> **Tip**: Use httpx for async scraping workloads."
  }
}

`rag` — RAG-Optimized Chunks

Built for AI Pipelines

When to Use

Ingesting web content into vector databases (Pinecone, Weaviate, Qdrant, ChromaDB)
Building RAG applications with LLMs
Knowledge base construction from scraped content
Semantic search over scraped data

Bash

curl -X POST https://alterlab.io/api/v1/scrape \
  -H "X-API-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com/blog/post",
    "formats": ["rag"]
  }'

Example Response

JSON

{
  "content": {
    "rag": {
      "metadata": {
        "title": "How to Build a Web Scraper in Python",
        "description": "A complete guide to building production web scrapers",
        "author": "Jane Doe",
        "published_at": "2026-01-15T10:00:00Z",
        "language": "en",
        "url": "https://example.com/blog/post",
        "domain": "example.com",
        "content_type": "Article",
        "content_type_confidence": 0.95,
        "total_tokens": 1847,
        "total_chunks": 5,
        "word_count": 1420
      },
      "chunks": [
        {
          "index": 0,
          "heading": "How to Build a Web Scraper in Python",
          "heading_level": 1,
          "content": "# How to Build a Web Scraper in Python\n\nWeb scraping is the process of extracting data from websites. In this guide...",
          "token_count": 387,
          "links": [
            { "text": "BeautifulSoup docs", "url": "https://..." }
          ]
        },
        {
          "index": 1,
          "heading": "Install Dependencies",
          "heading_level": 2,
          "content": "## Install Dependencies\n\nFirst, install the required packages:\n\n```bash\npip install requests beautifulsoup4\n```",
          "token_count": 142,
          "links": []
        }
      ]
    }
  }
}

Chunk Structure

Field	Type	Description
`index`	number	Sequential chunk index (0-based)
`heading`	string?	Section heading this chunk belongs to
`heading_level`	number	Heading depth (1-6), 0 for preamble
`content`	string	Markdown content of this chunk
`token_count`	number	Token count using `cl100k_base` tokenizer
`links`	Link[]	Links found within this chunk

Multi-Format Requests

Request multiple formats in a single API call. The page is scraped once and the content is transformed into each requested format — no extra cost, no redundant network requests.

Bash

curl -X POST https://alterlab.io/api/v1/scrape \
  -H "X-API-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com/blog/post",
    "formats": ["text", "markdown", "json"]
  }'

Response Structure

JSON

{
  "content": {
    "text": "How to Build a Web Scraper in Python...",
    "markdown": "# How to Build a Web Scraper in Python\n\n...",
    "json": {
      "type": "Article",
      "headline": "How to Build a Web Scraper in Python",
      "author": "Jane Doe",
      "datePublished": "2026-01-15"
    }
  },
  "billing": {
    "credits_used": 1,
    "tier": "tier_1"
  }
}

One Scrape, Multiple Formats

Choosing the Right Format

Use Case	Recommended Format	Why
LLM context / summarization	`markdown`	Token-efficient, preserves headings and tables
RAG / vector DB ingestion	`rag`	Pre-chunked with token counts, ready to embed
E-commerce product data	`json`	Extracts price, name, availability from Schema.org
Generic structured extraction	`json_v2`	Consistent schema, no LLM cost, works on any page
Full-text search indexing	`text`	Clean plaintext, no markup to strip
Content re-rendering	`html`	Preserves all formatting and media tags
Multi-purpose pipeline	`markdown` + `json`	Default combo — structure + semantic data
AI agent / MCP tool	`markdown`	Best for tool-use context in Claude, GPT, etc.

Pricing Impact

What Affects Cost	What Does NOT Affect Cost
Scraping tier (1-4)	Number of formats requested
Add-ons (screenshot, PDF, OCR)	Which specific formats you choose
LLM extraction (extraction_prompt)	json_v2 (purely algorithmic, no LLM)

Cost-Free Structured Data

Embeddings Add-on

Add-on	Cost	Output
Embeddings	+$0.001 per request (1,000 microcents)	Float32 vector array in `result.embeddings`

JSON

// Request embeddings alongside markdown
{
  "url": "https://example.com/article",
  "formats": ["markdown", "embeddings"]
}

// Response includes both formats
{
  "result": {
    "markdown": "# Article Title\n\nContent here...",
    "embeddings": [0.0023, -0.0147, 0.0891, ...],  // 1536-dim vector
    "embeddings_model": "text-embedding-3-small",
    "embeddings_dimensions": 1536
  }
}

Works with RAG Format

Best Practices

1. Request only what you need

While multi-format requests are free, each format adds to the response payload size and server-side processing time. Request only the formats your application actually consumes.

2. Use `rag` instead of custom chunking

3. Prefer `json_v2` for new projects

4. Combine `markdown` + `json_v2` for maximum utility

5. Check for extraction errors

JavaScript Rendering PDF & OCR

Last updated: June 2026

Output Formats

Quick Comparison

text — Plain Text

When to Use

html — Cleaned HTML

When to Use

json — Structured JSON

When to Use

json_v2 — Universal Deterministic Extraction

When to Use

Schema Reference

markdown — Structured Markdown

When to Use

rag — RAG-Optimized Chunks

When to Use

Chunk Structure

Multi-Format Requests

Choosing the Right Format

Pricing Impact

Embeddings Add-on

Best Practices

1. Request only what you need

2. Use rag instead of custom chunking

3. Prefer json_v2 for new projects

4. Combine markdown + json_v2 for maximum utility

5. Check for extraction errors

Output Formats

Quick Comparison

text — Plain Text

When to Use

html — Cleaned HTML

When to Use

json — Structured JSON

When to Use

json_v2 — Universal Deterministic Extraction

When to Use

Schema Reference

markdown — Structured Markdown

When to Use

rag — RAG-Optimized Chunks

When to Use

Chunk Structure

Multi-Format Requests

Choosing the Right Format

Pricing Impact

Embeddings Add-on

Best Practices

1. Request only what you need

2. Use rag instead of custom chunking

3. Prefer json_v2 for new projects

4. Combine markdown + json_v2 for maximum utility

5. Check for extraction errors

`text` — Plain Text

`html` — Cleaned HTML

`json` — Structured JSON

`json_v2` — Universal Deterministic Extraction

`markdown` — Structured Markdown

`rag` — RAG-Optimized Chunks

2. Use `rag` instead of custom chunking

3. Prefer `json_v2` for new projects

4. Combine `markdown` + `json_v2` for maximum utility

`text` — Plain Text

`html` — Cleaned HTML

`json` — Structured JSON

`json_v2` — Universal Deterministic Extraction

`markdown` — Structured Markdown

`rag` — RAG-Optimized Chunks

2. Use `rag` instead of custom chunking

3. Prefer `json_v2` for new projects

4. Combine `markdown` + `json_v2` for maximum utility