Output Formats
Get scraped data in exactly the shape your application needs. AlterLab supports 6 output formats — from plain text to RAG-optimized chunks — and you can request multiple formats in a single API call.
Default Behavior
formats parameter, AlterLab returns ["markdown", "json"] by default — optimized for LLM workflows. Markdown preserves document structure while JSON provides structured data extraction.Quick Comparison
| Format | Output | Best For | Preserves Structure |
|---|---|---|---|
text | Plain text, zero HTML | NLP, search indexing, diff | No |
html | Sanitized, readable HTML | Re-rendering, archival | Full |
json | Structured key-value data | Products, articles, recipes | Semantic |
json_v2 | Section tree, tables, classified links | Universal extraction, analytics | Full + semantic |
markdown | Headings, tables, lists, links | LLM context, documentation | Yes |
rag | Chunked markdown with token counts | Vector DBs, RAG pipelines | Per-chunk |
text — Plain Text
Extracts readable content with all HTML tags stripped. Uses Readability for article extraction, then converts to clean text with normalized whitespace. Ideal when you need raw content for NLP pipelines, full-text search, or text comparison.
When to Use
- Full-text search indexing
- Sentiment analysis and NLP pipelines
- Content diffing between scrape runs
- Word count and readability scoring
curl -X POST https://alterlab.io/api/v1/scrape \
-H "X-API-Key: YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com/blog/post",
"formats": ["text"]
}'Example Response
{
"content": {
"text": "How to Build a Web Scraper in Python\n\nWeb scraping is the process of extracting data from websites. In this guide, we'll walk through building a production-ready scraper using Python and BeautifulSoup.\n\nStep 1: Install Dependencies\n\nFirst, install the required packages:\n\npip install requests beautifulsoup4\n\nStep 2: Fetch the Page\n\nUse the requests library to download the HTML content..."
}
}html — Cleaned HTML
Returns sanitized HTML with navigation, ads, scripts, and boilerplate removed. The output preserves the document structure — headings, paragraphs, images, tables, and links remain intact. Useful when you need to re-render the content or preserve rich formatting.
When to Use
- Re-rendering content in your own UI
- Web archival and caching
- Email newsletter generation from scraped articles
- Custom post-processing with your own HTML parser
curl -X POST https://alterlab.io/api/v1/scrape \
-H "X-API-Key: YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com/blog/post",
"formats": ["html"]
}'Example Response
{
"content": {
"html": "<article><h1>How to Build a Web Scraper in Python</h1><p>Web scraping is the process of extracting data from websites. In this guide, we'll walk through building a production-ready scraper.</p><h2>Step 1: Install Dependencies</h2><p>First, install the required packages:</p><pre><code>pip install requests beautifulsoup4</code></pre></article>"
}
}json — Structured JSON
Extracts structured data using Schema.org, Open Graph, JSON-LD, microdata, and page-specific playbooks. The output schema depends on the page type — articles return headline, author, and body; products return name, price, and availability; recipes return ingredients and steps. Works best on pages with rich structured data or supported domain playbooks.
When to Use
- Extracting product data from e-commerce sites
- Parsing article metadata (author, date, headline)
- Collecting recipe ingredients and instructions
- Any page with Schema.org or JSON-LD markup
curl -X POST https://alterlab.io/api/v1/scrape \
-H "X-API-Key: YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com/product/widget",
"formats": ["json"]
}'Example Response (Product Page)
{
"content": {
"json": {
"type": "Product",
"name": "Wireless Noise-Canceling Headphones",
"price": 299.99,
"currency": "USD",
"availability": "InStock",
"description": "Premium over-ear headphones with active noise cancellation...",
"image": "https://example.com/images/headphones.jpg",
"rating": 4.7,
"review_count": 2341,
"brand": "AudioTech",
"sku": "AT-WNC-500"
}
}
}Schema Depends on Page Type
json output schema varies by content type. Articles return headline, author, datePublished. Products return name, price, availability. Use json_v2 for a consistent schema across all page types.json_v2 — Universal Deterministic Extraction
A consistent, deterministic extraction format that works on any page type — no LLM required. Returns a hierarchical section tree, structured tables, classified links (navigation, content, social, CTA), media items, contact info, and rich metadata. The schema is stable across all websites, making it ideal for building pipelines that need reliable, predictable output.
No LLM Costs
json_v2 uses purely algorithmic extraction — no LLM calls, no token costs, no latency variance. You get structured data at scraping speed with deterministic results.When to Use
- Building data pipelines that need a consistent output schema
- Extracting tables, links, and media without writing custom parsers
- Content analytics and competitive intelligence
- When you need structured data but want to avoid LLM extraction costs
curl -X POST https://alterlab.io/api/v1/scrape \
-H "X-API-Key: YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com/blog/post",
"formats": ["json_v2"]
}'Example Response
{
"content": {
"json_v2": {
"version": "1.0",
"extraction_method": "universal",
"metadata": {
"title": "How to Build a Web Scraper in Python",
"description": "A complete guide to building production web scrapers",
"language": "en",
"author": { "name": "Jane Doe", "url": "/authors/jane" },
"dates": {
"published": "2026-01-15T10:00:00Z",
"modified": "2026-03-01T14:30:00Z"
}
},
"sections": [
{
"id": "section-0",
"heading": { "text": "How to Build a Web Scraper", "level": 1 },
"content": [
{ "type": "paragraph", "text": "Web scraping is the process of..." }
],
"children": [
{
"id": "section-1",
"heading": { "text": "Install Dependencies", "level": 2 },
"content": [
{ "type": "code", "text": "pip install requests", "language": "bash" }
],
"children": []
}
]
}
],
"tables": [
{
"id": "table-0",
"caption": "Comparison of HTTP Libraries",
"headers": ["Library", "Async", "Speed"],
"rows": [
["requests", "No", "Fast"],
["httpx", "Yes", "Faster"],
["aiohttp", "Yes", "Fastest"]
],
"row_count": 3,
"col_count": 3,
"has_header": true
}
],
"links": {
"navigation": [{ "text": "Home", "url": "/" }],
"content": [{ "text": "BeautifulSoup docs", "url": "https://..." }],
"social": [{ "text": "Twitter", "url": "https://twitter.com/...", "platform": "twitter" }],
"cta": [],
"external": [],
"resource": []
}
}
}
}Schema Reference
The json_v2 response always contains these top-level fields:
| Field | Type | Description |
|---|---|---|
version | string | Schema version (currently "1.0") |
metadata | object | Title, description, language, author, dates |
structured_data | object | JSON-LD, Open Graph, Twitter Card, microdata, meta tags |
sections | Section[] | Hierarchical content tree with headings, paragraphs, lists, code blocks |
tables | Table[] | Structured tables with headers, rows, and captions |
links | ClassifiedLinks | Links classified as navigation, content, social, CTA, external, resource |
contacts | ContactInfo? | Emails, phones, addresses, social profiles |
dates | DateTimeline? | Published, modified, created dates with source attribution |
media | MediaItem[]? | Images, videos, audio with context (hero, content, thumbnail) |
markdown — Structured Markdown
Converts page content to clean Markdown that preserves document structure — headings, tables, lists, links, and code blocks are all retained. This is the recommended format for feeding content into LLMs because it provides rich context in a token-efficient encoding.
When to Use
- LLM context windows — Markdown is more token-efficient than HTML
- Documentation generation and knowledge base building
- Content migration between platforms
- Human-readable output that preserves tables and formatting
curl -X POST https://alterlab.io/api/v1/scrape \
-H "X-API-Key: YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com/blog/post",
"formats": ["markdown"]
}'Example Response
{
"content": {
"markdown": "# How to Build a Web Scraper in Python\n\nWeb scraping is the process of extracting data from websites.\n\n## Step 1: Install Dependencies\n\nFirst, install the required packages:\n\n```bash\npip install requests beautifulsoup4\n```\n\n## Step 2: Fetch the Page\n\n| Library | Async | Speed |\n|---------|-------|-------|\n| requests | No | Fast |\n| httpx | Yes | Faster |\n\n> **Tip**: Use httpx for async scraping workloads."
}
}rag — RAG-Optimized Chunks
Purpose-built for Retrieval-Augmented Generation (RAG) pipelines. Splits content into semantically meaningful Markdown chunks with pre-computed token counts, per-chunk metadata, and link extraction. Chunks are sized for embedding models (target: 500 tokens max, 50 tokens min) and split on heading boundaries to preserve context.
Built for AI Pipelines
rag format saves you from building your own chunking pipeline. Chunks are pre-sized for popular embedding models (text-embedding-3-small, Cohere embed-v3), include token counts using the cl100k_base tokenizer, and preserve heading hierarchy for better retrieval.When to Use
- Ingesting web content into vector databases (Pinecone, Weaviate, Qdrant, ChromaDB)
- Building RAG applications with LLMs
- Knowledge base construction from scraped content
- Semantic search over scraped data
curl -X POST https://alterlab.io/api/v1/scrape \
-H "X-API-Key: YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com/blog/post",
"formats": ["rag"]
}'Example Response
{
"content": {
"rag": {
"metadata": {
"title": "How to Build a Web Scraper in Python",
"description": "A complete guide to building production web scrapers",
"author": "Jane Doe",
"published_at": "2026-01-15T10:00:00Z",
"language": "en",
"url": "https://example.com/blog/post",
"domain": "example.com",
"content_type": "Article",
"content_type_confidence": 0.95,
"total_tokens": 1847,
"total_chunks": 5,
"word_count": 1420
},
"chunks": [
{
"index": 0,
"heading": "How to Build a Web Scraper in Python",
"heading_level": 1,
"content": "# How to Build a Web Scraper in Python\n\nWeb scraping is the process of extracting data from websites. In this guide...",
"token_count": 387,
"links": [
{ "text": "BeautifulSoup docs", "url": "https://..." }
]
},
{
"index": 1,
"heading": "Install Dependencies",
"heading_level": 2,
"content": "## Install Dependencies\n\nFirst, install the required packages:\n\n```bash\npip install requests beautifulsoup4\n```",
"token_count": 142,
"links": []
}
]
}
}
}Chunk Structure
| Field | Type | Description |
|---|---|---|
index | number | Sequential chunk index (0-based) |
heading | string? | Section heading this chunk belongs to |
heading_level | number | Heading depth (1-6), 0 for preamble |
content | string | Markdown content of this chunk |
token_count | number | Token count using cl100k_base tokenizer |
links | Link[] | Links found within this chunk |
Multi-Format Requests
Request multiple formats in a single API call. The page is scraped once and the content is transformed into each requested format — no extra cost, no redundant network requests.
curl -X POST https://alterlab.io/api/v1/scrape \
-H "X-API-Key: YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com/blog/post",
"formats": ["text", "markdown", "json"]
}'Response Structure
{
"content": {
"text": "How to Build a Web Scraper in Python...",
"markdown": "# How to Build a Web Scraper in Python\n\n...",
"json": {
"type": "Article",
"headline": "How to Build a Web Scraper in Python",
"author": "Jane Doe",
"datePublished": "2026-01-15"
}
},
"billing": {
"credits_used": 1,
"tier": "tier_1"
}
}One Scrape, Multiple Formats
["text", "markdown", "json"] costs the same as requesting a single format.Choosing the Right Format
| Use Case | Recommended Format | Why |
|---|---|---|
| LLM context / summarization | markdown | Token-efficient, preserves headings and tables |
| RAG / vector DB ingestion | rag | Pre-chunked with token counts, ready to embed |
| E-commerce product data | json | Extracts price, name, availability from Schema.org |
| Generic structured extraction | json_v2 | Consistent schema, no LLM cost, works on any page |
| Full-text search indexing | text | Clean plaintext, no markup to strip |
| Content re-rendering | html | Preserves all formatting and media tags |
| Multi-purpose pipeline | markdown + json | Default combo — structure + semantic data |
| AI agent / MCP tool | markdown | Best for tool-use context in Claude, GPT, etc. |
Pricing Impact
Output format selection does not affect cost. Cost is determined by the scraping tier (complexity of the target site), not by how many formats you request. Requesting one format or all six costs the same cost.
| What Affects Cost | What Does NOT Affect Cost |
|---|---|
| Scraping tier (1-4) | Number of formats requested |
| Add-ons (screenshot, PDF, OCR) | Which specific formats you choose |
| LLM extraction (extraction_prompt) | json_v2 (purely algorithmic, no LLM) |
Cost-Free Structured Data
json_v2 is the only format that provides structured extraction without LLM costs. If you need structured data but want to keep costs predictable, prefer json_v2 over LLM-based extraction_prompt.Best Practices
1. Request only what you need
While multi-format requests are free, each format adds to the response payload size and server-side processing time. Request only the formats your application actually consumes.
2. Use rag instead of custom chunking
If you are building a RAG pipeline, use the rag format instead of requesting markdown and splitting it yourself. The built-in chunker respects heading boundaries, pre-computes token counts with cl100k_base, and extracts per-chunk links and metadata.
3. Prefer json_v2 for new projects
The json format has a variable schema that depends on the page type. For new integrations, prefer json_v2 which provides a stable, consistent schema across all page types. Use json when you specifically need type-aware extraction (e.g., product price from Schema.org).
4. Combine markdown + json_v2 for maximum utility
For applications that need both human-readable content and structured data, request ["markdown", "json_v2"]. Use markdown for display and LLM context, and json_v2 for programmatic data access — tables, links, metadata — without parsing markdown.
5. Check for extraction errors
Individual formats can fail while others succeed. Always check for an error key in each format's output. For example, rag returns {"error": "extraction_failed"} if chunking fails, while other requested formats may still contain valid data.