AlterLabAlterLab
PricingComparePlaygroundBlogDocs
    AlterLabAlterLab
    PricingPlaygroundBlogDocsChangelog
    IntroductionInstallationYour First Request
    REST APIJob PollingAPI Keys
    OverviewPythonNode.js
    JavaScript RenderingOutput FormatsPDF & OCRCachingWebhooksJSON Schema FilteringWebSocket Real-TimeBring Your Own ProxyProWeb CrawlingBatch ScrapingSchedulerChange DetectionCloud Storage ExportSpend LimitsOrganizations & TeamsAlerts & Notifications
    Structured ExtractionAIE-commerce ScrapingNews MonitoringPrice MonitoringMulti-Page CrawlingMonitoring DashboardAI Agent / MCPMCPData Pipeline to Cloud
    PricingRate LimitsError Codes
    From FirecrawlFrom ApifyFrom ScrapingBee / ScraperAPI
    PlaygroundPricingStatus
    Guide
    Formats

    Output Formats

    Get scraped data in exactly the shape your application needs. AlterLab supports 6 output formats — from plain text to RAG-optimized chunks — and you can request multiple formats in a single API call.

    Default Behavior

    When you omit the formats parameter, AlterLab returns ["markdown", "json"] by default — optimized for LLM workflows. Markdown preserves document structure while JSON provides structured data extraction.

    Quick Comparison

    FormatOutputBest ForPreserves Structure
    textPlain text, zero HTMLNLP, search indexing, diffNo
    htmlSanitized, readable HTMLRe-rendering, archivalFull
    jsonStructured key-value dataProducts, articles, recipesSemantic
    json_v2Section tree, tables, classified linksUniversal extraction, analyticsFull + semantic
    markdownHeadings, tables, lists, linksLLM context, documentationYes
    ragChunked markdown with token countsVector DBs, RAG pipelinesPer-chunk

    text — Plain Text

    Extracts readable content with all HTML tags stripped. Uses Readability for article extraction, then converts to clean text with normalized whitespace. Ideal when you need raw content for NLP pipelines, full-text search, or text comparison.

    When to Use

    • Full-text search indexing
    • Sentiment analysis and NLP pipelines
    • Content diffing between scrape runs
    • Word count and readability scoring
    Bash
    curl -X POST https://alterlab.io/api/v1/scrape \
      -H "X-API-Key: YOUR_API_KEY" \
      -H "Content-Type: application/json" \
      -d '{
        "url": "https://example.com/blog/post",
        "formats": ["text"]
      }'

    Example Response

    JSON
    {
      "content": {
        "text": "How to Build a Web Scraper in Python\n\nWeb scraping is the process of extracting data from websites. In this guide, we'll walk through building a production-ready scraper using Python and BeautifulSoup.\n\nStep 1: Install Dependencies\n\nFirst, install the required packages:\n\npip install requests beautifulsoup4\n\nStep 2: Fetch the Page\n\nUse the requests library to download the HTML content..."
      }
    }

    html — Cleaned HTML

    Returns sanitized HTML with navigation, ads, scripts, and boilerplate removed. The output preserves the document structure — headings, paragraphs, images, tables, and links remain intact. Useful when you need to re-render the content or preserve rich formatting.

    When to Use

    • Re-rendering content in your own UI
    • Web archival and caching
    • Email newsletter generation from scraped articles
    • Custom post-processing with your own HTML parser
    Bash
    curl -X POST https://alterlab.io/api/v1/scrape \
      -H "X-API-Key: YOUR_API_KEY" \
      -H "Content-Type: application/json" \
      -d '{
        "url": "https://example.com/blog/post",
        "formats": ["html"]
      }'

    Example Response

    JSON
    {
      "content": {
        "html": "<article><h1>How to Build a Web Scraper in Python</h1><p>Web scraping is the process of extracting data from websites. In this guide, we'll walk through building a production-ready scraper.</p><h2>Step 1: Install Dependencies</h2><p>First, install the required packages:</p><pre><code>pip install requests beautifulsoup4</code></pre></article>"
      }
    }

    json — Structured JSON

    Extracts structured data using Schema.org, Open Graph, JSON-LD, microdata, and page-specific playbooks. The output schema depends on the page type — articles return headline, author, and body; products return name, price, and availability; recipes return ingredients and steps. Works best on pages with rich structured data or supported domain playbooks.

    When to Use

    • Extracting product data from e-commerce sites
    • Parsing article metadata (author, date, headline)
    • Collecting recipe ingredients and instructions
    • Any page with Schema.org or JSON-LD markup
    Bash
    curl -X POST https://alterlab.io/api/v1/scrape \
      -H "X-API-Key: YOUR_API_KEY" \
      -H "Content-Type: application/json" \
      -d '{
        "url": "https://example.com/product/widget",
        "formats": ["json"]
      }'

    Example Response (Product Page)

    JSON
    {
      "content": {
        "json": {
          "type": "Product",
          "name": "Wireless Noise-Canceling Headphones",
          "price": 299.99,
          "currency": "USD",
          "availability": "InStock",
          "description": "Premium over-ear headphones with active noise cancellation...",
          "image": "https://example.com/images/headphones.jpg",
          "rating": 4.7,
          "review_count": 2341,
          "brand": "AudioTech",
          "sku": "AT-WNC-500"
        }
      }
    }

    Schema Depends on Page Type

    The json output schema varies by content type. Articles return headline, author, datePublished. Products return name, price, availability. Use json_v2 for a consistent schema across all page types.

    json_v2 — Universal Deterministic Extraction

    A consistent, deterministic extraction format that works on any page type — no LLM required. Returns a hierarchical section tree, structured tables, classified links (navigation, content, social, CTA), media items, contact info, and rich metadata. The schema is stable across all websites, making it ideal for building pipelines that need reliable, predictable output.

    No LLM Costs

    json_v2 uses purely algorithmic extraction — no LLM calls, no token costs, no latency variance. You get structured data at scraping speed with deterministic results.

    When to Use

    • Building data pipelines that need a consistent output schema
    • Extracting tables, links, and media without writing custom parsers
    • Content analytics and competitive intelligence
    • When you need structured data but want to avoid LLM extraction costs
    Bash
    curl -X POST https://alterlab.io/api/v1/scrape \
      -H "X-API-Key: YOUR_API_KEY" \
      -H "Content-Type: application/json" \
      -d '{
        "url": "https://example.com/blog/post",
        "formats": ["json_v2"]
      }'

    Example Response

    JSON
    {
      "content": {
        "json_v2": {
          "version": "1.0",
          "extraction_method": "universal",
          "metadata": {
            "title": "How to Build a Web Scraper in Python",
            "description": "A complete guide to building production web scrapers",
            "language": "en",
            "author": { "name": "Jane Doe", "url": "/authors/jane" },
            "dates": {
              "published": "2026-01-15T10:00:00Z",
              "modified": "2026-03-01T14:30:00Z"
            }
          },
          "sections": [
            {
              "id": "section-0",
              "heading": { "text": "How to Build a Web Scraper", "level": 1 },
              "content": [
                { "type": "paragraph", "text": "Web scraping is the process of..." }
              ],
              "children": [
                {
                  "id": "section-1",
                  "heading": { "text": "Install Dependencies", "level": 2 },
                  "content": [
                    { "type": "code", "text": "pip install requests", "language": "bash" }
                  ],
                  "children": []
                }
              ]
            }
          ],
          "tables": [
            {
              "id": "table-0",
              "caption": "Comparison of HTTP Libraries",
              "headers": ["Library", "Async", "Speed"],
              "rows": [
                ["requests", "No", "Fast"],
                ["httpx", "Yes", "Faster"],
                ["aiohttp", "Yes", "Fastest"]
              ],
              "row_count": 3,
              "col_count": 3,
              "has_header": true
            }
          ],
          "links": {
            "navigation": [{ "text": "Home", "url": "/" }],
            "content": [{ "text": "BeautifulSoup docs", "url": "https://..." }],
            "social": [{ "text": "Twitter", "url": "https://twitter.com/...", "platform": "twitter" }],
            "cta": [],
            "external": [],
            "resource": []
          }
        }
      }
    }

    Schema Reference

    The json_v2 response always contains these top-level fields:

    FieldTypeDescription
    versionstringSchema version (currently "1.0")
    metadataobjectTitle, description, language, author, dates
    structured_dataobjectJSON-LD, Open Graph, Twitter Card, microdata, meta tags
    sectionsSection[]Hierarchical content tree with headings, paragraphs, lists, code blocks
    tablesTable[]Structured tables with headers, rows, and captions
    linksClassifiedLinksLinks classified as navigation, content, social, CTA, external, resource
    contactsContactInfo?Emails, phones, addresses, social profiles
    datesDateTimeline?Published, modified, created dates with source attribution
    mediaMediaItem[]?Images, videos, audio with context (hero, content, thumbnail)

    markdown — Structured Markdown

    Converts page content to clean Markdown that preserves document structure — headings, tables, lists, links, and code blocks are all retained. This is the recommended format for feeding content into LLMs because it provides rich context in a token-efficient encoding.

    When to Use

    • LLM context windows — Markdown is more token-efficient than HTML
    • Documentation generation and knowledge base building
    • Content migration between platforms
    • Human-readable output that preserves tables and formatting
    Bash
    curl -X POST https://alterlab.io/api/v1/scrape \
      -H "X-API-Key: YOUR_API_KEY" \
      -H "Content-Type: application/json" \
      -d '{
        "url": "https://example.com/blog/post",
        "formats": ["markdown"]
      }'

    Example Response

    JSON
    {
      "content": {
        "markdown": "# How to Build a Web Scraper in Python\n\nWeb scraping is the process of extracting data from websites.\n\n## Step 1: Install Dependencies\n\nFirst, install the required packages:\n\n```bash\npip install requests beautifulsoup4\n```\n\n## Step 2: Fetch the Page\n\n| Library | Async | Speed |\n|---------|-------|-------|\n| requests | No | Fast |\n| httpx | Yes | Faster |\n\n> **Tip**: Use httpx for async scraping workloads."
      }
    }

    rag — RAG-Optimized Chunks

    Purpose-built for Retrieval-Augmented Generation (RAG) pipelines. Splits content into semantically meaningful Markdown chunks with pre-computed token counts, per-chunk metadata, and link extraction. Chunks are sized for embedding models (target: 500 tokens max, 50 tokens min) and split on heading boundaries to preserve context.

    Built for AI Pipelines

    The rag format saves you from building your own chunking pipeline. Chunks are pre-sized for popular embedding models (text-embedding-3-small, Cohere embed-v3), include token counts using the cl100k_base tokenizer, and preserve heading hierarchy for better retrieval.

    When to Use

    • Ingesting web content into vector databases (Pinecone, Weaviate, Qdrant, ChromaDB)
    • Building RAG applications with LLMs
    • Knowledge base construction from scraped content
    • Semantic search over scraped data
    Bash
    curl -X POST https://alterlab.io/api/v1/scrape \
      -H "X-API-Key: YOUR_API_KEY" \
      -H "Content-Type: application/json" \
      -d '{
        "url": "https://example.com/blog/post",
        "formats": ["rag"]
      }'

    Example Response

    JSON
    {
      "content": {
        "rag": {
          "metadata": {
            "title": "How to Build a Web Scraper in Python",
            "description": "A complete guide to building production web scrapers",
            "author": "Jane Doe",
            "published_at": "2026-01-15T10:00:00Z",
            "language": "en",
            "url": "https://example.com/blog/post",
            "domain": "example.com",
            "content_type": "Article",
            "content_type_confidence": 0.95,
            "total_tokens": 1847,
            "total_chunks": 5,
            "word_count": 1420
          },
          "chunks": [
            {
              "index": 0,
              "heading": "How to Build a Web Scraper in Python",
              "heading_level": 1,
              "content": "# How to Build a Web Scraper in Python\n\nWeb scraping is the process of extracting data from websites. In this guide...",
              "token_count": 387,
              "links": [
                { "text": "BeautifulSoup docs", "url": "https://..." }
              ]
            },
            {
              "index": 1,
              "heading": "Install Dependencies",
              "heading_level": 2,
              "content": "## Install Dependencies\n\nFirst, install the required packages:\n\n```bash\npip install requests beautifulsoup4\n```",
              "token_count": 142,
              "links": []
            }
          ]
        }
      }
    }

    Chunk Structure

    FieldTypeDescription
    indexnumberSequential chunk index (0-based)
    headingstring?Section heading this chunk belongs to
    heading_levelnumberHeading depth (1-6), 0 for preamble
    contentstringMarkdown content of this chunk
    token_countnumberToken count using cl100k_base tokenizer
    linksLink[]Links found within this chunk

    Multi-Format Requests

    Request multiple formats in a single API call. The page is scraped once and the content is transformed into each requested format — no extra cost, no redundant network requests.

    Bash
    curl -X POST https://alterlab.io/api/v1/scrape \
      -H "X-API-Key: YOUR_API_KEY" \
      -H "Content-Type: application/json" \
      -d '{
        "url": "https://example.com/blog/post",
        "formats": ["text", "markdown", "json"]
      }'

    Response Structure

    JSON
    {
      "content": {
        "text": "How to Build a Web Scraper in Python...",
        "markdown": "# How to Build a Web Scraper in Python\n\n...",
        "json": {
          "type": "Article",
          "headline": "How to Build a Web Scraper in Python",
          "author": "Jane Doe",
          "datePublished": "2026-01-15"
        }
      },
      "billing": {
        "credits_used": 1,
        "tier": "tier_1"
      }
    }

    One Scrape, Multiple Formats

    Multi-format requests do not cost extra cost. The page is fetched once and transformed into each requested format server-side. Requesting ["text", "markdown", "json"] costs the same as requesting a single format.

    Choosing the Right Format

    Use CaseRecommended FormatWhy
    LLM context / summarizationmarkdownToken-efficient, preserves headings and tables
    RAG / vector DB ingestionragPre-chunked with token counts, ready to embed
    E-commerce product datajsonExtracts price, name, availability from Schema.org
    Generic structured extractionjson_v2Consistent schema, no LLM cost, works on any page
    Full-text search indexingtextClean plaintext, no markup to strip
    Content re-renderinghtmlPreserves all formatting and media tags
    Multi-purpose pipelinemarkdown + jsonDefault combo — structure + semantic data
    AI agent / MCP toolmarkdownBest for tool-use context in Claude, GPT, etc.

    Pricing Impact

    Output format selection does not affect cost. Cost is determined by the scraping tier (complexity of the target site), not by how many formats you request. Requesting one format or all six costs the same cost.

    What Affects CostWhat Does NOT Affect Cost
    Scraping tier (1-4)Number of formats requested
    Add-ons (screenshot, PDF, OCR)Which specific formats you choose
    LLM extraction (extraction_prompt)json_v2 (purely algorithmic, no LLM)

    Cost-Free Structured Data

    json_v2 is the only format that provides structured extraction without LLM costs. If you need structured data but want to keep costs predictable, prefer json_v2 over LLM-based extraction_prompt.

    Best Practices

    1. Request only what you need

    While multi-format requests are free, each format adds to the response payload size and server-side processing time. Request only the formats your application actually consumes.

    2. Use rag instead of custom chunking

    If you are building a RAG pipeline, use the rag format instead of requesting markdown and splitting it yourself. The built-in chunker respects heading boundaries, pre-computes token counts with cl100k_base, and extracts per-chunk links and metadata.

    3. Prefer json_v2 for new projects

    The json format has a variable schema that depends on the page type. For new integrations, prefer json_v2 which provides a stable, consistent schema across all page types. Use json when you specifically need type-aware extraction (e.g., product price from Schema.org).

    4. Combine markdown + json_v2 for maximum utility

    For applications that need both human-readable content and structured data, request ["markdown", "json_v2"]. Use markdown for display and LLM context, and json_v2 for programmatic data access — tables, links, metadata — without parsing markdown.

    5. Check for extraction errors

    Individual formats can fail while others succeed. Always check for an error key in each format's output. For example, rag returns {"error": "extraction_failed"} if chunking fails, while other requested formats may still contain valid data.

    JavaScript RenderingPDF & OCR
    Last updated: March 2026

    On this page

    AlterLabAlterLab

    AlterLab is the modern web scraping platform for developers. Reliable, scalable, and easy to use.

    Product

    • Pricing
    • Documentation
    • Changelog
    • Status

    Solutions

    • Python API
    • JS Rendering
    • Anti-Bot Bypass
    • Compare APIs

    Comparisons

    • Compare All
    • vs ScraperAPI
    • vs Firecrawl
    • vs ScrapingBee
    • vs Bright Data
    • vs Apify

    Company

    • About
    • Blog
    • Contact
    • FAQ

    Guides

    • Bypass Cloudflare
    • Playwright Anti-Detection
    • Puppeteer Bypass Guide
    • Selenium Detection Fix
    • Best Scraping APIs 2026

    Legal

    • Privacy
    • Terms
    • Acceptable Use
    • DPA
    • Cookie Policy
    • Licenses

    © 2026 RapierCraft Inc. All rights reserved.

    Middletown, DE