AlterLabAlterLab
PricingComparePlaygroundBlogDocsChangelog
    AlterLabAlterLab
    PricingComparePlaygroundBlogDocsChangelog
    IntroductionQuickstartInstallationYour First Request
    REST APIJob PollingAPI KeysSessions APINew
    OverviewPythonNode.js
    JavaScript RenderingOutput FormatsPDF & OCRCachingWebhooksJSON Schema FilteringWebSocket Real-TimeBring Your Own ProxyProAuthenticated ScrapingNewWeb CrawlingBatch ScrapingSchedulerChange DetectionCloud Storage ExportSpend LimitsOrganizations & TeamsAlerts & Notifications
    Structured ExtractionAIE-commerce ScrapingNews MonitoringPrice MonitoringMulti-Page CrawlingMonitoring DashboardAI Agent / MCPMCPData Pipeline to Cloud
    PricingRate LimitsError Codes
    From FirecrawlFrom ApifyFrom ScrapingBee / ScraperAPI
    PlaygroundPricingStatus
    Use Case
    AI

    RAG & AI Pipelines

    Scrape, extract, and structure web content for retrieval-augmented generation, vector stores, and AI agents.

    The Problem

    LLMs need clean, structured web data to ground their responses in facts. Building a RAG pipeline means solving several challenges at once:

    • Getting past anti-bot protection to access source content
    • Extracting clean text from complex HTML — stripping nav, ads, and boilerplate
    • Returning content in formats LLMs can consume (markdown, plain text, structured JSON)
    • Processing many pages for knowledge base construction without managing browser infrastructure

    Solution Architecture

    AlterLab provides the data ingestion layer for your AI pipeline:

    1. Scrape & Extract

    POST /scrape with formats: ["markdown", "text"] to get clean content ready for chunking and embedding.

    2. Structure

    Use extraction_schema to pull specific fields (title, author, date, body) into typed JSON for metadata enrichment.

    3. Scale

    POST /batch to process hundreds of source documents in parallel. Use crawl to discover and ingest entire sites.

    Quick Example

    Scrape a page and get LLM-ready markdown content in one call:

    Python
    import requests
    
    response = requests.post(
        "https://api.alterlab.io/api/v1/scrape",
        headers={"X-API-Key": "YOUR_API_KEY"},
        json={
            "url": "https://docs.example.com/api-reference",
            "formats": ["markdown", "text"]
        }
    )
    
    data = response.json()
    markdown_content = data.get("markdown", "")
    plain_text = data.get("text", "")
    
    # Chunk for embedding
    chunks = split_into_chunks(plain_text, max_tokens=512)
    embeddings = embed_chunks(chunks)  # Your embedding model
    vector_store.upsert(embeddings)

    Advanced Patterns

    Chunking Strategy

    Use the markdown output with headings to create semantically meaningful chunks. AlterLab preserves document structure so you can split on heading boundaries:

    Python
    import re
    import requests
    
    def scrape_and_chunk(url, api_key):
        """Scrape a page and split into heading-based chunks."""
        resp = requests.post(
            "https://api.alterlab.io/api/v1/scrape",
            headers={"X-API-Key": api_key},
            json={"url": url, "formats": ["markdown"]}
        )
        markdown = resp.json().get("markdown", "")
    
        # Split on H2/H3 headings for semantic chunks
        sections = re.split(r'(?=^#{2,3}\s)', markdown, flags=re.MULTILINE)
        chunks = []
        for section in sections:
            section = section.strip()
            if section:
                chunks.append({
                    "content": section,
                    "source_url": url,
                    "char_count": len(section)
                })
        return chunks
    
    chunks = scrape_and_chunk(
        "https://docs.example.com/guide",
        "YOUR_API_KEY"
    )
    print(f"Created {len(chunks)} chunks")

    LangChain Integration

    Use AlterLab as a document loader in your LangChain pipeline. Scrape content and pass it directly to LangChain's text splitters and retrievers:

    Python
    import requests
    from langchain.text_splitter import RecursiveCharacterTextSplitter
    from langchain.schema import Document
    
    def load_documents(urls, api_key):
        """Load web pages as LangChain Documents via AlterLab."""
        documents = []
        for url in urls:
            resp = requests.post(
                "https://api.alterlab.io/api/v1/scrape",
                headers={"X-API-Key": api_key},
                json={"url": url, "formats": ["text"]}
            )
            data = resp.json()
            if data.get("text"):
                documents.append(Document(
                    page_content=data["text"],
                    metadata={"source": url}
                ))
        return documents
    
    # Load and split
    docs = load_documents(
        ["https://docs.example.com/page1", "https://docs.example.com/page2"],
        "YOUR_API_KEY"
    )
    
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=1000,
        chunk_overlap=200
    )
    chunks = splitter.split_documents(docs)
    print(f"Created {len(chunks)} chunks from {len(docs)} pages")

    MCP Server

    AlterLab has an official MCP server that lets AI agents call the scraping API directly. Install via npx alterlab-mcp-server — it exposes scrape, extract, screenshot, balance, and estimate tools.

    Related Guides

    Structured Extraction Tutorial

    Use AI to extract structured JSON data from any web page.

    Batch Scraping Guide

    Process hundreds of URLs in parallel for knowledge base construction.

    Python SDK

    Official Python SDK with async support and type hints.

    JSON Schema Filtering

    Define extraction schemas for structured output.

    ← Previous: E-commerceNext: Content Monitoring →
    Last updated: March 2026

    On this page