AlterLabAlterLab
PricingComparePlaygroundBlogDocsChangelog
    AlterLabAlterLab
    PricingComparePlaygroundBlogDocsChangelog
    IntroductionQuickstartInstallationYour First Request
    REST APIJob PollingAPI KeysSessions APINew
    OverviewPythonNode.js
    JavaScript RenderingOutput FormatsPDF & OCRCachingWebhooksJSON Schema FilteringWebSocket Real-TimeBring Your Own ProxyProAuthenticated ScrapingNewWeb CrawlingBatch ScrapingSchedulerChange DetectionCloud Storage ExportSpend LimitsOrganizations & TeamsAlerts & Notifications
    Structured ExtractionAIE-commerce ScrapingNews MonitoringPrice MonitoringMulti-Page CrawlingMonitoring DashboardAI Agent / MCPMCPData Pipeline to Cloud
    PricingRate LimitsError Codes
    From FirecrawlFrom ApifyFrom ScrapingBee / ScraperAPI
    PlaygroundPricingStatus
    Integration
    AI Framework

    LangChain

    Use AlterLab as a document loader in LangChain to build RAG pipelines with live web data. Scrape pages, chunk content, embed into vector stores, and query with natural language.

    Why AlterLab + LangChain?

    Standard web loaders fail on JavaScript-heavy sites and get blocked by anti-bot systems. AlterLab handles JS rendering, anti-bot bypass, and returns clean text — ideal for LLM consumption.

    Installation

    Bash
    pip install alterlab langchain langchain-community langchain-openai chromadb

    You need the AlterLab Python SDK and LangChain community package. The examples below also use ChromaDB for the vector store and OpenAI for embeddings, but you can substitute any LangChain-compatible alternatives.

    Basic Usage

    Document Loader

    The simplest way to use AlterLab with LangChain is as a custom document loader. Each URL becomes a LangChain Document with page content and metadata.

    Python
    from langchain.schema import Document
    from alterlab import AlterLab
    
    client = AlterLab(api_key="your_api_key")
    
    def load_url(url: str, mode: str = "auto") -> Document:
        """Load a URL as a LangChain Document via AlterLab."""
        result = client.scrape(url, formats=["markdown"], mode=mode)
        return Document(
            page_content=result.get("markdown", result.get("text", "")),
            metadata={
                "source": url,
                "title": result.get("metadata", {}).get("title", ""),
                "status_code": result.get("metadata", {}).get("status_code"),
                "credits_used": result.get("cost", {}).get("credits_charged"),
            },
        )
    
    # Load a single page
    doc = load_url("https://example.com/blog/ai-trends")
    print(f"Loaded {len(doc.page_content)} chars from {doc.metadata['source']}")

    Loading Options

    Pass AlterLab parameters to control how pages are scraped:

    Python
    # JavaScript-heavy SPA — use JS rendering
    doc = load_url("https://app.example.com/dashboard", mode="js")
    
    # Get multiple formats at once
    result = client.scrape(
        "https://example.com",
        formats=["markdown", "text", "json"],
    )
    
    # Use cost controls for budget-conscious pipelines
    result = client.scrape(
        "https://example.com",
        formats=["markdown"],
        cost_controls={"max_tier": "2", "fail_fast": True},
    )

    RAG Pipeline

    Build a complete Retrieval-Augmented Generation pipeline in three steps: load and chunk web content, embed into a vector store, then query with natural language.

    Step 1: Load & Chunk

    Python
    from langchain.text_splitter import RecursiveCharacterTextSplitter
    
    # Load multiple pages
    urls = [
        "https://docs.example.com/getting-started",
        "https://docs.example.com/api-reference",
        "https://docs.example.com/tutorials",
    ]
    
    documents = [load_url(url) for url in urls]
    print(f"Loaded {len(documents)} documents")
    
    # Split into chunks for embedding
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=1000,
        chunk_overlap=200,
        separators=["\n## ", "\n### ", "\n\n", "\n", " "],
    )
    
    chunks = splitter.split_documents(documents)
    print(f"Split into {len(chunks)} chunks")

    Step 2: Embed & Store

    Python
    from langchain_openai import OpenAIEmbeddings
    from langchain_community.vectorstores import Chroma
    
    # Create embeddings and store in ChromaDB
    embeddings = OpenAIEmbeddings()
    vectorstore = Chroma.from_documents(
        documents=chunks,
        embedding=embeddings,
        collection_name="web_docs",
        persist_directory="./chroma_db",
    )
    
    print(f"Stored {len(chunks)} chunks in vector store")

    Step 3: Query

    Python
    from langchain_openai import ChatOpenAI
    from langchain.chains import RetrievalQA
    
    llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
    
    qa_chain = RetrievalQA.from_chain_type(
        llm=llm,
        chain_type="stuff",
        retriever=vectorstore.as_retriever(search_kwargs={"k": 4}),
    )
    
    answer = qa_chain.invoke("How do I authenticate with the API?")
    print(answer["result"])

    Full RAG Example

    A complete end-to-end example that scrapes a documentation site and answers questions about it:

    Python
    from alterlab import AlterLab
    from langchain.schema import Document
    from langchain.text_splitter import RecursiveCharacterTextSplitter
    from langchain_openai import OpenAIEmbeddings, ChatOpenAI
    from langchain_community.vectorstores import Chroma
    from langchain.chains import RetrievalQA
    
    # 1. Configure clients
    scraper = AlterLab(api_key="your_alterlab_key")
    llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
    
    # 2. Scrape pages into Documents
    urls = [
        "https://docs.stripe.com/api/charges",
        "https://docs.stripe.com/api/customers",
        "https://docs.stripe.com/api/refunds",
    ]
    
    documents = []
    for url in urls:
        result = scraper.scrape(url, formats=["markdown"])
        documents.append(
            Document(
                page_content=result.get("markdown", ""),
                metadata={"source": url},
            )
        )
    
    # 3. Chunk
    splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
    chunks = splitter.split_documents(documents)
    
    # 4. Embed and store
    vectorstore = Chroma.from_documents(
        documents=chunks,
        embedding=OpenAIEmbeddings(),
    )
    
    # 5. Query
    qa = RetrievalQA.from_chain_type(
        llm=llm,
        retriever=vectorstore.as_retriever(search_kwargs={"k": 4}),
    )
    
    questions = [
        "How do I create a charge?",
        "What parameters does the refund endpoint accept?",
        "How do I list all customers?",
    ]
    
    for q in questions:
        answer = qa.invoke(q)
        print(f"Q: {q}")
        print(f"A: {answer['result']}\n")

    Batch Loading

    For large document sets, use AlterLab's batch endpoint to load many pages concurrently:

    Python
    import time
    from alterlab import AlterLab
    from langchain.schema import Document
    
    client = AlterLab(api_key="your_api_key")
    
    urls = [
        "https://example.com/page-1",
        "https://example.com/page-2",
        "https://example.com/page-3",
        # ... up to 100 URLs per batch
    ]
    
    # Submit batch
    batch = client.batch_scrape(
        urls=[{"url": u, "formats": ["markdown"]} for u in urls],
    )
    
    # Poll until done
    while True:
        status = client.get_batch_status(batch["batch_id"])
        if status["status"] != "processing":
            break
        time.sleep(2)
    
    # Convert to LangChain Documents
    documents = []
    for item in status["items"]:
        if item["status"] == "succeeded":
            documents.append(
                Document(
                    page_content=item["result"].get("markdown", ""),
                    metadata={"source": item["url"]},
                )
            )
    
    print(f"Loaded {len(documents)} documents from batch")

    Structured Extraction

    Combine AlterLab's AI extraction with LangChain for type-safe structured data:

    Python
    # Use AlterLab's built-in extraction profiles
    result = client.scrape(
        "https://example.com/product",
        formats=["json"],
        extraction_profile="product",
    )
    
    # The JSON output contains structured product data
    product = result.get("json", {})
    print(f"Product: {product.get('name')}")
    print(f"Price: {product.get('price')}")
    
    # Or use a custom schema
    result = client.scrape(
        "https://example.com/article",
        formats=["json"],
        extraction_schema={
            "type": "object",
            "properties": {
                "title": {"type": "string"},
                "author": {"type": "string"},
                "published_date": {"type": "string"},
                "summary": {"type": "string"},
            },
        },
    )

    Tips & Best Practices

    • Use markdown format for RAG pipelines. It preserves headings, lists, and structure while being cleaner than HTML for chunking.
    • Set cost controls with max_tier to limit per-page costs in large crawls. Tier 1-2 is sufficient for most documentation sites.
    • Enable caching when re-running pipelines during development. Cache hits are free and return instantly.
    • Use batch scraping for loading 10+ pages. It is faster than sequential scraping and handles concurrency for you.
    • Chunk on markdown headers by including "\n## ", "\n### " in your text splitter separators. This creates semantically meaningful chunks.
    Last updated: March 2026

    On this page