Integration

AI Framework

LangChain

Use AlterLab as a document loader in LangChain to build RAG pipelines with live web data. Scrape pages, chunk content, embed into vector stores, and query with natural language.

Why AlterLab + LangChain?

Standard web loaders fail on JavaScript-heavy sites and get blocked by anti-bot systems. AlterLab handles JS rendering, anti-bot bypass, and returns clean text — ideal for LLM consumption.

Installation

Bash

pip install alterlab langchain langchain-community langchain-openai chromadb

You need the AlterLab Python SDK and LangChain community package. The examples below also use ChromaDB for the vector store and OpenAI for embeddings, but you can substitute any LangChain-compatible alternatives.

Basic Usage

Document Loader

The simplest way to use AlterLab with LangChain is as a custom document loader. Each URL becomes a LangChain Document with page content and metadata.

Python

from langchain.schema import Document
from alterlab import AlterLab

client = AlterLab(api_key="your_api_key")

def load_url(url: str, mode: str = "auto") -> Document:
    """Load a URL as a LangChain Document via AlterLab."""
    result = client.scrape(url, formats=["markdown"], mode=mode)
    return Document(
        page_content=result.get("markdown", result.get("text", "")),
        metadata={
            "source": url,
            "title": result.get("metadata", {}).get("title", ""),
            "status_code": result.get("metadata", {}).get("status_code"),
            "credits_used": result.get("cost", {}).get("credits_charged"),
        },
    )

# Load a single page
doc = load_url("https://example.com/blog/ai-trends")
print(f"Loaded {len(doc.page_content)} chars from {doc.metadata['source']}")

Loading Options

Pass AlterLab parameters to control how pages are scraped:

Python

# JavaScript-heavy SPA — use JS rendering
doc = load_url("https://app.example.com/dashboard", mode="js")

# Get multiple formats at once
result = client.scrape(
    "https://example.com",
    formats=["markdown", "text", "json"],
)

# Use cost controls for budget-conscious pipelines
result = client.scrape(
    "https://example.com",
    formats=["markdown"],
    cost_controls={"max_tier": "2", "fail_fast": True},
)

RAG Pipeline

Build a complete Retrieval-Augmented Generation pipeline in three steps: load and chunk web content, embed into a vector store, then query with natural language.

Step 1: Load & Chunk

Python

from langchain.text_splitter import RecursiveCharacterTextSplitter

# Load multiple pages
urls = [
    "https://docs.example.com/getting-started",
    "https://docs.example.com/api-reference",
    "https://docs.example.com/tutorials",
]

documents = [load_url(url) for url in urls]
print(f"Loaded {len(documents)} documents")

# Split into chunks for embedding
splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    separators=["\n## ", "\n### ", "\n\n", "\n", " "],
)

chunks = splitter.split_documents(documents)
print(f"Split into {len(chunks)} chunks")

Step 2: Embed & Store

Python

from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma

# Create embeddings and store in ChromaDB
embeddings = OpenAIEmbeddings()
vectorstore = Chroma.from_documents(
    documents=chunks,
    embedding=embeddings,
    collection_name="web_docs",
    persist_directory="./chroma_db",
)

print(f"Stored {len(chunks)} chunks in vector store")

Step 3: Query

Python

from langchain_openai import ChatOpenAI
from langchain.chains import RetrievalQA

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=vectorstore.as_retriever(search_kwargs={"k": 4}),
)

answer = qa_chain.invoke("How do I authenticate with the API?")
print(answer["result"])

Full RAG Example

A complete end-to-end example that scrapes a documentation site and answers questions about it:

Python

from alterlab import AlterLab
from langchain.schema import Document
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_community.vectorstores import Chroma
from langchain.chains import RetrievalQA

# 1. Configure clients
scraper = AlterLab(api_key="your_alterlab_key")
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

# 2. Scrape pages into Documents
urls = [
    "https://docs.stripe.com/api/charges",
    "https://docs.stripe.com/api/customers",
    "https://docs.stripe.com/api/refunds",
]

documents = []
for url in urls:
    result = scraper.scrape(url, formats=["markdown"])
    documents.append(
        Document(
            page_content=result.get("markdown", ""),
            metadata={"source": url},
        )
    )

# 3. Chunk
splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
chunks = splitter.split_documents(documents)

# 4. Embed and store
vectorstore = Chroma.from_documents(
    documents=chunks,
    embedding=OpenAIEmbeddings(),
)

# 5. Query
qa = RetrievalQA.from_chain_type(
    llm=llm,
    retriever=vectorstore.as_retriever(search_kwargs={"k": 4}),
)

questions = [
    "How do I create a charge?",
    "What parameters does the refund endpoint accept?",
    "How do I list all customers?",
]

for q in questions:
    answer = qa.invoke(q)
    print(f"Q: {q}")
    print(f"A: {answer['result']}\n")

Batch Loading

For large document sets, use AlterLab's batch endpoint to load many pages concurrently:

Python

import time
from alterlab import AlterLab
from langchain.schema import Document

client = AlterLab(api_key="your_api_key")

urls = [
    "https://example.com/page-1",
    "https://example.com/page-2",
    "https://example.com/page-3",
    # ... up to 100 URLs per batch
]

# Submit batch
batch = client.batch_scrape(
    urls=[{"url": u, "formats": ["markdown"]} for u in urls],
)

# Poll until done
while True:
    status = client.get_batch_status(batch["batch_id"])
    if status["status"] != "processing":
        break
    time.sleep(2)

# Convert to LangChain Documents
documents = []
for item in status["items"]:
    if item["status"] == "succeeded":
        documents.append(
            Document(
                page_content=item["result"].get("markdown", ""),
                metadata={"source": item["url"]},
            )
        )

print(f"Loaded {len(documents)} documents from batch")

Structured Extraction

Combine AlterLab's AI extraction with LangChain for type-safe structured data:

Python

# Use AlterLab's built-in extraction profiles
result = client.scrape(
    "https://example.com/product",
    formats=["json"],
    extraction_profile="product",
)

# The JSON output contains structured product data
product = result.get("json", {})
print(f"Product: {product.get('name')}")
print(f"Price: {product.get('price')}")

# Or use a custom schema
result = client.scrape(
    "https://example.com/article",
    formats=["json"],
    extraction_schema={
        "type": "object",
        "properties": {
            "title": {"type": "string"},
            "author": {"type": "string"},
            "published_date": {"type": "string"},
            "summary": {"type": "string"},
        },
    },
)

Tips & Best Practices

Use markdown format for RAG pipelines. It preserves headings, lists, and structure while being cleaner than HTML for chunking.
Set cost controls with max_tier to limit per-page costs in large crawls. Tier 1-2 is sufficient for most documentation sites.
Enable caching when re-running pipelines during development. Cache hits are free and return instantly.
Use batch scraping for loading 10+ pages. It is faster than sequential scraping and handles concurrency for you.
Chunk on markdown headers by including "\n## ", "\n### " in your text splitter separators. This creates semantically meaningful chunks.

Last updated: March 2026

Integration

AI Framework

LangChain

Use AlterLab as a document loader in LangChain to build RAG pipelines with live web data. Scrape pages, chunk content, embed into vector stores, and query with natural language.

Why AlterLab + LangChain?

Standard web loaders fail on JavaScript-heavy sites and get blocked by anti-bot systems. AlterLab handles JS rendering, anti-bot bypass, and returns clean text — ideal for LLM consumption.

Installation

Bash

pip install alterlab langchain langchain-community langchain-openai chromadb

Basic Usage

Document Loader

The simplest way to use AlterLab with LangChain is as a custom document loader. Each URL becomes a LangChain Document with page content and metadata.

Python

from langchain.schema import Document
from alterlab import AlterLab

client = AlterLab(api_key="your_api_key")

def load_url(url: str, mode: str = "auto") -> Document:
    """Load a URL as a LangChain Document via AlterLab."""
    result = client.scrape(url, formats=["markdown"], mode=mode)
    return Document(
        page_content=result.get("markdown", result.get("text", "")),
        metadata={
            "source": url,
            "title": result.get("metadata", {}).get("title", ""),
            "status_code": result.get("metadata", {}).get("status_code"),
            "credits_used": result.get("cost", {}).get("credits_charged"),
        },
    )

# Load a single page
doc = load_url("https://example.com/blog/ai-trends")
print(f"Loaded {len(doc.page_content)} chars from {doc.metadata['source']}")

Loading Options

Pass AlterLab parameters to control how pages are scraped:

Python

# JavaScript-heavy SPA — use JS rendering
doc = load_url("https://app.example.com/dashboard", mode="js")

# Get multiple formats at once
result = client.scrape(
    "https://example.com",
    formats=["markdown", "text", "json"],
)

# Use cost controls for budget-conscious pipelines
result = client.scrape(
    "https://example.com",
    formats=["markdown"],
    cost_controls={"max_tier": "2", "fail_fast": True},
)

RAG Pipeline

Build a complete Retrieval-Augmented Generation pipeline in three steps: load and chunk web content, embed into a vector store, then query with natural language.

Step 1: Load & Chunk

Python

from langchain.text_splitter import RecursiveCharacterTextSplitter

# Load multiple pages
urls = [
    "https://docs.example.com/getting-started",
    "https://docs.example.com/api-reference",
    "https://docs.example.com/tutorials",
]

documents = [load_url(url) for url in urls]
print(f"Loaded {len(documents)} documents")

# Split into chunks for embedding
splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    separators=["\n## ", "\n### ", "\n\n", "\n", " "],
)

chunks = splitter.split_documents(documents)
print(f"Split into {len(chunks)} chunks")

Step 2: Embed & Store

Python

from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma

# Create embeddings and store in ChromaDB
embeddings = OpenAIEmbeddings()
vectorstore = Chroma.from_documents(
    documents=chunks,
    embedding=embeddings,
    collection_name="web_docs",
    persist_directory="./chroma_db",
)

print(f"Stored {len(chunks)} chunks in vector store")

Step 3: Query

Python

from langchain_openai import ChatOpenAI
from langchain.chains import RetrievalQA

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=vectorstore.as_retriever(search_kwargs={"k": 4}),
)

answer = qa_chain.invoke("How do I authenticate with the API?")
print(answer["result"])

Full RAG Example

A complete end-to-end example that scrapes a documentation site and answers questions about it:

Python

from alterlab import AlterLab
from langchain.schema import Document
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_community.vectorstores import Chroma
from langchain.chains import RetrievalQA

# 1. Configure clients
scraper = AlterLab(api_key="your_alterlab_key")
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

# 2. Scrape pages into Documents
urls = [
    "https://docs.stripe.com/api/charges",
    "https://docs.stripe.com/api/customers",
    "https://docs.stripe.com/api/refunds",
]

documents = []
for url in urls:
    result = scraper.scrape(url, formats=["markdown"])
    documents.append(
        Document(
            page_content=result.get("markdown", ""),
            metadata={"source": url},
        )
    )

# 3. Chunk
splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
chunks = splitter.split_documents(documents)

# 4. Embed and store
vectorstore = Chroma.from_documents(
    documents=chunks,
    embedding=OpenAIEmbeddings(),
)

# 5. Query
qa = RetrievalQA.from_chain_type(
    llm=llm,
    retriever=vectorstore.as_retriever(search_kwargs={"k": 4}),
)

questions = [
    "How do I create a charge?",
    "What parameters does the refund endpoint accept?",
    "How do I list all customers?",
]

for q in questions:
    answer = qa.invoke(q)
    print(f"Q: {q}")
    print(f"A: {answer['result']}\n")

Batch Loading

For large document sets, use AlterLab's batch endpoint to load many pages concurrently:

Python

import time
from alterlab import AlterLab
from langchain.schema import Document

client = AlterLab(api_key="your_api_key")

urls = [
    "https://example.com/page-1",
    "https://example.com/page-2",
    "https://example.com/page-3",
    # ... up to 100 URLs per batch
]

# Submit batch
batch = client.batch_scrape(
    urls=[{"url": u, "formats": ["markdown"]} for u in urls],
)

# Poll until done
while True:
    status = client.get_batch_status(batch["batch_id"])
    if status["status"] != "processing":
        break
    time.sleep(2)

# Convert to LangChain Documents
documents = []
for item in status["items"]:
    if item["status"] == "succeeded":
        documents.append(
            Document(
                page_content=item["result"].get("markdown", ""),
                metadata={"source": item["url"]},
            )
        )

print(f"Loaded {len(documents)} documents from batch")

Structured Extraction

Combine AlterLab's AI extraction with LangChain for type-safe structured data:

Python

# Use AlterLab's built-in extraction profiles
result = client.scrape(
    "https://example.com/product",
    formats=["json"],
    extraction_profile="product",
)

# The JSON output contains structured product data
product = result.get("json", {})
print(f"Product: {product.get('name')}")
print(f"Price: {product.get('price')}")

# Or use a custom schema
result = client.scrape(
    "https://example.com/article",
    formats=["json"],
    extraction_schema={
        "type": "object",
        "properties": {
            "title": {"type": "string"},
            "author": {"type": "string"},
            "published_date": {"type": "string"},
            "summary": {"type": "string"},
        },
    },
)

Tips & Best Practices

Use markdown format for RAG pipelines. It preserves headings, lists, and structure while being cleaner than HTML for chunking.
Set cost controls with max_tier to limit per-page costs in large crawls. Tier 1-2 is sufficient for most documentation sites.
Enable caching when re-running pipelines during development. Cache hits are free and return instantly.
Use batch scraping for loading 10+ pages. It is faster than sequential scraping and handles concurrency for you.
Chunk on markdown headers by including "\n## ", "\n### " in your text splitter separators. This creates semantically meaningful chunks.

Last updated: March 2026