LangChain
Use AlterLab as a document loader in LangChain to build RAG pipelines with live web data. Scrape pages, chunk content, embed into vector stores, and query with natural language.
Why AlterLab + LangChain?
Installation
pip install alterlab langchain langchain-community langchain-openai chromadbYou need the AlterLab Python SDK and LangChain community package. The examples below also use ChromaDB for the vector store and OpenAI for embeddings, but you can substitute any LangChain-compatible alternatives.
Basic Usage
Document Loader
The simplest way to use AlterLab with LangChain is as a custom document loader. Each URL becomes a LangChain Document with page content and metadata.
from langchain.schema import Document
from alterlab import AlterLab
client = AlterLab(api_key="your_api_key")
def load_url(url: str, mode: str = "auto") -> Document:
"""Load a URL as a LangChain Document via AlterLab."""
result = client.scrape(url, formats=["markdown"], mode=mode)
return Document(
page_content=result.get("markdown", result.get("text", "")),
metadata={
"source": url,
"title": result.get("metadata", {}).get("title", ""),
"status_code": result.get("metadata", {}).get("status_code"),
"credits_used": result.get("cost", {}).get("credits_charged"),
},
)
# Load a single page
doc = load_url("https://example.com/blog/ai-trends")
print(f"Loaded {len(doc.page_content)} chars from {doc.metadata['source']}")Loading Options
Pass AlterLab parameters to control how pages are scraped:
# JavaScript-heavy SPA — use JS rendering
doc = load_url("https://app.example.com/dashboard", mode="js")
# Get multiple formats at once
result = client.scrape(
"https://example.com",
formats=["markdown", "text", "json"],
)
# Use cost controls for budget-conscious pipelines
result = client.scrape(
"https://example.com",
formats=["markdown"],
cost_controls={"max_tier": "2", "fail_fast": True},
)RAG Pipeline
Build a complete Retrieval-Augmented Generation pipeline in three steps: load and chunk web content, embed into a vector store, then query with natural language.
Step 1: Load & Chunk
from langchain.text_splitter import RecursiveCharacterTextSplitter
# Load multiple pages
urls = [
"https://docs.example.com/getting-started",
"https://docs.example.com/api-reference",
"https://docs.example.com/tutorials",
]
documents = [load_url(url) for url in urls]
print(f"Loaded {len(documents)} documents")
# Split into chunks for embedding
splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200,
separators=["\n## ", "\n### ", "\n\n", "\n", " "],
)
chunks = splitter.split_documents(documents)
print(f"Split into {len(chunks)} chunks")Step 2: Embed & Store
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
# Create embeddings and store in ChromaDB
embeddings = OpenAIEmbeddings()
vectorstore = Chroma.from_documents(
documents=chunks,
embedding=embeddings,
collection_name="web_docs",
persist_directory="./chroma_db",
)
print(f"Stored {len(chunks)} chunks in vector store")Step 3: Query
from langchain_openai import ChatOpenAI
from langchain.chains import RetrievalQA
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
qa_chain = RetrievalQA.from_chain_type(
llm=llm,
chain_type="stuff",
retriever=vectorstore.as_retriever(search_kwargs={"k": 4}),
)
answer = qa_chain.invoke("How do I authenticate with the API?")
print(answer["result"])Full RAG Example
A complete end-to-end example that scrapes a documentation site and answers questions about it:
from alterlab import AlterLab
from langchain.schema import Document
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_community.vectorstores import Chroma
from langchain.chains import RetrievalQA
# 1. Configure clients
scraper = AlterLab(api_key="your_alterlab_key")
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
# 2. Scrape pages into Documents
urls = [
"https://docs.stripe.com/api/charges",
"https://docs.stripe.com/api/customers",
"https://docs.stripe.com/api/refunds",
]
documents = []
for url in urls:
result = scraper.scrape(url, formats=["markdown"])
documents.append(
Document(
page_content=result.get("markdown", ""),
metadata={"source": url},
)
)
# 3. Chunk
splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
chunks = splitter.split_documents(documents)
# 4. Embed and store
vectorstore = Chroma.from_documents(
documents=chunks,
embedding=OpenAIEmbeddings(),
)
# 5. Query
qa = RetrievalQA.from_chain_type(
llm=llm,
retriever=vectorstore.as_retriever(search_kwargs={"k": 4}),
)
questions = [
"How do I create a charge?",
"What parameters does the refund endpoint accept?",
"How do I list all customers?",
]
for q in questions:
answer = qa.invoke(q)
print(f"Q: {q}")
print(f"A: {answer['result']}\n")Batch Loading
For large document sets, use AlterLab's batch endpoint to load many pages concurrently:
import time
from alterlab import AlterLab
from langchain.schema import Document
client = AlterLab(api_key="your_api_key")
urls = [
"https://example.com/page-1",
"https://example.com/page-2",
"https://example.com/page-3",
# ... up to 100 URLs per batch
]
# Submit batch
batch = client.batch_scrape(
urls=[{"url": u, "formats": ["markdown"]} for u in urls],
)
# Poll until done
while True:
status = client.get_batch_status(batch["batch_id"])
if status["status"] != "processing":
break
time.sleep(2)
# Convert to LangChain Documents
documents = []
for item in status["items"]:
if item["status"] == "succeeded":
documents.append(
Document(
page_content=item["result"].get("markdown", ""),
metadata={"source": item["url"]},
)
)
print(f"Loaded {len(documents)} documents from batch")Structured Extraction
Combine AlterLab's AI extraction with LangChain for type-safe structured data:
# Use AlterLab's built-in extraction profiles
result = client.scrape(
"https://example.com/product",
formats=["json"],
extraction_profile="product",
)
# The JSON output contains structured product data
product = result.get("json", {})
print(f"Product: {product.get('name')}")
print(f"Price: {product.get('price')}")
# Or use a custom schema
result = client.scrape(
"https://example.com/article",
formats=["json"],
extraction_schema={
"type": "object",
"properties": {
"title": {"type": "string"},
"author": {"type": "string"},
"published_date": {"type": "string"},
"summary": {"type": "string"},
},
},
)Tips & Best Practices
- Use markdown format for RAG pipelines. It preserves headings, lists, and structure while being cleaner than HTML for chunking.
- Set cost controls with
max_tierto limit per-page costs in large crawls. Tier 1-2 is sufficient for most documentation sites. - Enable caching when re-running pipelines during development. Cache hits are free and return instantly.
- Use batch scraping for loading 10+ pages. It is faster than sequential scraping and handles concurrency for you.
- Chunk on markdown headers by including
"\n## ", "\n### "in your text splitter separators. This creates semantically meaningful chunks.