RAG & AI Pipelines
Scrape, extract, and structure web content for retrieval-augmented generation, vector stores, and AI agents.
The Problem
LLMs need clean, structured web data to ground their responses in facts. Building a RAG pipeline means solving several challenges at once:
- Getting past anti-bot protection to access source content
- Extracting clean text from complex HTML — stripping nav, ads, and boilerplate
- Returning content in formats LLMs can consume (markdown, plain text, structured JSON)
- Processing many pages for knowledge base construction without managing browser infrastructure
Solution Architecture
AlterLab provides the data ingestion layer for your AI pipeline:
1. Scrape & Extract
POST /scrape with formats: ["markdown", "text"] to get clean content ready for chunking and embedding.
2. Structure
Use extraction_schema to pull specific fields (title, author, date, body) into typed JSON for metadata enrichment.
3. Scale
POST /batch to process hundreds of source documents in parallel. Use crawl to discover and ingest entire sites.
Quick Example
Scrape a page and get LLM-ready markdown content in one call:
import requests
response = requests.post(
"https://api.alterlab.io/api/v1/scrape",
headers={"X-API-Key": "YOUR_API_KEY"},
json={
"url": "https://docs.example.com/api-reference",
"formats": ["markdown", "text"]
}
)
data = response.json()
markdown_content = data.get("markdown", "")
plain_text = data.get("text", "")
# Chunk for embedding
chunks = split_into_chunks(plain_text, max_tokens=512)
embeddings = embed_chunks(chunks) # Your embedding model
vector_store.upsert(embeddings)Advanced Patterns
Chunking Strategy
Use the markdown output with headings to create semantically meaningful chunks. AlterLab preserves document structure so you can split on heading boundaries:
import re
import requests
def scrape_and_chunk(url, api_key):
"""Scrape a page and split into heading-based chunks."""
resp = requests.post(
"https://api.alterlab.io/api/v1/scrape",
headers={"X-API-Key": api_key},
json={"url": url, "formats": ["markdown"]}
)
markdown = resp.json().get("markdown", "")
# Split on H2/H3 headings for semantic chunks
sections = re.split(r'(?=^#{2,3}\s)', markdown, flags=re.MULTILINE)
chunks = []
for section in sections:
section = section.strip()
if section:
chunks.append({
"content": section,
"source_url": url,
"char_count": len(section)
})
return chunks
chunks = scrape_and_chunk(
"https://docs.example.com/guide",
"YOUR_API_KEY"
)
print(f"Created {len(chunks)} chunks")LangChain Integration
Use AlterLab as a document loader in your LangChain pipeline. Scrape content and pass it directly to LangChain's text splitters and retrievers:
import requests
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.schema import Document
def load_documents(urls, api_key):
"""Load web pages as LangChain Documents via AlterLab."""
documents = []
for url in urls:
resp = requests.post(
"https://api.alterlab.io/api/v1/scrape",
headers={"X-API-Key": api_key},
json={"url": url, "formats": ["text"]}
)
data = resp.json()
if data.get("text"):
documents.append(Document(
page_content=data["text"],
metadata={"source": url}
))
return documents
# Load and split
docs = load_documents(
["https://docs.example.com/page1", "https://docs.example.com/page2"],
"YOUR_API_KEY"
)
splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200
)
chunks = splitter.split_documents(docs)
print(f"Created {len(chunks)} chunks from {len(docs)} pages")MCP Server
npx alterlab-mcp-server — it exposes scrape, extract, screenshot, balance, and estimate tools.Related Guides
Structured Extraction Tutorial
Use AI to extract structured JSON data from any web page.
Batch Scraping Guide
Process hundreds of URLs in parallel for knowledge base construction.
Python SDK
Official Python SDK with async support and type hints.
JSON Schema Filtering
Define extraction schemas for structured output.