Use Case

RAG & AI Pipelines

Scrape, extract, and structure web content for retrieval-augmented generation, vector stores, and AI agents.

The Problem

LLMs need clean, structured web data to ground their responses in facts. Building a RAG pipeline means solving several challenges at once:

Getting past anti-bot protection to access source content
Extracting clean text from complex HTML — stripping nav, ads, and boilerplate
Returning content in formats LLMs can consume (markdown, plain text, structured JSON)
Processing many pages for knowledge base construction without managing browser infrastructure

Solution Architecture

AlterLab provides the data ingestion layer for your AI pipeline:

1. Scrape & Extract

POST /scrape with formats: ["markdown", "text"] to get clean content ready for chunking and embedding.

2. Structure

Use extraction_schema to pull specific fields (title, author, date, body) into typed JSON for metadata enrichment.

3. Scale

POST /batch to process hundreds of source documents in parallel. Use crawl to discover and ingest entire sites.

Quick Example

Scrape a page and get LLM-ready markdown content in one call:

Python

import requests

response = requests.post(
    "https://api.alterlab.io/api/v1/scrape",
    headers={"X-API-Key": "YOUR_API_KEY"},
    json={
        "url": "https://docs.example.com/api-reference",
        "formats": ["markdown", "text"]
    }
)

data = response.json()
markdown_content = data.get("markdown", "")
plain_text = data.get("text", "")

# Chunk for embedding
chunks = split_into_chunks(plain_text, max_tokens=512)
embeddings = embed_chunks(chunks)  # Your embedding model
vector_store.upsert(embeddings)

Advanced Patterns

Chunking Strategy

Use the markdown output with headings to create semantically meaningful chunks. AlterLab preserves document structure so you can split on heading boundaries:

Python

import re
import requests

def scrape_and_chunk(url, api_key):
    """Scrape a page and split into heading-based chunks."""
    resp = requests.post(
        "https://api.alterlab.io/api/v1/scrape",
        headers={"X-API-Key": api_key},
        json={"url": url, "formats": ["markdown"]}
    )
    markdown = resp.json().get("markdown", "")

    # Split on H2/H3 headings for semantic chunks
    sections = re.split(r'(?=^#{2,3}\s)', markdown, flags=re.MULTILINE)
    chunks = []
    for section in sections:
        section = section.strip()
        if section:
            chunks.append({
                "content": section,
                "source_url": url,
                "char_count": len(section)
            })
    return chunks

chunks = scrape_and_chunk(
    "https://docs.example.com/guide",
    "YOUR_API_KEY"
)
print(f"Created {len(chunks)} chunks")

LangChain Integration

Use AlterLab as a document loader in your LangChain pipeline. Scrape content and pass it directly to LangChain's text splitters and retrievers:

Python

import requests
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.schema import Document

def load_documents(urls, api_key):
    """Load web pages as LangChain Documents via AlterLab."""
    documents = []
    for url in urls:
        resp = requests.post(
            "https://api.alterlab.io/api/v1/scrape",
            headers={"X-API-Key": api_key},
            json={"url": url, "formats": ["text"]}
        )
        data = resp.json()
        if data.get("text"):
            documents.append(Document(
                page_content=data["text"],
                metadata={"source": url}
            ))
    return documents

# Load and split
docs = load_documents(
    ["https://docs.example.com/page1", "https://docs.example.com/page2"],
    "YOUR_API_KEY"
)

splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200
)
chunks = splitter.split_documents(docs)
print(f"Created {len(chunks)} chunks from {len(docs)} pages")

MCP Server

AlterLab has an official MCP server that lets AI agents call the scraping API directly. Install via npx alterlab-mcp-server — it exposes scrape, extract, screenshot, balance, and estimate tools.

Structured Extraction Tutorial

Use AI to extract structured JSON data from any web page.

Batch Scraping Guide

Process hundreds of URLs in parallel for knowledge base construction.

Python SDK

Official Python SDK with async support and type hints.

JSON Schema Filtering

Define extraction schemas for structured output.

← Previous: E-commerce Next: Content Monitoring →

Last updated: March 2026

Use Case

RAG & AI Pipelines

Scrape, extract, and structure web content for retrieval-augmented generation, vector stores, and AI agents.

The Problem

LLMs need clean, structured web data to ground their responses in facts. Building a RAG pipeline means solving several challenges at once:

Getting past anti-bot protection to access source content
Extracting clean text from complex HTML — stripping nav, ads, and boilerplate
Returning content in formats LLMs can consume (markdown, plain text, structured JSON)
Processing many pages for knowledge base construction without managing browser infrastructure

Solution Architecture

AlterLab provides the data ingestion layer for your AI pipeline:

1. Scrape & Extract

POST /scrape with formats: ["markdown", "text"] to get clean content ready for chunking and embedding.

2. Structure

Use extraction_schema to pull specific fields (title, author, date, body) into typed JSON for metadata enrichment.

3. Scale

POST /batch to process hundreds of source documents in parallel. Use crawl to discover and ingest entire sites.

Quick Example

Scrape a page and get LLM-ready markdown content in one call:

Python

import requests

response = requests.post(
    "https://api.alterlab.io/api/v1/scrape",
    headers={"X-API-Key": "YOUR_API_KEY"},
    json={
        "url": "https://docs.example.com/api-reference",
        "formats": ["markdown", "text"]
    }
)

data = response.json()
markdown_content = data.get("markdown", "")
plain_text = data.get("text", "")

# Chunk for embedding
chunks = split_into_chunks(plain_text, max_tokens=512)
embeddings = embed_chunks(chunks)  # Your embedding model
vector_store.upsert(embeddings)

Advanced Patterns

Chunking Strategy

Use the markdown output with headings to create semantically meaningful chunks. AlterLab preserves document structure so you can split on heading boundaries:

Python

import re
import requests

def scrape_and_chunk(url, api_key):
    """Scrape a page and split into heading-based chunks."""
    resp = requests.post(
        "https://api.alterlab.io/api/v1/scrape",
        headers={"X-API-Key": api_key},
        json={"url": url, "formats": ["markdown"]}
    )
    markdown = resp.json().get("markdown", "")

    # Split on H2/H3 headings for semantic chunks
    sections = re.split(r'(?=^#{2,3}\s)', markdown, flags=re.MULTILINE)
    chunks = []
    for section in sections:
        section = section.strip()
        if section:
            chunks.append({
                "content": section,
                "source_url": url,
                "char_count": len(section)
            })
    return chunks

chunks = scrape_and_chunk(
    "https://docs.example.com/guide",
    "YOUR_API_KEY"
)
print(f"Created {len(chunks)} chunks")

LangChain Integration

Use AlterLab as a document loader in your LangChain pipeline. Scrape content and pass it directly to LangChain's text splitters and retrievers:

Python

import requests
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.schema import Document

def load_documents(urls, api_key):
    """Load web pages as LangChain Documents via AlterLab."""
    documents = []
    for url in urls:
        resp = requests.post(
            "https://api.alterlab.io/api/v1/scrape",
            headers={"X-API-Key": api_key},
            json={"url": url, "formats": ["text"]}
        )
        data = resp.json()
        if data.get("text"):
            documents.append(Document(
                page_content=data["text"],
                metadata={"source": url}
            ))
    return documents

# Load and split
docs = load_documents(
    ["https://docs.example.com/page1", "https://docs.example.com/page2"],
    "YOUR_API_KEY"
)

splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200
)
chunks = splitter.split_documents(docs)
print(f"Created {len(chunks)} chunks from {len(docs)} pages")

MCP Server

AlterLab has an official MCP server that lets AI agents call the scraping API directly. Install via npx alterlab-mcp-server — it exposes scrape, extract, screenshot, balance, and estimate tools.

Last updated: March 2026

RAG & AI Pipelines

The Problem

Solution Architecture

1. Scrape & Extract

2. Structure

3. Scale

Quick Example

Advanced Patterns

Chunking Strategy

LangChain Integration

Related Guides

Structured Extraction Tutorial

Batch Scraping Guide

Python SDK

JSON Schema Filtering

RAG & AI Pipelines

The Problem

Solution Architecture

1. Scrape & Extract

2. Structure

3. Scale

Quick Example

Advanced Patterns

Chunking Strategy

LangChain Integration

Related Guides

Structured Extraction Tutorial

Batch Scraping Guide

Python SDK

JSON Schema Filtering