AlterLabAlterLab
Tutorials

Web Scraping Pipeline for RAG: Clean Data for LLMs

Build a 5-stage scraping pipeline that delivers token-efficient, clean text to your RAG system. Python code for extraction, chunking, and embedding included.

Yash Dubey
Yash Dubey

March 19, 2026

9 min read
21 views

Raw HTML is poison for RAG. A typical news article page is 45,000 characters—roughly 11,000 tokens. The actual article is 800 words, or about 1,100 tokens. You are paying 10× to embed navigation menus, cookie banners, footer links, and inline scripts that actively dilute your embeddings and degrade retrieval quality.

The fix is a five-stage pipeline: reliable fetch → content extraction → normalization → semantic chunking → embed and index. Each stage has a single responsibility. Each failure is isolated and debuggable. This post walks through a production implementation in Python.

~90%Token Reduction
Retrieval Precision Gain
5Discrete Pipeline Stages

Pipeline Architecture


Stage 1: Reliable Fetching

The hardest part of scraping at scale is not parsing—it is getting the HTML. Bot detection blocks requests. JavaScript-rendered SPAs return skeleton HTML to static fetches. IP ranges accumulate blocks.

AlterLab's scraping API handles this in a single POST: rotating residential proxies, automatic CAPTCHA bypass, and optional headless rendering without managing a browser fleet yourself.

Python:

Python
import httpx

ALTERLAB_API_KEY = "YOUR_API_KEY"
ALTERLAB_BASE_URL = "https://api.alterlab.io/v1"


def fetch_page(url: str, render_js: bool = False) -> str:
    """Fetch fully-rendered HTML from any URL."""
    response = httpx.post(
        f"{ALTERLAB_BASE_URL}/scrape",
        headers={"X-API-Key": ALTERLAB_API_KEY, "Content-Type": "application/json"},
        json={
            "url": url,
            "render_js": render_js,
            "wait_for": "networkidle" if render_js else None,
        },
        timeout=30,
    )
    response.raise_for_status()
    return response.json()["html"]

cURL:

Bash
curl -X POST https://api.alterlab.io/v1/scrape \
  -H "X-API-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com/docs/getting-started",
    "render_js": true,
    "wait_for": "networkidle"
  }'
Try it yourself

Try fetching Hacker News with AlterLab and inspect the raw HTML response before extraction


Stage 2: Content Extraction

trafilatura is the most accurate open-source library for pulling article body text from HTML. It outperforms readability-lxml and newspaper3k on structured documentation and blog content because it uses both DOM heuristics and text-density scoring.

Python
import json
import trafilatura
from trafilatura.settings import use_config

# Disable per-document timeout—let your own retry logic own the clock
config = use_config()
config.set("DEFAULT", "EXTRACTION_TIMEOUT", "0")


def extract_content(html: str, url: str) -> dict:
    """
    Extract main content from HTML.
    Returns dict with keys: text, title, author, date, description.
    Raises ValueError if no content can be extracted.
    """
    result = trafilatura.extract(
        html,
        url=url,
        include_comments=False,
        include_tables=True,
        no_fallback=False,
        config=config,
        output_format="json",
        with_metadata=True,
    )

    if result is None:
        raise ValueError(f"Extraction returned no content for {url}")

    return json.loads(result)

Set no_fallback=False to allow trafilatura to fall back to its secondary heuristic if the primary DOM analysis returns nothing—useful for pages with unconventional layouts.


Stage 3: Normalization

After extraction, text still contains artifacts: Unicode non-breaking spaces (\u00a0), zero-width joiners, smart quotes, triple-newline runs from CMS templates, and stub lines that are purely punctuation.

Python
import re
import unicodedata


def normalize_text(text: str) -> str:
    # Canonical Unicode form: convert smart quotes, em-dashes, ligatures
    text = unicodedata.normalize("NFKC", text)

    # Replace invisible/non-breaking whitespace variants
    text = re.sub(r"[\u00a0\u200b\u200c\u200d\ufeff]", " ", text)

    # Collapse horizontal whitespace, preserve single newlines
    text = re.sub(r"[ \t]+", " ", text)
    text = re.sub(r"\n{3,}", "\n\n", text)

    # Drop lines shorter than 4 chars (nav artifacts: "›", "|", "»")
    lines = [ln for ln in text.split("\n") if len(ln.strip()) > 3 or ln.strip() == ""]

    return "\n".join(lines).strip()

This pass runs in microseconds per document and prevents garbage tokens from reaching your embedding model.


Stage 4: Chunking Strategy

Three mistakes that kill retrieval quality:

  • Fixed character splits break sentences mid-clause. The embedding for a sentence fragment does not represent a complete thought.
  • Whole documents as single vectors average all content into one point in embedding space. Specific queries retrieve nothing useful.
  • Zero overlap means a concept bridging two chunks never matches a query that references it as a unit.

Use recursive sentence-aware chunking with configurable overlap:

Python
from __future__ import annotations
import re
from dataclasses import dataclass, field


@dataclass
class Chunk:
    text: str
    url: str
    chunk_index: int
    total_chunks: int
    metadata: dict = field(default_factory=dict)


def split_sentences(text: str) -> list[str]:
    """Sentence-boundary split on terminal punctuation followed by uppercase."""
    return re.split(r"(?<=[.!?])\s+(?=[A-Z\"])", text)


def chunk_document(
    text: str,
    url: str,
    max_tokens: int = 400,
    overlap_sentences: int = 2,
    chars_per_token: float = 4.0,
) -> list[Chunk]:
    """
    Split text into token-bounded chunks with sentence-level overlap.

    Args:
        max_tokens: Approximate token ceiling per chunk.
        overlap_sentences: Sentences carried over to the next chunk.
        chars_per_token: Heuristic for English prose (4.0 is reliable).
    """
    max_chars = int(max_tokens * chars_per_token)
    sentences = split_sentences(text)

    raw_chunks: list[str] = []
    current: list[str] = []
    current_len = 0

    for sentence in sentences:
        slen = len(sentence)
        if current_len + slen > max_chars and current:
            raw_chunks.append(" ".join(current))
            current = current[-overlap_sentences:]
            current_len = sum(len(s) for s in current)
        current.append(sentence)
        current_len += slen

    if current:
        raw_chunks.append(" ".join(current))

    total = len(raw_chunks)
    return [
        Chunk(text=t, url=url, chunk_index=i, total_chunks=total)
        for i, t in enumerate(raw_chunks)
    ]

Token ceiling guidelines by model:


Stage 5: Embedding and Indexing

Batch your embedding calls. The OpenAI embeddings API accepts up to 2,048 inputs per request—sending one chunk per call is 100× slower and burns rate limit quota unnecessarily.

Python
import asyncio
from openai import AsyncOpenAI
import pinecone

openai_client = AsyncOpenAI()


def init_index(api_key: str, index_name: str) -> pinecone.Index:
    pc = pinecone.Pinecone(api_key=api_key)
    return pc.Index(index_name)


async def embed_texts(texts: list[str]) -> list[list[float]]:
    """Batch embed up to 2048 texts in a single API call."""
    response = await openai_client.embeddings.create(
        model="text-embedding-3-small",
        input=texts,
        encoding_format="float",
    )
    return [item.embedding for item in response.data]


async def index_chunks(
    chunks: list["Chunk"],
    index: pinecone.Index,
    batch_size: int = 100,
) -> None:
    """Embed and upsert chunks into Pinecone with source metadata preserved."""
    for i in range(0, len(chunks), batch_size):
        batch = chunks[i : i + batch_size]
        vectors = await embed_texts([c.text for c in batch])

        upserts = [
            {
                "id": f"{c.url}::{c.chunk_index}",
                "values": vectors[j],
                "metadata": {
                    "url": c.url,
                    "chunk_index": c.chunk_index,
                    "total_chunks": c.total_chunks,
                    "text": c.text,  # store inline—avoids a separate fetch at query time
                },
            }
            for j, c in enumerate(batch)
        ]

        index.upsert(vectors=upserts)

Store text in the vector metadata. Fetching the source document at query time adds latency and a failure point; paying a few extra bytes per vector is worth it.


Full Pipeline

Python
import asyncio
import httpx
from fetch import fetch_page
from extract import extract_content
from normalize import normalize_text
from chunker import Chunk, chunk_document
from embed import init_index, index_chunks

PINECONE_API_KEY = "YOUR_PINECONE_KEY"
PINECONE_INDEX = "rag-knowledge-base"


async def ingest_url(url: str, render_js: bool = False) -> dict:
    """
    End-to-end pipeline: URL → indexed, retrievable chunks.
    Returns a summary dict for logging and monitoring.
    """
    # Stage 1: Fetch
    html = fetch_page(url, render_js=render_js)

    # Stage 2: Extract
    extracted = extract_content(html, url)
    raw_text = extracted.get("text", "")
    title = extracted.get("title", "untitled")

    if not raw_text:
        return {"url": url, "status": "no_content", "chunks": 0}

    # Stage 3: Normalize
    clean_text = normalize_text(raw_text)

    # Stage 4: Chunk
    chunks = chunk_document(
        text=clean_text,
        url=url,
        max_tokens=400,
        overlap_sentences=2,
    )

    # Filter degenerate chunks before embedding
    chunks = [c for c in chunks if len(c.text.split()) >= 15]

    # Stage 5: Embed + index
    index = init_index(PINECONE_API_KEY, PINECONE_INDEX)
    await index_chunks(chunks, index)

    return {
        "url": url,
        "title": title,
        "status": "indexed",
        "chunks": len(chunks),
        "approx_tokens": sum(len(c.text) // 4 for c in chunks),
    }


async def ingest_batch(urls: list[str], concurrency: int = 5) -> list[dict]:
    """Ingest multiple URLs with bounded concurrency."""
    semaphore = asyncio.Semaphore(concurrency)

    async def bounded(url: str) -> dict:
        async with semaphore:
            try:
                return await ingest_url(url)
            except Exception as e:
                return {"url": url, "status": "error", "error": str(e)}

    return await asyncio.gather(*[bounded(u) for u in urls])


if __name__ == "__main__":
    urls = [
        "https://docs.python.org/3/library/asyncio-task.html",
        "https://platform.openai.com/docs/guides/embeddings",
        "https://www.pinecone.io/docs/upsert-data/",
    ]
    results = asyncio.run(ingest_batch(urls, concurrency=3))
    for r in results:
        print(r)

Handling Edge Cases

Deduplication

The same content appears under multiple URLs: www vs. bare domain, query parameters, pagination suffixes. Hash normalized text before indexing:

Python
import hashlib


_seen_hashes: set[str] = set()


def is_duplicate(text: str) -> bool:
    """Return True if this exact content has already been indexed this run."""
    digest = hashlib.sha256(text.encode()).hexdigest()[:16]
    if digest in _seen_hashes:
        return True
    _seen_hashes.add(digest)
    return False

Call is_duplicate(clean_text) after Stage 3 and skip to the next URL if it returns True.

Pagination and Crawling

For documentation sites spanning dozens of pages, discover internal links before ingesting. A simple same-domain BFS over <a href> tags prevents you from missing chapters or API reference sections. Keep a visited-URL set to avoid cycles.

Retries with Backoff

Your embedding API has rate limits even when your scraper does not. Wrap async calls in exponential backoff:

Python
import asyncio
from typing import TypeVar, Callable, Awaitable

T = TypeVar("T")


async def with_retry(fn: Callable[[], Awaitable[T]], attempts: int = 3) -> T:
    for i in range(attempts):
        try:
            return await fn()
        except Exception:
            if i == attempts - 1:
                raise
            await asyncio.sleep(min(2 ** i, 30))
    raise RuntimeError("unreachable")

Wrap index_chunks calls: await with_retry(lambda: index_chunks(chunks, index)).


Production Checklist

Before running this at scale, verify:

  • Freshness TTL: Set an expiry on indexed documents. Re-scrape on a schedule. Stale RAG context is worse than no context—your LLM will confidently cite outdated information.
  • Minimum chunk length: Filter out chunks with fewer than 15 words. Stubs from tables or code snippets without context are noise at query time.
  • Metadata completeness: Always store scraped_at, source_url, and section_title in vector metadata. Your LLM needs these to generate citations users can verify.
  • Extraction failure rate: Monitor the share of URLs returning no_content. Above 5% means your source sites have unusual structure and need custom extraction rules.
  • Concurrency limits: Do not set concurrency above what your scraping tier supports. Queue excess work with Redis or a task runner rather than hammering with retries.

Takeaway

A five-stage pipeline—fetch, extract, normalize, chunk, embed—is not over-engineering. It is the minimum required to produce input that a retrieval system can actually use.

Token waste is a symptom, not the root problem. The root problem is that HTML is a rendering format, not a content format. Every step in this pipeline exists to close that gap.

The fetching layer is where most teams cut corners and regret it. Flaky HTML from failed bot-bypass attempts or unrendered JS propagates bad data through every downstream stage. Eliminating that variable with a dedicated scraping API means your engineering time goes where it compounds: extraction heuristics, chunking strategy, and retrieval evaluation.

Share

Was this article helpful?

Frequently Asked Questions

Use a purpose-built extraction library like trafilatura to strip navigation, ads, and boilerplate before passing text to your model. Follow extraction with a normalization pass to handle Unicode oddities, non-breaking spaces, and redundant whitespace. Never send raw HTML directly to an embedding model.
For most embedding models, 300–500 tokens per chunk with 1–2 sentence overlap is a reliable starting point. Smaller models (7B–13B) benefit from 256-token chunks; larger context models like GPT-4 or Claude can handle 800–1000 tokens without averaging away signal. Always prefer sentence-aware splitting over fixed character splits.
You need a headless browser or a scraping API with rendering support. Static HTTP fetches return skeleton HTML for SPAs and React/Next.js apps, missing the actual content. Set `render_js: true` and `wait_for: networkidle` to ensure the DOM is fully populated before extraction.
Hash the normalized text (SHA-256, first 16 chars is enough) before indexing and maintain a seen-hashes set across your ingestion run. The same article appears at multiple URLs due to query parameters, trailing slashes, and CDN aliases—deduplication at the content level is more reliable than URL normalization.