Web Scraping Pipeline for RAG: Clean Data for LLMs
Build a 5-stage scraping pipeline that delivers token-efficient, clean text to your RAG system. Python code for extraction, chunking, and embedding included.
March 19, 2026
Raw HTML is poison for RAG. A typical news article page is 45,000 characters—roughly 11,000 tokens. The actual article is 800 words, or about 1,100 tokens. You are paying 10× to embed navigation menus, cookie banners, footer links, and inline scripts that actively dilute your embeddings and degrade retrieval quality.
The fix is a five-stage pipeline: reliable fetch → content extraction → normalization → semantic chunking → embed and index. Each stage has a single responsibility. Each failure is isolated and debuggable. This post walks through a production implementation in Python.
Pipeline Architecture
Stage 1: Reliable Fetching
The hardest part of scraping at scale is not parsing—it is getting the HTML. Bot detection blocks requests. JavaScript-rendered SPAs return skeleton HTML to static fetches. IP ranges accumulate blocks.
AlterLab's scraping API handles this in a single POST: rotating residential proxies, automatic CAPTCHA bypass, and optional headless rendering without managing a browser fleet yourself.
Python:
import httpx
ALTERLAB_API_KEY = "YOUR_API_KEY"
ALTERLAB_BASE_URL = "https://api.alterlab.io/v1"
def fetch_page(url: str, render_js: bool = False) -> str:
"""Fetch fully-rendered HTML from any URL."""
response = httpx.post(
f"{ALTERLAB_BASE_URL}/scrape",
headers={"X-API-Key": ALTERLAB_API_KEY, "Content-Type": "application/json"},
json={
"url": url,
"render_js": render_js,
"wait_for": "networkidle" if render_js else None,
},
timeout=30,
)
response.raise_for_status()
return response.json()["html"]cURL:
curl -X POST https://api.alterlab.io/v1/scrape \
-H "X-API-Key: YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com/docs/getting-started",
"render_js": true,
"wait_for": "networkidle"
}'Try fetching Hacker News with AlterLab and inspect the raw HTML response before extraction
Stage 2: Content Extraction
trafilatura is the most accurate open-source library for pulling article body text from HTML. It outperforms readability-lxml and newspaper3k on structured documentation and blog content because it uses both DOM heuristics and text-density scoring.
import json
import trafilatura
from trafilatura.settings import use_config
# Disable per-document timeout—let your own retry logic own the clock
config = use_config()
config.set("DEFAULT", "EXTRACTION_TIMEOUT", "0")
def extract_content(html: str, url: str) -> dict:
"""
Extract main content from HTML.
Returns dict with keys: text, title, author, date, description.
Raises ValueError if no content can be extracted.
"""
result = trafilatura.extract(
html,
url=url,
include_comments=False,
include_tables=True,
no_fallback=False,
config=config,
output_format="json",
with_metadata=True,
)
if result is None:
raise ValueError(f"Extraction returned no content for {url}")
return json.loads(result)Set no_fallback=False to allow trafilatura to fall back to its secondary heuristic if the primary DOM analysis returns nothing—useful for pages with unconventional layouts.
Stage 3: Normalization
After extraction, text still contains artifacts: Unicode non-breaking spaces (\u00a0), zero-width joiners, smart quotes, triple-newline runs from CMS templates, and stub lines that are purely punctuation.
import re
import unicodedata
def normalize_text(text: str) -> str:
# Canonical Unicode form: convert smart quotes, em-dashes, ligatures
text = unicodedata.normalize("NFKC", text)
# Replace invisible/non-breaking whitespace variants
text = re.sub(r"[\u00a0\u200b\u200c\u200d\ufeff]", " ", text)
# Collapse horizontal whitespace, preserve single newlines
text = re.sub(r"[ \t]+", " ", text)
text = re.sub(r"\n{3,}", "\n\n", text)
# Drop lines shorter than 4 chars (nav artifacts: "›", "|", "»")
lines = [ln for ln in text.split("\n") if len(ln.strip()) > 3 or ln.strip() == ""]
return "\n".join(lines).strip()This pass runs in microseconds per document and prevents garbage tokens from reaching your embedding model.
Stage 4: Chunking Strategy
Three mistakes that kill retrieval quality:
- Fixed character splits break sentences mid-clause. The embedding for a sentence fragment does not represent a complete thought.
- Whole documents as single vectors average all content into one point in embedding space. Specific queries retrieve nothing useful.
- Zero overlap means a concept bridging two chunks never matches a query that references it as a unit.
Use recursive sentence-aware chunking with configurable overlap:
from __future__ import annotations
import re
from dataclasses import dataclass, field
@dataclass
class Chunk:
text: str
url: str
chunk_index: int
total_chunks: int
metadata: dict = field(default_factory=dict)
def split_sentences(text: str) -> list[str]:
"""Sentence-boundary split on terminal punctuation followed by uppercase."""
return re.split(r"(?<=[.!?])\s+(?=[A-Z\"])", text)
def chunk_document(
text: str,
url: str,
max_tokens: int = 400,
overlap_sentences: int = 2,
chars_per_token: float = 4.0,
) -> list[Chunk]:
"""
Split text into token-bounded chunks with sentence-level overlap.
Args:
max_tokens: Approximate token ceiling per chunk.
overlap_sentences: Sentences carried over to the next chunk.
chars_per_token: Heuristic for English prose (4.0 is reliable).
"""
max_chars = int(max_tokens * chars_per_token)
sentences = split_sentences(text)
raw_chunks: list[str] = []
current: list[str] = []
current_len = 0
for sentence in sentences:
slen = len(sentence)
if current_len + slen > max_chars and current:
raw_chunks.append(" ".join(current))
current = current[-overlap_sentences:]
current_len = sum(len(s) for s in current)
current.append(sentence)
current_len += slen
if current:
raw_chunks.append(" ".join(current))
total = len(raw_chunks)
return [
Chunk(text=t, url=url, chunk_index=i, total_chunks=total)
for i, t in enumerate(raw_chunks)
]Token ceiling guidelines by model:
Stage 5: Embedding and Indexing
Batch your embedding calls. The OpenAI embeddings API accepts up to 2,048 inputs per request—sending one chunk per call is 100× slower and burns rate limit quota unnecessarily.
import asyncio
from openai import AsyncOpenAI
import pinecone
openai_client = AsyncOpenAI()
def init_index(api_key: str, index_name: str) -> pinecone.Index:
pc = pinecone.Pinecone(api_key=api_key)
return pc.Index(index_name)
async def embed_texts(texts: list[str]) -> list[list[float]]:
"""Batch embed up to 2048 texts in a single API call."""
response = await openai_client.embeddings.create(
model="text-embedding-3-small",
input=texts,
encoding_format="float",
)
return [item.embedding for item in response.data]
async def index_chunks(
chunks: list["Chunk"],
index: pinecone.Index,
batch_size: int = 100,
) -> None:
"""Embed and upsert chunks into Pinecone with source metadata preserved."""
for i in range(0, len(chunks), batch_size):
batch = chunks[i : i + batch_size]
vectors = await embed_texts([c.text for c in batch])
upserts = [
{
"id": f"{c.url}::{c.chunk_index}",
"values": vectors[j],
"metadata": {
"url": c.url,
"chunk_index": c.chunk_index,
"total_chunks": c.total_chunks,
"text": c.text, # store inline—avoids a separate fetch at query time
},
}
for j, c in enumerate(batch)
]
index.upsert(vectors=upserts)Store text in the vector metadata. Fetching the source document at query time adds latency and a failure point; paying a few extra bytes per vector is worth it.
Full Pipeline
import asyncio
import httpx
from fetch import fetch_page
from extract import extract_content
from normalize import normalize_text
from chunker import Chunk, chunk_document
from embed import init_index, index_chunks
PINECONE_API_KEY = "YOUR_PINECONE_KEY"
PINECONE_INDEX = "rag-knowledge-base"
async def ingest_url(url: str, render_js: bool = False) -> dict:
"""
End-to-end pipeline: URL → indexed, retrievable chunks.
Returns a summary dict for logging and monitoring.
"""
# Stage 1: Fetch
html = fetch_page(url, render_js=render_js)
# Stage 2: Extract
extracted = extract_content(html, url)
raw_text = extracted.get("text", "")
title = extracted.get("title", "untitled")
if not raw_text:
return {"url": url, "status": "no_content", "chunks": 0}
# Stage 3: Normalize
clean_text = normalize_text(raw_text)
# Stage 4: Chunk
chunks = chunk_document(
text=clean_text,
url=url,
max_tokens=400,
overlap_sentences=2,
)
# Filter degenerate chunks before embedding
chunks = [c for c in chunks if len(c.text.split()) >= 15]
# Stage 5: Embed + index
index = init_index(PINECONE_API_KEY, PINECONE_INDEX)
await index_chunks(chunks, index)
return {
"url": url,
"title": title,
"status": "indexed",
"chunks": len(chunks),
"approx_tokens": sum(len(c.text) // 4 for c in chunks),
}
async def ingest_batch(urls: list[str], concurrency: int = 5) -> list[dict]:
"""Ingest multiple URLs with bounded concurrency."""
semaphore = asyncio.Semaphore(concurrency)
async def bounded(url: str) -> dict:
async with semaphore:
try:
return await ingest_url(url)
except Exception as e:
return {"url": url, "status": "error", "error": str(e)}
return await asyncio.gather(*[bounded(u) for u in urls])
if __name__ == "__main__":
urls = [
"https://docs.python.org/3/library/asyncio-task.html",
"https://platform.openai.com/docs/guides/embeddings",
"https://www.pinecone.io/docs/upsert-data/",
]
results = asyncio.run(ingest_batch(urls, concurrency=3))
for r in results:
print(r)Handling Edge Cases
Deduplication
The same content appears under multiple URLs: www vs. bare domain, query parameters, pagination suffixes. Hash normalized text before indexing:
import hashlib
_seen_hashes: set[str] = set()
def is_duplicate(text: str) -> bool:
"""Return True if this exact content has already been indexed this run."""
digest = hashlib.sha256(text.encode()).hexdigest()[:16]
if digest in _seen_hashes:
return True
_seen_hashes.add(digest)
return FalseCall is_duplicate(clean_text) after Stage 3 and skip to the next URL if it returns True.
Pagination and Crawling
For documentation sites spanning dozens of pages, discover internal links before ingesting. A simple same-domain BFS over <a href> tags prevents you from missing chapters or API reference sections. Keep a visited-URL set to avoid cycles.
Retries with Backoff
Your embedding API has rate limits even when your scraper does not. Wrap async calls in exponential backoff:
import asyncio
from typing import TypeVar, Callable, Awaitable
T = TypeVar("T")
async def with_retry(fn: Callable[[], Awaitable[T]], attempts: int = 3) -> T:
for i in range(attempts):
try:
return await fn()
except Exception:
if i == attempts - 1:
raise
await asyncio.sleep(min(2 ** i, 30))
raise RuntimeError("unreachable")Wrap index_chunks calls: await with_retry(lambda: index_chunks(chunks, index)).
Production Checklist
Before running this at scale, verify:
- Freshness TTL: Set an expiry on indexed documents. Re-scrape on a schedule. Stale RAG context is worse than no context—your LLM will confidently cite outdated information.
- Minimum chunk length: Filter out chunks with fewer than 15 words. Stubs from tables or code snippets without context are noise at query time.
- Metadata completeness: Always store
scraped_at,source_url, andsection_titlein vector metadata. Your LLM needs these to generate citations users can verify. - Extraction failure rate: Monitor the share of URLs returning
no_content. Above 5% means your source sites have unusual structure and need custom extraction rules. - Concurrency limits: Do not set
concurrencyabove what your scraping tier supports. Queue excess work with Redis or a task runner rather than hammering with retries.
Takeaway
A five-stage pipeline—fetch, extract, normalize, chunk, embed—is not over-engineering. It is the minimum required to produce input that a retrieval system can actually use.
Token waste is a symptom, not the root problem. The root problem is that HTML is a rendering format, not a content format. Every step in this pipeline exists to close that gap.
The fetching layer is where most teams cut corners and regret it. Flaky HTML from failed bot-bypass attempts or unrendered JS propagates bad data through every downstream stage. Eliminating that variable with a dedicated scraping API means your engineering time goes where it compounds: extraction heuristics, chunking strategy, and retrieval evaluation.
Was this article helpful?
Frequently Asked Questions
Related Articles
Popular Posts
Recommended
Selenium Bot Detection: Why You Get Caught and How to Avoid It
Why Your Headless Browser Gets Detected (and How to Fix It)
Web Scraping APIs vs DIY Scrapers: When to Stop Building Infrastructure
Scraping E-Commerce Sites at Scale Without Getting Blocked
Web Scraping with Node.js and Puppeteer: The Complete 2026 Guide
Newsletter
Scraping insights and API tips. No spam.
Recommended Reading
Selenium Bot Detection: Why You Get Caught and How to Avoid It
Why Your Headless Browser Gets Detected (and How to Fix It)
Web Scraping APIs vs DIY Scrapers: When to Stop Building Infrastructure
Scraping E-Commerce Sites at Scale Without Getting Blocked
Web Scraping with Node.js and Puppeteer: The Complete 2026 Guide
Stay in the Loop
Get scraping insights, API tips, and platform updates. No spam — we only send when we have something worth reading.