
Web Scraping Pipeline for RAG: Clean Data for LLMs
Build a 5-stage scraping pipeline that delivers token-efficient, clean text to your RAG system. Python code for extraction, chunking, and embedding included.
March 19, 2026
Raw HTML is poison for RAG. A typical news article page is 45,000 characters—roughly 11,000 tokens. The actual article is 800 words, or about 1,100 tokens. You are paying 10× to embed navigation menus, cookie banners, footer links, and inline scripts that actively dilute your embeddings and degrade retrieval quality.
The fix is a five-stage pipeline: reliable fetch → content extraction → normalization → semantic chunking → embed and index. Each stage has a single responsibility. Each failure is isolated and debuggable. This post walks through a production implementation in Python.
Pipeline Architecture
Stage 1: Reliable Fetching
The hardest part of scraping at scale is not parsing—it is getting the HTML. Bot detection blocks requests. JavaScript-rendered SPAs return skeleton HTML to static fetches. IP ranges accumulate blocks.
AlterLab's scraping API handles this in a single POST: rotating residential proxies, automatic CAPTCHA bypass, and optional headless rendering without managing a browser fleet yourself.
Python:
import httpx
ALTERLAB_API_KEY = "YOUR_API_KEY"
ALTERLAB_BASE_URL = "https://api.alterlab.io/v1"
def fetch_page(url: str, render_js: bool = False) -> str:
"""Fetch fully-rendered HTML from any URL."""
response = httpx.post(
f"{ALTERLAB_BASE_URL}/scrape",
headers={"X-API-Key": ALTERLAB_API_KEY, "Content-Type": "application/json"},
json={
"url": url,
"render_js": render_js,
"wait_for": "networkidle" if render_js else None,
},
timeout=30,
)
response.raise_for_status()
return response.json()["html"]cURL:
curl -X POST https://api.alterlab.io/v1/scrape \
-H "X-API-Key: YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com/docs/getting-started",
"render_js": true,
"wait_for": "networkidle"
}'Try fetching Hacker News with AlterLab and inspect the raw HTML response before extraction
Stage 2: Content Extraction
trafilatura is the most accurate open-source library for pulling article body text from HTML. It outperforms readability-lxml and newspaper3k on structured documentation and blog content because it uses both DOM heuristics and text-density scoring.
import json
import trafilatura
from trafilatura.settings import use_config
# Disable per-document timeout—let your own retry logic own the clock
config = use_config()
config.set("DEFAULT", "EXTRACTION_TIMEOUT", "0")
def extract_content(html: str, url: str) -> dict:
"""
Extract main content from HTML.
Returns dict with keys: text, title, author, date, description.
Raises ValueError if no content can be extracted.
"""
result = trafilatura.extract(
html,
url=url,
include_comments=False,
include_tables=True,
no_fallback=False,
config=config,
output_format="json",
with_metadata=True,
)
if result is None:
raise ValueError(f"Extraction returned no content for {url}")
return json.loads(result)Set no_fallback=False to allow trafilatura to fall back to its secondary heuristic if the primary DOM analysis returns nothing—useful for pages with unconventional layouts.
Stage 3: Normalization
After extraction, text still contains artifacts: Unicode non-breaking spaces (\u00a0), zero-width joiners, smart quotes, triple-newline runs from CMS templates, and stub lines that are purely punctuation.
import re
import unicodedata
def normalize_text(text: str) -> str:
# Canonical Unicode form: convert smart quotes, em-dashes, ligatures
text = unicodedata.normalize("NFKC", text)
# Replace invisible/non-breaking whitespace variants
text = re.sub(r"[\u00a0\u200b\u200c\u200d\ufeff]", " ", text)
# Collapse horizontal whitespace, preserve single newlines
text = re.sub(r"[ \t]+", " ", text)
text = re.sub(r"\n{3,}", "\n\n", text)
# Drop lines shorter than 4 chars (nav artifacts: "›", "|", "»")
lines = [ln for ln in text.split("\n") if len(ln.strip()) > 3 or ln.strip() == ""]
return "\n".join(lines).strip()This pass runs in microseconds per document and prevents garbage tokens from reaching your embedding model.
Stage 4: Chunking Strategy
Three mistakes that kill retrieval quality:
- Fixed character splits break sentences mid-clause. The embedding for a sentence fragment does not represent a complete thought.
- Whole documents as single vectors average all content into one point in embedding space. Specific queries retrieve nothing useful.
- Zero overlap means a concept bridging two chunks never matches a query that references it as a unit.
Use recursive sentence-aware chunking with configurable overlap:
from __future__ import annotations
import re
from dataclasses import dataclass, field
@dataclass
class Chunk:
text: str
url: str
chunk_index: int
total_chunks: int
metadata: dict = field(default_factory=dict)
def split_sentences(text: str) -> list[str]:
"""Sentence-boundary split on terminal punctuation followed by uppercase."""
return re.split(r"(?<=[.!?])\s+(?=[A-Z\"])", text)
def chunk_document(
text: str,
url: str,
max_tokens: int = 400,
overlap_sentences: int = 2,
chars_per_token: float = 4.0,
) -> list[Chunk]:
"""
Split text into token-bounded chunks with sentence-level overlap.
Args:
max_tokens: Approximate token ceiling per chunk.
overlap_sentences: Sentences carried over to the next chunk.
chars_per_token: Heuristic for English prose (4.0 is reliable).
"""
max_chars = int(max_tokens * chars_per_token)
sentences = split_sentences(text)
raw_chunks: list[str] = []
current: list[str] = []
current_len = 0
for sentence in sentences:
slen = len(sentence)
if current_len + slen > max_chars and current:
raw_chunks.append(" ".join(current))
current = current[-overlap_sentences:]
current_len = sum(len(s) for s in current)
current.append(sentence)
current_len += slen
if current:
raw_chunks.append(" ".join(current))
total = len(raw_chunks)
return [
Chunk(text=t, url=url, chunk_index=i, total_chunks=total)
for i, t in enumerate(raw_chunks)
]Token ceiling guidelines by model:
Stage 5: Embedding and Indexing
Batch your embedding calls. The OpenAI embeddings API accepts up to 2,048 inputs per request—sending one chunk per call is 100× slower and burns rate limit quota unnecessarily.
import asyncio
from openai import AsyncOpenAI
import pinecone
openai_client = AsyncOpenAI()
def init_index(api_key: str, index_name: str) -> pinecone.Index:
pc = pinecone.Pinecone(api_key=api_key)
return pc.Index(index_name)
async def embed_texts(texts: list[str]) -> list[list[float]]:
"""Batch embed up to 2048 texts in a single API call."""
response = await openai_client.embeddings.create(
model="text-embedding-3-small",
input=texts,
encoding_format="float",
)
return [item.embedding for item in response.data]
async def index_chunks(
chunks: list["Chunk"],
index: pinecone.Index,
batch_size: int = 100,
) -> None:
"""Embed and upsert chunks into Pinecone with source metadata preserved."""
for i in range(0, len(chunks), batch_size):
batch = chunks[i : i + batch_size]
vectors = await embed_texts([c.text for c in batch])
upserts = [
{
"id": f"{c.url}::{c.chunk_index}",
"values": vectors[j],
"metadata": {
"url": c.url,
"chunk_index": c.chunk_index,
"total_chunks": c.total_chunks,
"text": c.text, # store inline—avoids a separate fetch at query time
},
}
for j, c in enumerate(batch)
]
index.upsert(vectors=upserts)Store text in the vector metadata. Fetching the source document at query time adds latency and a failure point; paying a few extra bytes per vector is worth it.
Full Pipeline
import asyncio
import httpx
from fetch import fetch_page
from extract import extract_content
from normalize import normalize_text
from chunker import Chunk, chunk_document
from embed import init_index, index_chunks
PINECONE_API_KEY = "YOUR_PINECONE_KEY"
PINECONE_INDEX = "rag-knowledge-base"
async def ingest_url(url: str, render_js: bool = False) -> dict:
"""
End-to-end pipeline: URL → indexed, retrievable chunks.
Returns a summary dict for logging and monitoring.
"""
# Stage 1: Fetch
html = fetch_page(url, render_js=render_js)
# Stage 2: Extract
extracted = extract_content(html, url)
raw_text = extracted.get("text", "")
title = extracted.get("title", "untitled")
if not raw_text:
return {"url": url, "status": "no_content", "chunks": 0}
# Stage 3: Normalize
clean_text = normalize_text(raw_text)
# Stage 4: Chunk
chunks = chunk_document(
text=clean_text,
url=url,
max_tokens=400,
overlap_sentences=2,
)
# Filter degenerate chunks before embedding
chunks = [c for c in chunks if len(c.text.split()) >= 15]
# Stage 5: Embed + index
index = init_index(PINECONE_API_KEY, PINECONE_INDEX)
await index_chunks(chunks, index)
return {
"url": url,
"title": title,
"status": "indexed",
"chunks": len(chunks),
"approx_tokens": sum(len(c.text) // 4 for c in chunks),
}
async def ingest_batch(urls: list[str], concurrency: int = 5) -> list[dict]:
"""Ingest multiple URLs with bounded concurrency."""
semaphore = asyncio.Semaphore(concurrency)
async def bounded(url: str) -> dict:
async with semaphore:
try:
return await ingest_url(url)
except Exception as e:
return {"url": url, "status": "error", "error": str(e)}
return await asyncio.gather(*[bounded(u) for u in urls])
if __name__ == "__main__":
urls = [
"https://docs.python.org/3/library/asyncio-task.html",
"https://platform.openai.com/docs/guides/embeddings",
"https://www.pinecone.io/docs/upsert-data/",
]
results = asyncio.run(ingest_batch(urls, concurrency=3))
for r in results:
print(r)Handling Edge Cases
Deduplication
The same content appears under multiple URLs: www vs. bare domain, query parameters, pagination suffixes. Hash normalized text before indexing:
import hashlib
_seen_hashes: set[str] = set()
def is_duplicate(text: str) -> bool:
"""Return True if this exact content has already been indexed this run."""
digest = hashlib.sha256(text.encode()).hexdigest()[:16]
if digest in _seen_hashes:
return True
_seen_hashes.add(digest)
return FalseCall is_duplicate(clean_text) after Stage 3 and skip to the next URL if it returns True.
Pagination and Crawling
For documentation sites spanning dozens of pages, discover internal links before ingesting. A simple same-domain BFS over <a href> tags prevents you from missing chapters or API reference sections. Keep a visited-URL set to avoid cycles.
Retries with Backoff
Your embedding API has rate limits even when your scraper does not. Wrap async calls in exponential backoff:
import asyncio
from typing import TypeVar, Callable, Awaitable
T = TypeVar("T")
async def with_retry(fn: Callable[[], Awaitable[T]], attempts: int = 3) -> T:
for i in range(attempts):
try:
return await fn()
except Exception:
if i == attempts - 1:
raise
await asyncio.sleep(min(2 ** i, 30))
raise RuntimeError("unreachable")Wrap index_chunks calls: await with_retry(lambda: index_chunks(chunks, index)).
Production Checklist
Before running this at scale, verify:
- Freshness TTL: Set an expiry on indexed documents. Re-scrape on a schedule. Stale RAG context is worse than no context—your LLM will confidently cite outdated information.
- Minimum chunk length: Filter out chunks with fewer than 15 words. Stubs from tables or code snippets without context are noise at query time.
- Metadata completeness: Always store
scraped_at,source_url, andsection_titlein vector metadata. Your LLM needs these to generate citations users can verify. - Extraction failure rate: Monitor the share of URLs returning
no_content. Above 5% means your source sites have unusual structure and need custom extraction rules. - Concurrency limits: Do not set
concurrencyabove what your scraping tier supports. Queue excess work with Redis or a task runner rather than hammering with retries.
Takeaway
A five-stage pipeline—fetch, extract, normalize, chunk, embed—is not over-engineering. It is the minimum required to produce input that a retrieval system can actually use.
Token waste is a symptom, not the root problem. The root problem is that HTML is a rendering format, not a content format. Every step in this pipeline exists to close that gap.
The fetching layer is where most teams cut corners and regret it. Flaky HTML from failed bot-bypass attempts or unrendered JS propagates bad data through every downstream stage. Eliminating that variable with a dedicated scraping API means your engineering time goes where it compounds: extraction heuristics, chunking strategy, and retrieval evaluation.
Was this article helpful?
Frequently Asked Questions
Related Articles
Popular Posts
Recommended
Newsletter
Scraping insights and API tips. No spam.
Recommended Reading

Selenium Bot Detection: Why You Get Flagged and How to Fix It

How to Scrape AliExpress: Complete Guide for 2026

Why Your Headless Browser Gets Detected (and How to Fix It)

How to Scrape Indeed: Complete Guide for 2026

How to Scrape Cloudflare-Protected Sites in 2026
Stay in the Loop
Get scraping insights, API tips, and platform updates. No spam — we only send when we have something worth reading.


