Web Scraping Pipeline for LLM & RAG: Clean Markdown
Build a cost-effective web scraping pipeline that outputs clean markdown for LLM and RAG apps. Covers anti-bot bypass, heading-aware chunking, and ETag caching.
March 25, 2026
The biggest quality problem in RAG pipelines isn't the embedding model or the vector store — it's the input data. Raw HTML fed into a chunker produces token-heavy garbage: navigation menus, cookie banners, inline styles, and <script> blocks that dilute every embedding you generate. Clean markdown eliminates 60–80% of that noise before a single token reaches your LLM.
This post walks through a production-ready pipeline: fetch pages with bot-bypass-aware scraping, convert to structured markdown, chunk on heading boundaries, and cache aggressively to control cost.
Why Markdown Beats Raw HTML for LLM Inputs
A typical documentation page runs 8,000+ tokens as raw HTML and around 1,200 tokens as clean markdown. That gap matters at three stages:
- Chunking: HTML chunkers split on character count, slicing mid-tag, mid-sentence, and mid-function. Markdown respects the document's own semantic structure.
- Retrieval precision: Boilerplate (
<nav>,<footer>, repeated header text) bleeds into embedding space and degrades cosine similarity scores on meaningful content. - LLM context windows: Smaller, cleaner chunks mean more retrieved context fits in the prompt window without exceeding token limits.
The fix is to request markdown output at the fetch layer, not post-process HTML downstream.
Pipeline Architecture
Step 1: Fetching Pages with Anti-Bot Bypass
Most production scraping targets — documentation sites, knowledge bases, e-commerce, news publishers — run Cloudflare or similar bot detection. A raw requests.get() returns a 403 or a JS challenge page, neither of which contains your content.
The AlterLab anti-bot bypass API handles Cloudflare, Datadome, and CAPTCHA challenges transparently. You send a URL and receive content. No fingerprint maintenance, no proxy rotation code, no challenge-solving logic on your end.
Here's the cURL equivalent to verify the endpoint before writing any application code:
curl -X POST https://api.alterlab.io/v1/scrape \
-H "X-API-Key: YOUR_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://docs.example.com/reference/api",
"output_format": "markdown",
"js_render": true
}'The output_format: "markdown" parameter is the critical lever. Instead of receiving an HTML blob, you get a pre-processed markdown document with headings, fenced code blocks, and lists intact — ready for a splitter.
Try scraping a documentation page and see the clean markdown output from AlterLab
Step 2: The Full Python Pipeline
The Python SDK ships with a batteries-included client. Install dependencies, then wire up the complete ingest flow:
pip install alterlab langchain-text-splitters langchain-openai chromadb tiktokenimport os
from typing import Optional
import alterlab
from langchain_text_splitters import MarkdownHeaderTextSplitter
from langchain_openai import OpenAIEmbeddings
import chromadb
# Initialize clients
scraper = alterlab.Client(os.environ["ALTERLAB_API_KEY"])
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
chroma = chromadb.PersistentClient(path="./chroma_db")
collection = chroma.get_or_create_collection("rag_docs") # persistent vector store
EMBED_MODEL = "text-embedding-3-small" # 5× cheaper than ada-002
def fetch_as_markdown(url: str) -> str:
"""Fetch URL and return clean markdown. Handles JS, bot challenges, redirects."""
response = scraper.scrape( # anti-bot bypass + JS render
url=url,
output_format="markdown",
js_render=True,
wait_for_selector="main, article, [role='main']", # target content, not chrome
timeout=30,
)
if not response.success:
raise RuntimeError(f"Scrape failed [{response.status_code}]: {url}")
return response.text
def chunk_markdown(markdown: str, source_url: str) -> list[dict]:
"""
Split on heading boundaries, not character count.
Preserves the semantic unit of each documentation section.
"""
splitter = MarkdownHeaderTextSplitter( # respects document hierarchy
headers_to_split_on=[
("#", "h1"),
("##", "h2"),
("###", "h3"),
],
strip_headers=False, # keep heading in chunk text
)
docs = splitter.split_text(markdown)
return [
{"content": doc.page_content, "metadata": {**doc.metadata, "source_url": source_url}}
for doc in docs
if len(doc.page_content.strip()) > 80 # discard stub/empty sections
]
def ingest_url(url: str) -> int:
"""Full pipeline: fetch → chunk → embed → upsert. Returns new chunk count."""
markdown = fetch_as_markdown(url)
chunks = chunk_markdown(markdown, url)
texts = [c["content"] for c in chunks] # extract text for batch embed
vectors = embeddings.embed_documents(texts) # single batched API call
ids = [f"{url}#{i}" for i in range(len(chunks))]
collection.upsert( # idempotent — safe to re-run
ids=ids,
embeddings=vectors,
documents=texts,
metadatas=[c["metadata"] for c in chunks],
)
return len(chunks)
def query(question: str, n_results: int = 5) -> list[dict]:
"""Retrieve top-k chunks by cosine similarity."""
q_vector = embeddings.embed_query(question)
results = collection.query(query_embeddings=[q_vector], n_results=n_results)
return [
{"text": doc, "metadata": meta, "distance": dist}
for doc, meta, dist in zip(
results["documents"][0],
results["metadatas"][0],
results["distances"][0],
)
]
if __name__ == "__main__":
urls = [
"https://docs.example.com/reference/authentication",
"https://docs.example.com/reference/rate-limiting",
"https://docs.example.com/guides/quickstart",
]
for url in urls:
n = ingest_url(url)
print(f"Ingested {n} chunks ← {url}")
hits = query("How do I authenticate API requests?")
for h in hits:
print(f"[{h['distance']:.3f}] {h['metadata'].get('h2', '')} — {h['text'][:100]}...")Step 3: Chunking Strategy
Character-count chunking — RecursiveCharacterTextSplitter with chunk_size=1000 — is fine for prose but breaks code-heavy documentation mid-function and splits conceptually related content across chunk boundaries. The right splitter depends on content type:
For documentation ingestion, heading-based splitting is the default choice. Add a RecursiveCharacterTextSplitter as a fallback to cap any single chunk at 6,000 tokens — some reference pages have multi-page sections under a single heading.
Step 4: Cost Optimization with Caching
The main cost levers are scraping requests, embedding tokens, and storage queries. Both can be dramatically reduced with two caching layers:
Layer 1 — URL-level ETag caching: Most documentation and knowledge-base content is stable. Store (url → etag) after each fetch. On subsequent runs, issue a HEAD request first; if the ETag or Last-Modified header is unchanged, skip the scrape entirely.
Layer 2 — Chunk-level deduplication: Before embedding, SHA-256 hash each chunk's text. Check the vector store for that ID. If it exists, skip the embed call. This is the bigger cost saver for pipelines that re-ingest on a schedule.
import hashlib
import json
from pathlib import Path
CACHE_FILE = Path(".scrape_cache.json")
def load_cache() -> dict:
"""Load persisted ETag map from disk."""
if CACHE_FILE.exists():
return json.loads(CACHE_FILE.read_text()) # url → etag mapping
return {}
def save_cache(cache: dict) -> None:
CACHE_FILE.write_text(json.dumps(cache, indent=2))
def chunk_hash(text: str) -> str:
return hashlib.sha256(text.encode()).hexdigest()[:16]
def ingest_url_cached(url: str) -> int:
cache = load_cache()
head = scraper.head(url) # lightweight ETag check
etag = head.headers.get("etag") or head.headers.get("last-modified", "")
if cache.get(url) == etag and etag:
print(f"Skip {url} — content unchanged")
return 0
markdown = fetch_as_markdown(url)
chunks = chunk_markdown(markdown, url)
new_chunks = []
for chunk in chunks:
h = chunk_hash(chunk["content"])
existing = collection.get(ids=[h]) # check store before embedding
if not existing["ids"]:
new_chunks.append((h, chunk))
if new_chunks:
ids, texts, metas = zip(*[
(h, c["content"], c["metadata"]) for h, c in new_chunks
])
vectors = embeddings.embed_documents(list(texts)) # embed only new chunks
collection.upsert(
ids=list(ids),
embeddings=vectors,
documents=list(texts),
metadatas=list(metas),
)
cache[url] = etag
save_cache(cache)
return len(new_chunks)Step 5: Scaling with a Task Queue
For single-user tooling, the synchronous ingest above is sufficient. For pipelines ingesting thousands of URLs on a schedule, parallelize with Celery and Redis:
from celery import Celery
from cached_ingest import ingest_url_cached
app = Celery("rag_ingest", broker="redis://localhost:6379/0")
@app.task(bind=True, max_retries=3, default_retry_delay=30) # retry on transient 429/503
def ingest_task(self, url: str) -> dict:
try:
count = ingest_url_cached(url)
return {"url": url, "new_chunks": count}
except Exception as exc:
raise self.retry(exc=exc) # exponential backoff
# Dispatch a full sitemap
urls = parse_sitemap("https://docs.example.com/sitemap.xml")
job = celery.group(ingest_task.s(url) for url in urls)
result = job.apply_async()Keep worker concurrency aligned to your scraping API plan. AlterLab's pricing plans scale with concurrent connections — over-parallelizing wastes retries; under-parallelizing wastes wall time.
Handling Edge Cases
JavaScript-heavy SPAs: Set js_render: true and use wait_for_selector targeting the content container (main, article, [role="main"]). Waiting on body fires before React or Vue hydrates the actual content.
Pagination: After fetching, parse rel="next" link tags from the response metadata and enqueue subsequent pages in the same Celery task group. Store (canonical_url, page_number) as the vector ID to avoid collisions.
PDF and binary content: Check the response's content_type field before processing. If it's not text/html, route to a dedicated PDF extraction path (pdfplumber, pymupdf) rather than the markdown pipeline.
Oversized sections: A single API reference page converted to markdown can still produce chunks exceeding the 8,192-token embedding limit. Add a token-count guard after heading splitting and re-chunk any oversized section with RecursiveCharacterTextSplitter(chunk_size=6000) as a fallback.
Dynamic content that isn't indexed: Some content (login-gated pages, single-page apps loading data via authenticated XHR) won't yield useful markdown regardless of JS rendering. Identify these early in the pipeline and route them to session-based scraping or data-export APIs where they exist.
Choosing an Embedding Model
The embedding model choice has a larger cost impact than most engineers expect:
| Model | Dimensions | Cost / 1M tokens | MTEB Score |
|---|---|---|---|
text-embedding-ada-002 | 1,536 | $0.10 | 61.0 |
text-embedding-3-small | 1,536 | $0.02 | 62.3 |
text-embedding-3-large | 3,072 | $0.13 | 64.6 |
nomic-embed-text (local) | 768 | $0.00 | 62.4 |
BGE-M3 (local) | 1,024 | $0.00 | 63.8 |
For most RAG use cases, text-embedding-3-small matches or beats ada-002 at 5× lower cost. For air-gapped or high-volume deployments, a locally-hosted model eliminates per-token costs entirely at the price of infrastructure overhead.
Takeaways
- Request markdown at the fetch layer. Post-processing HTML is expensive and lossy; let the scraping API handle conversion before the content hits your pipeline.
- Split on structure, not character count.
MarkdownHeaderTextSplitterpreserves the semantic unit of documentation. Add a token-limit fallback for oversized sections. - Cache at two levels. ETag-based URL caching prevents unnecessary re-scraping; chunk-level SHA-256 deduplication prevents unnecessary re-embedding. Together they cut ongoing costs by 90%+ on stable knowledge bases.
- Match worker concurrency to your API tier. Excess parallelism hits rate limits and burns retries; it does not improve throughput.
- Pick the right embedding model.
text-embedding-3-smallis the default right choice for hosted inference. Local BGE-M3 or nomic-embed-text is the right choice above ~500M tokens/month.
Was this article helpful?
Frequently Asked Questions
Related Articles
Popular Posts
Recommended
Selenium Bot Detection: Why You Get Caught and How to Avoid It
Why Your Headless Browser Gets Detected (and How to Fix It)
Web Scraping APIs vs DIY Scrapers: When to Stop Building Infrastructure
Scraping E-Commerce Sites at Scale Without Getting Blocked
Web Scraping with Node.js and Puppeteer: The Complete 2026 Guide
Newsletter
Scraping insights and API tips. No spam.
Recommended Reading
Selenium Bot Detection: Why You Get Caught and How to Avoid It
Why Your Headless Browser Gets Detected (and How to Fix It)
Web Scraping APIs vs DIY Scrapers: When to Stop Building Infrastructure
Scraping E-Commerce Sites at Scale Without Getting Blocked
Web Scraping with Node.js and Puppeteer: The Complete 2026 Guide
Stay in the Loop
Get scraping insights, API tips, and platform updates. No spam — we only send when we have something worth reading.