
Web Scraping Pipeline for LLM & RAG: Clean Markdown
Build a cost-effective web scraping pipeline that outputs clean markdown for LLM and RAG apps. Covers anti-bot bypass, heading-aware chunking, and ETag caching.
AlterLab handles this automatically — scrape any URL with one API call. No infrastructure required.
Try it freeThe biggest quality problem in RAG pipelines isn't the embedding model or the vector store — it's the input data. Raw HTML fed into a chunker produces token-heavy garbage: navigation menus, cookie banners, inline styles, and <script> blocks that dilute every embedding you generate. Clean markdown eliminates 60–80% of that noise before a single token reaches your LLM.
This post walks through a production-ready pipeline: fetch pages with bot-bypass-aware scraping, convert to structured markdown, chunk on heading boundaries, and cache aggressively to control cost.
Why Markdown Beats Raw HTML for LLM Inputs
A typical documentation page runs 8,000+ tokens as raw HTML and around 1,200 tokens as clean markdown. That gap matters at three stages:
- Chunking: HTML chunkers split on character count, slicing mid-tag, mid-sentence, and mid-function. Markdown respects the document's own semantic structure.
- Retrieval precision: Boilerplate (
<nav>,<footer>, repeated header text) bleeds into embedding space and degrades cosine similarity scores on meaningful content. - LLM context windows: Smaller, cleaner chunks mean more retrieved context fits in the prompt window without exceeding token limits.
The fix is to request markdown output at the fetch layer, not post-process HTML downstream.
Pipeline Architecture
Step 1: Fetching Pages with Anti-Bot Bypass
Most production scraping targets — documentation sites, knowledge bases, e-commerce, news publishers — run Cloudflare or similar bot detection. A raw requests.get() returns a 403 or a JS challenge page, neither of which contains your content.
The AlterLab anti-bot bypass API handles Cloudflare, Datadome, and CAPTCHA challenges transparently. You send a URL and receive content. No fingerprint maintenance, no proxy rotation code, no challenge-solving logic on your end.
Here's the cURL equivalent to verify the endpoint before writing any application code:
curl -X POST https://api.alterlab.io/v1/scrape \
-H "X-API-Key: YOUR_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://docs.example.com/reference/api",
"output_format": "markdown",
"js_render": true
}'The output_format: "markdown" parameter is the critical lever. Instead of receiving an HTML blob, you get a pre-processed markdown document with headings, fenced code blocks, and lists intact — ready for a splitter.
Try scraping a documentation page and see the clean markdown output from AlterLab
Step 2: The Full Python Pipeline
The Python SDK ships with a batteries-included client. Install dependencies, then wire up the complete ingest flow:
pip install alterlab langchain-text-splitters langchain-openai chromadb tiktokenimport os
from typing import Optional
import alterlab
from langchain_text_splitters import MarkdownHeaderTextSplitter
from langchain_openai import OpenAIEmbeddings
import chromadb
# Initialize clients
scraper = alterlab.Client(os.environ["ALTERLAB_API_KEY"])
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
chroma = chromadb.PersistentClient(path="./chroma_db")
collection = chroma.get_or_create_collection("rag_docs") # persistent vector store
EMBED_MODEL = "text-embedding-3-small" # 5× cheaper than ada-002
def fetch_as_markdown(url: str) -> str:
"""Fetch URL and return clean markdown. Handles JS, bot challenges, redirects."""
response = scraper.scrape( # anti-bot bypass + JS render
url=url,
output_format="markdown",
js_render=True,
wait_for_selector="main, article, [role='main']", # target content, not chrome
timeout=30,
)
if not response.success:
raise RuntimeError(f"Scrape failed [{response.status_code}]: {url}")
return response.text
def chunk_markdown(markdown: str, source_url: str) -> list[dict]:
"""
Split on heading boundaries, not character count.
Preserves the semantic unit of each documentation section.
"""
splitter = MarkdownHeaderTextSplitter( # respects document hierarchy
headers_to_split_on=[
("#", "h1"),
("##", "h2"),
("###", "h3"),
],
strip_headers=False, # keep heading in chunk text
)
docs = splitter.split_text(markdown)
return [
{"content": doc.page_content, "metadata": {**doc.metadata, "source_url": source_url}}
for doc in docs
if len(doc.page_content.strip()) > 80 # discard stub/empty sections
]
def ingest_url(url: str) -> int:
"""Full pipeline: fetch → chunk → embed → upsert. Returns new chunk count."""
markdown = fetch_as_markdown(url)
chunks = chunk_markdown(markdown, url)
texts = [c["content"] for c in chunks] # extract text for batch embed
vectors = embeddings.embed_documents(texts) # single batched API call
ids = [f"{url}#{i}" for i in range(len(chunks))]
collection.upsert( # idempotent — safe to re-run
ids=ids,
embeddings=vectors,
documents=texts,
metadatas=[c["metadata"] for c in chunks],
)
return len(chunks)
def query(question: str, n_results: int = 5) -> list[dict]:
"""Retrieve top-k chunks by cosine similarity."""
q_vector = embeddings.embed_query(question)
results = collection.query(query_embeddings=[q_vector], n_results=n_results)
return [
{"text": doc, "metadata": meta, "distance": dist}
for doc, meta, dist in zip(
results["documents"][0],
results["metadatas"][0],
results["distances"][0],
)
]
if __name__ == "__main__":
urls = [
"https://docs.example.com/reference/authentication",
"https://docs.example.com/reference/rate-limiting",
"https://docs.example.com/guides/quickstart",
]
for url in urls:
n = ingest_url(url)
print(f"Ingested {n} chunks ← {url}")
hits = query("How do I authenticate API requests?")
for h in hits:
print(f"[{h['distance']:.3f}] {h['metadata'].get('h2', '')} — {h['text'][:100]}...")Step 3: Chunking Strategy
Character-count chunking — RecursiveCharacterTextSplitter with chunk_size=1000 — is fine for prose but breaks code-heavy documentation mid-function and splits conceptually related content across chunk boundaries. The right splitter depends on content type:
For documentation ingestion, heading-based splitting is the default choice. Add a RecursiveCharacterTextSplitter as a fallback to cap any single chunk at 6,000 tokens — some reference pages have multi-page sections under a single heading.
Step 4: Cost Optimization with Caching
The main cost levers are scraping requests, embedding tokens, and storage queries. Both can be dramatically reduced with two caching layers:
Layer 1 — URL-level ETag caching: Most documentation and knowledge-base content is stable. Store (url → etag) after each fetch. On subsequent runs, issue a HEAD request first; if the ETag or Last-Modified header is unchanged, skip the scrape entirely.
Layer 2 — Chunk-level deduplication: Before embedding, SHA-256 hash each chunk's text. Check the vector store for that ID. If it exists, skip the embed call. This is the bigger cost saver for pipelines that re-ingest on a schedule.
import hashlib
import json
from pathlib import Path
CACHE_FILE = Path(".scrape_cache.json")
def load_cache() -> dict:
"""Load persisted ETag map from disk."""
if CACHE_FILE.exists():
return json.loads(CACHE_FILE.read_text()) # url → etag mapping
return {}
def save_cache(cache: dict) -> None:
CACHE_FILE.write_text(json.dumps(cache, indent=2))
def chunk_hash(text: str) -> str:
return hashlib.sha256(text.encode()).hexdigest()[:16]
def ingest_url_cached(url: str) -> int:
cache = load_cache()
head = scraper.head(url) # lightweight ETag check
etag = head.headers.get("etag") or head.headers.get("last-modified", "")
if cache.get(url) == etag and etag:
print(f"Skip {url} — content unchanged")
return 0
markdown = fetch_as_markdown(url)
chunks = chunk_markdown(markdown, url)
new_chunks = []
for chunk in chunks:
h = chunk_hash(chunk["content"])
existing = collection.get(ids=[h]) # check store before embedding
if not existing["ids"]:
new_chunks.append((h, chunk))
if new_chunks:
ids, texts, metas = zip(*[
(h, c["content"], c["metadata"]) for h, c in new_chunks
])
vectors = embeddings.embed_documents(list(texts)) # embed only new chunks
collection.upsert(
ids=list(ids),
embeddings=vectors,
documents=list(texts),
metadatas=list(metas),
)
cache[url] = etag
save_cache(cache)
return len(new_chunks)Step 5: Scaling with a Task Queue
For single-user tooling, the synchronous ingest above is sufficient. For pipelines ingesting thousands of URLs on a schedule, parallelize with Celery and Redis:
from celery import Celery
from cached_ingest import ingest_url_cached
app = Celery("rag_ingest", broker="redis://localhost:6379/0")
@app.task(bind=True, max_retries=3, default_retry_delay=30) # retry on transient 429/503
def ingest_task(self, url: str) -> dict:
try:
count = ingest_url_cached(url)
return {"url": url, "new_chunks": count}
except Exception as exc:
raise self.retry(exc=exc) # exponential backoff
# Dispatch a full sitemap
urls = parse_sitemap("https://docs.example.com/sitemap.xml")
job = celery.group(ingest_task.s(url) for url in urls)
result = job.apply_async()Keep worker concurrency aligned to your scraping API plan. AlterLab's pricing plans scale with concurrent connections — over-parallelizing wastes retries; under-parallelizing wastes wall time.
Handling Edge Cases
JavaScript-heavy SPAs: Set js_render: true and use wait_for_selector targeting the content container (main, article, [role="main"]). Waiting on body fires before React or Vue hydrates the actual content.
Pagination: After fetching, parse rel="next" link tags from the response metadata and enqueue subsequent pages in the same Celery task group. Store (canonical_url, page_number) as the vector ID to avoid collisions.
PDF and binary content: Check the response's content_type field before processing. If it's not text/html, route to a dedicated PDF extraction path (pdfplumber, pymupdf) rather than the markdown pipeline.
Oversized sections: A single API reference page converted to markdown can still produce chunks exceeding the 8,192-token embedding limit. Add a token-count guard after heading splitting and re-chunk any oversized section with RecursiveCharacterTextSplitter(chunk_size=6000) as a fallback.
Dynamic content that isn't indexed: Some content (login-gated pages, single-page apps loading data via authenticated XHR) won't yield useful markdown regardless of JS rendering. Identify these early in the pipeline and route them to session-based scraping or data-export APIs where they exist.
Choosing an Embedding Model
The embedding model choice has a larger cost impact than most engineers expect:
| Model | Dimensions | Cost / 1M tokens | MTEB Score |
|---|---|---|---|
text-embedding-ada-002 | 1,536 | $0.10 | 61.0 |
text-embedding-3-small | 1,536 | $0.02 | 62.3 |
text-embedding-3-large | 3,072 | $0.13 | 64.6 |
nomic-embed-text (local) | 768 | $0.00 | 62.4 |
BGE-M3 (local) | 1,024 | $0.00 | 63.8 |
For most RAG use cases, text-embedding-3-small matches or beats ada-002 at 5× lower cost. For air-gapped or high-volume deployments, a locally-hosted model eliminates per-token costs entirely at the price of infrastructure overhead.
Takeaways
- Request markdown at the fetch layer. Post-processing HTML is expensive and lossy; let the scraping API handle conversion before the content hits your pipeline.
- Split on structure, not character count.
MarkdownHeaderTextSplitterpreserves the semantic unit of documentation. Add a token-limit fallback for oversized sections. - Cache at two levels. ETag-based URL caching prevents unnecessary re-scraping; chunk-level SHA-256 deduplication prevents unnecessary re-embedding. Together they cut ongoing costs by 90%+ on stable knowledge bases.
- Match worker concurrency to your API tier. Excess parallelism hits rate limits and burns retries; it does not improve throughput.
- Pick the right embedding model.
text-embedding-3-smallis the default right choice for hosted inference. Local BGE-M3 or nomic-embed-text is the right choice above ~500M tokens/month.
Was this article helpful?
Frequently Asked Questions
Related Articles

How to Give Your AI Agent Access to Capterra Data
Learn how to equip your AI agent with structured Capterra data for software research pipelines using AlterLab's Extract API. Get clean JSON without parsing HTML.
Herald Blog Service

Reducing LLM Token Usage in RAG via Structured Extraction
Learn how to optimize RAG pipelines by converting raw HTML into clean Markdown and structured JSON to significantly reduce LLM token consumption and costs.
Herald Blog Service

ESPN Data API: Extract Structured JSON in 2026
Learn how to extract structured JSON data from ESPN using AlterLab's Extract API. Get team, score, date, venue and competition data with schema-based validation.
Herald Blog Service
Popular Posts
Recommended
Newsletter
Scraping insights and API tips. No spam.
Recommended Reading

How to Scrape AliExpress: Complete Guide for 2026

Why Your Headless Browser Gets Detected (and How to Fix It)

AlterLab vs Firecrawl: Which Scraping API Is Better in 2026?

How to Scrape Twitter/X Data: Complete Guide for 2026

How to Scrape Cloudflare-Protected Sites in 2026
Stay in the Loop
Get scraping insights, API tips, and platform updates. No spam — we only send when we have something worth reading.
Explore AlterLab
Anti-Bot Handling API
Automatic challenge handling for protected sites — works out of the box.
JavaScript Rendering API
Render SPAs and dynamic content with headless Chromium.
Pricing
5-tier pricing from $0.0002/page. 5,000 free requests to start.
Documentation
API reference, SDKs, quickstart guides, and tutorials.
Web Scraping API Resources
Part of the Web Scraping API Documentation cluster
Complete API reference with 5-tier auto-escalation — Curl to challenge resolution.
Pillar pageConfigure Tier 4 browser rendering for SPAs and dynamic content.
Scrape pages behind login using session management.
Real success rates and cost data across all 5 tiers.
MCP Server, Python SDK, and Firecrawl-compatible API for AI agent workflows.