AlterLabAlterLab
Tutorials

Web Scraping Pipeline for LLM & RAG: Clean Markdown

Build a cost-effective web scraping pipeline that outputs clean markdown for LLM and RAG apps. Covers anti-bot bypass, heading-aware chunking, and ETag caching.

Yash Dubey
Yash Dubey

March 25, 2026

8 min read
2 views

The biggest quality problem in RAG pipelines isn't the embedding model or the vector store — it's the input data. Raw HTML fed into a chunker produces token-heavy garbage: navigation menus, cookie banners, inline styles, and <script> blocks that dilute every embedding you generate. Clean markdown eliminates 60–80% of that noise before a single token reaches your LLM.

This post walks through a production-ready pipeline: fetch pages with bot-bypass-aware scraping, convert to structured markdown, chunk on heading boundaries, and cache aggressively to control cost.

Why Markdown Beats Raw HTML for LLM Inputs

A typical documentation page runs 8,000+ tokens as raw HTML and around 1,200 tokens as clean markdown. That gap matters at three stages:

  • Chunking: HTML chunkers split on character count, slicing mid-tag, mid-sentence, and mid-function. Markdown respects the document's own semantic structure.
  • Retrieval precision: Boilerplate (<nav>, <footer>, repeated header text) bleeds into embedding space and degrades cosine similarity scores on meaningful content.
  • LLM context windows: Smaller, cleaner chunks mean more retrieved context fits in the prompt window without exceeding token limits.

The fix is to request markdown output at the fetch layer, not post-process HTML downstream.

Pipeline Architecture

Step 1: Fetching Pages with Anti-Bot Bypass

Most production scraping targets — documentation sites, knowledge bases, e-commerce, news publishers — run Cloudflare or similar bot detection. A raw requests.get() returns a 403 or a JS challenge page, neither of which contains your content.

The AlterLab anti-bot bypass API handles Cloudflare, Datadome, and CAPTCHA challenges transparently. You send a URL and receive content. No fingerprint maintenance, no proxy rotation code, no challenge-solving logic on your end.

Here's the cURL equivalent to verify the endpoint before writing any application code:

Bash
curl -X POST https://api.alterlab.io/v1/scrape \
  -H "X-API-Key: YOUR_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://docs.example.com/reference/api",
    "output_format": "markdown",
    "js_render": true
  }'

The output_format: "markdown" parameter is the critical lever. Instead of receiving an HTML blob, you get a pre-processed markdown document with headings, fenced code blocks, and lists intact — ready for a splitter.

Try it yourself

Try scraping a documentation page and see the clean markdown output from AlterLab

Step 2: The Full Python Pipeline

The Python SDK ships with a batteries-included client. Install dependencies, then wire up the complete ingest flow:

Bash
pip install alterlab langchain-text-splitters langchain-openai chromadb tiktoken
Python
import os
from typing import Optional
import alterlab
from langchain_text_splitters import MarkdownHeaderTextSplitter
from langchain_openai import OpenAIEmbeddings
import chromadb

# Initialize clients
scraper = alterlab.Client(os.environ["ALTERLAB_API_KEY"])
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
chroma = chromadb.PersistentClient(path="./chroma_db")
collection = chroma.get_or_create_collection("rag_docs")   # persistent vector store
EMBED_MODEL = "text-embedding-3-small"                     # 5× cheaper than ada-002


def fetch_as_markdown(url: str) -> str:
    """Fetch URL and return clean markdown. Handles JS, bot challenges, redirects."""
    response = scraper.scrape(                              # anti-bot bypass + JS render
        url=url,
        output_format="markdown",
        js_render=True,
        wait_for_selector="main, article, [role='main']",  # target content, not chrome
        timeout=30,
    )
    if not response.success:
        raise RuntimeError(f"Scrape failed [{response.status_code}]: {url}")
    return response.text


def chunk_markdown(markdown: str, source_url: str) -> list[dict]:
    """
    Split on heading boundaries, not character count.
    Preserves the semantic unit of each documentation section.
    """
    splitter = MarkdownHeaderTextSplitter(                  # respects document hierarchy
        headers_to_split_on=[
            ("#",   "h1"),
            ("##",  "h2"),
            ("###", "h3"),
        ],
        strip_headers=False,                               # keep heading in chunk text
    )
    docs = splitter.split_text(markdown)
    return [
        {"content": doc.page_content, "metadata": {**doc.metadata, "source_url": source_url}}
        for doc in docs
        if len(doc.page_content.strip()) > 80             # discard stub/empty sections
    ]


def ingest_url(url: str) -> int:
    """Full pipeline: fetch → chunk → embed → upsert. Returns new chunk count."""
    markdown = fetch_as_markdown(url)
    chunks = chunk_markdown(markdown, url)

    texts = [c["content"] for c in chunks]                 # extract text for batch embed
    vectors = embeddings.embed_documents(texts)            # single batched API call
    ids = [f"{url}#{i}" for i in range(len(chunks))]

    collection.upsert(                                     # idempotent — safe to re-run
        ids=ids,
        embeddings=vectors,
        documents=texts,
        metadatas=[c["metadata"] for c in chunks],
    )
    return len(chunks)


def query(question: str, n_results: int = 5) -> list[dict]:
    """Retrieve top-k chunks by cosine similarity."""
    q_vector = embeddings.embed_query(question)
    results = collection.query(query_embeddings=[q_vector], n_results=n_results)
    return [
        {"text": doc, "metadata": meta, "distance": dist}
        for doc, meta, dist in zip(
            results["documents"][0],
            results["metadatas"][0],
            results["distances"][0],
        )
    ]


if __name__ == "__main__":
    urls = [
        "https://docs.example.com/reference/authentication",
        "https://docs.example.com/reference/rate-limiting",
        "https://docs.example.com/guides/quickstart",
    ]
    for url in urls:
        n = ingest_url(url)
        print(f"Ingested {n} chunks ← {url}")

    hits = query("How do I authenticate API requests?")
    for h in hits:
        print(f"[{h['distance']:.3f}] {h['metadata'].get('h2', '')}{h['text'][:100]}...")

Step 3: Chunking Strategy

Character-count chunking — RecursiveCharacterTextSplitter with chunk_size=1000 — is fine for prose but breaks code-heavy documentation mid-function and splits conceptually related content across chunk boundaries. The right splitter depends on content type:

For documentation ingestion, heading-based splitting is the default choice. Add a RecursiveCharacterTextSplitter as a fallback to cap any single chunk at 6,000 tokens — some reference pages have multi-page sections under a single heading.

Step 4: Cost Optimization with Caching

The main cost levers are scraping requests, embedding tokens, and storage queries. Both can be dramatically reduced with two caching layers:

Layer 1 — URL-level ETag caching: Most documentation and knowledge-base content is stable. Store (url → etag) after each fetch. On subsequent runs, issue a HEAD request first; if the ETag or Last-Modified header is unchanged, skip the scrape entirely.

Layer 2 — Chunk-level deduplication: Before embedding, SHA-256 hash each chunk's text. Check the vector store for that ID. If it exists, skip the embed call. This is the bigger cost saver for pipelines that re-ingest on a schedule.

Python
import hashlib
import json
from pathlib import Path

CACHE_FILE = Path(".scrape_cache.json")


def load_cache() -> dict:
    """Load persisted ETag map from disk."""
    if CACHE_FILE.exists():
        return json.loads(CACHE_FILE.read_text())   # url → etag mapping
    return {}

def save_cache(cache: dict) -> None:
    CACHE_FILE.write_text(json.dumps(cache, indent=2))

def chunk_hash(text: str) -> str:
    return hashlib.sha256(text.encode()).hexdigest()[:16]

def ingest_url_cached(url: str) -> int:
    cache = load_cache()

    head = scraper.head(url)                            # lightweight ETag check
    etag = head.headers.get("etag") or head.headers.get("last-modified", "")
    if cache.get(url) == etag and etag:
        print(f"Skip {url} — content unchanged")
        return 0

    markdown = fetch_as_markdown(url)
    chunks = chunk_markdown(markdown, url)

    new_chunks = []
    for chunk in chunks:
        h = chunk_hash(chunk["content"])
        existing = collection.get(ids=[h])              # check store before embedding
        if not existing["ids"]:
            new_chunks.append((h, chunk))

    if new_chunks:
        ids, texts, metas = zip(*[
            (h, c["content"], c["metadata"]) for h, c in new_chunks
        ])
        vectors = embeddings.embed_documents(list(texts))   # embed only new chunks
        collection.upsert(
            ids=list(ids),
            embeddings=vectors,
            documents=list(texts),
            metadatas=list(metas),
        )

    cache[url] = etag
    save_cache(cache)
    return len(new_chunks)
~80%Token Reduction vs Raw HTML
Cost Saving: small vs. ada-002
~90%Embed Cost Saved via Caching

Step 5: Scaling with a Task Queue

For single-user tooling, the synchronous ingest above is sufficient. For pipelines ingesting thousands of URLs on a schedule, parallelize with Celery and Redis:

Python
from celery import Celery
from cached_ingest import ingest_url_cached

app = Celery("rag_ingest", broker="redis://localhost:6379/0")

@app.task(bind=True, max_retries=3, default_retry_delay=30)   # retry on transient 429/503
def ingest_task(self, url: str) -> dict:
    try:
        count = ingest_url_cached(url)
        return {"url": url, "new_chunks": count}
    except Exception as exc:
        raise self.retry(exc=exc)                             # exponential backoff

# Dispatch a full sitemap
urls = parse_sitemap("https://docs.example.com/sitemap.xml")
job = celery.group(ingest_task.s(url) for url in urls)
result = job.apply_async()

Keep worker concurrency aligned to your scraping API plan. AlterLab's pricing plans scale with concurrent connections — over-parallelizing wastes retries; under-parallelizing wastes wall time.

Handling Edge Cases

JavaScript-heavy SPAs: Set js_render: true and use wait_for_selector targeting the content container (main, article, [role="main"]). Waiting on body fires before React or Vue hydrates the actual content.

Pagination: After fetching, parse rel="next" link tags from the response metadata and enqueue subsequent pages in the same Celery task group. Store (canonical_url, page_number) as the vector ID to avoid collisions.

PDF and binary content: Check the response's content_type field before processing. If it's not text/html, route to a dedicated PDF extraction path (pdfplumber, pymupdf) rather than the markdown pipeline.

Oversized sections: A single API reference page converted to markdown can still produce chunks exceeding the 8,192-token embedding limit. Add a token-count guard after heading splitting and re-chunk any oversized section with RecursiveCharacterTextSplitter(chunk_size=6000) as a fallback.

Dynamic content that isn't indexed: Some content (login-gated pages, single-page apps loading data via authenticated XHR) won't yield useful markdown regardless of JS rendering. Identify these early in the pipeline and route them to session-based scraping or data-export APIs where they exist.

Choosing an Embedding Model

The embedding model choice has a larger cost impact than most engineers expect:

ModelDimensionsCost / 1M tokensMTEB Score
text-embedding-ada-0021,536$0.1061.0
text-embedding-3-small1,536$0.0262.3
text-embedding-3-large3,072$0.1364.6
nomic-embed-text (local)768$0.0062.4
BGE-M3 (local)1,024$0.0063.8

For most RAG use cases, text-embedding-3-small matches or beats ada-002 at 5× lower cost. For air-gapped or high-volume deployments, a locally-hosted model eliminates per-token costs entirely at the price of infrastructure overhead.

Takeaways

  • Request markdown at the fetch layer. Post-processing HTML is expensive and lossy; let the scraping API handle conversion before the content hits your pipeline.
  • Split on structure, not character count. MarkdownHeaderTextSplitter preserves the semantic unit of documentation. Add a token-limit fallback for oversized sections.
  • Cache at two levels. ETag-based URL caching prevents unnecessary re-scraping; chunk-level SHA-256 deduplication prevents unnecessary re-embedding. Together they cut ongoing costs by 90%+ on stable knowledge bases.
  • Match worker concurrency to your API tier. Excess parallelism hits rate limits and burns retries; it does not improve throughput.
  • Pick the right embedding model. text-embedding-3-small is the default right choice for hosted inference. Local BGE-M3 or nomic-embed-text is the right choice above ~500M tokens/month.
Share

Was this article helpful?

Frequently Asked Questions

Markdown is the best output format for RAG pipelines. It reduces token count by 60–80% compared to raw HTML while preserving document structure — headings, lists, and code blocks — that directly improves chunking quality and retrieval precision.
Use a scraping API with built-in anti-bot bypass rather than raw HTTP clients. Implementing bypass from scratch requires maintaining browser fingerprints, proxy rotation, and challenge solvers — a significant ongoing maintenance burden that grows as detection systems update.
For structured content like documentation or wikis, use heading-based chunking (MarkdownHeaderTextSplitter in LangChain). Combine it with ETag-based URL caching and SHA-256 chunk deduplication to avoid re-embedding unchanged content — this cuts ongoing embedding costs by 90%+ for stable knowledge bases.