Web Scraping Pipeline for LLM & RAG: Clean Markdown
Tutorials

Web Scraping Pipeline for LLM & RAG: Clean Markdown

Build a cost-effective web scraping pipeline that outputs clean markdown for LLM and RAG apps. Covers anti-bot bypass, heading-aware chunking, and ETag caching.

Yash Dubey
Yash Dubey
8 min read
1,099 views

AlterLab handles this automaticallyscrape any URL with one API call. No infrastructure required.

Try it free

The biggest quality problem in RAG pipelines isn't the embedding model or the vector store — it's the input data. Raw HTML fed into a chunker produces token-heavy garbage: navigation menus, cookie banners, inline styles, and <script> blocks that dilute every embedding you generate. Clean markdown eliminates 60–80% of that noise before a single token reaches your LLM.

This post walks through a production-ready pipeline: fetch pages with bot-bypass-aware scraping, convert to structured markdown, chunk on heading boundaries, and cache aggressively to control cost.

Why Markdown Beats Raw HTML for LLM Inputs

A typical documentation page runs 8,000+ tokens as raw HTML and around 1,200 tokens as clean markdown. That gap matters at three stages:

  • Chunking: HTML chunkers split on character count, slicing mid-tag, mid-sentence, and mid-function. Markdown respects the document's own semantic structure.
  • Retrieval precision: Boilerplate (<nav>, <footer>, repeated header text) bleeds into embedding space and degrades cosine similarity scores on meaningful content.
  • LLM context windows: Smaller, cleaner chunks mean more retrieved context fits in the prompt window without exceeding token limits.

The fix is to request markdown output at the fetch layer, not post-process HTML downstream.

Pipeline Architecture

Step 1: Fetching Pages with Anti-Bot Bypass

Most production scraping targets — documentation sites, knowledge bases, e-commerce, news publishers — run Cloudflare or similar bot detection. A raw requests.get() returns a 403 or a JS challenge page, neither of which contains your content.

The AlterLab anti-bot bypass API handles Cloudflare, Datadome, and CAPTCHA challenges transparently. You send a URL and receive content. No fingerprint maintenance, no proxy rotation code, no challenge-solving logic on your end.

Here's the cURL equivalent to verify the endpoint before writing any application code:

Bash
curl -X POST https://api.alterlab.io/v1/scrape \
  -H "X-API-Key: YOUR_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://docs.example.com/reference/api",
    "output_format": "markdown",
    "js_render": true
  }'

The output_format: "markdown" parameter is the critical lever. Instead of receiving an HTML blob, you get a pre-processed markdown document with headings, fenced code blocks, and lists intact — ready for a splitter.

Try it yourself

Try scraping a documentation page and see the clean markdown output from AlterLab

Step 2: The Full Python Pipeline

The Python SDK ships with a batteries-included client. Install dependencies, then wire up the complete ingest flow:

Bash
pip install alterlab langchain-text-splitters langchain-openai chromadb tiktoken
Python
import os
from typing import Optional
import alterlab
from langchain_text_splitters import MarkdownHeaderTextSplitter
from langchain_openai import OpenAIEmbeddings
import chromadb

# Initialize clients
scraper = alterlab.Client(os.environ["ALTERLAB_API_KEY"])
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
chroma = chromadb.PersistentClient(path="./chroma_db")
collection = chroma.get_or_create_collection("rag_docs")   # persistent vector store
EMBED_MODEL = "text-embedding-3-small"                     # 5× cheaper than ada-002


def fetch_as_markdown(url: str) -> str:
    """Fetch URL and return clean markdown. Handles JS, bot challenges, redirects."""
    response = scraper.scrape(                              # anti-bot bypass + JS render
        url=url,
        output_format="markdown",
        js_render=True,
        wait_for_selector="main, article, [role='main']",  # target content, not chrome
        timeout=30,
    )
    if not response.success:
        raise RuntimeError(f"Scrape failed [{response.status_code}]: {url}")
    return response.text


def chunk_markdown(markdown: str, source_url: str) -> list[dict]:
    """
    Split on heading boundaries, not character count.
    Preserves the semantic unit of each documentation section.
    """
    splitter = MarkdownHeaderTextSplitter(                  # respects document hierarchy
        headers_to_split_on=[
            ("#",   "h1"),
            ("##",  "h2"),
            ("###", "h3"),
        ],
        strip_headers=False,                               # keep heading in chunk text
    )
    docs = splitter.split_text(markdown)
    return [
        {"content": doc.page_content, "metadata": {**doc.metadata, "source_url": source_url}}
        for doc in docs
        if len(doc.page_content.strip()) > 80             # discard stub/empty sections
    ]


def ingest_url(url: str) -> int:
    """Full pipeline: fetch → chunk → embed → upsert. Returns new chunk count."""
    markdown = fetch_as_markdown(url)
    chunks = chunk_markdown(markdown, url)

    texts = [c["content"] for c in chunks]                 # extract text for batch embed
    vectors = embeddings.embed_documents(texts)            # single batched API call
    ids = [f"{url}#{i}" for i in range(len(chunks))]

    collection.upsert(                                     # idempotent — safe to re-run
        ids=ids,
        embeddings=vectors,
        documents=texts,
        metadatas=[c["metadata"] for c in chunks],
    )
    return len(chunks)


def query(question: str, n_results: int = 5) -> list[dict]:
    """Retrieve top-k chunks by cosine similarity."""
    q_vector = embeddings.embed_query(question)
    results = collection.query(query_embeddings=[q_vector], n_results=n_results)
    return [
        {"text": doc, "metadata": meta, "distance": dist}
        for doc, meta, dist in zip(
            results["documents"][0],
            results["metadatas"][0],
            results["distances"][0],
        )
    ]


if __name__ == "__main__":
    urls = [
        "https://docs.example.com/reference/authentication",
        "https://docs.example.com/reference/rate-limiting",
        "https://docs.example.com/guides/quickstart",
    ]
    for url in urls:
        n = ingest_url(url)
        print(f"Ingested {n} chunks ← {url}")

    hits = query("How do I authenticate API requests?")
    for h in hits:
        print(f"[{h['distance']:.3f}] {h['metadata'].get('h2', '')}{h['text'][:100]}...")

Step 3: Chunking Strategy

Character-count chunking — RecursiveCharacterTextSplitter with chunk_size=1000 — is fine for prose but breaks code-heavy documentation mid-function and splits conceptually related content across chunk boundaries. The right splitter depends on content type:

For documentation ingestion, heading-based splitting is the default choice. Add a RecursiveCharacterTextSplitter as a fallback to cap any single chunk at 6,000 tokens — some reference pages have multi-page sections under a single heading.

Step 4: Cost Optimization with Caching

The main cost levers are scraping requests, embedding tokens, and storage queries. Both can be dramatically reduced with two caching layers:

Layer 1 — URL-level ETag caching: Most documentation and knowledge-base content is stable. Store (url → etag) after each fetch. On subsequent runs, issue a HEAD request first; if the ETag or Last-Modified header is unchanged, skip the scrape entirely.

Layer 2 — Chunk-level deduplication: Before embedding, SHA-256 hash each chunk's text. Check the vector store for that ID. If it exists, skip the embed call. This is the bigger cost saver for pipelines that re-ingest on a schedule.

Python
import hashlib
import json
from pathlib import Path

CACHE_FILE = Path(".scrape_cache.json")


def load_cache() -> dict:
    """Load persisted ETag map from disk."""
    if CACHE_FILE.exists():
        return json.loads(CACHE_FILE.read_text())   # url → etag mapping
    return {}

def save_cache(cache: dict) -> None:
    CACHE_FILE.write_text(json.dumps(cache, indent=2))

def chunk_hash(text: str) -> str:
    return hashlib.sha256(text.encode()).hexdigest()[:16]

def ingest_url_cached(url: str) -> int:
    cache = load_cache()

    head = scraper.head(url)                            # lightweight ETag check
    etag = head.headers.get("etag") or head.headers.get("last-modified", "")
    if cache.get(url) == etag and etag:
        print(f"Skip {url} — content unchanged")
        return 0

    markdown = fetch_as_markdown(url)
    chunks = chunk_markdown(markdown, url)

    new_chunks = []
    for chunk in chunks:
        h = chunk_hash(chunk["content"])
        existing = collection.get(ids=[h])              # check store before embedding
        if not existing["ids"]:
            new_chunks.append((h, chunk))

    if new_chunks:
        ids, texts, metas = zip(*[
            (h, c["content"], c["metadata"]) for h, c in new_chunks
        ])
        vectors = embeddings.embed_documents(list(texts))   # embed only new chunks
        collection.upsert(
            ids=list(ids),
            embeddings=vectors,
            documents=list(texts),
            metadatas=list(metas),
        )

    cache[url] = etag
    save_cache(cache)
    return len(new_chunks)
~80%Token Reduction vs Raw HTML
Cost Saving: small vs. ada-002
~90%Embed Cost Saved via Caching

Step 5: Scaling with a Task Queue

For single-user tooling, the synchronous ingest above is sufficient. For pipelines ingesting thousands of URLs on a schedule, parallelize with Celery and Redis:

Python
from celery import Celery
from cached_ingest import ingest_url_cached

app = Celery("rag_ingest", broker="redis://localhost:6379/0")

@app.task(bind=True, max_retries=3, default_retry_delay=30)   # retry on transient 429/503
def ingest_task(self, url: str) -> dict:
    try:
        count = ingest_url_cached(url)
        return {"url": url, "new_chunks": count}
    except Exception as exc:
        raise self.retry(exc=exc)                             # exponential backoff

# Dispatch a full sitemap
urls = parse_sitemap("https://docs.example.com/sitemap.xml")
job = celery.group(ingest_task.s(url) for url in urls)
result = job.apply_async()

Keep worker concurrency aligned to your scraping API plan. AlterLab's pricing plans scale with concurrent connections — over-parallelizing wastes retries; under-parallelizing wastes wall time.

Handling Edge Cases

JavaScript-heavy SPAs: Set js_render: true and use wait_for_selector targeting the content container (main, article, [role="main"]). Waiting on body fires before React or Vue hydrates the actual content.

Pagination: After fetching, parse rel="next" link tags from the response metadata and enqueue subsequent pages in the same Celery task group. Store (canonical_url, page_number) as the vector ID to avoid collisions.

PDF and binary content: Check the response's content_type field before processing. If it's not text/html, route to a dedicated PDF extraction path (pdfplumber, pymupdf) rather than the markdown pipeline.

Oversized sections: A single API reference page converted to markdown can still produce chunks exceeding the 8,192-token embedding limit. Add a token-count guard after heading splitting and re-chunk any oversized section with RecursiveCharacterTextSplitter(chunk_size=6000) as a fallback.

Dynamic content that isn't indexed: Some content (login-gated pages, single-page apps loading data via authenticated XHR) won't yield useful markdown regardless of JS rendering. Identify these early in the pipeline and route them to session-based scraping or data-export APIs where they exist.

Choosing an Embedding Model

The embedding model choice has a larger cost impact than most engineers expect:

ModelDimensionsCost / 1M tokensMTEB Score
text-embedding-ada-0021,536$0.1061.0
text-embedding-3-small1,536$0.0262.3
text-embedding-3-large3,072$0.1364.6
nomic-embed-text (local)768$0.0062.4
BGE-M3 (local)1,024$0.0063.8

For most RAG use cases, text-embedding-3-small matches or beats ada-002 at 5× lower cost. For air-gapped or high-volume deployments, a locally-hosted model eliminates per-token costs entirely at the price of infrastructure overhead.

Takeaways

  • Request markdown at the fetch layer. Post-processing HTML is expensive and lossy; let the scraping API handle conversion before the content hits your pipeline.
  • Split on structure, not character count. MarkdownHeaderTextSplitter preserves the semantic unit of documentation. Add a token-limit fallback for oversized sections.
  • Cache at two levels. ETag-based URL caching prevents unnecessary re-scraping; chunk-level SHA-256 deduplication prevents unnecessary re-embedding. Together they cut ongoing costs by 90%+ on stable knowledge bases.
  • Match worker concurrency to your API tier. Excess parallelism hits rate limits and burns retries; it does not improve throughput.
  • Pick the right embedding model. text-embedding-3-small is the default right choice for hosted inference. Local BGE-M3 or nomic-embed-text is the right choice above ~500M tokens/month.
Share

Was this article helpful?

Frequently Asked Questions

Markdown is the best output format for RAG pipelines. It reduces token count by 60–80% compared to raw HTML while preserving document structure — headings, lists, and code blocks — that directly improves chunking quality and retrieval precision.
Use a scraping API with built-in anti-bot bypass rather than raw HTTP clients. Implementing bypass from scratch requires maintaining browser fingerprints, proxy rotation, and challenge solvers — a significant ongoing maintenance burden that grows as detection systems update.
For structured content like documentation or wikis, use heading-based chunking (MarkdownHeaderTextSplitter in LangChain). Combine it with ETag-based URL caching and SHA-256 chunk deduplication to avoid re-embedding unchanged content — this cuts ongoing embedding costs by 90%+ for stable knowledge bases.