Pricing Compare Playground Blog Docs Changelog

Web Scraping Pipeline for LLM & RAG: Clean Markdown

Build a cost-effective web scraping pipeline that outputs clean markdown for LLM and RAG apps. Covers anti-bot bypass, heading-aware chunking, and ETag caching.

Yash DubeyMarch 25, 2026

8 min read

1,099 views

AlterLab handles this automatically — scrape any URL with one API call. No infrastructure required.

Try it free

The biggest quality problem in RAG pipelines isn't the embedding model or the vector store — it's the input data. Raw HTML fed into a chunker produces token-heavy garbage: navigation menus, cookie banners, inline styles, and <script> blocks that dilute every embedding you generate. Clean markdown eliminates 60–80% of that noise before a single token reaches your LLM.

This post walks through a production-ready pipeline: fetch pages with bot-bypass-aware scraping, convert to structured markdown, chunk on heading boundaries, and cache aggressively to control cost.

Why Markdown Beats Raw HTML for LLM Inputs

A typical documentation page runs 8,000+ tokens as raw HTML and around 1,200 tokens as clean markdown. That gap matters at three stages:

Chunking: HTML chunkers split on character count, slicing mid-tag, mid-sentence, and mid-function. Markdown respects the document's own semantic structure.
Retrieval precision: Boilerplate (<nav>, <footer>, repeated header text) bleeds into embedding space and degrades cosine similarity scores on meaningful content.
LLM context windows: Smaller, cleaner chunks mean more retrieved context fits in the prompt window without exceeding token limits.

The fix is to request markdown output at the fetch layer, not post-process HTML downstream.

Pipeline Architecture

Step 1: Fetching Pages with Anti-Bot Bypass

Most production scraping targets — documentation sites, knowledge bases, e-commerce, news publishers — run Cloudflare or similar bot detection. A raw requests.get() returns a 403 or a JS challenge page, neither of which contains your content.

The AlterLab anti-bot bypass API handles Cloudflare, Datadome, and CAPTCHA challenges transparently. You send a URL and receive content. No fingerprint maintenance, no proxy rotation code, no challenge-solving logic on your end.

Here's the cURL equivalent to verify the endpoint before writing any application code:

Bash

curl -X POST https://api.alterlab.io/v1/scrape \
  -H "X-API-Key: YOUR_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://docs.example.com/reference/api",
    "output_format": "markdown",
    "js_render": true
  }'

The output_format: "markdown" parameter is the critical lever. Instead of receiving an HTML blob, you get a pre-processed markdown document with headings, fenced code blocks, and lists intact — ready for a splitter.

Try it yourself

Try scraping a documentation page and see the clean markdown output from AlterLab

curl -X POST https://api.alterlab.io/v1/scrape \
  -H "X-API-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"url": "https://docs.python.org/3/library/asyncio.html"}'

Enable JavaScript to try the live demo, or sign up to use the API directly.

Step 2: The Full Python Pipeline

The Python SDK ships with a batteries-included client. Install dependencies, then wire up the complete ingest flow:

Bash

pip install alterlab langchain-text-splitters langchain-openai chromadb tiktoken

Python

import os
from typing import Optional
import alterlab
from langchain_text_splitters import MarkdownHeaderTextSplitter
from langchain_openai import OpenAIEmbeddings
import chromadb

# Initialize clients
scraper = alterlab.Client(os.environ["ALTERLAB_API_KEY"])
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
chroma = chromadb.PersistentClient(path="./chroma_db")
collection = chroma.get_or_create_collection("rag_docs")   # persistent vector store
EMBED_MODEL = "text-embedding-3-small"                     # 5× cheaper than ada-002


def fetch_as_markdown(url: str) -> str:
    """Fetch URL and return clean markdown. Handles JS, bot challenges, redirects."""
    response = scraper.scrape(                              # anti-bot bypass + JS render
        url=url,
        output_format="markdown",
        js_render=True,
        wait_for_selector="main, article, [role='main']",  # target content, not chrome
        timeout=30,
    )
    if not response.success:
        raise RuntimeError(f"Scrape failed [{response.status_code}]: {url}")
    return response.text


def chunk_markdown(markdown: str, source_url: str) -> list[dict]:
    """
    Split on heading boundaries, not character count.
    Preserves the semantic unit of each documentation section.
    """
    splitter = MarkdownHeaderTextSplitter(                  # respects document hierarchy
        headers_to_split_on=[
            ("#",   "h1"),
            ("##",  "h2"),
            ("###", "h3"),
        ],
        strip_headers=False,                               # keep heading in chunk text
    )
    docs = splitter.split_text(markdown)
    return [
        {"content": doc.page_content, "metadata": {**doc.metadata, "source_url": source_url}}
        for doc in docs
        if len(doc.page_content.strip()) > 80             # discard stub/empty sections
    ]


def ingest_url(url: str) -> int:
    """Full pipeline: fetch → chunk → embed → upsert. Returns new chunk count."""
    markdown = fetch_as_markdown(url)
    chunks = chunk_markdown(markdown, url)

    texts = [c["content"] for c in chunks]                 # extract text for batch embed
    vectors = embeddings.embed_documents(texts)            # single batched API call
    ids = [f"{url}#{i}" for i in range(len(chunks))]

    collection.upsert(                                     # idempotent — safe to re-run
        ids=ids,
        embeddings=vectors,
        documents=texts,
        metadatas=[c["metadata"] for c in chunks],
    )
    return len(chunks)


def query(question: str, n_results: int = 5) -> list[dict]:
    """Retrieve top-k chunks by cosine similarity."""
    q_vector = embeddings.embed_query(question)
    results = collection.query(query_embeddings=[q_vector], n_results=n_results)
    return [
        {"text": doc, "metadata": meta, "distance": dist}
        for doc, meta, dist in zip(
            results["documents"][0],
            results["metadatas"][0],
            results["distances"][0],
        )
    ]


if __name__ == "__main__":
    urls = [
        "https://docs.example.com/reference/authentication",
        "https://docs.example.com/reference/rate-limiting",
        "https://docs.example.com/guides/quickstart",
    ]
    for url in urls:
        n = ingest_url(url)
        print(f"Ingested {n} chunks ← {url}")

    hits = query("How do I authenticate API requests?")
    for h in hits:
        print(f"[{h['distance']:.3f}] {h['metadata'].get('h2', '')} — {h['text'][:100]}...")

Step 3: Chunking Strategy

Character-count chunking — RecursiveCharacterTextSplitter with chunk_size=1000 — is fine for prose but breaks code-heavy documentation mid-function and splits conceptually related content across chunk boundaries. The right splitter depends on content type:

For documentation ingestion, heading-based splitting is the default choice. Add a RecursiveCharacterTextSplitter as a fallback to cap any single chunk at 6,000 tokens — some reference pages have multi-page sections under a single heading.

Step 4: Cost Optimization with Caching

The main cost levers are scraping requests, embedding tokens, and storage queries. Both can be dramatically reduced with two caching layers:

Layer 1 — URL-level ETag caching: Most documentation and knowledge-base content is stable. Store (url → etag) after each fetch. On subsequent runs, issue a HEAD request first; if the ETag or Last-Modified header is unchanged, skip the scrape entirely.

Layer 2 — Chunk-level deduplication: Before embedding, SHA-256 hash each chunk's text. Check the vector store for that ID. If it exists, skip the embed call. This is the bigger cost saver for pipelines that re-ingest on a schedule.

Python

import hashlib
import json
from pathlib import Path

CACHE_FILE = Path(".scrape_cache.json")


def load_cache() -> dict:
    """Load persisted ETag map from disk."""
    if CACHE_FILE.exists():
        return json.loads(CACHE_FILE.read_text())   # url → etag mapping
    return {}

def save_cache(cache: dict) -> None:
    CACHE_FILE.write_text(json.dumps(cache, indent=2))

def chunk_hash(text: str) -> str:
    return hashlib.sha256(text.encode()).hexdigest()[:16]

def ingest_url_cached(url: str) -> int:
    cache = load_cache()

    head = scraper.head(url)                            # lightweight ETag check
    etag = head.headers.get("etag") or head.headers.get("last-modified", "")
    if cache.get(url) == etag and etag:
        print(f"Skip {url} — content unchanged")
        return 0

    markdown = fetch_as_markdown(url)
    chunks = chunk_markdown(markdown, url)

    new_chunks = []
    for chunk in chunks:
        h = chunk_hash(chunk["content"])
        existing = collection.get(ids=[h])              # check store before embedding
        if not existing["ids"]:
            new_chunks.append((h, chunk))

    if new_chunks:
        ids, texts, metas = zip(*[
            (h, c["content"], c["metadata"]) for h, c in new_chunks
        ])
        vectors = embeddings.embed_documents(list(texts))   # embed only new chunks
        collection.upsert(
            ids=list(ids),
            embeddings=vectors,
            documents=list(texts),
            metadatas=list(metas),
        )

    cache[url] = etag
    save_cache(cache)
    return len(new_chunks)

~80%Token Reduction vs Raw HTML

5×Cost Saving: small vs. ada-002

~90%Embed Cost Saved via Caching

Step 5: Scaling with a Task Queue

For single-user tooling, the synchronous ingest above is sufficient. For pipelines ingesting thousands of URLs on a schedule, parallelize with Celery and Redis:

Python

from celery import Celery
from cached_ingest import ingest_url_cached

app = Celery("rag_ingest", broker="redis://localhost:6379/0")

@app.task(bind=True, max_retries=3, default_retry_delay=30)   # retry on transient 429/503
def ingest_task(self, url: str) -> dict:
    try:
        count = ingest_url_cached(url)
        return {"url": url, "new_chunks": count}
    except Exception as exc:
        raise self.retry(exc=exc)                             # exponential backoff

# Dispatch a full sitemap
urls = parse_sitemap("https://docs.example.com/sitemap.xml")
job = celery.group(ingest_task.s(url) for url in urls)
result = job.apply_async()

Keep worker concurrency aligned to your scraping API plan. AlterLab's pricing plans scale with concurrent connections — over-parallelizing wastes retries; under-parallelizing wastes wall time.

Handling Edge Cases

JavaScript-heavy SPAs: Set js_render: true and use wait_for_selector targeting the content container (main, article, [role="main"]). Waiting on body fires before React or Vue hydrates the actual content.

Pagination: After fetching, parse rel="next" link tags from the response metadata and enqueue subsequent pages in the same Celery task group. Store (canonical_url, page_number) as the vector ID to avoid collisions.

PDF and binary content: Check the response's content_type field before processing. If it's not text/html, route to a dedicated PDF extraction path (pdfplumber, pymupdf) rather than the markdown pipeline.

Oversized sections: A single API reference page converted to markdown can still produce chunks exceeding the 8,192-token embedding limit. Add a token-count guard after heading splitting and re-chunk any oversized section with RecursiveCharacterTextSplitter(chunk_size=6000) as a fallback.

Dynamic content that isn't indexed: Some content (login-gated pages, single-page apps loading data via authenticated XHR) won't yield useful markdown regardless of JS rendering. Identify these early in the pipeline and route them to session-based scraping or data-export APIs where they exist.

Choosing an Embedding Model

The embedding model choice has a larger cost impact than most engineers expect:

Model	Dimensions	Cost / 1M tokens	MTEB Score
`text-embedding-ada-002`	1,536	$0.10	61.0
`text-embedding-3-small`	1,536	$0.02	62.3
`text-embedding-3-large`	3,072	$0.13	64.6
`nomic-embed-text` (local)	768	$0.00	62.4
`BGE-M3` (local)	1,024	$0.00	63.8

For most RAG use cases, text-embedding-3-small matches or beats ada-002 at 5× lower cost. For air-gapped or high-volume deployments, a locally-hosted model eliminates per-token costs entirely at the price of infrastructure overhead.

Takeaways

Request markdown at the fetch layer. Post-processing HTML is expensive and lossy; let the scraping API handle conversion before the content hits your pipeline.
Split on structure, not character count. MarkdownHeaderTextSplitter preserves the semantic unit of documentation. Add a token-limit fallback for oversized sections.
Cache at two levels. ETag-based URL caching prevents unnecessary re-scraping; chunk-level SHA-256 deduplication prevents unnecessary re-embedding. Together they cut ongoing costs by 90%+ on stable knowledge bases.
Match worker concurrency to your API tier. Excess parallelism hits rate limits and burns retries; it does not improve throughput.
Pick the right embedding model. text-embedding-3-small is the default right choice for hosted inference. Local BGE-M3 or nomic-embed-text is the right choice above ~500M tokens/month.

Was this article helpful?

Try it yourself

Skip the proxy management overhead

AlterLab handles proxy rotation, browser environments, and challenge resolution for you.

curl -X POST https://api.alterlab.io/v1/scrape \
  -H "X-API-Key: YOUR_KEY" \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com"}'

No credit card required · 5,000 free requests

Frequently Asked Questions

Markdown is the best output format for RAG pipelines. It reduces token count by 60–80% compared to raw HTML while preserving document structure — headings, lists, and code blocks — that directly improves chunking quality and retrieval precision.

Use a scraping API with built-in anti-bot bypass rather than raw HTTP clients. Implementing bypass from scratch requires maintaining browser fingerprints, proxy rotation, and challenge solvers — a significant ongoing maintenance burden that grows as detection systems update.

For structured content like documentation or wikis, use heading-based chunking (MarkdownHeaderTextSplitter in LangChain). Combine it with ETag-based URL caching and SHA-256 chunk deduplication to avoid re-embedding unchanged content — this cuts ongoing embedding costs by 90%+ for stable knowledge bases.

Yash Dubey

View all posts

Tutorials

How to Give Your AI Agent Access to Capterra Data

Learn how to equip your AI agent with structured Capterra data for software research pipelines using AlterLab's Extract API. Get clean JSON without parsing HTML.

Herald Blog Service

Jul 1, 2026

Tutorials

Reducing LLM Token Usage in RAG via Structured Extraction

Learn how to optimize RAG pipelines by converting raw HTML into clean Markdown and structured JSON to significantly reduce LLM token consumption and costs.

Herald Blog Service

Jul 1, 2026

Tutorials

ESPN Data API: Extract Structured JSON in 2026

Learn how to extract structured JSON data from ESPN using AlterLab's Extract API. Get team, score, date, venue and competition data with schema-based validation.

Herald Blog Service

Jun 30, 2026

Stay in the Loop

Get scraping insights, API tips, and platform updates. No spam — we only send when we have something worth reading.

Web Scraping API Resources

Part of the Web Scraping API Documentation cluster

Web Scraping API Documentation

Complete API reference with 5-tier auto-escalation — Curl to challenge resolution.

Pillar page

JavaScript Rendering Guide

Configure Tier 4 browser rendering for SPAs and dynamic content.

Authenticated Scraping Guide

Scrape pages behind login using session management.

Web Scraping API Benchmarks

Real success rates and cost data across all 5 tiers.

AlterLab for AI Agents

MCP Server, Python SDK, and Firecrawl-compatible API for AI agent workflows.

Why Markdown Beats Raw HTML for LLM Inputs

Pipeline Architecture

Step 1: Fetching Pages with Anti-Bot Bypass

Step 2: The Full Python Pipeline

Step 3: Chunking Strategy

Step 4: Cost Optimization with Caching

Step 5: Scaling with a Task Queue

Handling Edge Cases

Choosing an Embedding Model

Takeaways

Frequently Asked Questions

Related Articles

How to Give Your AI Agent Access to Capterra Data

Reducing LLM Token Usage in RAG via Structured Extraction

ESPN Data API: Extract Structured JSON in 2026

Popular Posts

Playwright Bot Detection: What Actually Works in 2026

Why Your Headless Browser Gets Detected (and How to Fix It)

How to Scrape Twitter/X: Complete Guide for 2026

Best Web Scraping APIs in 2026: Complete Comparison Guide

How to Scrape Cloudflare-Protected Sites in 2026

Recommended

How to Scrape AliExpress: Complete Guide for 2026

Why Your Headless Browser Gets Detected (and How to Fix It)

AlterLab vs Firecrawl: Which Scraping API Is Better in 2026?

How to Scrape Twitter/X Data: Complete Guide for 2026

How to Scrape Cloudflare-Protected Sites in 2026

Newsletter

Recommended Reading

How to Scrape AliExpress: Complete Guide for 2026

Why Your Headless Browser Gets Detected (and How to Fix It)

AlterLab vs Firecrawl: Which Scraping API Is Better in 2026?

How to Scrape Twitter/X Data: Complete Guide for 2026

How to Scrape Cloudflare-Protected Sites in 2026

Stay in the Loop

Explore AlterLab

Anti-Bot Handling API

JavaScript Rendering API

Pricing

Documentation

Web Scraping API Resources