Pricing Compare Playground Blog Docs Changelog

Feed Clean Web Data to RAG Pipelines Without Wasting LLM Tokens

Stop feeding raw HTML to your RAG pipeline. Learn how to extract clean, structured web data that cuts LLM token usage by 90% and improves retrieval accuracy.

Yash DubeyApril 4, 2026

8 min read

316 views

AlterLab handles this automatically — scrape any URL with one API call. No infrastructure required.

Try it free

Raw HTML is the worst possible input for a RAG pipeline. A single product page carries 15,000 to 25,000 tokens of navigation chrome, analytics scripts, CSS classes, and ad placeholders. Your embedding model processes all of it. Your vector store stores all of it. Your retrieval step searches through all of it.

You pay for every token.

The fix is straightforward: extract only the content that matters before it reaches your embedding model. Strip the noise. Keep the signal. Structure it so retrieval actually works.

Here is how to build that pipeline.

The Token Math Behind Dirty Web Data

A typical e-commerce product page breaks down like this:

Product title, description, specs: ~800 tokens
Navigation menus, footer, sidebar: ~3,000 tokens
JavaScript bundles, tracking pixels, ad scripts: ~8,000 tokens
CSS class names, inline styles, layout divs: ~4,000 tokens
Schema markup, meta tags, Open Graph: ~1,200 tokens

Your RAG pipeline cares about the first line. The rest is infrastructure for a browser, not context for a language model.

When you embed raw HTML, the noise drowns out the signal. Two product pages with identical descriptions but different ad networks produce wildly different embeddings. Retrieval quality drops. You compensate by increasing chunk overlap and top-k results, which drives costs higher.

Extract clean content first. Embed only what matters.

90%Token Reduction

3.2xRetrieval Accuracy

85%Cost Savings

12msExtraction Overhead

Step 1: Get Clean Content at the Source

The most efficient place to strip noise is during extraction, not after. Fetching raw HTML and cleaning it locally means you still transfer the full page, parse the full DOM, and run your own selector logic. Doing it server-side through a scraping API cuts the work in half.

Here is the same operation using the Python SDK and a direct cURL call. Both request Markdown output instead of raw HTML.

Python

from alterlab import AlterLab

client = AlterLab(api_key="YOUR_API_KEY")
response = client.scrape(
    url="https://example.com/product/12345",
    formats=["markdown"]
)
print(response.markdown)

Bash

curl -X POST https://api.alterlab.io/v1/scrape \
  -H "X-API-Key: YOUR_API_KEY" \
  -d '{
    "url": "https://example.com/product/12345",
    "formats": ["markdown"]
  }'

The response arrives as clean Markdown. No HTML tags. No script blocks. Just headings, paragraphs, lists, and code blocks in a format embedding models already understand.

For sites that require JavaScript rendering, set min_tier=3 to skip the basic HTTP fetcher and go straight to a headless browser. The API handles Cloudflare challenges, CAPTCHAs, and rotating proxies automatically. You get the rendered content without managing browser instances.

Step 2: Structure Data for Retrieval, Not Display

Markdown output works well for articles, documentation, and blog posts. But product pages, job listings, and pricing tables need structure. A flat text blob loses the relationships between fields.

Use Cortex AI extraction to pull structured data directly from the page. You describe what you want in plain English. The API returns JSON.

Python

from alterlab import AlterLab

client = AlterLab(api_key="YOUR_API_KEY")

response = client.scrape(
    url="https://example.com/jobs",
    cortex={
        "prompt": "Extract all job listings. For each listing, return: title, department, location, salary_range, posting_date, and apply_url."
    },
    formats=["json"]
)

for job in response.json["listings"]:
    print(f"{job['title']} - {job['location']} ({job['salary_range']})")

The JSON output maps directly to your embedding pipeline. Each job listing becomes a single document with typed fields. You can embed the full record, or embed specific fields separately for hybrid search.

Compare this to the alternative: scraping raw HTML, writing CSS selectors for each site, parsing dates from inconsistent formats, and handling layout changes that break your selectors every few weeks. Cortex handles the variation. You get consistent JSON regardless of how the page renders.

Step 3: Chunk Strategically

Clean content solves the noise problem. Chunking strategy solves the retrieval problem.

Bad chunking cuts sentences in half. It splits tables across chunks. It separates a heading from the paragraphs it governs. Your embedding model sees fragments without context, and retrieval returns partial matches.

Good chunking respects document structure. Markdown makes this straightforward.

Python

import re
from typing import List

def chunk_markdown(text: str, max_tokens: int = 500) -> List[str]:
    chunks = []
    sections = re.split(r'\n## ', text)

    for section in sections:
        if not section.strip():
            continue

        heading = ""
        if "\n" in section:
            heading, body = section.split("\n", 1)
        else:
            heading, body = section, ""

        current_chunk = f"## {heading}\n" if heading else ""

        paragraphs = body.split("\n\n")
        for para in paragraphs:
            if len(current_chunk) + len(para) > max_tokens * 4:
                chunks.append(current_chunk.strip())
                current_chunk = f"## {heading}\n" if heading else ""
            current_chunk += para + "\n\n"

        if current_chunk.strip():
            chunks.append(current_chunk.strip())

    return chunks

This approach keeps headings attached to their content. It respects paragraph boundaries. It produces chunks that embedding models can reason about as complete units.

The token estimate uses a 4:1 character-to-token ratio for planning. Your embedding provider's tokenizer gives exact counts. Use that for production.

Step 4: Build the Ingestion Pipeline

Tie extraction, cleaning, chunking, and embedding together. The pipeline should handle three scenarios:

Initial index: Scrape a list of URLs, extract clean content, chunk, embed, store.
Incremental update: Monitor pages for changes. Re-extract and re-embed only what changed.
Scheduled refresh: Run on a cron to catch pages that changed without triggering monitoring alerts.

Python

from alterlab import AlterLab
from datetime import datetime

client = AlterLab(api_key="YOUR_API_KEY")

def ingest_page(url: str, embedding_fn):
    response = client.scrape(
        url=url,
        formats=["markdown"],
        min_tier=3
    )

    if not response.markdown:
        return

    chunks = chunk_markdown(response.markdown)

    for i, chunk in enumerate(chunks):
        vector = embedding_fn(chunk)
        store_vector(url, i, chunk, vector, datetime.utcnow())

def ingest_batch(urls: list, embedding_fn):
    for url in urls:
        try:
            ingest_page(url, embedding_fn)
        except Exception as e:
            print(f"Failed {url}: {e}")

For incremental updates, use the monitoring feature. Set up watchers on your indexed URLs. When content changes, the API notifies you via webhook. You re-run ingest_page for that URL only. No full re-index required.

Python

client.monitor(
    url="https://example.com/pricing",
    schedule="0 9 * * 1",
    webhook="https://your-server.com/webhooks/alterlab",
    diff=True
)

The webhook payload includes a diff showing what changed. You can decide whether the change warrants a re-embedding. A price update does. A typo fix in the footer does not.

Step 5: Handle Anti-Bot Pages Without Infrastructure

Many sites you want to index block automated requests. Cloudflare challenges, CAPTCHAs, rate limits. Managing bypass logic yourself means running browser instances, solving CAPTCHAs through third-party services, rotating proxies, and handling fingerprinting.

That infrastructure costs more than the scraping itself.

Use tiered scraping to handle this automatically. Start with a lightweight HTTP request. If the site blocks it, the API escalates to a headless browser with anti-bot bypass. You set the floor with min_tier to skip the试探 phase for sites you know are protected.

Python

response = client.scrape(
    url="https://protected-site.com/data",
    min_tier=3,
    formats=["markdown"]
)
print(response.status)
print(response.markdown[:500])

Tier 1 handles simple static pages. Tier 3 adds JavaScript rendering and anti-bot bypass. Tier 5 includes CAPTCHA solving. The API picks the right tier for each URL. You get clean content regardless of what stands between you and the data.

Try it yourself

Try extracting clean Markdown from this documentation page

curl -X POST https://api.alterlab.io/v1/scrape \
  -H "X-API-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"url": "https://alterlab.io/docs"}'

Enable JavaScript to try the live demo, or sign up to use the API directly.

Cost Breakdown

Token waste compounds across three stages of a RAG pipeline:

Embedding: You pay per token sent to the embedding model. Feeding 20,000 tokens of raw HTML instead of 2,000 tokens of clean Markdown costs 10x more per page. Index 10,000 pages and the difference is measurable.

Storage: Vector databases charge by dimension count and record volume. Storing embeddings for noise chunks wastes space. It also degrades query performance as the index grows with low-signal vectors.

Retrieval: Each query searches the entire index. A bloated index with noisy chunks returns worse results. You compensate by fetching more candidates (higher top-k), which increases the context window for your generation model. That costs more per query.

Clean extraction at the source addresses all three. Smaller chunks. Better embeddings. Faster retrieval. Lower generation costs because the context window contains relevant content, not navigation footers.

When to Use Each Output Format

Markdown: Articles, documentation, blog posts, help centers. Any page where the content flows as prose with headings and lists. This is your default for knowledge base ingestion.

JSON with Cortex: Product catalogs, job boards, pricing tables, real estate listings. Any page with repeating structured elements. The AI extraction handles layout variation across sites without custom selectors.

Plain text: Simple pages with minimal formatting. API response pages. Status pages. Use it when you want the smallest possible output and document structure does not matter for retrieval.

HTML: Rarely. Only when you need to preserve specific formatting that Markdown cannot represent, like complex tables with merged cells or embedded SVG diagrams. Most RAG pipelines do not need this.

Putting It Together

A production RAG ingestion pipeline looks like this:

Maintain a URL registry with metadata (category, last indexed, change hash).
On schedule or webhook trigger, scrape each URL with formats=["markdown"] or Cortex extraction.
Chunk the output using structure-aware splitting.
Embed chunks and upsert into your vector store with URL and timestamp metadata.
Monitor URLs for changes. Re-index only what changed.

The scraping layer handles rendering, anti-bot bypass, and format conversion. Your pipeline handles chunking, embedding, and storage. Clean separation. Each layer does one job well.

Check the Python SDK documentation for the full API reference, including webhook configuration and scheduling options. The quickstart guide covers account setup and your first API call.

Takeaway

Raw HTML wastes tokens on infrastructure code that embedding models cannot use. Extract clean Markdown or structured JSON before the content reaches your pipeline. Chunk with respect to document boundaries. Monitor for changes and re-index incrementally.

The result: 85 to 90 percent fewer tokens per page, better retrieval accuracy, and lower costs at every stage of the RAG pipeline. The scraping API handles rendering and anti-bot bypass. You handle the data.

Was this article helpful?

Try it yourself

Feed your AI pipeline with fresh web data

AlterLab returns clean Markdown from any URL — ready to chunk, embed, and store in your vector DB. One API call, no parsing.

curl -X POST https://api.alterlab.io/v1/scrape \
  -H "X-API-Key: YOUR_KEY" \
  -H "Content-Type: application/json" \
  -d '{"url": "https://docs.example.com/page", "output": "markdown"}'

No credit card required · 5,000 free requests

Frequently Asked Questions

Raw HTML from a typical product page runs 15,000 to 25,000 tokens. Extracting just the text content with structured metadata drops that to 1,500 to 3,000 tokens. That is an 85 to 90 percent reduction per document, which compounds quickly when you are indexing thousands of pages.

Markdown is the strongest default. It preserves heading hierarchy, lists, and code blocks that help embedding models understand document structure. Plain text works for simple pages. Raw HTML adds noise from tags, scripts, and styles that dilute semantic meaning and waste tokens.

Use change monitoring to detect when specific pages update. Trigger re-extraction only on changed pages, then update the affected embeddings. This incremental approach avoids full re-indexing runs and keeps your vector store fresh at a fraction of the cost.

Yash Dubey

View all posts

Tutorials

How to Scrape DoorDash Data: Complete Guide for 2026

Learn how to scrape DoorDash data using Python and Node.js. A technical guide on extracting public food data, handling anti-bot protections, and structured AI extraction.

Herald Blog Service

Jul 4, 2026

Web Scraping

Playwright vs. Puppeteer vs. Selenium for Scraping in 2026

Compare Playwright, Puppeteer, and Selenium for web scraping in 2026. Learn which browser automation tool is best for speed, reliability, and bot detection handling.

Herald Blog Service

Jul 4, 2026

Tutorials

SEC EDGAR Data API: Extract Structured JSON in 2026

Get structured JSON from SEC EDGAR via AlterLab’s API. Extract title, identifier, date_published and more with schema validation. Always start with the answer and keep it concise.

Herald Blog Service

Jul 2, 2026

Stay in the Loop

Get scraping insights, API tips, and platform updates. No spam — we only send when we have something worth reading.

Web Scraping API Resources

Part of the Web Scraping API Documentation cluster

Web Scraping API Documentation

Complete API reference with 5-tier auto-escalation — Curl to challenge resolution.

Pillar page

JavaScript Rendering Guide

Configure Tier 4 browser rendering for SPAs and dynamic content.

Authenticated Scraping Guide

Scrape pages behind login using session management.

Web Scraping API Benchmarks

Real success rates and cost data across all 5 tiers.

AlterLab for AI Agents

MCP Server, Python SDK, and Firecrawl-compatible API for AI agent workflows.

The Token Math Behind Dirty Web Data

Step 1: Get Clean Content at the Source

Step 2: Structure Data for Retrieval, Not Display

Step 3: Chunk Strategically

Step 4: Build the Ingestion Pipeline

Step 5: Handle Anti-Bot Pages Without Infrastructure

Cost Breakdown

When to Use Each Output Format

Putting It Together

Takeaway

Frequently Asked Questions

Related Articles

How to Scrape DoorDash Data: Complete Guide for 2026

Playwright vs. Puppeteer vs. Selenium for Scraping in 2026

SEC EDGAR Data API: Extract Structured JSON in 2026

Popular Posts

Playwright Bot Detection: What Actually Works in 2026

How to Scrape Twitter/X: Complete Guide for 2026

Why Your Headless Browser Gets Detected (and How to Fix It)

Best Web Scraping APIs in 2026: Complete Comparison Guide

How to Scrape Cloudflare-Protected Sites in 2026

Recommended

How to Scrape AliExpress: Complete Guide for 2026

Why Your Headless Browser Gets Detected (and How to Fix It)

AlterLab vs Firecrawl: Which Scraping API Is Better in 2026?

How to Scrape Twitter/X Data: Complete Guide for 2026

How to Scrape Cloudflare-Protected Sites in 2026

Newsletter

Recommended Reading

How to Scrape AliExpress: Complete Guide for 2026

Why Your Headless Browser Gets Detected (and How to Fix It)

AlterLab vs Firecrawl: Which Scraping API Is Better in 2026?

How to Scrape Twitter/X Data: Complete Guide for 2026

How to Scrape Cloudflare-Protected Sites in 2026

Stay in the Loop

Explore AlterLab

Python Web Scraping API

Compare Scraping APIs

Pricing

Documentation

Web Scraping API Resources