
Feed Clean Web Data to RAG Pipelines Without Wasting LLM Tokens
Stop feeding raw HTML to your RAG pipeline. Learn how to extract clean, structured web data that cuts LLM token usage by 90% and improves retrieval accuracy.
April 4, 2026
Raw HTML is the worst possible input for a RAG pipeline. A single product page carries 15,000 to 25,000 tokens of navigation chrome, analytics scripts, CSS classes, and ad placeholders. Your embedding model processes all of it. Your vector store stores all of it. Your retrieval step searches through all of it.
You pay for every token.
The fix is straightforward: extract only the content that matters before it reaches your embedding model. Strip the noise. Keep the signal. Structure it so retrieval actually works.
Here is how to build that pipeline.
The Token Math Behind Dirty Web Data
A typical e-commerce product page breaks down like this:
- Product title, description, specs: ~800 tokens
- Navigation menus, footer, sidebar: ~3,000 tokens
- JavaScript bundles, tracking pixels, ad scripts: ~8,000 tokens
- CSS class names, inline styles, layout divs: ~4,000 tokens
- Schema markup, meta tags, Open Graph: ~1,200 tokens
Your RAG pipeline cares about the first line. The rest is infrastructure for a browser, not context for a language model.
When you embed raw HTML, the noise drowns out the signal. Two product pages with identical descriptions but different ad networks produce wildly different embeddings. Retrieval quality drops. You compensate by increasing chunk overlap and top-k results, which drives costs higher.
Extract clean content first. Embed only what matters.
Step 1: Get Clean Content at the Source
The most efficient place to strip noise is during extraction, not after. Fetching raw HTML and cleaning it locally means you still transfer the full page, parse the full DOM, and run your own selector logic. Doing it server-side through a scraping API cuts the work in half.
Here is the same operation using the Python SDK and a direct cURL call. Both request Markdown output instead of raw HTML.
from alterlab import AlterLab
client = AlterLab(api_key="YOUR_API_KEY")
response = client.scrape(
url="https://example.com/product/12345",
formats=["markdown"]
)
print(response.markdown)curl -X POST https://api.alterlab.io/v1/scrape \
-H "X-API-Key: YOUR_API_KEY" \
-d '{
"url": "https://example.com/product/12345",
"formats": ["markdown"]
}'The response arrives as clean Markdown. No HTML tags. No script blocks. Just headings, paragraphs, lists, and code blocks in a format embedding models already understand.
For sites that require JavaScript rendering, set min_tier=3 to skip the basic HTTP fetcher and go straight to a headless browser. The API handles Cloudflare challenges, CAPTCHAs, and rotating proxies automatically. You get the rendered content without managing browser instances.
Step 2: Structure Data for Retrieval, Not Display
Markdown output works well for articles, documentation, and blog posts. But product pages, job listings, and pricing tables need structure. A flat text blob loses the relationships between fields.
Use Cortex AI extraction to pull structured data directly from the page. You describe what you want in plain English. The API returns JSON.
from alterlab import AlterLab
client = AlterLab(api_key="YOUR_API_KEY")
response = client.scrape(
url="https://example.com/jobs",
cortex={
"prompt": "Extract all job listings. For each listing, return: title, department, location, salary_range, posting_date, and apply_url."
},
formats=["json"]
)
for job in response.json["listings"]:
print(f"{job['title']} - {job['location']} ({job['salary_range']})")The JSON output maps directly to your embedding pipeline. Each job listing becomes a single document with typed fields. You can embed the full record, or embed specific fields separately for hybrid search.
Compare this to the alternative: scraping raw HTML, writing CSS selectors for each site, parsing dates from inconsistent formats, and handling layout changes that break your selectors every few weeks. Cortex handles the variation. You get consistent JSON regardless of how the page renders.
Step 3: Chunk Strategically
Clean content solves the noise problem. Chunking strategy solves the retrieval problem.
Bad chunking cuts sentences in half. It splits tables across chunks. It separates a heading from the paragraphs it governs. Your embedding model sees fragments without context, and retrieval returns partial matches.
Good chunking respects document structure. Markdown makes this straightforward.
import re
from typing import List
def chunk_markdown(text: str, max_tokens: int = 500) -> List[str]:
chunks = []
sections = re.split(r'\n## ', text)
for section in sections:
if not section.strip():
continue
heading = ""
if "\n" in section:
heading, body = section.split("\n", 1)
else:
heading, body = section, ""
current_chunk = f"## {heading}\n" if heading else ""
paragraphs = body.split("\n\n")
for para in paragraphs:
if len(current_chunk) + len(para) > max_tokens * 4:
chunks.append(current_chunk.strip())
current_chunk = f"## {heading}\n" if heading else ""
current_chunk += para + "\n\n"
if current_chunk.strip():
chunks.append(current_chunk.strip())
return chunksThis approach keeps headings attached to their content. It respects paragraph boundaries. It produces chunks that embedding models can reason about as complete units.
The token estimate uses a 4:1 character-to-token ratio for planning. Your embedding provider's tokenizer gives exact counts. Use that for production.
Step 4: Build the Ingestion Pipeline
Tie extraction, cleaning, chunking, and embedding together. The pipeline should handle three scenarios:
- Initial index: Scrape a list of URLs, extract clean content, chunk, embed, store.
- Incremental update: Monitor pages for changes. Re-extract and re-embed only what changed.
- Scheduled refresh: Run on a cron to catch pages that changed without triggering monitoring alerts.
from alterlab import AlterLab
from datetime import datetime
client = AlterLab(api_key="YOUR_API_KEY")
def ingest_page(url: str, embedding_fn):
response = client.scrape(
url=url,
formats=["markdown"],
min_tier=3
)
if not response.markdown:
return
chunks = chunk_markdown(response.markdown)
for i, chunk in enumerate(chunks):
vector = embedding_fn(chunk)
store_vector(url, i, chunk, vector, datetime.utcnow())
def ingest_batch(urls: list, embedding_fn):
for url in urls:
try:
ingest_page(url, embedding_fn)
except Exception as e:
print(f"Failed {url}: {e}")For incremental updates, use the monitoring feature. Set up watchers on your indexed URLs. When content changes, the API notifies you via webhook. You re-run ingest_page for that URL only. No full re-index required.
client.monitor(
url="https://example.com/pricing",
schedule="0 9 * * 1",
webhook="https://your-server.com/webhooks/alterlab",
diff=True
)The webhook payload includes a diff showing what changed. You can decide whether the change warrants a re-embedding. A price update does. A typo fix in the footer does not.
Step 5: Handle Anti-Bot Pages Without Infrastructure
Many sites you want to index block automated requests. Cloudflare challenges, CAPTCHAs, rate limits. Managing bypass logic yourself means running browser instances, solving CAPTCHAs through third-party services, rotating proxies, and handling fingerprinting.
That infrastructure costs more than the scraping itself.
Use tiered scraping to handle this automatically. Start with a lightweight HTTP request. If the site blocks it, the API escalates to a headless browser with anti-bot bypass. You set the floor with min_tier to skip the试探 phase for sites you know are protected.
response = client.scrape(
url="https://protected-site.com/data",
min_tier=3,
formats=["markdown"]
)
print(response.status)
print(response.markdown[:500])Tier 1 handles simple static pages. Tier 3 adds JavaScript rendering and anti-bot bypass. Tier 5 includes CAPTCHA solving. The API picks the right tier for each URL. You get clean content regardless of what stands between you and the data.
Try extracting clean Markdown from this documentation page
Cost Breakdown
Token waste compounds across three stages of a RAG pipeline:
Embedding: You pay per token sent to the embedding model. Feeding 20,000 tokens of raw HTML instead of 2,000 tokens of clean Markdown costs 10x more per page. Index 10,000 pages and the difference is measurable.
Storage: Vector databases charge by dimension count and record volume. Storing embeddings for noise chunks wastes space. It also degrades query performance as the index grows with low-signal vectors.
Retrieval: Each query searches the entire index. A bloated index with noisy chunks returns worse results. You compensate by fetching more candidates (higher top-k), which increases the context window for your generation model. That costs more per query.
Clean extraction at the source addresses all three. Smaller chunks. Better embeddings. Faster retrieval. Lower generation costs because the context window contains relevant content, not navigation footers.
When to Use Each Output Format
Markdown: Articles, documentation, blog posts, help centers. Any page where the content flows as prose with headings and lists. This is your default for knowledge base ingestion.
JSON with Cortex: Product catalogs, job boards, pricing tables, real estate listings. Any page with repeating structured elements. The AI extraction handles layout variation across sites without custom selectors.
Plain text: Simple pages with minimal formatting. API response pages. Status pages. Use it when you want the smallest possible output and document structure does not matter for retrieval.
HTML: Rarely. Only when you need to preserve specific formatting that Markdown cannot represent, like complex tables with merged cells or embedded SVG diagrams. Most RAG pipelines do not need this.
Putting It Together
A production RAG ingestion pipeline looks like this:
- Maintain a URL registry with metadata (category, last indexed, change hash).
- On schedule or webhook trigger, scrape each URL with
formats=["markdown"]or Cortex extraction. - Chunk the output using structure-aware splitting.
- Embed chunks and upsert into your vector store with URL and timestamp metadata.
- Monitor URLs for changes. Re-index only what changed.
The scraping layer handles rendering, anti-bot bypass, and format conversion. Your pipeline handles chunking, embedding, and storage. Clean separation. Each layer does one job well.
Check the Python SDK documentation for the full API reference, including webhook configuration and scheduling options. The quickstart guide covers account setup and your first API call.
Takeaway
Raw HTML wastes tokens on infrastructure code that embedding models cannot use. Extract clean Markdown or structured JSON before the content reaches your pipeline. Chunk with respect to document boundaries. Monitor for changes and re-index incrementally.
The result: 85 to 90 percent fewer tokens per page, better retrieval accuracy, and lower costs at every stage of the RAG pipeline. The scraping API handles rendering and anti-bot bypass. You handle the data.
Was this article helpful?
Frequently Asked Questions
Related Articles
Popular Posts
Recommended

Selenium Bot Detection: Why You Get Caught and How to Avoid It

How to Scrape Glassdoor: Complete Guide for 2026

Why Your Headless Browser Gets Detected (and How to Fix It)

Web Scraping APIs vs DIY Scrapers: When to Stop Building Infrastructure

Scraping JavaScript-Heavy SPAs with Python: Dynamic Content, Infinite Scroll, and API Interception
Newsletter
Scraping insights and API tips. No spam.
Recommended Reading

Selenium Bot Detection: Why You Get Caught and How to Avoid It

How to Scrape Glassdoor: Complete Guide for 2026

Why Your Headless Browser Gets Detected (and How to Fix It)

Web Scraping APIs vs DIY Scrapers: When to Stop Building Infrastructure

Scraping JavaScript-Heavy SPAs with Python: Dynamic Content, Infinite Scroll, and API Interception
Stay in the Loop
Get scraping insights, API tips, and platform updates. No spam — we only send when we have something worth reading.


