
Reduce LLM Token Waste in RAG with Markdown
Stop wasting LLM tokens on raw HTML. Learn how to extract dynamically rendered web pages as clean Markdown for efficient, high-quality RAG pipelines.
TL;DR
Feeding raw HTML to Large Language Models wastes tokens on markup, scripts, and styling. By rendering dynamic web pages in a headless browser and converting the final DOM to clean Markdown, you reduce token consumption by up to 90% while preserving semantic structure and improving retrieval accuracy in RAG pipelines.
The Problem: LLMs, Context Windows, and the HTML Tax
Building Retrieval-Augmented Generation (RAG) pipelines over web data introduces a specific data engineering problem. The web is built on HTML. Large Language Models operate on tokens.
When you pass raw HTML to an embedding model or an LLM context window, you pay a steep tax. You pay for <div class="mt-4 flex flex-col justify-center">, <script type="application/json">, SVG paths, and inline CSS. These non-semantic tokens dilute the actual content. They increase latency, exhaust context limits, and drive up API costs.
Worse, this noise degrades your embeddings. When an embedding model processes a chunk of text dominated by CSS classes and HTML attributes, the resulting vector represents the markup structure more heavily than the actual information. This leads to poor retrieval performance. When a user queries your RAG system, the vector database returns chunks based on matching HTML boilerplate rather than semantic relevance.
Why Markdown is the Ideal Intermediate Format
Markdown solves the HTML tax problem. It preserves semantic meaning without the syntactic overhead of HTML. It maintains hierarchical structure through headers, relationships through links, and tabular data through Markdown tables.
A standard product page or a long-form article converted from HTML to Markdown often drops from 50,000 tokens to roughly 3,000. This 94% reduction in token count directly translates to lower inference costs and higher context density.
When you feed clean Markdown into a context window, the LLM processes dense, high-signal information. It pays attention to the data you care about.
Consider this raw HTML snippet:
<div class="product-specs">
<h2 class="text-xl font-bold mb-2">Specifications</h2>
<ul class="list-disc pl-5">
<li class="spec-item" data-id="123">Weight: 2.4 lbs</li>
<li class="spec-item" data-id="124">Battery Life: 12 hours</li>
</ul>
</div>Converted to Markdown, it becomes:
## Specifications
- Weight: 2.4 lbs
- Battery Life: 12 hoursThe Markdown version contains the exact same information but requires a fraction of the tokens. The LLM understands the header and the list items natively.
The Challenge of Modern Web Rendering
Converting static HTML to Markdown is straightforward using libraries like html2text or turndown. The challenge lies in modern web architecture. Most single-page applications (SPAs) ship an empty <div id="root"> and render content client-side via JavaScript.
If you fetch these pages with a standard HTTP client like requests in Python or curl in bash, your Markdown converter will output nothing. You capture the loading state, not the data.
You need a headless browser to execute the JavaScript, wait for the network to idle, and then extract the final computed DOM.
Try scraping this page with AlterLab
Doing this at scale introduces significant infrastructure overhead. You must manage a fleet of headless Chrome instances. You have to handle memory leaks, process crashes, and concurrent execution limits.
Beyond browser management, you face access barriers. Many web servers employ strict rate limiting and automated traffic detection, even for publicly accessible data. Fetching the fully rendered DOM requires robust proxy rotation and systems capable of sophisticated anti-bot handling. If you fail to solve a CAPTCHA or trigger a firewall block, your RAG pipeline starves for data.
Cleaning the DOM Before Conversion
Before generating the Markdown, it is crucial to sanitize the HTML. Modern web pages contain elements like <nav>, <footer>, <aside>, and hidden modals that contribute no value to the core content.
If you convert the entire page blindly, your Markdown will include navigation links, newsletter signups, and related article previews. This reintroduces noise into your RAG pipeline.
A robust extraction pipeline evaluates DOM nodes based on text density, link-to-text ratios, and semantic HTML5 tags like <main> or <article>. It prunes the DOM tree of boilerplate, ensuring the resulting Markdown represents only the primary article or data payload.
When implementing custom conversion pipelines, you must build this sanitization step yourself using tools like Mozilla's Readability.js. Offloading this eliminates the need to maintain complex DOM pruning rules across diverse web layouts.
Single-Step Markdown Extraction with AlterLab
Instead of building a complex pipeline with Puppeteer, proxy managers, HTML parsing libraries, and Markdown converters, you can request Markdown directly from the AlterLab API.
We built AlterLab to abstract this infrastructure away. Our systems handle the headless browser execution, manage the proxy rotation, sanitize the DOM, and return the data in your requested format.
You pass the target URL to the API. You specify that you want Markdown. AlterLab navigates to the page, waits for JavaScript execution to complete, parses the rendered HTML, strips navigation and footer boilerplate using heuristics, and returns a clean Markdown string.
Here is how to implement this using our Python SDK.
import alterlab
import os
client = alterlab.Client(api_key=os.getenv("ALTERLAB_API_KEY"))
response = client.scrape(
url="https://example-news-site.com/article",
formats=["markdown"],
wait_for="networkidle"
)
markdown_content = response.markdown
print(markdown_content)
# The output is ready to be chunked and embeddedFor systems where you prefer standard HTTP requests, the same configuration works via cURL. See the API reference for full parameter details.
curl -X POST https://api.alterlab.io/v1/scrape \
-H "X-API-Key: YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://example-news-site.com/article",
"formats": ["markdown"],
"wait_for": "networkidle"
}'The wait_for: "networkidle" parameter ensures the headless browser waits until all client-side rendering completes before extracting the DOM. The formats: ["markdown"] parameter handles the conversion pipeline internally.
Optimizing RAG Ingestion Pipelines with Markdown
Once you have clean Markdown, your chunking strategy improves drastically. Standard text chunking methods split text arbitrarily by character count. This often breaks paragraphs in half or separates a table header from its rows, destroying the context the LLM needs to answer queries.
With Markdown, you chunk by semantic boundaries using headers (#, ##, ###).
Markdown-aware text splitters read these headers to keep related concepts together. When a section exceeds your chunk size limit, the splitter drops down to the next header level.
from langchain.text_splitter import MarkdownHeaderTextSplitter
headers_to_split_on = [
("#", "Header 1"),
("##", "Header 2"),
("###", "Header 3"),
]
markdown_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
md_header_splits = markdown_splitter.split_text(markdown_content)
for split in md_header_splits:
print(split.metadata)
print(split.page_content)This ensures that every chunk sent to your vector database contains complete, logically grouped information. It also preserves the header hierarchy in the metadata, allowing you to filter or weight retrieval results based on section context.
Handling Tabular Data and Complex Structures
Tables present a notorious challenge for RAG systems. HTML tables (<table>, <tr>, <td>) confuse text embedding models. Flattening a table into plain text removes the row and column relationships, rendering the data incomprehensible.
Markdown tables maintain a rigid, predictable structure.
| Parameter | Type | Description |
|---|---|---|
| url | string | The target webpage URL |
| formats | array | Requested output formats |LLMs parse Markdown tables natively. When a user asks a question requiring data aggregation across columns, the LLM correctly interprets the intersections of the Markdown table provided in the context window. Converting HTML directly to Markdown preserves this critical tabular structure without writing custom extraction logic.
Beyond text and tables, web pages contain images and complex nested structures. Raw HTML encodes images with <img> tags, srcset attributes, and lazy-loading wrappers.
When converting to Markdown, the process extracts the primary src and alt text, formatting it as . If your RAG system incorporates multimodal LLMs, you can parse these Markdown image tags to fetch and analyze the visual content. The LLM receives the semantic description via the alt text, maintaining context even if you choose not to download the image.
For nested structures like accordions or tabbed interfaces, headless browser execution is paramount. SPAs often delay rendering the content of an inactive tab until the user clicks it. By using interaction features to simulate user clicks before triggering the Markdown extraction, you ensure all hidden content surfaces in the final DOM. This guarantees your RAG pipeline ingests the complete dataset, rather than missing critical information hidden behind UI components.
Real-world Data Engineering Considerations
Operating a RAG ingestion pipeline requires fault tolerance. When scraping dynamic websites, you must account for network timeouts, changing DOM structures, and temporary IP blocks.
By relying on an API to handle the extraction and conversion, you reduce your surface area for errors. You do not need to debug Puppeteer timeouts or update Chrome versions. Your error handling focuses entirely on your ingestion logic.
Implement exponential backoff for failed requests. Queue URLs for processing rather than executing them synchronously. Monitor the token count of the returned Markdown. If a site undergoes a major redesign, the heuristics stripping boilerplate might fail, resulting in a sudden spike in token count. Set up alerts for unexpected deviations in response size to catch these anomalies early.
Summary
Processing web data for AI requires minimizing noise. Extracting dynamically rendered pages directly as Markdown removes token bloat at the source. It simplifies your ingestion pipeline, lowers LLM API costs, and provides your embedding models with highly structured, high-signal text.
By offloading browser rendering, JavaScript execution, and Markdown conversion to an API, your engineering team can focus on improving embedding models and retrieval strategies rather than managing headless Chromium instances. Build data pipelines that scale reliably by treating web extraction as a solved infrastructure primitive.
Was this article helpful?
Frequently Asked Questions
Related Articles

Handling Shadow DOMs in Agentic Scraping Workflows
Learn how to pierce Shadow DOMs and extract data from dynamic Web Components using JavaScript traversal, headless browsers, and AI-powered extraction APIs.
Herald Blog Service

Dynamically Altering WebGL and Canvas Fingerprints in Headless Browsers
Learn how to dynamically alter WebGL and Canvas fingerprints in headless browsers to improve success rates for AI web agents fetching public data.
Herald Blog Service

Optimizing AI Data Pipelines: JSON vs Markdown vs Text
Learn how to choose the right data format for LLM grounding and AI agents to minimize token costs and maximize extraction accuracy in your data pipelines.
Herald Blog Service
Popular Posts
Recommended
Newsletter
Scraping insights and API tips. No spam.
Recommended Reading

How to Scrape Amazon in 2026: Engineering Guide

How to Scrape AliExpress: Complete Guide for 2026

Why Your Headless Browser Gets Detected (and How to Fix It)

How to Scrape Indeed: Complete Guide for 2026

How to Scrape Twitter/X Data: Complete Guide for 2026
Stay in the Loop
Get scraping insights, API tips, and platform updates. No spam — we only send when we have something worth reading.
Explore AlterLab
Web Scraping API Resources
Part of the Web Scraping API Documentation cluster
Complete API reference with 5-tier auto-escalation — Curl to challenge resolution.
Pillar pageConfigure Tier 4 browser rendering for SPAs and dynamic content.
Scrape pages behind login using session management.
Real success rates and cost data across all 5 tiers.
MCP Server, Python SDK, and Firecrawl-compatible API for AI agent workflows.