
Reduce RAG Token Waste: Optimize Scraping to Markdown & JSON
Stop passing raw HTML to your LLMs. Cut RAG token costs and improve context quality by transforming scraped web pages into clean Markdown and structured JSON.
May 1, 2026
Raw HTML bloats Retrieval-Augmented Generation (RAG) pipelines. An average web page consists of 80% markup and 20% actual content. Passing this raw Document Object Model (DOM) to a Large Language Model wastes tokens, increases latency, and severely degrades response quality.
If you chunk and embed raw HTML, your vector database index becomes polluted with CSS class names, SVG paths, and tracking scripts. A similarity search for specific domain knowledge might incorrectly return a chunk containing layout classes instead of the actual textual content.
The solution is moving the extraction and transformation logic to the edge. By converting raw web pages into clean Markdown or structured JSON at the scraping layer, you preserve semantic structure while eliminating token waste.
The Token Economy of Web Data
LLMs process text using tokenizers based on algorithms like Byte Pair Encoding (BPE). Common words might map to a single token, but random strings, minified JavaScript, and complex CSS selectors often break down into multiple tokens per character.
A single empty layout element like <div class="x-flex-container y-mt-4 z-hidden"></div> can consume 15 to 20 tokens. Multiply this by the thousands of nested elements in a modern web application, and a single page can easily exhaust an 8k or 16k context window before the LLM ever reaches the core content.
Why Markdown Wins for Unstructured Content
Markdown is the native language of modern language models. Models are heavily trained on Markdown files from repositories, technical documentation, and forums.
When you convert a page to Markdown before embedding it, you achieve two things. First, you strip away the syntax overhead of the DOM. Second, you preserve the hierarchical structure. Headings (#, ##) indicate document sections, bullet points group related items, and tables maintain tabular data structures. This context is critical for text chunking strategies. Advanced RAG systems use header-based chunking to split documents logically rather than arbitrarily cutting text every 500 words.
Test HTML to Markdown conversion on a public article
Why JSON Wins for Structured Content
For highly structured public data, such as e-commerce product pages, real estate listings, or public business directories, Markdown is still too broad. You do not need the site navigation, the footer links, or the sidebar recommendations. You only need the core entities: product name, price, specifications, or company contact details.
In these cases, extracting data directly into structured JSON at the scraping layer is the most token-efficient approach. You feed the LLM a clean JSON object containing only the exact facts it needs.
Implementing the Transformation Pipeline
To implement this reliably, your scraping infrastructure must handle headless browser rendering, network interception, and data extraction in a single pass. If you attempt to fetch raw HTML with a standard HTTP client and parse it locally, you will fail on modern single-page applications (SPAs) that require JavaScript execution.
You can utilize a unified scraping platform to handle this entire flow. AlterLab provides native endpoints that return transformed formats directly, eliminating the need to maintain your own headless browser clusters and HTML parsing libraries.
Below is an example of requesting both Markdown and structured JSON in a single API call using the Python SDK.
import os
from alterlab import Client
def fetch_clean_data(url: str):
client = Client(api_key=os.getenv("ALTERLAB_API_KEY"))
# We request both formats. The API handles the browser
# rendering and the transformation internally.
response = client.scrape(
url=url,
formats=["markdown", "json"],
extract_rules={
"title": "h1",
"content": "article",
"author": ".author-name"
},
wait_for_network_idle=True
)
return response.markdown, response.json
md_content, json_data = fetch_clean_data("https://example.com/public-docs")
print("Tokens saved. Ready for embedding.")For teams integrating directly at the network layer or building services in Go or Rust, the same functionality is available via the REST API. You can review the complete schema specifications in the API docs.
curl -X POST https://api.alterlab.io/v1/scrape \
-H "X-API-Key: YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com/public-docs",
"formats": ["markdown", "json"],
"extract_rules": {
"title": "h1",
"summary": ".post-excerpt"
},
"wait_for_network_idle": true
}'Navigating Dynamic Content and Access Protections
Web architectures are increasingly complex. Collecting public data often requires executing JavaScript, managing browser fingerprints, and handling dynamic content loading. If your scraper fails to load the page properly, your Markdown output will just be the loading screen or a generic "Please enable JavaScript" warning.
To ensure your RAG pipeline receives the actual content, the scraping layer must perfectly simulate a real user environment. This requires managing IP rotation, TLS fingerprinting, and browser environments. Offloading anti-bot handling to a specialized infrastructure layer ensures your data engineers spend their time building better vector search algorithms rather than debugging browser memory leaks.
Always ensure your collection methods target publicly accessible content and respect general rate limits to maintain ethical data pipelines.
Chunking and Embedding the Result
Once you have the clean Markdown, the next step in the pipeline is text chunking. Because you retained the Markdown formatting, you can use specialized splitters provided by frameworks like LangChain or LlamaIndex.
Instead of splitting text every 1000 characters, a Markdown text splitter divides the document based on headers. This ensures that a specific section, such as "Configuration Parameters", remains in a single contiguous chunk. When the vector database retrieves this chunk, the LLM receives the complete context for that specific topic, drastically reducing hallucination rates.
For the JSON output, embedding is even simpler. You can serialize the JSON objects into key-value strings (e.g., "Product: Mechanical Keyboard, Switch Type: Cherry MX Red") and embed those directly. This deterministic structure is highly effective for building semantic search engines over product catalogs or public directories.
Takeaways for Engineering Teams
RAG architectures are only as good as the data fed into them. By optimizing your scraping outputs, you directly improve the performance of your entire AI pipeline.
- Stop sending raw DOM trees to your LLM context windows.
- Extract public text into Markdown to preserve semantic hierarchy while stripping syntax noise.
- Target specific entities using JSON extraction for highly structured data sources.
- Move the transformation logic to the edge using a dedicated scraping API to reduce compute overhead on your own servers.
- Use header-based chunking on the resulting Markdown to maintain logical context in your vector database.
Was this article helpful?
Frequently Asked Questions
Related Articles
Popular Posts
Recommended
Newsletter
Scraping insights and API tips. No spam.
Recommended Reading

Selenium Bot Detection: Why You Get Flagged and How to Fix It

How to Scrape AliExpress: Complete Guide for 2026

Why Your Headless Browser Gets Detected (and How to Fix It)

How to Scrape Indeed: Complete Guide for 2026

How to Scrape Cloudflare-Protected Sites in 2026
Stay in the Loop
Get scraping insights, API tips, and platform updates. No spam — we only send when we have something worth reading.


