Reduce RAG Token Waste: Optimize Scraping to Markdown & JSON
Tutorials

Reduce RAG Token Waste: Optimize Scraping to Markdown & JSON

Stop passing raw HTML to your LLMs. Cut RAG token costs and improve context quality by transforming scraped web pages into clean Markdown and structured JSON.

Yash Dubey
Yash Dubey

May 1, 2026

5 min read
9 views

Raw HTML bloats Retrieval-Augmented Generation (RAG) pipelines. An average web page consists of 80% markup and 20% actual content. Passing this raw Document Object Model (DOM) to a Large Language Model wastes tokens, increases latency, and severely degrades response quality.

If you chunk and embed raw HTML, your vector database index becomes polluted with CSS class names, SVG paths, and tracking scripts. A similarity search for specific domain knowledge might incorrectly return a chunk containing layout classes instead of the actual textual content.

The solution is moving the extraction and transformation logic to the edge. By converting raw web pages into clean Markdown or structured JSON at the scraping layer, you preserve semantic structure while eliminating token waste.

85%Avg Token Reduction
4xFaster Embedding
0DOM Noise

The Token Economy of Web Data

LLMs process text using tokenizers based on algorithms like Byte Pair Encoding (BPE). Common words might map to a single token, but random strings, minified JavaScript, and complex CSS selectors often break down into multiple tokens per character.

A single empty layout element like <div class="x-flex-container y-mt-4 z-hidden"></div> can consume 15 to 20 tokens. Multiply this by the thousands of nested elements in a modern web application, and a single page can easily exhaust an 8k or 16k context window before the LLM ever reaches the core content.

Why Markdown Wins for Unstructured Content

Markdown is the native language of modern language models. Models are heavily trained on Markdown files from repositories, technical documentation, and forums.

When you convert a page to Markdown before embedding it, you achieve two things. First, you strip away the syntax overhead of the DOM. Second, you preserve the hierarchical structure. Headings (#, ##) indicate document sections, bullet points group related items, and tables maintain tabular data structures. This context is critical for text chunking strategies. Advanced RAG systems use header-based chunking to split documents logically rather than arbitrarily cutting text every 500 words.

Try it yourself

Test HTML to Markdown conversion on a public article

Why JSON Wins for Structured Content

For highly structured public data, such as e-commerce product pages, real estate listings, or public business directories, Markdown is still too broad. You do not need the site navigation, the footer links, or the sidebar recommendations. You only need the core entities: product name, price, specifications, or company contact details.

In these cases, extracting data directly into structured JSON at the scraping layer is the most token-efficient approach. You feed the LLM a clean JSON object containing only the exact facts it needs.

Implementing the Transformation Pipeline

To implement this reliably, your scraping infrastructure must handle headless browser rendering, network interception, and data extraction in a single pass. If you attempt to fetch raw HTML with a standard HTTP client and parse it locally, you will fail on modern single-page applications (SPAs) that require JavaScript execution.

You can utilize a unified scraping platform to handle this entire flow. AlterLab provides native endpoints that return transformed formats directly, eliminating the need to maintain your own headless browser clusters and HTML parsing libraries.

Below is an example of requesting both Markdown and structured JSON in a single API call using the Python SDK.

Python
import os
from alterlab import Client

def fetch_clean_data(url: str):
    client = Client(api_key=os.getenv("ALTERLAB_API_KEY"))
    
    # We request both formats. The API handles the browser 
    # rendering and the transformation internally.
    response = client.scrape(
        url=url,
        formats=["markdown", "json"],
        extract_rules={
            "title": "h1",
            "content": "article",
            "author": ".author-name"
        },
        wait_for_network_idle=True
    )
    
    return response.markdown, response.json

md_content, json_data = fetch_clean_data("https://example.com/public-docs")
print("Tokens saved. Ready for embedding.")

For teams integrating directly at the network layer or building services in Go or Rust, the same functionality is available via the REST API. You can review the complete schema specifications in the API docs.

Bash
curl -X POST https://api.alterlab.io/v1/scrape \
  -H "X-API-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com/public-docs",
    "formats": ["markdown", "json"],
    "extract_rules": {
      "title": "h1",
      "summary": ".post-excerpt"
    },
    "wait_for_network_idle": true
  }'

Web architectures are increasingly complex. Collecting public data often requires executing JavaScript, managing browser fingerprints, and handling dynamic content loading. If your scraper fails to load the page properly, your Markdown output will just be the loading screen or a generic "Please enable JavaScript" warning.

To ensure your RAG pipeline receives the actual content, the scraping layer must perfectly simulate a real user environment. This requires managing IP rotation, TLS fingerprinting, and browser environments. Offloading anti-bot handling to a specialized infrastructure layer ensures your data engineers spend their time building better vector search algorithms rather than debugging browser memory leaks.

Always ensure your collection methods target publicly accessible content and respect general rate limits to maintain ethical data pipelines.

Chunking and Embedding the Result

Once you have the clean Markdown, the next step in the pipeline is text chunking. Because you retained the Markdown formatting, you can use specialized splitters provided by frameworks like LangChain or LlamaIndex.

Instead of splitting text every 1000 characters, a Markdown text splitter divides the document based on headers. This ensures that a specific section, such as "Configuration Parameters", remains in a single contiguous chunk. When the vector database retrieves this chunk, the LLM receives the complete context for that specific topic, drastically reducing hallucination rates.

For the JSON output, embedding is even simpler. You can serialize the JSON objects into key-value strings (e.g., "Product: Mechanical Keyboard, Switch Type: Cherry MX Red") and embed those directly. This deterministic structure is highly effective for building semantic search engines over product catalogs or public directories.

Takeaways for Engineering Teams

RAG architectures are only as good as the data fed into them. By optimizing your scraping outputs, you directly improve the performance of your entire AI pipeline.

  • Stop sending raw DOM trees to your LLM context windows.
  • Extract public text into Markdown to preserve semantic hierarchy while stripping syntax noise.
  • Target specific entities using JSON extraction for highly structured data sources.
  • Move the transformation logic to the edge using a dedicated scraping API to reduce compute overhead on your own servers.
  • Use header-based chunking on the resulting Markdown to maintain logical context in your vector database.
Share

Was this article helpful?

Frequently Asked Questions

HTML contains massive amounts of noise, including inline CSS, scripts, and nested layout tags. This consumes token limits rapidly, increases inference costs, and dilutes the semantic meaning of the content for embedding models.
Yes. Markdown preserves the hierarchical structure of a document, such as headings, lists, and tables. This structural context helps LLMs understand the relationship between different sections of the text.
You must use a headless browser to execute the JavaScript and wait for network idle events before extracting the DOM. Modern scraping APIs handle this rendering layer automatically before converting the output to Markdown.