Reducing LLM Token Usage in RAG via Structured Extraction
Tutorials

Reducing LLM Token Usage in RAG via Structured Extraction

Learn how to optimize RAG pipelines by converting raw HTML into clean Markdown and structured JSON to significantly reduce LLM token consumption and costs.

4 min read
13 views

AlterLab handles this automaticallyscrape any URL with one API call. No infrastructure required.

Try it free

TL;DR

To reduce LLM token usage in RAG pipelines, replace raw HTML with clean Markdown or structured JSON. This removes non-semantic noise like <script> and <div> tags, lowering costs and improving retrieval accuracy.

In Retrieval-Augmented Generation (RAG) workflows, the quality of your context is directly tied to the density of semantic information. Most developers make the mistake of feeding raw HTML directly into their embedding models or LLMs. This is inefficient. HTML is noisy, filled with boilerplate,-and heavily penalizes your token budget.

By implementing a transformation layer that converts web content into Markdown or structured JSON, you can achieve higher accuracy with significantly lower latency and cost.

The Problem: HTML Token Bloat

When you scrape a page and pass the source code to an LLM, you are paying for characters that carry zero semantic meaning. A single <div> nested deep within a complex layout can consume dozens of tokens.

Consider the following comparison:

  • Raw HTML: Contains tags, attributes, scripts, and styles. Often 10x larger than the visible text.
  • Markdown: Retains semantic structure (headers, lists, links) using minimal characters.
  • JSON: Extracts only the specific data points required for your application.

Strategy 1: Markdown for Semantic Context

Markdown is the "goldilously" formatted language for LLMs. It preserves the hierarchy of a page (H1, H2, lists) which helps the model understand the relationship between different pieces of text, but it strips away the heavy lifting of HTML attributes.

If you are building a knowledge base where the LLM needs to understand the relationship between a heading and a paragraph, Markdown is your best choice.

You can automate this by using a Python web scraping API that handles the heavy lifting of-rendering JavaScript before you perform the conversion.

Implementation Example

Here is how you can fetch a page and prepare it for an LLM using a Python client.

Python
import alterlab
import markdownify # Library to convert HTML to Markdown

client = alterlab.Client("YOUR_API_KEY")

# Fetch the page content
response = client.scrape("https://example-news-site.com/article")
html_content = response.text

# Convert to clean Markdown
md_content = markdownify.markdownify(html_content)

print(md_content[:500]) # View the first 500 characters

For high-scale production environments, you should use an extraction tool that performs this conversion server-side to minimize local processing.

Strategy 2: Structured JSON for Targeted Extraction

When your RAG pipeline doesn'0 need the entire article—only specific data points like prices, product names, or dates—do not use Markdown. Use structured extraction.

Instead of asking an LLM to "Read this HTML and tell me the price," you should use an extraction engine to turn the HTML into a JSON object. This moves the complexity from the LLM to the scraping layer, which is significantly cheaper.

Automating Extraction with cURL

You can define your desired schema directly in your request. This ensures that what enters your database is already clean, structured, and token-optimized.

Bash
curl -X POST https://api.alterlab.io/v1/extract \
  -H "X-API-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example-store.com/product/123",
    "schema": {
      "product_name": "string",
      "price": "number",
      "availability": "boolean"
    }
  }'

By requesting JSON directly, you bypass the need for a separate "Cleanup LLM" pass. This single architectural change can reduce your LLM-related costs by 60-80%.

Try it yourself

View the documentation and try a request

Comparing Approaches

To decide which method to use, consider your end-use case:

| Feature | Raw HTML | Markdown | Structured JSON | | :--- | :--- | :COMP_END_TABLE_ROW | | | Token Usage | Extremely High | Low | Minimal | | Semantic Value | High (but noisy) | High | Targeted | | LLM Latency | High | Low | Minimal | | Implementation | Easy | Moderate | Advanced |

When dealing with complex-dynamic sites, ensure your pipeline includes robust anti-bot handling to prevent scraping failures from breaking your RAG ingestion.

Summary of Best Practices

  1. Never embed raw HTML in prompts: It is a waste of money and increases the chance of hallucinations.
  2. Use Markdown for unstructured text: If the content is long-form (blogs, news), Markdown preserves the structure LLMs need.
  3. Prompting for JSON: For data-driven RAG (e.1. product catalogs), always extract via JSON schema.
  4. Pre-process before embedding: Clean your text (remove extra whitespace, boilerplate footers) before sending it to your embedding model.

For more advanced implementation details, check our [API documentation](https actually refer to our documentation) or read our recent posts on the AlterLab blog.

Hit reply if you have questions.

AlterLab // Web Data, Simplified.

Share

Was this article helpful?

Frequently Asked Questions

Raw HTML contains heavy boilerplate, tags, and scripts that consume thousands of tokens without providing semantic value. Converting HTML to Markdown or JSON reduces token count by up to 80% while preserving context.
Structured data like JSON removes noise and provides clear key-value relationships. This allows the LLM to focus on the actual data rather than parsing document structure.
Yes, using specialized extraction tools or APIs that perform scraping and-structured parsing in a single step. This ensures the data is clean before it reaches your vector database.