Optimizing Web Scraping Data to Reduce RAG Token Costs
Best Practices

Optimizing Web Scraping Data to Reduce RAG Token Costs

Reduce LLM token costs in RAG pipelines by optimizing web scraping extraction. Learn to clean HTML, convert to Markdown, and structure data before embedding.

Yash Dubey
Yash Dubey

April 23, 2026

7 min read
17 views

Feeding raw HTML into a Retrieval-Augmented Generation (RAG) pipeline is a fast way to burn through your LLM token budget. When building data pipelines that rely on publicly accessible web data, the difference between a cost-effective architecture and an expensive one often comes down to pre-processing.

A standard public news article or e-commerce product page can easily exceed 2MB of raw HTML. Run that through a tokenizer like tiktoken (used by OpenAI models), and you are looking at roughly 300,000 to 500,000 tokens per page. At scale, processing thousands of pages daily, this approach becomes financially unviable. The LLM spends valuable compute parsing navigation menus, inline CSS, base64 encoded tracking pixels, and minified JavaScript rather than the actual content.

To build an efficient RAG pipeline, you must aggressively filter, structure, and compress web data before it ever reaches your vector database or LLM context window.

The Data Extraction Pipeline

The most efficient architectures treat web scraping and LLM ingestion as distinct phases separated by a strict transformation layer. The goal is to maximize the signal-to-noise ratio.

Phase 1: Aggressive DOM Stripping

If you are managing your own scraping infrastructure, the first step is cleaning the Document Object Model (DOM) before doing anything else. Standard libraries like BeautifulSoup in Python allow you to strip out the heaviest, least useful tags.

The most egregious token-wasters are <script>, <style>, and <svg> tags. SVGs in particular can contain thousands of lines of mathematical paths for simple icons, which provide zero semantic value to an LLM.

Python
from bs4 import BeautifulSoup
import re

def clean_html_for_llm(raw_html: str) -> str:
    soup = BeautifulSoup(raw_html, 'lxml')
    
    # Elements that contain zero semantic value for text generation
    tags_to_remove = [
        'script', 'style', 'noscript', 'svg', 'canvas', 
        'video', 'audio', 'iframe', 'map', 'object'
    ]
    
    for tag in soup(tags_to_remove):
        tag.decompose()
        
    # Remove hidden elements often used for tracking or mobile menus
    for hidden in soup.find_all(style=re.compile(r'display:\s*none')):
        hidden.decompose()
        
    # Strip comments
    for comment in soup.findAll(text=lambda text: isinstance(text, Comment)):
        comment.extract()
        
    return str(soup)

This simple pre-processing step routinely reduces the payload size by 60 to 80 percent. However, the resulting HTML still contains <div> and <span> tags that add token overhead without adding meaning.

Phase 2: The Markdown Golden Ratio

After stripping the DOM, the next step is format conversion. You might be tempted to extract pure text using soup.get_text(). This is a mistake for RAG pipelines.

Plain text loses the structural hierarchy of the document. You lose the distinction between an H1 title, an H2 sub-section, and a data table. When you pass a massive block of plain text into a text splitter for vectorization, the chunking algorithm is forced to split by character count or whitespace, often cutting right through the middle of a related concept.

Markdown is the golden ratio. It removes all HTML bracket overhead while preserving semantic boundaries.

When your data is formatted in Markdown, you can use semantic splitters (like LangChain's MarkdownHeaderTextSplitter) to chunk your data by ## and ### headers. This ensures that the vector database stores complete, coherent thoughts.

Try it yourself

Try scraping this page with AlterLab to see the clean markdown output

Phase 3: Offloading Extraction to the API

Running BeautifulSoup and Markdown converters on your own infrastructure requires maintaining complex server fleets, especially when dealing with headless browsers needed to render JavaScript-heavy Single Page Applications (SPAs).

Instead of building and scaling this extraction layer yourself, you can offload it directly to the scraping API. AlterLab natively supports returning cleaned Markdown or structured JSON, bypassing the raw HTML entirely. This shifts the compute cost away from your infrastructure and drastically reduces the payload size traversing your network.

Here is how you request clean Markdown directly using cURL:

Bash
curl -X POST https://api.alterlab.io/v1/scrape \
  -H "X-API-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://public-data-source.com/research-paper",
    "formats": ["markdown"]
  }'

For Python developers, the AlterLab Python SDK handles the connection and response parsing automatically. This is the recommended approach for integrating into data pipelines.

Python
import alterlab
import os

client = alterlab.Client(os.environ["ALTERLAB_API_KEY"])

def fetch_document_for_rag(url: str) -> str:
    # Request markdown directly to save tokens and skip local parsing
    response = client.scrape(
        url,
        formats=["markdown"],
        min_tier=3 # Ensure JS is rendered before markdown conversion
    )
    
    return response.markdown

markdown_content = fetch_document_for_rag("https://public-data-source.com/research-paper")
print(f"Retrieved {len(markdown_content)} characters of clean markdown.")

By requesting the markdown format, the API automatically renders the JavaScript, waits for the network idle state, strips the noise, and converts the semantic structure. A 2.5MB HTML payload becomes a 15KB Markdown string. When you compare this token reduction against your LLM costs, the efficiency gains are immediate. Check the pricing to model the cost difference between handling extraction in-house versus offloading it to the API.

Phase 4: Schema-Driven JSON Extraction

While Markdown is excellent for unstructured documents like articles and documentation, it is still too verbose for highly structured data. If you are scraping public directories, e-commerce product catalogs, or financial data tables, you do not need sentences. You need key-value pairs.

Passing Markdown tables into an LLM to answer queries about specific product prices or specifications is inefficient. The LLM has to read the entire table to find a single value.

For highly structured pages, bypass text entirely and extract raw JSON at the scraping layer.

Python
import alterlab
import os

client = alterlab.Client(os.environ["ALTERLAB_API_KEY"])

def extract_product_data(url: str) -> dict:
    # Use Cortex AI to extract specific fields directly
    # into JSON, entirely bypassing HTML/Markdown in your pipeline
    response = client.scrape(
        url,
        formats=["json"],
        extraction_schema={
            "product_name": "string",
            "price": "number",
            "availability": "string",
            "specifications": {
                "weight": "string",
                "dimensions": "string"
            }
        }
    )
    
    return response.json

data = extract_product_data("https://public-store.com/item/12345")

In this architecture, your RAG pipeline does not need to embed dense documents. You can store the JSON directly in a NoSQL database or a relational database, and use your LLM to generate SQL or query DSLs to retrieve exact answers. This hybrid approach (structured query generation plus unstructured vector search) yields the highest accuracy for data-dense applications. You can read more about structured extraction schemas in the API docs.

Implementing Semantic Chunking

Once you have your clean Markdown, the final step before embedding is chunking. Standard recursive character splitters will break your data at arbitrary points. A semantic splitter reads the Markdown headers and groups the text logically.

Here is a practical implementation using LangChain to process the Markdown retrieved from the scraping API:

Python
from langchain.text_splitter import MarkdownHeaderTextSplitter
from langchain.schema import Document

def chunk_markdown_document(markdown_text: str) -> list[Document]:
    # Define which headers represent distinct sections
    headers_to_split_on = [
        ("#", "Header 1"),
        ("##", "Header 2"),
        ("###", "Header 3"),
    ]
    
    # Initialize the semantic splitter
    markdown_splitter = MarkdownHeaderTextSplitter(
        headers_to_split_on=headers_to_split_on
    )
    
    # Split the document
    md_header_splits = markdown_splitter.split_text(markdown_text)
    
    return md_header_splits

When you inspect the output of this splitter, you will notice that the metadata for each chunk contains the hierarchy of headers above it. When the vector database returns a specific chunk to the LLM, the LLM immediately knows the exact context of the paragraph, drastically reducing hallucinations.

Cost Scaling in Production

Let us look at a practical cost model. Assume you process 10,000 pages per day.

The math is unambiguous. Processing raw HTML requires massive compute resources, bloats your vector database, and forces the LLM to waste its attention mechanism on DOM boilerplate.

By shifting the extraction burden to the scraping layer, converting to Markdown, and employing semantic chunking, you build a pipeline that is resilient, highly accurate, and exponentially cheaper to operate at scale. Stop passing <div> tags to your neural networks. Clean your data first.

Share

Was this article helpful?

Frequently Asked Questions

Raw HTML contains massive amounts of non-semantic noise like inline styles, scripts, and SVG paths. This inflates your LLM context window with useless tokens, driving up costs and reducing retrieval accuracy.
Markdown is the optimal format. It strips out HTML boilerplate while preserving critical semantic boundaries like headers, lists, and tables, which are essential for accurate text chunking.
Use a semantic chunking strategy that splits documents by headers (H1, H2, H3). This ensures that related concepts remain in the same vector embedding, improving the quality of your RAG answers.