How to Connect Local LLMs to Live Web Data Using Token-Efficient JSON and Markdown
Tutorials

How to Connect Local LLMs to Live Web Data Using Token-Efficient JSON and Markdown

Learn how to connect local LLMs to live web data using token-efficient JSON and Markdown extraction to reduce hallucination and save tokens.

Yash Dubey
Yash Dubey

May 19, 2026

6 min read
12 views

TL;DR

Connecting local LLMs to live web data requires converting noisy HTML into token-efficient JSON or Markdown formats before injection into the context window. Using a purpose-built extraction API bypasses heavy DOM parsing, allowing you to feed clean, structured context directly into models like Llama 3 or Mistral. This minimizes token usage, accelerates inference times, and severely reduces the risk of model hallucination.

The Problem with Raw HTML and Context Windows

When building Retrieval-Augmented Generation (RAG) pipelines or autonomous agents, the most common anti-pattern is passing raw HTML directly into a Large Language Model.

The DOM was designed for browsers, not neural networks. A standard public webpage—such as an e-commerce product listing or a real estate directory—contains hundreds of kilobytes of code. This includes base64-encoded SVG icons, tracking scripts, inline CSS styling, and deeply nested <div> structures that offer zero semantic value to an AI model.

Language models tokenize input text. Depending on the tokenizer (like Tiktoken for OpenAI or the sentencepiece tokenizers used by Llama and Mistral), a 1MB HTML file can easily translate into 250,000 to 400,000 tokens.

Feeding this into a local LLM creates three critical bottlenecks:

  1. Context Exhaustion: Most local models operate optimally within an 8k to 32k context window. Raw HTML immediately overflows these limits.
  2. Inference Latency: Processing 100,000 tokens of boilerplate code requires massive compute. Time-to-first-token (TTFT) skyrockets, making real-time applications impossible.
  3. Attention Dilution: The "lost in the middle" phenomenon is amplified by structural noise. When the target data (e.g., a product price) is buried between 5,000 tokens of navigation menus and footer scripts, the model's attention mechanism fails to retrieve it reliably.

To build performant AI data pipelines, the extraction layer must decouple data retrieval from data formatting.

Token Efficiency: Markdown and JSON

The solution is transforming the raw DOM into LLM-native formats before inference. The two standard formats for this are Markdown and JSON.

Markdown for Unstructured Context

Markdown is the ideal format for article-like content, documentation, and forum threads. It strips away the visual presentation layer while perfectly preserving the document's semantic hierarchy (H1, H2, lists, bold emphasis, and hyperlinks).

Because most foundational models incorporate large amounts of Markdown in their pre-training data (via GitHub and Reddit datasets), they parse Markdown natively and efficiently. Converting a typical 500KB webpage into Markdown often yields a 15KB file, representing a 95% reduction in token consumption.

JSON for Structured Entities

When the goal is extracting specific entities—such as a list of public company locations, pricing tiers, or tabular data—JSON is superior. JSON provides a rigid, key-value mapping that eliminates the need for the LLM to understand document flow.

By handling the DOM-to-JSON extraction outside the LLM (using CSS selectors or layout-aware heuristics), you only pass the exact data points the model needs to analyze.

Setting Up the Pipeline

Rather than building a brittle pipeline of headless browsers, proxy rotators, and HTML parsers (like BeautifulSoup or Turndown), you can offload the extraction step entirely. AlterLab provides native support for Markdown and JSON extraction, returning LLM-ready strings directly in the API response.

Fetching Data via API

Let's look at how to pull a page directly into Markdown format. First, we will use a standard cURL request to demonstrate the underlying HTTP interface.

Bash
curl -X POST https://api.alterlab.io/v1/scrape \
  -H "X-API-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com/blog/latest-tech",
    "formats": ["markdown"]
  }'

For production applications, using the Python SDK is cleaner and handles retries automatically.

Python
import alterlab

# Initialize the client
client = alterlab.Client("YOUR_API_KEY")

# Request only the markdown format to save bandwidth
response = client.scrape(
    url="https://example.com/blog/latest-tech",
    formats=["markdown"]
)

# The response object contains the cleanly formatted markdown
web_content = response.markdown
print(f"Retrieved {len(web_content)} characters of clean text.")
Try it yourself

Try extracting clean Markdown from this URL.

By specifying formats=["markdown"], the API processes the DOM tree, removes navigation bars, footers, and sidebars using readability algorithms, and returns only the core content formatted as Markdown.

Parsing and Injecting into Local LLMs

Once you have the token-optimized text, you can feed it into a local model. For this example, we will use Ollama running a quantized version of Llama 3 (8B parameters).

Running local models ensures data privacy and eliminates API costs for token generation, making it highly synergistic with an efficient extraction layer.

Python
import requests
import alterlab

def analyze_webpage(url: str, prompt: str) -> str:
    # 1. Fetch clean markdown via AlterLab
    client = alterlab.Client("YOUR_API_KEY")
    scrape_result = client.scrape(url=url, formats=["markdown"])
    clean_markdown = scrape_result.markdown
    
    # 2. Construct the prompt with the injected context
    system_prompt = "You are a data extraction assistant. Analyze the provided Markdown content and answer the user's prompt. Be concise."
    
    full_prompt = f"{prompt}\n\n### Web Context:\n{clean_markdown}\n\n### Answer:"
    
    # 3. Feed to local Ollama instance
    response = requests.post("http://localhost:11434/api/generate", json={
        "model": "llama3",
        "system": system_prompt,
        "prompt": full_prompt,
        "stream": False,
        "options": {
            "temperature": 0.1,
            "num_predict": 256
        }
    })
    
    return response.json().get("response", "Error generating response.")

# Example Usage
url_to_analyze = "https://example.com/press-releases/q3-earnings"
query = "What were the total revenue and net income reported for Q3? Return as JSON."

result = analyze_webpage(url_to_analyze, query)
print(result)

In this architecture, the local LLM never sees a single <div> or <script> tag. It only processes the semantic Markdown, allowing the 8B parameter model to perform with accuracy that rivals much larger models forced to parse raw HTML.

Handling Dynamic Content and SPAs

A major challenge in data extraction is Single Page Applications (SPAs) built with React, Vue, or Angular. If you send a standard HTTP GET request to these URLs, the server returns a skeletal HTML file containing only a JavaScript bundle link.

If you convert this skeletal HTML to Markdown, the output will be empty. The page must be fully rendered in a real browser environment before the DOM can be serialized and converted.

Managing headless Playwright or Puppeteer instances at scale is notoriously difficult. You must handle memory leaks, browser fingerprinting, and concurrent rendering queues. Modern target sites also deploy sophisticated request verification to ensure traffic originates from genuine browsers.

By leveraging an API with built-in anti-bot handling, the rendering phase is abstracted away. The infrastructure automatically provisions a headless browser, executes the necessary JavaScript, waits for network idle (ensuring asynchronous data fetches complete), and then performs the Markdown or JSON conversion on the final, fully-populated DOM.

This ensures your LLM always receives complete data context, regardless of how heavily the target site relies on client-side rendering.

Scaling to Multi-URL Contexts

Because Markdown is so compact, you can combine content from multiple URLs into a single prompt without blowing out the context window. This is critical for comparative analysis tasks, such as finding the difference between three distinct product pages.

Python
import alterlab

client = alterlab.Client("YOUR_API_KEY")
urls = [
    "https://example.com/models/standard",
    "https://example.com/models/pro",
    "https://example.com/models/ultra"
]

combined_context = ""

for i, url in enumerate(urls):
    resp = client.scrape(url=url, formats=["markdown"])
    combined_context += f"\n\n## Document {i+1} ({url})\n"
    combined_context += resp.markdown

# combined_context can now be passed to the LLM for comparison

For advanced usage, error handling, and parameter tuning, always refer to the API docs to ensure your requests are optimized for the specific target architecture.

Conclusion

Building LLM-powered data pipelines requires treating the context window as your most precious resource. Passing raw HTML to local models guarantees slow inference, high token costs, and poor retrieval accuracy. By strictly separating the extraction layer from the inference layer—and converting web data into native RAG formats like JSON and Markdown—you can build systems that are significantly faster, highly accurate, and capable of running entirely on local hardware.

Share

Was this article helpful?

Frequently Asked Questions

Raw HTML contains massive amounts of non-semantic noise like inline CSS, JavaScript, and structural tags that bloat the context window. This wastes tokens, slows down inference, and degrades the LLM's ability to accurately retrieve facts due to attention dilution.
Markdown strips away stylistic and functional web bloat while preserving semantic relationships like headers, lists, and links. This conversion typically reduces the token payload by 80% to 95%, allowing more relevant context to fit within the model's limits.
Yes. Modern scraping APIs handle the DOM parsing and layout analysis to output structured JSON directly. You can feed this pre-structured JSON to an LLM, drastically reducing the cognitive load on the model compared to unstructured text parsing.