
How to Connect Local LLMs to Live Web Data Using Token-Efficient JSON and Markdown
Learn how to connect local LLMs to live web data using token-efficient JSON and Markdown extraction to reduce hallucination and save tokens.
May 19, 2026
TL;DR
Connecting local LLMs to live web data requires converting noisy HTML into token-efficient JSON or Markdown formats before injection into the context window. Using a purpose-built extraction API bypasses heavy DOM parsing, allowing you to feed clean, structured context directly into models like Llama 3 or Mistral. This minimizes token usage, accelerates inference times, and severely reduces the risk of model hallucination.
The Problem with Raw HTML and Context Windows
When building Retrieval-Augmented Generation (RAG) pipelines or autonomous agents, the most common anti-pattern is passing raw HTML directly into a Large Language Model.
The DOM was designed for browsers, not neural networks. A standard public webpage—such as an e-commerce product listing or a real estate directory—contains hundreds of kilobytes of code. This includes base64-encoded SVG icons, tracking scripts, inline CSS styling, and deeply nested <div> structures that offer zero semantic value to an AI model.
Language models tokenize input text. Depending on the tokenizer (like Tiktoken for OpenAI or the sentencepiece tokenizers used by Llama and Mistral), a 1MB HTML file can easily translate into 250,000 to 400,000 tokens.
Feeding this into a local LLM creates three critical bottlenecks:
- Context Exhaustion: Most local models operate optimally within an 8k to 32k context window. Raw HTML immediately overflows these limits.
- Inference Latency: Processing 100,000 tokens of boilerplate code requires massive compute. Time-to-first-token (TTFT) skyrockets, making real-time applications impossible.
- Attention Dilution: The "lost in the middle" phenomenon is amplified by structural noise. When the target data (e.g., a product price) is buried between 5,000 tokens of navigation menus and footer scripts, the model's attention mechanism fails to retrieve it reliably.
To build performant AI data pipelines, the extraction layer must decouple data retrieval from data formatting.
Token Efficiency: Markdown and JSON
The solution is transforming the raw DOM into LLM-native formats before inference. The two standard formats for this are Markdown and JSON.
Markdown for Unstructured Context
Markdown is the ideal format for article-like content, documentation, and forum threads. It strips away the visual presentation layer while perfectly preserving the document's semantic hierarchy (H1, H2, lists, bold emphasis, and hyperlinks).
Because most foundational models incorporate large amounts of Markdown in their pre-training data (via GitHub and Reddit datasets), they parse Markdown natively and efficiently. Converting a typical 500KB webpage into Markdown often yields a 15KB file, representing a 95% reduction in token consumption.
JSON for Structured Entities
When the goal is extracting specific entities—such as a list of public company locations, pricing tiers, or tabular data—JSON is superior. JSON provides a rigid, key-value mapping that eliminates the need for the LLM to understand document flow.
By handling the DOM-to-JSON extraction outside the LLM (using CSS selectors or layout-aware heuristics), you only pass the exact data points the model needs to analyze.
Setting Up the Pipeline
Rather than building a brittle pipeline of headless browsers, proxy rotators, and HTML parsers (like BeautifulSoup or Turndown), you can offload the extraction step entirely. AlterLab provides native support for Markdown and JSON extraction, returning LLM-ready strings directly in the API response.
Fetching Data via API
Let's look at how to pull a page directly into Markdown format. First, we will use a standard cURL request to demonstrate the underlying HTTP interface.
curl -X POST https://api.alterlab.io/v1/scrape \
-H "X-API-Key: YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com/blog/latest-tech",
"formats": ["markdown"]
}'For production applications, using the Python SDK is cleaner and handles retries automatically.
import alterlab
# Initialize the client
client = alterlab.Client("YOUR_API_KEY")
# Request only the markdown format to save bandwidth
response = client.scrape(
url="https://example.com/blog/latest-tech",
formats=["markdown"]
)
# The response object contains the cleanly formatted markdown
web_content = response.markdown
print(f"Retrieved {len(web_content)} characters of clean text.")Try extracting clean Markdown from this URL.
By specifying formats=["markdown"], the API processes the DOM tree, removes navigation bars, footers, and sidebars using readability algorithms, and returns only the core content formatted as Markdown.
Parsing and Injecting into Local LLMs
Once you have the token-optimized text, you can feed it into a local model. For this example, we will use Ollama running a quantized version of Llama 3 (8B parameters).
Running local models ensures data privacy and eliminates API costs for token generation, making it highly synergistic with an efficient extraction layer.
import requests
import alterlab
def analyze_webpage(url: str, prompt: str) -> str:
# 1. Fetch clean markdown via AlterLab
client = alterlab.Client("YOUR_API_KEY")
scrape_result = client.scrape(url=url, formats=["markdown"])
clean_markdown = scrape_result.markdown
# 2. Construct the prompt with the injected context
system_prompt = "You are a data extraction assistant. Analyze the provided Markdown content and answer the user's prompt. Be concise."
full_prompt = f"{prompt}\n\n### Web Context:\n{clean_markdown}\n\n### Answer:"
# 3. Feed to local Ollama instance
response = requests.post("http://localhost:11434/api/generate", json={
"model": "llama3",
"system": system_prompt,
"prompt": full_prompt,
"stream": False,
"options": {
"temperature": 0.1,
"num_predict": 256
}
})
return response.json().get("response", "Error generating response.")
# Example Usage
url_to_analyze = "https://example.com/press-releases/q3-earnings"
query = "What were the total revenue and net income reported for Q3? Return as JSON."
result = analyze_webpage(url_to_analyze, query)
print(result)In this architecture, the local LLM never sees a single <div> or <script> tag. It only processes the semantic Markdown, allowing the 8B parameter model to perform with accuracy that rivals much larger models forced to parse raw HTML.
Handling Dynamic Content and SPAs
A major challenge in data extraction is Single Page Applications (SPAs) built with React, Vue, or Angular. If you send a standard HTTP GET request to these URLs, the server returns a skeletal HTML file containing only a JavaScript bundle link.
If you convert this skeletal HTML to Markdown, the output will be empty. The page must be fully rendered in a real browser environment before the DOM can be serialized and converted.
Managing headless Playwright or Puppeteer instances at scale is notoriously difficult. You must handle memory leaks, browser fingerprinting, and concurrent rendering queues. Modern target sites also deploy sophisticated request verification to ensure traffic originates from genuine browsers.
By leveraging an API with built-in anti-bot handling, the rendering phase is abstracted away. The infrastructure automatically provisions a headless browser, executes the necessary JavaScript, waits for network idle (ensuring asynchronous data fetches complete), and then performs the Markdown or JSON conversion on the final, fully-populated DOM.
This ensures your LLM always receives complete data context, regardless of how heavily the target site relies on client-side rendering.
Scaling to Multi-URL Contexts
Because Markdown is so compact, you can combine content from multiple URLs into a single prompt without blowing out the context window. This is critical for comparative analysis tasks, such as finding the difference between three distinct product pages.
import alterlab
client = alterlab.Client("YOUR_API_KEY")
urls = [
"https://example.com/models/standard",
"https://example.com/models/pro",
"https://example.com/models/ultra"
]
combined_context = ""
for i, url in enumerate(urls):
resp = client.scrape(url=url, formats=["markdown"])
combined_context += f"\n\n## Document {i+1} ({url})\n"
combined_context += resp.markdown
# combined_context can now be passed to the LLM for comparisonFor advanced usage, error handling, and parameter tuning, always refer to the API docs to ensure your requests are optimized for the specific target architecture.
Conclusion
Building LLM-powered data pipelines requires treating the context window as your most precious resource. Passing raw HTML to local models guarantees slow inference, high token costs, and poor retrieval accuracy. By strictly separating the extraction layer from the inference layer—and converting web data into native RAG formats like JSON and Markdown—you can build systems that are significantly faster, highly accurate, and capable of running entirely on local hardware.
Was this article helpful?
Frequently Asked Questions
Related Articles
Popular Posts
Recommended
Newsletter
Scraping insights and API tips. No spam.
Recommended Reading

Selenium Bot Detection: Why You Get Flagged and How to Fix It

How to Scrape AliExpress: Complete Guide for 2026

Why Your Headless Browser Gets Detected (and How to Fix It)

How to Scrape Indeed: Complete Guide for 2026

How to Scrape Cloudflare-Protected Sites in 2026
Stay in the Loop
Get scraping insights, API tips, and platform updates. No spam — we only send when we have something worth reading.


