
Extracting Markdown from JS-Heavy Sites for AI Agents
Learn how to reliably extract clean, token-efficient Markdown from JavaScript-heavy single-page applications to feed LLMs and autonomous AI agents.
April 30, 2026
Autonomous AI agents require structured, clean data to operate effectively. When an agent is tasked with researching an entity, summarizing news, or analyzing market trends, it needs to ingest web content. However, feeding raw HTML directly into a Large Language Model (LLM) is inefficient. Modern web pages are bloated with structural <div> tags, massive inline CSS styles, and tracking scripts. A typical e-commerce product page might weigh 2MB in HTML, consuming over 300,000 tokens, while the actual semantic content—the product name, description, price, and specifications—could be represented in under 1,000 tokens of Markdown.
The challenge compounds when dealing with single-page applications (SPAs) built on React, Vue, or Angular. A standard HTTP GET request to these endpoints returns an empty HTML shell containing only a <script> tag pointing to a massive JavaScript bundle. The actual data is fetched asynchronously and injected into the Document Object Model (DOM) post-load.
To feed AI agents efficiently, we must solve two distinct problems: executing the JavaScript to render the page, and converting the resulting chaotic HTML into clean, token-optimized Markdown.
The Cost of Raw HTML in LLM Workloads
Context windows are finite and expensive. Whether you are using open-weights models locally or querying commercial APIs, token count directly dictates your latency and financial cost.
Consider a typical news article page. The HTML contains:
- Navigation menus and mega-menus
- Sidebar advertisements
- Footer links
- Inline SVG icons
- Hidden tracking pixels
- Structured data (JSON-LD) meant for search engines
- Deeply nested structural elements (
<div class="flex flex-col md:flex-row ...">)
If an AI agent attempts to process this raw HTML, it wastes cognitive capacity navigating the markup structure rather than analyzing the content. The model's attention mechanism must calculate weights across thousands of structural tokens that add zero semantic value to the actual text.
Markdown solves this by preserving the semantic hierarchy (headers, lists, links, tables, emphasis) while stripping away the presentation layer. Converting a rendered DOM to Markdown acts as a highly effective data compression step, often reducing token counts by 95% or more without losing the information necessary for reasoning.
The Extraction Pipeline
Extracting clean Markdown from a JS-heavy site is a multi-stage process. You cannot simply run a regex over the network response. You must simulate a real user environment.
Stage 1: Headless Rendering
The first step requires a browser engine (Chromium, WebKit, or Firefox) instrumented via a protocol like CDP (Chrome DevTools Protocol). Tools like Puppeteer or Playwright are standard here.
When you navigate to a JS-heavy site, the DOMContentLoaded event is insufficient. The page has loaded, but the frontend framework is likely just beginning to fetch the actual data payload. You must configure the headless browser to wait for network idle states—typically defined as having no more than a specific number of active network connections for a set duration (e.g., networkidle2 in Puppeteer).
Furthermore, modern sites often utilize lazy-loading. Images, comments, and sometimes entire sections of the page will not render until they enter the viewport. To capture a complete representation of the page, the rendering pipeline must programmatically scroll down the document, triggering IntersectionObserver callbacks and forcing the application to render deferred content before the DOM is snapshotted.
Stage 2: DOM Sanitization
Once the DOM is fully rendered, extracting document.body.innerHTML yields a string that is still too noisy. Before converting to Markdown, the HTML must be aggressively pruned.
A robust sanitization pass involves traversing the tree and removing nodes that provide no textual value:
<script>,<noscript>, and<style>tags.<iframe>elements (unless handling specific embedded content is required).<svg>and<canvas>elements.- Hidden elements. This requires evaluating computed styles, not just looking for
display: noneattributes, as elements might be hidden via CSS classes or zero-pixel dimensions.
You also need to isolate the main content. While heuristics like Mozilla's Readability.js algorithm can attempt to find the primary article text, an AI agent often needs access to other data on the page, such as reviews or related items. Sanitization must strike a balance between removing boilerplate (nav, footers) and preserving context.
Stage 3: HTML to Markdown Translation
The final stage translates the sanitized HTML tree into Markdown. This is typically done using tools conceptually similar to Turndown.js.
The translation engine walks the DOM node by node, applying rules based on the element type:
<h1>through<h6>map to corresponding#prefixes.<ul>,<ol>, and<li>map to list formatting.<a>tags become[text](url).<img>tags become.<table>elements require careful parsing to construct valid Markdown tables, handling column spans and empty cells gracefully.
Crucially, text nodes must be processed to collapse excessive whitespace and escape characters that hold special meaning in Markdown (like *, _, or [), preventing formatting collisions.
Handling the Anti-Bot Layer
Building this pipeline locally is straightforward for simple targets. However, when extracting data at scale, you will encounter significant resistance. Most high-value e-commerce sites, real estate aggregators, and travel platforms deploy sophisticated bot mitigation systems.
These systems do not simply look for a missing User-Agent. They analyze:
- TLS fingerprints (JA3/JA4 hashes) to verify the underlying network stack matches the claimed browser.
- JavaScript engine characteristics, looking for variables injected by automation frameworks like
navigator.webdriver. - Canvas rendering outputs, which differ slightly between real GPUs and headless virtual environments.
- IP reputation, quickly flagging datacenters.
When a mitigation system flags the connection, it intercepts the request. Instead of serving the application bundle, it serves a challenge page—often a CAPTCHA or a complex JavaScript proof-of-work calculation. If your headless browser cannot solve this challenge, the extraction pipeline fails entirely.
Maintaining a fleet of proxy servers, rotating browser fingerprints, and constantly updating evasion scripts requires dedicated engineering resources that detract from building the actual AI agent logic. Utilizing a managed anti-bot solution shifts this operational burden, allowing you to request a URL and receive the rendered content reliably.
Streamlining Extraction for Agents
Instead of orchestrating Playwright instances, handling proxy rotation, and writing custom DOM sanitizers, you can offload the entire pipeline. The AlterLab API natively supports rendering JS-heavy applications and returning clean Markdown directly.
Test Markdown extraction on a JS-heavy target
By specifying the formats parameter, the API handles the underlying browser lifecycle, waits for the network to settle, strips boilerplate HTML, and returns a token-optimized string ready for ingestion by an LLM.
Example: Extracting via cURL
This example demonstrates how to request the Markdown format directly via the API. Note that the request includes configuration for both executing JavaScript and specifying the desired output format.
curl -X POST https://api.alterlab.io/v1/scrape \
-H "X-API-Key: YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://example-spa-target.com/product/123",
"render_js": true,
"formats": ["markdown"]
}'The response will contain the sanitized, converted Markdown in the markdown field of the JSON payload, bypassing the need to parse gigabytes of raw HTML locally.
Example: Extracting via Python SDK
For production systems, the Python SDK provides a strongly typed interface for defining extraction tasks. The SDK handles connection pooling, retries, and schema validation automatically.
import alterlab
import json
client = alterlab.Client("YOUR_API_KEY")
response = client.scrape(
url="https://example-spa-target.com/product/123",
render_js=True,
formats=["markdown"]
)
# The resulting markdown is token-efficient and ready for LLM context
agent_context = response.markdown
print(f"Extracted {len(agent_context)} characters of clean Markdown.")
# Feed agent_context directly into your OpenAI, Anthropic, or local model promptsIf you need further details on configuration options, such as injecting custom headers or geographic targeting, refer to the API docs for the complete parameter schema.
Takeaways
Feeding AI agents requires optimizing for both semantic clarity and token efficiency.
- Raw HTML from modern web applications is heavily polluted with structural noise and tracking scripts, wasting LLM context windows and increasing processing latency.
- Single-page applications require a full headless browser execution environment to trigger network requests and render the final DOM state.
- Converting the rendered DOM to Markdown reduces token consumption drastically while preserving the necessary hierarchical structure for accurate data extraction.
- Abstracting the rendering, sanitization, and evasion layers behind an API allows engineering teams to focus on agent orchestration rather than infrastructure maintenance.
Was this article helpful?
Frequently Asked Questions
Related Articles
Popular Posts
Recommended
Newsletter
Scraping insights and API tips. No spam.
Recommended Reading

Selenium Bot Detection: Why You Get Flagged and How to Fix It

How to Scrape AliExpress: Complete Guide for 2026

Why Your Headless Browser Gets Detected (and How to Fix It)

How to Scrape Indeed: Complete Guide for 2026

How to Scrape Cloudflare-Protected Sites in 2026
Stay in the Loop
Get scraping insights, API tips, and platform updates. No spam — we only send when we have something worth reading.


