Pricing Compare Playground Blog Docs Changelog

Extracting Markdown from JS-Heavy Sites for AI Agents

Learn how to reliably extract clean, token-efficient Markdown from JavaScript-heavy single-page applications to feed LLMs and autonomous AI agents.

Yash Dubey

April 30, 2026

7 min read

3 views

Autonomous AI agents require structured, clean data to operate effectively. When an agent is tasked with researching an entity, summarizing news, or analyzing market trends, it needs to ingest web content. However, feeding raw HTML directly into a Large Language Model (LLM) is inefficient. Modern web pages are bloated with structural <div> tags, massive inline CSS styles, and tracking scripts. A typical e-commerce product page might weigh 2MB in HTML, consuming over 300,000 tokens, while the actual semantic content—the product name, description, price, and specifications—could be represented in under 1,000 tokens of Markdown.

The challenge compounds when dealing with single-page applications (SPAs) built on React, Vue, or Angular. A standard HTTP GET request to these endpoints returns an empty HTML shell containing only a <script> tag pointing to a massive JavaScript bundle. The actual data is fetched asynchronously and injected into the Document Object Model (DOM) post-load.

To feed AI agents efficiently, we must solve two distinct problems: executing the JavaScript to render the page, and converting the resulting chaotic HTML into clean, token-optimized Markdown.

The Cost of Raw HTML in LLM Workloads

Context windows are finite and expensive. Whether you are using open-weights models locally or querying commercial APIs, token count directly dictates your latency and financial cost.

Consider a typical news article page. The HTML contains:

Navigation menus and mega-menus
Sidebar advertisements
Footer links
Inline SVG icons
Hidden tracking pixels
Structured data (JSON-LD) meant for search engines
Deeply nested structural elements (<div class="flex flex-col md:flex-row ...">)

If an AI agent attempts to process this raw HTML, it wastes cognitive capacity navigating the markup structure rather than analyzing the content. The model's attention mechanism must calculate weights across thousands of structural tokens that add zero semantic value to the actual text.

Markdown solves this by preserving the semantic hierarchy (headers, lists, links, tables, emphasis) while stripping away the presentation layer. Converting a rendered DOM to Markdown acts as a highly effective data compression step, often reducing token counts by 95% or more without losing the information necessary for reasoning.

The Extraction Pipeline

Extracting clean Markdown from a JS-heavy site is a multi-stage process. You cannot simply run a regex over the network response. You must simulate a real user environment.

Stage 1: Headless Rendering

The first step requires a browser engine (Chromium, WebKit, or Firefox) instrumented via a protocol like CDP (Chrome DevTools Protocol). Tools like Puppeteer or Playwright are standard here.

When you navigate to a JS-heavy site, the DOMContentLoaded event is insufficient. The page has loaded, but the frontend framework is likely just beginning to fetch the actual data payload. You must configure the headless browser to wait for network idle states—typically defined as having no more than a specific number of active network connections for a set duration (e.g., networkidle2 in Puppeteer).

Furthermore, modern sites often utilize lazy-loading. Images, comments, and sometimes entire sections of the page will not render until they enter the viewport. To capture a complete representation of the page, the rendering pipeline must programmatically scroll down the document, triggering IntersectionObserver callbacks and forcing the application to render deferred content before the DOM is snapshotted.

Stage 2: DOM Sanitization

Once the DOM is fully rendered, extracting document.body.innerHTML yields a string that is still too noisy. Before converting to Markdown, the HTML must be aggressively pruned.

A robust sanitization pass involves traversing the tree and removing nodes that provide no textual value:

<script>, <noscript>, and <style> tags.
<iframe> elements (unless handling specific embedded content is required).
<svg> and <canvas> elements.
Hidden elements. This requires evaluating computed styles, not just looking for display: none attributes, as elements might be hidden via CSS classes or zero-pixel dimensions.

You also need to isolate the main content. While heuristics like Mozilla's Readability.js algorithm can attempt to find the primary article text, an AI agent often needs access to other data on the page, such as reviews or related items. Sanitization must strike a balance between removing boilerplate (nav, footers) and preserving context.

Stage 3: HTML to Markdown Translation

The final stage translates the sanitized HTML tree into Markdown. This is typically done using tools conceptually similar to Turndown.js.

The translation engine walks the DOM node by node, applying rules based on the element type:

<h1> through <h6> map to corresponding # prefixes.
<ul>, <ol>, and <li> map to list formatting.
<a> tags become [text](url).
<img> tags become ![alt](url).
<table> elements require careful parsing to construct valid Markdown tables, handling column spans and empty cells gracefully.

Crucially, text nodes must be processed to collapse excessive whitespace and escape characters that hold special meaning in Markdown (like *, _, or [), preventing formatting collisions.

Handling the Anti-Bot Layer

Building this pipeline locally is straightforward for simple targets. However, when extracting data at scale, you will encounter significant resistance. Most high-value e-commerce sites, real estate aggregators, and travel platforms deploy sophisticated bot mitigation systems.

These systems do not simply look for a missing User-Agent. They analyze:

TLS fingerprints (JA3/JA4 hashes) to verify the underlying network stack matches the claimed browser.
JavaScript engine characteristics, looking for variables injected by automation frameworks like navigator.webdriver.
Canvas rendering outputs, which differ slightly between real GPUs and headless virtual environments.
IP reputation, quickly flagging datacenters.

When a mitigation system flags the connection, it intercepts the request. Instead of serving the application bundle, it serves a challenge page—often a CAPTCHA or a complex JavaScript proof-of-work calculation. If your headless browser cannot solve this challenge, the extraction pipeline fails entirely.

Maintaining a fleet of proxy servers, rotating browser fingerprints, and constantly updating evasion scripts requires dedicated engineering resources that detract from building the actual AI agent logic. Utilizing a managed anti-bot solution shifts this operational burden, allowing you to request a URL and receive the rendered content reliably.

Streamlining Extraction for Agents

Instead of orchestrating Playwright instances, handling proxy rotation, and writing custom DOM sanitizers, you can offload the entire pipeline. The AlterLab API natively supports rendering JS-heavy applications and returning clean Markdown directly.

Try it yourself

Test Markdown extraction on a JS-heavy target

curl -X POST https://api.alterlab.io/v1/scrape \
  -H "X-API-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com"}'

Enable JavaScript to try the live demo, or sign up to use the API directly.

By specifying the formats parameter, the API handles the underlying browser lifecycle, waits for the network to settle, strips boilerplate HTML, and returns a token-optimized string ready for ingestion by an LLM.

Example: Extracting via cURL

This example demonstrates how to request the Markdown format directly via the API. Note that the request includes configuration for both executing JavaScript and specifying the desired output format.

Bash

curl -X POST https://api.alterlab.io/v1/scrape \
  -H "X-API-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example-spa-target.com/product/123",
    "render_js": true,
    "formats": ["markdown"]
  }'

The response will contain the sanitized, converted Markdown in the markdown field of the JSON payload, bypassing the need to parse gigabytes of raw HTML locally.

Example: Extracting via Python SDK

For production systems, the Python SDK provides a strongly typed interface for defining extraction tasks. The SDK handles connection pooling, retries, and schema validation automatically.

Python

import alterlab
import json

client = alterlab.Client("YOUR_API_KEY")

response = client.scrape(
    url="https://example-spa-target.com/product/123",
    render_js=True,
    formats=["markdown"]
)

# The resulting markdown is token-efficient and ready for LLM context
agent_context = response.markdown

print(f"Extracted {len(agent_context)} characters of clean Markdown.")
# Feed agent_context directly into your OpenAI, Anthropic, or local model prompts

If you need further details on configuration options, such as injecting custom headers or geographic targeting, refer to the API docs for the complete parameter schema.

Takeaways

Feeding AI agents requires optimizing for both semantic clarity and token efficiency.

Raw HTML from modern web applications is heavily polluted with structural noise and tracking scripts, wasting LLM context windows and increasing processing latency.
Single-page applications require a full headless browser execution environment to trigger network requests and render the final DOM state.
Converting the rendered DOM to Markdown reduces token consumption drastically while preserving the necessary hierarchical structure for accurate data extraction.
Abstracting the rendering, sanitization, and evasion layers behind an API allows engineering teams to focus on agent orchestration rather than infrastructure maintenance.

Was this article helpful?

Try it yourself

One API call. Any language.

Python SDK, Node SDK, or plain HTTP. Get started in under a minute.

from alterlab import AlterLab

client = AlterLab(api_key="YOUR_KEY")
result = client.scrape("https://example.com")
print(result.markdown)

No credit card required · 5,000 free requests

Frequently Asked Questions

You need a headless browser to execute the JavaScript and render the DOM before extracting the HTML. Once rendered, the HTML can be sanitized and converted to text or Markdown.

Markdown is significantly more token-efficient than raw HTML. It strips out unnecessary structural tags, CSS classes, and inline scripts while preserving semantic meaning, drastically reducing cost and latency for LLMs.

Handling bot protection requires advanced techniques like rotating residential proxies, TLS fingerprint spoofing, and solving CAPTCHAs, or utilizing a managed rendering API that handles these layers automatically.

Yash Dubey

View all posts

Tutorials

How to Scrape Walmart Data: Complete Guide for 2026

Learn how to scrape Walmart data using Python in 2026. A technical guide to extracting public e-commerce data, handling dynamic content, and scaling pipelines.

Yash Dubey

Apr 29, 2026

Tutorials

Extract JSON from E-Commerce Sites Without CSS Selectors

Learn how to use AI and schema-based extraction to parse structured product data from e-commerce sites without writing or maintaining fragile CSS selectors.

Yash Dubey

Apr 29, 2026

Tutorials

How to Scrape Twitter/X Data with Python in 2026

Learn how to scrape Twitter/X using Python. A technical guide on bypassing dynamic content rendering to extract public social data reliably at scale.

Yash Dubey

Apr 28, 2026

Stay in the Loop

Get scraping insights, API tips, and platform updates. No spam — we only send when we have something worth reading.

The Cost of Raw HTML in LLM Workloads

The Extraction Pipeline

Stage 1: Headless Rendering

Stage 2: DOM Sanitization

Stage 3: HTML to Markdown Translation

Handling the Anti-Bot Layer

Streamlining Extraction for Agents

Example: Extracting via cURL

Example: Extracting via Python SDK

Takeaways

Frequently Asked Questions

Related Articles

How to Scrape Walmart Data: Complete Guide for 2026

Extract JSON from E-Commerce Sites Without CSS Selectors

How to Scrape Twitter/X Data with Python in 2026

Popular Posts

Why Your Headless Browser Gets Detected (and How to Fix It)

Best Web Scraping APIs in 2026: Complete Comparison Guide

Playwright Bot Detection: What Actually Works in 2026

How to Scrape Cloudflare-Protected Sites in 2026

How to Scrape Twitter/X: Complete Guide for 2026

Recommended

Selenium Bot Detection: Why You Get Flagged and How to Fix It

How to Scrape AliExpress: Complete Guide for 2026

Why Your Headless Browser Gets Detected (and How to Fix It)

How to Scrape Indeed: Complete Guide for 2026

How to Scrape Cloudflare-Protected Sites in 2026

Newsletter

Recommended Reading

Selenium Bot Detection: Why You Get Flagged and How to Fix It

How to Scrape AliExpress: Complete Guide for 2026

Why Your Headless Browser Gets Detected (and How to Fix It)

How to Scrape Indeed: Complete Guide for 2026

How to Scrape Cloudflare-Protected Sites in 2026

Stay in the Loop

Explore AlterLab

Python Web Scraping API

Compare Scraping APIs

Pricing

Documentation