
Optimizing AI Data Pipelines: JSON vs Markdown vs Text
Learn how to choose the right data format for LLM grounding and AI agents to minimize token costs and maximize extraction accuracy in your data pipelines.
TL;DR
Markdown is the optimal format for LLM grounding and RAG pipelines because it preserves structural hierarchy with minimal token overhead. Use JSON only when your agent requires strict schema adherence for tool-calling, and avoid raw text for complex pages where layout and table relationships are critical for reasoning.
The Token Cost of Structure
Building data pipelines for AI agents requires a fundamental shift in how we think about data serialization. In traditional ETL, JSON is the undisputed king. However, when feeding data into Large Language Models (LLMs), the primary constraint isn't just "readability" or "parseability"—it is token density and semantic preservation.
Every character you send to an LLM has a cost, both in terms of latency and actual spend. More importantly, the format of that data dictates how well the model "understands" the relationship between different pieces of information. If you strip too much structure, the model loses context. If you keep too much (like raw HTML), you waste the context window on boilerplate code.
Markdown: The Sweet Spot for LLMs
Markdown has emerged as the de facto standard for LLM grounding. There are three technical reasons for this dominance: training data alignment, structural semantics, and token efficiency.
1. Training Data Alignment
Most state-of-the-art LLMs were trained on massive repositories of technical documentation, GitHub READMEs, and specialized forums—all of which heavily utilize Markdown. As a result, models are exceptionally good at interpreting # for headers, > for blockquotes, and | for tables. When you provide data in Markdown, you are speaking the model's native language of structure.
2. Semantic Preservation in RAG
In a Retrieval-Augmented Generation (RAG) pipeline, you must split long documents into smaller "chunks" to fit within the retrieval context. Raw text chunking often breaks mid-sentence or mid-paragraph, losing the connection between a heading and its sub-points.
Markdown allows for "Header-based Chunking." You can split a document at every ## or ### tag, ensuring that each chunk is a self-contained semantic unit. This significantly improves the quality of the embeddings used in vector databases.
3. Table Integrity
Consider a pricing table on an e-commerce site. In raw text, the relationship between a "Feature" and its "Price" is often lost as the table is flattened into a single stream of words. In JSON, a table can become incredibly verbose. Markdown tables preserve the grid structure with minimal token usage, allowing the LLM to perform accurate "lookups" within the context window.
JSON: When Schema Precision Matters
While Markdown is superior for grounding, JSON remains essential for "Extraction" tasks. If your goal is to populate a database or trigger a programmatic function, the LLM must output JSON.
However, using JSON as the input format for an agent can be problematic. Consider this comparison:
[
{"id": 1, "name": "Standard Plan", "price": "$10", "limit": "1000 requests"},
{"id": 2, "name": "Pro Plan", "price": "$50", "limit": "Unlimited"}
]| ID | Name | Price | Limit |
|---|---|---|---|
| 1 | Standard Plan | $10 | 1000 requests |
| 2 | Pro Plan | $50 | Unlimited |The JSON version repeats keys ("name", "price", "limit") for every single row. In a table with 50 rows, those repetitive keys consume thousands of unnecessary tokens. For a high-volume pipeline, this inefficiency scales into significant costs.
Implementation: Transforming Web Data
The challenge for engineers is converting messy web content into clean Markdown or JSON. Most scrapers return raw HTML, which is a nightmare for LLMs due to the high noise-to-signal ratio.
When building your pipeline, you should use a Python SDK that handles the conversion at the edge. This reduces the payload size coming into your application and saves you from writing complex BeautifulSoup logic.
import alterlab
# Initialize the client
client = alterlab.Client(api_key="YOUR_API_KEY")
# Request specific formats to optimize for LLM grounding
response = client.scrape(
url="https://example-news-site.com/article",
formats=["markdown", "json"],
min_tier=3 # Ensure JS-heavy content is rendered
)
# Use Markdown for the RAG context
markdown_content = response.markdown
# Use JSON for metadata (author, date, tags)
metadata = response.jsonBy requesting markdown directly from the documentation, you avoid the overhead of local processing. The API performs the "DOM cleaning" (removing scripts, ads, and navbars) before converting the semantic structure to Markdown.
Raw Text: The Minimalist Approach
Raw text is only recommended when the structural relationship between data points is irrelevant. For example, if you are performing sentiment analysis on a 2,000-word product review, the headings and bullet points matter less than the prose itself.
However, even in these cases, we often find that "Clean Text" (text with boilerplate removed) is better than "Raw Text." Using an anti-bot solution that also handles content extraction ensures that you aren't feeding the LLM "Cookie Policy" or "Sign Up for Newsletter" text, which can lead to hallucinations.
Benchmarking Token Usage
To illustrate the difference, we ran a sample 500-word technical blog post through three common serialization formats and measured the Tiktoken count (using the o1 and gpt-4o encoders).
The Markdown version was 35% more efficient than JSON and 84% more efficient than raw HTML while retaining 100% of the structural hierarchy needed for grounding.
Strategy: Designing the Multi-Format Pipeline
The most robust AI agents don't rely on a single format. They use a hybrid approach:
- Markdown for Knowledge: The body of the page, tables, and lists are stored as Markdown in a vector database for RAG.
- JSON for Discovery: Metadata like page title, published date, and breadcrumbs are stored as JSON for filtering and sorting.
- Text for Summarization: Large blocks of prose can be simplified to text to maximize the context window for extremely long documents.
Best Practices for AI Agent Pipelines
When configuring your data ingestion, follow these rules to ensure your agent remains accurate and cost-effective:
- Filter before format: Remove non-content elements (nav, footer, sidebars) before converting to Markdown. An LLM grounded on a sidebar's "Related Articles" list will likely hallucinate those titles as part of the primary content.
- Sanitize Markdown: Some converters produce "dirty" Markdown with excessive newlines or nested divs. Ensure your pipeline outputs standard CommonMark.
- Schema Validation: If you are using JSON, use Pydantic (Python) or Zod (TypeScript) to validate the structure before it reaches your agent logic. LLMs can occasionally "drift" from a schema if the input data is ambiguous.
- Monitor Token Density: Track the ratio of "Useful Characters" to "Total Tokens." If your JSON keys are longer than the values they hold, consider switching to a more compact representation or Markdown for that specific data segment.
Takeaway
For developers building the next generation of AI-native applications, the choice of data format is a performance optimization. Markdown is the clear winner for grounding and RAG due to its balance of structural context and token efficiency. Reserve JSON for structured extraction and Text for the simplest of prose-only tasks. By optimizing your ingestion format, you reduce costs, lower latency, and significantly improve the reasoning capabilities of your agents.
AlterLab // Web Data, Simplified.
Was this article helpful?
Frequently Asked Questions
Related Articles

Replacing Fragile CSS Selectors with LLM-Powered Zero-Shot JSON Extraction
Learn how to replace brittle CSS selectors with LLM-powered zero-shot JSON extraction to build resilient, autonomous web scraping pipelines that survive UI changes.
Herald Blog Service

Building Custom Proxy Rotation Wrappers with Automated Tunnel Health Verification
Learn how to construct resilient proxy rotation wrappers using asynchronous pre-flight checks to ensure reliable data extraction for autonomous agents.
Herald Blog Service

Handling Infinite Scroll & Pagination in Headless Browsers
Learn how to reliably handle infinite scroll, cursor-based pagination, and dynamic rendering for autonomous AI web scraping agents using headless browsers.
Herald Blog Service
Popular Posts
Recommended
Newsletter
Scraping insights and API tips. No spam.
Recommended Reading

How to Scrape Amazon in 2026: Engineering Guide

How to Scrape AliExpress: Complete Guide for 2026

Why Your Headless Browser Gets Detected (and How to Fix It)

How to Scrape Indeed: Complete Guide for 2026

How to Scrape Twitter/X Data: Complete Guide for 2026
Stay in the Loop
Get scraping insights, API tips, and platform updates. No spam — we only send when we have something worth reading.
Explore AlterLab
Web Scraping API Resources
Part of the Web Scraping API Documentation cluster
Complete API reference with 5-tier auto-escalation — Curl to challenge resolution.
Pillar pageConfigure Tier 4 browser rendering for SPAs and dynamic content.
Scrape pages behind login using session management.
Real success rates and cost data across all 5 tiers.
MCP Server, Python SDK, and Firecrawl-compatible API for AI agent workflows.