Pricing Compare Playground Blog Docs Changelog

Get Clean JSON and Markdown Output from Any Website

Learn how to extract structured JSON and Markdown from any webpage without writing custom parsers. Practical examples using AlterLab's web scraping API.

Yash DubeyApril 15, 2026

6 min read

486 views

AlterLab handles this automatically — scrape any URL with one API call. No infrastructure required.

Try it free

HTML is messy. You send a request, get back 4,000 lines of nested divs, inline styles, and script tags, then spend hours writing XPath expressions that break when the site updates.

There is a better approach. Request the format you actually need.

The Problem with Raw HTML

When you scrape a product page, you do not want the HTML. You want:

Product name
Price
Availability
Description
Reviews

Extracting those fields means writing selectors for each site. Amazon uses different class names than Shopify stores. Shopify stores differ from WooCommerce. Every site is its own parsing problem.

The traditional approach looks like this:

Python

import requests
from bs4 import BeautifulSoup

response = requests.get("https://example-store.com/product/123")
soup = BeautifulSoup(response.text, "html.parser")

# These selectors break when the site updates
name = soup.select_one(".product-title h1").text
price = soup.select_one(".price-current").text.strip("$")
description = soup.select_one(".product-description p").text

This works until the site redesigns. Then your selectors return None and your pipeline breaks.

Request the Format You Need

AlterLab's scraping API converts HTML to structured output server-side. You specify the format, get back clean data.

Python

import alterlab

client = alterlab.Client("YOUR_API_KEY")
response = client.scrape(
    "https://example-store.com/product/123",
    formats=["json"]
)
print(response.json)

The response contains extracted fields without any selector logic on your end:

JSON

{
  "title": "Wireless Bluetooth Headphones",
  "price": 49.99,
  "currency": "USD",
  "availability": "in_stock",
  "description": "Over-ear headphones with active noise cancellation...",
  "reviews_count": 1247,
  "rating": 4.3
}

No BeautifulSoup. No XPath. No maintenance when the site changes its CSS classes.

JSON Output for Data Pipelines

JSON output works best when you are feeding data into a database, analytics system, or downstream API. The API extracts common structured data patterns automatically:

Product listings with prices and SKUs
Article content with titles, authors, and dates
Contact information from business pages
Table data converted to arrays of objects
Navigation links and metadata

Python

import alterlab
import psycopg2

client = alterlab.Client("YOUR_API_KEY")

# Scrape and get JSON directly
response = client.scrape(
    "https://news-site.com/articles/latest",
    formats=["json"]
)

# Insert directly into your database
conn = psycopg2.connect("dbname=news user=writer")
cur = conn.cursor()
for article in response.json["articles"]:
    cur.execute(
        "INSERT INTO articles (title, author, published) VALUES (%s, %s, %s)",
        (article["title"], article["author"], article["published_date"])
    )
conn.commit()

The same request works via curl if you are testing from a terminal or building in a non-Python language:

Bash

curl -X POST https://api.alterlab.io/v1/scrape \
  -H "X-API-Key: YOUR_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://news-site.com/articles/latest",
    "formats": ["json"]
  }'

Markdown Output for Content and LLMs

Markdown output strips everything except the readable content. Scripts, styles, navigation bars, footers, and ads disappear. What remains is the article text, properly formatted.

This matters for two use cases:

Content aggregation. You want the article text, not the surrounding chrome. Markdown gives you clean text with heading hierarchy preserved.

LLM context. Language models process Markdown more efficiently than HTML. Tokens spent on <div class="sidebar-widget"> are wasted tokens. Markdown removes the noise.

Python

import alterlab
from openai import OpenAI

client = alterlab.Client("YOUR_API_KEY")
response = client.scrape(
    "https://tech-blog.com/post/understanding-distributed-systems",
    formats=["markdown"]
)

# Feed clean markdown to an LLM
llm = OpenAI()
completion = llm.chat.completions.create(
    model="gpt-4",
    messages=[
        {"role": "system", "content": "Summarize this technical article."},
        {"role": "user", "content": response.markdown}
    ]
)
print(completion.choices[0].message.content)

The Markdown output looks like this:

Markdown

# Understanding Distributed Systems

## Introduction

Distributed systems coordinate multiple independent processes to achieve a common goal. The key challenge is handling partial failures...

## Consensus Algorithms

The Raft algorithm provides leader election and log replication...

### Leader Election

Nodes vote for a leader in term-based elections...

No <script> tags. No cookie consent banners. No navigation menus. Just the content.

Try it yourself

Try scraping this page with AlterLab to get JSON or Markdown output

curl -X POST https://api.alterlab.io/v1/scrape \
  -H "X-API-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com"}'

Enable JavaScript to try the live demo, or sign up to use the API directly.

Handling JavaScript-Rendered Sites

Many sites render content client-side. The initial HTML response contains almost nothing. The actual data loads via JavaScript after the page renders.

Raw HTTP requests cannot handle this. You need a headless browser.

AlterLab handles this automatically through its tiered rendering system. T1 handles static HTML. T3 and above execute JavaScript. You control the minimum tier:

Python

import alterlab

client = alterlab.Client("YOUR_API_KEY")
response = client.scrape(
    "https://spa-app.com/dashboard",
    formats=["json"],
    min_tier=3
)

The min_tier=3 parameter skips the static HTML attempt and goes straight to headless browser rendering. This costs more per request but guarantees you get the rendered content, not the empty shell.

The anti-bot bypass system handles Cloudflare, Akamai, and other bot detection layers automatically. You do not need to configure proxies or solve CAPTCHAs manually.

Combining Cortex AI for Custom Extraction

Sometimes the automatic extraction does not capture exactly what you need. A site might have unusual data layouts or domain-specific fields.

Cortex AI adds LLM-powered extraction on top of the scraped page. You describe what you want in plain text:

Python

import alterlab

client = alterlab.Client("YOUR_API_KEY")
response = client.scrape(
    "https://real-estate-site.com/listings",
    formats=["json"],
    cortex={
        "prompt": "Extract each property listing: address, price, bedrooms, bathrooms, and square footage. Return as a JSON array."
    }
)

for listing in response.json["listings"]:
    print(f"{listing['address']}: ${listing['price']}")

Cortex reads the page like a human would and extracts the fields you specify. No selectors. No regex. Just describe the data you want.

This works well for:

Real estate listings with non-standard layouts
Job boards with varying card structures
Restaurant menus in image-heavy layouts
Government data in poorly structured tables

Multiple Formats in One Request

You can request multiple formats simultaneously. Useful when different parts of your pipeline need different representations:

Python

import alterlab

client = alterlab.Client("YOUR_API_KEY")
response = client.scrape(
    "https://product-page.com/item/456",
    formats=["json", "markdown", "text"]
)

# JSON for your database
save_to_db(response.json)

# Markdown for your LLM pipeline
summarize(response.markdown)

# Plain text for search indexing
index(response.text)

One request. Three formats. No duplicate scraping.

Comparison: Traditional vs Format-Based Scraping

Performance and Cost

Format conversion happens server-side as part of the scrape. There is no additional charge for requesting JSON or Markdown instead of HTML. The cost is the same regardless of output format.

1 format paramAll you need to change output

0 selectorsNo CSS or XPath to maintain

3 formatsJSON, Markdown, Text in one request

Pricing is pay-as-you-go. You pay per successful scrape, not per format requested. Check the pricing page for current rates.

When to Use Each Format

JSON when you need structured data for databases, APIs, or analytics. Best for product listings, pricing data, contact information, and any content with clear field structure.

Markdown when you need clean text for LLMs, content aggregation, or search indexing. Best for articles, blog posts, documentation pages, and any content where readability matters more than structure.

Text when you need the simplest possible output for full-text search or basic keyword extraction. Strips all formatting, leaves plain text.

Getting Started

The quickstart guide covers installation and your first scrape. The Python SDK is available via pip:

Bash

pip install alterlab

Full API documentation covers all parameters, tier options, and advanced features like scheduling and webhooks.

Takeaway

Stop writing parsers. Request the format you need directly. One parameter changes the output from raw HTML to clean JSON or Markdown. No selectors to maintain. No breakage when sites redesign. Your pipeline gets the data it actually needs.

Was this article helpful?

Try it yourself

One API call. Any language.

Python SDK, Node SDK, or plain HTTP. Get started in under a minute.

from alterlab import AlterLab

client = AlterLab(api_key="YOUR_KEY")
result = client.scrape("https://example.com")
print(result.markdown)

No credit card required · 5,000 free requests

Frequently Asked Questions

Use a web scraping API that supports format conversion. Send a POST request with a formats parameter set to ["json"] and receive structured data without writing CSS selectors or regex patterns.

Yes. Most scraping APIs support a formats parameter. Set formats=["markdown"] to convert HTML to clean Markdown, stripping scripts, styles, and navigation elements automatically.

Use a scraping API with headless browser support and automatic anti-bot bypass. Set min_tier=3 or higher to ensure JavaScript rendering, then request JSON or Markdown output formats.

Yash Dubey

View all posts

Tutorials

Fiverr Data API: Extract Structured JSON in 2026

Learn how to build a reliable data pipeline using a Fiverr data API to extract structured JSON from public service listings and job data with ease.

Herald Blog Service

Jul 18, 2026

Tutorials

How to Scrape ESPN Data: Complete Guide for 2026

Learn how to scrape ESPN data efficiently using Python and Node.js. This guide covers handling anti-bot protections, using Cortex AI for extraction, and scaling pipelines.

Herald Blog Service

Jul 18, 2026

Tutorials

How to Scrape Capterra Data: Complete Guide for 2026

Learn how to scrape Capterra reviews and software data using Python and Node.js. A technical guide on handling anti-bot protections and structured data extraction.

Herald Blog Service

Jul 18, 2026

Stay in the Loop

Get scraping insights, API tips, and platform updates. No spam — we only send when we have something worth reading.

Web Scraping API Resources

Part of the Web Scraping API Documentation cluster

Web Scraping API Documentation

Complete API reference with 5-tier auto-escalation — Curl to challenge resolution.

Pillar page

JavaScript Rendering Guide

Configure Tier 4 browser rendering for SPAs and dynamic content.

Authenticated Scraping Guide

Scrape pages behind login using session management.

Web Scraping API Benchmarks

Real success rates and cost data across all 5 tiers.

AlterLab for AI Agents

MCP Server, Python SDK, and Firecrawl-compatible API for AI agent workflows.

The Problem with Raw HTML

Request the Format You Need

JSON Output for Data Pipelines

Markdown Output for Content and LLMs

Handling JavaScript-Rendered Sites

Combining Cortex AI for Custom Extraction

Multiple Formats in One Request

Comparison: Traditional vs Format-Based Scraping

Performance and Cost

When to Use Each Format

Getting Started

Takeaway

Frequently Asked Questions

Related Articles

Fiverr Data API: Extract Structured JSON in 2026

How to Scrape ESPN Data: Complete Guide for 2026

How to Scrape Capterra Data: Complete Guide for 2026

Popular Posts

Playwright Bot Detection: What Actually Works in 2026

Why Your Headless Browser Gets Detected (and How to Fix It)

How to Scrape Twitter/X: Complete Guide for 2026

Best Web Scraping APIs in 2026: Complete Comparison Guide

How to Scrape Cloudflare-Protected Sites in 2026

Recommended

How to Scrape AliExpress: Complete Guide for 2026

Why Your Headless Browser Gets Detected (and How to Fix It)

AlterLab vs Firecrawl: Which Scraping API Is Better in 2026?

How to Scrape Twitter/X Data: Complete Guide for 2026

How to Scrape Cloudflare-Protected Sites in 2026

Newsletter

Recommended Reading

How to Scrape AliExpress: Complete Guide for 2026

Why Your Headless Browser Gets Detected (and How to Fix It)

AlterLab vs Firecrawl: Which Scraping API Is Better in 2026?

How to Scrape Twitter/X Data: Complete Guide for 2026

How to Scrape Cloudflare-Protected Sites in 2026

Stay in the Loop

Explore AlterLab

Python Web Scraping API

Compare Scraping APIs

Pricing

Documentation

Web Scraping API Resources