
Get Clean JSON and Markdown Output from Any Website
Learn how to extract structured JSON and Markdown from any webpage without writing custom parsers. Practical examples using AlterLab's web scraping API.
April 15, 2026
HTML is messy. You send a request, get back 4,000 lines of nested divs, inline styles, and script tags, then spend hours writing XPath expressions that break when the site updates.
There is a better approach. Request the format you actually need.
The Problem with Raw HTML
When you scrape a product page, you do not want the HTML. You want:
- Product name
- Price
- Availability
- Description
- Reviews
Extracting those fields means writing selectors for each site. Amazon uses different class names than Shopify stores. Shopify stores differ from WooCommerce. Every site is its own parsing problem.
The traditional approach looks like this:
import requests
from bs4 import BeautifulSoup
response = requests.get("https://example-store.com/product/123")
soup = BeautifulSoup(response.text, "html.parser")
# These selectors break when the site updates
name = soup.select_one(".product-title h1").text
price = soup.select_one(".price-current").text.strip("$")
description = soup.select_one(".product-description p").textThis works until the site redesigns. Then your selectors return None and your pipeline breaks.
Request the Format You Need
AlterLab's scraping API converts HTML to structured output server-side. You specify the format, get back clean data.
import alterlab
client = alterlab.Client("YOUR_API_KEY")
response = client.scrape(
"https://example-store.com/product/123",
formats=["json"]
)
print(response.json)The response contains extracted fields without any selector logic on your end:
{
"title": "Wireless Bluetooth Headphones",
"price": 49.99,
"currency": "USD",
"availability": "in_stock",
"description": "Over-ear headphones with active noise cancellation...",
"reviews_count": 1247,
"rating": 4.3
}No BeautifulSoup. No XPath. No maintenance when the site changes its CSS classes.
JSON Output for Data Pipelines
JSON output works best when you are feeding data into a database, analytics system, or downstream API. The API extracts common structured data patterns automatically:
- Product listings with prices and SKUs
- Article content with titles, authors, and dates
- Contact information from business pages
- Table data converted to arrays of objects
- Navigation links and metadata
import alterlab
import psycopg2
client = alterlab.Client("YOUR_API_KEY")
# Scrape and get JSON directly
response = client.scrape(
"https://news-site.com/articles/latest",
formats=["json"]
)
# Insert directly into your database
conn = psycopg2.connect("dbname=news user=writer")
cur = conn.cursor()
for article in response.json["articles"]:
cur.execute(
"INSERT INTO articles (title, author, published) VALUES (%s, %s, %s)",
(article["title"], article["author"], article["published_date"])
)
conn.commit()The same request works via curl if you are testing from a terminal or building in a non-Python language:
curl -X POST https://api.alterlab.io/v1/scrape \
-H "X-API-Key: YOUR_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://news-site.com/articles/latest",
"formats": ["json"]
}'Markdown Output for Content and LLMs
Markdown output strips everything except the readable content. Scripts, styles, navigation bars, footers, and ads disappear. What remains is the article text, properly formatted.
This matters for two use cases:
Content aggregation. You want the article text, not the surrounding chrome. Markdown gives you clean text with heading hierarchy preserved.
LLM context. Language models process Markdown more efficiently than HTML. Tokens spent on <div class="sidebar-widget"> are wasted tokens. Markdown removes the noise.
import alterlab
from openai import OpenAI
client = alterlab.Client("YOUR_API_KEY")
response = client.scrape(
"https://tech-blog.com/post/understanding-distributed-systems",
formats=["markdown"]
)
# Feed clean markdown to an LLM
llm = OpenAI()
completion = llm.chat.completions.create(
model="gpt-4",
messages=[
{"role": "system", "content": "Summarize this technical article."},
{"role": "user", "content": response.markdown}
]
)
print(completion.choices[0].message.content)The Markdown output looks like this:
# Understanding Distributed Systems
## Introduction
Distributed systems coordinate multiple independent processes to achieve a common goal. The key challenge is handling partial failures...
## Consensus Algorithms
The Raft algorithm provides leader election and log replication...
### Leader Election
Nodes vote for a leader in term-based elections...No <script> tags. No cookie consent banners. No navigation menus. Just the content.
Try scraping this page with AlterLab to get JSON or Markdown output
Handling JavaScript-Rendered Sites
Many sites render content client-side. The initial HTML response contains almost nothing. The actual data loads via JavaScript after the page renders.
Raw HTTP requests cannot handle this. You need a headless browser.
AlterLab handles this automatically through its tiered rendering system. T1 handles static HTML. T3 and above execute JavaScript. You control the minimum tier:
import alterlab
client = alterlab.Client("YOUR_API_KEY")
response = client.scrape(
"https://spa-app.com/dashboard",
formats=["json"],
min_tier=3
)The min_tier=3 parameter skips the static HTML attempt and goes straight to headless browser rendering. This costs more per request but guarantees you get the rendered content, not the empty shell.
The anti-bot bypass system handles Cloudflare, Akamai, and other bot detection layers automatically. You do not need to configure proxies or solve CAPTCHAs manually.
Combining Cortex AI for Custom Extraction
Sometimes the automatic extraction does not capture exactly what you need. A site might have unusual data layouts or domain-specific fields.
Cortex AI adds LLM-powered extraction on top of the scraped page. You describe what you want in plain text:
import alterlab
client = alterlab.Client("YOUR_API_KEY")
response = client.scrape(
"https://real-estate-site.com/listings",
formats=["json"],
cortex={
"prompt": "Extract each property listing: address, price, bedrooms, bathrooms, and square footage. Return as a JSON array."
}
)
for listing in response.json["listings"]:
print(f"{listing['address']}: ${listing['price']}")Cortex reads the page like a human would and extracts the fields you specify. No selectors. No regex. Just describe the data you want.
This works well for:
- Real estate listings with non-standard layouts
- Job boards with varying card structures
- Restaurant menus in image-heavy layouts
- Government data in poorly structured tables
Multiple Formats in One Request
You can request multiple formats simultaneously. Useful when different parts of your pipeline need different representations:
import alterlab
client = alterlab.Client("YOUR_API_KEY")
response = client.scrape(
"https://product-page.com/item/456",
formats=["json", "markdown", "text"]
)
# JSON for your database
save_to_db(response.json)
# Markdown for your LLM pipeline
summarize(response.markdown)
# Plain text for search indexing
index(response.text)One request. Three formats. No duplicate scraping.
Comparison: Traditional vs Format-Based Scraping
Performance and Cost
Format conversion happens server-side as part of the scrape. There is no additional charge for requesting JSON or Markdown instead of HTML. The cost is the same regardless of output format.
Pricing is pay-as-you-go. You pay per successful scrape, not per format requested. Check the pricing page for current rates.
When to Use Each Format
JSON when you need structured data for databases, APIs, or analytics. Best for product listings, pricing data, contact information, and any content with clear field structure.
Markdown when you need clean text for LLMs, content aggregation, or search indexing. Best for articles, blog posts, documentation pages, and any content where readability matters more than structure.
Text when you need the simplest possible output for full-text search or basic keyword extraction. Strips all formatting, leaves plain text.
Getting Started
The quickstart guide covers installation and your first scrape. The Python SDK is available via pip:
pip install alterlabFull API documentation covers all parameters, tier options, and advanced features like scheduling and webhooks.
Takeaway
Stop writing parsers. Request the format you need directly. One parameter changes the output from raw HTML to clean JSON or Markdown. No selectors to maintain. No breakage when sites redesign. Your pipeline gets the data it actually needs.
Was this article helpful?
Frequently Asked Questions
Related Articles
Popular Posts
Recommended

Selenium Bot Detection: Why You Get Caught and How to Avoid It

How to Scrape AliExpress: Complete Guide for 2026

Why Your Headless Browser Gets Detected (and How to Fix It)

Scraping JavaScript-Heavy SPAs with Python: Dynamic Content, Infinite Scroll, and API Interception

How to Bypass Cloudflare Bot Protection When Web Scraping
Newsletter
Scraping insights and API tips. No spam.
Recommended Reading

Selenium Bot Detection: Why You Get Caught and How to Avoid It

How to Scrape AliExpress: Complete Guide for 2026

Why Your Headless Browser Gets Detected (and How to Fix It)

Scraping JavaScript-Heavy SPAs with Python: Dynamic Content, Infinite Scroll, and API Interception

How to Bypass Cloudflare Bot Protection When Web Scraping
Stay in the Loop
Get scraping insights, API tips, and platform updates. No spam — we only send when we have something worth reading.


