AlterLabAlterLab
Get Clean JSON and Markdown Output from Any Website
Tutorials

Get Clean JSON and Markdown Output from Any Website

Learn how to extract structured JSON and Markdown from any webpage without writing custom parsers. Practical examples using AlterLab's web scraping API.

Yash Dubey
Yash Dubey

April 15, 2026

6 min read
54 views

HTML is messy. You send a request, get back 4,000 lines of nested divs, inline styles, and script tags, then spend hours writing XPath expressions that break when the site updates.

There is a better approach. Request the format you actually need.

The Problem with Raw HTML

When you scrape a product page, you do not want the HTML. You want:

  • Product name
  • Price
  • Availability
  • Description
  • Reviews

Extracting those fields means writing selectors for each site. Amazon uses different class names than Shopify stores. Shopify stores differ from WooCommerce. Every site is its own parsing problem.

The traditional approach looks like this:

Python
import requests
from bs4 import BeautifulSoup

response = requests.get("https://example-store.com/product/123")
soup = BeautifulSoup(response.text, "html.parser")

# These selectors break when the site updates
name = soup.select_one(".product-title h1").text
price = soup.select_one(".price-current").text.strip("$")
description = soup.select_one(".product-description p").text

This works until the site redesigns. Then your selectors return None and your pipeline breaks.

Request the Format You Need

AlterLab's scraping API converts HTML to structured output server-side. You specify the format, get back clean data.

Python
import alterlab

client = alterlab.Client("YOUR_API_KEY")
response = client.scrape(
    "https://example-store.com/product/123",
    formats=["json"]
)
print(response.json)

The response contains extracted fields without any selector logic on your end:

JSON
{
  "title": "Wireless Bluetooth Headphones",
  "price": 49.99,
  "currency": "USD",
  "availability": "in_stock",
  "description": "Over-ear headphones with active noise cancellation...",
  "reviews_count": 1247,
  "rating": 4.3
}

No BeautifulSoup. No XPath. No maintenance when the site changes its CSS classes.

JSON Output for Data Pipelines

JSON output works best when you are feeding data into a database, analytics system, or downstream API. The API extracts common structured data patterns automatically:

  • Product listings with prices and SKUs
  • Article content with titles, authors, and dates
  • Contact information from business pages
  • Table data converted to arrays of objects
  • Navigation links and metadata
Python
import alterlab
import psycopg2

client = alterlab.Client("YOUR_API_KEY")

# Scrape and get JSON directly
response = client.scrape(
    "https://news-site.com/articles/latest",
    formats=["json"]
)

# Insert directly into your database
conn = psycopg2.connect("dbname=news user=writer")
cur = conn.cursor()
for article in response.json["articles"]:
    cur.execute(
        "INSERT INTO articles (title, author, published) VALUES (%s, %s, %s)",
        (article["title"], article["author"], article["published_date"])
    )
conn.commit()

The same request works via curl if you are testing from a terminal or building in a non-Python language:

Bash
curl -X POST https://api.alterlab.io/v1/scrape \
  -H "X-API-Key: YOUR_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://news-site.com/articles/latest",
    "formats": ["json"]
  }'

Markdown Output for Content and LLMs

Markdown output strips everything except the readable content. Scripts, styles, navigation bars, footers, and ads disappear. What remains is the article text, properly formatted.

This matters for two use cases:

Content aggregation. You want the article text, not the surrounding chrome. Markdown gives you clean text with heading hierarchy preserved.

LLM context. Language models process Markdown more efficiently than HTML. Tokens spent on <div class="sidebar-widget"> are wasted tokens. Markdown removes the noise.

Python
import alterlab
from openai import OpenAI

client = alterlab.Client("YOUR_API_KEY")
response = client.scrape(
    "https://tech-blog.com/post/understanding-distributed-systems",
    formats=["markdown"]
)

# Feed clean markdown to an LLM
llm = OpenAI()
completion = llm.chat.completions.create(
    model="gpt-4",
    messages=[
        {"role": "system", "content": "Summarize this technical article."},
        {"role": "user", "content": response.markdown}
    ]
)
print(completion.choices[0].message.content)

The Markdown output looks like this:

Markdown
# Understanding Distributed Systems

## Introduction

Distributed systems coordinate multiple independent processes to achieve a common goal. The key challenge is handling partial failures...

## Consensus Algorithms

The Raft algorithm provides leader election and log replication...

### Leader Election

Nodes vote for a leader in term-based elections...

No <script> tags. No cookie consent banners. No navigation menus. Just the content.

Try it yourself

Try scraping this page with AlterLab to get JSON or Markdown output

Handling JavaScript-Rendered Sites

Many sites render content client-side. The initial HTML response contains almost nothing. The actual data loads via JavaScript after the page renders.

Raw HTTP requests cannot handle this. You need a headless browser.

AlterLab handles this automatically through its tiered rendering system. T1 handles static HTML. T3 and above execute JavaScript. You control the minimum tier:

Python
import alterlab

client = alterlab.Client("YOUR_API_KEY")
response = client.scrape(
    "https://spa-app.com/dashboard",
    formats=["json"],
    min_tier=3
)

The min_tier=3 parameter skips the static HTML attempt and goes straight to headless browser rendering. This costs more per request but guarantees you get the rendered content, not the empty shell.

The anti-bot bypass system handles Cloudflare, Akamai, and other bot detection layers automatically. You do not need to configure proxies or solve CAPTCHAs manually.

Combining Cortex AI for Custom Extraction

Sometimes the automatic extraction does not capture exactly what you need. A site might have unusual data layouts or domain-specific fields.

Cortex AI adds LLM-powered extraction on top of the scraped page. You describe what you want in plain text:

Python
import alterlab

client = alterlab.Client("YOUR_API_KEY")
response = client.scrape(
    "https://real-estate-site.com/listings",
    formats=["json"],
    cortex={
        "prompt": "Extract each property listing: address, price, bedrooms, bathrooms, and square footage. Return as a JSON array."
    }
)

for listing in response.json["listings"]:
    print(f"{listing['address']}: ${listing['price']}")

Cortex reads the page like a human would and extracts the fields you specify. No selectors. No regex. Just describe the data you want.

This works well for:

  • Real estate listings with non-standard layouts
  • Job boards with varying card structures
  • Restaurant menus in image-heavy layouts
  • Government data in poorly structured tables

Multiple Formats in One Request

You can request multiple formats simultaneously. Useful when different parts of your pipeline need different representations:

Python
import alterlab

client = alterlab.Client("YOUR_API_KEY")
response = client.scrape(
    "https://product-page.com/item/456",
    formats=["json", "markdown", "text"]
)

# JSON for your database
save_to_db(response.json)

# Markdown for your LLM pipeline
summarize(response.markdown)

# Plain text for search indexing
index(response.text)

One request. Three formats. No duplicate scraping.

Comparison: Traditional vs Format-Based Scraping

Performance and Cost

Format conversion happens server-side as part of the scrape. There is no additional charge for requesting JSON or Markdown instead of HTML. The cost is the same regardless of output format.

1 format paramAll you need to change output
0 selectorsNo CSS or XPath to maintain
3 formatsJSON, Markdown, Text in one request

Pricing is pay-as-you-go. You pay per successful scrape, not per format requested. Check the pricing page for current rates.

When to Use Each Format

JSON when you need structured data for databases, APIs, or analytics. Best for product listings, pricing data, contact information, and any content with clear field structure.

Markdown when you need clean text for LLMs, content aggregation, or search indexing. Best for articles, blog posts, documentation pages, and any content where readability matters more than structure.

Text when you need the simplest possible output for full-text search or basic keyword extraction. Strips all formatting, leaves plain text.

Getting Started

The quickstart guide covers installation and your first scrape. The Python SDK is available via pip:

Bash
pip install alterlab

Full API documentation covers all parameters, tier options, and advanced features like scheduling and webhooks.

Takeaway

Stop writing parsers. Request the format you need directly. One parameter changes the output from raw HTML to clean JSON or Markdown. No selectors to maintain. No breakage when sites redesign. Your pipeline gets the data it actually needs.

Share

Was this article helpful?

Frequently Asked Questions

Use a web scraping API that supports format conversion. Send a POST request with a formats parameter set to ["json"] and receive structured data without writing CSS selectors or regex patterns.
Yes. Most scraping APIs support a formats parameter. Set formats=["markdown"] to convert HTML to clean Markdown, stripping scripts, styles, and navigation elements automatically.
Use a scraping API with headless browser support and automatic anti-bot bypass. Set min_tier=3 or higher to ensure JavaScript rendering, then request JSON or Markdown output formats.