Pricing Compare Playground Blog Docs Changelog

Extract JSON from E-Commerce Sites Without CSS Selectors

Learn how to use AI and schema-based extraction to parse structured product data from e-commerce sites without writing or maintaining fragile CSS selectors.

Yash Dubey

April 29, 2026

6 min read

3 views

The Problem with DOM-Based Extraction

E-commerce storefronts deploy dozens of front-end updates daily. If your data pipeline relies on traversing the DOM with target paths like div.product-price > span.current-price, your pipeline is fragile.

Modern web development relies on utility-first CSS frameworks and dynamic class generation. An element that looks like <div class="price"> today might render as <div class="flex mt-4 text-sm font-bold css-1x8z9"> tomorrow. Add constant A/B testing, localized layout variations, and personalized content blocks into the mix. Hardcoded CSS selectors require constant monitoring and immediate patching when they inevitably break.

Maintaining these selectors across hundreds of target domains costs engineering hours that should be spent analyzing the data.

Schema-Driven Extraction

Schema-driven extraction entirely replaces DOM traversal. You define the exact data structure you want to receive using a JSON schema. The extraction engine processes the raw page content and maps the semantic information to your requested structure.

You declare the desired output. The system handles the mapping.

Designing the Extraction Schema

Before making a request, define the data points required for your application. We will build a schema to extract a product's name, its current price as a float, the currency used, a list of available variants, and a boolean indicating stock availability.

Providing clear descriptions for each field improves the accuracy of the extraction model. The descriptions act as instructions for the parser.

JSON

{
  "type": "object",
  "properties": {
    "product_name": { "type": "string", "description": "The main title of the product" },
    "price": { "type": "number", "description": "The current numeric price, excluding currency symbols" },
    "currency": { "type": "string", "description": "The 3-letter currency code, e.g., USD, EUR" },
    "in_stock": { "type": "boolean", "description": "True if the item is currently available to purchase" },
    "variants": {
      "type": "array",
      "items": { "type": "string" },
      "description": "List of available sizes, colors, or configurations"
    }
  },
  "required": ["product_name", "price", "currency", "in_stock"]
}

Implementing the Request

With the schema defined, you pass it to the AlterLab API alongside the target URL. The system will handle the network request, execute any necessary JavaScript to render the page, and pass the resulting DOM to the Cortex extraction engine.

Here is how to implement this using the Python SDK.

Python

import alterlab
import json

client = alterlab.Client("YOUR_API_KEY")

schema = {
    # Schema definition from above
    "type": "object",
    # ...
}

response = client.scrape(
    "https://example-ecommerce-store.com/product/12345",
    extract_schema=schema,
    wait_for=".product-loaded"
)

# The response.data object strictly adheres to the requested schema
extracted_data = response.data
print(json.dumps(extracted_data, indent=2))

If you prefer operating directly via HTTP, the identical operation can be executed using cURL.

Bash

curl -X POST https://api.alterlab.io/v1/scrape \
  -H "X-API-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example-ecommerce-store.com/product/12345",
    "extract_schema": {"type": "object", "properties": {"product_name": {"type": "string"}}}
  }'

The resulting JSON output requires zero post-processing. The extraction engine normalizes the data types based on the schema definitions. Prices are cast to floats, text is stripped of extraneous whitespace, and booleans are properly evaluated based on the context of the page text.

JSON

{
  "product_name": "Mechanical Keyboard Pro V2",
  "price": 129.99,
  "currency": "USD",
  "in_stock": true,
  "variants": ["Cherry MX Red", "Cherry MX Brown", "Cherry MX Blue"]
}

Try it yourself

Try scraping this page with AlterLab

curl -X POST https://api.alterlab.io/v1/scrape \
  -H "X-API-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"url": "https://www.amazon.com/dp/B09V3KXJPB"}'

Enable JavaScript to try the live demo, or sign up to use the API directly.

Handling Dynamic Rendering and Pagination

E-commerce sites are highly dynamic. Initial HTML payloads often contain placeholder skeletons. The actual product data, variants, and pricing usually load asynchronously via XHR requests after the page initializes.

To accurately extract this data, the headless browser must wait for the network activity to settle. Passing wait_for parameters ensures the DOM is fully hydrated before the extraction model begins parsing. You can wait for specific network events or for a generic lifecycle event like networkidle.

This rendering phase is critical. If the parser evaluates the DOM before the XHR requests resolve, the resulting JSON will contain null values for your required fields.

Additionally, data collection pipelines must navigate pagination on category pages. Instead of writing complex logic to find "Next Page" buttons, you can expand your schema to extract pagination metadata.

JSON

{
  "type": "object",
  "properties": {
    "products": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "title": { "type": "string" },
          "url": { "type": "string" }
        }
      }
    },
    "has_next_page": { "type": "boolean", "description": "True if there is a next page button visible" },
    "next_page_url": { "type": "string", "description": "The URL of the next page of results, if available" }
  }
}

Your scraping script can then evaluate has_next_page and enqueue the next_page_url dynamically.

Managing Network Blocks and Reliability

Extracting the data is only half the process. Accessing the data consistently at scale requires managing connection reputation. Retail sites employ sophisticated traffic analysis to block automated requests.

Schema extraction works seamlessly with automated anti-bot handling. The platform automatically manages proxy rotation, browser fingerprinting, and TLS handshakes before passing the successful response to the Cortex engine. You pay only for successful extractions.

This separation of concerns simplifies your architecture. The network layer handles access. The AI layer handles extraction. Your application code strictly handles data ingestion and business logic.

If you encounter sites with strict JavaScript rendering requirements, you can adjust your configuration tier. Setting min_tier=3 ensures the system utilizes a fully configured headless browser capable of rendering heavy client-side applications before extraction begins.

Webhooks for Asynchronous Pipelines

When executing thousands of extraction requests, holding open HTTP connections for each synchronous scrape introduces unnecessary overhead. AlterLab supports webhook deliveries for asynchronous processing.

You submit the target URLs and schemas to the API. The platform queues the jobs, handles all necessary retries, executes the extraction, and POSTs the resulting JSON to your server.

Bash

curl -X POST https://api.alterlab.io/v1/scrape/async \
  -H "X-API-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example-ecommerce-store.com/category/laptops",
    "webhook_url": "https://your-server.com/webhooks/alterlab",
    "extract_schema": { ... }
  }'

This pattern allows your infrastructure to scale horizontally without being constrained by concurrent connection limits or network timeouts.

Takeaway

Relying on CSS selectors for e-commerce data extraction introduces unacceptable technical debt. Front-end frameworks iterate too quickly for hardcoded DOM paths to remain reliable.

Schema-driven extraction shifts the paradigm. By defining a JSON schema, you instruct an AI model to map the visual and semantic context of a page directly to your required data structures. This approach normalizes data types, handles inconsistent DOM layouts automatically, and outputs production-ready JSON.

Combine this extraction method with managed proxy rotation and headless browser rendering to build resilient, low-maintenance data pipelines. Review our API docs for full schema configuration options and advanced usage patterns.

Was this article helpful?

Try it yourself

Scrape Amazon at scale

Extract product data, prices, and reviews with one API call.

curl -X POST https://api.alterlab.io/v1/scrape \
  -H "X-API-Key: YOUR_KEY" \
  -d '{"url": "https://amazon.com/dp/B09V3KXJPB"}'

No credit card required · 5,000 free requests

Frequently Asked Questions

Front-end frameworks generate dynamic, utility-based class names that change during every build. A/B testing and regional layout variations also cause hardcoded selectors to fail unpredictably.

Instead of targeting specific DOM nodes, you provide a JSON schema describing the data you need. An extraction model parses the raw HTML and maps the visible semantic content directly to your predefined schema.

Yes, provided the scraping infrastructure renders the JavaScript payload before passing the DOM to the extraction model. Headless browsers execute the initial rendering phase.

Yash Dubey

View all posts

Tutorials

How to Scrape Walmart Data: Complete Guide for 2026

Learn how to scrape Walmart data using Python in 2026. A technical guide to extracting public e-commerce data, handling dynamic content, and scaling pipelines.

Yash Dubey

Apr 29, 2026

Tutorials

How to Scrape Twitter/X Data with Python in 2026

Learn how to scrape Twitter/X using Python. A technical guide on bypassing dynamic content rendering to extract public social data reliably at scale.

Yash Dubey

Apr 28, 2026

Tutorials

How to Scrape LinkedIn Data with Python in 2026

Learn how to reliably extract public jobs data from LinkedIn using Python. We cover handling dynamic content, rate limits, and building scalable pipelines.

Yash Dubey

Apr 27, 2026

Stay in the Loop

Get scraping insights, API tips, and platform updates. No spam — we only send when we have something worth reading.

Extract JSON from E-Commerce Sites Without CSS Selectors

The Problem with DOM-Based Extraction

Schema-Driven Extraction

Designing the Extraction Schema

Implementing the Request

Managing Network Blocks and Reliability

Webhooks for Asynchronous Pipelines

Takeaway

Frequently Asked Questions

Related Articles

How to Scrape Walmart Data: Complete Guide for 2026

How to Scrape Twitter/X Data with Python in 2026

How to Scrape LinkedIn Data with Python in 2026

Popular Posts

Why Your Headless Browser Gets Detected (and How to Fix It)

Best Web Scraping APIs in 2026: Complete Comparison Guide

Playwright Bot Detection: What Actually Works in 2026

How to Scrape Cloudflare-Protected Sites in 2026

How to Scrape Twitter/X: Complete Guide for 2026

Recommended

Selenium Bot Detection: Why You Get Flagged and How to Fix It

How to Scrape AliExpress: Complete Guide for 2026

Why Your Headless Browser Gets Detected (and How to Fix It)

How to Scrape Indeed: Complete Guide for 2026

How to Scrape Cloudflare-Protected Sites in 2026

Newsletter

Recommended Reading

Selenium Bot Detection: Why You Get Flagged and How to Fix It

How to Scrape AliExpress: Complete Guide for 2026

Why Your Headless Browser Gets Detected (and How to Fix It)

How to Scrape Indeed: Complete Guide for 2026

How to Scrape Cloudflare-Protected Sites in 2026

Stay in the Loop

Explore AlterLab

Python Web Scraping API

Compare Scraping APIs

Pricing

Documentation