
Extract JSON from E-Commerce Sites Without CSS Selectors
Learn how to use AI and schema-based extraction to parse structured product data from e-commerce sites without writing or maintaining fragile CSS selectors.
April 29, 2026
The Problem with DOM-Based Extraction
E-commerce storefronts deploy dozens of front-end updates daily. If your data pipeline relies on traversing the DOM with target paths like div.product-price > span.current-price, your pipeline is fragile.
Modern web development relies on utility-first CSS frameworks and dynamic class generation. An element that looks like <div class="price"> today might render as <div class="flex mt-4 text-sm font-bold css-1x8z9"> tomorrow. Add constant A/B testing, localized layout variations, and personalized content blocks into the mix. Hardcoded CSS selectors require constant monitoring and immediate patching when they inevitably break.
Maintaining these selectors across hundreds of target domains costs engineering hours that should be spent analyzing the data.
Schema-Driven Extraction
Schema-driven extraction entirely replaces DOM traversal. You define the exact data structure you want to receive using a JSON schema. The extraction engine processes the raw page content and maps the semantic information to your requested structure.
You declare the desired output. The system handles the mapping.
Designing the Extraction Schema
Before making a request, define the data points required for your application. We will build a schema to extract a product's name, its current price as a float, the currency used, a list of available variants, and a boolean indicating stock availability.
Providing clear descriptions for each field improves the accuracy of the extraction model. The descriptions act as instructions for the parser.
{
"type": "object",
"properties": {
"product_name": { "type": "string", "description": "The main title of the product" },
"price": { "type": "number", "description": "The current numeric price, excluding currency symbols" },
"currency": { "type": "string", "description": "The 3-letter currency code, e.g., USD, EUR" },
"in_stock": { "type": "boolean", "description": "True if the item is currently available to purchase" },
"variants": {
"type": "array",
"items": { "type": "string" },
"description": "List of available sizes, colors, or configurations"
}
},
"required": ["product_name", "price", "currency", "in_stock"]
}Implementing the Request
With the schema defined, you pass it to the AlterLab API alongside the target URL. The system will handle the network request, execute any necessary JavaScript to render the page, and pass the resulting DOM to the Cortex extraction engine.
Here is how to implement this using the Python SDK.
import alterlab
import json
client = alterlab.Client("YOUR_API_KEY")
schema = {
# Schema definition from above
"type": "object",
# ...
}
response = client.scrape(
"https://example-ecommerce-store.com/product/12345",
extract_schema=schema,
wait_for=".product-loaded"
)
# The response.data object strictly adheres to the requested schema
extracted_data = response.data
print(json.dumps(extracted_data, indent=2))If you prefer operating directly via HTTP, the identical operation can be executed using cURL.
curl -X POST https://api.alterlab.io/v1/scrape \
-H "X-API-Key: YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://example-ecommerce-store.com/product/12345",
"extract_schema": {"type": "object", "properties": {"product_name": {"type": "string"}}}
}'The resulting JSON output requires zero post-processing. The extraction engine normalizes the data types based on the schema definitions. Prices are cast to floats, text is stripped of extraneous whitespace, and booleans are properly evaluated based on the context of the page text.
{
"product_name": "Mechanical Keyboard Pro V2",
"price": 129.99,
"currency": "USD",
"in_stock": true,
"variants": ["Cherry MX Red", "Cherry MX Brown", "Cherry MX Blue"]
}Try scraping this page with AlterLab
Handling Dynamic Rendering and Pagination
E-commerce sites are highly dynamic. Initial HTML payloads often contain placeholder skeletons. The actual product data, variants, and pricing usually load asynchronously via XHR requests after the page initializes.
To accurately extract this data, the headless browser must wait for the network activity to settle. Passing wait_for parameters ensures the DOM is fully hydrated before the extraction model begins parsing. You can wait for specific network events or for a generic lifecycle event like networkidle.
This rendering phase is critical. If the parser evaluates the DOM before the XHR requests resolve, the resulting JSON will contain null values for your required fields.
Additionally, data collection pipelines must navigate pagination on category pages. Instead of writing complex logic to find "Next Page" buttons, you can expand your schema to extract pagination metadata.
{
"type": "object",
"properties": {
"products": {
"type": "array",
"items": {
"type": "object",
"properties": {
"title": { "type": "string" },
"url": { "type": "string" }
}
}
},
"has_next_page": { "type": "boolean", "description": "True if there is a next page button visible" },
"next_page_url": { "type": "string", "description": "The URL of the next page of results, if available" }
}
}Your scraping script can then evaluate has_next_page and enqueue the next_page_url dynamically.
Managing Network Blocks and Reliability
Extracting the data is only half the process. Accessing the data consistently at scale requires managing connection reputation. Retail sites employ sophisticated traffic analysis to block automated requests.
Schema extraction works seamlessly with automated anti-bot handling. The platform automatically manages proxy rotation, browser fingerprinting, and TLS handshakes before passing the successful response to the Cortex engine. You pay only for successful extractions.
This separation of concerns simplifies your architecture. The network layer handles access. The AI layer handles extraction. Your application code strictly handles data ingestion and business logic.
If you encounter sites with strict JavaScript rendering requirements, you can adjust your configuration tier. Setting min_tier=3 ensures the system utilizes a fully configured headless browser capable of rendering heavy client-side applications before extraction begins.
Webhooks for Asynchronous Pipelines
When executing thousands of extraction requests, holding open HTTP connections for each synchronous scrape introduces unnecessary overhead. AlterLab supports webhook deliveries for asynchronous processing.
You submit the target URLs and schemas to the API. The platform queues the jobs, handles all necessary retries, executes the extraction, and POSTs the resulting JSON to your server.
curl -X POST https://api.alterlab.io/v1/scrape/async \
-H "X-API-Key: YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://example-ecommerce-store.com/category/laptops",
"webhook_url": "https://your-server.com/webhooks/alterlab",
"extract_schema": { ... }
}'This pattern allows your infrastructure to scale horizontally without being constrained by concurrent connection limits or network timeouts.
Takeaway
Relying on CSS selectors for e-commerce data extraction introduces unacceptable technical debt. Front-end frameworks iterate too quickly for hardcoded DOM paths to remain reliable.
Schema-driven extraction shifts the paradigm. By defining a JSON schema, you instruct an AI model to map the visual and semantic context of a page directly to your required data structures. This approach normalizes data types, handles inconsistent DOM layouts automatically, and outputs production-ready JSON.
Combine this extraction method with managed proxy rotation and headless browser rendering to build resilient, low-maintenance data pipelines. Review our API docs for full schema configuration options and advanced usage patterns.
Was this article helpful?
Frequently Asked Questions
Related Articles
Popular Posts
Recommended
Newsletter
Scraping insights and API tips. No spam.
Recommended Reading

Selenium Bot Detection: Why You Get Flagged and How to Fix It

How to Scrape AliExpress: Complete Guide for 2026

Why Your Headless Browser Gets Detected (and How to Fix It)

How to Scrape Indeed: Complete Guide for 2026

How to Scrape Cloudflare-Protected Sites in 2026
Stay in the Loop
Get scraping insights, API tips, and platform updates. No spam — we only send when we have something worth reading.


