Pricing Compare Playground Blog Docs Changelog

Scraping E-Commerce Sites at Scale Without Getting Blocked

Amazon, Walmart, Target, and Best Buy all run aggressive anti-bot systems. Here is what works for extracting product data from these sites in 2026.

Yash DubeyFebruary 9, 2026

9 min read

232 views

Data Extraction

Anti-Bot

E-commerce sites are some of the hardest scraping targets on the internet. Amazon alone runs multiple layers of bot detection including CAPTCHA challenges, device fingerprinting, and behavioral analysis. Walmart and Target use DataDome. Best Buy uses PerimeterX.

If you need product data from these sites, here is what you are up against and what actually works.

Why E-Commerce Sites Are Hard

These sites have real financial incentive to block scrapers. Competitors use scraping for price monitoring. Counterfeit sellers use it to copy listings. Data brokers resell the data. The sites fight back aggressively.

Common Defenses

Rate limiting per session and IP. Amazon starts showing CAPTCHAs after 20-30 requests from the same IP in a short window. The threshold varies by product category and time of day.

Dynamic page structure. Class names and element IDs change between deployments. Sometimes they change per user session. A scraper that relies on specific CSS selectors breaks regularly.

Price and availability obfuscation. Some sites load pricing via separate API calls or embed it in JavaScript bundles that need to be executed to extract.

Login walls for full data. Reviews, seller information, and inventory counts are sometimes gated behind login or require specific cookies from browsing behavior.

Product Data Architecture

Before scraping, understand how e-commerce product data is structured. Most major sites embed structured data in the page:

HTML

<!-- Most product pages include JSON-LD -->
<script type="application/ld+json">
{
  "@type": "Product",
  "name": "Apple AirPods Pro",
  "offers": {
    "price": "249.00",
    "priceCurrency": "USD",
    "availability": "InStock"
  }
}
</script>

Extracting JSON-LD is more reliable than parsing HTML elements. The structured data follows a schema standard and changes less frequently than the visual layout.

Python

import json
from bs4 import BeautifulSoup

def extract_product_jsonld(html):
    soup = BeautifulSoup(html, "html.parser")
    scripts = soup.find_all("script", type="application/ld+json")
    for script in scripts:
        data = json.loads(script.string)
        if isinstance(data, dict) and data.get("@type") == "Product":
            return data
        if isinstance(data, list):
            for item in data:
                if item.get("@type") == "Product":
                    return item
    return None

Site-Specific Notes

Amazon

Amazon is the hardest e-commerce target. Their bot detection uses:

Device fingerprinting via JavaScript
CAPTCHA challenges (image selection, text CAPTCHAs)
Session-based rate limiting
Dynamic HTML structure

What works: Residential proxies with full browser rendering. Rotate IPs after every 10-15 requests. Do not send requests faster than one every 3-5 seconds per session.

The product API (PA-API 5.0) is an alternative if you qualify as an Amazon Associate, but it has strict rate limits and requires an active affiliate account.

Walmart

Uses DataDome for bot protection. Less aggressive than Amazon but still catches headless browsers easily.

What works: Playwright with stealth patches and ISP proxies. Walmart also exposes a lot of data through their taxonomy and search API endpoints that are less protected than product pages.

Target and Best Buy

Both use PerimeterX. The detection focuses heavily on mouse movement patterns and JavaScript execution timing.

What works: Full browser rendering with realistic viewport sizes and interaction delays. Best Buy is particularly strict about request timing.

Scaling E-Commerce Scraping

The Numbers

Say you need to monitor prices on 50,000 product listings across Amazon, Walmart, and Target, updated daily.

That is 50,000 requests per day. With a 70% success rate on first attempt and retries for the rest, you are making roughly 65,000-70,000 total requests daily.

DIY cost estimate:

Residential proxies: 70K requests x 3MB avg = 210 GB = $2,100/month
Server (10 browser instances): $150/month
CAPTCHA solving: $200/month (for Amazon)
Engineering maintenance: 15 hrs/month

API cost estimate: With AlterLab, e-commerce pages with JS rendering and anti-bot bypass fall into the higher-cost tier:

50,000 requests x $0.02-0.05 = $1,000-2,500/month
No infrastructure to maintain
Failed requests are not charged

The API approach is cheaper or comparable, and you skip the infrastructure headaches.

Freshness vs Cost

Not every product needs daily price updates. Most prices change weekly or less frequently. A smart approach:

High-priority products (top sellers, competitive items): update every 6-12 hours
Medium priority: update daily
Low priority (long-tail catalog): update every 3-7 days

This can cut your request volume by 60-70% compared to blanket daily scraping.

Structured Data Extraction

Raw HTML is not what you want to store. Parse it into structured data immediately:

Python

def parse_product(html, source):
    # Try JSON-LD first (most reliable)
    product = extract_product_jsonld(html)
    if product:
        return {
            "name": product.get("name"),
            "price": product.get("offers", {}).get("price"),
            "currency": product.get("offers", {}).get("priceCurrency"),
            "availability": product.get("offers", {}).get("availability"),
            "source": source,
            "scraped_at": datetime.utcnow().isoformat(),
        }
    # Fall back to HTML parsing
    # (site-specific selectors here)
    return None

AlterLab can return structured JSON directly using schema-based extraction. You define the fields you want, and the API extracts them from the page. Saves you from maintaining parsing logic for each site.

The Practical Approach

Start with the JSON-LD and API endpoints. These are the least protected and give you the most structured data. Only fall back to full page rendering when the data you need is not available through easier channels.

Use a scraping API for the hard targets (Amazon, Walmart with bot protection). Build simple scripts for the easy targets (smaller retailers without serious protection).

Monitor your success rates. When they drop below 80%, something changed on the target site. Fix it immediately - stale data is worse than no data.

Was this article helpful?

Try it yourself

Extract product data at scale

Prices, reviews, and inventory — structured JSON with one API call.

curl -X POST https://api.alterlab.io/v1/scrape \
  -H "X-API-Key: YOUR_KEY" \
  -H "Content-Type: application/json" \
  -d '{"url": "https://amazon.com/dp/B09V3KXJPB"}'

No credit card required · 5,000 free requests

Yash Dubey

View all posts

Tutorials

Handling Infinite Scroll & Pagination in Headless Browsers

Learn how to reliably handle infinite scroll, cursor-based pagination, and dynamic rendering for autonomous AI web scraping agents using headless browsers.

Herald Blog Service

Jun 13, 2026

Tutorials

Playwright Network Interception Guide for AI Data Extraction

Learn how to intercept and block network requests in Playwright to accelerate AI agent data extraction, reduce bandwidth, and capture raw API JSON payloads.

Herald Blog Service

Jun 13, 2026

13m

Tutorials

Building an Autonomous CrewAI Web Scraping Tool for JSON Extraction

Learn how to build a custom CrewAI tool that autonomously scrapes dynamic websites and returns structured JSON using a headless browser API.

Herald Blog Service

Jun 12, 2026

Stay in the Loop

Get scraping insights, API tips, and platform updates. No spam — we only send when we have something worth reading.

Web Scraping API Resources

Part of the Web Scraping API Documentation cluster

Web Scraping API Documentation

Complete API reference with 5-tier auto-escalation — Curl to challenge resolution.

Pillar page

JavaScript Rendering Guide

Configure Tier 4 browser rendering for SPAs and dynamic content.

Authenticated Scraping Guide

Scrape pages behind login using session management.

Web Scraping API Benchmarks

Real success rates and cost data across all 5 tiers.

AlterLab for AI Agents

MCP Server, Python SDK, and Firecrawl-compatible API for AI agent workflows.

Why E-Commerce Sites Are Hard

Common Defenses

Product Data Architecture

Site-Specific Notes

Amazon

Walmart

Target and Best Buy

Scaling E-Commerce Scraping

The Numbers

Freshness vs Cost

Structured Data Extraction

The Practical Approach

Related Articles

Handling Infinite Scroll & Pagination in Headless Browsers

Playwright Network Interception Guide for AI Data Extraction

Building an Autonomous CrewAI Web Scraping Tool for JSON Extraction

Popular Posts

Why Your Headless Browser Gets Detected (and How to Fix It)

Best Web Scraping APIs in 2026: Complete Comparison Guide

Playwright Bot Detection: What Actually Works in 2026

How to Scrape Twitter/X: Complete Guide for 2026

How to Scrape Cloudflare-Protected Sites in 2026

Recommended

How to Scrape Amazon in 2026: Engineering Guide

How to Scrape AliExpress: Complete Guide for 2026

Why Your Headless Browser Gets Detected (and How to Fix It)

How to Scrape Indeed: Complete Guide for 2026

How to Scrape Twitter/X Data: Complete Guide for 2026

Newsletter

Recommended Reading

How to Scrape Amazon in 2026: Engineering Guide

How to Scrape AliExpress: Complete Guide for 2026

Why Your Headless Browser Gets Detected (and How to Fix It)

How to Scrape Indeed: Complete Guide for 2026

How to Scrape Twitter/X Data: Complete Guide for 2026

Stay in the Loop

Explore AlterLab

Anti-Bot Handling API

JavaScript Rendering API

Pricing

Documentation

Web Scraping API Resources