AlterLabAlterLab
Tutorials

Scraping E-Commerce Sites at Scale Without Getting Blocked

Amazon, Walmart, Target, and Best Buy all run aggressive anti-bot systems. Here is what works for extracting product data from these sites in 2026.

Yash Dubey

Yash Dubey

February 9, 2026

9 min read
47 views
Share:

E-commerce sites are some of the hardest scraping targets on the internet. Amazon alone runs multiple layers of bot detection including CAPTCHA challenges, device fingerprinting, and behavioral analysis. Walmart and Target use DataDome. Best Buy uses PerimeterX.

If you need product data from these sites, here is what you are up against and what actually works.

Why E-Commerce Sites Are Hard

These sites have real financial incentive to block scrapers. Competitors use scraping for price monitoring. Counterfeit sellers use it to copy listings. Data brokers resell the data. The sites fight back aggressively.

Common Defenses

Rate limiting per session and IP. Amazon starts showing CAPTCHAs after 20-30 requests from the same IP in a short window. The threshold varies by product category and time of day.

Dynamic page structure. Class names and element IDs change between deployments. Sometimes they change per user session. A scraper that relies on specific CSS selectors breaks regularly.

Price and availability obfuscation. Some sites load pricing via separate API calls or embed it in JavaScript bundles that need to be executed to extract.

Login walls for full data. Reviews, seller information, and inventory counts are sometimes gated behind login or require specific cookies from browsing behavior.

Product Data Architecture

Before scraping, understand how e-commerce product data is structured. Most major sites embed structured data in the page:

html
<!-- Most product pages include JSON-LD -->
<script type="application/ld+json">
{
  "@type": "Product",
  "name": "Apple AirPods Pro",
  "offers": {
    "price": "249.00",
    "priceCurrency": "USD",
    "availability": "InStock"
  }
}
</script>

Extracting JSON-LD is more reliable than parsing HTML elements. The structured data follows a schema standard and changes less frequently than the visual layout.

python
import json
from bs4 import BeautifulSoup

def extract_product_jsonld(html):
    soup = BeautifulSoup(html, "html.parser")
    scripts = soup.find_all("script", type="application/ld+json")
    for script in scripts:
        data = json.loads(script.string)
        if isinstance(data, dict) and data.get("@type") == "Product":
            return data
        if isinstance(data, list):
            for item in data:
                if item.get("@type") == "Product":
                    return item
    return None

Site-Specific Notes

Amazon

Amazon is the hardest e-commerce target. Their bot detection uses:

  • Device fingerprinting via JavaScript
  • CAPTCHA challenges (image selection, text CAPTCHAs)
  • Session-based rate limiting
  • Dynamic HTML structure

What works: Residential proxies with full browser rendering. Rotate IPs after every 10-15 requests. Do not send requests faster than one every 3-5 seconds per session.

The product API (PA-API 5.0) is an alternative if you qualify as an Amazon Associate, but it has strict rate limits and requires an active affiliate account.

Walmart

Uses DataDome for bot protection. Less aggressive than Amazon but still catches headless browsers easily.

What works: Playwright with stealth patches and ISP proxies. Walmart also exposes a lot of data through their taxonomy and search API endpoints that are less protected than product pages.

Target and Best Buy

Both use PerimeterX. The detection focuses heavily on mouse movement patterns and JavaScript execution timing.

What works: Full browser rendering with realistic viewport sizes and interaction delays. Best Buy is particularly strict about request timing.

Scaling E-Commerce Scraping

The Numbers

Say you need to monitor prices on 50,000 product listings across Amazon, Walmart, and Target, updated daily.

That is 50,000 requests per day. With a 70% success rate on first attempt and retries for the rest, you are making roughly 65,000-70,000 total requests daily.

DIY cost estimate:

  • Residential proxies: 70K requests x 3MB avg = 210 GB = $2,100/month
  • Server (10 browser instances): $150/month
  • CAPTCHA solving: $200/month (for Amazon)
  • Engineering maintenance: 15 hrs/month

API cost estimate: With AlterLab, e-commerce pages with JS rendering and anti-bot bypass fall into the higher-cost tier:

  • 50,000 requests x $0.02-0.05 = $1,000-2,500/month
  • No infrastructure to maintain
  • Failed requests are not charged

The API approach is cheaper or comparable, and you skip the infrastructure headaches.

Freshness vs Cost

Not every product needs daily price updates. Most prices change weekly or less frequently. A smart approach:

  • High-priority products (top sellers, competitive items): update every 6-12 hours
  • Medium priority: update daily
  • Low priority (long-tail catalog): update every 3-7 days

This can cut your request volume by 60-70% compared to blanket daily scraping.

Structured Data Extraction

Raw HTML is not what you want to store. Parse it into structured data immediately:

python
def parse_product(html, source):
    # Try JSON-LD first (most reliable)
    product = extract_product_jsonld(html)
    if product:
        return {
            "name": product.get("name"),
            "price": product.get("offers", {}).get("price"),
            "currency": product.get("offers", {}).get("priceCurrency"),
            "availability": product.get("offers", {}).get("availability"),
            "source": source,
            "scraped_at": datetime.utcnow().isoformat(),
        }
    # Fall back to HTML parsing
    # (site-specific selectors here)
    return None

AlterLab can return structured JSON directly using schema-based extraction. You define the fields you want, and the API extracts them from the page. Saves you from maintaining parsing logic for each site.

The Practical Approach

Start with the JSON-LD and API endpoints. These are the least protected and give you the most structured data. Only fall back to full page rendering when the data you need is not available through easier channels.

Use a scraping API for the hard targets (Amazon, Walmart with bot protection). Build simple scripts for the easy targets (smaller retailers without serious protection).

Monitor your success rates. When they drop below 80%, something changed on the target site. Fix it immediately - stale data is worse than no data.

Yash Dubey

Yash Dubey