Scraping E-Commerce Sites at Scale Without Getting Blocked
Amazon, Walmart, Target, and Best Buy all run aggressive anti-bot systems. Here is what works for extracting product data from these sites in 2026.
Yash Dubey
February 9, 2026
E-commerce sites are some of the hardest scraping targets on the internet. Amazon alone runs multiple layers of bot detection including CAPTCHA challenges, device fingerprinting, and behavioral analysis. Walmart and Target use DataDome. Best Buy uses PerimeterX.
If you need product data from these sites, here is what you are up against and what actually works.
Why E-Commerce Sites Are Hard
These sites have real financial incentive to block scrapers. Competitors use scraping for price monitoring. Counterfeit sellers use it to copy listings. Data brokers resell the data. The sites fight back aggressively.
Common Defenses
Rate limiting per session and IP. Amazon starts showing CAPTCHAs after 20-30 requests from the same IP in a short window. The threshold varies by product category and time of day.
Dynamic page structure. Class names and element IDs change between deployments. Sometimes they change per user session. A scraper that relies on specific CSS selectors breaks regularly.
Price and availability obfuscation. Some sites load pricing via separate API calls or embed it in JavaScript bundles that need to be executed to extract.
Login walls for full data. Reviews, seller information, and inventory counts are sometimes gated behind login or require specific cookies from browsing behavior.
Product Data Architecture
Before scraping, understand how e-commerce product data is structured. Most major sites embed structured data in the page:
<!-- Most product pages include JSON-LD -->
<script type="application/ld+json">
{
"@type": "Product",
"name": "Apple AirPods Pro",
"offers": {
"price": "249.00",
"priceCurrency": "USD",
"availability": "InStock"
}
}
</script>Extracting JSON-LD is more reliable than parsing HTML elements. The structured data follows a schema standard and changes less frequently than the visual layout.
import json
from bs4 import BeautifulSoup
def extract_product_jsonld(html):
soup = BeautifulSoup(html, "html.parser")
scripts = soup.find_all("script", type="application/ld+json")
for script in scripts:
data = json.loads(script.string)
if isinstance(data, dict) and data.get("@type") == "Product":
return data
if isinstance(data, list):
for item in data:
if item.get("@type") == "Product":
return item
return NoneSite-Specific Notes
Amazon
Amazon is the hardest e-commerce target. Their bot detection uses:
- Device fingerprinting via JavaScript
- CAPTCHA challenges (image selection, text CAPTCHAs)
- Session-based rate limiting
- Dynamic HTML structure
What works: Residential proxies with full browser rendering. Rotate IPs after every 10-15 requests. Do not send requests faster than one every 3-5 seconds per session.
The product API (PA-API 5.0) is an alternative if you qualify as an Amazon Associate, but it has strict rate limits and requires an active affiliate account.
Walmart
Uses DataDome for bot protection. Less aggressive than Amazon but still catches headless browsers easily.
What works: Playwright with stealth patches and ISP proxies. Walmart also exposes a lot of data through their taxonomy and search API endpoints that are less protected than product pages.
Target and Best Buy
Both use PerimeterX. The detection focuses heavily on mouse movement patterns and JavaScript execution timing.
What works: Full browser rendering with realistic viewport sizes and interaction delays. Best Buy is particularly strict about request timing.
Scaling E-Commerce Scraping
The Numbers
Say you need to monitor prices on 50,000 product listings across Amazon, Walmart, and Target, updated daily.
That is 50,000 requests per day. With a 70% success rate on first attempt and retries for the rest, you are making roughly 65,000-70,000 total requests daily.
DIY cost estimate:
- Residential proxies: 70K requests x 3MB avg = 210 GB = $2,100/month
- Server (10 browser instances): $150/month
- CAPTCHA solving: $200/month (for Amazon)
- Engineering maintenance: 15 hrs/month
API cost estimate: With AlterLab, e-commerce pages with JS rendering and anti-bot bypass fall into the higher-cost tier:
- 50,000 requests x $0.02-0.05 = $1,000-2,500/month
- No infrastructure to maintain
- Failed requests are not charged
The API approach is cheaper or comparable, and you skip the infrastructure headaches.
Freshness vs Cost
Not every product needs daily price updates. Most prices change weekly or less frequently. A smart approach:
- High-priority products (top sellers, competitive items): update every 6-12 hours
- Medium priority: update daily
- Low priority (long-tail catalog): update every 3-7 days
This can cut your request volume by 60-70% compared to blanket daily scraping.
Structured Data Extraction
Raw HTML is not what you want to store. Parse it into structured data immediately:
def parse_product(html, source):
# Try JSON-LD first (most reliable)
product = extract_product_jsonld(html)
if product:
return {
"name": product.get("name"),
"price": product.get("offers", {}).get("price"),
"currency": product.get("offers", {}).get("priceCurrency"),
"availability": product.get("offers", {}).get("availability"),
"source": source,
"scraped_at": datetime.utcnow().isoformat(),
}
# Fall back to HTML parsing
# (site-specific selectors here)
return NoneAlterLab can return structured JSON directly using schema-based extraction. You define the fields you want, and the API extracts them from the page. Saves you from maintaining parsing logic for each site.
The Practical Approach
Start with the JSON-LD and API endpoints. These are the least protected and give you the most structured data. Only fall back to full page rendering when the data you need is not available through easier channels.
Use a scraping API for the hard targets (Amazon, Walmart with bot protection). Build simple scripts for the easy targets (smaller retailers without serious protection).
Monitor your success rates. When they drop below 80%, something changed on the target site. Fix it immediately - stale data is worse than no data.