How to Scrape Amazon: Complete Guide for 2026
Learn how to scrape Amazon product data with Python in 2026. Bypass CAPTCHA and IP bans, extract structured data, and build production-ready scraping pipelines.
March 23, 2026
Amazon is one of the most data-rich e-commerce targets on the web — and one of the most aggressively defended. This guide covers what protections you'll hit, how to work around them, and how to build a reliable extraction pipeline that handles pricing, availability, ratings, and product metadata at scale.
Why Scrape Amazon?
Three use cases that justify the engineering investment:
Price monitoring: Amazon updates prices multiple times per day on high-velocity products. Scraping price history across competitor SKUs feeds dynamic pricing models, discount detection systems, and market intelligence dashboards. For retail analytics firms, this is the primary driver for Amazon scraping at scale.
Market research and product intelligence: Best Seller Rank (BSR) movements signal category momentum before it appears in any third-party dataset. Aggregating review sentiment, tracking new product launches, and analyzing feature bullet points across a category gives consumer goods teams and investors a ground-truth view of the market.
Inventory and availability monitoring: "Currently unavailable" status changes on high-demand ASINs serve as supply chain signals. Resellers and logistics teams use availability scraping to trigger procurement or repricing workflows automatically.
Anti-Bot Challenges on amazon.com
Amazon runs one of the most layered bot detection stacks in e-commerce. Here's a precise breakdown of what you're dealing with:
CAPTCHA on datacenter IPs: Any request originating from a known datacenter ASN (AWS, GCP, Azure, Hetzner) almost always lands on a CAPTCHA page. The challenge is served dynamically — you won't get a clean 403, you'll get a 200 with a CAPTCHA payload that looks like a product page until you parse it.
Browser fingerprinting: Amazon's page JavaScript inspects navigator.userAgent, screen resolution, timezone offset, WebGL renderer hash, Canvas fingerprint, and AudioContext output. Default headless Chrome — even with standard User-Agent spoofing — is trivially detected via the combination of these signals.
TLS and HTTP/2 fingerprinting: The TLS client hello and HTTP/2 SETTINGS frame expose client identity before any application-layer code runs. Python requests, httpx, and cURL all have distinct fingerprints that Amazon's edge layer flags. Matching a real browser's TLS fingerprint requires patching at the socket level.
IP reputation and per-IP rate limiting: Even residential IPs get rate-limited if the same IP makes repeated requests to the same product category. You need per-request IP rotation, not per-session rotation.
Session-gated pricing: Subscribe & Save prices, Prime-exclusive discounts, and some availability states require a valid Amazon session (logged in or authenticated guest) to render. Without proper session handling, you receive degraded HTML with placeholder prices.
Solving all of this from scratch is a multi-week infrastructure project with ongoing maintenance. The anti-bot bypass API abstracts the entire stack — proxy rotation, fingerprint management, JS rendering — into a single API call.
Quick Start with AlterLab API
Install the SDK and make your first request in under five minutes. Full setup is covered in the Getting started guide.
import alterlab
client = alterlab.Client("YOUR_API_KEY")
response = client.scrape(
url="https://www.amazon.com/dp/B0CHX3TB1R",
render_js=True,
premium_proxy=True,
)
print(response.html[:1000])The equivalent request via cURL:
curl -X POST https://api.alterlab.io/v1/scrape \
-H "X-API-Key: YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://www.amazon.com/dp/B0CHX3TB1R",
"render_js": true,
"premium_proxy": true
}'render_js: true launches a headless browser instance with a randomized fingerprint profile. premium_proxy: true routes the request through a residential IP — this flag is not optional for Amazon. Without both, the majority of product page requests return bot challenge pages rather than product HTML.
Extracting Structured Data
Once you have rendered HTML, parse it with BeautifulSoup. Amazon's DOM is inconsistent across product categories and changes frequently, but the core selectors below are stable across the majority of standard product pages.
from bs4 import BeautifulSoup
import alterlab
import json
client = alterlab.Client("YOUR_API_KEY")
def scrape_product(asin: str) -> dict:
url = f"https://www.amazon.com/dp/{asin}"
response = client.scrape(url=url, render_js=True, premium_proxy=True)
soup = BeautifulSoup(response.html, "html.parser")
# Product title
title_el = soup.select_one("#productTitle")
title = title_el.get_text(strip=True) if title_el else None
# Current price — only present after JS execution
price_el = soup.select_one(".a-price .a-offscreen")
price = price_el.get_text(strip=True) if price_el else None
# Original (struck-through) price
original_price_el = soup.select_one(".basisPrice .a-offscreen")
original_price = original_price_el.get_text(strip=True) if original_price_el else None
# Star rating (returned as a string like "4.5 out of 5 stars")
rating_el = soup.select_one("#acrPopover")
rating = rating_el.get("title") if rating_el else None
# Total review count
reviews_el = soup.select_one("#acrCustomerReviewText")
review_count = reviews_el.get_text(strip=True) if reviews_el else None
# Availability
avail_el = soup.select_one("#availability span")
availability = avail_el.get_text(strip=True) if avail_el else None
# Feature bullet points
bullets = [
li.get_text(strip=True)
for li in soup.select("#feature-bullets ul li span.a-list-item")
]
# Brand
brand_el = soup.select_one("#bylineInfo")
brand = brand_el.get_text(strip=True) if brand_el else None
return {
"asin": asin,
"title": title,
"brand": brand,
"price": price,
"original_price": original_price,
"rating": rating,
"review_count": review_count,
"availability": availability,
"bullets": bullets,
}
if __name__ == "__main__":
product = scrape_product("B0CHX3TB1R")
print(json.dumps(product, indent=2))CSS selector reference for Amazon product pages:
| Field | Selector |
|---|---|
| Product title | #productTitle |
| Current price | .a-price .a-offscreen |
| Original/list price | .basisPrice .a-offscreen |
| Star rating | #acrPopover[title] |
| Review count | #acrCustomerReviewText |
| Availability | #availability span |
| Bullet points | #feature-bullets ul li span.a-list-item |
| Brand | #bylineInfo |
| ASIN (hidden input) | input[name="ASIN"] |
| Main product image | #landingImage |
Price selector caveat: Amazon renders prices via XHR after initial page load. If you request without
render_js: true,#priceblock_ourpriceand.a-priceelements are frequently absent or empty in the raw HTML. Always enable JS rendering when pricing data is required.
Common Pitfalls
Dynamic price loading after initial render: Amazon's pricing is XHR-driven on most pages. Scraping without JavaScript rendering returns HTML where .a-price is either missing or contains a placeholder. This is the single most common reason a price scraper returns empty results — always use render_js: true.
Geographic price variation: Amazon prices differ by region. A residential proxy geolocated to us-east and one geolocated to us-west may return different prices for the same ASIN. If price consistency matters for your dataset, lock your proxy geography to a specific country or region in your API parameters.
A/B testing causes selector drift: Amazon continuously experiments on its product page UI. Selectors stable for 90% of traffic today may silently break on 10% of requests tomorrow. Build your parser defensively: always check element existence before accessing .get_text(), log when expected selectors return None, and set up anomaly detection on your extracted data (e.g., alert if price extraction starts returning empty strings at >5% rate).
Session-gated content: Subscribe & Save pricing and Prime-exclusive offers require a valid session. For these fields, pass session cookies in the request headers. Without them, you'll get the logged-out price — which may differ significantly.
Review pagination depth: The first two pages of product reviews are straightforward to scrape. Requests to page 8+ increasingly trigger bot challenges even with residential proxies. For deep review scraping, add randomized delays between 3–10 seconds per request and distribute your scraping across a longer time window.
Throttling on category-level crawls: Hitting 50+ ASINs in the same product category within a short window can trigger ASN-level rate limiting even with IP rotation. Interleave requests across different categories or add jitter between calls when scraping at volume.
Scaling Up
For production pipelines, move from sequential single-ASIN requests to concurrent batch execution with proper error handling and retry logic.
import alterlab
import json
import time
import random
from concurrent.futures import ThreadPoolExecutor, as_completed
from bs4 import BeautifulSoup
client = alterlab.Client("YOUR_API_KEY")
ASINS = [
"B0CHX3TB1R",
"B09G9HD5R7",
"B0BVZPGRNF",
"B0CF4RLBN1",
"B0BDJH7ZBC",
]
def fetch_product(asin: str) -> dict:
try:
# Random jitter between requests from the same thread
time.sleep(random.uniform(1.5, 4.0))
response = client.scrape(
url=f"https://www.amazon.com/dp/{asin}",
render_js=True,
premium_proxy=True,
)
soup = BeautifulSoup(response.html, "html.parser")
title_el = soup.select_one("#productTitle")
price_el = soup.select_one(".a-price .a-offscreen")
avail_el = soup.select_one("#availability span")
return {
"asin": asin,
"title": title_el.get_text(strip=True) if title_el else None,
"price": price_el.get_text(strip=True) if price_el else None,
"availability": avail_el.get_text(strip=True) if avail_el else None,
"scraped_at": int(time.time()),
"status": "ok",
}
except Exception as exc:
return {"asin": asin, "status": "error", "error": str(exc)}
with ThreadPoolExecutor(max_workers=8) as executor:
futures = {executor.submit(fetch_product, asin): asin for asin in ASINS}
results = [f.result() for f in as_completed(futures)]
successful = [r for r in results if r["status"] == "ok"]
failed = [r for r in results if r["status"] == "error"]
print(f"Scraped: {len(successful)} | Failed: {len(failed)}")
# Write results — include scraped_at for time-series reconstruction
with open("products.jsonl", "w") as f:
for record in successful:
f.write(json.dumps(record) + "\n")Scheduling for price monitoring: For most use cases, a daily cron job covers sufficient granularity. For lightning deals and flash sale tracking, 15-minute intervals are common. Use Celery with Redis as the task broker to handle retries, dead-letter queuing, and concurrency limits without reimplementing that logic yourself.
Storage: JSONL files work at small scale. For production price history pipelines, write directly to PostgreSQL or ClickHouse. Include a scraped_at Unix timestamp on every record — without it, you can't reconstruct price time series reliably.
Concurrency ceiling: Keep max_workers at 8–12 for Amazon. Higher parallelism yields diminishing returns and increases the likelihood of triggering per-category throttling. Horizontal scaling via multiple independent workers (each with their own API key pool) is more reliable than maxing out thread count in a single process.
Cost optimization: JS-rendered requests cost more than plain HTML scrapes. For category index pages and search results — where prices aren't critical and you're only capturing ASINs — use render_js: false to cut costs. Reserve render_js: true for individual product page scrapes where pricing and availability are required. Review the AlterLab pricing tiers to find the plan that matches your request volume breakdown.
Try scraping an Amazon product page with AlterLab — paste any ASIN URL and get live HTML back.
Key Takeaways
- Residential proxies and JS rendering are non-negotiable for Amazon. Datacenter IPs return CAPTCHA pages. Plain HTTP requests return incomplete HTML without prices. Both flags —
render_js: trueandpremium_proxy: true— are required for reliable product page scraping. - The core CSS selectors are stable but not universal.
#productTitle,.a-price .a-offscreen,#acrPopover, and#availability spancover the majority of standard product pages. Build defensive parsers that log missing elements rather than failing hard. - Geographic proxy pinning matters for price data consistency. If you're building a price time series, lock your proxy geography — mixed geolocation across a dataset produces price anomalies that are hard to detect downstream.
- Add jitter and cap thread concurrency. Random delays between 1.5–4 seconds and a concurrency ceiling of 8–12 workers prevent category-level rate limiting more effectively than aggressive parallelism.
- Always store
scraped_attimestamps. Price history is only useful as time-series data. Without timestamps, your dataset is a snapshot with no reconstruction path.
Related Guides
Building a broader e-commerce intelligence pipeline? These guides apply the same techniques to other major marketplaces:
Was this article helpful?
Frequently Asked Questions
Related Articles
Popular Posts
Recommended
Selenium Bot Detection: Why You Get Caught and How to Avoid It
Why Your Headless Browser Gets Detected (and How to Fix It)
Web Scraping APIs vs DIY Scrapers: When to Stop Building Infrastructure
Scraping E-Commerce Sites at Scale Without Getting Blocked
Web Scraping with Node.js and Puppeteer: The Complete 2026 Guide
Newsletter
Scraping insights and API tips. No spam.
Recommended Reading
Selenium Bot Detection: Why You Get Caught and How to Avoid It
Why Your Headless Browser Gets Detected (and How to Fix It)
Web Scraping APIs vs DIY Scrapers: When to Stop Building Infrastructure
Scraping E-Commerce Sites at Scale Without Getting Blocked
Web Scraping with Node.js and Puppeteer: The Complete 2026 Guide
Stay in the Loop
Get scraping insights, API tips, and platform updates. No spam — we only send when we have something worth reading.