AlterLabAlterLab
Tutorials

How to Scrape Amazon: Complete Guide for 2026

Learn how to scrape Amazon product data with Python in 2026. Bypass CAPTCHA and IP bans, extract structured data, and build production-ready scraping pipelines.

Yash Dubey
Yash Dubey

March 23, 2026

9 min read
7 views

Amazon is one of the most data-rich e-commerce targets on the web — and one of the most aggressively defended. This guide covers what protections you'll hit, how to work around them, and how to build a reliable extraction pipeline that handles pricing, availability, ratings, and product metadata at scale.

Why Scrape Amazon?

Three use cases that justify the engineering investment:

Price monitoring: Amazon updates prices multiple times per day on high-velocity products. Scraping price history across competitor SKUs feeds dynamic pricing models, discount detection systems, and market intelligence dashboards. For retail analytics firms, this is the primary driver for Amazon scraping at scale.

Market research and product intelligence: Best Seller Rank (BSR) movements signal category momentum before it appears in any third-party dataset. Aggregating review sentiment, tracking new product launches, and analyzing feature bullet points across a category gives consumer goods teams and investors a ground-truth view of the market.

Inventory and availability monitoring: "Currently unavailable" status changes on high-demand ASINs serve as supply chain signals. Resellers and logistics teams use availability scraping to trigger procurement or repricing workflows automatically.

Anti-Bot Challenges on amazon.com

Amazon runs one of the most layered bot detection stacks in e-commerce. Here's a precise breakdown of what you're dealing with:

CAPTCHA on datacenter IPs: Any request originating from a known datacenter ASN (AWS, GCP, Azure, Hetzner) almost always lands on a CAPTCHA page. The challenge is served dynamically — you won't get a clean 403, you'll get a 200 with a CAPTCHA payload that looks like a product page until you parse it.

Browser fingerprinting: Amazon's page JavaScript inspects navigator.userAgent, screen resolution, timezone offset, WebGL renderer hash, Canvas fingerprint, and AudioContext output. Default headless Chrome — even with standard User-Agent spoofing — is trivially detected via the combination of these signals.

TLS and HTTP/2 fingerprinting: The TLS client hello and HTTP/2 SETTINGS frame expose client identity before any application-layer code runs. Python requests, httpx, and cURL all have distinct fingerprints that Amazon's edge layer flags. Matching a real browser's TLS fingerprint requires patching at the socket level.

IP reputation and per-IP rate limiting: Even residential IPs get rate-limited if the same IP makes repeated requests to the same product category. You need per-request IP rotation, not per-session rotation.

Session-gated pricing: Subscribe & Save prices, Prime-exclusive discounts, and some availability states require a valid Amazon session (logged in or authenticated guest) to render. Without proper session handling, you receive degraded HTML with placeholder prices.

Solving all of this from scratch is a multi-week infrastructure project with ongoing maintenance. The anti-bot bypass API abstracts the entire stack — proxy rotation, fingerprint management, JS rendering — into a single API call.

Quick Start with AlterLab API

Install the SDK and make your first request in under five minutes. Full setup is covered in the Getting started guide.

Python
import alterlab

client = alterlab.Client("YOUR_API_KEY")

response = client.scrape(
    url="https://www.amazon.com/dp/B0CHX3TB1R",
    render_js=True,
    premium_proxy=True,
)

print(response.html[:1000])

The equivalent request via cURL:

Bash
curl -X POST https://api.alterlab.io/v1/scrape \
  -H "X-API-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://www.amazon.com/dp/B0CHX3TB1R",
    "render_js": true,
    "premium_proxy": true
  }'

render_js: true launches a headless browser instance with a randomized fingerprint profile. premium_proxy: true routes the request through a residential IP — this flag is not optional for Amazon. Without both, the majority of product page requests return bot challenge pages rather than product HTML.

99.2%Success Rate on Amazon
1.4sAvg Response Time
195+Proxy Countries
ZeroInfrastructure to Manage

Extracting Structured Data

Once you have rendered HTML, parse it with BeautifulSoup. Amazon's DOM is inconsistent across product categories and changes frequently, but the core selectors below are stable across the majority of standard product pages.

Python
from bs4 import BeautifulSoup
import alterlab
import json

client = alterlab.Client("YOUR_API_KEY")

def scrape_product(asin: str) -> dict:
    url = f"https://www.amazon.com/dp/{asin}"
    response = client.scrape(url=url, render_js=True, premium_proxy=True)
    soup = BeautifulSoup(response.html, "html.parser")

    # Product title
    title_el = soup.select_one("#productTitle")
    title = title_el.get_text(strip=True) if title_el else None

    # Current price — only present after JS execution
    price_el = soup.select_one(".a-price .a-offscreen")
    price = price_el.get_text(strip=True) if price_el else None

    # Original (struck-through) price
    original_price_el = soup.select_one(".basisPrice .a-offscreen")
    original_price = original_price_el.get_text(strip=True) if original_price_el else None

    # Star rating (returned as a string like "4.5 out of 5 stars")
    rating_el = soup.select_one("#acrPopover")
    rating = rating_el.get("title") if rating_el else None

    # Total review count
    reviews_el = soup.select_one("#acrCustomerReviewText")
    review_count = reviews_el.get_text(strip=True) if reviews_el else None

    # Availability
    avail_el = soup.select_one("#availability span")
    availability = avail_el.get_text(strip=True) if avail_el else None

    # Feature bullet points
    bullets = [
        li.get_text(strip=True)
        for li in soup.select("#feature-bullets ul li span.a-list-item")
    ]

    # Brand
    brand_el = soup.select_one("#bylineInfo")
    brand = brand_el.get_text(strip=True) if brand_el else None

    return {
        "asin": asin,
        "title": title,
        "brand": brand,
        "price": price,
        "original_price": original_price,
        "rating": rating,
        "review_count": review_count,
        "availability": availability,
        "bullets": bullets,
    }

if __name__ == "__main__":
    product = scrape_product("B0CHX3TB1R")
    print(json.dumps(product, indent=2))

CSS selector reference for Amazon product pages:

FieldSelector
Product title#productTitle
Current price.a-price .a-offscreen
Original/list price.basisPrice .a-offscreen
Star rating#acrPopover[title]
Review count#acrCustomerReviewText
Availability#availability span
Bullet points#feature-bullets ul li span.a-list-item
Brand#bylineInfo
ASIN (hidden input)input[name="ASIN"]
Main product image#landingImage

Price selector caveat: Amazon renders prices via XHR after initial page load. If you request without render_js: true, #priceblock_ourprice and .a-price elements are frequently absent or empty in the raw HTML. Always enable JS rendering when pricing data is required.

Common Pitfalls

Dynamic price loading after initial render: Amazon's pricing is XHR-driven on most pages. Scraping without JavaScript rendering returns HTML where .a-price is either missing or contains a placeholder. This is the single most common reason a price scraper returns empty results — always use render_js: true.

Geographic price variation: Amazon prices differ by region. A residential proxy geolocated to us-east and one geolocated to us-west may return different prices for the same ASIN. If price consistency matters for your dataset, lock your proxy geography to a specific country or region in your API parameters.

A/B testing causes selector drift: Amazon continuously experiments on its product page UI. Selectors stable for 90% of traffic today may silently break on 10% of requests tomorrow. Build your parser defensively: always check element existence before accessing .get_text(), log when expected selectors return None, and set up anomaly detection on your extracted data (e.g., alert if price extraction starts returning empty strings at >5% rate).

Session-gated content: Subscribe & Save pricing and Prime-exclusive offers require a valid session. For these fields, pass session cookies in the request headers. Without them, you'll get the logged-out price — which may differ significantly.

Review pagination depth: The first two pages of product reviews are straightforward to scrape. Requests to page 8+ increasingly trigger bot challenges even with residential proxies. For deep review scraping, add randomized delays between 3–10 seconds per request and distribute your scraping across a longer time window.

Throttling on category-level crawls: Hitting 50+ ASINs in the same product category within a short window can trigger ASN-level rate limiting even with IP rotation. Interleave requests across different categories or add jitter between calls when scraping at volume.

Scaling Up

For production pipelines, move from sequential single-ASIN requests to concurrent batch execution with proper error handling and retry logic.

Python
import alterlab
import json
import time
import random
from concurrent.futures import ThreadPoolExecutor, as_completed
from bs4 import BeautifulSoup

client = alterlab.Client("YOUR_API_KEY")

ASINS = [
    "B0CHX3TB1R",
    "B09G9HD5R7",
    "B0BVZPGRNF",
    "B0CF4RLBN1",
    "B0BDJH7ZBC",
]

def fetch_product(asin: str) -> dict:
    try:
        # Random jitter between requests from the same thread
        time.sleep(random.uniform(1.5, 4.0))
        response = client.scrape(
            url=f"https://www.amazon.com/dp/{asin}",
            render_js=True,
            premium_proxy=True,
        )
        soup = BeautifulSoup(response.html, "html.parser")
        title_el = soup.select_one("#productTitle")
        price_el = soup.select_one(".a-price .a-offscreen")
        avail_el = soup.select_one("#availability span")
        return {
            "asin": asin,
            "title": title_el.get_text(strip=True) if title_el else None,
            "price": price_el.get_text(strip=True) if price_el else None,
            "availability": avail_el.get_text(strip=True) if avail_el else None,
            "scraped_at": int(time.time()),
            "status": "ok",
        }
    except Exception as exc:
        return {"asin": asin, "status": "error", "error": str(exc)}

with ThreadPoolExecutor(max_workers=8) as executor:
    futures = {executor.submit(fetch_product, asin): asin for asin in ASINS}
    results = [f.result() for f in as_completed(futures)]

successful = [r for r in results if r["status"] == "ok"]
failed = [r for r in results if r["status"] == "error"]

print(f"Scraped: {len(successful)} | Failed: {len(failed)}")

# Write results — include scraped_at for time-series reconstruction
with open("products.jsonl", "w") as f:
    for record in successful:
        f.write(json.dumps(record) + "\n")

Scheduling for price monitoring: For most use cases, a daily cron job covers sufficient granularity. For lightning deals and flash sale tracking, 15-minute intervals are common. Use Celery with Redis as the task broker to handle retries, dead-letter queuing, and concurrency limits without reimplementing that logic yourself.

Storage: JSONL files work at small scale. For production price history pipelines, write directly to PostgreSQL or ClickHouse. Include a scraped_at Unix timestamp on every record — without it, you can't reconstruct price time series reliably.

Concurrency ceiling: Keep max_workers at 8–12 for Amazon. Higher parallelism yields diminishing returns and increases the likelihood of triggering per-category throttling. Horizontal scaling via multiple independent workers (each with their own API key pool) is more reliable than maxing out thread count in a single process.

Cost optimization: JS-rendered requests cost more than plain HTML scrapes. For category index pages and search results — where prices aren't critical and you're only capturing ASINs — use render_js: false to cut costs. Reserve render_js: true for individual product page scrapes where pricing and availability are required. Review the AlterLab pricing tiers to find the plan that matches your request volume breakdown.

Try it yourself

Try scraping an Amazon product page with AlterLab — paste any ASIN URL and get live HTML back.

Key Takeaways

  • Residential proxies and JS rendering are non-negotiable for Amazon. Datacenter IPs return CAPTCHA pages. Plain HTTP requests return incomplete HTML without prices. Both flags — render_js: true and premium_proxy: true — are required for reliable product page scraping.
  • The core CSS selectors are stable but not universal. #productTitle, .a-price .a-offscreen, #acrPopover, and #availability span cover the majority of standard product pages. Build defensive parsers that log missing elements rather than failing hard.
  • Geographic proxy pinning matters for price data consistency. If you're building a price time series, lock your proxy geography — mixed geolocation across a dataset produces price anomalies that are hard to detect downstream.
  • Add jitter and cap thread concurrency. Random delays between 1.5–4 seconds and a concurrency ceiling of 8–12 workers prevent category-level rate limiting more effectively than aggressive parallelism.
  • Always store scraped_at timestamps. Price history is only useful as time-series data. Without timestamps, your dataset is a snapshot with no reconstruction path.

Building a broader e-commerce intelligence pipeline? These guides apply the same techniques to other major marketplaces:

Share

Was this article helpful?

Frequently Asked Questions

Scraping publicly visible Amazon data sits in a legal gray area — US and EU courts have generally upheld the right to access public web data, but Amazon's Terms of Service explicitly prohibit automated access. The practical risk is IP bans and account termination rather than litigation, though you should consult legal counsel for commercial use cases that involve large-scale or personally identifiable data collection.
Amazon stacks CAPTCHA, browser fingerprinting, TLS fingerprinting, and IP reputation checks — raw HTTP requests from datacenter IPs fail on most product pages before you even parse a byte. AlterLab's anti-bot bypass API handles residential proxy rotation, headless browser fingerprinting, and JS rendering in a single API call, so you never maintain that infrastructure yourself. Pass `render_js: true` and `premium_proxy: true` in your request and the bypass is handled transparently.
Cost scales with request volume and rendering tier — JS-rendered requests (required for Amazon prices and availability) cost more than plain HTML scrapes. AlterLab's pricing starts with a free tier large enough for evaluation, with pay-as-you-go and high-volume plans for production pipelines. Mixing rendering tiers strategically — JS rendering for product pages, plain HTML for category index pages — can significantly reduce per-scrape costs at scale.