AlterLabAlterLab
Tutorials

How to Scrape Shopify Stores: Complete Guide for 2026

Learn how to scrape Shopify stores in 2026 with Python — extract products, prices, and inventory using the JSON API and anti-bot bypass. Production-ready guide.

Yash Dubey
Yash Dubey

March 25, 2026

8 min read
0 views

Shopify powers over 4.5 million active online stores. Every one of them is a structured dataset: products, variants, pricing, inventory, collections, and vendor metadata — sitting behind a predictable URL schema. If you're building a price intelligence tool, a competitor monitoring pipeline, or a retail analytics system, Shopify is one of the most consistently structured targets you'll encounter in e-commerce scraping.

This guide covers the full stack: from Shopify's public JSON endpoints (faster and cleaner than HTML parsing) to handling anti-bot protections when those endpoints are locked down.

Why Scrape Shopify Stores?

Three use cases drive most Shopify scraping workloads:

Price and inventory monitoring. Brands and retailers track competitor pricing in near real-time to power dynamic repricing engines. Shopify's variant-level data includes price, compare-at price, and inventory quantity — everything needed to feed a pricing model without any HTML parsing.

Lead generation and market research. Aggregating store metadata — vendor names, product categories, brand positioning, SKU counts — gives agencies and SaaS tools a filtered view of which Shopify merchants are operating in a given niche.

Catalog aggregation. Marketplaces and comparison engines pull structured product data (title, description, images, tags) across thousands of stores to build searchable indexes.

4.5M+Active Shopify Stores
250Products per JSON Page
99.2%AlterLab Success Rate
1.4sAvg Response Time

Anti-Bot Challenges on Shopify Stores

Shopify itself doesn't deploy heavy anti-bot infrastructure at the platform level, but individual merchants do — and the stack is increasingly aggressive in 2026.

Cloudflare is the dominant layer. Most mid-to-large Shopify stores sit behind Cloudflare, which means your scraper faces browser integrity checks (JS challenges), managed challenge pages, and TLS fingerprinting. A plain requests session with a spoofed User-Agent fails immediately — Cloudflare scores TLS cipher suites and HTTP/2 frame ordering at the network level before any JavaScript runs.

Shopify's native bot protection. Shopify's checkout and storefront expose behavioral bot scoring via their fraud prevention tooling. High-frequency requests from a single IP or ASN trigger rate limiting (typically HTTP 429) or silent throttling where responses are served stale or incomplete.

JavaScript-gated storefronts. Headless Shopify storefronts built on Hydrogen or custom themes frequently render product grids client-side. The initial HTML payload is a shell; product data arrives via XHR after JavaScript executes. Standard HTTP scrapers get nothing useful.

The practical consequence: DIY scraping against protected Shopify stores requires managing residential proxy pools, implementing TLS fingerprint spoofing (via curl-impersonate or tls-client), and running headless Chromium for JS-rendered pages — each of which has its own maintenance overhead.

AlterLab's anti-bot bypass API consolidates all of this: residential proxies, real browser fingerprints, and automatic challenge solving, exposed through a single HTTP endpoint.

Quick Start with AlterLab

Install the SDK and grab your API key from the getting started guide.

Bash
pip install alterlab

The fastest path to Shopify product data is the /products.json endpoint — publicly accessible on most stores, no JavaScript required, returns clean JSON with full product and variant detail.

Python
import alterlab
import json

client = alterlab.Client("YOUR_API_KEY")

# Most Shopify stores expose /products.json with no auth required
response = client.scrape(
    url="https://target-store.myshopify.com/products.json",
    params={
        "limit": 250,
        "anti_bot": True
    }
)

data = json.loads(response.text)
products = data["products"]

for product in products:
    price = product["variants"][0]["price"]
    available = product["variants"][0]["available"]
    print(f"{product['title']} — ${price} ({'in stock' if available else 'OOS'})")

The equivalent with cURL:

Bash
curl -X POST https://api.alterlab.io/v1/scrape \
  -H "X-API-Key: YOUR_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://target-store.myshopify.com/products.json?limit=250",
    "anti_bot": true,
    "render_js": false
  }'

Set render_js: false explicitly for JSON endpoints — it halves latency since there's no browser spin-up cost.

Try it yourself

Try scraping a Shopify products.json endpoint with AlterLab

Extracting Structured Data

The JSON API (Preferred)

The Shopify Storefront JSON API is the cleanest path. Each endpoint returns structured data without any HTML parsing:

EndpointReturns
/products.json?limit=250Paginated product catalog
/collections.jsonAll store collections
/collections/{handle}/products.jsonProducts in a specific collection
/products/{handle}.jsonSingle product with full variant detail

Here's a complete extractor that handles cursor-based pagination across the full catalog:

Python
import alterlab
import json
from dataclasses import dataclass, field
from typing import List

@dataclass
class Variant:
    id: int
    title: str
    price: str
    compare_at_price: str | None
    sku: str
    inventory_quantity: int
    available: bool

@dataclass
class Product:
    id: int
    title: str
    handle: str
    vendor: str
    product_type: str
    tags: List[str]
    variants: List[Variant] = field(default_factory=list)

def scrape_full_catalog(store_url: str, client: alterlab.Client) -> List[Product]:
    products: List[Product] = []
    since_id = 0

    while True:
        url = f"{store_url}/products.json?limit=250&since_id={since_id}"
        response = client.scrape(url=url, anti_bot=True, render_js=False)

        if response.status_code != 200:
            print(f"Got {response.status_code} — stopping pagination")
            break

        batch = json.loads(response.text).get("products", [])
        if not batch:
            break  # No more pages

        for p in batch:
            variants = [
                Variant(
                    id=v["id"],
                    title=v["title"],
                    price=v["price"],
                    compare_at_price=v.get("compare_at_price"),
                    sku=v.get("sku", ""),
                    inventory_quantity=v.get("inventory_quantity", 0),
                    available=v["available"],
                )
                for v in p["variants"]
            ]
            products.append(
                Product(
                    id=p["id"],
                    title=p["title"],
                    handle=p["handle"],
                    vendor=p["vendor"],
                    product_type=p["product_type"],
                    tags=p.get("tags", "").split(", ") if p.get("tags") else [],
                    variants=variants,
                )
            )

        since_id = batch[-1]["id"]
        print(f"Fetched {len(products)} products so far (last id: {since_id})")

    return products

The since_id cursor is the correct pagination approach in 2026. The old page=N parameter was deprecated by Shopify and no longer returns results beyond page 1 on most stores.

HTML Fallback (When JSON is Disabled)

Some stores restrict /products.json or return empty arrays. In that case, scrape the product listing pages with CSS selectors. Shopify's default Liquid themes follow consistent class conventions:

Python
import alterlab
from bs4 import BeautifulSoup

client = alterlab.Client("YOUR_API_KEY")

response = client.scrape(
    url="https://target-store.myshopify.com/collections/all",
    anti_bot=True,
    render_js=True  # Required for headless storefronts
)

soup = BeautifulSoup(response.text, "html.parser")

# Standard Shopify theme selectors (Dawn, Debut, and most custom themes)
product_cards = soup.select(".product-item, .grid__item, [data-product-id]")

for card in product_cards:
    title_el = card.select_one(".product-item__title, .product__title, h2 a")
    price_el = card.select_one(
        ".price__regular .price-item, [data-product-price], .product-price"
    )
    link_el = card.select_one("a[href*='/products/']")

    title = title_el.get_text(strip=True) if title_el else "N/A"
    price = price_el.get_text(strip=True) if price_el else "N/A"
    href = link_el["href"] if link_el else "N/A"

    print(f"{title} | {price} | {href}")

Selector reliability caveat: Custom themes break these selectors. For custom themes, look for <script type="application/json" data-product-json> — Shopify injects the full product object as inline JSON on every PDP regardless of theme.

Common Pitfalls

Rate limiting on /products.json. Most stores tolerate 2–4 requests per second before returning HTTP 429. Some implement silent throttling — the response is 200 but the products array is empty after a few pages. Add a 0.5–1s delay between paginated requests and respect Retry-After headers.

The JSON API returns no results. Some merchants explicitly disable the JSON API in their Shopify settings or use a password-protected storefront. Check robots.txt first; if /products.json is disallowed, fall back to HTML scraping.

Variant data is incomplete. /products.json caps inventory quantities at 100 for stores using tracked inventory at the variant level. If you need exact counts, you need the Shopify Admin API — which requires OAuth and merchant consent, outside the scope of public data scraping.

Headless storefronts return skeleton HTML. Shopify Hydrogen and custom React storefronts render product grids entirely client-side. You'll get a <div id="main"> with nothing in it. Set render_js: true in your request and allow at least 2–3 seconds for hydration.

Session-gated pages. Flash sale storefronts, member-only collections, and age-gated stores require cookie-based sessions. Pass cookies in the request headers; do not attempt to bypass authenticated checkouts.

Scaling Up

Here's an async multi-store scraper that handles a batch of stores concurrently:

Python
import asyncio
import json
import alterlab

client = alterlab.Client("YOUR_API_KEY")

STORES = [
    "https://store-a.myshopify.com",
    "https://store-b.myshopify.com",
    "https://store-c.myshopify.com",
    "https://store-d.myshopify.com",
]

async def scrape_store_products(store_url: str) -> tuple[str, list]:
    all_products = []
    since_id = 0

    while True:
        url = f"{store_url}/products.json?limit=250&since_id={since_id}"
        response = await client.async_scrape(url=url, anti_bot=True, render_js=False)

        if response.status_code != 200:
            break

        batch = json.loads(response.text).get("products", [])
        if not batch:
            break

        all_products.extend(batch)
        since_id = batch[-1]["id"]
        await asyncio.sleep(0.5)  # Polite delay between pages

    return store_url, all_products

async def main():
    tasks = [scrape_store_products(url) for url in STORES]
    results = await asyncio.gather(*tasks, return_exceptions=True)

    total = 0
    for store_url, result in results:
        if isinstance(result, Exception):
            print(f"[ERROR] {store_url}: {result}")
        else:
            print(f"[OK] {store_url}: {len(result)} products")
            total += len(result)

    print(f"\nTotal: {total} products across {len(STORES)} stores")

asyncio.run(main())

Throughput guidance:

  • JSON endpoint (no JS render): 10–20 concurrent requests is a safe ceiling before triggering rate limits across mixed store targets
  • JS-rendered pages: Cap at 5–8 concurrent — browser instances are CPU/memory-bound
  • For pipeline scheduling, tools like Prefect or Airflow work well for daily or hourly refresh cycles

On cost at scale: Request volume is the primary cost driver. For bulk catalog crawls where the JSON API is accessible, you avoid JavaScript rendering credits entirely. See AlterLab's pricing for credit costs at different volume tiers — most production price-monitoring pipelines land in the Growth or Business tiers.

Key Takeaways

  • Start with /products.json — it's faster, cheaper, and more structured than HTML scraping. Most Shopify stores leave it accessible.
  • Use since_id for pagination, not page=N. The legacy page parameter is deprecated and silently truncates results.
  • Set render_js: false for JSON endpoints. Only enable JS rendering for headless storefronts or when the JSON API is disabled.
  • Respect rate limits — a 500ms delay between paginated calls on the same store avoids silent throttling.
  • Extract data-product-json script tags when CSS selectors fail on custom themes — Shopify injects the full product object as inline JSON on every product page.
  • Anti-bot bypass is necessary for stores behind Cloudflare. TLS fingerprinting alone blocks naive requests-based scrapers before any challenge is even served.

Building a broader e-commerce scraping pipeline? These guides cover the other major platforms:

Share

Was this article helpful?

Frequently Asked Questions

Scraping publicly accessible product pages and pricing data is generally permissible under most jurisdictions, consistent with the hiQ v. LinkedIn precedent on public data. That said, Shopify's platform Terms of Service and individual merchant store policies vary — always check the store's robots.txt, avoid scraping authenticated sessions, and don't hammer endpoints at rates that degrade service. Use collected data for research, price monitoring, or competitive analysis rather than republishing content verbatim.
Most Shopify stores use Cloudflare or native bot scoring that fingerprints TLS signatures, browser headers, and JavaScript execution behavior. Rolling your own bypass — managing cipher suites, rotating residential IPs, and solving challenges — is brittle and requires constant maintenance. AlterLab's anti-bot bypass API handles all of this transparently, routing requests through real browser fingerprints and residential proxies so you get consistent success rates without building or maintaining bypass infrastructure yourself.
Cost depends on request volume and whether you need JavaScript rendering. For stores where the /products.json endpoint is accessible, standard (non-JS) requests are the cheapest tier. JavaScript-rendered requests cost more credits per call but are necessary when the JSON API is disabled or you need dynamic storefront data. AlterLab uses credit-based pricing with volume discounts at higher tiers — see the [pricing page](/pricing) for a full breakdown and plan comparison.