Pricing Compare Playground Blog Docs Changelog

How to Scrape Shopify Stores: Complete Guide for 2026

Learn how to scrape Shopify stores in 2026 with Python — extract products, prices, and inventory using the JSON API and anti-bot bypass. Production-ready guide.

Yash DubeyMarch 25, 2026

8 min read

307 views

Shopify powers over 4.5 million active online stores. Every one of them is a structured dataset: products, variants, pricing, inventory, collections, and vendor metadata — sitting behind a predictable URL schema. If you're building a price intelligence tool, a competitor monitoring pipeline, or a retail analytics system, Shopify is one of the most consistently structured targets you'll encounter in e-commerce scraping.

This guide covers the full stack: from Shopify's public JSON endpoints (faster and cleaner than HTML parsing) to handling anti-bot protections when those endpoints are locked down.

Why Scrape Shopify Stores?

Three use cases drive most Shopify scraping workloads:

Price and inventory monitoring. Brands and retailers track competitor pricing in near real-time to power dynamic repricing engines. Shopify's variant-level data includes price, compare-at price, and inventory quantity — everything needed to feed a pricing model without any HTML parsing.

Lead generation and market research. Aggregating store metadata — vendor names, product categories, brand positioning, SKU counts — gives agencies and SaaS tools a filtered view of which Shopify merchants are operating in a given niche.

Catalog aggregation. Marketplaces and comparison engines pull structured product data (title, description, images, tags) across thousands of stores to build searchable indexes.

4.5M+Active Shopify Stores

250Products per JSON Page

99.2%AlterLab Success Rate

1.4sAvg Response Time

Anti-Bot Challenges on Shopify Stores

Shopify itself doesn't deploy heavy anti-bot infrastructure at the platform level, but individual merchants do — and the stack is increasingly aggressive in 2026.

Cloudflare is the dominant layer. Most mid-to-large Shopify stores sit behind Cloudflare, which means your scraper faces browser integrity checks (JS challenges), managed challenge pages, and TLS fingerprinting. A plain requests session with a spoofed User-Agent fails immediately — Cloudflare scores TLS cipher suites and HTTP/2 frame ordering at the network level before any JavaScript runs.

Shopify's native bot protection. Shopify's checkout and storefront expose behavioral bot scoring via their fraud prevention tooling. High-frequency requests from a single IP or ASN trigger rate limiting (typically HTTP 429) or silent throttling where responses are served stale or incomplete.

JavaScript-gated storefronts. Headless Shopify storefronts built on Hydrogen or custom themes frequently render product grids client-side. The initial HTML payload is a shell; product data arrives via XHR after JavaScript executes. Standard HTTP scrapers get nothing useful.

The practical consequence: DIY scraping against protected Shopify stores requires managing residential proxy pools, implementing TLS fingerprint spoofing (via curl-impersonate or tls-client), and running headless Chromium for JS-rendered pages — each of which has its own maintenance overhead.

AlterLab's anti-bot bypass API consolidates all of this: residential proxies, real browser fingerprints, and automatic challenge solving, exposed through a single HTTP endpoint.

Quick Start with AlterLab

Install the SDK and grab your API key from the getting started guide.

Bash

pip install alterlab

The fastest path to Shopify product data is the /products.json endpoint — publicly accessible on most stores, no JavaScript required, returns clean JSON with full product and variant detail.

Python

import alterlab
import json

client = alterlab.Client("YOUR_API_KEY")

# Most Shopify stores expose /products.json with no auth required
response = client.scrape(
    url="https://target-store.myshopify.com/products.json",
    params={
        "limit": 250,
        "anti_bot": True
    }
)

data = json.loads(response.text)
products = data["products"]

for product in products:
    price = product["variants"][0]["price"]
    available = product["variants"][0]["available"]
    print(f"{product['title']} — ${price} ({'in stock' if available else 'OOS'})")

The equivalent with cURL:

Bash

curl -X POST https://api.alterlab.io/v1/scrape \
  -H "X-API-Key: YOUR_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://target-store.myshopify.com/products.json?limit=250",
    "anti_bot": true,
    "render_js": false
  }'

Set render_js: false explicitly for JSON endpoints — it halves latency since there's no browser spin-up cost.

Try it yourself

Try scraping a Shopify products.json endpoint with AlterLab

curl -X POST https://api.alterlab.io/v1/scrape \
  -H "X-API-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"url": "https://demo.myshopify.com/products.json"}'

Enable JavaScript to try the live demo, or sign up to use the API directly.

Extracting Structured Data

The JSON API (Preferred)

The Shopify Storefront JSON API is the cleanest path. Each endpoint returns structured data without any HTML parsing:

Endpoint	Returns
`/products.json?limit=250`	Paginated product catalog
`/collections.json`	All store collections
`/collections/{handle}/products.json`	Products in a specific collection
`/products/{handle}.json`	Single product with full variant detail

Here's a complete extractor that handles cursor-based pagination across the full catalog:

Python

import alterlab
import json
from dataclasses import dataclass, field
from typing import List

@dataclass
class Variant:
    id: int
    title: str
    price: str
    compare_at_price: str | None
    sku: str
    inventory_quantity: int
    available: bool

@dataclass
class Product:
    id: int
    title: str
    handle: str
    vendor: str
    product_type: str
    tags: List[str]
    variants: List[Variant] = field(default_factory=list)

def scrape_full_catalog(store_url: str, client: alterlab.Client) -> List[Product]:
    products: List[Product] = []
    since_id = 0

    while True:
        url = f"{store_url}/products.json?limit=250&since_id={since_id}"
        response = client.scrape(url=url, anti_bot=True, render_js=False)

        if response.status_code != 200:
            print(f"Got {response.status_code} — stopping pagination")
            break

        batch = json.loads(response.text).get("products", [])
        if not batch:
            break  # No more pages

        for p in batch:
            variants = [
                Variant(
                    id=v["id"],
                    title=v["title"],
                    price=v["price"],
                    compare_at_price=v.get("compare_at_price"),
                    sku=v.get("sku", ""),
                    inventory_quantity=v.get("inventory_quantity", 0),
                    available=v["available"],
                )
                for v in p["variants"]
            ]
            products.append(
                Product(
                    id=p["id"],
                    title=p["title"],
                    handle=p["handle"],
                    vendor=p["vendor"],
                    product_type=p["product_type"],
                    tags=p.get("tags", "").split(", ") if p.get("tags") else [],
                    variants=variants,
                )
            )

        since_id = batch[-1]["id"]
        print(f"Fetched {len(products)} products so far (last id: {since_id})")

    return products

The since_id cursor is the correct pagination approach in 2026. The old page=N parameter was deprecated by Shopify and no longer returns results beyond page 1 on most stores.

HTML Fallback (When JSON is Disabled)

Some stores restrict /products.json or return empty arrays. In that case, scrape the product listing pages with CSS selectors. Shopify's default Liquid themes follow consistent class conventions:

Python

import alterlab
from bs4 import BeautifulSoup

client = alterlab.Client("YOUR_API_KEY")

response = client.scrape(
    url="https://target-store.myshopify.com/collections/all",
    anti_bot=True,
    render_js=True  # Required for headless storefronts
)

soup = BeautifulSoup(response.text, "html.parser")

# Standard Shopify theme selectors (Dawn, Debut, and most custom themes)
product_cards = soup.select(".product-item, .grid__item, [data-product-id]")

for card in product_cards:
    title_el = card.select_one(".product-item__title, .product__title, h2 a")
    price_el = card.select_one(
        ".price__regular .price-item, [data-product-price], .product-price"
    )
    link_el = card.select_one("a[href*='/products/']")

    title = title_el.get_text(strip=True) if title_el else "N/A"
    price = price_el.get_text(strip=True) if price_el else "N/A"
    href = link_el["href"] if link_el else "N/A"

    print(f"{title} | {price} | {href}")

Selector reliability caveat: Custom themes break these selectors. For custom themes, look for <script type="application/json" data-product-json> — Shopify injects the full product object as inline JSON on every PDP regardless of theme.

Common Pitfalls

Rate limiting on /products.json. Most stores tolerate 2–4 requests per second before returning HTTP 429. Some implement silent throttling — the response is 200 but the products array is empty after a few pages. Add a 0.5–1s delay between paginated requests and respect Retry-After headers.

The JSON API returns no results. Some merchants explicitly disable the JSON API in their Shopify settings or use a password-protected storefront. Check robots.txt first; if /products.json is disallowed, fall back to HTML scraping.

Variant data is incomplete. /products.json caps inventory quantities at 100 for stores using tracked inventory at the variant level. If you need exact counts, you need the Shopify Admin API — which requires OAuth and merchant consent, outside the scope of public data scraping.

Headless storefronts return skeleton HTML. Shopify Hydrogen and custom React storefronts render product grids entirely client-side. You'll get a <div id="main"> with nothing in it. Set render_js: true in your request and allow at least 2–3 seconds for hydration.

Session-gated pages. Flash sale storefronts, member-only collections, and age-gated stores require cookie-based sessions. Pass cookies in the request headers; do not attempt to bypass authenticated checkouts.

Scaling Up

Here's an async multi-store scraper that handles a batch of stores concurrently:

Python

import asyncio
import json
import alterlab

client = alterlab.Client("YOUR_API_KEY")

STORES = [
    "https://store-a.myshopify.com",
    "https://store-b.myshopify.com",
    "https://store-c.myshopify.com",
    "https://store-d.myshopify.com",
]

async def scrape_store_products(store_url: str) -> tuple[str, list]:
    all_products = []
    since_id = 0

    while True:
        url = f"{store_url}/products.json?limit=250&since_id={since_id}"
        response = await client.async_scrape(url=url, anti_bot=True, render_js=False)

        if response.status_code != 200:
            break

        batch = json.loads(response.text).get("products", [])
        if not batch:
            break

        all_products.extend(batch)
        since_id = batch[-1]["id"]
        await asyncio.sleep(0.5)  # Polite delay between pages

    return store_url, all_products

async def main():
    tasks = [scrape_store_products(url) for url in STORES]
    results = await asyncio.gather(*tasks, return_exceptions=True)

    total = 0
    for store_url, result in results:
        if isinstance(result, Exception):
            print(f"[ERROR] {store_url}: {result}")
        else:
            print(f"[OK] {store_url}: {len(result)} products")
            total += len(result)

    print(f"\nTotal: {total} products across {len(STORES)} stores")

asyncio.run(main())

Throughput guidance:

JSON endpoint (no JS render): 10–20 concurrent requests is a safe ceiling before triggering rate limits across mixed store targets
JS-rendered pages: Cap at 5–8 concurrent — browser instances are CPU/memory-bound
For pipeline scheduling, tools like Prefect or Airflow work well for daily or hourly refresh cycles

On cost at scale: Request volume is the primary cost driver. For bulk catalog crawls where the JSON API is accessible, you avoid JavaScript rendering credits entirely. See AlterLab's pricing for credit costs at different volume tiers — most production price-monitoring pipelines land in the Growth or Business tiers.

Key Takeaways

Start with /products.json — it's faster, cheaper, and more structured than HTML scraping. Most Shopify stores leave it accessible.
Use since_id for pagination, not page=N. The legacy page parameter is deprecated and silently truncates results.
Set render_js: false for JSON endpoints. Only enable JS rendering for headless storefronts or when the JSON API is disabled.
Respect rate limits — a 500ms delay between paginated calls on the same store avoids silent throttling.
Extract data-product-json script tags when CSS selectors fail on custom themes — Shopify injects the full product object as inline JSON on every product page.
Anti-bot bypass is necessary for stores behind Cloudflare. TLS fingerprinting alone blocks naive requests-based scrapers before any challenge is even served.

Building a broader e-commerce scraping pipeline? These guides cover the other major platforms:

How to Scrape Amazon — Handling Amazon's aggressive bot detection, product ASIN extraction, and review scraping
How to Scrape eBay — Auction data, sold listings, and seller profile extraction
How to Scrape Walmart — Walmart's API-backed frontend, price history, and inventory signals

Was this article helpful?

Try it yourself

Extract product data at scale

Prices, reviews, and inventory — structured JSON with one API call.

curl -X POST https://api.alterlab.io/v1/scrape \
  -H "X-API-Key: YOUR_KEY" \
  -H "Content-Type: application/json" \
  -d '{"url": "https://amazon.com/dp/B09V3KXJPB"}'

No credit card required · 5,000 free requests

Frequently Asked Questions

Scraping publicly accessible product pages and pricing data is generally permissible under most jurisdictions, consistent with the hiQ v. LinkedIn precedent on public data. That said, Shopify's platform Terms of Service and individual merchant store policies vary — always check the store's robots.txt, avoid scraping authenticated sessions, and don't hammer endpoints at rates that degrade service. Use collected data for research, price monitoring, or competitive analysis rather than republishing content verbatim.

Most Shopify stores use Cloudflare or native bot scoring that fingerprints TLS signatures, browser headers, and JavaScript execution behavior. Rolling your own bypass — managing cipher suites, rotating residential IPs, and solving challenges — is brittle and requires constant maintenance. AlterLab's anti-bot bypass API handles all of this transparently, routing requests through real browser fingerprints and residential proxies so you get consistent success rates without building or maintaining bypass infrastructure yourself.

Cost depends on request volume and whether you need JavaScript rendering. For stores where the /products.json endpoint is accessible, standard (non-JS) requests are the cheapest tier. JavaScript-rendered requests cost more credits per call but are necessary when the JSON API is disabled or you need dynamic storefront data. AlterLab uses credit-based pricing with volume discounts at higher tiers — see the [pricing page](/pricing) for a full breakdown and plan comparison.

Yash Dubey

View all posts

Tutorials

How to Give Your AI Agent Access to eBay Data

Learn how to equip your AI agent with live eBay data using AlterLab’s Extract and Search APIs for reliable, structured access.

Herald Blog Service

Jun 26, 2026

Tutorials

How to Give Your AI Agent Access to SimilarWeb Data

Learn how to give your AI agent direct access to SimilarWeb traffic data using structured extraction, anti‑bot bypass, and MCP tooling—no parsing, no headaches.

Herald Blog Service

Jun 26, 2026

Tutorials

How to Give Your AI Agent Access to Statista Data

Enable AI agents to access public Statista data via AlterLab's APIs for structured extraction, search, and MCP integration—no anti-bot barriers or parsing overhead.

Herald Blog Service

Jun 26, 2026

Stay in the Loop

Get scraping insights, API tips, and platform updates. No spam — we only send when we have something worth reading.

Web Scraping API Resources

Part of the Web Scraping API Documentation cluster

Web Scraping API Documentation

Complete API reference with 5-tier auto-escalation — Curl to challenge resolution.

Pillar page

JavaScript Rendering Guide

Configure Tier 4 browser rendering for SPAs and dynamic content.

Authenticated Scraping Guide

Scrape pages behind login using session management.

Web Scraping API Benchmarks

Real success rates and cost data across all 5 tiers.

AlterLab for AI Agents

MCP Server, Python SDK, and Firecrawl-compatible API for AI agent workflows.

Why Scrape Shopify Stores?

Anti-Bot Challenges on Shopify Stores

Quick Start with AlterLab

Extracting Structured Data

The JSON API (Preferred)

HTML Fallback (When JSON is Disabled)

Common Pitfalls

Scaling Up

Key Takeaways

Related Guides

Frequently Asked Questions

Related Articles

How to Give Your AI Agent Access to eBay Data

How to Give Your AI Agent Access to SimilarWeb Data

How to Give Your AI Agent Access to Statista Data

Popular Posts

Playwright Bot Detection: What Actually Works in 2026

Why Your Headless Browser Gets Detected (and How to Fix It)

How to Scrape Twitter/X: Complete Guide for 2026

Best Web Scraping APIs in 2026: Complete Comparison Guide

How to Scrape Cloudflare-Protected Sites in 2026

Recommended

How to Scrape AliExpress: Complete Guide for 2026

Why Your Headless Browser Gets Detected (and How to Fix It)

How to Scrape Twitter/X Data: Complete Guide for 2026

How to Scrape Cloudflare-Protected Sites in 2026

How to Bypass Cloudflare Bot Protection with Puppeteer in 2026

Newsletter

Recommended Reading

How to Scrape AliExpress: Complete Guide for 2026

Why Your Headless Browser Gets Detected (and How to Fix It)

How to Scrape Twitter/X Data: Complete Guide for 2026

How to Scrape Cloudflare-Protected Sites in 2026

How to Bypass Cloudflare Bot Protection with Puppeteer in 2026

Stay in the Loop

Explore AlterLab

Anti-Bot Handling API

JavaScript Rendering API

Pricing

Documentation

Web Scraping API Resources