How to Scrape Shopify Stores: Complete Guide for 2026
Learn how to scrape Shopify stores in 2026 with Python — extract products, prices, and inventory using the JSON API and anti-bot bypass. Production-ready guide.
March 25, 2026
Shopify powers over 4.5 million active online stores. Every one of them is a structured dataset: products, variants, pricing, inventory, collections, and vendor metadata — sitting behind a predictable URL schema. If you're building a price intelligence tool, a competitor monitoring pipeline, or a retail analytics system, Shopify is one of the most consistently structured targets you'll encounter in e-commerce scraping.
This guide covers the full stack: from Shopify's public JSON endpoints (faster and cleaner than HTML parsing) to handling anti-bot protections when those endpoints are locked down.
Why Scrape Shopify Stores?
Three use cases drive most Shopify scraping workloads:
Price and inventory monitoring. Brands and retailers track competitor pricing in near real-time to power dynamic repricing engines. Shopify's variant-level data includes price, compare-at price, and inventory quantity — everything needed to feed a pricing model without any HTML parsing.
Lead generation and market research. Aggregating store metadata — vendor names, product categories, brand positioning, SKU counts — gives agencies and SaaS tools a filtered view of which Shopify merchants are operating in a given niche.
Catalog aggregation. Marketplaces and comparison engines pull structured product data (title, description, images, tags) across thousands of stores to build searchable indexes.
Anti-Bot Challenges on Shopify Stores
Shopify itself doesn't deploy heavy anti-bot infrastructure at the platform level, but individual merchants do — and the stack is increasingly aggressive in 2026.
Cloudflare is the dominant layer. Most mid-to-large Shopify stores sit behind Cloudflare, which means your scraper faces browser integrity checks (JS challenges), managed challenge pages, and TLS fingerprinting. A plain requests session with a spoofed User-Agent fails immediately — Cloudflare scores TLS cipher suites and HTTP/2 frame ordering at the network level before any JavaScript runs.
Shopify's native bot protection. Shopify's checkout and storefront expose behavioral bot scoring via their fraud prevention tooling. High-frequency requests from a single IP or ASN trigger rate limiting (typically HTTP 429) or silent throttling where responses are served stale or incomplete.
JavaScript-gated storefronts. Headless Shopify storefronts built on Hydrogen or custom themes frequently render product grids client-side. The initial HTML payload is a shell; product data arrives via XHR after JavaScript executes. Standard HTTP scrapers get nothing useful.
The practical consequence: DIY scraping against protected Shopify stores requires managing residential proxy pools, implementing TLS fingerprint spoofing (via curl-impersonate or tls-client), and running headless Chromium for JS-rendered pages — each of which has its own maintenance overhead.
AlterLab's anti-bot bypass API consolidates all of this: residential proxies, real browser fingerprints, and automatic challenge solving, exposed through a single HTTP endpoint.
Quick Start with AlterLab
Install the SDK and grab your API key from the getting started guide.
pip install alterlabThe fastest path to Shopify product data is the /products.json endpoint — publicly accessible on most stores, no JavaScript required, returns clean JSON with full product and variant detail.
import alterlab
import json
client = alterlab.Client("YOUR_API_KEY")
# Most Shopify stores expose /products.json with no auth required
response = client.scrape(
url="https://target-store.myshopify.com/products.json",
params={
"limit": 250,
"anti_bot": True
}
)
data = json.loads(response.text)
products = data["products"]
for product in products:
price = product["variants"][0]["price"]
available = product["variants"][0]["available"]
print(f"{product['title']} — ${price} ({'in stock' if available else 'OOS'})")The equivalent with cURL:
curl -X POST https://api.alterlab.io/v1/scrape \
-H "X-API-Key: YOUR_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://target-store.myshopify.com/products.json?limit=250",
"anti_bot": true,
"render_js": false
}'Set render_js: false explicitly for JSON endpoints — it halves latency since there's no browser spin-up cost.
Try scraping a Shopify products.json endpoint with AlterLab
Extracting Structured Data
The JSON API (Preferred)
The Shopify Storefront JSON API is the cleanest path. Each endpoint returns structured data without any HTML parsing:
| Endpoint | Returns |
|---|---|
/products.json?limit=250 | Paginated product catalog |
/collections.json | All store collections |
/collections/{handle}/products.json | Products in a specific collection |
/products/{handle}.json | Single product with full variant detail |
Here's a complete extractor that handles cursor-based pagination across the full catalog:
import alterlab
import json
from dataclasses import dataclass, field
from typing import List
@dataclass
class Variant:
id: int
title: str
price: str
compare_at_price: str | None
sku: str
inventory_quantity: int
available: bool
@dataclass
class Product:
id: int
title: str
handle: str
vendor: str
product_type: str
tags: List[str]
variants: List[Variant] = field(default_factory=list)
def scrape_full_catalog(store_url: str, client: alterlab.Client) -> List[Product]:
products: List[Product] = []
since_id = 0
while True:
url = f"{store_url}/products.json?limit=250&since_id={since_id}"
response = client.scrape(url=url, anti_bot=True, render_js=False)
if response.status_code != 200:
print(f"Got {response.status_code} — stopping pagination")
break
batch = json.loads(response.text).get("products", [])
if not batch:
break # No more pages
for p in batch:
variants = [
Variant(
id=v["id"],
title=v["title"],
price=v["price"],
compare_at_price=v.get("compare_at_price"),
sku=v.get("sku", ""),
inventory_quantity=v.get("inventory_quantity", 0),
available=v["available"],
)
for v in p["variants"]
]
products.append(
Product(
id=p["id"],
title=p["title"],
handle=p["handle"],
vendor=p["vendor"],
product_type=p["product_type"],
tags=p.get("tags", "").split(", ") if p.get("tags") else [],
variants=variants,
)
)
since_id = batch[-1]["id"]
print(f"Fetched {len(products)} products so far (last id: {since_id})")
return productsThe since_id cursor is the correct pagination approach in 2026. The old page=N parameter was deprecated by Shopify and no longer returns results beyond page 1 on most stores.
HTML Fallback (When JSON is Disabled)
Some stores restrict /products.json or return empty arrays. In that case, scrape the product listing pages with CSS selectors. Shopify's default Liquid themes follow consistent class conventions:
import alterlab
from bs4 import BeautifulSoup
client = alterlab.Client("YOUR_API_KEY")
response = client.scrape(
url="https://target-store.myshopify.com/collections/all",
anti_bot=True,
render_js=True # Required for headless storefronts
)
soup = BeautifulSoup(response.text, "html.parser")
# Standard Shopify theme selectors (Dawn, Debut, and most custom themes)
product_cards = soup.select(".product-item, .grid__item, [data-product-id]")
for card in product_cards:
title_el = card.select_one(".product-item__title, .product__title, h2 a")
price_el = card.select_one(
".price__regular .price-item, [data-product-price], .product-price"
)
link_el = card.select_one("a[href*='/products/']")
title = title_el.get_text(strip=True) if title_el else "N/A"
price = price_el.get_text(strip=True) if price_el else "N/A"
href = link_el["href"] if link_el else "N/A"
print(f"{title} | {price} | {href}")Selector reliability caveat: Custom themes break these selectors. For custom themes, look for <script type="application/json" data-product-json> — Shopify injects the full product object as inline JSON on every PDP regardless of theme.
Common Pitfalls
Rate limiting on /products.json. Most stores tolerate 2–4 requests per second before returning HTTP 429. Some implement silent throttling — the response is 200 but the products array is empty after a few pages. Add a 0.5–1s delay between paginated requests and respect Retry-After headers.
The JSON API returns no results. Some merchants explicitly disable the JSON API in their Shopify settings or use a password-protected storefront. Check robots.txt first; if /products.json is disallowed, fall back to HTML scraping.
Variant data is incomplete. /products.json caps inventory quantities at 100 for stores using tracked inventory at the variant level. If you need exact counts, you need the Shopify Admin API — which requires OAuth and merchant consent, outside the scope of public data scraping.
Headless storefronts return skeleton HTML. Shopify Hydrogen and custom React storefronts render product grids entirely client-side. You'll get a <div id="main"> with nothing in it. Set render_js: true in your request and allow at least 2–3 seconds for hydration.
Session-gated pages. Flash sale storefronts, member-only collections, and age-gated stores require cookie-based sessions. Pass cookies in the request headers; do not attempt to bypass authenticated checkouts.
Scaling Up
Here's an async multi-store scraper that handles a batch of stores concurrently:
import asyncio
import json
import alterlab
client = alterlab.Client("YOUR_API_KEY")
STORES = [
"https://store-a.myshopify.com",
"https://store-b.myshopify.com",
"https://store-c.myshopify.com",
"https://store-d.myshopify.com",
]
async def scrape_store_products(store_url: str) -> tuple[str, list]:
all_products = []
since_id = 0
while True:
url = f"{store_url}/products.json?limit=250&since_id={since_id}"
response = await client.async_scrape(url=url, anti_bot=True, render_js=False)
if response.status_code != 200:
break
batch = json.loads(response.text).get("products", [])
if not batch:
break
all_products.extend(batch)
since_id = batch[-1]["id"]
await asyncio.sleep(0.5) # Polite delay between pages
return store_url, all_products
async def main():
tasks = [scrape_store_products(url) for url in STORES]
results = await asyncio.gather(*tasks, return_exceptions=True)
total = 0
for store_url, result in results:
if isinstance(result, Exception):
print(f"[ERROR] {store_url}: {result}")
else:
print(f"[OK] {store_url}: {len(result)} products")
total += len(result)
print(f"\nTotal: {total} products across {len(STORES)} stores")
asyncio.run(main())Throughput guidance:
- JSON endpoint (no JS render): 10–20 concurrent requests is a safe ceiling before triggering rate limits across mixed store targets
- JS-rendered pages: Cap at 5–8 concurrent — browser instances are CPU/memory-bound
- For pipeline scheduling, tools like Prefect or Airflow work well for daily or hourly refresh cycles
On cost at scale: Request volume is the primary cost driver. For bulk catalog crawls where the JSON API is accessible, you avoid JavaScript rendering credits entirely. See AlterLab's pricing for credit costs at different volume tiers — most production price-monitoring pipelines land in the Growth or Business tiers.
Key Takeaways
- Start with
/products.json— it's faster, cheaper, and more structured than HTML scraping. Most Shopify stores leave it accessible. - Use
since_idfor pagination, notpage=N. The legacy page parameter is deprecated and silently truncates results. - Set
render_js: falsefor JSON endpoints. Only enable JS rendering for headless storefronts or when the JSON API is disabled. - Respect rate limits — a 500ms delay between paginated calls on the same store avoids silent throttling.
- Extract
data-product-jsonscript tags when CSS selectors fail on custom themes — Shopify injects the full product object as inline JSON on every product page. - Anti-bot bypass is necessary for stores behind Cloudflare. TLS fingerprinting alone blocks naive
requests-based scrapers before any challenge is even served.
Related Guides
Building a broader e-commerce scraping pipeline? These guides cover the other major platforms:
- How to Scrape Amazon — Handling Amazon's aggressive bot detection, product ASIN extraction, and review scraping
- How to Scrape eBay — Auction data, sold listings, and seller profile extraction
- How to Scrape Walmart — Walmart's API-backed frontend, price history, and inventory signals
Was this article helpful?
Frequently Asked Questions
Related Articles
Popular Posts
Recommended
Selenium Bot Detection: Why You Get Caught and How to Avoid It
Why Your Headless Browser Gets Detected (and How to Fix It)
Web Scraping APIs vs DIY Scrapers: When to Stop Building Infrastructure
Scraping E-Commerce Sites at Scale Without Getting Blocked
Web Scraping with Node.js and Puppeteer: The Complete 2026 Guide
Newsletter
Scraping insights and API tips. No spam.
Recommended Reading
Selenium Bot Detection: Why You Get Caught and How to Avoid It
Why Your Headless Browser Gets Detected (and How to Fix It)
Web Scraping APIs vs DIY Scrapers: When to Stop Building Infrastructure
Scraping E-Commerce Sites at Scale Without Getting Blocked
Web Scraping with Node.js and Puppeteer: The Complete 2026 Guide
Stay in the Loop
Get scraping insights, API tips, and platform updates. No spam — we only send when we have something worth reading.