
How to Scrape Zillow: Complete Guide for 2026
Learn how to scrape Zillow property listings with Python in 2026. Beat Cloudflare protection, handle JS rendering, and extract real estate data at scale.
March 28, 2026
Zillow blocks most scrapers within seconds. It runs Cloudflare's Enterprise Bot Management, renders all listing data client-side via React, and fingerprints TLS connections to identify non-browser clients. Standard tooling—requests, basic Selenium, unpatched Playwright—fails before the first listing loads.
This guide covers everything you need to extract property listings, prices, and details from Zillow reliably in 2026: what protections you're dealing with, how to bypass them, where the data actually lives in the page, and how to scale to thousands of requests without hitting rate limits.
Why Scrape Zillow?
Three high-value use cases drive most Zillow scraping pipelines:
Real estate price monitoring. Track listing prices, days on market, and price reductions across specific ZIP codes or neighborhoods. Feed this into dashboards or alerting systems that fire when a property hits a target price point or reduces by more than a threshold percentage.
Lead generation for agents and investors. Pull new listings as they appear, including seller context, listing agent details, and price history. Build automated CRM workflows or outreach pipelines that act on fresh inventory before it gets competitive.
Market research and academic analysis. Zillow covers over 100 million US properties with historical price data, Zestimate valuations, and tax records. This dataset underpins housing market studies, investment underwriting models, and economic research that would otherwise require expensive licensed data feeds.
Anti-Bot Challenges on zillow.com
Understanding the protection stack is necessary before writing a single line of scraping code.
Cloudflare Enterprise Bot Management. Every request passes through Cloudflare's bot score evaluation. Suspicious clients—those with mismatched TLS fingerprints, missing browser APIs, or mechanical request timing—receive JavaScript challenges or managed CAPTCHAs. This happens before any Zillow application code runs.
TLS and HTTP/2 fingerprinting. Cloudflare inspects the TLS handshake: cipher suite ordering, extension presence and order, ALPN negotiation values. Python's requests library (backed by urllib3) produces a fingerprint that differs measurably from Chrome or Firefox. Cloudflare maintains fingerprint databases and blocks known non-browser patterns.
JavaScript-rendered content. Zillow's search and detail pages are Next.js applications. The raw HTML from a basic HTTP fetch contains scaffolding and metadata but virtually no listing data. The actual property information is either embedded in a <script id="__NEXT_DATA__"> tag after JS execution or injected into the DOM during React hydration. You need a real browser context to get populated HTML.
Behavioral fingerprinting. Request velocity, scroll events, mouse movement patterns, and time-between-clicks are analyzed. Pipelines that hit pages too fast or with perfectly uniform intervals trigger soft blocks—you'll see 429 responses or silently empty result sets.
IP reputation. Datacenter IP ranges are blocked at the edge. Residential or ISP proxies, rotated per-request or per-session, are required for consistent access.
Building this stack yourself—custom TLS fingerprints, maintained residential proxy pools, behavioral simulation, and Cloudflare rule updates—is a months-long engineering project with ongoing maintenance overhead. The AlterLab anti-bot bypass API handles all of it transparently, including headless browser execution on demand.
Quick Start with AlterLab API
Install the SDK and make your first Zillow request in under two minutes. Full environment setup is in the getting started guide.
pip install alterlab beautifulsoup4import alterlab
client = alterlab.Client("YOUR_API_KEY")
response = client.scrape(
"https://www.zillow.com/homes/for_sale/Seattle-WA/",
render_js=True,
country="us"
)
print(response.status_code) # 200
print(len(response.text)) # ~800KB rendered HTMLThe render_js=True parameter routes the request through a headless browser that executes JavaScript and waits for the React application to hydrate before returning HTML. This is required for every Zillow page—search results and detail pages alike. country="us" ensures a US residential proxy is used; Zillow geo-blocks non-US IPs at the application layer.
For cURL:
curl -X POST https://api.alterlab.io/v1/scrape \
-H "X-API-Key: YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://www.zillow.com/homes/for_sale/Seattle-WA/",
"render_js": true,
"country": "us"
}'The response body is the fully rendered HTML. Status 200 with populated __NEXT_DATA__ means you have usable listing data. Status 403 or an empty listResults array usually indicates a session issue or incorrect country routing.
Extracting Structured Data
Zillow embeds all listing and property data in a <script id="__NEXT_DATA__"> tag. Parsing this JSON is more reliable than targeting CSS selectors, which change with every React component update.
Search Results Pages
import alterlab
import json
from bs4 import BeautifulSoup
client = alterlab.Client("YOUR_API_KEY")
def get_zillow_listings(search_url: str) -> list[dict]:
response = client.scrape(search_url, render_js=True, country="us")
soup = BeautifulSoup(response.text, "html.parser")
next_data_tag = soup.find("script", {"id": "__NEXT_DATA__"})
if not next_data_tag:
raise ValueError("__NEXT_DATA__ not found — JS may not have rendered")
next_data = json.loads(next_data_tag.string)
# Path for search result pages as of March 2026
search_results = (
next_data
.get("props", {})
.get("pageProps", {})
.get("searchPageState", {})
.get("cat1", {})
.get("searchResults", {})
.get("listResults", [])
)
listings = []
for result in search_results:
listings.append({
"zpid": result.get("zpid"),
"address": result.get("address"),
"price": result.get("price"),
"beds": result.get("beds"),
"baths": result.get("baths"),
"area_sqft": result.get("area"),
"status": result.get("statusType"),
"days_on_zillow": result.get("daysOnZillow"),
"detail_url": result.get("detailUrl"),
"latitude": result.get("latLong", {}).get("latitude"),
"longitude": result.get("latLong", {}).get("longitude"),
})
return listings
listings = get_zillow_listings("https://www.zillow.com/homes/for_sale/Seattle-WA/")
print(f"Found {len(listings)} listings")
print(json.dumps(listings[0], indent=2))Property Detail Pages
The detail page JSON uses a different path via gdpClientCache:
import alterlab
import json
from bs4 import BeautifulSoup
client = alterlab.Client("YOUR_API_KEY")
def get_property_detail(detail_url: str) -> dict:
response = client.scrape(detail_url, render_js=True, country="us")
soup = BeautifulSoup(response.text, "html.parser")
next_data = json.loads(
soup.find("script", {"id": "__NEXT_DATA__"}).string
)
# gdpClientCache is keyed by a composite ID; grab the first value
gdp_cache = (
next_data
.get("props", {})
.get("pageProps", {})
.get("componentProps", {})
.get("gdpClientCache", {})
)
property_data = next(iter(gdp_cache.values()), {}).get("property", {})
return {
"zpid": property_data.get("zpid"),
"address": property_data.get("streetAddress"),
"city": property_data.get("city"),
"state": property_data.get("state"),
"zip": property_data.get("zipcode"),
"price": property_data.get("price"),
"home_type": property_data.get("homeType"),
"year_built": property_data.get("yearBuilt"),
"lot_size": property_data.get("lotSize"),
"zestimate": property_data.get("zestimate"),
"tax_history": property_data.get("taxHistory", []),
"price_history": property_data.get("priceHistory", []),
"description": property_data.get("description"),
}
detail = get_property_detail(
"https://www.zillow.com/homedetails/123-Main-St-Seattle-WA-98101/12345678_zpid/"
)
print(json.dumps(detail, indent=2))Common Pitfalls
__NEXT_DATA__ path changes
Zillow ships frontend updates frequently. The JSON path from props.pageProps down to listResults or gdpClientCache can change without notice. The paths in this guide are accurate as of March 2026, but you should build defensive traversal rather than chaining raw .get() calls:
from typing import Any
def safe_get(data: dict, *keys: str, default: Any = None) -> Any:
"""Traverse a nested dict without raising KeyError."""
for key in keys:
if not isinstance(data, dict):
return default
data = data.get(key, default)
if data is None:
return default
return data
# Resilient path access
listings = safe_get(
next_data,
"props", "pageProps", "searchPageState",
"cat1", "searchResults", "listResults",
default=[]
)
if not listings:
# Log the full structure to diagnose path changes
import logging
logging.warning("Empty listResults — dumping keys: %s", list(next_data.keys()))Logging the top-level keys when results are empty is the fastest way to identify a path change after a Zillow frontend deployment.
Pagination and cursor encoding
Zillow returns 20 listings per search page and uses searchQueryState URL parameters for pagination. Manually constructing page 2+ URLs requires modifying the pagination key in that parameter:
import json
import urllib.parse
def build_page_url(base_url: str, page: int) -> str:
parsed = urllib.parse.urlparse(base_url)
params = urllib.parse.parse_qs(parsed.query)
state = json.loads(params.get("searchQueryState", ["{}"])[0])
state["pagination"] = {"currentPage": page}
new_query = urllib.parse.urlencode(
{"searchQueryState": json.dumps(state, separators=(",", ":"))},
quote_via=urllib.parse.quote
)
return urllib.parse.urlunparse(parsed._replace(query=new_query))
page_3_url = build_page_url(
"https://www.zillow.com/homes/for_sale/Seattle-WA/?searchQueryState=%7B%22pagination%22%3A%7B%7D%7D",
page=3
)Rate limiting and empty result sets
Zillow doesn't always return an obvious 429 when rate-limiting. Instead, listResults silently returns an empty array. If you're getting valid HTML with __NEXT_DATA__ present but listResults: [], slow your request rate—1 to 3 seconds between search page requests is a safe baseline. Per-request proxy rotation (the default) handles IP-level limits; the inter-request delay handles session-level behavioral analysis.
Scaling Up
Async batch processing
For large pipelines, use bounded async concurrency rather than sequential requests:
import alterlab
import asyncio
import json
from bs4 import BeautifulSoup
client = alterlab.Client("YOUR_API_KEY")
async def scrape_detail(url: str) -> dict | None:
try:
response = await client.scrape_async(url, render_js=True, country="us")
soup = BeautifulSoup(response.text, "html.parser")
tag = soup.find("script", {"id": "__NEXT_DATA__"})
return json.loads(tag.string) if tag else None
except Exception as exc:
print(f"Failed {url}: {exc}")
return None
async def scrape_batch(urls: list[str], concurrency: int = 5) -> list[dict]:
sem = asyncio.Semaphore(concurrency)
async def bounded(url: str):
async with sem:
return await scrape_detail(url)
results = await asyncio.gather(*[bounded(u) for u in urls])
return [r for r in results if r is not None]
# Run
detail_urls = [
"https://www.zillow.com/homedetails/...",
# ... up to thousands of URLs
]
results = asyncio.run(scrape_batch(detail_urls, concurrency=5))
print(f"Successfully scraped {len(results)}/{len(detail_urls)}")Keep concurrency at 3–5 for Zillow. Higher values don't improve throughput meaningfully and increase the probability of triggering behavioral rate limits even with proxy rotation.
Incremental updates for price monitoring
Re-scraping every listing on every run is expensive and unnecessary. Use daysOnZillow and priceHistory to build an incremental update strategy:
from datetime import datetime, timezone
def needs_rescrape(zpid: str, last_scraped_at: datetime, status: str) -> bool:
age_hours = (datetime.now(timezone.utc) - last_scraped_at).total_seconds() / 3600
# Active listings: check daily. Off-market: check weekly.
threshold = 24 if status in ("FOR_SALE", "FOR_RENT") else 168
return age_hours >= threshold
def extract_price_change(stored_history: list, fresh_history: list) -> dict | None:
if not fresh_history or not stored_history:
return None
latest = fresh_history[0]
previous = stored_history[0]
if latest.get("price") != previous.get("price"):
return {
"from": previous.get("price"),
"to": latest.get("price"),
"date": latest.get("date"),
"event": latest.get("event"),
}
return NoneStore the raw __NEXT_DATA__ JSON blob alongside your normalized records. When Zillow's JSON schema changes, you can re-parse historical raw payloads without re-hitting the site.
Cost planning
Zillow requires headless browser requests for every page type, which is priced higher than standard fetches. A typical real estate monitoring pipeline looks like:
- Discovery pass: ~500 search result pages (20 listings each = 10,000 listings) per metro area
- Detail enrichment: 10,000 detail page requests for full property data
- Daily delta: ~200–400 requests for price change detection on active inventory
The search-then-detail pattern—collect ZPIDs from search pages, then scrape only the detail pages that match your filter criteria—is the most cost-efficient approach. See AlterLab's pricing page for current per-request rates and volume discount tiers.
Try scraping a Zillow search results page with AlterLab — see the raw __NEXT_DATA__ JSON in seconds
Key Takeaways
requestsand basic headless Chromium both fail. Zillow's Cloudflare layer blocks non-browser TLS fingerprints before serving any content. You need proper fingerprint spoofing, residential proxies, and JS execution—not just a user-agent header.- Parse
__NEXT_DATA__, not the DOM. The embedded JSON is structured, complete, and far more stable than CSS class selectors on a rapidly-deployed React frontend. Usesafe_getwrappers and log raw payloads on empty results. - Always pass
country="us". Non-US IPs get geo-blocked at the application layer, returning a redirect or an empty state rather than listing data. - Keep async concurrency at 3–5. Higher concurrency doesn't meaningfully improve throughput and risks triggering behavioral rate limits even with per-request proxy rotation.
- Store raw JSON alongside normalized records. Schema paths in
__NEXT_DATA__change with Zillow deployments. Raw payload storage lets you re-parse without re-scraping.
Related Guides
Scraping other real estate platforms or e-commerce sites? These guides cover the same techniques for adjacent targets:
Was this article helpful?
Frequently Asked Questions
Related Articles
Popular Posts
Recommended

Selenium Bot Detection: Why You Get Caught and How to Avoid It

Why Your Headless Browser Gets Detected (and How to Fix It)

Web Scraping APIs vs DIY Scrapers: When to Stop Building Infrastructure

Scraping E-Commerce Sites at Scale Without Getting Blocked

Web Scraping with Node.js and Puppeteer: The Complete 2026 Guide
Newsletter
Scraping insights and API tips. No spam.
Recommended Reading

Selenium Bot Detection: Why You Get Caught and How to Avoid It

Why Your Headless Browser Gets Detected (and How to Fix It)

Web Scraping APIs vs DIY Scrapers: When to Stop Building Infrastructure

Scraping E-Commerce Sites at Scale Without Getting Blocked

Web Scraping with Node.js and Puppeteer: The Complete 2026 Guide
Stay in the Loop
Get scraping insights, API tips, and platform updates. No spam — we only send when we have something worth reading.


