AlterLabAlterLab
How to Scrape Zillow: Complete Guide for 2026
Tutorials

How to Scrape Zillow: Complete Guide for 2026

Learn how to scrape Zillow property listings with Python in 2026. Beat Cloudflare protection, handle JS rendering, and extract real estate data at scale.

Yash Dubey
Yash Dubey

March 28, 2026

9 min read
3 views

Zillow blocks most scrapers within seconds. It runs Cloudflare's Enterprise Bot Management, renders all listing data client-side via React, and fingerprints TLS connections to identify non-browser clients. Standard tooling—requests, basic Selenium, unpatched Playwright—fails before the first listing loads.

This guide covers everything you need to extract property listings, prices, and details from Zillow reliably in 2026: what protections you're dealing with, how to bypass them, where the data actually lives in the page, and how to scale to thousands of requests without hitting rate limits.


Why Scrape Zillow?

Three high-value use cases drive most Zillow scraping pipelines:

Real estate price monitoring. Track listing prices, days on market, and price reductions across specific ZIP codes or neighborhoods. Feed this into dashboards or alerting systems that fire when a property hits a target price point or reduces by more than a threshold percentage.

Lead generation for agents and investors. Pull new listings as they appear, including seller context, listing agent details, and price history. Build automated CRM workflows or outreach pipelines that act on fresh inventory before it gets competitive.

Market research and academic analysis. Zillow covers over 100 million US properties with historical price data, Zestimate valuations, and tax records. This dataset underpins housing market studies, investment underwriting models, and economic research that would otherwise require expensive licensed data feeds.

100M+Homes on Zillow
99.2%Success Rate on Zillow
1.2sAvg Response Time
50+Proxy Countries Available

Anti-Bot Challenges on zillow.com

Understanding the protection stack is necessary before writing a single line of scraping code.

Cloudflare Enterprise Bot Management. Every request passes through Cloudflare's bot score evaluation. Suspicious clients—those with mismatched TLS fingerprints, missing browser APIs, or mechanical request timing—receive JavaScript challenges or managed CAPTCHAs. This happens before any Zillow application code runs.

TLS and HTTP/2 fingerprinting. Cloudflare inspects the TLS handshake: cipher suite ordering, extension presence and order, ALPN negotiation values. Python's requests library (backed by urllib3) produces a fingerprint that differs measurably from Chrome or Firefox. Cloudflare maintains fingerprint databases and blocks known non-browser patterns.

JavaScript-rendered content. Zillow's search and detail pages are Next.js applications. The raw HTML from a basic HTTP fetch contains scaffolding and metadata but virtually no listing data. The actual property information is either embedded in a <script id="__NEXT_DATA__"> tag after JS execution or injected into the DOM during React hydration. You need a real browser context to get populated HTML.

Behavioral fingerprinting. Request velocity, scroll events, mouse movement patterns, and time-between-clicks are analyzed. Pipelines that hit pages too fast or with perfectly uniform intervals trigger soft blocks—you'll see 429 responses or silently empty result sets.

IP reputation. Datacenter IP ranges are blocked at the edge. Residential or ISP proxies, rotated per-request or per-session, are required for consistent access.

Building this stack yourself—custom TLS fingerprints, maintained residential proxy pools, behavioral simulation, and Cloudflare rule updates—is a months-long engineering project with ongoing maintenance overhead. The AlterLab anti-bot bypass API handles all of it transparently, including headless browser execution on demand.


Quick Start with AlterLab API

Install the SDK and make your first Zillow request in under two minutes. Full environment setup is in the getting started guide.

Bash
pip install alterlab beautifulsoup4
Python
import alterlab

client = alterlab.Client("YOUR_API_KEY")

response = client.scrape(
    "https://www.zillow.com/homes/for_sale/Seattle-WA/",
    render_js=True,
    country="us"
)

print(response.status_code)   # 200
print(len(response.text))     # ~800KB rendered HTML

The render_js=True parameter routes the request through a headless browser that executes JavaScript and waits for the React application to hydrate before returning HTML. This is required for every Zillow page—search results and detail pages alike. country="us" ensures a US residential proxy is used; Zillow geo-blocks non-US IPs at the application layer.

For cURL:

Bash
curl -X POST https://api.alterlab.io/v1/scrape \
  -H "X-API-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://www.zillow.com/homes/for_sale/Seattle-WA/",
    "render_js": true,
    "country": "us"
  }'

The response body is the fully rendered HTML. Status 200 with populated __NEXT_DATA__ means you have usable listing data. Status 403 or an empty listResults array usually indicates a session issue or incorrect country routing.


Extracting Structured Data

Zillow embeds all listing and property data in a <script id="__NEXT_DATA__"> tag. Parsing this JSON is more reliable than targeting CSS selectors, which change with every React component update.

Search Results Pages

Python
import alterlab
import json
from bs4 import BeautifulSoup

client = alterlab.Client("YOUR_API_KEY")

def get_zillow_listings(search_url: str) -> list[dict]:
    response = client.scrape(search_url, render_js=True, country="us")
    soup = BeautifulSoup(response.text, "html.parser")

    next_data_tag = soup.find("script", {"id": "__NEXT_DATA__"})
    if not next_data_tag:
        raise ValueError("__NEXT_DATA__ not found — JS may not have rendered")

    next_data = json.loads(next_data_tag.string)

    # Path for search result pages as of March 2026
    search_results = (
        next_data
        .get("props", {})
        .get("pageProps", {})
        .get("searchPageState", {})
        .get("cat1", {})
        .get("searchResults", {})
        .get("listResults", [])
    )

    listings = []
    for result in search_results:
        listings.append({
            "zpid":          result.get("zpid"),
            "address":       result.get("address"),
            "price":         result.get("price"),
            "beds":          result.get("beds"),
            "baths":         result.get("baths"),
            "area_sqft":     result.get("area"),
            "status":        result.get("statusType"),
            "days_on_zillow": result.get("daysOnZillow"),
            "detail_url":    result.get("detailUrl"),
            "latitude":      result.get("latLong", {}).get("latitude"),
            "longitude":     result.get("latLong", {}).get("longitude"),
        })

    return listings

listings = get_zillow_listings("https://www.zillow.com/homes/for_sale/Seattle-WA/")
print(f"Found {len(listings)} listings")
print(json.dumps(listings[0], indent=2))

Property Detail Pages

The detail page JSON uses a different path via gdpClientCache:

Python
import alterlab
import json
from bs4 import BeautifulSoup

client = alterlab.Client("YOUR_API_KEY")

def get_property_detail(detail_url: str) -> dict:
    response = client.scrape(detail_url, render_js=True, country="us")
    soup = BeautifulSoup(response.text, "html.parser")

    next_data = json.loads(
        soup.find("script", {"id": "__NEXT_DATA__"}).string
    )

    # gdpClientCache is keyed by a composite ID; grab the first value
    gdp_cache = (
        next_data
        .get("props", {})
        .get("pageProps", {})
        .get("componentProps", {})
        .get("gdpClientCache", {})
    )
    property_data = next(iter(gdp_cache.values()), {}).get("property", {})

    return {
        "zpid":          property_data.get("zpid"),
        "address":       property_data.get("streetAddress"),
        "city":          property_data.get("city"),
        "state":         property_data.get("state"),
        "zip":           property_data.get("zipcode"),
        "price":         property_data.get("price"),
        "home_type":     property_data.get("homeType"),
        "year_built":    property_data.get("yearBuilt"),
        "lot_size":      property_data.get("lotSize"),
        "zestimate":     property_data.get("zestimate"),
        "tax_history":   property_data.get("taxHistory", []),
        "price_history": property_data.get("priceHistory", []),
        "description":   property_data.get("description"),
    }

detail = get_property_detail(
    "https://www.zillow.com/homedetails/123-Main-St-Seattle-WA-98101/12345678_zpid/"
)
print(json.dumps(detail, indent=2))

Common Pitfalls

__NEXT_DATA__ path changes

Zillow ships frontend updates frequently. The JSON path from props.pageProps down to listResults or gdpClientCache can change without notice. The paths in this guide are accurate as of March 2026, but you should build defensive traversal rather than chaining raw .get() calls:

Python
from typing import Any

def safe_get(data: dict, *keys: str, default: Any = None) -> Any:
    """Traverse a nested dict without raising KeyError."""
    for key in keys:
        if not isinstance(data, dict):
            return default
        data = data.get(key, default)
        if data is None:
            return default
    return data

# Resilient path access
listings = safe_get(
    next_data,
    "props", "pageProps", "searchPageState",
    "cat1", "searchResults", "listResults",
    default=[]
)

if not listings:
    # Log the full structure to diagnose path changes
    import logging
    logging.warning("Empty listResults — dumping keys: %s", list(next_data.keys()))

Logging the top-level keys when results are empty is the fastest way to identify a path change after a Zillow frontend deployment.

Pagination and cursor encoding

Zillow returns 20 listings per search page and uses searchQueryState URL parameters for pagination. Manually constructing page 2+ URLs requires modifying the pagination key in that parameter:

Python
import json
import urllib.parse

def build_page_url(base_url: str, page: int) -> str:
    parsed = urllib.parse.urlparse(base_url)
    params = urllib.parse.parse_qs(parsed.query)

    state = json.loads(params.get("searchQueryState", ["{}"])[0])
    state["pagination"] = {"currentPage": page}

    new_query = urllib.parse.urlencode(
        {"searchQueryState": json.dumps(state, separators=(",", ":"))},
        quote_via=urllib.parse.quote
    )
    return urllib.parse.urlunparse(parsed._replace(query=new_query))

page_3_url = build_page_url(
    "https://www.zillow.com/homes/for_sale/Seattle-WA/?searchQueryState=%7B%22pagination%22%3A%7B%7D%7D",
    page=3
)

Rate limiting and empty result sets

Zillow doesn't always return an obvious 429 when rate-limiting. Instead, listResults silently returns an empty array. If you're getting valid HTML with __NEXT_DATA__ present but listResults: [], slow your request rate—1 to 3 seconds between search page requests is a safe baseline. Per-request proxy rotation (the default) handles IP-level limits; the inter-request delay handles session-level behavioral analysis.


Scaling Up

Async batch processing

For large pipelines, use bounded async concurrency rather than sequential requests:

Python
import alterlab
import asyncio
import json
from bs4 import BeautifulSoup

client = alterlab.Client("YOUR_API_KEY")

async def scrape_detail(url: str) -> dict | None:
    try:
        response = await client.scrape_async(url, render_js=True, country="us")
        soup = BeautifulSoup(response.text, "html.parser")
        tag = soup.find("script", {"id": "__NEXT_DATA__"})
        return json.loads(tag.string) if tag else None
    except Exception as exc:
        print(f"Failed {url}: {exc}")
        return None

async def scrape_batch(urls: list[str], concurrency: int = 5) -> list[dict]:
    sem = asyncio.Semaphore(concurrency)

    async def bounded(url: str):
        async with sem:
            return await scrape_detail(url)

    results = await asyncio.gather(*[bounded(u) for u in urls])
    return [r for r in results if r is not None]

# Run
detail_urls = [
    "https://www.zillow.com/homedetails/...",
    # ... up to thousands of URLs
]
results = asyncio.run(scrape_batch(detail_urls, concurrency=5))
print(f"Successfully scraped {len(results)}/{len(detail_urls)}")

Keep concurrency at 3–5 for Zillow. Higher values don't improve throughput meaningfully and increase the probability of triggering behavioral rate limits even with proxy rotation.

Incremental updates for price monitoring

Re-scraping every listing on every run is expensive and unnecessary. Use daysOnZillow and priceHistory to build an incremental update strategy:

Python
from datetime import datetime, timezone

def needs_rescrape(zpid: str, last_scraped_at: datetime, status: str) -> bool:
    age_hours = (datetime.now(timezone.utc) - last_scraped_at).total_seconds() / 3600
    # Active listings: check daily. Off-market: check weekly.
    threshold = 24 if status in ("FOR_SALE", "FOR_RENT") else 168
    return age_hours >= threshold

def extract_price_change(stored_history: list, fresh_history: list) -> dict | None:
    if not fresh_history or not stored_history:
        return None
    latest = fresh_history[0]
    previous = stored_history[0]
    if latest.get("price") != previous.get("price"):
        return {
            "from": previous.get("price"),
            "to":   latest.get("price"),
            "date": latest.get("date"),
            "event": latest.get("event"),
        }
    return None

Store the raw __NEXT_DATA__ JSON blob alongside your normalized records. When Zillow's JSON schema changes, you can re-parse historical raw payloads without re-hitting the site.

Cost planning

Zillow requires headless browser requests for every page type, which is priced higher than standard fetches. A typical real estate monitoring pipeline looks like:

  • Discovery pass: ~500 search result pages (20 listings each = 10,000 listings) per metro area
  • Detail enrichment: 10,000 detail page requests for full property data
  • Daily delta: ~200–400 requests for price change detection on active inventory

The search-then-detail pattern—collect ZPIDs from search pages, then scrape only the detail pages that match your filter criteria—is the most cost-efficient approach. See AlterLab's pricing page for current per-request rates and volume discount tiers.

Try it yourself

Try scraping a Zillow search results page with AlterLab — see the raw __NEXT_DATA__ JSON in seconds


Key Takeaways

  • requests and basic headless Chromium both fail. Zillow's Cloudflare layer blocks non-browser TLS fingerprints before serving any content. You need proper fingerprint spoofing, residential proxies, and JS execution—not just a user-agent header.
  • Parse __NEXT_DATA__, not the DOM. The embedded JSON is structured, complete, and far more stable than CSS class selectors on a rapidly-deployed React frontend. Use safe_get wrappers and log raw payloads on empty results.
  • Always pass country="us". Non-US IPs get geo-blocked at the application layer, returning a redirect or an empty state rather than listing data.
  • Keep async concurrency at 3–5. Higher concurrency doesn't meaningfully improve throughput and risks triggering behavioral rate limits even with per-request proxy rotation.
  • Store raw JSON alongside normalized records. Schema paths in __NEXT_DATA__ change with Zillow deployments. Raw payload storage lets you re-parse without re-scraping.

Scraping other real estate platforms or e-commerce sites? These guides cover the same techniques for adjacent targets:

Share

Was this article helpful?

Frequently Asked Questions

Scraping publicly accessible Zillow data is generally protected under the Ninth Circuit's ruling in hiQ v. LinkedIn, which upheld the legality of scraping public web pages. Zillow's Terms of Service prohibit automated access, so violations can result in account termination or legal notices even absent criminal liability. Avoid scraping behind authentication, don't republish raw data commercially without reviewing Zillow's data licensing terms, and consult legal counsel for production deployments.
Zillow uses Cloudflare's Enterprise Bot Management, which inspects TLS fingerprints, JavaScript execution context, and behavioral signals—standard Python requests or basic headless Chromium are blocked within seconds. AlterLab's anti-bot bypass API handles TLS fingerprint spoofing, residential proxy rotation, and full JS rendering transparently, achieving a 99.2% success rate on Zillow without any manual fingerprint maintenance on your end.
Zillow requires headless browser rendering for all search and detail pages, which carries a higher per-request rate than standard fetches. A pipeline scraping 10,000 Zillow detail pages daily runs at 10,000 headless requests per day; using the search-then-detail pattern reduces this significantly by batching 20 listings per search page request. See AlterLab's pricing page for current rates and volume tiers—batch pricing substantially lowers the per-request cost for high-volume pipelines.