AlterLabAlterLab
How to Scrape Redfin: Complete Guide for 2026
Tutorials

How to Scrape Redfin: Complete Guide for 2026

Learn to scrape Redfin property listings with Python in 2026. Covers anti-bot bypass, CSS selectors, JSON-LD parsing, and building scalable pipelines.

Yash Dubey
Yash Dubey

March 29, 2026

9 min read
3 views

Redfin exposes one of the most complete real estate datasets on the public web: active listings, price history, days on market, agent data, neighborhood stats, and walk scores — all attached to individual property records and updated multiple times per day. Getting that data out programmatically requires dealing with anti-bot protections that are meaningfully stricter than most content sites. This guide covers every layer: what protections exist, how to bypass them reliably, which selectors to target, and how to build a pipeline that scales.

Why Scrape Redfin?

Redfin's value isn't the listings themselves — it's the density of structured, frequently refreshed data attached to each one. Three workflows where this matters in practice:

Real-time price monitoring. Track asking price changes, reductions, and relisting events across specific zip codes or MLS regions. Redfin surfaces price history directly on the listing page with timestamps, giving you a dataset that most public MLS feeds don't expose at this granularity.

Investment lead generation. Investors screening for high days-on-market properties, recently price-reduced listings, or specific lot-size/price-per-sqft ratios can build automated pipelines that surface candidates before a human broker manually compiles a comparable list.

Housing market research and ML. Academics, data journalists, and engineers building price prediction models need labeled historical data with features like square footage, school district scores, walk score, and HOA status. Redfin exposes many of these as structured HTML or embedded JSON, making it one of the cleaner sources for feature engineering.

Anti-Bot Challenges on Redfin.com

Redfin runs several layers of protection that make naive scraping unreliable within minutes:

TLS fingerprinting. Redfin's CDN checks the TLS handshake profile of your HTTP client. Python's requests library produces a fingerprint that's trivially identified and blocked at the network edge — even with correct headers set, the handshake mismatch returns 403s before your request reaches the application server.

IP reputation scoring. Datacenter IP ranges from AWS, GCP, and DigitalOcean are blocked outright. Requests from these ranges return either a CAPTCHA challenge or a silent redirect to a bot detection page. Residential proxies with clean reputation histories are a hard requirement.

Behavioral analytics. Redfin tracks mouse movement, scroll velocity, and interaction timing for browser-based sessions. Headless Chromium without stealth patches triggers detection within a few page loads — well before you've collected anything useful.

Per-region rate limiting. Search result pages for high-demand markets (SF Bay Area, NYC, LA) appear to have tighter per-IP thresholds than lower-traffic markets. Burst patterns on these market endpoints trip limits faster than a naive rotating proxy setup can handle.

Building around all of this from scratch means maintaining residential proxy pools, patching TLS clients with curl_cffi or tls-client, implementing browser fingerprint spoofing, and writing CAPTCHA fallback logic — before you've written a single line of parsing code. The Anti-bot bypass API handles this infrastructure layer, so your code only needs to deal with the HTML.

Quick Start with AlterLab API

Install the SDK and parsing dependencies:

Bash
pip install alterlab beautifulsoup4 lxml

The minimal working example — fetch a Redfin search results page and confirm you got real listing HTML back:

Python
import alterlab
from bs4 import BeautifulSoup

client = alterlab.Client("YOUR_API_KEY")

response = client.scrape(
    "https://www.redfin.com/city/30749/CA/San-Francisco/filter/max-price=1.5M",
    render_js=True,        # Required — search results load via React
    country_code="us",     # Use US residential proxies
)

soup = BeautifulSoup(response.text, "lxml")
cards = soup.select(".HomeCardContainer")
print(f"Found {len(cards)} listing cards")

The same request as a cURL call, useful for smoke-testing from a pipeline:

Bash
curl -s -X POST https://api.alterlab.io/v1/scrape \
  -H "X-API-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://www.redfin.com/city/30749/CA/San-Francisco/filter/max-price=1.5M",
    "render_js": true,
    "country_code": "us"
  }' | jq '.html | length'

If you're setting up AlterLab for the first time, the Getting started guide covers API key setup, SDK installation, and your first request in under five minutes.

99.1%Redfin Request Success Rate
1.8sAvg JS Render Time
350Max Results per Polygon Query
~45%Credit Savings (Static vs JS Render)
Try it yourself

Try scraping a live Redfin search results page with AlterLab

Extracting Structured Data

Redfin renders listing cards as React components. After JavaScript execution, the DOM exposes consistent class names and data-rf-test-id attributes you can target reliably.

Search Result Listing Cards

Python
import alterlab
from bs4 import BeautifulSoup
import json

client = alterlab.Client("YOUR_API_KEY")

response = client.scrape(
    "https://www.redfin.com/zipcode/94105",
    render_js=True,
    country_code="us",
)

soup = BeautifulSoup(response.text, "lxml")
listings = []

for card in soup.select(".HomeCardContainer"):
    listing = {
        "address": card.select_one(".homeAddressV2") and
                   card.select_one(".homeAddressV2").get_text(strip=True),
        "price":   card.select_one(".homePriceV2") and
                   card.select_one(".homePriceV2").get_text(strip=True),
        "beds":    card.select_one("[data-rf-test-id='abp-beds']") and
                   card.select_one("[data-rf-test-id='abp-beds']").get_text(strip=True),
        "baths":   card.select_one("[data-rf-test-id='abp-baths']") and
                   card.select_one("[data-rf-test-id='abp-baths']").get_text(strip=True),
        "sqft":    card.select_one("[data-rf-test-id='abp-sqft']") and
                   card.select_one("[data-rf-test-id='abp-sqft']").get_text(strip=True),
        "dom":     card.select_one(".daysOnRedfin") and
                   card.select_one(".daysOnRedfin").get_text(strip=True),
        "url":     card.select_one("a[href]") and
                   "https://www.redfin.com" + card.select_one("a[href]")["href"],
    }
    listings.append(listing)

print(json.dumps(listings[:3], indent=2))

JSON-LD from Property Detail Pages

Individual listing pages embed structured data in a <script type="application/ld+json"> block. This follows the schema.org RealEstateListing type and is faster to parse than walking the rendered DOM — and more stable across Redfin frontend deploys:

Python
import alterlab
from bs4 import BeautifulSoup
import json

client = alterlab.Client("YOUR_API_KEY")

response = client.scrape(
    "https://www.redfin.com/CA/San-Francisco/123-Main-St-94105/home/12345678",
    render_js=True,
    country_code="us",
    wait_for_selector=".price-table",  # Wait until price history loads
)

soup = BeautifulSoup(response.text, "lxml")

# Extract JSON-LD structured data block
ld_scripts = soup.find_all("script", {"type": "application/ld+json"})
for script in ld_scripts:
    try:
        data = json.loads(script.string)
        if isinstance(data, list):
            data = next((d for d in data if d.get("@type") == "RealEstateListing"), None)
        if data and data.get("@type") == "RealEstateListing":
            print("Address:", data.get("address", {}).get("streetAddress"))
            print("Price:  ", data.get("offers", {}).get("price"))
            print("Beds:   ", data.get("numberOfRooms"))
            break
    except (json.JSONDecodeError, AttributeError):
        continue

# Supplemental facts from the property detail table
facts = {}
for row in soup.select(".facts-table .table-row"):
    label = row.select_one(".table-label")
    value = row.select_one(".table-value")
    if label and value:
        facts[label.get_text(strip=True)] = value.get_text(strip=True)

print(json.dumps(facts, indent=2))

CSS selector reference for common data points:

Data PointSelector
Listing price.homePriceV2
Address.homeAddressV2
Beds[data-rf-test-id="abp-beds"]
Baths[data-rf-test-id="abp-baths"]
Square footage[data-rf-test-id="abp-sqft"]
Days on market.daysOnRedfin
Price history rows.price-table .price-table-row
Property facts table.facts-table .table-row
Walk / Transit score.walkscore-stats
Listing agent name.agent-basic-details .agent-name

Prefer data-rf-test-id attributes over class names wherever they exist — test IDs are significantly more stable across Redfin's frontend deployments than utility class names, which have already incremented at least once in the past 12 months.

Common Pitfalls

Pagination varies by market URL structure. Redfin's search pages use ?page=N for city and neighborhood URLs, but map-polygon search endpoints use &start=N. On some market URLs neither parameter exists and the next page loads via an infinite scroll XHR trigger. Always verify that incrementing the page parameter actually returns a different listing set before building a loop around it.

Class names drift with frontend releases. .homePriceV2 superseded .homePrice sometime in 2024. The V2 suffix has appeared on several selectors. Build your parsers with a fallback chain — try the current selector first, then the previous generation:

Python
def get_price(card):
    # Try current generation first, fall back to previous
    for selector in [
        ".homePriceV2",
        ".homePrice",
        "[data-rf-test-id='listing-price']",
    ]:
        el = card.select_one(selector)
        if el:
            return el.get_text(strip=True)
    return None

Render timing on detail pages. Even with render_js=True, sections like price history charts and neighborhood score widgets load asynchronously after initial paint. If your parser finds empty containers for data you can see in a browser, add wait_for_selector targeting the last element to appear on the page — typically .price-table for the price history block.

Map polygon result caps. If you're scraping using Redfin's map bounding box URLs (useful for irregular geographic boundaries), the API caps results at 350 homes per query regardless of how many listings exist in the area. For dense urban markets, subdivide your bounding polygon into smaller quadrants and merge the results, deduplicating on the MLS ID extracted from each listing's URL path.

Address encoding edge cases. Redfin address fields occasionally include Unicode characters — directional markers, special apartment symbols, or non-breaking spaces — that cause issues when writing to CSV or comparing records across runs. Normalize with .encode("ascii", "ignore").decode() for ASCII-only pipelines, or store as TEXT in PostgreSQL and handle the full range there.

Scaling Up

Once your single-request parser is stable, a production pipeline needs concurrency, scheduling, and deduplication.

Async Batch Scraping

Python
import asyncio
import alterlab
from bs4 import BeautifulSoup

client = alterlab.Client("YOUR_API_KEY")

ZIP_CODES = [
    "94105", "94107", "94109", "94110", "94112",
    "94114", "94115", "94116", "94117", "94118",
]

async def scrape_zip(zip_code: str) -> list[dict]:
    url = f"https://www.redfin.com/zipcode/{zip_code}"
    response = await client.async_scrape(url, render_js=True, country_code="us")
    soup = BeautifulSoup(response.text, "lxml")
    return [
        {
            "zip":     zip_code,
            "address": c.select_one(".homeAddressV2") and
                       c.select_one(".homeAddressV2").get_text(strip=True),
            "price":   c.select_one(".homePriceV2") and
                       c.select_one(".homePriceV2").get_text(strip=True),
        }
        for c in soup.select(".HomeCardContainer")
    ]

async def main():
    tasks = [scrape_zip(z) for z in ZIP_CODES]
    results = await asyncio.gather(*tasks, return_exceptions=True)
    listings = [
        item
        for batch in results
        if isinstance(batch, list)
        for item in batch
    ]
    print(f"Collected {len(listings)} listings across {len(ZIP_CODES)} zip codes")

asyncio.run(main())

Storage Schema with Deduplication

The MLS ID is embedded directly in the Redfin URL path: redfin.com/CA/San-Francisco/123-Main-St-94105/home/12345678 — extract the final numeric segment. Use it as the natural deduplication key:

SQL
CREATE TABLE redfin_listings (
    id            BIGSERIAL    PRIMARY KEY,
    mls_id        TEXT         NOT NULL,
    address       TEXT,
    zip_code      CHAR(5),
    price_usd     INTEGER,
    beds          NUMERIC(3,1),
    baths         NUMERIC(3,1),
    sqft          INTEGER,
    days_on_mkt   INTEGER,
    scraped_at    TIMESTAMPTZ  NOT NULL DEFAULT now(),
    UNIQUE (mls_id, (scraped_at::DATE))  -- one record per listing per day
);

CREATE INDEX idx_redfin_zip_scraped ON redfin_listings (zip_code, scraped_at DESC);

Run daily scrapes with cron or a workflow tool like Prefect, and INSERT ... ON CONFLICT DO UPDATE to update fields like price and days-on-market while preserving the initial scraped-at timestamp for price change calculations.

Cost Planning at Scale

JavaScript-rendered requests consume more credits than static HTML fetches because of the additional proxy bandwidth and browser compute required. For search result pages, test whether the target URL returns useful HTML without JS rendering enabled — many Redfin city and zip-code search URLs do render a server-side content layer. Only enable render_js=True for pages where the extra data justifies the cost. This pattern typically reduces credit consumption by 40–60% on search-heavy pipelines. See AlterLab pricing for current per-request credit rates across render types.

Key Takeaways

  • Redfin blocks datacenter IP ranges and standard Python TLS fingerprints at the network edge. Residential proxies and fingerprint spoofing are not optional.
  • render_js=True is required for search result pages and listing detail pages. Add wait_for_selector targeting late-loading sections like .price-table to avoid empty parser results.
  • Prefer data-rf-test-id attributes over class names — they survive Redfin frontend deploys more reliably than utility classes.
  • JSON-LD <script type="application/ld+json"> blocks on detail pages give you clean schema.org fields without DOM traversal for the most commonly needed listing attributes.
  • Map polygon queries cap at 350 results. Subdivide dense-market bounding boxes and deduplicate on MLS ID.
  • Enable async batch scraping from the start — sequential requests don't scale past a few hundred zip codes per hour.
  • Use static mode for search pages where possible to cut JS-render credit costs by ~40–60%.

Building a broader real estate data pipeline? These guides cover the other major platforms:

The same anti-bot patterns and async pipeline structure apply to e-commerce scraping:

Share

Was this article helpful?

Frequently Asked Questions

Scraping publicly visible Redfin listing data is generally supported by U.S. case law (see hiQ Labs v. LinkedIn), but Redfin's Terms of Service prohibit automated access. In practice, limit request rates, avoid scraping behind authentication, and don't aggregate personal agent contact data for unsolicited outreach — consult legal counsel for any commercial use case.
Redfin uses TLS fingerprinting, IP reputation scoring, and behavioral analytics that block standard Python HTTP clients and datacenter IP ranges within minutes. The most reliable path is routing requests through AlterLab's anti-bot bypass API, which handles residential proxy rotation, browser fingerprint spoofing, and challenge solving without requiring any custom middleware on your end.
Cost depends on request volume and render type — JavaScript-rendered requests consume more credits than static HTML fetches. AlterLab offers a free tier for prototyping and pay-as-you-go pricing for production pipelines. Using static mode for search result pages where possible can cut credit consumption by 40–60%. See the pricing page for current rates by request type.