Pricing Compare Playground Blog Docs Changelog

Scraping JavaScript-Heavy SPAs with Python: Dynamic Content, Infinite Scroll, and API Interception

Scraping JavaScriptHeavy SPAs with Python: Dynamic Content, Infinite Scroll, and API Interception Modern web applications rarely serve their data in the...

Yash DubeyMarch 16, 2026

8 min read

726 views

Modern web applications rarely serve their data in the initial HTML response. React, Vue, and Angular SPAs render content client-side, fetch data from internal APIs, and load more content as users scroll. If you're trying to scrape JavaScript-heavy SPAs with Python using standard requests + BeautifulSoup pipelines, you'll fail immediately — by the time you parse the response, the meaningful content hasn't rendered yet.

This post covers three concrete techniques for extracting data from SPAs:

Headless browser automation for rendered DOM extraction
Network request interception to harvest raw API responses
Programmatic infinite scroll handling

Why `requests` Fails Against SPAs

When you GET a typical SPA URL, the server returns a near-empty shell:

HTML

<!DOCTYPE html>
<html>
  <head><title>My App</title></head>
  <body>
    <div id="root"></div>
    <script src="/static/js/main.chunk.js"></script>
  </body>
</html>

All product listings, search results, and user data are loaded asynchronously after the browser executes those script bundles. requests never runs JavaScript — it only sees the shell.

The content you want lives in one of two places:

The rendered DOM after JavaScript execution
Raw JSON responses from the internal API calls that JavaScript makes

Your scraping strategy depends on which is easier to access.

Choose Your Approach Before Writing Code

Open DevTools → Network tab → filter by XHR/Fetch → reload the page. If you see clean JSON responses from readable endpoints like /api/v1/products?page=2, you can skip the browser entirely and call those endpoints directly with httpx or requests. This is almost always faster and more reliable than browser automation.

Only reach for a headless browser when:

The API requires tokens generated client-side (complex HMAC signatures, rotating JWTs)
Endpoints are obfuscated or dynamically constructed
Data genuinely only exists in the rendered DOM with no backing API

Scenario	Best Approach
Content rendered into DOM	Headless browser + DOM extraction
SPA fetches from internal API	Network interception → direct HTTP
Predictable paginated API	Direct HTTP (no browser needed)
Infinite scroll feed	Headless browser + scroll automation
Virtual scrolling list	Network interception (DOM won't hold all items)

Approach 1: Headless Browser with Playwright

Playwright is the current standard for headless browser automation in Python. It supports Chromium, Firefox, and WebKit, has a clean async API, and handles modern JS frameworks well.

Bash

pip install playwright
playwright install chromium

Waiting for the Right Moment

The most common failure in SPA scraping is extracting the DOM before content has rendered. Playwright gives you several wait strategies:

Python

import asyncio
from playwright.async_api import async_playwright

async def scrape_spa(url: str) -> list[dict]:
    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True)
        page = await browser.new_page()

        # "networkidle" waits until no network requests for 500ms
        # Use "domcontentloaded" when you'll wait on a selector anyway
        await page.goto(url, wait_until="networkidle")

        # Wait for the specific element you need — don't rely on networkidle alone
        await page.wait_for_selector("[data-testid='product-grid']", timeout=15000)

        products = await page.evaluate("""
            () => Array.from(
                document.querySelectorAll('[data-testid="product-card"]')
            ).map(el => ({
                title: el.querySelector('h2')?.textContent?.trim(),
                price: el.querySelector('[data-price]')?.dataset?.price,
                url: el.querySelector('a')?.href,
                image: el.querySelector('img')?.src
            }))
        """)

        await browser.close()
        return products

if __name__ == "__main__":
    results = asyncio.run(scrape_spa("https://example-shop.com/products"))
    print(f"Extracted {len(results)} products")

wait_for_selector is more reliable than a fixed timeout. It resolves as soon as the element exists in the DOM, which can be seconds earlier than a blanket await asyncio.sleep(3) and won't fail when the sleep was too short.

`evaluate()` vs. Locators

page.evaluate() runs JavaScript directly in the browser context — useful for extracting many similar elements in a single round-trip. For targeted single-field reads, the locator API is cleaner:

Python

title = await page.locator("h1.product-title").text_content()
price = await page.locator("[data-price]").get_attribute("data-price")

Use evaluate() for mass extraction, locators for one-off field reads.

Approach 2: API Interception

Many SPAs load data from internal REST or GraphQL APIs that return clean, structured JSON. You can intercept these responses from within Playwright without touching the DOM at all.

Python

import asyncio
import json
from playwright.async_api import async_playwright

async def intercept_api_responses(url: str) -> list[dict]:
    captured: list[dict] = []

    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True)
        page = await browser.new_page()

        async def on_response(response):
            if "/api/v2/listings" in response.url and response.status == 200:
                content_type = response.headers.get("content-type", "")
                if "application/json" in content_type:
                    try:
                        data = await response.json()
                        items = data if isinstance(data, list) else data.get("results", [])
                        captured.extend(items)
                    except Exception as e:
                        print(f"Failed to parse {response.url}: {e}")

        page.on("response", on_response)
        await page.goto(url, wait_until="networkidle")
        await browser.close()

    return captured

if __name__ == "__main__":
    data = asyncio.run(intercept_api_responses("https://example-marketplace.com/search"))
    print(json.dumps(data[:2], indent=2))

Once you've identified the API pattern, replicate it directly with httpx for production. The browser is only needed to observe which endpoints are called and what authentication headers they carry.

Extracting Client-Side Auth Tokens

If the API requires a bearer token generated in the browser:

Python

auth_token: str | None = None

async def on_request(request):
    global auth_token
    if "/api/v2/listings" in request.url:
        auth = request.headers.get("authorization", "")
        if auth.startswith("Bearer "):
            auth_token = auth.removeprefix("Bearer ")

page.on("request", on_request)
await page.goto(url, wait_until="networkidle")

# Now use auth_token directly with httpx for bulk pagination
import httpx

async with httpx.AsyncClient() as client:
    for page_num in range(1, 50):
        resp = await client.get(
            f"https://example-marketplace.com/api/v2/listings?page={page_num}",
            headers={"Authorization": f"Bearer {auth_token}"}
        )
        items = resp.json().get("results", [])
        if not items:
            break
        captured.extend(items)

This hybrid pattern — use the browser once to capture tokens, then direct HTTP for bulk pagination — is 10–50× faster than routing every request through Playwright.

Approach 3: Infinite Scroll Automation

Infinite scroll triggers data loads when the user scrolls near the bottom of the page. The automation pattern is: scroll to the bottom, wait for new content to appear, extract, repeat.

Python

import asyncio
from playwright.async_api import async_playwright

async def scrape_infinite_scroll(url: str, max_items: int = 500) -> list[dict]:
    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True)
        page = await browser.new_page()
        await page.goto(url, wait_until="domcontentloaded")
        await page.wait_for_selector(".item-card", timeout=10000)

        seen_ids: set[str] = set()
        items: list[dict] = []
        stall_rounds = 0

        while len(items) < max_items:
            current = await page.evaluate("""
                () => Array.from(document.querySelectorAll('.item-card')).map(el => ({
                    id: el.dataset.id,
                    title: el.querySelector('h3')?.textContent?.trim(),
                    price: el.querySelector('.price')?.textContent?.trim()
                }))
            """)

            new_items = [i for i in current if i["id"] not in seen_ids]

            if not new_items:
                stall_rounds += 1
                if stall_rounds >= 3:
                    break  # End of feed or load failure
            else:
                stall_rounds = 0
                for item in new_items:
                    seen_ids.add(item["id"])
                    items.append(item)

            await page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
            await page.wait_for_timeout(1500)  # Wait for new content to render

        await browser.close()
        return items

Key decisions in this pattern:

Track by ID, not count. A seen_ids set prevents reprocessing items that stay in the DOM after scroll. Counting total DOM nodes is unreliable if the page removes old items as new ones load.
Stall detection. Three consecutive scroll cycles with no new items means you've hit the end of the feed or a silent load failure.
Scroll target. document.body.scrollHeight works when the document itself scrolls. If the scrollable container is a nested div, target it: document.querySelector('.feed-container').scrollTo(0, 99999).

Virtual Scrolling Is a Different Problem

React-window and similar virtualization libraries render only visible rows and recycle DOM nodes as you scroll. You cannot collect all items from the DOM simultaneously — items outside the viewport don't exist as DOM nodes.

For virtual scrolling, API interception is almost always the correct solution. The virtualized list is backed by data loaded from somewhere; intercept those API calls instead of fighting the DOM.

Anti-Bot Considerations

SPAs behind Cloudflare, Akamai, or PerimeterX fingerprint browser characteristics at the JavaScript level: canvas rendering, WebGL parameters, audio context, font enumeration, navigator properties. A stock Playwright instance fails these checks.

Mitigation strategies, in order of practical effectiveness:

playwright-stealth: Patches the most common fingerprint detection vectors. Start here.
Real Chrome with user data directory: Launch against a real Chrome install with an existing profile — closer to real browser state.
Residential proxies: Many bot detectors block datacenter IP ranges regardless of browser fingerprinting. Fix IP reputation before spending time on JS patches.
Managed scraping APIs: Services like AlterLab handle browser fingerprinting, proxy rotation, and bypass as infrastructure — you POST a URL and get back rendered HTML or a JSON payload without managing browser fleets.

Bash

pip install playwright-stealth

Python

from playwright.async_api import async_playwright
from playwright_stealth import stealth_async

async def scrape_with_stealth(url: str) -> str:
    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True)
        page = await browser.new_page()
        await stealth_async(page)  # Apply patches before navigation
        await page.goto(url, wait_until="networkidle")
        content = await page.content()
        await browser.close()
        return content

Performance at Scale

A single Chromium instance uses 200–400 MB RAM. For pipelines scraping thousands of pages:

Reuse browser instances, not contexts. browser.new_context() is cheap; browser.launch() is expensive. Create one browser, one context per isolated job.

Block unnecessary resources. Images, fonts, and stylesheets are irrelevant for data extraction and meaningfully slow down page loads.

Python

await page.route(
    "**/*",
    lambda route: route.abort()
    if route.request.resource_type in ("image", "font", "stylesheet", "media")
    else route.continue_()
)

Blocking images alone cuts load time by 30–60% on image-heavy SPAs.

Run contexts in parallel. Use asyncio.gather() to run multiple page scrapes concurrently within one browser instance. Keep concurrency at 3–5 pages per browser; beyond that, CPU contention negates the gains.

Python

async def scrape_batch(urls: list[str]) -> list[list[dict]]:
    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True)
        tasks = [scrape_with_context(browser, url) for url in urls]
        results = await asyncio.gather(*tasks, return_exceptions=True)
        await browser.close()
        return [r for r in results if isinstance(r, list)]

Summary

Strategy	Use When	Skip When
DOM extraction (Playwright)	Data only in rendered HTML	API is accessible
API interception + direct HTTP	API exists, data is structured JSON	Token rotation is too complex
Infinite scroll automation	Feed-style pages with scroll triggers	Site uses virtual scrolling
Managed scraping API	High-volume, anti-bot protected targets	Simple unprotected targets

The sequence that works for most SPA scraping projects:

Open the Network tab before writing any code. If the SPA calls a clean API endpoint, skip the browser entirely.
Use wait_for_selector, not networkidle alone. Wait for the specific element you need.
Intercept requests to capture auth tokens. Use the browser once, then switch to direct HTTP for bulk pagination.
Infinite scroll: track items by stable ID, not count. Stop when stall detection triggers.
Block images and fonts in browser pipelines. Free 30–60% speed improvement.
Fix IP reputation before fingerprinting patches. Residential proxies solve most bot blocks; stealth patches solve the rest.

The most common over-engineering mistake is defaulting to headless browsers when httpx and a couple of curl-derived headers would have worked. Start simple, escalate only when blocked.

Was this article helpful?

Try it yourself

One API call. Any language.

Python SDK, Node SDK, or plain HTTP. Get started in under a minute.

from alterlab import AlterLab

client = AlterLab(api_key="YOUR_KEY")
result = client.scrape("https://example.com")
print(result.markdown)

No credit card required · 5,000 free requests

Yash Dubey

View all posts

Web Scraping

Advanced Headless Browser Anti-Bot Techniques: TLS & Canvas

Understand TLS, Canvas, and WebGL fingerprinting in headless browser scraping. Learn how anti-bot systems detect agents and how modern pipelines adapt.

Herald Blog Service

Jun 5, 2026

335

Web Scraping

Playwright vs Puppeteer for AI Agents & RAG Pipelines

Comparing Playwright vs Puppeteer for AI data collection. Learn which headless browser wins on speed, context isolation, and reliability for LLMs in 2026.

Herald Blog Service

May 25, 2026

127

Web Scraping

How to Scrape LinkedIn Profiles and Company Data Without Getting Blocked in 2026

How to Scrape LinkedIn Profiles and Company Data Without Getting Blocked in 2026 Scraping LinkedIn profiles and company data is one of the harder...

Yash Dubey

Mar 17, 2026

10m

582

Stay in the Loop

Get scraping insights, API tips, and platform updates. No spam — we only send when we have something worth reading.

Web Scraping API Resources

Part of the Web Scraping API Documentation cluster

Web Scraping API Documentation

Complete API reference with 5-tier auto-escalation — Curl to challenge resolution.

Pillar page

JavaScript Rendering Guide

Configure Tier 4 browser rendering for SPAs and dynamic content.

Authenticated Scraping Guide

Scrape pages behind login using session management.

Web Scraping API Benchmarks

Real success rates and cost data across all 5 tiers.

AlterLab for AI Agents

MCP Server, Python SDK, and Firecrawl-compatible API for AI agent workflows.

Why requests Fails Against SPAs

Choose Your Approach Before Writing Code

Approach 1: Headless Browser with Playwright

Waiting for the Right Moment

evaluate() vs. Locators

Approach 2: API Interception

Extracting Client-Side Auth Tokens

Approach 3: Infinite Scroll Automation

Virtual Scrolling Is a Different Problem

Anti-Bot Considerations

Performance at Scale

Summary

Related Articles

Advanced Headless Browser Anti-Bot Techniques: TLS & Canvas

Playwright vs Puppeteer for AI Agents & RAG Pipelines

How to Scrape LinkedIn Profiles and Company Data Without Getting Blocked in 2026

Popular Posts

Playwright Bot Detection: What Actually Works in 2026

Why Your Headless Browser Gets Detected (and How to Fix It)

How to Scrape Twitter/X: Complete Guide for 2026

Best Web Scraping APIs in 2026: Complete Comparison Guide

How to Scrape Cloudflare-Protected Sites in 2026

Recommended

Selenium Bot Detection: Why You Get Flagged and How to Fix It

How to Scrape AliExpress: Complete Guide for 2026

Why Your Headless Browser Gets Detected (and How to Fix It)

How to Scrape Cloudflare-Protected Sites in 2026

How to Bypass Cloudflare Bot Protection with Puppeteer in 2026

Newsletter

Recommended Reading

Selenium Bot Detection: Why You Get Flagged and How to Fix It

How to Scrape AliExpress: Complete Guide for 2026

Why Your Headless Browser Gets Detected (and How to Fix It)

How to Scrape Cloudflare-Protected Sites in 2026

How to Bypass Cloudflare Bot Protection with Puppeteer in 2026

Stay in the Loop

Explore AlterLab

Python Web Scraping API

Compare Scraping APIs

Pricing

Documentation

Web Scraping API Resources

Why `requests` Fails Against SPAs

`evaluate()` vs. Locators