
Scraping JavaScript-Heavy SPAs with Python: Dynamic Content, Infinite Scroll, and API Interception
Scraping JavaScriptHeavy SPAs with Python: Dynamic Content, Infinite Scroll, and API Interception Modern web applications rarely serve their data in the...
March 16, 2026
Modern web applications rarely serve their data in the initial HTML response. React, Vue, and Angular SPAs render content client-side, fetch data from internal APIs, and load more content as users scroll. If you're trying to scrape JavaScript-heavy SPAs with Python using standard requests + BeautifulSoup pipelines, you'll fail immediately — by the time you parse the response, the meaningful content hasn't rendered yet.
This post covers three concrete techniques for extracting data from SPAs:
- Headless browser automation for rendered DOM extraction
- Network request interception to harvest raw API responses
- Programmatic infinite scroll handling
Why requests Fails Against SPAs
When you GET a typical SPA URL, the server returns a near-empty shell:
<!DOCTYPE html>
<html>
<head><title>My App</title></head>
<body>
<div id="root"></div>
<script src="/static/js/main.chunk.js"></script>
</body>
</html>All product listings, search results, and user data are loaded asynchronously after the browser executes those script bundles. requests never runs JavaScript — it only sees the shell.
The content you want lives in one of two places:
- The rendered DOM after JavaScript execution
- Raw JSON responses from the internal API calls that JavaScript makes
Your scraping strategy depends on which is easier to access.
Choose Your Approach Before Writing Code
Open DevTools → Network tab → filter by XHR/Fetch → reload the page. If you see clean JSON responses from readable endpoints like /api/v1/products?page=2, you can skip the browser entirely and call those endpoints directly with httpx or requests. This is almost always faster and more reliable than browser automation.
Only reach for a headless browser when:
- The API requires tokens generated client-side (complex HMAC signatures, rotating JWTs)
- Endpoints are obfuscated or dynamically constructed
- Data genuinely only exists in the rendered DOM with no backing API
| Scenario | Best Approach |
|---|---|
| Content rendered into DOM | Headless browser + DOM extraction |
| SPA fetches from internal API | Network interception → direct HTTP |
| Predictable paginated API | Direct HTTP (no browser needed) |
| Infinite scroll feed | Headless browser + scroll automation |
| Virtual scrolling list | Network interception (DOM won't hold all items) |
Approach 1: Headless Browser with Playwright
Playwright is the current standard for headless browser automation in Python. It supports Chromium, Firefox, and WebKit, has a clean async API, and handles modern JS frameworks well.
pip install playwright
playwright install chromiumWaiting for the Right Moment
The most common failure in SPA scraping is extracting the DOM before content has rendered. Playwright gives you several wait strategies:
import asyncio
from playwright.async_api import async_playwright
async def scrape_spa(url: str) -> list[dict]:
async with async_playwright() as p:
browser = await p.chromium.launch(headless=True)
page = await browser.new_page()
# "networkidle" waits until no network requests for 500ms
# Use "domcontentloaded" when you'll wait on a selector anyway
await page.goto(url, wait_until="networkidle")
# Wait for the specific element you need — don't rely on networkidle alone
await page.wait_for_selector("[data-testid='product-grid']", timeout=15000)
products = await page.evaluate("""
() => Array.from(
document.querySelectorAll('[data-testid="product-card"]')
).map(el => ({
title: el.querySelector('h2')?.textContent?.trim(),
price: el.querySelector('[data-price]')?.dataset?.price,
url: el.querySelector('a')?.href,
image: el.querySelector('img')?.src
}))
""")
await browser.close()
return products
if __name__ == "__main__":
results = asyncio.run(scrape_spa("https://example-shop.com/products"))
print(f"Extracted {len(results)} products")wait_for_selector is more reliable than a fixed timeout. It resolves as soon as the element exists in the DOM, which can be seconds earlier than a blanket await asyncio.sleep(3) and won't fail when the sleep was too short.
evaluate() vs. Locators
page.evaluate() runs JavaScript directly in the browser context — useful for extracting many similar elements in a single round-trip. For targeted single-field reads, the locator API is cleaner:
title = await page.locator("h1.product-title").text_content()
price = await page.locator("[data-price]").get_attribute("data-price")Use evaluate() for mass extraction, locators for one-off field reads.
Approach 2: API Interception
Many SPAs load data from internal REST or GraphQL APIs that return clean, structured JSON. You can intercept these responses from within Playwright without touching the DOM at all.
import asyncio
import json
from playwright.async_api import async_playwright
async def intercept_api_responses(url: str) -> list[dict]:
captured: list[dict] = []
async with async_playwright() as p:
browser = await p.chromium.launch(headless=True)
page = await browser.new_page()
async def on_response(response):
if "/api/v2/listings" in response.url and response.status == 200:
content_type = response.headers.get("content-type", "")
if "application/json" in content_type:
try:
data = await response.json()
items = data if isinstance(data, list) else data.get("results", [])
captured.extend(items)
except Exception as e:
print(f"Failed to parse {response.url}: {e}")
page.on("response", on_response)
await page.goto(url, wait_until="networkidle")
await browser.close()
return captured
if __name__ == "__main__":
data = asyncio.run(intercept_api_responses("https://example-marketplace.com/search"))
print(json.dumps(data[:2], indent=2))Once you've identified the API pattern, replicate it directly with httpx for production. The browser is only needed to observe which endpoints are called and what authentication headers they carry.
Extracting Client-Side Auth Tokens
If the API requires a bearer token generated in the browser:
auth_token: str | None = None
async def on_request(request):
global auth_token
if "/api/v2/listings" in request.url:
auth = request.headers.get("authorization", "")
if auth.startswith("Bearer "):
auth_token = auth.removeprefix("Bearer ")
page.on("request", on_request)
await page.goto(url, wait_until="networkidle")
# Now use auth_token directly with httpx for bulk pagination
import httpx
async with httpx.AsyncClient() as client:
for page_num in range(1, 50):
resp = await client.get(
f"https://example-marketplace.com/api/v2/listings?page={page_num}",
headers={"Authorization": f"Bearer {auth_token}"}
)
items = resp.json().get("results", [])
if not items:
break
captured.extend(items)This hybrid pattern — use the browser once to capture tokens, then direct HTTP for bulk pagination — is 10–50× faster than routing every request through Playwright.
Approach 3: Infinite Scroll Automation
Infinite scroll triggers data loads when the user scrolls near the bottom of the page. The automation pattern is: scroll to the bottom, wait for new content to appear, extract, repeat.
import asyncio
from playwright.async_api import async_playwright
async def scrape_infinite_scroll(url: str, max_items: int = 500) -> list[dict]:
async with async_playwright() as p:
browser = await p.chromium.launch(headless=True)
page = await browser.new_page()
await page.goto(url, wait_until="domcontentloaded")
await page.wait_for_selector(".item-card", timeout=10000)
seen_ids: set[str] = set()
items: list[dict] = []
stall_rounds = 0
while len(items) < max_items:
current = await page.evaluate("""
() => Array.from(document.querySelectorAll('.item-card')).map(el => ({
id: el.dataset.id,
title: el.querySelector('h3')?.textContent?.trim(),
price: el.querySelector('.price')?.textContent?.trim()
}))
""")
new_items = [i for i in current if i["id"] not in seen_ids]
if not new_items:
stall_rounds += 1
if stall_rounds >= 3:
break # End of feed or load failure
else:
stall_rounds = 0
for item in new_items:
seen_ids.add(item["id"])
items.append(item)
await page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
await page.wait_for_timeout(1500) # Wait for new content to render
await browser.close()
return itemsKey decisions in this pattern:
- Track by ID, not count. A
seen_idsset prevents reprocessing items that stay in the DOM after scroll. Counting total DOM nodes is unreliable if the page removes old items as new ones load. - Stall detection. Three consecutive scroll cycles with no new items means you've hit the end of the feed or a silent load failure.
- Scroll target.
document.body.scrollHeightworks when the document itself scrolls. If the scrollable container is a nested div, target it:document.querySelector('.feed-container').scrollTo(0, 99999).
Virtual Scrolling Is a Different Problem
React-window and similar virtualization libraries render only visible rows and recycle DOM nodes as you scroll. You cannot collect all items from the DOM simultaneously — items outside the viewport don't exist as DOM nodes.
For virtual scrolling, API interception is almost always the correct solution. The virtualized list is backed by data loaded from somewhere; intercept those API calls instead of fighting the DOM.
Anti-Bot Considerations
SPAs behind Cloudflare, Akamai, or PerimeterX fingerprint browser characteristics at the JavaScript level: canvas rendering, WebGL parameters, audio context, font enumeration, navigator properties. A stock Playwright instance fails these checks.
Mitigation strategies, in order of practical effectiveness:
playwright-stealth: Patches the most common fingerprint detection vectors. Start here.- Real Chrome with user data directory: Launch against a real Chrome install with an existing profile — closer to real browser state.
- Residential proxies: Many bot detectors block datacenter IP ranges regardless of browser fingerprinting. Fix IP reputation before spending time on JS patches.
- Managed scraping APIs: Services like AlterLab handle browser fingerprinting, proxy rotation, and bypass as infrastructure — you POST a URL and get back rendered HTML or a JSON payload without managing browser fleets.
pip install playwright-stealthfrom playwright.async_api import async_playwright
from playwright_stealth import stealth_async
async def scrape_with_stealth(url: str) -> str:
async with async_playwright() as p:
browser = await p.chromium.launch(headless=True)
page = await browser.new_page()
await stealth_async(page) # Apply patches before navigation
await page.goto(url, wait_until="networkidle")
content = await page.content()
await browser.close()
return contentPerformance at Scale
A single Chromium instance uses 200–400 MB RAM. For pipelines scraping thousands of pages:
Reuse browser instances, not contexts. browser.new_context() is cheap; browser.launch() is expensive. Create one browser, one context per isolated job.
Block unnecessary resources. Images, fonts, and stylesheets are irrelevant for data extraction and meaningfully slow down page loads.
await page.route(
"**/*",
lambda route: route.abort()
if route.request.resource_type in ("image", "font", "stylesheet", "media")
else route.continue_()
)Blocking images alone cuts load time by 30–60% on image-heavy SPAs.
Run contexts in parallel. Use asyncio.gather() to run multiple page scrapes concurrently within one browser instance. Keep concurrency at 3–5 pages per browser; beyond that, CPU contention negates the gains.
async def scrape_batch(urls: list[str]) -> list[list[dict]]:
async with async_playwright() as p:
browser = await p.chromium.launch(headless=True)
tasks = [scrape_with_context(browser, url) for url in urls]
results = await asyncio.gather(*tasks, return_exceptions=True)
await browser.close()
return [r for r in results if isinstance(r, list)]Summary
| Strategy | Use When | Skip When |
|---|---|---|
| DOM extraction (Playwright) | Data only in rendered HTML | API is accessible |
| API interception + direct HTTP | API exists, data is structured JSON | Token rotation is too complex |
| Infinite scroll automation | Feed-style pages with scroll triggers | Site uses virtual scrolling |
| Managed scraping API | High-volume, anti-bot protected targets | Simple unprotected targets |
The sequence that works for most SPA scraping projects:
- Open the Network tab before writing any code. If the SPA calls a clean API endpoint, skip the browser entirely.
- Use
wait_for_selector, notnetworkidlealone. Wait for the specific element you need. - Intercept requests to capture auth tokens. Use the browser once, then switch to direct HTTP for bulk pagination.
- Infinite scroll: track items by stable ID, not count. Stop when stall detection triggers.
- Block images and fonts in browser pipelines. Free 30–60% speed improvement.
- Fix IP reputation before fingerprinting patches. Residential proxies solve most bot blocks; stealth patches solve the rest.
The most common over-engineering mistake is defaulting to headless browsers when httpx and a couple of curl-derived headers would have worked. Start simple, escalate only when blocked.
Was this article helpful?
Related Articles
Popular Posts
Recommended

Selenium Bot Detection: Why You Get Caught and How to Avoid It

Why Your Headless Browser Gets Detected (and How to Fix It)

Scraping E-Commerce Sites at Scale Without Getting Blocked

Web Scraping with Node.js and Puppeteer: The Complete 2026 Guide

How to Bypass Cloudflare Bot Protection When Web Scraping
Newsletter
Scraping insights and API tips. No spam.
Recommended Reading

Selenium Bot Detection: Why You Get Caught and How to Avoid It

Why Your Headless Browser Gets Detected (and How to Fix It)

Scraping E-Commerce Sites at Scale Without Getting Blocked

Web Scraping with Node.js and Puppeteer: The Complete 2026 Guide

How to Bypass Cloudflare Bot Protection When Web Scraping
Stay in the Loop
Get scraping insights, API tips, and platform updates. No spam — we only send when we have something worth reading.

