Web Scraping with Playwright — Complete Guide
How Playwright works, how to set up a scraper, and when a cloud rendering API is more practical than running browsers yourself.
Playwright is a browser automation library from Microsoft — it controls Chromium, Firefox, and WebKit programmatically. It is a strong tool for scraping JavaScript-heavy pages that require real browser execution: single-page applications, pages with infinite scroll, forms that need to be filled, and content that only loads after user interactions. This guide covers setup, basic scraping patterns, and the practical tradeoffs of running browsers locally versus using a managed rendering API.
Installing Playwright
Playwright is available for Python and Node.js. Install the package and then download the browser binaries.
# Python
pip install playwright
playwright install chromium
# Node.js
npm install playwright
npx playwright install chromiumYour First Playwright Scraper
Playwright's API is straightforward: launch a browser, open a page, navigate to a URL, and query the DOM. The async API is the standard for production code.
import asyncio
from playwright.async_api import async_playwright
async def scrape_page(url: str) -> list[dict]:
async with async_playwright() as p:
browser = await p.chromium.launch(headless=True)
page = await browser.new_page()
await page.goto(url, wait_until="networkidle")
products = await page.eval_on_selector_all(
"div.product-card",
"""els => els.map(el => ({
title: el.querySelector('h2')?.textContent?.trim() ?? '',
price: el.querySelector('.price')?.textContent?.trim() ?? '',
}))"""
)
await browser.close()
return products
results = asyncio.run(scrape_page("https://example.com/products"))Handling Dynamic Content and Waiting
The most common Playwright challenge: knowing when the page has loaded enough data to scrape. Playwright provides several wait strategies — use the most specific one for your target page.
# Wait for network to settle (no requests for 500ms)
await page.goto(url, wait_until="networkidle")
# Wait for a specific element to appear
await page.wait_for_selector("div.product-card", timeout=10000)
# Wait for a specific number of elements
await page.wait_for_function("document.querySelectorAll('div.product-card').length > 0")
# Wait for an XHR response
async with page.expect_response(lambda r: "/api/products" in r.url) as response_info:
await page.goto(url)
response = await response_info.value
data = await response.json() # often easier than DOM scrapingHandling Infinite Scroll
Infinite-scroll pages load more content as you scroll down. Use Playwright to scroll the page incrementally and wait for new content to load before scrolling again.
async def scrape_infinite_scroll(url: str, max_scrolls: int = 10) -> list[str]:
async with async_playwright() as p:
browser = await p.chromium.launch(headless=True)
page = await browser.new_page()
await page.goto(url, wait_until="networkidle")
all_items: set[str] = set()
for _ in range(max_scrolls):
items = await page.eval_on_selector_all(
".item-title",
"els => els.map(el => el.textContent.trim())"
)
all_items.update(items)
prev_count = len(all_items)
await page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
await page.wait_for_timeout(2000) # wait for new content
new_items = await page.eval_on_selector_all(
".item-title",
"els => els.map(el => el.textContent.trim())"
)
all_items.update(new_items)
if len(all_items) == prev_count:
break # no new content loaded
await browser.close()
return list(all_items)Practical Limitations of Running Playwright Locally
Playwright is powerful, but comes with significant operational costs for production scraping:
Memory: Each browser instance uses 300–500 MB. Scaling to 20 concurrent browsers requires 6–10 GB of RAM.
Speed: Browser startup takes 1–3 seconds. Page load takes 3–10 seconds per page. Throughput is low compared to API-based approaches.
Detection: Headless browsers can be identified by timing patterns, navigator properties, and rendering characteristics. Many sites with compatibility layers identify and block automated browser traffic.
Infrastructure: You need to manage Chromium binaries, handle crashes, implement restarts, and configure proxy rotation yourself.
When to use Playwright locally: Complex interaction sequences (login flows, multi-step forms), testing/QA pipelines, or one-off data collection runs.
When to use a rendering API instead: Production scraping of JavaScript-heavy pages at scale, when you need reliable IP rotation, when you cannot maintain browser infrastructure.
Extracting Data After Rendering — Playwright vs API
Both approaches produce the same outcome: rendered HTML you can parse. The difference is where the browser runs.
# Approach A: Local Playwright browser
from playwright.async_api import async_playwright
async with async_playwright() as p:
browser = await p.chromium.launch()
page = await browser.new_page()
await page.goto("https://example.com/spa-page", wait_until="networkidle")
html = await page.content()
await browser.close()
# Approach B: AlterLab rendering API (same result, no browser management)
import requests
response = requests.post(
"https://api.alterlab.io/api/v1/scrape",
headers={"X-API-Key": "YOUR_KEY", "Content-Type": "application/json"},
json={"url": "https://example.com/spa-page", "render_js": True},
)
html = response.json()["html"]
# Either way, parse with BeautifulSoup or lxml
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, "lxml")Playwright Scraper — SPA with Pagination
Complete working Playwright scraper with pagination, realistic browser settings, and error handling.
import asyncio
from playwright.async_api import async_playwright
import json
async def scrape_spa(base_url: str, max_pages: int = 10) -> list[dict]:
results = []
async with async_playwright() as p:
browser = await p.chromium.launch(headless=True)
context = await browser.new_context(
user_agent="Mozilla/5.0 (compatible; DataBot/1.0)",
viewport={"width": 1280, "height": 900},
)
page = await context.new_page()
for page_num in range(1, max_pages + 1):
url = f"{base_url}?page={page_num}"
print(f"Scraping page {page_num}...")
await page.goto(url, wait_until="networkidle", timeout=30000)
await page.wait_for_selector(".product-card", timeout=10000)
items = await page.eval_on_selector_all(
".product-card",
"""els => els.map(el => ({
title: el.querySelector('h2')?.textContent?.trim() ?? '',
price: el.querySelector('.price')?.textContent?.trim() ?? '',
url: el.querySelector('a')?.href ?? '',
}))"""
)
if not items:
print(f"No items on page {page_num} — stopping")
break
results.extend(items)
# Check for next page
next_btn = await page.query_selector("a.next-page")
if not next_btn:
break
await browser.close()
return results
results = asyncio.run(scrape_spa("https://example.com/products"))
with open("products.json", "w") as f:
json.dump(results, f, indent=2)
print(f"Saved {len(results)} products")Same Result, No Browser Process
When you just need rendered HTML — not complex interactions — AlterLab handles the browser server-side. No Playwright install, no browser binary management, no memory overhead. From $0.0002/request with 5,000 free requests to start.
import requests
from bs4 import BeautifulSoup
import json
API_KEY = "YOUR_API_KEY" # Get free at alterlab.io
def scrape_spa_page(url: str) -> list[dict]:
"""AlterLab renders the page server-side — no local browser required."""
response = requests.post(
"https://api.alterlab.io/api/v1/scrape",
headers={"X-API-Key": API_KEY, "Content-Type": "application/json"},
json={"url": url, "render_js": True},
timeout=30,
)
response.raise_for_status()
html = response.json().get("html", "")
soup = BeautifulSoup(html, "lxml")
return [
{
"title": card.select_one("h2").get_text(strip=True),
"price": card.select_one(".price").get_text(strip=True),
}
for card in soup.select(".product-card")
if card.select_one("h2") and card.select_one(".price")
]
all_results = []
for page_num in range(1, 11):
url = f"https://example.com/products?page={page_num}"
items = scrape_spa_page(url)
if not items:
break
all_results.extend(items)
with open("products.json", "w") as f:
json.dump(all_results, f, indent=2)
print(f"Saved {len(all_results)} products — no browser process running")Playwright vs Alternatives
Playwright (local browser)
Pros
- +Full browser interaction (clicks, forms, scroll)
- +Free to run
- +Direct DOM access
Cons
- −300–500 MB per browser instance
- −3–10 seconds per page
- −Browser detection common
- −Complex infrastructure management
- −Crashes require restart logic
Playwright + proxy rotation (DIY)
Pros
- +Handles IP-based rate limiting
- +More reliable than plain browser
Cons
- −Proxy cost + browser cost
- −Complex integration
- −Still slow and memory-heavy
- −Detection still possible
AlterLab rendering API
Pros
- +No browser management
- +Automatic IP rotation
- +5-tier compatibility escalation
- +From $0.0002/request
- +No memory or CPU overhead
Cons
- −Per-request cost
- −Cannot perform complex interactions
Frequently Asked Questions
More Browser Scraping Resources
Web Scraping with Puppeteer
Complete Puppeteer guide — how it compares to Playwright and when to use each.
Web Scraping with Python
Python scraping guide: requests, BeautifulSoup, async patterns from beginner to production.
JavaScript Rendering API
Cloud rendering — no local browser. Handles dynamic content from $0.0002/request.
Anti-Bot Handling API
5-tier automatic website compatibility — handles challenge pages without browser management.
Your first scrape.
Sixty seconds.
$1 free balance. No credit card. No SDK.
Just a POST request.
No credit card required · Up to 5,000 free scrapes · Balance never expires