Web Scraping with Puppeteer — Complete Guide
How Puppeteer works for web scraping, working code examples, and when a cloud rendering API replaces the need to run Chrome locally.
Puppeteer is a Node.js library from Google that provides a high-level API to control headless Chrome. It is a natural choice for scraping JavaScript-heavy pages — it executes JavaScript, handles dynamic content, and lets you interact with the page before extracting data. This guide covers installation, basic and advanced scraping patterns, common pitfalls, and the practical tradeoffs of running Chrome locally versus using a managed rendering API.
Installing Puppeteer
Puppeteer ships with a bundled version of Chrome. The full package downloads Chrome automatically on install.
npm install puppeteer
# or lightweight version (bring your own Chrome)
npm install puppeteer-coreYour First Puppeteer Scraper
The core Puppeteer pattern: launch a browser, open a page, navigate to a URL, and extract data. Always close the browser when done — leaving browser processes open leaks memory.
import puppeteer from "puppeteer";
const browser = await puppeteer.launch({ headless: true });
const page = await browser.newPage();
await page.goto("https://example.com/products", {
waitUntil: "networkidle2",
timeout: 30000,
});
// Extract data from the DOM
const products = await page.evaluate(() => {
return Array.from(document.querySelectorAll("div.product-card")).map((el) => ({
title: el.querySelector("h2")?.textContent?.trim() ?? "",
price: el.querySelector(".price")?.textContent?.trim() ?? "",
url: el.querySelector("a")?.href ?? "",
}));
});
await browser.close();
console.log(`Extracted ${products.length} products:`, products);Waiting for Dynamic Content
The most common Puppeteer challenge: knowing when the page has loaded the data you need. Puppeteer provides several wait strategies.
// Wait for specific selector to appear
await page.waitForSelector("div.product-card", { timeout: 10000 });
// Wait for network to settle (no requests for 500ms)
await page.goto(url, { waitUntil: "networkidle2" });
// Wait for XHR response — often cleaner than DOM polling
const [response] = await Promise.all([
page.waitForResponse((res) => res.url().includes("/api/products")),
page.goto(url),
]);
const data = await response.json(); // structured data from the API
console.log("API response:", data);Setting User Agent and Viewport
Default Puppeteer configuration can be identified by sites with compatibility layers. Set a realistic user agent and viewport to more closely match a real browser.
const browser = await puppeteer.launch({ headless: true });
const page = await browser.newPage();
await page.setUserAgent(
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 " +
"(KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36"
);
await page.setViewport({ width: 1280, height: 900 });
await page.setExtraHTTPHeaders({
"Accept-Language": "en-US,en;q=0.9",
});Handling Pagination with Puppeteer
Navigate through paginated sites by clicking the next-page button or constructing page URLs.
const browser = await puppeteer.launch({ headless: true });
const page = await browser.newPage();
const allResults = [];
let currentPage = 1;
while (currentPage <= 20) {
await page.goto(`https://example.com/articles?page=${currentPage}`, {
waitUntil: "networkidle2",
});
const items = await page.$$eval(".article-card", (els) =>
els.map((el) => ({
title: el.querySelector("h2")?.textContent?.trim() ?? "",
date: el.querySelector("time")?.getAttribute("datetime") ?? "",
}))
);
if (items.length === 0) break;
allResults.push(...items);
const nextBtn = await page.$("a.next-page");
if (!nextBtn) break;
currentPage++;
await new Promise((r) => setTimeout(r, 1000)); // polite delay
}
await browser.close();
console.log(`Collected ${allResults.length} articles`);Practical Limitations for Production Scraping
Puppeteer is excellent for development and interactive scraping, but has significant costs at production scale:
Memory: Each Chrome instance uses 200–500 MB. Running 10 parallel scrapers requires 2–5 GB RAM.
Speed: Browser launch + page render takes 3–15 seconds per page. A 100-page scrape takes 5–25 minutes.
Maintenance: Bundled Chrome updates require dependency updates. Chrome version mismatches cause failures.
Detection: Headless Chrome exposes signals through navigator properties, timing, and rendering characteristics. Sites with compatibility layers often identify and restrict headless traffic.
When Puppeteer is the right choice: Interaction-heavy flows (login, form submission, multi-step navigation), browser testing/QA, or low-volume one-off data collection.
When a rendering API is more practical: High-volume production scraping, when you cannot maintain browser infrastructure, or when you need reliable IP rotation without additional proxy setup.
Complete Puppeteer Scraper — Multi-Page Data Collection
Complete working Puppeteer scraper with pagination, realistic browser configuration, and error handling.
import puppeteer from "puppeteer";
import { writeFileSync } from "fs";
async function scrapeSite(baseUrl, maxPages = 10) {
const browser = await puppeteer.launch({
headless: true,
args: ["--no-sandbox", "--disable-setuid-sandbox"],
});
const page = await browser.newPage();
await page.setUserAgent(
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 " +
"(KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36"
);
await page.setViewport({ width: 1280, height: 900 });
const allResults = [];
let pageNum = 1;
try {
while (pageNum <= maxPages) {
const url = `${baseUrl}?page=${pageNum}`;
console.log(`Scraping ${url}...`);
await page.goto(url, { waitUntil: "networkidle2", timeout: 30000 });
try {
await page.waitForSelector(".product-card", { timeout: 5000 });
} catch {
console.log("No products found — stopping");
break;
}
const products = await page.$$eval(".product-card", (els) =>
els.map((el) => ({
title: el.querySelector("h2")?.textContent?.trim() ?? "",
price: el.querySelector(".price")?.textContent?.trim() ?? "",
url: el.querySelector("a")?.href ?? "",
}))
);
if (products.length === 0) break;
allResults.push(...products);
const hasNext = await page.$("a.next-page");
if (!hasNext) break;
pageNum++;
await new Promise((r) => setTimeout(r, 1500)); // polite delay
}
} finally {
await browser.close();
}
return allResults;
}
const results = await scrapeSite("https://example.com/products");
writeFileSync("products.json", JSON.stringify(results, null, 2));
console.log(`Saved ${results.length} products`);Same Result, No Chrome Process
When you just need rendered HTML — not complex browser interactions — AlterLab handles the browser server-side. No Puppeteer install, no Chrome binary, no memory overhead. From $0.0002/request with 5,000 free requests to start.
import * as cheerio from "cheerio";
import { writeFileSync } from "fs";
const API_KEY = "YOUR_API_KEY"; // Get free at alterlab.io
async function scrapeWithAlterLab(url, renderJs = false) {
const response = await fetch("https://api.alterlab.io/api/v1/scrape", {
method: "POST",
headers: {
"X-API-Key": API_KEY,
"Content-Type": "application/json",
},
body: JSON.stringify({ url, render_js: renderJs }),
signal: AbortSignal.timeout(30000),
});
if (!response.ok) throw new Error(`API error: ${response.status}`);
const data = await response.json();
return data.html ?? "";
}
const allResults = [];
for (let pageNum = 1; pageNum <= 10; pageNum++) {
const url = `https://example.com/products?page=${pageNum}`;
const html = await scrapeWithAlterLab(url, true); // render_js: true
const $ = cheerio.load(html);
const products = [];
$(".product-card").each((_i, el) => {
const title = $(el).find("h2").text().trim();
const price = $(el).find(".price").text().trim();
if (title) products.push({ title, price });
});
if (products.length === 0) break;
allResults.push(...products);
}
writeFileSync("products.json", JSON.stringify(allResults, null, 2));
console.log(`Saved ${allResults.length} products — no Chrome process running`);Puppeteer vs Alternatives
Puppeteer (local Chrome)
Pros
- +Full browser interaction
- +Intercept network requests
- +Free to run
Cons
- −200–500 MB per browser instance
- −3–15 seconds per page
- −Chrome detection is common
- −Complex scaling and crash handling
Puppeteer + proxies
Pros
- +Handles IP-based rate limiting
- +More reliable on protected sites
Cons
- −Proxy cost + browser cost
- −Complex proxy rotation setup
- −Still slow and memory-heavy
AlterLab rendering API
Pros
- +No Chrome management
- +Automatic IP rotation
- +5-tier compatibility escalation
- +From $0.0002/request
- +No CPU or memory overhead
Cons
- −Per-request cost
- −Cannot perform complex UI interactions
Frequently Asked Questions
More Browser Scraping Resources
Web Scraping with Playwright
Playwright guide — often the modern alternative to Puppeteer for Python and Node.js.
Web Scraping with Node.js
Complete Node.js scraping guide: fetch, Cheerio, concurrency, and TypeScript patterns.
JavaScript Rendering API
Cloud rendering — no local Chrome process. From $0.0002/request.
Anti-Bot Handling API
5-tier automatic website compatibility — works without running a local browser.
Your first scrape.
Sixty seconds.
$1 free balance. No credit card. No SDK.
Just a POST request.
No credit card required · Up to 5,000 free scrapes · Balance never expires