Puppeteer GuideNode.js / JavaScript

Web Scraping with Puppeteer — Complete Guide

How Puppeteer works for web scraping, working code examples, and when a cloud rendering API replaces the need to run Chrome locally.

Puppeteer is a Node.js library from Google that provides a high-level API to control headless Chrome. It is a natural choice for scraping JavaScript-heavy pages — it executes JavaScript, handles dynamic content, and lets you interact with the page before extracting data. This guide covers installation, basic and advanced scraping patterns, common pitfalls, and the practical tradeoffs of running Chrome locally versus using a managed rendering API.

Installing Puppeteer

Puppeteer ships with a bundled version of Chrome. The full package downloads Chrome automatically on install.

npm install puppeteer
# or lightweight version (bring your own Chrome)
npm install puppeteer-core

Your First Puppeteer Scraper

The core Puppeteer pattern: launch a browser, open a page, navigate to a URL, and extract data. Always close the browser when done — leaving browser processes open leaks memory.

import puppeteer from "puppeteer";

const browser = await puppeteer.launch({ headless: true });
const page = await browser.newPage();

await page.goto("https://example.com/products", {
  waitUntil: "networkidle2",
  timeout: 30000,
});

// Extract data from the DOM
const products = await page.evaluate(() => {
  return Array.from(document.querySelectorAll("div.product-card")).map((el) => ({
    title: el.querySelector("h2")?.textContent?.trim() ?? "",
    price: el.querySelector(".price")?.textContent?.trim() ?? "",
    url: el.querySelector("a")?.href ?? "",
  }));
});

await browser.close();
console.log(`Extracted ${products.length} products:`, products);

Waiting for Dynamic Content

The most common Puppeteer challenge: knowing when the page has loaded the data you need. Puppeteer provides several wait strategies.

// Wait for specific selector to appear
await page.waitForSelector("div.product-card", { timeout: 10000 });

// Wait for network to settle (no requests for 500ms)
await page.goto(url, { waitUntil: "networkidle2" });

// Wait for XHR response — often cleaner than DOM polling
const [response] = await Promise.all([
  page.waitForResponse((res) => res.url().includes("/api/products")),
  page.goto(url),
]);
const data = await response.json(); // structured data from the API
console.log("API response:", data);

Setting User Agent and Viewport

Default Puppeteer configuration can be identified by sites with compatibility layers. Set a realistic user agent and viewport to more closely match a real browser.

const browser = await puppeteer.launch({ headless: true });
const page = await browser.newPage();

await page.setUserAgent(
  "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 " +
  "(KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36"
);
await page.setViewport({ width: 1280, height: 900 });
await page.setExtraHTTPHeaders({
  "Accept-Language": "en-US,en;q=0.9",
});

Handling Pagination with Puppeteer

Navigate through paginated sites by clicking the next-page button or constructing page URLs.

const browser = await puppeteer.launch({ headless: true });
const page = await browser.newPage();
const allResults = [];

let currentPage = 1;
while (currentPage <= 20) {
  await page.goto(`https://example.com/articles?page=${currentPage}`, {
    waitUntil: "networkidle2",
  });

  const items = await page.$$eval(".article-card", (els) =>
    els.map((el) => ({
      title: el.querySelector("h2")?.textContent?.trim() ?? "",
      date: el.querySelector("time")?.getAttribute("datetime") ?? "",
    }))
  );

  if (items.length === 0) break;
  allResults.push(...items);

  const nextBtn = await page.$("a.next-page");
  if (!nextBtn) break;
  currentPage++;
  await new Promise((r) => setTimeout(r, 1000)); // polite delay
}

await browser.close();
console.log(`Collected ${allResults.length} articles`);

Practical Limitations for Production Scraping

Puppeteer is excellent for development and interactive scraping, but has significant costs at production scale:

Memory: Each Chrome instance uses 200–500 MB. Running 10 parallel scrapers requires 2–5 GB RAM.

Speed: Browser launch + page render takes 3–15 seconds per page. A 100-page scrape takes 5–25 minutes.

Maintenance: Bundled Chrome updates require dependency updates. Chrome version mismatches cause failures.

Detection: Headless Chrome exposes signals through navigator properties, timing, and rendering characteristics. Sites with compatibility layers often identify and restrict headless traffic.

When Puppeteer is the right choice: Interaction-heavy flows (login, form submission, multi-step navigation), browser testing/QA, or low-volume one-off data collection.

When a rendering API is more practical: High-volume production scraping, when you cannot maintain browser infrastructure, or when you need reliable IP rotation without additional proxy setup.

Complete Puppeteer Scraper — Multi-Page Data Collection

Complete working Puppeteer scraper with pagination, realistic browser configuration, and error handling.

import puppeteer from "puppeteer";
import { writeFileSync } from "fs";

async function scrapeSite(baseUrl, maxPages = 10) {
  const browser = await puppeteer.launch({
    headless: true,
    args: ["--no-sandbox", "--disable-setuid-sandbox"],
  });

  const page = await browser.newPage();
  await page.setUserAgent(
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 " +
    "(KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36"
  );
  await page.setViewport({ width: 1280, height: 900 });

  const allResults = [];
  let pageNum = 1;

  try {
    while (pageNum <= maxPages) {
      const url = `${baseUrl}?page=${pageNum}`;
      console.log(`Scraping ${url}...`);

      await page.goto(url, { waitUntil: "networkidle2", timeout: 30000 });

      try {
        await page.waitForSelector(".product-card", { timeout: 5000 });
      } catch {
        console.log("No products found — stopping");
        break;
      }

      const products = await page.$$eval(".product-card", (els) =>
        els.map((el) => ({
          title: el.querySelector("h2")?.textContent?.trim() ?? "",
          price: el.querySelector(".price")?.textContent?.trim() ?? "",
          url: el.querySelector("a")?.href ?? "",
        }))
      );

      if (products.length === 0) break;
      allResults.push(...products);

      const hasNext = await page.$("a.next-page");
      if (!hasNext) break;

      pageNum++;
      await new Promise((r) => setTimeout(r, 1500)); // polite delay
    }
  } finally {
    await browser.close();
  }

  return allResults;
}

const results = await scrapeSite("https://example.com/products");
writeFileSync("products.json", JSON.stringify(results, null, 2));
console.log(`Saved ${results.length} products`);

Same Result, No Chrome Process

When you just need rendered HTML — not complex browser interactions — AlterLab handles the browser server-side. No Puppeteer install, no Chrome binary, no memory overhead. From $0.0002/request with 5,000 free requests to start.

import * as cheerio from "cheerio";
import { writeFileSync } from "fs";

const API_KEY = "YOUR_API_KEY"; // Get free at alterlab.io

async function scrapeWithAlterLab(url, renderJs = false) {
  const response = await fetch("https://api.alterlab.io/api/v1/scrape", {
    method: "POST",
    headers: {
      "X-API-Key": API_KEY,
      "Content-Type": "application/json",
    },
    body: JSON.stringify({ url, render_js: renderJs }),
    signal: AbortSignal.timeout(30000),
  });
  if (!response.ok) throw new Error(`API error: ${response.status}`);
  const data = await response.json();
  return data.html ?? "";
}

const allResults = [];
for (let pageNum = 1; pageNum <= 10; pageNum++) {
  const url = `https://example.com/products?page=${pageNum}`;
  const html = await scrapeWithAlterLab(url, true); // render_js: true

  const $ = cheerio.load(html);
  const products = [];
  $(".product-card").each((_i, el) => {
    const title = $(el).find("h2").text().trim();
    const price = $(el).find(".price").text().trim();
    if (title) products.push({ title, price });
  });

  if (products.length === 0) break;
  allResults.push(...products);
}

writeFileSync("products.json", JSON.stringify(allResults, null, 2));
console.log(`Saved ${allResults.length} products — no Chrome process running`);

Puppeteer vs Alternatives

Puppeteer (local Chrome)

Pros

  • +Full browser interaction
  • +Intercept network requests
  • +Free to run

Cons

  • 200–500 MB per browser instance
  • 3–15 seconds per page
  • Chrome detection is common
  • Complex scaling and crash handling

Puppeteer + proxies

Pros

  • +Handles IP-based rate limiting
  • +More reliable on protected sites

Cons

  • Proxy cost + browser cost
  • Complex proxy rotation setup
  • Still slow and memory-heavy

AlterLab rendering API

Pros

  • +No Chrome management
  • +Automatic IP rotation
  • +5-tier compatibility escalation
  • +From $0.0002/request
  • +No CPU or memory overhead

Cons

  • Per-request cost
  • Cannot perform complex UI interactions

Frequently Asked Questions

Your first scrape.
Sixty seconds.

$1 free balance. No credit card. No SDK.Just a POST request.

terminal
curl -X POST https://api.alterlab.io/v1/scrape \
-H "X-API-Key: YOUR_KEY" \
-H "Content-Type: application/json" \
-d '{"url": "https://example.com", "formats": ["markdown"]}'

No credit card required · Up to 5,000 free scrapes · Balance never expires

    Web Scraping with Puppeteer 2026 — Node.js Guide & Limitations | AlterLab | AlterLab