Pricing Compare Playground Blog Docs Changelog

Web Scraping with Node.js and Puppeteer: The Complete 2026 Guide

Puppeteer is Google's official Node.js library for controlling Chrome and Chromium. Unlike HTTPbased scraping tools like Axios or Cheerio, Puppeteer runs a...

Yash DubeyFebruary 19, 2026

16 min read

259 views

On this page

Puppeteer is Google's official Node.js library for controlling Chrome and Chromium. Unlike HTTP-based scraping tools like Axios or Cheerio, Puppeteer runs a real browser. That means it can render JavaScript, click buttons, fill forms, scroll through infinite feeds, and screenshot pages exactly as a human would see them.

This guide walks through every technique you will need to build production-grade scrapers with Puppeteer in 2026 — from basic page fetches to handling anti-bot systems, proxy rotation, and browser pooling. Every code example is tested and ready to use.

90k+GitHub Stars

12M+Weekly npm Downloads

Chrome 132Latest Supported Engine

Why Puppeteer for Web Scraping

HTTP-only scrapers (Axios + Cheerio, Got + JSDOM) send a raw request and parse whatever HTML comes back. That works for static sites. But the modern web runs on JavaScript. React, Vue, and Angular apps render their content client-side. Product listings load via XHR calls after the initial page load. Prices appear only after a scroll event fires.

Puppeteer solves this by giving you a full Chromium instance. You get:

JavaScript execution — SPAs render completely before you extract data
User interaction — click buttons, fill forms, navigate pagination
Network control — intercept requests, block resources, capture API responses
Screenshots and PDFs — visual verification and archival
Cookie and session management — handle logins and authenticated scraping

The tradeoff is resource usage. Each Puppeteer instance launches a Chromium process that consumes 100-300 MB of RAM. For static HTML scraping, Cheerio is 50x lighter. Use Puppeteer when the target site requires a real browser to produce the data you need.

Setup and Installation

You need Node.js 18+ (LTS recommended). Puppeteer bundles its own Chromium binary, so there is nothing else to install.

Bash

mkdir my-scraper && cd my-scraper
npm init -y
npm install puppeteer

If you are deploying to a server and want to use an existing Chrome installation instead:

Bash

npm install puppeteer-core

With puppeteer-core, you pass the browser path manually — useful for Docker containers where you install Chromium via apt.

Verify the installation:

JAVASCRIPT

const puppeteer = require('puppeteer');

(async () => {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  await page.goto('https://example.com');
  const title = await page.title();
  console.log(`Page title: ${title}`);
  await browser.close();
})();

Run it with node index.js. You should see "Page title: Example Domain" in your terminal.

Basic Page Scraping

Every Puppeteer scraper follows the same pattern: launch a browser, open a page, navigate to a URL, wait for content, extract data, and close.

JAVASCRIPT

const puppeteer = require('puppeteer');

async function scrapeProduct(url) {
  const browser = await puppeteer.launch({ headless: 'new' });
  const page = await browser.newPage();

  // Set a realistic viewport and user agent
  await page.setViewport({ width: 1920, height: 1080 });
  await page.setUserAgent(
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 ' +
    '(KHTML, like Gecko) Chrome/132.0.0.0 Safari/537.36'
  );

  await page.goto(url, { waitUntil: 'networkidle2', timeout: 30000 });

  // Wait for a specific element before extracting
  await page.waitForSelector('.product-title', { timeout: 10000 });

  const product = await page.evaluate(() => {
    return {
      title: document.querySelector('.product-title')?.textContent.trim(),
      price: document.querySelector('.product-price')?.textContent.trim(),
      description: document.querySelector('.product-desc')?.textContent.trim(),
      inStock: !document.querySelector('.out-of-stock'),
    };
  });

  await browser.close();
  return product;
}

Key points:

waitUntil: 'networkidle2' waits until there are no more than 2 network connections for 500ms. This handles most SPA rendering.
waitForSelector ensures the element you need is actually in the DOM before you try to read it.
page.evaluate runs JavaScript inside the browser context. The function you pass has access to document, window, and the full DOM — but not your Node.js variables.

Handling Dynamic Content

Single-Page Applications

SPAs often load data after the initial HTML. The page source is just a <div id="root"></div> with a JavaScript bundle. Puppeteer handles this by default since it runs the JavaScript, but you need to wait for the right moment.

JAVASCRIPT

// Wait for data to render (not just the shell)
await page.waitForFunction(
  () => document.querySelectorAll('.product-card').length > 0,
  { timeout: 15000 }
);

Infinite Scroll

Many sites load content as you scroll down. You need to simulate scrolling and wait for new items to appear.

JAVASCRIPT

async function scrapeInfiniteScroll(page, maxItems = 100) {
  let items = [];
  let previousHeight = 0;

  while (items.length < maxItems) {
    // Scroll to bottom
    previousHeight = await page.evaluate(() => document.body.scrollHeight);
    await page.evaluate(() => window.scrollTo(0, document.body.scrollHeight));

    // Wait for new content to load
    try {
      await page.waitForFunction(
        `document.body.scrollHeight > ${previousHeight}`,
        { timeout: 5000 }
      );
    } catch {
      break; // No more content to load
    }

    // Small delay for rendering
    await new Promise(r => setTimeout(r, 1000));

    // Extract current items
    items = await page.evaluate(() =>
      [...document.querySelectorAll('.item')].map(el => ({
        name: el.querySelector('.name')?.textContent.trim(),
        price: el.querySelector('.price')?.textContent.trim(),
      }))
    );
  }

  return items.slice(0, maxItems);
}

Lazy-Loaded Images

Images with loading="lazy" or intersection observer patterns only load when they enter the viewport. Scroll them into view first:

JAVASCRIPT

async function loadLazyImages(page) {
  await page.evaluate(async () => {
    const images = document.querySelectorAll('img[loading="lazy"]');
    for (const img of images) {
      img.scrollIntoView({ behavior: 'instant' });
      await new Promise(r => setTimeout(r, 200));
    }
    // Scroll back to top
    window.scrollTo(0, 0);
  });
  // Wait for images to finish loading
  await new Promise(r => setTimeout(r, 2000));
}

Form Interaction

Many scraping targets require authentication. Puppeteer can fill forms and submit them just like a user.

JAVASCRIPT

async function loginAndScrape(url, username, password) {
  const browser = await puppeteer.launch({ headless: 'new' });
  const page = await browser.newPage();

  await page.goto('https://example.com/login', { waitUntil: 'networkidle2' });

  // Type credentials with realistic delays
  await page.type('#username', username, { delay: 50 });
  await page.type('#password', password, { delay: 50 });

  // Click submit and wait for navigation
  await Promise.all([
    page.waitForNavigation({ waitUntil: 'networkidle2' }),
    page.click('#login-button'),
  ]);

  // Now scrape authenticated content
  await page.goto(url, { waitUntil: 'networkidle2' });
  const data = await page.evaluate(() => {
    return document.querySelector('.dashboard-data')?.textContent;
  });

  await browser.close();
  return data;
}

Search Forms

JAVASCRIPT

// Type into a search box, wait for autocomplete, select a result
await page.type('#search-input', 'web scraping api', { delay: 100 });
await page.waitForSelector('.autocomplete-results', { timeout: 5000 });
await page.click('.autocomplete-results li:first-child');
await page.waitForNavigation({ waitUntil: 'networkidle2' });

Handling Pagination

Most scraping jobs involve multiple pages. Here is a reliable pattern for numbered pagination:

JAVASCRIPT

async function scrapeAllPages(baseUrl) {
  const browser = await puppeteer.launch({ headless: 'new' });
  const page = await browser.newPage();
  let allResults = [];
  let currentPage = 1;

  while (true) {
    const url = `${baseUrl}?page=${currentPage}`;
    await page.goto(url, { waitUntil: 'networkidle2' });

    const pageResults = await page.evaluate(() =>
      [...document.querySelectorAll('.result-item')].map(el => ({
        title: el.querySelector('h3')?.textContent.trim(),
        link: el.querySelector('a')?.href,
      }))
    );

    if (pageResults.length === 0) break;

    allResults.push(...pageResults);
    console.log(`Page ${currentPage}: ${pageResults.length} results`);

    // Check if there is a next page
    const hasNext = await page.$('.pagination .next:not(.disabled)');
    if (!hasNext) break;

    currentPage++;

    // Respectful delay between pages
    await new Promise(r => setTimeout(r, 1500 + Math.random() * 1000));
  }

  await browser.close();
  return allResults;
}

For "Load More" button pagination:

JAVASCRIPT

async function scrapeLoadMore(page) {
  let clickCount = 0;
  while (clickCount < 20) {
    const loadMoreBtn = await page.$('.load-more-button');
    if (!loadMoreBtn) break;

    await loadMoreBtn.click();
    await page.waitForFunction(
      (prevCount) => document.querySelectorAll('.item').length > prevCount,
      { timeout: 8000 },
      await page.evaluate(() => document.querySelectorAll('.item').length)
    );
    clickCount++;
  }
}

Network Request Interception

Blocking unnecessary resources makes your scraper faster and reduces bandwidth. Images, fonts, and CSS are rarely needed for data extraction.

JAVASCRIPT

async function setupRequestInterception(page) {
  await page.setRequestInterception(true);

  page.on('request', (request) => {
    const blockedTypes = ['image', 'stylesheet', 'font', 'media'];
    const blockedDomains = ['google-analytics.com', 'facebook.net', 'doubleclick.net'];

    if (blockedTypes.includes(request.resourceType())) {
      request.abort();
    } else if (blockedDomains.some(d => request.url().includes(d))) {
      request.abort();
    } else {
      request.continue();
    }
  });
}

This can reduce page load times by 40-60% and cut bandwidth by 70%+. Always block analytics and tracking scripts — they slow down scraping and serve no purpose for data extraction.

Screenshots and PDF Generation

Puppeteer can capture full-page screenshots and generate PDFs, which is useful for visual verification, archival, or monitoring changes.

JAVASCRIPT

// Full page screenshot
await page.screenshot({
  path: 'page.png',
  fullPage: true,
});

// Screenshot of a specific element
const element = await page.$('.product-card');
await element.screenshot({ path: 'product.png' });

// Generate PDF (works only in headless mode)
await page.pdf({
  path: 'page.pdf',
  format: 'A4',
  printBackground: true,
  margin: { top: '1cm', right: '1cm', bottom: '1cm', left: '1cm' },
});

A practical use case: take before/after screenshots of product pages to detect price changes or layout shifts.

Advanced Data Extraction

Scraping Tables

HTML tables are everywhere — product specs, financial data, comparison pages. Here is a generic table scraper:

JAVASCRIPT

async function scrapeTable(page, tableSelector) {
  return await page.evaluate((selector) => {
    const table = document.querySelector(selector);
    if (!table) return null;

    const headers = [...table.querySelectorAll('thead th')].map(
      th => th.textContent.trim()
    );

    const rows = [...table.querySelectorAll('tbody tr')].map(tr => {
      const cells = [...tr.querySelectorAll('td')].map(
        td => td.textContent.trim()
      );
      return Object.fromEntries(headers.map((h, i) => [h, cells[i]]));
    });

    return { headers, rows };
  }, tableSelector);
}

// Usage
const data = await scrapeTable(page, '#pricing-table');
// Returns: { headers: ['Plan', 'Price', 'Features'], rows: [{...}, {...}] }

Shadow DOM

Some modern web components use Shadow DOM, which hides elements from regular querySelector calls. You need to pierce through shadow roots:

JAVASCRIPT

const shadowData = await page.evaluate(() => {
  const host = document.querySelector('product-card');
  const shadow = host.shadowRoot;
  return {
    title: shadow.querySelector('.title')?.textContent.trim(),
    price: shadow.querySelector('.price')?.textContent.trim(),
  };
});

Intercepting XHR/Fetch Responses

Sometimes the cleanest approach is to intercept the API calls the page makes internally, rather than parsing the rendered HTML:

JAVASCRIPT

async function interceptApiData(page, url) {
  const apiData = [];

  page.on('response', async (response) => {
    const reqUrl = response.url();
    if (reqUrl.includes('/api/products') && response.status() === 200) {
      try {
        const json = await response.json();
        apiData.push(...json.results);
      } catch (e) {
        // Not JSON, skip
      }
    }
  });

  await page.goto(url, { waitUntil: 'networkidle2' });
  return apiData;
}

This technique is extremely powerful. Many sites fetch structured JSON from their own APIs and then render it into HTML. By intercepting the response, you get clean, structured data without parsing DOM elements at all.

Error Handling and Retry Patterns

Production scrapers need robust error handling. Network timeouts, selector changes, and rate limiting will all happen.

JAVASCRIPT

async function scrapeWithRetry(url, maxRetries = 3) {
  let browser;

  for (let attempt = 1; attempt <= maxRetries; attempt++) {
    try {
      browser = await puppeteer.launch({ headless: 'new' });
      const page = await browser.newPage();
      page.setDefaultTimeout(15000);

      await page.goto(url, { waitUntil: 'networkidle2' });
      await page.waitForSelector('.data-container');

      const data = await page.evaluate(() => {
        return document.querySelector('.data-container')?.textContent.trim();
      });

      if (!data) throw new Error('Empty data extracted');
      return data;

    } catch (error) {
      console.error(`Attempt ${attempt}/${maxRetries} failed: ${error.message}`);

      if (attempt === maxRetries) {
        throw new Error(`All ${maxRetries} attempts failed for ${url}`);
      }

      // Exponential backoff: 2s, 4s, 8s...
      const delay = Math.pow(2, attempt) * 1000;
      await new Promise(r => setTimeout(r, delay));

    } finally {
      if (browser) await browser.close();
    }
  }
}

Key practices:

Always close the browser in a finally block to prevent orphan Chromium processes
Validate extracted data — an empty result is as bad as an error
Use exponential backoff — hammering a rate-limited server with immediate retries makes things worse
Set explicit timeouts — the default 30-second timeout is too long for most scraping jobs

Anti-Bot Detection and Stealth

Out of the box, Puppeteer is trivially detectable. The navigator.webdriver flag is set to true, plugin arrays are empty, and WebGL reports SwiftShader instead of real GPU hardware. Every anti-bot service checks these in milliseconds.

The puppeteer-extra-plugin-stealth package patches most of these signals:

Bash

npm install puppeteer-extra puppeteer-extra-plugin-stealth

JAVASCRIPT

const puppeteer = require('puppeteer-extra');
const StealthPlugin = require('puppeteer-extra-plugin-stealth');

puppeteer.use(StealthPlugin());

(async () => {
  const browser = await puppeteer.launch({
    headless: 'new',
    args: [
      '--no-sandbox',
      '--disable-setuid-sandbox',
      '--disable-blink-features=AutomationControlled',
    ],
  });

  const page = await browser.newPage();

  // Randomize viewport to look less bot-like
  const width = 1280 + Math.floor(Math.random() * 640);
  const height = 800 + Math.floor(Math.random() * 280);
  await page.setViewport({ width, height });

  await page.goto('https://bot.sannysoft.com', { waitUntil: 'networkidle2' });
  await page.screenshot({ path: 'stealth-test.png', fullPage: true });
  await browser.close();
})();

The stealth plugin handles:

Removing navigator.webdriver flag
Faking navigator.plugins with realistic entries
Spoofing WebGL vendor and renderer strings
Patching Chrome runtime objects (window.chrome)
Fixing navigator.permissions behavior
Spoofing navigator.languages properly

Even with stealth, advanced anti-bot systems like Cloudflare Turnstile, DataDome, and PerimeterX use behavioral analysis — mouse movement patterns, scroll velocity, and timing between actions. You can partially address this:

JAVASCRIPT

// Simulate human-like mouse movement
async function humanMove(page, x, y) {
  const steps = 10 + Math.floor(Math.random() * 15);
  await page.mouse.move(x, y, { steps });
  await new Promise(r => setTimeout(r, 100 + Math.random() * 300));
}

// Random delays between actions
async function humanDelay(min = 500, max = 2000) {
  const delay = min + Math.random() * (max - min);
  await new Promise(r => setTimeout(r, delay));
}

Proxy Rotation

Sending all requests from one IP address gets you blocked fast. Proxy rotation distributes your requests across many IPs.

JAVASCRIPT

const puppeteer = require('puppeteer-extra');
const StealthPlugin = require('puppeteer-extra-plugin-stealth');

puppeteer.use(StealthPlugin());

const proxies = [
  'http://user:[email protected]:8080',
  'http://user:[email protected]:8080',
  'http://user:[email protected]:8080',
];

async function scrapeWithProxy(url) {
  const proxy = proxies[Math.floor(Math.random() * proxies.length)];
  const proxyUrl = new URL(proxy);

  const browser = await puppeteer.launch({
    headless: 'new',
    args: [`--proxy-server=${proxyUrl.host}`],
  });

  const page = await browser.newPage();

  // Authenticate with the proxy
  if (proxyUrl.username) {
    await page.authenticate({
      username: proxyUrl.username,
      password: proxyUrl.password,
    });
  }

  try {
    await page.goto(url, { waitUntil: 'networkidle2', timeout: 30000 });
    return await page.content();
  } finally {
    await browser.close();
  }
}

A smarter approach rotates proxies based on failure rate. If a proxy gets blocked, move it to a cooldown list:

JAVASCRIPT

class ProxyPool {
  constructor(proxies) {
    this.available = [...proxies];
    this.cooldown = new Map(); // proxy -> cooldown expiry timestamp
  }

  getProxy() {
    // Move expired cooldowns back to available
    const now = Date.now();
    for (const [proxy, expiry] of this.cooldown) {
      if (now > expiry) {
        this.available.push(proxy);
        this.cooldown.delete(proxy);
      }
    }

    if (this.available.length === 0) {
      throw new Error('No proxies available');
    }

    const index = Math.floor(Math.random() * this.available.length);
    return this.available[index];
  }

  markFailed(proxy) {
    this.available = this.available.filter(p => p !== proxy);
    this.cooldown.set(proxy, Date.now() + 5 * 60 * 1000); // 5 min cooldown
  }
}

Scaling with Browser Pools

Running one browser at a time is fine for small jobs. For serious scraping, you need a pool of browsers running in parallel with concurrency limits.

JAVASCRIPT

const puppeteer = require('puppeteer');

class BrowserPool {
  constructor(maxBrowsers = 5) {
    this.maxBrowsers = maxBrowsers;
    this.activeBrowsers = 0;
    this.queue = [];
  }

  async acquire() {
    if (this.activeBrowsers < this.maxBrowsers) {
      this.activeBrowsers++;
      return await puppeteer.launch({
        headless: 'new',
        args: ['--no-sandbox', '--disable-dev-shm-usage'],
      });
    }

    // Wait for a browser to become available
    return new Promise((resolve) => {
      this.queue.push(resolve);
    });
  }

  async release(browser) {
    await browser.close();

    if (this.queue.length > 0) {
      const next = this.queue.shift();
      const newBrowser = await puppeteer.launch({
        headless: 'new',
        args: ['--no-sandbox', '--disable-dev-shm-usage'],
      });
      next(newBrowser);
    } else {
      this.activeBrowsers--;
    }
  }
}

// Usage: scrape 100 URLs with max 5 concurrent browsers
async function scrapeMany(urls) {
  const pool = new BrowserPool(5);
  const results = [];

  const tasks = urls.map(async (url) => {
    const browser = await pool.acquire();
    try {
      const page = await browser.newPage();
      await page.goto(url, { waitUntil: 'networkidle2', timeout: 20000 });
      const title = await page.title();
      results.push({ url, title });
    } catch (error) {
      results.push({ url, error: error.message });
    } finally {
      await pool.release(browser);
    }
  });

  await Promise.all(tasks);
  return results;
}

Scaling Puppeteer beyond a single machine gets complicated fast. Each browser instance needs 100-300 MB of RAM. A 4 GB server can realistically run 10-15 concurrent browsers. For higher throughput, you are looking at container orchestration, process monitoring, and failure recovery — which is essentially building your own scraping infrastructure.

Launch Browser

Puppeteer starts a Chromium instance with stealth patches and proxy configuration

Navigate and Wait

Go to the target URL, wait for JavaScript rendering and dynamic content to load

Extract Data

Run page.evaluate() to pull structured data from the DOM or intercept API responses

Handle Errors

Retry on failure with exponential backoff, rotate proxies on blocks

Store Results

Save extracted data to JSON, CSV, or database — validate before writing

When to Use a Scraping API Instead

Building and maintaining a Puppeteer scraping pipeline is real engineering work. You are responsible for browser management, proxy infrastructure, anti-bot bypass, CAPTCHA solving, retry logic, and monitoring. For one-off projects or simple targets, that is fine.

But when you are scraping at scale against sites with serious bot protection, the maintenance burden grows fast. Cloudflare updates its detection every few weeks. Proxy providers rotate your IPs into already-burned ranges. Your stealth patches break when Chrome updates. You spend more time maintaining infrastructure than building the product that needs the data.

This is where a scraping API makes sense. Instead of managing browsers, proxies, and stealth patches yourself, you send an HTTP request and get back the rendered HTML or structured data.

AlterLab handles the hard parts — anti-bot bypass across Cloudflare, DataDome, and PerimeterX, automatic proxy rotation through residential and datacenter pools, and headless browser rendering. A single API call replaces hundreds of lines of Puppeteer infrastructure code:

Bash

curl -X POST https://alterlab.io/api/v1/scrape \
  -H "X-API-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com/products", "formats": ["text", "markdown"]}'

No browser pools to manage. No proxy rotation to implement. No stealth plugin updates to track.

Feature	DIY Puppeteer	AlterLab API
Setup Time	Hours to days	5 minutes
Anti-Bot Bypass	Manual (breaks often)	Built-in (auto-updated)
Proxy Management	Self-managed pool	Included
Browser Infrastructure	Your servers	Managed
Maintenance	Ongoing	None
Scaling	Complex orchestration	Increase API calls
Cost at 10K pages/day	$200-500/mo servers + proxies	$49/mo
JavaScript Rendering

The decision is straightforward: use Puppeteer when you need full browser control (custom interactions, screenshots, specific workflows). Use a scraping API when you need reliable data extraction at scale without the infrastructure overhead.

Complete Example: E-Commerce Product Scraper

Here is a complete, production-ready scraper that extracts product data from an e-commerce listing page, handles pagination, retries on failure, and saves results to JSON.

JAVASCRIPT

const puppeteer = require('puppeteer-extra');
const StealthPlugin = require('puppeteer-extra-plugin-stealth');
const fs = require('fs').promises;

puppeteer.use(StealthPlugin());

const CONFIG = {
  baseUrl: 'https://example-store.com/category/electronics',
  maxPages: 10,
  maxRetries: 3,
  concurrency: 3,
  outputFile: 'products.json',
  delayBetweenPages: [1500, 3000], // min/max ms
};

async function randomDelay([min, max]) {
  const delay = min + Math.random() * (max - min);
  await new Promise(r => setTimeout(r, delay));
}

async function extractProducts(page) {
  return await page.evaluate(() => {
    return [...document.querySelectorAll('.product-card')].map(card => ({
      name: card.querySelector('.product-name')?.textContent.trim() || '',
      price: card.querySelector('.product-price')?.textContent.trim() || '',
      rating: card.querySelector('.rating-value')?.textContent.trim() || '',
      reviewCount: card.querySelector('.review-count')?.textContent.trim() || '',
      url: card.querySelector('a.product-link')?.href || '',
      image: card.querySelector('img.product-image')?.src || '',
      availability: card.querySelector('.stock-status')?.textContent.trim() || '',
      scraped_at: new Date().toISOString(),
    }));
  });
}

async function scrapePage(browser, url, retries = CONFIG.maxRetries) {
  const page = await browser.newPage();

  try {
    // Block heavy resources
    await page.setRequestInterception(true);
    page.on('request', req => {
      if (['image', 'stylesheet', 'font', 'media'].includes(req.resourceType())) {
        req.abort();
      } else {
        req.continue();
      }
    });

    await page.setViewport({ width: 1366, height: 768 });
    await page.goto(url, { waitUntil: 'networkidle2', timeout: 25000 });

    // Wait for product cards to render
    await page.waitForSelector('.product-card', { timeout: 10000 });

    const products = await extractProducts(page);
    console.log(`  Extracted ${products.length} products from ${url}`);
    return products;

  } catch (error) {
    if (retries > 0) {
      console.warn(`  Retry (${CONFIG.maxRetries - retries + 1}): ${error.message}`);
      await new Promise(r => setTimeout(r, 3000));
      return scrapePage(browser, url, retries - 1);
    }
    console.error(`  Failed after ${CONFIG.maxRetries} retries: ${url}`);
    return [];

  } finally {
    await page.close();
  }
}

async function main() {
  console.log('Starting e-commerce scraper...');

  const browser = await puppeteer.launch({
    headless: 'new',
    args: [
      '--no-sandbox',
      '--disable-setuid-sandbox',
      '--disable-dev-shm-usage',
      '--disable-blink-features=AutomationControlled',
    ],
  });

  const allProducts = [];

  try {
    for (let pageNum = 1; pageNum <= CONFIG.maxPages; pageNum++) {
      const url = `${CONFIG.baseUrl}?page=${pageNum}`;
      console.log(`Scraping page ${pageNum}/${CONFIG.maxPages}: ${url}`);

      const products = await scrapePage(browser, url);

      if (products.length === 0) {
        console.log('No products found — reached last page.');
        break;
      }

      allProducts.push(...products);

      // Respectful delay between pages
      if (pageNum < CONFIG.maxPages) {
        await randomDelay(CONFIG.delayBetweenPages);
      }
    }

    // Save results
    await fs.writeFile(
      CONFIG.outputFile,
      JSON.stringify(allProducts, null, 2),
      'utf-8'
    );

    console.log(`\nDone. Scraped ${allProducts.length} products across ${CONFIG.maxPages} pages.`);
    console.log(`Results saved to ${CONFIG.outputFile}`);

  } finally {
    await browser.close();
  }
}

main().catch(console.error);

This scraper includes everything covered in this guide: stealth mode, request interception, proper waits, retry logic, respectful delays, and clean resource management. To adapt it to a real target, you only need to update the CSS selectors in extractProducts and the base URL.

What to Remember

Puppeteer gives you complete control over a real browser, which makes it the right choice for scraping JavaScript-heavy sites, handling complex interactions, and extracting data that HTTP-only tools cannot reach. The cost is complexity — you are managing browser processes, memory, proxies, and stealth.

For projects where you need that level of control, the patterns in this guide will get you to production. For projects where you just need the data, consider whether the infrastructure overhead is worth it. A scraping API like AlterLab can reduce weeks of Puppeteer infrastructure work to a single HTTP call — letting you focus on what you are building instead of how you are scraping.

Was this article helpful?

Try it yourself

One API call. Any language.

Python SDK, Node SDK, or plain HTTP. Get started in under a minute.

from alterlab import AlterLab

client = AlterLab(api_key="YOUR_KEY")
result = client.scrape("https://example.com")
print(result.markdown)

No credit card required · 5,000 free requests

Yash Dubey

View all posts

Tutorials

Handling Infinite Scroll & Pagination in Headless Browsers

Learn how to reliably handle infinite scroll, cursor-based pagination, and dynamic rendering for autonomous AI web scraping agents using headless browsers.

Herald Blog Service

Jun 13, 2026

Tutorials

Playwright Network Interception Guide for AI Data Extraction

Learn how to intercept and block network requests in Playwright to accelerate AI agent data extraction, reduce bandwidth, and capture raw API JSON payloads.

Herald Blog Service

Jun 13, 2026

13m

Tutorials

Building an Autonomous CrewAI Web Scraping Tool for JSON Extraction

Learn how to build a custom CrewAI tool that autonomously scrapes dynamic websites and returns structured JSON using a headless browser API.

Herald Blog Service

Jun 12, 2026

Stay in the Loop

Get scraping insights, API tips, and platform updates. No spam — we only send when we have something worth reading.

Web Scraping API Resources

Part of the Web Scraping API Documentation cluster

Web Scraping API Documentation

Complete API reference with 5-tier auto-escalation — Curl to challenge resolution.

Pillar page

JavaScript Rendering Guide

Configure Tier 4 browser rendering for SPAs and dynamic content.

Authenticated Scraping Guide

Scrape pages behind login using session management.

Web Scraping API Benchmarks

Real success rates and cost data across all 5 tiers.

AlterLab for AI Agents

MCP Server, Python SDK, and Firecrawl-compatible API for AI agent workflows.

Why Puppeteer for Web Scraping

Setup and Installation

Basic Page Scraping

Handling Dynamic Content

Single-Page Applications

Infinite Scroll

Lazy-Loaded Images

Form Interaction

Login Flows

Search Forms

Handling Pagination

Network Request Interception

Screenshots and PDF Generation

Advanced Data Extraction

Scraping Tables

Shadow DOM

Intercepting XHR/Fetch Responses

Error Handling and Retry Patterns

Anti-Bot Detection and Stealth

Proxy Rotation

Scaling with Browser Pools

Launch Browser

Navigate and Wait

Extract Data

Handle Errors

Store Results

When to Use a Scraping API Instead

Complete Example: E-Commerce Product Scraper

What to Remember

Related Articles

Handling Infinite Scroll & Pagination in Headless Browsers

Playwright Network Interception Guide for AI Data Extraction

Building an Autonomous CrewAI Web Scraping Tool for JSON Extraction

Popular Posts

Why Your Headless Browser Gets Detected (and How to Fix It)

Best Web Scraping APIs in 2026: Complete Comparison Guide

Playwright Bot Detection: What Actually Works in 2026

How to Scrape Twitter/X: Complete Guide for 2026

How to Scrape Cloudflare-Protected Sites in 2026

Recommended

How to Scrape Amazon in 2026: Engineering Guide

How to Scrape AliExpress: Complete Guide for 2026

Why Your Headless Browser Gets Detected (and How to Fix It)

How to Scrape Indeed: Complete Guide for 2026

How to Scrape Twitter/X Data: Complete Guide for 2026

Newsletter

Recommended Reading

How to Scrape Amazon in 2026: Engineering Guide

How to Scrape AliExpress: Complete Guide for 2026

Why Your Headless Browser Gets Detected (and How to Fix It)

How to Scrape Indeed: Complete Guide for 2026

How to Scrape Twitter/X Data: Complete Guide for 2026

Stay in the Loop

Explore AlterLab

Python Web Scraping API

Compare Scraping APIs

Pricing

Documentation

Web Scraping API Resources