AlterLabAlterLab
Tutorials

Web Scraping with Node.js and Puppeteer: The Complete 2026 Guide

Puppeteer is Google's official Node.js library for controlling Chrome and Chromium. Unlike HTTPbased scraping tools like Axios or Cheerio, Puppeteer runs a...

Yash Dubey

Yash Dubey

February 19, 2026

16 min read
45 views
Share:

Puppeteer is Google's official Node.js library for controlling Chrome and Chromium. Unlike HTTP-based scraping tools like Axios or Cheerio, Puppeteer runs a real browser. That means it can render JavaScript, click buttons, fill forms, scroll through infinite feeds, and screenshot pages exactly as a human would see them.

This guide walks through every technique you will need to build production-grade scrapers with Puppeteer in 2026 — from basic page fetches to handling anti-bot systems, proxy rotation, and browser pooling. Every code example is tested and ready to use.

90k+GitHub Stars
12M+Weekly npm Downloads
Chrome 132Latest Supported Engine

Why Puppeteer for Web Scraping

HTTP-only scrapers (Axios + Cheerio, Got + JSDOM) send a raw request and parse whatever HTML comes back. That works for static sites. But the modern web runs on JavaScript. React, Vue, and Angular apps render their content client-side. Product listings load via XHR calls after the initial page load. Prices appear only after a scroll event fires.

Puppeteer solves this by giving you a full Chromium instance. You get:

  • JavaScript execution — SPAs render completely before you extract data
  • User interaction — click buttons, fill forms, navigate pagination
  • Network control — intercept requests, block resources, capture API responses
  • Screenshots and PDFs — visual verification and archival
  • Cookie and session management — handle logins and authenticated scraping

The tradeoff is resource usage. Each Puppeteer instance launches a Chromium process that consumes 100-300 MB of RAM. For static HTML scraping, Cheerio is 50x lighter. Use Puppeteer when the target site requires a real browser to produce the data you need.

Setup and Installation

You need Node.js 18+ (LTS recommended). Puppeteer bundles its own Chromium binary, so there is nothing else to install.

bash
mkdir my-scraper && cd my-scraper
npm init -y
npm install puppeteer

If you are deploying to a server and want to use an existing Chrome installation instead:

bash
npm install puppeteer-core

With puppeteer-core, you pass the browser path manually — useful for Docker containers where you install Chromium via apt.

Verify the installation:

javascript
const puppeteer = require('puppeteer');

(async () => {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  await page.goto('https://example.com');
  const title = await page.title();
  console.log(`Page title: ${title}`);
  await browser.close();
})();

Run it with node index.js. You should see "Page title: Example Domain" in your terminal.

Basic Page Scraping

Every Puppeteer scraper follows the same pattern: launch a browser, open a page, navigate to a URL, wait for content, extract data, and close.

javascript
const puppeteer = require('puppeteer');

async function scrapeProduct(url) {
  const browser = await puppeteer.launch({ headless: 'new' });
  const page = await browser.newPage();

  // Set a realistic viewport and user agent
  await page.setViewport({ width: 1920, height: 1080 });
  await page.setUserAgent(
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 ' +
    '(KHTML, like Gecko) Chrome/132.0.0.0 Safari/537.36'
  );

  await page.goto(url, { waitUntil: 'networkidle2', timeout: 30000 });

  // Wait for a specific element before extracting
  await page.waitForSelector('.product-title', { timeout: 10000 });

  const product = await page.evaluate(() => {
    return {
      title: document.querySelector('.product-title')?.textContent.trim(),
      price: document.querySelector('.product-price')?.textContent.trim(),
      description: document.querySelector('.product-desc')?.textContent.trim(),
      inStock: !document.querySelector('.out-of-stock'),
    };
  });

  await browser.close();
  return product;
}

Key points:

  • waitUntil: 'networkidle2' waits until there are no more than 2 network connections for 500ms. This handles most SPA rendering.
  • waitForSelector ensures the element you need is actually in the DOM before you try to read it.
  • page.evaluate runs JavaScript inside the browser context. The function you pass has access to document, window, and the full DOM — but not your Node.js variables.

Handling Dynamic Content

Single-Page Applications

SPAs often load data after the initial HTML. The page source is just a <div id="root"></div> with a JavaScript bundle. Puppeteer handles this by default since it runs the JavaScript, but you need to wait for the right moment.

javascript
// Wait for data to render (not just the shell)
await page.waitForFunction(
  () => document.querySelectorAll('.product-card').length > 0,
  { timeout: 15000 }
);

Infinite Scroll

Many sites load content as you scroll down. You need to simulate scrolling and wait for new items to appear.

javascript
async function scrapeInfiniteScroll(page, maxItems = 100) {
  let items = [];
  let previousHeight = 0;

  while (items.length < maxItems) {
    // Scroll to bottom
    previousHeight = await page.evaluate(() => document.body.scrollHeight);
    await page.evaluate(() => window.scrollTo(0, document.body.scrollHeight));

    // Wait for new content to load
    try {
      await page.waitForFunction(
        `document.body.scrollHeight > ${previousHeight}`,
        { timeout: 5000 }
      );
    } catch {
      break; // No more content to load
    }

    // Small delay for rendering
    await new Promise(r => setTimeout(r, 1000));

    // Extract current items
    items = await page.evaluate(() =>
      [...document.querySelectorAll('.item')].map(el => ({
        name: el.querySelector('.name')?.textContent.trim(),
        price: el.querySelector('.price')?.textContent.trim(),
      }))
    );
  }

  return items.slice(0, maxItems);
}

Lazy-Loaded Images

Images with loading="lazy" or intersection observer patterns only load when they enter the viewport. Scroll them into view first:

javascript
async function loadLazyImages(page) {
  await page.evaluate(async () => {
    const images = document.querySelectorAll('img[loading="lazy"]');
    for (const img of images) {
      img.scrollIntoView({ behavior: 'instant' });
      await new Promise(r => setTimeout(r, 200));
    }
    // Scroll back to top
    window.scrollTo(0, 0);
  });
  // Wait for images to finish loading
  await new Promise(r => setTimeout(r, 2000));
}

Form Interaction

Login Flows

Many scraping targets require authentication. Puppeteer can fill forms and submit them just like a user.

javascript
async function loginAndScrape(url, username, password) {
  const browser = await puppeteer.launch({ headless: 'new' });
  const page = await browser.newPage();

  await page.goto('https://example.com/login', { waitUntil: 'networkidle2' });

  // Type credentials with realistic delays
  await page.type('#username', username, { delay: 50 });
  await page.type('#password', password, { delay: 50 });

  // Click submit and wait for navigation
  await Promise.all([
    page.waitForNavigation({ waitUntil: 'networkidle2' }),
    page.click('#login-button'),
  ]);

  // Now scrape authenticated content
  await page.goto(url, { waitUntil: 'networkidle2' });
  const data = await page.evaluate(() => {
    return document.querySelector('.dashboard-data')?.textContent;
  });

  await browser.close();
  return data;
}

Search Forms

javascript
// Type into a search box, wait for autocomplete, select a result
await page.type('#search-input', 'web scraping api', { delay: 100 });
await page.waitForSelector('.autocomplete-results', { timeout: 5000 });
await page.click('.autocomplete-results li:first-child');
await page.waitForNavigation({ waitUntil: 'networkidle2' });

Handling Pagination

Most scraping jobs involve multiple pages. Here is a reliable pattern for numbered pagination:

javascript
async function scrapeAllPages(baseUrl) {
  const browser = await puppeteer.launch({ headless: 'new' });
  const page = await browser.newPage();
  let allResults = [];
  let currentPage = 1;

  while (true) {
    const url = `${baseUrl}?page=${currentPage}`;
    await page.goto(url, { waitUntil: 'networkidle2' });

    const pageResults = await page.evaluate(() =>
      [...document.querySelectorAll('.result-item')].map(el => ({
        title: el.querySelector('h3')?.textContent.trim(),
        link: el.querySelector('a')?.href,
      }))
    );

    if (pageResults.length === 0) break;

    allResults.push(...pageResults);
    console.log(`Page ${currentPage}: ${pageResults.length} results`);

    // Check if there is a next page
    const hasNext = await page.$('.pagination .next:not(.disabled)');
    if (!hasNext) break;

    currentPage++;

    // Respectful delay between pages
    await new Promise(r => setTimeout(r, 1500 + Math.random() * 1000));
  }

  await browser.close();
  return allResults;
}

For "Load More" button pagination:

javascript
async function scrapeLoadMore(page) {
  let clickCount = 0;
  while (clickCount < 20) {
    const loadMoreBtn = await page.$('.load-more-button');
    if (!loadMoreBtn) break;

    await loadMoreBtn.click();
    await page.waitForFunction(
      (prevCount) => document.querySelectorAll('.item').length > prevCount,
      { timeout: 8000 },
      await page.evaluate(() => document.querySelectorAll('.item').length)
    );
    clickCount++;
  }
}

Network Request Interception

Blocking unnecessary resources makes your scraper faster and reduces bandwidth. Images, fonts, and CSS are rarely needed for data extraction.

javascript
async function setupRequestInterception(page) {
  await page.setRequestInterception(true);

  page.on('request', (request) => {
    const blockedTypes = ['image', 'stylesheet', 'font', 'media'];
    const blockedDomains = ['google-analytics.com', 'facebook.net', 'doubleclick.net'];

    if (blockedTypes.includes(request.resourceType())) {
      request.abort();
    } else if (blockedDomains.some(d => request.url().includes(d))) {
      request.abort();
    } else {
      request.continue();
    }
  });
}

This can reduce page load times by 40-60% and cut bandwidth by 70%+. Always block analytics and tracking scripts — they slow down scraping and serve no purpose for data extraction.

Screenshots and PDF Generation

Puppeteer can capture full-page screenshots and generate PDFs, which is useful for visual verification, archival, or monitoring changes.

javascript
// Full page screenshot
await page.screenshot({
  path: 'page.png',
  fullPage: true,
});

// Screenshot of a specific element
const element = await page.$('.product-card');
await element.screenshot({ path: 'product.png' });

// Generate PDF (works only in headless mode)
await page.pdf({
  path: 'page.pdf',
  format: 'A4',
  printBackground: true,
  margin: { top: '1cm', right: '1cm', bottom: '1cm', left: '1cm' },
});

A practical use case: take before/after screenshots of product pages to detect price changes or layout shifts.

Advanced Data Extraction

Scraping Tables

HTML tables are everywhere — product specs, financial data, comparison pages. Here is a generic table scraper:

javascript
async function scrapeTable(page, tableSelector) {
  return await page.evaluate((selector) => {
    const table = document.querySelector(selector);
    if (!table) return null;

    const headers = [...table.querySelectorAll('thead th')].map(
      th => th.textContent.trim()
    );

    const rows = [...table.querySelectorAll('tbody tr')].map(tr => {
      const cells = [...tr.querySelectorAll('td')].map(
        td => td.textContent.trim()
      );
      return Object.fromEntries(headers.map((h, i) => [h, cells[i]]));
    });

    return { headers, rows };
  }, tableSelector);
}

// Usage
const data = await scrapeTable(page, '#pricing-table');
// Returns: { headers: ['Plan', 'Price', 'Features'], rows: [{...}, {...}] }

Shadow DOM

Some modern web components use Shadow DOM, which hides elements from regular querySelector calls. You need to pierce through shadow roots:

javascript
const shadowData = await page.evaluate(() => {
  const host = document.querySelector('product-card');
  const shadow = host.shadowRoot;
  return {
    title: shadow.querySelector('.title')?.textContent.trim(),
    price: shadow.querySelector('.price')?.textContent.trim(),
  };
});

Intercepting XHR/Fetch Responses

Sometimes the cleanest approach is to intercept the API calls the page makes internally, rather than parsing the rendered HTML:

javascript
async function interceptApiData(page, url) {
  const apiData = [];

  page.on('response', async (response) => {
    const reqUrl = response.url();
    if (reqUrl.includes('/api/products') && response.status() === 200) {
      try {
        const json = await response.json();
        apiData.push(...json.results);
      } catch (e) {
        // Not JSON, skip
      }
    }
  });

  await page.goto(url, { waitUntil: 'networkidle2' });
  return apiData;
}

This technique is extremely powerful. Many sites fetch structured JSON from their own APIs and then render it into HTML. By intercepting the response, you get clean, structured data without parsing DOM elements at all.

Error Handling and Retry Patterns

Production scrapers need robust error handling. Network timeouts, selector changes, and rate limiting will all happen.

javascript
async function scrapeWithRetry(url, maxRetries = 3) {
  let browser;

  for (let attempt = 1; attempt <= maxRetries; attempt++) {
    try {
      browser = await puppeteer.launch({ headless: 'new' });
      const page = await browser.newPage();
      page.setDefaultTimeout(15000);

      await page.goto(url, { waitUntil: 'networkidle2' });
      await page.waitForSelector('.data-container');

      const data = await page.evaluate(() => {
        return document.querySelector('.data-container')?.textContent.trim();
      });

      if (!data) throw new Error('Empty data extracted');
      return data;

    } catch (error) {
      console.error(`Attempt ${attempt}/${maxRetries} failed: ${error.message}`);

      if (attempt === maxRetries) {
        throw new Error(`All ${maxRetries} attempts failed for ${url}`);
      }

      // Exponential backoff: 2s, 4s, 8s...
      const delay = Math.pow(2, attempt) * 1000;
      await new Promise(r => setTimeout(r, delay));

    } finally {
      if (browser) await browser.close();
    }
  }
}

Key practices:

  • Always close the browser in a finally block to prevent orphan Chromium processes
  • Validate extracted data — an empty result is as bad as an error
  • Use exponential backoff — hammering a rate-limited server with immediate retries makes things worse
  • Set explicit timeouts — the default 30-second timeout is too long for most scraping jobs

Anti-Bot Detection and Stealth

Out of the box, Puppeteer is trivially detectable. The navigator.webdriver flag is set to true, plugin arrays are empty, and WebGL reports SwiftShader instead of real GPU hardware. Every anti-bot service checks these in milliseconds.

The puppeteer-extra-plugin-stealth package patches most of these signals:

bash
npm install puppeteer-extra puppeteer-extra-plugin-stealth
javascript
const puppeteer = require('puppeteer-extra');
const StealthPlugin = require('puppeteer-extra-plugin-stealth');

puppeteer.use(StealthPlugin());

(async () => {
  const browser = await puppeteer.launch({
    headless: 'new',
    args: [
      '--no-sandbox',
      '--disable-setuid-sandbox',
      '--disable-blink-features=AutomationControlled',
    ],
  });

  const page = await browser.newPage();

  // Randomize viewport to look less bot-like
  const width = 1280 + Math.floor(Math.random() * 640);
  const height = 800 + Math.floor(Math.random() * 280);
  await page.setViewport({ width, height });

  await page.goto('https://bot.sannysoft.com', { waitUntil: 'networkidle2' });
  await page.screenshot({ path: 'stealth-test.png', fullPage: true });
  await browser.close();
})();

The stealth plugin handles:

  • Removing navigator.webdriver flag
  • Faking navigator.plugins with realistic entries
  • Spoofing WebGL vendor and renderer strings
  • Patching Chrome runtime objects (window.chrome)
  • Fixing navigator.permissions behavior
  • Spoofing navigator.languages properly

Even with stealth, advanced anti-bot systems like Cloudflare Turnstile, DataDome, and PerimeterX use behavioral analysis — mouse movement patterns, scroll velocity, and timing between actions. You can partially address this:

javascript
// Simulate human-like mouse movement
async function humanMove(page, x, y) {
  const steps = 10 + Math.floor(Math.random() * 15);
  await page.mouse.move(x, y, { steps });
  await new Promise(r => setTimeout(r, 100 + Math.random() * 300));
}

// Random delays between actions
async function humanDelay(min = 500, max = 2000) {
  const delay = min + Math.random() * (max - min);
  await new Promise(r => setTimeout(r, delay));
}

Proxy Rotation

Sending all requests from one IP address gets you blocked fast. Proxy rotation distributes your requests across many IPs.

javascript
const puppeteer = require('puppeteer-extra');
const StealthPlugin = require('puppeteer-extra-plugin-stealth');

puppeteer.use(StealthPlugin());

const proxies = [
  'http://user:[email protected]:8080',
  'http://user:[email protected]:8080',
  'http://user:[email protected]:8080',
];

async function scrapeWithProxy(url) {
  const proxy = proxies[Math.floor(Math.random() * proxies.length)];
  const proxyUrl = new URL(proxy);

  const browser = await puppeteer.launch({
    headless: 'new',
    args: [`--proxy-server=${proxyUrl.host}`],
  });

  const page = await browser.newPage();

  // Authenticate with the proxy
  if (proxyUrl.username) {
    await page.authenticate({
      username: proxyUrl.username,
      password: proxyUrl.password,
    });
  }

  try {
    await page.goto(url, { waitUntil: 'networkidle2', timeout: 30000 });
    return await page.content();
  } finally {
    await browser.close();
  }
}

A smarter approach rotates proxies based on failure rate. If a proxy gets blocked, move it to a cooldown list:

javascript
class ProxyPool {
  constructor(proxies) {
    this.available = [...proxies];
    this.cooldown = new Map(); // proxy -> cooldown expiry timestamp
  }

  getProxy() {
    // Move expired cooldowns back to available
    const now = Date.now();
    for (const [proxy, expiry] of this.cooldown) {
      if (now > expiry) {
        this.available.push(proxy);
        this.cooldown.delete(proxy);
      }
    }

    if (this.available.length === 0) {
      throw new Error('No proxies available');
    }

    const index = Math.floor(Math.random() * this.available.length);
    return this.available[index];
  }

  markFailed(proxy) {
    this.available = this.available.filter(p => p !== proxy);
    this.cooldown.set(proxy, Date.now() + 5 * 60 * 1000); // 5 min cooldown
  }
}

Scaling with Browser Pools

Running one browser at a time is fine for small jobs. For serious scraping, you need a pool of browsers running in parallel with concurrency limits.

javascript
const puppeteer = require('puppeteer');

class BrowserPool {
  constructor(maxBrowsers = 5) {
    this.maxBrowsers = maxBrowsers;
    this.activeBrowsers = 0;
    this.queue = [];
  }

  async acquire() {
    if (this.activeBrowsers < this.maxBrowsers) {
      this.activeBrowsers++;
      return await puppeteer.launch({
        headless: 'new',
        args: ['--no-sandbox', '--disable-dev-shm-usage'],
      });
    }

    // Wait for a browser to become available
    return new Promise((resolve) => {
      this.queue.push(resolve);
    });
  }

  async release(browser) {
    await browser.close();

    if (this.queue.length > 0) {
      const next = this.queue.shift();
      const newBrowser = await puppeteer.launch({
        headless: 'new',
        args: ['--no-sandbox', '--disable-dev-shm-usage'],
      });
      next(newBrowser);
    } else {
      this.activeBrowsers--;
    }
  }
}

// Usage: scrape 100 URLs with max 5 concurrent browsers
async function scrapeMany(urls) {
  const pool = new BrowserPool(5);
  const results = [];

  const tasks = urls.map(async (url) => {
    const browser = await pool.acquire();
    try {
      const page = await browser.newPage();
      await page.goto(url, { waitUntil: 'networkidle2', timeout: 20000 });
      const title = await page.title();
      results.push({ url, title });
    } catch (error) {
      results.push({ url, error: error.message });
    } finally {
      await pool.release(browser);
    }
  });

  await Promise.all(tasks);
  return results;
}

Scaling Puppeteer beyond a single machine gets complicated fast. Each browser instance needs 100-300 MB of RAM. A 4 GB server can realistically run 10-15 concurrent browsers. For higher throughput, you are looking at container orchestration, process monitoring, and failure recovery — which is essentially building your own scraping infrastructure.

1

Launch Browser

Puppeteer starts a Chromium instance with stealth patches and proxy configuration

2

Navigate and Wait

Go to the target URL, wait for JavaScript rendering and dynamic content to load

3

Extract Data

Run page.evaluate() to pull structured data from the DOM or intercept API responses

4

Handle Errors

Retry on failure with exponential backoff, rotate proxies on blocks

5

Store Results

Save extracted data to JSON, CSV, or database — validate before writing

When to Use a Scraping API Instead

Building and maintaining a Puppeteer scraping pipeline is real engineering work. You are responsible for browser management, proxy infrastructure, anti-bot bypass, CAPTCHA solving, retry logic, and monitoring. For one-off projects or simple targets, that is fine.

But when you are scraping at scale against sites with serious bot protection, the maintenance burden grows fast. Cloudflare updates its detection every few weeks. Proxy providers rotate your IPs into already-burned ranges. Your stealth patches break when Chrome updates. You spend more time maintaining infrastructure than building the product that needs the data.

This is where a scraping API makes sense. Instead of managing browsers, proxies, and stealth patches yourself, you send an HTTP request and get back the rendered HTML or structured data.

AlterLab handles the hard parts — anti-bot bypass across Cloudflare, DataDome, and PerimeterX, automatic proxy rotation through residential and datacenter pools, and headless browser rendering. A single API call replaces hundreds of lines of Puppeteer infrastructure code:

bash
curl -X POST https://alterlab.io/api/v1/scrape \
  -H "X-API-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com/products", "formats": ["text", "markdown"]}'

No browser pools to manage. No proxy rotation to implement. No stealth plugin updates to track.

FeatureDIY PuppeteerAlterLab API
Setup TimeHours to days5 minutes
Anti-Bot BypassManual (breaks often)Built-in (auto-updated)
Proxy ManagementSelf-managed poolIncluded
Browser InfrastructureYour serversManaged
MaintenanceOngoingNone
ScalingComplex orchestrationIncrease API calls
Cost at 10K pages/day$200-500/mo servers + proxies$49/mo
JavaScript Rendering

The decision is straightforward: use Puppeteer when you need full browser control (custom interactions, screenshots, specific workflows). Use a scraping API when you need reliable data extraction at scale without the infrastructure overhead.

Complete Example: E-Commerce Product Scraper

Here is a complete, production-ready scraper that extracts product data from an e-commerce listing page, handles pagination, retries on failure, and saves results to JSON.

javascript
const puppeteer = require('puppeteer-extra');
const StealthPlugin = require('puppeteer-extra-plugin-stealth');
const fs = require('fs').promises;

puppeteer.use(StealthPlugin());

const CONFIG = {
  baseUrl: 'https://example-store.com/category/electronics',
  maxPages: 10,
  maxRetries: 3,
  concurrency: 3,
  outputFile: 'products.json',
  delayBetweenPages: [1500, 3000], // min/max ms
};

async function randomDelay([min, max]) {
  const delay = min + Math.random() * (max - min);
  await new Promise(r => setTimeout(r, delay));
}

async function extractProducts(page) {
  return await page.evaluate(() => {
    return [...document.querySelectorAll('.product-card')].map(card => ({
      name: card.querySelector('.product-name')?.textContent.trim() || '',
      price: card.querySelector('.product-price')?.textContent.trim() || '',
      rating: card.querySelector('.rating-value')?.textContent.trim() || '',
      reviewCount: card.querySelector('.review-count')?.textContent.trim() || '',
      url: card.querySelector('a.product-link')?.href || '',
      image: card.querySelector('img.product-image')?.src || '',
      availability: card.querySelector('.stock-status')?.textContent.trim() || '',
      scraped_at: new Date().toISOString(),
    }));
  });
}

async function scrapePage(browser, url, retries = CONFIG.maxRetries) {
  const page = await browser.newPage();

  try {
    // Block heavy resources
    await page.setRequestInterception(true);
    page.on('request', req => {
      if (['image', 'stylesheet', 'font', 'media'].includes(req.resourceType())) {
        req.abort();
      } else {
        req.continue();
      }
    });

    await page.setViewport({ width: 1366, height: 768 });
    await page.goto(url, { waitUntil: 'networkidle2', timeout: 25000 });

    // Wait for product cards to render
    await page.waitForSelector('.product-card', { timeout: 10000 });

    const products = await extractProducts(page);
    console.log(`  Extracted ${products.length} products from ${url}`);
    return products;

  } catch (error) {
    if (retries > 0) {
      console.warn(`  Retry (${CONFIG.maxRetries - retries + 1}): ${error.message}`);
      await new Promise(r => setTimeout(r, 3000));
      return scrapePage(browser, url, retries - 1);
    }
    console.error(`  Failed after ${CONFIG.maxRetries} retries: ${url}`);
    return [];

  } finally {
    await page.close();
  }
}

async function main() {
  console.log('Starting e-commerce scraper...');

  const browser = await puppeteer.launch({
    headless: 'new',
    args: [
      '--no-sandbox',
      '--disable-setuid-sandbox',
      '--disable-dev-shm-usage',
      '--disable-blink-features=AutomationControlled',
    ],
  });

  const allProducts = [];

  try {
    for (let pageNum = 1; pageNum <= CONFIG.maxPages; pageNum++) {
      const url = `${CONFIG.baseUrl}?page=${pageNum}`;
      console.log(`Scraping page ${pageNum}/${CONFIG.maxPages}: ${url}`);

      const products = await scrapePage(browser, url);

      if (products.length === 0) {
        console.log('No products found — reached last page.');
        break;
      }

      allProducts.push(...products);

      // Respectful delay between pages
      if (pageNum < CONFIG.maxPages) {
        await randomDelay(CONFIG.delayBetweenPages);
      }
    }

    // Save results
    await fs.writeFile(
      CONFIG.outputFile,
      JSON.stringify(allProducts, null, 2),
      'utf-8'
    );

    console.log(`\nDone. Scraped ${allProducts.length} products across ${CONFIG.maxPages} pages.`);
    console.log(`Results saved to ${CONFIG.outputFile}`);

  } finally {
    await browser.close();
  }
}

main().catch(console.error);

This scraper includes everything covered in this guide: stealth mode, request interception, proper waits, retry logic, respectful delays, and clean resource management. To adapt it to a real target, you only need to update the CSS selectors in extractProducts and the base URL.

What to Remember

Puppeteer gives you complete control over a real browser, which makes it the right choice for scraping JavaScript-heavy sites, handling complex interactions, and extracting data that HTTP-only tools cannot reach. The cost is complexity — you are managing browser processes, memory, proxies, and stealth.

For projects where you need that level of control, the patterns in this guide will get you to production. For projects where you just need the data, consider whether the infrastructure overhead is worth it. A scraping API like AlterLab can reduce weeks of Puppeteer infrastructure work to a single HTTP call — letting you focus on what you are building instead of how you are scraping.

Yash Dubey

Yash Dubey