AlterLabAlterLab
Tutorials

Web Scraping with Node.js in 2026: Puppeteer, Playwright, and When to Use a Scraping API

Web Scraping with Node.js in 2026: Puppeteer, Playwright, and When to Use a Scraping API Most web scraping tutorials are written for Python. That is fine if...

Yash Dubey

Yash Dubey

February 12, 2026

8 min read
34 views
Share:

Most web scraping tutorials are written for Python. That is fine if Python is your stack, but if you are building a Node.js application, switching languages just to scrape data adds unnecessary complexity to your deployment.

Here is how to scrape effectively with JavaScript, which tools to use for which situations, and when to stop fighting browser automation entirely.

The Node.js Scraping Stack

Your tool choice depends on what you are scraping:

  • Static HTML pagesfetch + cheerio
  • JavaScript-rendered pages → Puppeteer or Playwright
  • Anti-bot protected sites → Scraping API (more on this later)

Do not reach for a headless browser when you do not need one. Most pages serve their content in the initial HTML response. Check first.

Starting Simple: fetch + cheerio

For static pages, this is all you need:

javascript
import * as cheerio from 'cheerio';

const response = await fetch('https://example.com/products');
const html = await response.text();
const $ = cheerio.load(html);

const products = [];
$('.product-card').each((i, el) => {
  products.push({
    name: $(el).find('.title').text().trim(),
    price: $(el).find('.price').text().trim(),
    url: $(el).find('a').attr('href'),
  });
});

console.log(products);

This runs in milliseconds, uses almost no memory, and handles most documentation sites, blogs, directories, and simple product pages.

Common mistakes at this stage:

  • Not setting a User-Agent. Many sites block requests with no User-Agent or with the default Node.js one. Set it to something realistic.
  • Not handling encoding. Some sites use non-UTF-8 encoding. Check the Content-Type header and decode accordingly.
  • Fetching too fast. Even without anti-bot protection, hammering a server with hundreds of concurrent requests gets your IP blocked.

When You Need a Browser: Puppeteer vs Playwright

If the page content loads via JavaScript (React, Vue, Angular apps), you need a headless browser.

Puppeteer

Google's browser automation library. Chrome/Chromium only.

javascript
import puppeteer from 'puppeteer';

const browser = await puppeteer.launch({ headless: 'new' });
const page = await browser.newPage();

await page.setUserAgent('Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36');
await page.goto('https://example.com/app', { waitUntil: 'networkidle2' });

// Wait for the specific element you need
await page.waitForSelector('.data-table');

const data = await page.evaluate(() => {
  const rows = document.querySelectorAll('.data-table tr');
  return Array.from(rows).map(row => {
    const cells = row.querySelectorAll('td');
    return {
      name: cells[0]?.textContent?.trim(),
      value: cells[1]?.textContent?.trim(),
    };
  });
});

await browser.close();

Playwright

Microsoft's alternative. Supports Chrome, Firefox, and WebKit.

javascript
import { chromium } from 'playwright';

const browser = await chromium.launch();
const page = await browser.newPage();

await page.goto('https://example.com/app');
await page.waitForSelector('.data-table');

const data = await page.$$eval('.data-table tr', rows =>
  rows.map(row => {
    const cells = row.querySelectorAll('td');
    return {
      name: cells[0]?.textContent?.trim(),
      value: cells[1]?.textContent?.trim(),
    };
  })
);

await browser.close();

Which One to Pick

Playwright is the better choice for most scraping projects in 2026:

  • Auto-wait built in. Playwright automatically waits for elements to be actionable before interacting. Puppeteer requires manual waitForSelector calls everywhere.
  • Better selectors. Playwright supports text=, role=, and CSS selectors out of the box.
  • Multi-browser. If a site blocks Chrome, try Firefox or WebKit without rewriting your code.
  • Network interception is cleaner. Intercept API calls the page makes and grab the JSON directly instead of parsing the DOM.

Puppeteer still makes sense if you are already deep in the Google ecosystem or need Chrome-specific DevTools protocol features.

The Network Interception Trick

Here is something most tutorials skip: many SPAs fetch their data from an API that returns JSON. Instead of parsing the rendered DOM, intercept the network request and grab the structured data directly.

javascript
import { chromium } from 'playwright';

const browser = await chromium.launch();
const page = await browser.newPage();

// Intercept API responses
const apiData = [];
page.on('response', async (response) => {
  const url = response.url();
  if (url.includes('/api/products')) {
    const json = await response.json();
    apiData.push(...json.results);
  }
});

await page.goto('https://example.com/products');
await page.waitForTimeout(3000); // Wait for API calls to complete

await browser.close();
console.log(apiData); // Clean JSON, no DOM parsing needed

This gives you cleaner data with less code. The tradeoff is that it breaks if the site changes their internal API endpoints, but the same is true for DOM selectors.

Handling Pagination

Most scraping projects need to handle pagination. Three patterns cover almost every site:

Pattern 1: URL-based pagination

javascript
const allProducts = [];

for (let page = 1; page <= 50; page++) {
  const response = await fetch(`https://example.com/products?page=${page}`);
  const html = await response.text();
  const $ = cheerio.load(html);

  const products = $('.product').map((i, el) => ({
    name: $(el).find('.name').text().trim(),
  })).get();

  if (products.length === 0) break; // No more pages
  allProducts.push(...products);

  // Be respectful: wait between requests
  await new Promise(r => setTimeout(r, 1000 + Math.random() * 2000));
}

Pattern 2: Click-to-load / Infinite scroll

javascript
const page = await browser.newPage();
await page.goto('https://example.com/feed');

let previousHeight = 0;
while (true) {
  await page.evaluate(() => window.scrollTo(0, document.body.scrollHeight));
  await page.waitForTimeout(2000);

  const currentHeight = await page.evaluate(() => document.body.scrollHeight);
  if (currentHeight === previousHeight) break;
  previousHeight = currentHeight;
}

// Now extract all loaded items
const items = await page.$$eval('.feed-item', els =>
  els.map(el => el.textContent.trim())
);

Pattern 3: Cursor-based API pagination

javascript
let cursor = null;
const allItems = [];

do {
  const url = new URL('https://example.com/api/items');
  url.searchParams.set('limit', '100');
  if (cursor) url.searchParams.set('cursor', cursor);

  const response = await fetch(url);
  const data = await response.json();

  allItems.push(...data.items);
  cursor = data.next_cursor;
} while (cursor);

Concurrency Without Getting Blocked

Sending requests one at a time is slow. Sending them all at once gets you blocked. The sweet spot is controlled concurrency:

javascript
async function scrapeWithConcurrency(urls, maxConcurrent = 5) {
  const results = [];
  const executing = new Set();

  for (const url of urls) {
    const promise = scrapeUrl(url).then(result => {
      executing.delete(promise);
      results.push(result);
    });

    executing.add(promise);

    if (executing.size >= maxConcurrent) {
      await Promise.race(executing);
    }
  }

  await Promise.all(executing);
  return results;
}

Start with 3-5 concurrent requests and increase only if the target server handles it without errors. Add jitter (random delays) between batches to look less like a bot.

Error Handling That Actually Works

Scraping is inherently unreliable. Sites go down, layouts change, rate limits kick in. Build retry logic from the start:

javascript
async function scrapeWithRetry(url, maxRetries = 3) {
  for (let attempt = 1; attempt <= maxRetries; attempt++) {
    try {
      const response = await fetch(url);

      if (response.status === 429) {
        const retryAfter = parseInt(response.headers.get('retry-after') || '60');
        console.log(`Rate limited. Waiting ${retryAfter}s...`);
        await new Promise(r => setTimeout(r, retryAfter * 1000));
        continue;
      }

      if (!response.ok) {
        throw new Error(`HTTP ${response.status}`);
      }

      return await response.text();
    } catch (error) {
      if (attempt === maxRetries) throw error;

      const backoff = Math.pow(2, attempt) * 1000 + Math.random() * 1000;
      console.log(`Attempt ${attempt} failed. Retrying in ${Math.round(backoff / 1000)}s...`);
      await new Promise(r => setTimeout(r, backoff));
    }
  }
}

Key points:

  • Exponential backoff prevents hammering a server that is already struggling.
  • Jitter (the random component) prevents all your retries from hitting at the same time.
  • Respect Retry-After headers. If a server tells you when to come back, listen.

When to Stop Building and Use a Scraping API

There is a point where maintaining your own scraping infrastructure costs more than it saves. You have crossed that line when:

  • You are spending more time on infrastructure than on using the data. Proxy rotation, CAPTCHA solving, browser fingerprint management, and IP ban recovery are full-time problems.
  • Anti-bot systems keep winning. Cloudflare, DataDome, and PerimeterX update their detection weekly. Your bypass that worked last Tuesday is already flagged.
  • You need reliability. Production systems that depend on scraped data cannot afford a 20% failure rate because your proxy pool degraded overnight.
  • Scale changed. Scraping 100 pages a day with Puppeteer is fine. Scraping 100,000 pages a day means managing browser instances, memory limits, and concurrent connections across multiple servers.

A scraping API like AlterLab handles the entire infrastructure layer: proxy rotation, anti-bot bypass, browser rendering, and retry logic. You send a URL, you get back data.

javascript
const response = await fetch('https://alterlab.io/api/v1/scrape', {
  method: 'POST',
  headers: {
    'X-API-Key': 'your-api-key',
    'Content-Type': 'application/json',
  },
  body: JSON.stringify({
    url: 'https://target-site.com/products',
    render_js: true,
  }),
});

const { content } = await response.json();
// Parse the returned HTML with cheerio as usual
const $ = cheerio.load(content);

Your scraping logic stays the same. The difference is that someone else handles the arms race with anti-bot systems.

Project Structure for Larger Scrapers

Once you go beyond a single script, organize your scraper like any other Node.js project:

code
scraper/
  src/
    scrapers/        # One file per site/source
      amazon.js
      competitor.js
    extractors/      # DOM parsing logic
      product.js
      pricing.js
    utils/
      retry.js       # Retry/backoff logic
      rateLimit.js   # Request throttling
    index.js         # Entry point and orchestration
  output/            # Scraped data (gitignored)
  package.json

Separate the fetching (getting the HTML) from the extracting (parsing the data). When a site changes its layout, you only update the extractor. When you switch from direct fetching to an API, you only update the scraper.

Quick Reference

ScenarioToolWhy
Static HTMLfetch + cheerioFast, lightweight, no browser overhead
JS-rendered pagesPlaywrightAuto-wait, multi-browser, clean API
Anti-bot protectedScraping APIInfrastructure handled for you
Internal APIsfetch directlySkip the browser entirely
High volume (10k+ pages/day)Scraping APIProxy management at scale is a full-time job

The best scraper is the simplest one that gets the job done. Start with fetch, add Playwright when you need it, and move to a scraping API when you would rather spend time on your product than on fighting bot detection.

Yash Dubey

Yash Dubey