Pricing Compare Playground Blog Docs Changelog

Web Scraping with Node.js in 2026: Puppeteer, Playwright, and When to Use a Scraping API

Web Scraping with Node.js in 2026: Puppeteer, Playwright, and When to Use a Scraping API Most web scraping tutorials are written for Python. That is fine if...

Yash DubeyFebruary 12, 2026

8 min read

184 views

Automation

JavaScript

Node.js

Most web scraping tutorials are written for Python. That is fine if Python is your stack, but if you are building a Node.js application, switching languages just to scrape data adds unnecessary complexity to your deployment.

Here is how to scrape effectively with JavaScript, which tools to use for which situations, and when to stop fighting browser automation entirely.

The Node.js Scraping Stack

Your tool choice depends on what you are scraping:

Static HTML pages → fetch + cheerio
JavaScript-rendered pages → Puppeteer or Playwright
Anti-bot protected sites → Scraping API (more on this later)

Do not reach for a headless browser when you do not need one. Most pages serve their content in the initial HTML response. Check first.

Starting Simple: fetch + cheerio

For static pages, this is all you need:

JAVASCRIPT

import * as cheerio from 'cheerio';

const response = await fetch('https://example.com/products');
const html = await response.text();
const $ = cheerio.load(html);

const products = [];
$('.product-card').each((i, el) => {
  products.push({
    name: $(el).find('.title').text().trim(),
    price: $(el).find('.price').text().trim(),
    url: $(el).find('a').attr('href'),
  });
});

console.log(products);

This runs in milliseconds, uses almost no memory, and handles most documentation sites, blogs, directories, and simple product pages.

Common mistakes at this stage:

Not setting a User-Agent. Many sites block requests with no User-Agent or with the default Node.js one. Set it to something realistic.
Not handling encoding. Some sites use non-UTF-8 encoding. Check the Content-Type header and decode accordingly.
Fetching too fast. Even without anti-bot protection, hammering a server with hundreds of concurrent requests gets your IP blocked.

When You Need a Browser: Puppeteer vs Playwright

If the page content loads via JavaScript (React, Vue, Angular apps), you need a headless browser.

Puppeteer

Google's browser automation library. Chrome/Chromium only.

JAVASCRIPT

import puppeteer from 'puppeteer';

const browser = await puppeteer.launch({ headless: 'new' });
const page = await browser.newPage();

await page.setUserAgent('Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36');
await page.goto('https://example.com/app', { waitUntil: 'networkidle2' });

// Wait for the specific element you need
await page.waitForSelector('.data-table');

const data = await page.evaluate(() => {
  const rows = document.querySelectorAll('.data-table tr');
  return Array.from(rows).map(row => {
    const cells = row.querySelectorAll('td');
    return {
      name: cells[0]?.textContent?.trim(),
      value: cells[1]?.textContent?.trim(),
    };
  });
});

await browser.close();

Playwright

Microsoft's alternative. Supports Chrome, Firefox, and WebKit.

JAVASCRIPT

import { chromium } from 'playwright';

const browser = await chromium.launch();
const page = await browser.newPage();

await page.goto('https://example.com/app');
await page.waitForSelector('.data-table');

const data = await page.$$eval('.data-table tr', rows =>
  rows.map(row => {
    const cells = row.querySelectorAll('td');
    return {
      name: cells[0]?.textContent?.trim(),
      value: cells[1]?.textContent?.trim(),
    };
  })
);

await browser.close();

Which One to Pick

Playwright is the better choice for most scraping projects in 2026:

Auto-wait built in. Playwright automatically waits for elements to be actionable before interacting. Puppeteer requires manual waitForSelector calls everywhere.
Better selectors. Playwright supports text=, role=, and CSS selectors out of the box.
Multi-browser. If a site blocks Chrome, try Firefox or WebKit without rewriting your code.
Network interception is cleaner. Intercept API calls the page makes and grab the JSON directly instead of parsing the DOM.

Puppeteer still makes sense if you are already deep in the Google ecosystem or need Chrome-specific DevTools protocol features.

The Network Interception Trick

Here is something most tutorials skip: many SPAs fetch their data from an API that returns JSON. Instead of parsing the rendered DOM, intercept the network request and grab the structured data directly.

JAVASCRIPT

import { chromium } from 'playwright';

const browser = await chromium.launch();
const page = await browser.newPage();

// Intercept API responses
const apiData = [];
page.on('response', async (response) => {
  const url = response.url();
  if (url.includes('/api/products')) {
    const json = await response.json();
    apiData.push(...json.results);
  }
});

await page.goto('https://example.com/products');
await page.waitForTimeout(3000); // Wait for API calls to complete

await browser.close();
console.log(apiData); // Clean JSON, no DOM parsing needed

This gives you cleaner data with less code. The tradeoff is that it breaks if the site changes their internal API endpoints, but the same is true for DOM selectors.

Handling Pagination

Most scraping projects need to handle pagination. Three patterns cover almost every site:

Pattern 1: URL-based pagination

JAVASCRIPT

const allProducts = [];

for (let page = 1; page <= 50; page++) {
  const response = await fetch(`https://example.com/products?page=${page}`);
  const html = await response.text();
  const $ = cheerio.load(html);

  const products = $('.product').map((i, el) => ({
    name: $(el).find('.name').text().trim(),
  })).get();

  if (products.length === 0) break; // No more pages
  allProducts.push(...products);

  // Be respectful: wait between requests
  await new Promise(r => setTimeout(r, 1000 + Math.random() * 2000));
}

Pattern 2: Click-to-load / Infinite scroll

JAVASCRIPT

const page = await browser.newPage();
await page.goto('https://example.com/feed');

let previousHeight = 0;
while (true) {
  await page.evaluate(() => window.scrollTo(0, document.body.scrollHeight));
  await page.waitForTimeout(2000);

  const currentHeight = await page.evaluate(() => document.body.scrollHeight);
  if (currentHeight === previousHeight) break;
  previousHeight = currentHeight;
}

// Now extract all loaded items
const items = await page.$$eval('.feed-item', els =>
  els.map(el => el.textContent.trim())
);

Pattern 3: Cursor-based API pagination

JAVASCRIPT

let cursor = null;
const allItems = [];

do {
  const url = new URL('https://example.com/api/items');
  url.searchParams.set('limit', '100');
  if (cursor) url.searchParams.set('cursor', cursor);

  const response = await fetch(url);
  const data = await response.json();

  allItems.push(...data.items);
  cursor = data.next_cursor;
} while (cursor);

Concurrency Without Getting Blocked

Sending requests one at a time is slow. Sending them all at once gets you blocked. The sweet spot is controlled concurrency:

JAVASCRIPT

async function scrapeWithConcurrency(urls, maxConcurrent = 5) {
  const results = [];
  const executing = new Set();

  for (const url of urls) {
    const promise = scrapeUrl(url).then(result => {
      executing.delete(promise);
      results.push(result);
    });

    executing.add(promise);

    if (executing.size >= maxConcurrent) {
      await Promise.race(executing);
    }
  }

  await Promise.all(executing);
  return results;
}

Start with 3-5 concurrent requests and increase only if the target server handles it without errors. Add jitter (random delays) between batches to look less like a bot.

Error Handling That Actually Works

Scraping is inherently unreliable. Sites go down, layouts change, rate limits kick in. Build retry logic from the start:

JAVASCRIPT

async function scrapeWithRetry(url, maxRetries = 3) {
  for (let attempt = 1; attempt <= maxRetries; attempt++) {
    try {
      const response = await fetch(url);

      if (response.status === 429) {
        const retryAfter = parseInt(response.headers.get('retry-after') || '60');
        console.log(`Rate limited. Waiting ${retryAfter}s...`);
        await new Promise(r => setTimeout(r, retryAfter * 1000));
        continue;
      }

      if (!response.ok) {
        throw new Error(`HTTP ${response.status}`);
      }

      return await response.text();
    } catch (error) {
      if (attempt === maxRetries) throw error;

      const backoff = Math.pow(2, attempt) * 1000 + Math.random() * 1000;
      console.log(`Attempt ${attempt} failed. Retrying in ${Math.round(backoff / 1000)}s...`);
      await new Promise(r => setTimeout(r, backoff));
    }
  }
}

Key points:

Exponential backoff prevents hammering a server that is already struggling.
Jitter (the random component) prevents all your retries from hitting at the same time.
Respect Retry-After headers. If a server tells you when to come back, listen.

When to Stop Building and Use a Scraping API

There is a point where maintaining your own scraping infrastructure costs more than it saves. You have crossed that line when:

You are spending more time on infrastructure than on using the data. Proxy rotation, CAPTCHA solving, browser fingerprint management, and IP ban recovery are full-time problems.
Anti-bot systems keep winning. Cloudflare, DataDome, and PerimeterX update their detection weekly. Your bypass that worked last Tuesday is already flagged.
You need reliability. Production systems that depend on scraped data cannot afford a 20% failure rate because your proxy pool degraded overnight.
Scale changed. Scraping 100 pages a day with Puppeteer is fine. Scraping 100,000 pages a day means managing browser instances, memory limits, and concurrent connections across multiple servers.

A scraping API like AlterLab handles the entire infrastructure layer: proxy rotation, anti-bot bypass, browser rendering, and retry logic. You send a URL, you get back data.

JAVASCRIPT

const response = await fetch('https://alterlab.io/api/v1/scrape', {
  method: 'POST',
  headers: {
    'X-API-Key': 'your-api-key',
    'Content-Type': 'application/json',
  },
  body: JSON.stringify({
    url: 'https://target-site.com/products',
    render_js: true,
  }),
});

const { content } = await response.json();
// Parse the returned HTML with cheerio as usual
const $ = cheerio.load(content);

Your scraping logic stays the same. The difference is that someone else handles the arms race with anti-bot systems.

Project Structure for Larger Scrapers

Once you go beyond a single script, organize your scraper like any other Node.js project:

Code

scraper/
  src/
    scrapers/        # One file per site/source
      amazon.js
      competitor.js
    extractors/      # DOM parsing logic
      product.js
      pricing.js
    utils/
      retry.js       # Retry/backoff logic
      rateLimit.js   # Request throttling
    index.js         # Entry point and orchestration
  output/            # Scraped data (gitignored)
  package.json

Separate the fetching (getting the HTML) from the extracting (parsing the data). When a site changes its layout, you only update the extractor. When you switch from direct fetching to an API, you only update the scraper.

Quick Reference

Scenario	Tool	Why
Static HTML	`fetch` + `cheerio`	Fast, lightweight, no browser overhead
JS-rendered pages	Playwright	Auto-wait, multi-browser, clean API
Anti-bot protected	Scraping API	Infrastructure handled for you
Internal APIs	`fetch` directly	Skip the browser entirely
High volume (10k+ pages/day)	Scraping API	Proxy management at scale is a full-time job

The best scraper is the simplest one that gets the job done. Start with fetch, add Playwright when you need it, and move to a scraping API when you would rather spend time on your product than on fighting bot detection.

Was this article helpful?

Try it yourself

One API call. Any language.

Python SDK, Node SDK, or plain HTTP. Get started in under a minute.

from alterlab import AlterLab

client = AlterLab(api_key="YOUR_KEY")
result = client.scrape("https://example.com")
print(result.markdown)

No credit card required · 5,000 free requests

Yash Dubey

View all posts

Tutorials

Handling Infinite Scroll & Pagination in Headless Browsers

Learn how to reliably handle infinite scroll, cursor-based pagination, and dynamic rendering for autonomous AI web scraping agents using headless browsers.

Herald Blog Service

Jun 13, 2026

Tutorials

Playwright Network Interception Guide for AI Data Extraction

Learn how to intercept and block network requests in Playwright to accelerate AI agent data extraction, reduce bandwidth, and capture raw API JSON payloads.

Herald Blog Service

Jun 13, 2026

13m

Tutorials

Building an Autonomous CrewAI Web Scraping Tool for JSON Extraction

Learn how to build a custom CrewAI tool that autonomously scrapes dynamic websites and returns structured JSON using a headless browser API.

Herald Blog Service

Jun 12, 2026

Stay in the Loop

Get scraping insights, API tips, and platform updates. No spam — we only send when we have something worth reading.

Web Scraping API Resources

Part of the Web Scraping API Documentation cluster

Web Scraping API Documentation

Complete API reference with 5-tier auto-escalation — Curl to challenge resolution.

Pillar page

JavaScript Rendering Guide

Configure Tier 4 browser rendering for SPAs and dynamic content.

Authenticated Scraping Guide

Scrape pages behind login using session management.

Web Scraping API Benchmarks

Real success rates and cost data across all 5 tiers.

AlterLab for AI Agents

MCP Server, Python SDK, and Firecrawl-compatible API for AI agent workflows.

The Node.js Scraping Stack

Starting Simple: fetch + cheerio

When You Need a Browser: Puppeteer vs Playwright

Puppeteer

Playwright

Which One to Pick

The Network Interception Trick

Handling Pagination

Pattern 1: URL-based pagination

Pattern 2: Click-to-load / Infinite scroll

Pattern 3: Cursor-based API pagination

Concurrency Without Getting Blocked

Error Handling That Actually Works

When to Stop Building and Use a Scraping API

Project Structure for Larger Scrapers

Quick Reference

Related Articles

Handling Infinite Scroll & Pagination in Headless Browsers

Playwright Network Interception Guide for AI Data Extraction

Building an Autonomous CrewAI Web Scraping Tool for JSON Extraction

Popular Posts

Why Your Headless Browser Gets Detected (and How to Fix It)

Best Web Scraping APIs in 2026: Complete Comparison Guide

Playwright Bot Detection: What Actually Works in 2026

How to Scrape Twitter/X: Complete Guide for 2026

How to Scrape Cloudflare-Protected Sites in 2026

Recommended

How to Scrape Amazon in 2026: Engineering Guide

How to Scrape AliExpress: Complete Guide for 2026

Why Your Headless Browser Gets Detected (and How to Fix It)

How to Scrape Indeed: Complete Guide for 2026

How to Scrape Twitter/X Data: Complete Guide for 2026

Newsletter

Recommended Reading

How to Scrape Amazon in 2026: Engineering Guide

How to Scrape AliExpress: Complete Guide for 2026

Why Your Headless Browser Gets Detected (and How to Fix It)

How to Scrape Indeed: Complete Guide for 2026

How to Scrape Twitter/X Data: Complete Guide for 2026

Stay in the Loop

Explore AlterLab

Python Web Scraping API

Compare Scraping APIs

Pricing

Documentation

Web Scraping API Resources