Web Scraping with Node.js in 2026: Puppeteer, Playwright, and When to Use a Scraping API
Web Scraping with Node.js in 2026: Puppeteer, Playwright, and When to Use a Scraping API Most web scraping tutorials are written for Python. That is fine if...
Yash Dubey
February 12, 2026
Most web scraping tutorials are written for Python. That is fine if Python is your stack, but if you are building a Node.js application, switching languages just to scrape data adds unnecessary complexity to your deployment.
Here is how to scrape effectively with JavaScript, which tools to use for which situations, and when to stop fighting browser automation entirely.
The Node.js Scraping Stack
Your tool choice depends on what you are scraping:
- Static HTML pages →
fetch+cheerio - JavaScript-rendered pages → Puppeteer or Playwright
- Anti-bot protected sites → Scraping API (more on this later)
Do not reach for a headless browser when you do not need one. Most pages serve their content in the initial HTML response. Check first.
Starting Simple: fetch + cheerio
For static pages, this is all you need:
import * as cheerio from 'cheerio';
const response = await fetch('https://example.com/products');
const html = await response.text();
const $ = cheerio.load(html);
const products = [];
$('.product-card').each((i, el) => {
products.push({
name: $(el).find('.title').text().trim(),
price: $(el).find('.price').text().trim(),
url: $(el).find('a').attr('href'),
});
});
console.log(products);This runs in milliseconds, uses almost no memory, and handles most documentation sites, blogs, directories, and simple product pages.
Common mistakes at this stage:
- Not setting a User-Agent. Many sites block requests with no User-Agent or with the default Node.js one. Set it to something realistic.
- Not handling encoding. Some sites use non-UTF-8 encoding. Check the
Content-Typeheader and decode accordingly. - Fetching too fast. Even without anti-bot protection, hammering a server with hundreds of concurrent requests gets your IP blocked.
When You Need a Browser: Puppeteer vs Playwright
If the page content loads via JavaScript (React, Vue, Angular apps), you need a headless browser.
Puppeteer
Google's browser automation library. Chrome/Chromium only.
import puppeteer from 'puppeteer';
const browser = await puppeteer.launch({ headless: 'new' });
const page = await browser.newPage();
await page.setUserAgent('Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36');
await page.goto('https://example.com/app', { waitUntil: 'networkidle2' });
// Wait for the specific element you need
await page.waitForSelector('.data-table');
const data = await page.evaluate(() => {
const rows = document.querySelectorAll('.data-table tr');
return Array.from(rows).map(row => {
const cells = row.querySelectorAll('td');
return {
name: cells[0]?.textContent?.trim(),
value: cells[1]?.textContent?.trim(),
};
});
});
await browser.close();Playwright
Microsoft's alternative. Supports Chrome, Firefox, and WebKit.
import { chromium } from 'playwright';
const browser = await chromium.launch();
const page = await browser.newPage();
await page.goto('https://example.com/app');
await page.waitForSelector('.data-table');
const data = await page.$$eval('.data-table tr', rows =>
rows.map(row => {
const cells = row.querySelectorAll('td');
return {
name: cells[0]?.textContent?.trim(),
value: cells[1]?.textContent?.trim(),
};
})
);
await browser.close();Which One to Pick
Playwright is the better choice for most scraping projects in 2026:
- Auto-wait built in. Playwright automatically waits for elements to be actionable before interacting. Puppeteer requires manual
waitForSelectorcalls everywhere. - Better selectors. Playwright supports
text=,role=, and CSS selectors out of the box. - Multi-browser. If a site blocks Chrome, try Firefox or WebKit without rewriting your code.
- Network interception is cleaner. Intercept API calls the page makes and grab the JSON directly instead of parsing the DOM.
Puppeteer still makes sense if you are already deep in the Google ecosystem or need Chrome-specific DevTools protocol features.
The Network Interception Trick
Here is something most tutorials skip: many SPAs fetch their data from an API that returns JSON. Instead of parsing the rendered DOM, intercept the network request and grab the structured data directly.
import { chromium } from 'playwright';
const browser = await chromium.launch();
const page = await browser.newPage();
// Intercept API responses
const apiData = [];
page.on('response', async (response) => {
const url = response.url();
if (url.includes('/api/products')) {
const json = await response.json();
apiData.push(...json.results);
}
});
await page.goto('https://example.com/products');
await page.waitForTimeout(3000); // Wait for API calls to complete
await browser.close();
console.log(apiData); // Clean JSON, no DOM parsing neededThis gives you cleaner data with less code. The tradeoff is that it breaks if the site changes their internal API endpoints, but the same is true for DOM selectors.
Handling Pagination
Most scraping projects need to handle pagination. Three patterns cover almost every site:
Pattern 1: URL-based pagination
const allProducts = [];
for (let page = 1; page <= 50; page++) {
const response = await fetch(`https://example.com/products?page=${page}`);
const html = await response.text();
const $ = cheerio.load(html);
const products = $('.product').map((i, el) => ({
name: $(el).find('.name').text().trim(),
})).get();
if (products.length === 0) break; // No more pages
allProducts.push(...products);
// Be respectful: wait between requests
await new Promise(r => setTimeout(r, 1000 + Math.random() * 2000));
}Pattern 2: Click-to-load / Infinite scroll
const page = await browser.newPage();
await page.goto('https://example.com/feed');
let previousHeight = 0;
while (true) {
await page.evaluate(() => window.scrollTo(0, document.body.scrollHeight));
await page.waitForTimeout(2000);
const currentHeight = await page.evaluate(() => document.body.scrollHeight);
if (currentHeight === previousHeight) break;
previousHeight = currentHeight;
}
// Now extract all loaded items
const items = await page.$$eval('.feed-item', els =>
els.map(el => el.textContent.trim())
);Pattern 3: Cursor-based API pagination
let cursor = null;
const allItems = [];
do {
const url = new URL('https://example.com/api/items');
url.searchParams.set('limit', '100');
if (cursor) url.searchParams.set('cursor', cursor);
const response = await fetch(url);
const data = await response.json();
allItems.push(...data.items);
cursor = data.next_cursor;
} while (cursor);Concurrency Without Getting Blocked
Sending requests one at a time is slow. Sending them all at once gets you blocked. The sweet spot is controlled concurrency:
async function scrapeWithConcurrency(urls, maxConcurrent = 5) {
const results = [];
const executing = new Set();
for (const url of urls) {
const promise = scrapeUrl(url).then(result => {
executing.delete(promise);
results.push(result);
});
executing.add(promise);
if (executing.size >= maxConcurrent) {
await Promise.race(executing);
}
}
await Promise.all(executing);
return results;
}Start with 3-5 concurrent requests and increase only if the target server handles it without errors. Add jitter (random delays) between batches to look less like a bot.
Error Handling That Actually Works
Scraping is inherently unreliable. Sites go down, layouts change, rate limits kick in. Build retry logic from the start:
async function scrapeWithRetry(url, maxRetries = 3) {
for (let attempt = 1; attempt <= maxRetries; attempt++) {
try {
const response = await fetch(url);
if (response.status === 429) {
const retryAfter = parseInt(response.headers.get('retry-after') || '60');
console.log(`Rate limited. Waiting ${retryAfter}s...`);
await new Promise(r => setTimeout(r, retryAfter * 1000));
continue;
}
if (!response.ok) {
throw new Error(`HTTP ${response.status}`);
}
return await response.text();
} catch (error) {
if (attempt === maxRetries) throw error;
const backoff = Math.pow(2, attempt) * 1000 + Math.random() * 1000;
console.log(`Attempt ${attempt} failed. Retrying in ${Math.round(backoff / 1000)}s...`);
await new Promise(r => setTimeout(r, backoff));
}
}
}Key points:
- Exponential backoff prevents hammering a server that is already struggling.
- Jitter (the random component) prevents all your retries from hitting at the same time.
- Respect Retry-After headers. If a server tells you when to come back, listen.
When to Stop Building and Use a Scraping API
There is a point where maintaining your own scraping infrastructure costs more than it saves. You have crossed that line when:
- You are spending more time on infrastructure than on using the data. Proxy rotation, CAPTCHA solving, browser fingerprint management, and IP ban recovery are full-time problems.
- Anti-bot systems keep winning. Cloudflare, DataDome, and PerimeterX update their detection weekly. Your bypass that worked last Tuesday is already flagged.
- You need reliability. Production systems that depend on scraped data cannot afford a 20% failure rate because your proxy pool degraded overnight.
- Scale changed. Scraping 100 pages a day with Puppeteer is fine. Scraping 100,000 pages a day means managing browser instances, memory limits, and concurrent connections across multiple servers.
A scraping API like AlterLab handles the entire infrastructure layer: proxy rotation, anti-bot bypass, browser rendering, and retry logic. You send a URL, you get back data.
const response = await fetch('https://alterlab.io/api/v1/scrape', {
method: 'POST',
headers: {
'X-API-Key': 'your-api-key',
'Content-Type': 'application/json',
},
body: JSON.stringify({
url: 'https://target-site.com/products',
render_js: true,
}),
});
const { content } = await response.json();
// Parse the returned HTML with cheerio as usual
const $ = cheerio.load(content);Your scraping logic stays the same. The difference is that someone else handles the arms race with anti-bot systems.
Project Structure for Larger Scrapers
Once you go beyond a single script, organize your scraper like any other Node.js project:
scraper/
src/
scrapers/ # One file per site/source
amazon.js
competitor.js
extractors/ # DOM parsing logic
product.js
pricing.js
utils/
retry.js # Retry/backoff logic
rateLimit.js # Request throttling
index.js # Entry point and orchestration
output/ # Scraped data (gitignored)
package.jsonSeparate the fetching (getting the HTML) from the extracting (parsing the data). When a site changes its layout, you only update the extractor. When you switch from direct fetching to an API, you only update the scraper.
Quick Reference
| Scenario | Tool | Why |
|---|---|---|
| Static HTML | fetch + cheerio | Fast, lightweight, no browser overhead |
| JS-rendered pages | Playwright | Auto-wait, multi-browser, clean API |
| Anti-bot protected | Scraping API | Infrastructure handled for you |
| Internal APIs | fetch directly | Skip the browser entirely |
| High volume (10k+ pages/day) | Scraping API | Proxy management at scale is a full-time job |
The best scraper is the simplest one that gets the job done. Start with fetch, add Playwright when you need it, and move to a scraping API when you would rather spend time on your product than on fighting bot detection.