Web Scraping with Node.js — Complete Guide
From your first fetch() call to concurrent data pipelines. Everything you need to scrape websites reliably with Node.js and TypeScript.
Node.js is a strong platform for web scraping — native async/await makes concurrent requests natural, the npm ecosystem provides excellent HTML parsing tools, and TypeScript adds type safety to large extraction codebases. This guide covers the full stack: fetching pages, parsing HTML with Cheerio, handling JavaScript-rendered content, and scaling with Promise.all().
Setting Up Your Node.js Scraping Project
You need Node.js 18+ (for native fetch) and a few packages. Start with a TypeScript project for better maintainability on large scraping jobs.
mkdir my-scraper && cd my-scraper
npm init -y
npm install cheerio axios p-limit
npm install -D typescript @types/node tsxFetching Pages with Node.js
Node.js 18+ ships with native fetch — no need for axios for simple cases. For more control over retries, timeouts, and interceptors, axios is the standard choice.
// Native fetch (Node 18+)
const response = await fetch("https://example.com/products", {
headers: {
"User-Agent": "Mozilla/5.0 (compatible; DataBot/1.0)",
Accept: "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
},
signal: AbortSignal.timeout(10000), // 10-second timeout
});
if (!response.ok) {
throw new Error(`HTTP ${response.status} on ${response.url}`);
}
const html = await response.text();Parsing HTML with Cheerio
Cheerio is a server-side implementation of the jQuery API for parsing HTML. It uses CSS selectors and the familiar .find(), .text(), and .attr() methods — fast and familiar if you know jQuery.
import * as cheerio from "cheerio";
const $ = cheerio.load(html);
// CSS selectors — same as browser querySelector
const products: Array<{ title: string; price: string; url: string }> = [];
$("div.product-card").each((_i, el) => {
products.push({
title: $(el).find("h2.product-title").text().trim(),
price: $(el).find("span.price").text().trim(),
url: $(el).find("a").attr("href") ?? "",
});
});
console.log(`Extracted ${products.length} products`);Concurrent Scraping with Promise.all()
Node.js's event loop makes concurrent scraping natural. Use Promise.all() to fire multiple requests simultaneously. Use p-limit to control maximum concurrency and avoid overwhelming target servers.
import pLimit from "p-limit";
const limit = pLimit(5); // max 5 concurrent requests
async function scrapePage(url: string): Promise<{ url: string; html: string }> {
const response = await fetch(url, { signal: AbortSignal.timeout(15000) });
if (!response.ok) throw new Error(`${response.status} on ${url}`);
return { url, html: await response.text() };
}
const urls = [
"https://example.com/page/1",
"https://example.com/page/2",
"https://example.com/page/3",
// ... up to thousands
];
const results = await Promise.all(
urls.map((url) => limit(() => scrapePage(url)))
);
console.log(`Scraped ${results.length} pages`);Handling JavaScript-Rendered Pages
Pages built with React, Vue, or Angular load content dynamically after the initial HTML response. A plain fetch() returns an empty shell — you need a browser that executes JavaScript.
Options: run Playwright or Puppeteer locally (covered in separate guides), or use a cloud rendering API. AlterLab runs a headless browser server-side and returns the fully rendered HTML through a simple POST request — no local browser management needed.
Saving Results to JSON or Database
For simple projects, write to JSON files. For production pipelines, write to PostgreSQL, MongoDB, or stream to a data warehouse.
import { writeFileSync } from "fs";
// Save to JSON
writeFileSync("products.json", JSON.stringify(products, null, 2), "utf-8");
// Or save to CSV
import { createWriteStream } from "fs";
const stream = createWriteStream("products.csv");
stream.write("title,price,url\n");
products.forEach((p) =>
stream.write(`"${p.title}","${p.price}","${p.url}"\n`)
);
stream.end();Complete Node.js Scraper — Paginated Site
Complete working scraper with pagination, error handling, and JSON output.
import * as cheerio from "cheerio";
import pLimit from "p-limit";
import { writeFileSync } from "fs";
const HEADERS = {
"User-Agent": "Mozilla/5.0 (compatible; DataBot/1.0)",
Accept: "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
};
async function fetchPage(url: string): Promise<string> {
const response = await fetch(url, {
headers: HEADERS,
signal: AbortSignal.timeout(10000),
});
if (!response.ok) throw new Error(`HTTP ${response.status}: ${url}`);
return response.text();
}
function parseArticles(html: string): Array<{ title: string; url: string }> {
const $ = cheerio.load(html);
const articles: Array<{ title: string; url: string }> = [];
$("article.post").each((_i, el) => {
const title = $(el).find("h2").text().trim();
const url = $(el).find("a").attr("href") ?? "";
if (title) articles.push({ title, url });
});
return articles;
}
function hasNextPage(html: string): boolean {
const $ = cheerio.load(html);
return $("a.next-page").length > 0;
}
async function scrapeAllPages(baseUrl: string): Promise<Array<{ title: string; url: string }>> {
const all: Array<{ title: string; url: string }> = [];
let page = 1;
while (true) {
const url = `${baseUrl}?page=${page}`;
console.log(`Scraping page ${page}…`);
const html = await fetchPage(url);
const articles = parseArticles(html);
if (articles.length === 0 || !hasNextPage(html)) break;
all.push(...articles);
page++;
await new Promise((r) => setTimeout(r, 1000)); // polite delay
}
return all;
}
const results = await scrapeAllPages("https://example.com/articles");
writeFileSync("articles.json", JSON.stringify(results, null, 2));
console.log(`Saved ${results.length} articles`);Or Skip the Complexity — Use AlterLab
AlterLab handles JavaScript rendering, IP rotation, and website compatibility automatically. One POST request returns rendered HTML — no browser management, no proxy configuration. Starts at $0.0002/request with 5,000 free to start.
import * as cheerio from "cheerio";
const API_KEY = "YOUR_API_KEY"; // Get free at alterlab.io
async function scrapeWithAlterLab(
url: string,
renderJs = false
): Promise<string> {
const response = await fetch("https://api.alterlab.io/api/v1/scrape", {
method: "POST",
headers: {
"X-API-Key": API_KEY,
"Content-Type": "application/json",
},
body: JSON.stringify({ url, render_js: renderJs }),
signal: AbortSignal.timeout(30000),
});
if (!response.ok) throw new Error(`AlterLab API: ${response.status}`);
const data = await response.json();
return data.html ?? "";
}
// Works on JavaScript-heavy pages and sites with compatibility layers
const html = await scrapeWithAlterLab(
"https://example.com/products",
true // render_js: true for SPAs and dynamic pages
);
const $ = cheerio.load(html);
const products: Array<{ title: string; price: string }> = [];
$("div.product-card").each((_i, el) => {
const title = $(el).find("h2").text().trim();
const price = $(el).find(".price").text().trim();
if (title) products.push({ title, price });
});
console.log(`Extracted ${products.length} products`);Choosing Your Approach
fetch / axios + Cheerio
Pros
- +Lightweight and fast for static pages
- +Low resource usage
- +Full control over HTTP layer
Cons
- −No JavaScript execution
- −Manual IP rotation needed
- −Breaks on challenge pages
Puppeteer / Playwright
Pros
- +Full browser — executes any JavaScript
- +Handles complex interactions
Cons
- −High memory usage (1+ GB per browser)
- −Slower (5–15 seconds per page)
- −Browser detection is common
- −Complex setup
AlterLab API
Pros
- +Handles static, JavaScript, and challenge pages
- +No browser management
- +Automatic IP rotation
- +5-tier auto-escalation
- +From $0.0002/request
Cons
- −Per-request cost
- −Requires network access
Frequently Asked Questions
More Node.js & Scraping Resources
Web Scraping with Python
Complete Python scraping guide: requests, BeautifulSoup, async patterns, and production tips.
Web Scraping with Playwright
Playwright setup and code examples — Python and Node.js.
Node.js Web Scraping API
Official Node.js SDK with TypeScript support and 5,000 free requests to start.
JavaScript Rendering API
Render SPAs and dynamic content with headless Chromium — no browser management.
Your first scrape.
Sixty seconds.
$1 free balance. No credit card. No SDK.
Just a POST request.
No credit card required · Up to 5,000 free scrapes · Balance never expires