Node.js GuideBeginner → Production

Web Scraping with Node.js — Complete Guide

From your first fetch() call to concurrent data pipelines. Everything you need to scrape websites reliably with Node.js and TypeScript.

Node.js is a strong platform for web scraping — native async/await makes concurrent requests natural, the npm ecosystem provides excellent HTML parsing tools, and TypeScript adds type safety to large extraction codebases. This guide covers the full stack: fetching pages, parsing HTML with Cheerio, handling JavaScript-rendered content, and scaling with Promise.all().

Setting Up Your Node.js Scraping Project

You need Node.js 18+ (for native fetch) and a few packages. Start with a TypeScript project for better maintainability on large scraping jobs.

mkdir my-scraper && cd my-scraper
npm init -y
npm install cheerio axios p-limit
npm install -D typescript @types/node tsx

Fetching Pages with Node.js

Node.js 18+ ships with native fetch — no need for axios for simple cases. For more control over retries, timeouts, and interceptors, axios is the standard choice.

// Native fetch (Node 18+)
const response = await fetch("https://example.com/products", {
  headers: {
    "User-Agent": "Mozilla/5.0 (compatible; DataBot/1.0)",
    Accept: "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
  },
  signal: AbortSignal.timeout(10000), // 10-second timeout
});

if (!response.ok) {
  throw new Error(`HTTP ${response.status} on ${response.url}`);
}
const html = await response.text();

Parsing HTML with Cheerio

Cheerio is a server-side implementation of the jQuery API for parsing HTML. It uses CSS selectors and the familiar .find(), .text(), and .attr() methods — fast and familiar if you know jQuery.

import * as cheerio from "cheerio";

const $ = cheerio.load(html);

// CSS selectors — same as browser querySelector
const products: Array<{ title: string; price: string; url: string }> = [];

$("div.product-card").each((_i, el) => {
  products.push({
    title: $(el).find("h2.product-title").text().trim(),
    price: $(el).find("span.price").text().trim(),
    url: $(el).find("a").attr("href") ?? "",
  });
});

console.log(`Extracted ${products.length} products`);

Concurrent Scraping with Promise.all()

Node.js's event loop makes concurrent scraping natural. Use Promise.all() to fire multiple requests simultaneously. Use p-limit to control maximum concurrency and avoid overwhelming target servers.

import pLimit from "p-limit";

const limit = pLimit(5); // max 5 concurrent requests

async function scrapePage(url: string): Promise<{ url: string; html: string }> {
  const response = await fetch(url, { signal: AbortSignal.timeout(15000) });
  if (!response.ok) throw new Error(`${response.status} on ${url}`);
  return { url, html: await response.text() };
}

const urls = [
  "https://example.com/page/1",
  "https://example.com/page/2",
  "https://example.com/page/3",
  // ... up to thousands
];

const results = await Promise.all(
  urls.map((url) => limit(() => scrapePage(url)))
);
console.log(`Scraped ${results.length} pages`);

Handling JavaScript-Rendered Pages

Pages built with React, Vue, or Angular load content dynamically after the initial HTML response. A plain fetch() returns an empty shell — you need a browser that executes JavaScript.

Options: run Playwright or Puppeteer locally (covered in separate guides), or use a cloud rendering API. AlterLab runs a headless browser server-side and returns the fully rendered HTML through a simple POST request — no local browser management needed.

Saving Results to JSON or Database

For simple projects, write to JSON files. For production pipelines, write to PostgreSQL, MongoDB, or stream to a data warehouse.

import { writeFileSync } from "fs";

// Save to JSON
writeFileSync("products.json", JSON.stringify(products, null, 2), "utf-8");

// Or save to CSV
import { createWriteStream } from "fs";
const stream = createWriteStream("products.csv");
stream.write("title,price,url\n");
products.forEach((p) =>
  stream.write(`"${p.title}","${p.price}","${p.url}"\n`)
);
stream.end();

Complete Node.js Scraper — Paginated Site

Complete working scraper with pagination, error handling, and JSON output.

import * as cheerio from "cheerio";
import pLimit from "p-limit";
import { writeFileSync } from "fs";

const HEADERS = {
  "User-Agent": "Mozilla/5.0 (compatible; DataBot/1.0)",
  Accept: "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
};

async function fetchPage(url: string): Promise<string> {
  const response = await fetch(url, {
    headers: HEADERS,
    signal: AbortSignal.timeout(10000),
  });
  if (!response.ok) throw new Error(`HTTP ${response.status}: ${url}`);
  return response.text();
}

function parseArticles(html: string): Array<{ title: string; url: string }> {
  const $ = cheerio.load(html);
  const articles: Array<{ title: string; url: string }> = [];
  $("article.post").each((_i, el) => {
    const title = $(el).find("h2").text().trim();
    const url = $(el).find("a").attr("href") ?? "";
    if (title) articles.push({ title, url });
  });
  return articles;
}

function hasNextPage(html: string): boolean {
  const $ = cheerio.load(html);
  return $("a.next-page").length > 0;
}

async function scrapeAllPages(baseUrl: string): Promise<Array<{ title: string; url: string }>> {
  const all: Array<{ title: string; url: string }> = [];
  let page = 1;

  while (true) {
    const url = `${baseUrl}?page=${page}`;
    console.log(`Scraping page ${page}…`);
    const html = await fetchPage(url);
    const articles = parseArticles(html);
    if (articles.length === 0 || !hasNextPage(html)) break;
    all.push(...articles);
    page++;
    await new Promise((r) => setTimeout(r, 1000)); // polite delay
  }

  return all;
}

const results = await scrapeAllPages("https://example.com/articles");
writeFileSync("articles.json", JSON.stringify(results, null, 2));
console.log(`Saved ${results.length} articles`);

Or Skip the Complexity — Use AlterLab

AlterLab handles JavaScript rendering, IP rotation, and website compatibility automatically. One POST request returns rendered HTML — no browser management, no proxy configuration. Starts at $0.0002/request with 5,000 free to start.

import * as cheerio from "cheerio";

const API_KEY = "YOUR_API_KEY"; // Get free at alterlab.io

async function scrapeWithAlterLab(
  url: string,
  renderJs = false
): Promise<string> {
  const response = await fetch("https://api.alterlab.io/api/v1/scrape", {
    method: "POST",
    headers: {
      "X-API-Key": API_KEY,
      "Content-Type": "application/json",
    },
    body: JSON.stringify({ url, render_js: renderJs }),
    signal: AbortSignal.timeout(30000),
  });
  if (!response.ok) throw new Error(`AlterLab API: ${response.status}`);
  const data = await response.json();
  return data.html ?? "";
}

// Works on JavaScript-heavy pages and sites with compatibility layers
const html = await scrapeWithAlterLab(
  "https://example.com/products",
  true // render_js: true for SPAs and dynamic pages
);

const $ = cheerio.load(html);
const products: Array<{ title: string; price: string }> = [];

$("div.product-card").each((_i, el) => {
  const title = $(el).find("h2").text().trim();
  const price = $(el).find(".price").text().trim();
  if (title) products.push({ title, price });
});

console.log(`Extracted ${products.length} products`);

Choosing Your Approach

fetch / axios + Cheerio

Pros

  • +Lightweight and fast for static pages
  • +Low resource usage
  • +Full control over HTTP layer

Cons

  • No JavaScript execution
  • Manual IP rotation needed
  • Breaks on challenge pages

Puppeteer / Playwright

Pros

  • +Full browser — executes any JavaScript
  • +Handles complex interactions

Cons

  • High memory usage (1+ GB per browser)
  • Slower (5–15 seconds per page)
  • Browser detection is common
  • Complex setup

AlterLab API

Pros

  • +Handles static, JavaScript, and challenge pages
  • +No browser management
  • +Automatic IP rotation
  • +5-tier auto-escalation
  • +From $0.0002/request

Cons

  • Per-request cost
  • Requires network access

Frequently Asked Questions

Your first scrape.
Sixty seconds.

$1 free balance. No credit card. No SDK.Just a POST request.

terminal
curl -X POST https://api.alterlab.io/v1/scrape \
-H "X-API-Key: YOUR_KEY" \
-H "Content-Type: application/json" \
-d '{"url": "https://example.com", "formats": ["markdown"]}'

No credit card required · Up to 5,000 free scrapes · Balance never expires

    Node.js Web Scraping Guide 2026 — Cheerio, Axios & Beyond | AlterLab | AlterLab