Pricing Compare Playground Blog Docs Changelog

Build an n8n Web Scraping Pipeline Without Code

Learn how to build a production-grade web scraping pipeline in n8n using HTTP Request nodes, JavaScript transforms, pagination handling, and automatic retries.

Yash DubeyMarch 21, 2026

8 min read

379 views

n8n gives you a visual canvas for wiring together APIs, databases, and triggers. What it doesn't give you is a scraper — and that gap matters the moment you hit a JavaScript-heavy SPA, a Cloudflare-protected page, or a target that rate-limits by IP.

This guide shows you how to close that gap: build an n8n pipeline that extracts data from any website on a schedule, transforms it, handles pagination and errors, and routes results to any downstream sink.

What You'll Build

A recurring scraping pipeline that:

Fires on a cron schedule (or webhook trigger)
Requests rendered HTML or structured JSON from a scraping API
Parses and normalizes the response in a Code node
Loops through paginated results until all records are collected
Writes clean records to Postgres, Google Sheets, or any of n8n's 400+ integrations

No Puppeteer process to babysit. No proxy pool to rotate. No CAPTCHA solver to maintain.

Prerequisites

n8n running locally (npx n8n) or on n8n Cloud
An API key from AlterLab (free tier covers testing)
Basic familiarity with n8n's canvas — if you can drag a node, you're set

Pipeline Architecture

Each node has exactly one responsibility. Keep it that way and debugging becomes trivial.

Step 1: Create the Trigger

Add a Schedule Trigger node. Set the interval based on data freshness requirements:

Product prices: every 1–4 hours
News headlines: every 15–30 minutes
Job listings: every 6–12 hours

During development, use Manual Trigger instead. Swap it out for Schedule Trigger when you're ready for production — nothing else in the pipeline changes.

Step 2: Configure the HTTP Request Node

Add an HTTP Request node after the trigger.

Method: POST URL: https://api.alterlab.io/v1/scrape Authentication: Header Auth → key X-API-Key, value YOUR_KEY

Set the request body to JSON:

JSON

{
  "url": "https://books.toscrape.com/catalogue/page-1.html",
  "render": false,
  "extract_rules": {
    "titles": {
      "selector": "article.product_pod h3 a",
      "type": "list",
      "output": "text"
    },
    "prices": {
      "selector": "article.product_pod .price_color",
      "type": "list",
      "output": "text"
    },
    "next_page": {
      "selector": "li.next a",
      "type": "item",
      "output": "attr:href"
    }
  }
}

extract_rules returns structured JSON directly — parallel arrays where index i in titles corresponds to index i in prices. For complex extraction logic or when you need full DOM access, omit extract_rules and work with the raw HTML in the Code node.

Set "render": true for React, Vue, or Angular pages that hydrate client-side. This triggers a headless browser on the API side. Adjust the HTTP node timeout to 60 seconds when render is enabled.

Here is the equivalent cURL command for testing outside n8n before wiring the node:

Bash

curl -X POST https://api.alterlab.io/v1/scrape \
  -H "X-API-Key: YOUR_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://books.toscrape.com/catalogue/page-1.html",
    "render": false,
    "extract_rules": {
      "titles": {"selector": "article.product_pod h3 a", "type": "list", "output": "text"},
      "prices": {"selector": "article.product_pod .price_color", "type": "list", "output": "text"},
      "next_page": {"selector": "li.next a", "type": "item", "output": "attr:href"}
    }
  }'

Try it against the live sandbox:

Try it yourself

Scrape a live product listing page with AlterLab — no setup required

curl -X POST https://api.alterlab.io/v1/scrape \
  -H "X-API-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"url": "https://books.toscrape.com/catalogue/page-1.html"}'

Enable JavaScript to try the live demo, or sign up to use the API directly.

Step 3: Transform the Response

When extract_rules returns parallel arrays, zip them into records in a Code node. Add the node after HTTP Request, set language to JavaScript:

JAVASCRIPT

const data = $input.first().json;

const titles = data.titles ?? [];
const prices = data.prices ?? [];

// Zip parallel arrays into structured records
const records = titles.map((title, i) => ({
  title: title.trim(),
  price: parseFloat(prices[i]?.replace('£', '') ?? '0'),
  currency: 'GBP',
  scraped_at: new Date().toISOString(),
  source_url: $('HTTP Request').first().json._meta?.url ?? null,
  page: $getWorkflowStaticData('node').pageCount ?? 1,
}));

return records.map(r => ({ json: r }));

The highlighted block zips, cleans, type-casts, and attaches metadata. Output from this node is a flat array of items — each item flows independently into downstream nodes. A Postgres INSERT node will write one row per item automatically.

Step 4: Handle Pagination

Most real targets paginate. The standard n8n pattern:

An IF node checks whether next_page is non-null
On true: update the URL and loop back to the HTTP Request node
On false: exit to the output node

Store the next page path in workflow static data so it survives across loop iterations:

JAVASCRIPT

const nextPage = $input.first().json.next_page;

// Persist state across iterations
const staticData = $getWorkflowStaticData('node');
staticData.nextPagePath = nextPage ?? null;
staticData.pageCount = (staticData.pageCount ?? 0) + 1;

const baseUrl = 'https://books.toscrape.com/catalogue/';
const nextUrl = nextPage ? `${baseUrl}${nextPage}` : null;

return [{
  json: {
    has_next_page: !!nextPage,
    next_url: nextUrl,
    pages_collected: staticData.pageCount,
  }
}];

Wire the IF node: has_next_page === true loops back to HTTP Request with next_url injected into the request body via an expression ({{ $json.next_url }}). Always add a max_pages guard — check pages_collected > 50 on the true branch and terminate if hit. Broken pagination signals on malformed sites will otherwise run indefinitely.

Step 5: Error Handling

HTTP Request nodes fail silently by default. Enable Continue On Error in the node settings, then add an IF node checking $json.$response.statusCode >= 400.

For transient failures (429, 503, 502), add a Wait node on the error branch and loop back to retry:

JAVASCRIPT

const status = $input.first().json.$response?.statusCode ?? 0;
const headers = $input.first().json.$response?.headers ?? {};

// Respect Retry-After header if present, otherwise use fixed backoff
const retryable = [429, 502, 503, 504].includes(status);
const retryAfter = parseInt(headers['retry-after'] ?? '0', 10);
const delay = retryable
  ? (retryAfter > 0 ? retryAfter * 1000 : 5000 * Math.pow(2, $getWorkflowStaticData('node').retries ?? 0))
  : 0;

const retries = ($getWorkflowStaticData('node').retries ?? 0) + 1;
$getWorkflowStaticData('node').retries = retries;

return [{
  json: { retryable: retryable && retries <= 3, delay_ms: delay, status }
}];

Wire: retryable === true → Wait (delay_ms milliseconds) → HTTP Request. retryable === false → Slack/PagerDuty alert node. After a successful page, reset retries to 0 in the Transform Code node.

Python Equivalent

If you want to run the same pipeline outside n8n — in a scheduled job, an Airflow DAG, or a standalone service — here is the direct Python equivalent:

Python

import httpx
import time

API_KEY = "YOUR_KEY"
BASE_URL = "https://api.alterlab.io/v1/scrape"

def scrape_all_pages(start_url: str, max_pages: int = 50) -> list[dict]:
    records: list[dict] = []
    url: str | None = start_url
    page = 1

    while url and page <= max_pages:
        try:
            response = httpx.post(
                BASE_URL,
                headers={"X-API-Key": API_KEY},
                json={
                    "url": url,
                    "render": False,
                    "extract_rules": {
                        "titles": {"selector": "article.product_pod h3 a", "type": "list", "output": "text"},
                        "prices": {"selector": "article.product_pod .price_color", "type": "list", "output": "text"},
                        "next_page": {"selector": "li.next a", "type": "item", "output": "attr:href"},
                    },
                },
                timeout=30.0,
            )
            response.raise_for_status()
        except httpx.HTTPStatusError as exc:
            if exc.response.status_code in (429, 503):
                time.sleep(int(exc.response.headers.get("retry-after", "10")))
                continue
            raise

        data = response.json()
        titles = data.get("titles", [])
        prices = data.get("prices", [])

        records.extend(
            {"title": t.strip(), "price": float(p.replace("£", "")), "page": page}
            for t, p in zip(titles, prices)
        )

        next_path = data.get("next_page")
        url = f"https://books.toscrape.com/catalogue/{next_path}" if next_path else None
        page += 1
        time.sleep(0.5)  # polite crawl delay

    return records

if __name__ == "__main__":
    results = scrape_all_pages("https://books.toscrape.com/catalogue/page-1.html")
    print(f"Collected {len(results)} records across {results[-1]['page']} pages")

The logic in lines 8–35 mirrors exactly what the n8n pipeline does: one POST per page, structured extraction, pagination loop with a hard page cap, and retry handling for transient errors. Use this when you need the scraper embedded in a larger Python service or want to prototype extraction rules before wiring them into n8n.

Choosing the Right Output Node

Once records leave the Code node, they are standard n8n items. Route them to whichever sink fits your use case:

Performance Considerations

Execution timeout: n8n Cloud imposes a per-execution timeout (varies by plan). For large datasets, split work across multiple triggered executions rather than one long-running loop.

Concurrency: n8n processes loop iterations sequentially by default. For parallelism, use the Split In Batches node with multiple URL inputs and let n8n fan them out across parallel execution paths.

Credential rotation: Store multiple API keys as separate credentials and select among them in a Code node using Math.floor(Math.random() * keys.length). Spreads request volume across keys when you're approaching per-key rate limits.

Execution logs: n8n stores full execution history including request/response payloads. For debugging extraction failures, open the HTTP Request node's output panel — the full JSON response is visible without any additional logging code.

Takeaway

The n8n approach to web scraping separates concerns cleanly by node boundary:

Triggering — Schedule or Webhook node handles when
Fetching — HTTP Request node handles the network call
Transforming — Code node handles parsing and normalization (~20 lines of JavaScript)
Error routing — IF + Wait nodes handle retry logic
Storing — any of n8n's integrations handles persistence

The scraping infrastructure — proxies, anti-bot bypass, headless rendering — runs on the API side. What you version-control and maintain is a workflow JSON file you can export, fork, and redeploy in minutes.

Start with a single target URL and verify extract_rules returns clean data. Add pagination next, then the error branch. A production-grade pipeline for most targets is under an hour to build and requires zero ongoing infrastructure management.

Was this article helpful?

Try it yourself

Skip the proxy management overhead

AlterLab handles proxy rotation, browser environments, and challenge resolution for you.

curl -X POST https://api.alterlab.io/v1/scrape \
  -H "X-API-Key: YOUR_KEY" \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com"}'

No credit card required · 5,000 free requests

Frequently Asked Questions

n8n has no built-in scraping engine. Its HTTP Request node can fetch static pages, but JavaScript-rendered sites, anti-bot protections, and proxy rotation all require an external scraping API. Routing requests through a scraping service handles that infrastructure automatically.

Pass `"render": true` in the API request body. This triggers a headless browser on the API side and returns fully rendered HTML. Expect 2–4 seconds per request instead of ~400ms for static pages, so adjust your HTTP node timeout to at least 30 seconds.

Before inserting records, add a Postgres node in SELECT mode to query existing identifiers (URL, product ID, etc.). Feed the result into a Code node that filters out already-seen records, then pass only new items to the INSERT node. This makes each execution idempotent regardless of run frequency.

Yash Dubey

View all posts

Tutorials

How to Give Your AI Agent Access to eBay Data

Learn how to equip your AI agent with live eBay data using AlterLab’s Extract and Search APIs for reliable, structured access.

Herald Blog Service

Jun 26, 2026

Tutorials

How to Give Your AI Agent Access to SimilarWeb Data

Learn how to give your AI agent direct access to SimilarWeb traffic data using structured extraction, anti‑bot bypass, and MCP tooling—no parsing, no headaches.

Herald Blog Service

Jun 26, 2026

Tutorials

How to Give Your AI Agent Access to Statista Data

Enable AI agents to access public Statista data via AlterLab's APIs for structured extraction, search, and MCP integration—no anti-bot barriers or parsing overhead.

Herald Blog Service

Jun 26, 2026

Stay in the Loop

Get scraping insights, API tips, and platform updates. No spam — we only send when we have something worth reading.

Web Scraping API Resources

Part of the Web Scraping API Documentation cluster

Web Scraping API Documentation

Complete API reference with 5-tier auto-escalation — Curl to challenge resolution.

Pillar page

JavaScript Rendering Guide

Configure Tier 4 browser rendering for SPAs and dynamic content.

Authenticated Scraping Guide

Scrape pages behind login using session management.

Web Scraping API Benchmarks

Real success rates and cost data across all 5 tiers.

AlterLab for AI Agents

MCP Server, Python SDK, and Firecrawl-compatible API for AI agent workflows.

What You'll Build

Prerequisites

Pipeline Architecture

Step 1: Create the Trigger

Step 2: Configure the HTTP Request Node

Step 3: Transform the Response

Step 4: Handle Pagination

Step 5: Error Handling

Python Equivalent

Choosing the Right Output Node

Performance Considerations

Takeaway

Frequently Asked Questions

Related Articles

How to Give Your AI Agent Access to eBay Data

How to Give Your AI Agent Access to SimilarWeb Data

How to Give Your AI Agent Access to Statista Data

Popular Posts

Playwright Bot Detection: What Actually Works in 2026

Why Your Headless Browser Gets Detected (and How to Fix It)

How to Scrape Twitter/X: Complete Guide for 2026

Best Web Scraping APIs in 2026: Complete Comparison Guide

How to Scrape Cloudflare-Protected Sites in 2026

Recommended

How to Scrape AliExpress: Complete Guide for 2026

Why Your Headless Browser Gets Detected (and How to Fix It)

How to Scrape Twitter/X Data: Complete Guide for 2026

How to Scrape Cloudflare-Protected Sites in 2026

How to Bypass Cloudflare Bot Protection with Puppeteer in 2026

Newsletter

Recommended Reading

How to Scrape AliExpress: Complete Guide for 2026

Why Your Headless Browser Gets Detected (and How to Fix It)

How to Scrape Twitter/X Data: Complete Guide for 2026

How to Scrape Cloudflare-Protected Sites in 2026

How to Bypass Cloudflare Bot Protection with Puppeteer in 2026

Stay in the Loop

Explore AlterLab

Anti-Bot Handling API

JavaScript Rendering API

Pricing

Documentation

Web Scraping API Resources