AlterLabAlterLab
Tutorials

Build an n8n Web Scraping Pipeline Without Code

Learn how to build a production-grade web scraping pipeline in n8n using HTTP Request nodes, JavaScript transforms, pagination handling, and automatic retries.

Yash Dubey
Yash Dubey

March 21, 2026

8 min read
23 views

n8n gives you a visual canvas for wiring together APIs, databases, and triggers. What it doesn't give you is a scraper — and that gap matters the moment you hit a JavaScript-heavy SPA, a Cloudflare-protected page, or a target that rate-limits by IP.

This guide shows you how to close that gap: build an n8n pipeline that extracts data from any website on a schedule, transforms it, handles pagination and errors, and routes results to any downstream sink.

What You'll Build

A recurring scraping pipeline that:

  • Fires on a cron schedule (or webhook trigger)
  • Requests rendered HTML or structured JSON from a scraping API
  • Parses and normalizes the response in a Code node
  • Loops through paginated results until all records are collected
  • Writes clean records to Postgres, Google Sheets, or any of n8n's 400+ integrations

No Puppeteer process to babysit. No proxy pool to rotate. No CAPTCHA solver to maintain.

Prerequisites

  • n8n running locally (npx n8n) or on n8n Cloud
  • An API key from AlterLab (free tier covers testing)
  • Basic familiarity with n8n's canvas — if you can drag a node, you're set

Pipeline Architecture

Each node has exactly one responsibility. Keep it that way and debugging becomes trivial.

Step 1: Create the Trigger

Add a Schedule Trigger node. Set the interval based on data freshness requirements:

  • Product prices: every 1–4 hours
  • News headlines: every 15–30 minutes
  • Job listings: every 6–12 hours

During development, use Manual Trigger instead. Swap it out for Schedule Trigger when you're ready for production — nothing else in the pipeline changes.

Step 2: Configure the HTTP Request Node

Add an HTTP Request node after the trigger.

Method: POST URL: https://api.alterlab.io/v1/scrape Authentication: Header Auth → key X-API-Key, value YOUR_KEY

Set the request body to JSON:

JSON
{
  "url": "https://books.toscrape.com/catalogue/page-1.html",
  "render": false,
  "extract_rules": {
    "titles": {
      "selector": "article.product_pod h3 a",
      "type": "list",
      "output": "text"
    },
    "prices": {
      "selector": "article.product_pod .price_color",
      "type": "list",
      "output": "text"
    },
    "next_page": {
      "selector": "li.next a",
      "type": "item",
      "output": "attr:href"
    }
  }
}

extract_rules returns structured JSON directly — parallel arrays where index i in titles corresponds to index i in prices. For complex extraction logic or when you need full DOM access, omit extract_rules and work with the raw HTML in the Code node.

Set "render": true for React, Vue, or Angular pages that hydrate client-side. This triggers a headless browser on the API side. Adjust the HTTP node timeout to 60 seconds when render is enabled.

Here is the equivalent cURL command for testing outside n8n before wiring the node:

Bash
curl -X POST https://api.alterlab.io/v1/scrape \
  -H "X-API-Key: YOUR_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://books.toscrape.com/catalogue/page-1.html",
    "render": false,
    "extract_rules": {
      "titles": {"selector": "article.product_pod h3 a", "type": "list", "output": "text"},
      "prices": {"selector": "article.product_pod .price_color", "type": "list", "output": "text"},
      "next_page": {"selector": "li.next a", "type": "item", "output": "attr:href"}
    }
  }'

Try it against the live sandbox:

Try it yourself

Scrape a live product listing page with AlterLab — no setup required

Step 3: Transform the Response

When extract_rules returns parallel arrays, zip them into records in a Code node. Add the node after HTTP Request, set language to JavaScript:

JAVASCRIPT
const data = $input.first().json;

const titles = data.titles ?? [];
const prices = data.prices ?? [];

// Zip parallel arrays into structured records
const records = titles.map((title, i) => ({
  title: title.trim(),
  price: parseFloat(prices[i]?.replace('£', '') ?? '0'),
  currency: 'GBP',
  scraped_at: new Date().toISOString(),
  source_url: $('HTTP Request').first().json._meta?.url ?? null,
  page: $getWorkflowStaticData('node').pageCount ?? 1,
}));

return records.map(r => ({ json: r }));

The highlighted block zips, cleans, type-casts, and attaches metadata. Output from this node is a flat array of items — each item flows independently into downstream nodes. A Postgres INSERT node will write one row per item automatically.

Step 4: Handle Pagination

Most real targets paginate. The standard n8n pattern:

  1. An IF node checks whether next_page is non-null
  2. On true: update the URL and loop back to the HTTP Request node
  3. On false: exit to the output node

Store the next page path in workflow static data so it survives across loop iterations:

JAVASCRIPT
const nextPage = $input.first().json.next_page;

// Persist state across iterations
const staticData = $getWorkflowStaticData('node');
staticData.nextPagePath = nextPage ?? null;
staticData.pageCount = (staticData.pageCount ?? 0) + 1;

const baseUrl = 'https://books.toscrape.com/catalogue/';
const nextUrl = nextPage ? `${baseUrl}${nextPage}` : null;

return [{
  json: {
    has_next_page: !!nextPage,
    next_url: nextUrl,
    pages_collected: staticData.pageCount,
  }
}];

Wire the IF node: has_next_page === true loops back to HTTP Request with next_url injected into the request body via an expression ({{ $json.next_url }}). Always add a max_pages guard — check pages_collected > 50 on the true branch and terminate if hit. Broken pagination signals on malformed sites will otherwise run indefinitely.

Step 5: Error Handling

HTTP Request nodes fail silently by default. Enable Continue On Error in the node settings, then add an IF node checking $json.$response.statusCode >= 400.

For transient failures (429, 503, 502), add a Wait node on the error branch and loop back to retry:

JAVASCRIPT
const status = $input.first().json.$response?.statusCode ?? 0;
const headers = $input.first().json.$response?.headers ?? {};

// Respect Retry-After header if present, otherwise use fixed backoff
const retryable = [429, 502, 503, 504].includes(status);
const retryAfter = parseInt(headers['retry-after'] ?? '0', 10);
const delay = retryable
  ? (retryAfter > 0 ? retryAfter * 1000 : 5000 * Math.pow(2, $getWorkflowStaticData('node').retries ?? 0))
  : 0;

const retries = ($getWorkflowStaticData('node').retries ?? 0) + 1;
$getWorkflowStaticData('node').retries = retries;

return [{
  json: { retryable: retryable && retries <= 3, delay_ms: delay, status }
}];

Wire: retryable === true → Wait (delay_ms milliseconds) → HTTP Request. retryable === false → Slack/PagerDuty alert node. After a successful page, reset retries to 0 in the Transform Code node.

Python Equivalent

If you want to run the same pipeline outside n8n — in a scheduled job, an Airflow DAG, or a standalone service — here is the direct Python equivalent:

Python
import httpx
import time

API_KEY = "YOUR_KEY"
BASE_URL = "https://api.alterlab.io/v1/scrape"

def scrape_all_pages(start_url: str, max_pages: int = 50) -> list[dict]:
    records: list[dict] = []
    url: str | None = start_url
    page = 1

    while url and page <= max_pages:
        try:
            response = httpx.post(
                BASE_URL,
                headers={"X-API-Key": API_KEY},
                json={
                    "url": url,
                    "render": False,
                    "extract_rules": {
                        "titles": {"selector": "article.product_pod h3 a", "type": "list", "output": "text"},
                        "prices": {"selector": "article.product_pod .price_color", "type": "list", "output": "text"},
                        "next_page": {"selector": "li.next a", "type": "item", "output": "attr:href"},
                    },
                },
                timeout=30.0,
            )
            response.raise_for_status()
        except httpx.HTTPStatusError as exc:
            if exc.response.status_code in (429, 503):
                time.sleep(int(exc.response.headers.get("retry-after", "10")))
                continue
            raise

        data = response.json()
        titles = data.get("titles", [])
        prices = data.get("prices", [])

        records.extend(
            {"title": t.strip(), "price": float(p.replace("£", "")), "page": page}
            for t, p in zip(titles, prices)
        )

        next_path = data.get("next_page")
        url = f"https://books.toscrape.com/catalogue/{next_path}" if next_path else None
        page += 1
        time.sleep(0.5)  # polite crawl delay

    return records

if __name__ == "__main__":
    results = scrape_all_pages("https://books.toscrape.com/catalogue/page-1.html")
    print(f"Collected {len(results)} records across {results[-1]['page']} pages")

The logic in lines 8–35 mirrors exactly what the n8n pipeline does: one POST per page, structured extraction, pagination loop with a hard page cap, and retry handling for transient errors. Use this when you need the scraper embedded in a larger Python service or want to prototype extraction rules before wiring them into n8n.

Choosing the Right Output Node

Once records leave the Code node, they are standard n8n items. Route them to whichever sink fits your use case:

Performance Considerations

Execution timeout: n8n Cloud imposes a per-execution timeout (varies by plan). For large datasets, split work across multiple triggered executions rather than one long-running loop.

Concurrency: n8n processes loop iterations sequentially by default. For parallelism, use the Split In Batches node with multiple URL inputs and let n8n fan them out across parallel execution paths.

Credential rotation: Store multiple API keys as separate credentials and select among them in a Code node using Math.floor(Math.random() * keys.length). Spreads request volume across keys when you're approaching per-key rate limits.

Execution logs: n8n stores full execution history including request/response payloads. For debugging extraction failures, open the HTTP Request node's output panel — the full JSON response is visible without any additional logging code.

Takeaway

The n8n approach to web scraping separates concerns cleanly by node boundary:

  • Triggering — Schedule or Webhook node handles when
  • Fetching — HTTP Request node handles the network call
  • Transforming — Code node handles parsing and normalization (~20 lines of JavaScript)
  • Error routing — IF + Wait nodes handle retry logic
  • Storing — any of n8n's integrations handles persistence

The scraping infrastructure — proxies, anti-bot bypass, headless rendering — runs on the API side. What you version-control and maintain is a workflow JSON file you can export, fork, and redeploy in minutes.

Start with a single target URL and verify extract_rules returns clean data. Add pagination next, then the error branch. A production-grade pipeline for most targets is under an hour to build and requires zero ongoing infrastructure management.

Share

Was this article helpful?

Frequently Asked Questions

n8n has no built-in scraping engine. Its HTTP Request node can fetch static pages, but JavaScript-rendered sites, anti-bot protections, and proxy rotation all require an external scraping API. Routing requests through a scraping service handles that infrastructure automatically.
Pass `"render": true` in the API request body. This triggers a headless browser on the API side and returns fully rendered HTML. Expect 2–4 seconds per request instead of ~400ms for static pages, so adjust your HTTP node timeout to at least 30 seconds.
Before inserting records, add a Postgres node in SELECT mode to query existing identifiers (URL, product ID, etc.). Feed the result into a Code node that filters out already-seen records, then pass only new items to the INSERT node. This makes each execution idempotent regardless of run frequency.