Can n8n be used for web scraping?

n8n has no native scraper, but its HTTP Request node connects to any scraping API. Pair it with a service like AlterLab to handle anti-bot bypass, JavaScript rendering, and proxy rotation — n8n handles scheduling, HTML parsing, and downstream storage.

How do I scrape JavaScript-rendered pages in an n8n workflow?

Pass `"render_js": true` in the HTTP Request body payload. The scraping API backend spins up a headless browser, executes JavaScript, and returns the fully rendered HTML — no Playwright or Puppeteer setup needed inside your n8n instance.

How do I handle errors and retries in n8n scraping workflows?

Enable "Retry On Fail" on the HTTP Request node with 3 retries and 2-second backoff. Create a dedicated n8n Error Workflow to catch uncaught failures and log them to a Postgres dead-letter table or fire a Slack alert, preventing silent data loss in scheduled pipelines.

Build a Web Scraping Pipeline with n8n & AlterLab

n8n is a workflow automation platform built around HTTP nodes, visual routing, and an in-process JavaScript runtime. When you pair it with AlterLab — a scraping API that handles anti-bot detection, headless rendering, and proxy rotation — you get a complete data extraction pipeline without managing browser pools, proxy credentials, or retry logic from scratch.

This tutorial builds a production-ready pipeline: URL inputs → scraping API → HTML parsing → structured storage, driven by a cron schedule with proper error handling.

Prerequisites

n8n instance (self-hosted via Docker or n8n Cloud)
API key — follow the quickstart guide to get one in under two minutes
Familiarity with n8n's workflow editor and basic JavaScript

Step 1: Store the API Key in n8n Credentials

Never hardcode secrets into HTTP Request nodes. Go to Settings → Credentials → Add Credential → Header Auth and fill in:

Field	Value
Name	`Scraping API Key`
Header Name	`X-API-Key`
Header Value	`YOUR_API_KEY`

Reference this credential in every HTTP Request node in the workflow. Rotating the key means updating one credential, not hunting through nodes.

Step 2: Configure the HTTP Request Node

Drop an HTTP Request node into the canvas. Set Method to POST, URL to https://api.alterlab.io/v1/scrape, authenticate with the credential created above, and set Body Content Type to JSON.

JSON

{
  "url": "https://books.toscrape.com/catalogue/page-1.html",
  "render_js": false,
  "premium_proxy": false,
  "country": "us",
  "timeout": 30000
}

For targets protected by Cloudflare, Akamai, or PerimeterX, set render_js: true and premium_proxy: true. The anti-bot bypass layer handles TLS fingerprinting, browser emulation, and CAPTCHA solving transparently — no extra configuration on your end.

The same request in cURL for testing before wiring into n8n:

Bash

curl -X POST https://api.alterlab.io/v1/scrape \
  -H "X-API-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://books.toscrape.com/catalogue/page-1.html",
    "render_js": false,
    "premium_proxy": false
  }'

The equivalent single-URL call in Python:

Python

import httpx

API_KEY = "YOUR_API_KEY"
BASE_URL = "https://api.alterlab.io/v1/scrape"

def scrape(url: str, render_js: bool = False) -> dict:
    with httpx.Client() as client:                    # synchronous single fetch
        r = client.post(
            BASE_URL,
            headers={"X-API-Key": API_KEY},
            json={"url": url, "render_js": render_js},
            timeout=30.0,
        )
        r.raise_for_status()
        return r.json()

result = scrape("https://books.toscrape.com/catalogue/page-1.html")
print(result["status_code"], result["elapsed_ms"], "ms")

The API response shape:

JSON

{
  "success": true,
  "status_code": 200,
  "url": "https://books.toscrape.com/catalogue/page-1.html",
  "html": "<!DOCTYPE html>...",
  "elapsed_ms": 712
}

Try it against a live target to see the response before building the rest of the pipeline:

Try it yourself

Try scraping this page with AlterLab's API — no setup required

curl -X POST https://api.alterlab.io/v1/scrape \
  -H "X-API-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"url": "https://books.toscrape.com/catalogue/page-1.html"}'

Enable JavaScript to try the live demo, or sign up to use the API directly.

Step 3: Parse HTML in the Code Node

Add a Code node immediately after the HTTP Request. n8n bundles Cheerio in its runtime — use it to walk the DOM and emit structured records.

JAVASCRIPT

const { load } = require('cheerio');

const results = [];

for (const item of $input.all()) {
  const $ = load(item.json.html);

  $('article.product_pod').each((_, el) => {         // iterate product cards
    const title   = $(el).find('h3 a').attr('title');
    const price   = $(el).find('.price_color').text().trim();
    const rating  = $(el).find('p.star-rating').attr('class')?.split(' ')[1];
    const relHref = $(el).find('h3 a').attr('href');

    results.push({                                    // emit flat record
      title,
      price,
      rating,
      url: `https://books.toscrape.com/catalogue/${relHref}`,
      scraped_at: new Date().toISOString(),
    });
  });
}

return results.map(r => ({ json: r }));

For targets that return JSON from an XHR endpoint (scraped through the proxy), skip Cheerio and parse directly:

JAVASCRIPT

const raw = $input.first().json.html;
const data = JSON.parse(raw);            // html field contains the raw JSON string
return data.products.map(p => ({ json: p }));

If Cheerio is missing in a self-hosted setup, run npm install cheerio in the n8n working directory and restart the service.

99.2%Scrape Success Rate

<800msAvg Static Response

180+Proxy Countries

0Browser Config Required

Step 4: Scrape Multiple Pages

Use a Code node to generate a URL list, then feed it through Split In Batches → HTTP Request:

JAVASCRIPT

const BASE  = 'https://books.toscrape.com/catalogue/page-';
const PAGES = 50;

const urls = Array.from(                        // generate range of page URLs
  { length: PAGES },
  (_, i) => ({ json: { url: `${BASE}${i + 1}.html` } })
);

return urls;

Set Split In Batches to a batch size of 5 to avoid hammering the target. The HTTP Request node processes each batch item as a separate request automatically.

For high-volume pipelines where n8n acts as the orchestrator and Python handles the heavy lifting, use async fan-out:

Python

import asyncio
import httpx

API_KEY  = "YOUR_API_KEY"
ENDPOINT = "https://api.alterlab.io/v1/scrape"

async def fetch(client: httpx.AsyncClient, url: str) -> dict:
    r = await client.post(
        ENDPOINT,
        headers={"X-API-Key": API_KEY},
        json={"url": url, "render_js": False},
        timeout=30.0,
    )
    r.raise_for_status()
    return r.json()

async def scrape_batch(urls: list[str]) -> list[dict]:  # fan-out entry point
    async with httpx.AsyncClient() as client:           # single connection pool
        tasks   = [fetch(client, u) for u in urls]      # build coroutine list
        results = await asyncio.gather(*tasks, return_exceptions=True)
        return results

if __name__ == "__main__":
    pages = [f"https://books.toscrape.com/catalogue/page-{i}.html" for i in range(1, 11)]
    data  = asyncio.run(scrape_batch(pages))

    for i, result in enumerate(data):
        if isinstance(result, Exception):
            print(f"Page {i+1} failed: {result}")
        else:
            print(f"Page {i+1}: {len(result['html']):,} bytes — {result['elapsed_ms']}ms")

The Python scraping API client wraps this pattern with built-in retry logic, concurrency throttling, and typed responses — worth switching to once you move beyond prototyping.

Step 5: Route Data to Storage

Wire the Code node output to whichever storage node fits your stack.

Postgres — recommended for structured pipelines:

Node: Postgres, Operation: Insert, Table: scraped_books
Map title, price, rating, url, scraped_at directly from Code node output fields

Google Sheets — minimal setup for low-volume runs:

Node: Google Sheets, Operation: Append or Update
Same column mapping

Webhook forward — for downstream microservices or event buses:

JSON

{
  "source": "n8n-book-scraper",
  "run_id": "{{ $execution.id }}",
  "count": 20,
  "records": [
    { "title": "A Light in the Attic", "price": "£51.77", "rating": "Three", "url": "..." }
  ]
}

Step 6: Schedule and Add Error Handling

Swap the manual trigger for a Schedule Trigger node before going to production.

Cadence	Cron Expression	Typical Use Case
Hourly	`0 * * * *`	Price monitoring
Daily 06:00 UTC	`0 6 * * *`	News/content aggregation
Every 15 minutes	`/15 * * *`	Inventory feeds
Weekdays 09:00 UTC	`0 9 * * 1-5`	B2B lead enrichment

For event-driven scraping — e.g., new URLs inserted into a database — replace the Schedule Trigger with a Postgres Trigger node watching for new rows.

Error handling — configure before going live:

HTTP Request node → enable Retry On Fail: 3 retries, 2000ms backoff
Code node → enable Continue On Fail if partial runs are acceptable
In Settings → Error Workflow, assign a dedicated workflow that captures and routes failures:

JAVASCRIPT

// Runs inside the error workflow's Code node
const err = $input.first().json;

return [{
  json: {
    workflow:     err.workflow?.name,
    node:         err.execution?.lastNodeExecuted,   // which node threw
    message:      err.execution?.error?.message,
    failed_at:    new Date().toISOString(),
    execution_id: err.execution?.id,
  }
}];

Route the output to a Postgres scrape_errors table or a Slack node. Silent failures are harder to diagnose than loud ones.

Monitoring Pipeline Health

Don't rely solely on n8n's execution log. Instrument your pipeline explicitly:

Log success: false responses from the scraping API to a monitoring table — the API returns this field even on 200 responses if the target blocked the request
Store elapsed_ms per run in a scrape_metrics table; trend upward means proxy pool degradation
Row count guard — after the storage node, add a Code node that alerts if results.length < EXPECTED_MINIMUM:

JAVASCRIPT

const MINIMUM = 15; // expect at least 15 records per page

const count = $input.all().length;

if (count < MINIMUM) {                        // trigger alert path
  throw new Error(`Low yield: got ${count}, expected >= ${MINIMUM}`);
}

return $input.all(); // pass through if OK

Place this node between the Code parser and the storage node. When it throws, n8n's error workflow catches it.

Takeaways

n8n's HTTP Request node integrates with any REST scraping API in minutes — no custom nodes required
Use render_js: true selectively; static fetches are faster and cheaper than headless browser requests
Keep parsing logic inside the Code node to maintain self-contained, debuggable workflows
Cheerio handles the majority of HTML extraction cases; fall back to a dedicated parser service only for complex XPath requirements
Configure retries on the HTTP node and a global error workflow before scheduling — silent data loss compounds across runs
For event-driven ingestion triggered by new URLs in a queue or database, swap the Schedule Trigger for a Postgres Trigger or AMQP node without changing the rest of the workflow

Build a Web Scraping Pipeline with n8n and AlterLab

Prerequisites

Step 1: Store the API Key in n8n Credentials

Step 2: Configure the HTTP Request Node

Step 3: Parse HTML in the Code Node

Step 4: Scrape Multiple Pages

Step 5: Route Data to Storage

Step 6: Schedule and Add Error Handling

Monitoring Pipeline Health

Takeaways

Frequently Asked Questions

Related Articles

Agentic Web Browsing: Python LLMs and Real-Time Data

Optimizing Web Data Extraction Before Chunking in RAG Pipelines

Agentic RAG vs Traditional RAG: Architecting Real-Time AI Data Pipelines

Popular Posts

Best Web Scraping APIs in 2026: Complete Comparison Guide

Why Your Headless Browser Gets Detected (and How to Fix It)

Playwright Bot Detection: What Actually Works in 2026

How to Scrape Cloudflare-Protected Sites in 2026

How to Scrape Twitter/X: Complete Guide for 2026

Recommended

Selenium Bot Detection: Why You Get Flagged and How to Fix It

How to Scrape AliExpress: Complete Guide for 2026

Why Your Headless Browser Gets Detected (and How to Fix It)

How to Scrape Indeed: Complete Guide for 2026

How to Scrape Cloudflare-Protected Sites in 2026

Newsletter

Recommended Reading

Selenium Bot Detection: Why You Get Flagged and How to Fix It

How to Scrape AliExpress: Complete Guide for 2026

Why Your Headless Browser Gets Detected (and How to Fix It)

How to Scrape Indeed: Complete Guide for 2026

How to Scrape Cloudflare-Protected Sites in 2026

Stay in the Loop

Explore AlterLab

Python Web Scraping API

Compare Scraping APIs

Pricing

Documentation