Build an n8n Web Scraping Pipeline Without Code
Learn how to build a production-grade web scraping pipeline in n8n using HTTP Request nodes, JavaScript transforms, pagination handling, and automatic retries.
March 21, 2026
n8n gives you a visual canvas for wiring together APIs, databases, and triggers. What it doesn't give you is a scraper — and that gap matters the moment you hit a JavaScript-heavy SPA, a Cloudflare-protected page, or a target that rate-limits by IP.
This guide shows you how to close that gap: build an n8n pipeline that extracts data from any website on a schedule, transforms it, handles pagination and errors, and routes results to any downstream sink.
What You'll Build
A recurring scraping pipeline that:
- Fires on a cron schedule (or webhook trigger)
- Requests rendered HTML or structured JSON from a scraping API
- Parses and normalizes the response in a Code node
- Loops through paginated results until all records are collected
- Writes clean records to Postgres, Google Sheets, or any of n8n's 400+ integrations
No Puppeteer process to babysit. No proxy pool to rotate. No CAPTCHA solver to maintain.
Prerequisites
- n8n running locally (
npx n8n) or on n8n Cloud - An API key from AlterLab (free tier covers testing)
- Basic familiarity with n8n's canvas — if you can drag a node, you're set
Pipeline Architecture
Each node has exactly one responsibility. Keep it that way and debugging becomes trivial.
Step 1: Create the Trigger
Add a Schedule Trigger node. Set the interval based on data freshness requirements:
- Product prices: every 1–4 hours
- News headlines: every 15–30 minutes
- Job listings: every 6–12 hours
During development, use Manual Trigger instead. Swap it out for Schedule Trigger when you're ready for production — nothing else in the pipeline changes.
Step 2: Configure the HTTP Request Node
Add an HTTP Request node after the trigger.
Method: POST
URL: https://api.alterlab.io/v1/scrape
Authentication: Header Auth → key X-API-Key, value YOUR_KEY
Set the request body to JSON:
{
"url": "https://books.toscrape.com/catalogue/page-1.html",
"render": false,
"extract_rules": {
"titles": {
"selector": "article.product_pod h3 a",
"type": "list",
"output": "text"
},
"prices": {
"selector": "article.product_pod .price_color",
"type": "list",
"output": "text"
},
"next_page": {
"selector": "li.next a",
"type": "item",
"output": "attr:href"
}
}
}extract_rules returns structured JSON directly — parallel arrays where index i in titles corresponds to index i in prices. For complex extraction logic or when you need full DOM access, omit extract_rules and work with the raw HTML in the Code node.
Set "render": true for React, Vue, or Angular pages that hydrate client-side. This triggers a headless browser on the API side. Adjust the HTTP node timeout to 60 seconds when render is enabled.
Here is the equivalent cURL command for testing outside n8n before wiring the node:
curl -X POST https://api.alterlab.io/v1/scrape \
-H "X-API-Key: YOUR_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://books.toscrape.com/catalogue/page-1.html",
"render": false,
"extract_rules": {
"titles": {"selector": "article.product_pod h3 a", "type": "list", "output": "text"},
"prices": {"selector": "article.product_pod .price_color", "type": "list", "output": "text"},
"next_page": {"selector": "li.next a", "type": "item", "output": "attr:href"}
}
}'Try it against the live sandbox:
Scrape a live product listing page with AlterLab — no setup required
Step 3: Transform the Response
When extract_rules returns parallel arrays, zip them into records in a Code node. Add the node after HTTP Request, set language to JavaScript:
const data = $input.first().json;
const titles = data.titles ?? [];
const prices = data.prices ?? [];
// Zip parallel arrays into structured records
const records = titles.map((title, i) => ({
title: title.trim(),
price: parseFloat(prices[i]?.replace('£', '') ?? '0'),
currency: 'GBP',
scraped_at: new Date().toISOString(),
source_url: $('HTTP Request').first().json._meta?.url ?? null,
page: $getWorkflowStaticData('node').pageCount ?? 1,
}));
return records.map(r => ({ json: r }));The highlighted block zips, cleans, type-casts, and attaches metadata. Output from this node is a flat array of items — each item flows independently into downstream nodes. A Postgres INSERT node will write one row per item automatically.
Step 4: Handle Pagination
Most real targets paginate. The standard n8n pattern:
- An IF node checks whether
next_pageis non-null - On
true: update the URL and loop back to the HTTP Request node - On
false: exit to the output node
Store the next page path in workflow static data so it survives across loop iterations:
const nextPage = $input.first().json.next_page;
// Persist state across iterations
const staticData = $getWorkflowStaticData('node');
staticData.nextPagePath = nextPage ?? null;
staticData.pageCount = (staticData.pageCount ?? 0) + 1;
const baseUrl = 'https://books.toscrape.com/catalogue/';
const nextUrl = nextPage ? `${baseUrl}${nextPage}` : null;
return [{
json: {
has_next_page: !!nextPage,
next_url: nextUrl,
pages_collected: staticData.pageCount,
}
}];Wire the IF node: has_next_page === true loops back to HTTP Request with next_url injected into the request body via an expression ({{ $json.next_url }}). Always add a max_pages guard — check pages_collected > 50 on the true branch and terminate if hit. Broken pagination signals on malformed sites will otherwise run indefinitely.
Step 5: Error Handling
HTTP Request nodes fail silently by default. Enable Continue On Error in the node settings, then add an IF node checking $json.$response.statusCode >= 400.
For transient failures (429, 503, 502), add a Wait node on the error branch and loop back to retry:
const status = $input.first().json.$response?.statusCode ?? 0;
const headers = $input.first().json.$response?.headers ?? {};
// Respect Retry-After header if present, otherwise use fixed backoff
const retryable = [429, 502, 503, 504].includes(status);
const retryAfter = parseInt(headers['retry-after'] ?? '0', 10);
const delay = retryable
? (retryAfter > 0 ? retryAfter * 1000 : 5000 * Math.pow(2, $getWorkflowStaticData('node').retries ?? 0))
: 0;
const retries = ($getWorkflowStaticData('node').retries ?? 0) + 1;
$getWorkflowStaticData('node').retries = retries;
return [{
json: { retryable: retryable && retries <= 3, delay_ms: delay, status }
}];Wire: retryable === true → Wait (delay_ms milliseconds) → HTTP Request. retryable === false → Slack/PagerDuty alert node. After a successful page, reset retries to 0 in the Transform Code node.
Python Equivalent
If you want to run the same pipeline outside n8n — in a scheduled job, an Airflow DAG, or a standalone service — here is the direct Python equivalent:
import httpx
import time
API_KEY = "YOUR_KEY"
BASE_URL = "https://api.alterlab.io/v1/scrape"
def scrape_all_pages(start_url: str, max_pages: int = 50) -> list[dict]:
records: list[dict] = []
url: str | None = start_url
page = 1
while url and page <= max_pages:
try:
response = httpx.post(
BASE_URL,
headers={"X-API-Key": API_KEY},
json={
"url": url,
"render": False,
"extract_rules": {
"titles": {"selector": "article.product_pod h3 a", "type": "list", "output": "text"},
"prices": {"selector": "article.product_pod .price_color", "type": "list", "output": "text"},
"next_page": {"selector": "li.next a", "type": "item", "output": "attr:href"},
},
},
timeout=30.0,
)
response.raise_for_status()
except httpx.HTTPStatusError as exc:
if exc.response.status_code in (429, 503):
time.sleep(int(exc.response.headers.get("retry-after", "10")))
continue
raise
data = response.json()
titles = data.get("titles", [])
prices = data.get("prices", [])
records.extend(
{"title": t.strip(), "price": float(p.replace("£", "")), "page": page}
for t, p in zip(titles, prices)
)
next_path = data.get("next_page")
url = f"https://books.toscrape.com/catalogue/{next_path}" if next_path else None
page += 1
time.sleep(0.5) # polite crawl delay
return records
if __name__ == "__main__":
results = scrape_all_pages("https://books.toscrape.com/catalogue/page-1.html")
print(f"Collected {len(results)} records across {results[-1]['page']} pages")The logic in lines 8–35 mirrors exactly what the n8n pipeline does: one POST per page, structured extraction, pagination loop with a hard page cap, and retry handling for transient errors. Use this when you need the scraper embedded in a larger Python service or want to prototype extraction rules before wiring them into n8n.
Choosing the Right Output Node
Once records leave the Code node, they are standard n8n items. Route them to whichever sink fits your use case:
Performance Considerations
Execution timeout: n8n Cloud imposes a per-execution timeout (varies by plan). For large datasets, split work across multiple triggered executions rather than one long-running loop.
Concurrency: n8n processes loop iterations sequentially by default. For parallelism, use the Split In Batches node with multiple URL inputs and let n8n fan them out across parallel execution paths.
Credential rotation: Store multiple API keys as separate credentials and select among them in a Code node using Math.floor(Math.random() * keys.length). Spreads request volume across keys when you're approaching per-key rate limits.
Execution logs: n8n stores full execution history including request/response payloads. For debugging extraction failures, open the HTTP Request node's output panel — the full JSON response is visible without any additional logging code.
Takeaway
The n8n approach to web scraping separates concerns cleanly by node boundary:
- Triggering — Schedule or Webhook node handles when
- Fetching — HTTP Request node handles the network call
- Transforming — Code node handles parsing and normalization (~20 lines of JavaScript)
- Error routing — IF + Wait nodes handle retry logic
- Storing — any of n8n's integrations handles persistence
The scraping infrastructure — proxies, anti-bot bypass, headless rendering — runs on the API side. What you version-control and maintain is a workflow JSON file you can export, fork, and redeploy in minutes.
Start with a single target URL and verify extract_rules returns clean data. Add pagination next, then the error branch. A production-grade pipeline for most targets is under an hour to build and requires zero ongoing infrastructure management.
Was this article helpful?
Frequently Asked Questions
Related Articles
Popular Posts
Recommended
Selenium Bot Detection: Why You Get Caught and How to Avoid It
Why Your Headless Browser Gets Detected (and How to Fix It)
Web Scraping APIs vs DIY Scrapers: When to Stop Building Infrastructure
Scraping E-Commerce Sites at Scale Without Getting Blocked
Web Scraping with Node.js and Puppeteer: The Complete 2026 Guide
Newsletter
Scraping insights and API tips. No spam.
Recommended Reading
Selenium Bot Detection: Why You Get Caught and How to Avoid It
Why Your Headless Browser Gets Detected (and How to Fix It)
Web Scraping APIs vs DIY Scrapers: When to Stop Building Infrastructure
Scraping E-Commerce Sites at Scale Without Getting Blocked
Web Scraping with Node.js and Puppeteer: The Complete 2026 Guide
Stay in the Loop
Get scraping insights, API tips, and platform updates. No spam — we only send when we have something worth reading.