
Build a Web Scraping Pipeline with n8n and AlterLab
Connect n8n to a scraping API for automated data extraction with anti-bot bypass, JavaScript rendering, proxy rotation, and scheduled cron triggers — step by step.
March 31, 2026
n8n is a workflow automation platform built around HTTP nodes, visual routing, and an in-process JavaScript runtime. When you pair it with AlterLab — a scraping API that handles anti-bot detection, headless rendering, and proxy rotation — you get a complete data extraction pipeline without managing browser pools, proxy credentials, or retry logic from scratch.
This tutorial builds a production-ready pipeline: URL inputs → scraping API → HTML parsing → structured storage, driven by a cron schedule with proper error handling.
Prerequisites
- n8n instance (self-hosted via Docker or n8n Cloud)
- API key — follow the quickstart guide to get one in under two minutes
- Familiarity with n8n's workflow editor and basic JavaScript
Step 1: Store the API Key in n8n Credentials
Never hardcode secrets into HTTP Request nodes. Go to Settings → Credentials → Add Credential → Header Auth and fill in:
| Field | Value |
|---|---|
| Name | Scraping API Key |
| Header Name | X-API-Key |
| Header Value | YOUR_API_KEY |
Reference this credential in every HTTP Request node in the workflow. Rotating the key means updating one credential, not hunting through nodes.
Step 2: Configure the HTTP Request Node
Drop an HTTP Request node into the canvas. Set Method to POST, URL to https://api.alterlab.io/v1/scrape, authenticate with the credential created above, and set Body Content Type to JSON.
{
"url": "https://books.toscrape.com/catalogue/page-1.html",
"render_js": false,
"premium_proxy": false,
"country": "us",
"timeout": 30000
}For targets protected by Cloudflare, Akamai, or PerimeterX, set render_js: true and premium_proxy: true. The anti-bot bypass layer handles TLS fingerprinting, browser emulation, and CAPTCHA solving transparently — no extra configuration on your end.
The same request in cURL for testing before wiring into n8n:
curl -X POST https://api.alterlab.io/v1/scrape \
-H "X-API-Key: YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://books.toscrape.com/catalogue/page-1.html",
"render_js": false,
"premium_proxy": false
}'The equivalent single-URL call in Python:
import httpx
API_KEY = "YOUR_API_KEY"
BASE_URL = "https://api.alterlab.io/v1/scrape"
def scrape(url: str, render_js: bool = False) -> dict:
with httpx.Client() as client: # synchronous single fetch
r = client.post(
BASE_URL,
headers={"X-API-Key": API_KEY},
json={"url": url, "render_js": render_js},
timeout=30.0,
)
r.raise_for_status()
return r.json()
result = scrape("https://books.toscrape.com/catalogue/page-1.html")
print(result["status_code"], result["elapsed_ms"], "ms")The API response shape:
{
"success": true,
"status_code": 200,
"url": "https://books.toscrape.com/catalogue/page-1.html",
"html": "<!DOCTYPE html>...",
"elapsed_ms": 712
}Try it against a live target to see the response before building the rest of the pipeline:
Try scraping this page with AlterLab's API — no setup required
Step 3: Parse HTML in the Code Node
Add a Code node immediately after the HTTP Request. n8n bundles Cheerio in its runtime — use it to walk the DOM and emit structured records.
const { load } = require('cheerio');
const results = [];
for (const item of $input.all()) {
const $ = load(item.json.html);
$('article.product_pod').each((_, el) => { // iterate product cards
const title = $(el).find('h3 a').attr('title');
const price = $(el).find('.price_color').text().trim();
const rating = $(el).find('p.star-rating').attr('class')?.split(' ')[1];
const relHref = $(el).find('h3 a').attr('href');
results.push({ // emit flat record
title,
price,
rating,
url: `https://books.toscrape.com/catalogue/${relHref}`,
scraped_at: new Date().toISOString(),
});
});
}
return results.map(r => ({ json: r }));For targets that return JSON from an XHR endpoint (scraped through the proxy), skip Cheerio and parse directly:
const raw = $input.first().json.html;
const data = JSON.parse(raw); // html field contains the raw JSON string
return data.products.map(p => ({ json: p }));If Cheerio is missing in a self-hosted setup, run npm install cheerio in the n8n working directory and restart the service.
Step 4: Scrape Multiple Pages
Use a Code node to generate a URL list, then feed it through Split In Batches → HTTP Request:
const BASE = 'https://books.toscrape.com/catalogue/page-';
const PAGES = 50;
const urls = Array.from( // generate range of page URLs
{ length: PAGES },
(_, i) => ({ json: { url: `${BASE}${i + 1}.html` } })
);
return urls;Set Split In Batches to a batch size of 5 to avoid hammering the target. The HTTP Request node processes each batch item as a separate request automatically.
For high-volume pipelines where n8n acts as the orchestrator and Python handles the heavy lifting, use async fan-out:
import asyncio
import httpx
API_KEY = "YOUR_API_KEY"
ENDPOINT = "https://api.alterlab.io/v1/scrape"
async def fetch(client: httpx.AsyncClient, url: str) -> dict:
r = await client.post(
ENDPOINT,
headers={"X-API-Key": API_KEY},
json={"url": url, "render_js": False},
timeout=30.0,
)
r.raise_for_status()
return r.json()
async def scrape_batch(urls: list[str]) -> list[dict]: # fan-out entry point
async with httpx.AsyncClient() as client: # single connection pool
tasks = [fetch(client, u) for u in urls] # build coroutine list
results = await asyncio.gather(*tasks, return_exceptions=True)
return results
if __name__ == "__main__":
pages = [f"https://books.toscrape.com/catalogue/page-{i}.html" for i in range(1, 11)]
data = asyncio.run(scrape_batch(pages))
for i, result in enumerate(data):
if isinstance(result, Exception):
print(f"Page {i+1} failed: {result}")
else:
print(f"Page {i+1}: {len(result['html']):,} bytes — {result['elapsed_ms']}ms")The Python scraping API client wraps this pattern with built-in retry logic, concurrency throttling, and typed responses — worth switching to once you move beyond prototyping.
Step 5: Route Data to Storage
Wire the Code node output to whichever storage node fits your stack.
Postgres — recommended for structured pipelines:
- Node: Postgres, Operation: Insert, Table:
scraped_books - Map
title,price,rating,url,scraped_atdirectly from Code node output fields
Google Sheets — minimal setup for low-volume runs:
- Node: Google Sheets, Operation: Append or Update
- Same column mapping
Webhook forward — for downstream microservices or event buses:
{
"source": "n8n-book-scraper",
"run_id": "{{ $execution.id }}",
"count": 20,
"records": [
{ "title": "A Light in the Attic", "price": "£51.77", "rating": "Three", "url": "..." }
]
}Step 6: Schedule and Add Error Handling
Swap the manual trigger for a Schedule Trigger node before going to production.
| Cadence | Cron Expression | Typical Use Case |
|---|---|---|
| Hourly | 0 * * * * | Price monitoring |
| Daily 06:00 UTC | 0 6 * * * | News/content aggregation |
| Every 15 minutes | */15 * * * * | Inventory feeds |
| Weekdays 09:00 UTC | 0 9 * * 1-5 | B2B lead enrichment |
For event-driven scraping — e.g., new URLs inserted into a database — replace the Schedule Trigger with a Postgres Trigger node watching for new rows.
Error handling — configure before going live:
- HTTP Request node → enable Retry On Fail: 3 retries, 2000ms backoff
- Code node → enable Continue On Fail if partial runs are acceptable
- In Settings → Error Workflow, assign a dedicated workflow that captures and routes failures:
// Runs inside the error workflow's Code node
const err = $input.first().json;
return [{
json: {
workflow: err.workflow?.name,
node: err.execution?.lastNodeExecuted, // which node threw
message: err.execution?.error?.message,
failed_at: new Date().toISOString(),
execution_id: err.execution?.id,
}
}];Route the output to a Postgres scrape_errors table or a Slack node. Silent failures are harder to diagnose than loud ones.
Monitoring Pipeline Health
Don't rely solely on n8n's execution log. Instrument your pipeline explicitly:
- Log
success: falseresponses from the scraping API to a monitoring table — the API returns this field even on 200 responses if the target blocked the request - Store
elapsed_msper run in ascrape_metricstable; trend upward means proxy pool degradation - Row count guard — after the storage node, add a Code node that alerts if
results.length < EXPECTED_MINIMUM:
const MINIMUM = 15; // expect at least 15 records per page
const count = $input.all().length;
if (count < MINIMUM) { // trigger alert path
throw new Error(`Low yield: got ${count}, expected >= ${MINIMUM}`);
}
return $input.all(); // pass through if OKPlace this node between the Code parser and the storage node. When it throws, n8n's error workflow catches it.
Takeaways
- n8n's HTTP Request node integrates with any REST scraping API in minutes — no custom nodes required
- Use
render_js: trueselectively; static fetches are faster and cheaper than headless browser requests - Keep parsing logic inside the Code node to maintain self-contained, debuggable workflows
- Cheerio handles the majority of HTML extraction cases; fall back to a dedicated parser service only for complex XPath requirements
- Configure retries on the HTTP node and a global error workflow before scheduling — silent data loss compounds across runs
- For event-driven ingestion triggered by new URLs in a queue or database, swap the Schedule Trigger for a Postgres Trigger or AMQP node without changing the rest of the workflow
Was this article helpful?
Frequently Asked Questions
Related Articles
Popular Posts
Recommended

Selenium Bot Detection: Why You Get Caught and How to Avoid It

How to Scrape Glassdoor: Complete Guide for 2026

Why Your Headless Browser Gets Detected (and How to Fix It)

Web Scraping APIs vs DIY Scrapers: When to Stop Building Infrastructure

Web Scraping with Node.js and Puppeteer: The Complete 2026 Guide
Newsletter
Scraping insights and API tips. No spam.
Recommended Reading

Selenium Bot Detection: Why You Get Caught and How to Avoid It

How to Scrape Glassdoor: Complete Guide for 2026

Why Your Headless Browser Gets Detected (and How to Fix It)

Web Scraping APIs vs DIY Scrapers: When to Stop Building Infrastructure

Web Scraping with Node.js and Puppeteer: The Complete 2026 Guide
Stay in the Loop
Get scraping insights, API tips, and platform updates. No spam — we only send when we have something worth reading.


