
Build a Web Scraping Pipeline with n8n and AlterLab
Connect n8n to a scraping API for automated data extraction with anti-bot bypass, JavaScript rendering, proxy rotation, and scheduled cron triggers — step by step.
March 31, 2026
n8n is a workflow automation platform built around HTTP nodes, visual routing, and an in-process JavaScript runtime. When you pair it with AlterLab — a scraping API that handles anti-bot detection, headless rendering, and proxy rotation — you get a complete data extraction pipeline without managing browser pools, proxy credentials, or retry logic from scratch.
This tutorial builds a production-ready pipeline: URL inputs → scraping API → HTML parsing → structured storage, driven by a cron schedule with proper error handling.
Prerequisites
- n8n instance (self-hosted via Docker or n8n Cloud)
- API key — follow the quickstart guide to get one in under two minutes
- Familiarity with n8n's workflow editor and basic JavaScript
Step 1: Store the API Key in n8n Credentials
Never hardcode secrets into HTTP Request nodes. Go to Settings → Credentials → Add Credential → Header Auth and fill in:
| Field | Value |
|---|---|
| Name | Scraping API Key |
| Header Name | X-API-Key |
| Header Value | YOUR_API_KEY |
Reference this credential in every HTTP Request node in the workflow. Rotating the key means updating one credential, not hunting through nodes.
Step 2: Configure the HTTP Request Node
Drop an HTTP Request node into the canvas. Set Method to POST, URL to https://api.alterlab.io/v1/scrape, authenticate with the credential created above, and set Body Content Type to JSON.
{
"url": "https://books.toscrape.com/catalogue/page-1.html",
"render_js": false,
"premium_proxy": false,
"country": "us",
"timeout": 30000
}For targets protected by Cloudflare, Akamai, or PerimeterX, set render_js: true and premium_proxy: true. The anti-bot bypass layer handles TLS fingerprinting, browser emulation, and CAPTCHA solving transparently — no extra configuration on your end.
The same request in cURL for testing before wiring into n8n:
curl -X POST https://api.alterlab.io/v1/scrape \
-H "X-API-Key: YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://books.toscrape.com/catalogue/page-1.html",
"render_js": false,
"premium_proxy": false
}'The equivalent single-URL call in Python:
import httpx
API_KEY = "YOUR_API_KEY"
BASE_URL = "https://api.alterlab.io/v1/scrape"
def scrape(url: str, render_js: bool = False) -> dict:
with httpx.Client() as client: # synchronous single fetch
r = client.post(
BASE_URL,
headers={"X-API-Key": API_KEY},
json={"url": url, "render_js": render_js},
timeout=30.0,
)
r.raise_for_status()
return r.json()
result = scrape("https://books.toscrape.com/catalogue/page-1.html")
print(result["status_code"], result["elapsed_ms"], "ms")The API response shape:
{
"success": true,
"status_code": 200,
"url": "https://books.toscrape.com/catalogue/page-1.html",
"html": "<!DOCTYPE html>...",
"elapsed_ms": 712
}Try it against a live target to see the response before building the rest of the pipeline:
Try scraping this page with AlterLab's API — no setup required
Step 3: Parse HTML in the Code Node
Add a Code node immediately after the HTTP Request. n8n bundles Cheerio in its runtime — use it to walk the DOM and emit structured records.
const { load } = require('cheerio');
const results = [];
for (const item of $input.all()) {
const $ = load(item.json.html);
$('article.product_pod').each((_, el) => { // iterate product cards
const title = $(el).find('h3 a').attr('title');
const price = $(el).find('.price_color').text().trim();
const rating = $(el).find('p.star-rating').attr('class')?.split(' ')[1];
const relHref = $(el).find('h3 a').attr('href');
results.push({ // emit flat record
title,
price,
rating,
url: `https://books.toscrape.com/catalogue/${relHref}`,
scraped_at: new Date().toISOString(),
});
});
}
return results.map(r => ({ json: r }));For targets that return JSON from an XHR endpoint (scraped through the proxy), skip Cheerio and parse directly:
const raw = $input.first().json.html;
const data = JSON.parse(raw); // html field contains the raw JSON string
return data.products.map(p => ({ json: p }));If Cheerio is missing in a self-hosted setup, run npm install cheerio in the n8n working directory and restart the service.
Step 4: Scrape Multiple Pages
Use a Code node to generate a URL list, then feed it through Split In Batches → HTTP Request:
const BASE = 'https://books.toscrape.com/catalogue/page-';
const PAGES = 50;
const urls = Array.from( // generate range of page URLs
{ length: PAGES },
(_, i) => ({ json: { url: `${BASE}${i + 1}.html` } })
);
return urls;Set Split In Batches to a batch size of 5 to avoid hammering the target. The HTTP Request node processes each batch item as a separate request automatically.
For high-volume pipelines where n8n acts as the orchestrator and Python handles the heavy lifting, use async fan-out:
import asyncio
import httpx
API_KEY = "YOUR_API_KEY"
ENDPOINT = "https://api.alterlab.io/v1/scrape"
async def fetch(client: httpx.AsyncClient, url: str) -> dict:
r = await client.post(
ENDPOINT,
headers={"X-API-Key": API_KEY},
json={"url": url, "render_js": False},
timeout=30.0,
)
r.raise_for_status()
return r.json()
async def scrape_batch(urls: list[str]) -> list[dict]: # fan-out entry point
async with httpx.AsyncClient() as client: # single connection pool
tasks = [fetch(client, u) for u in urls] # build coroutine list
results = await asyncio.gather(*tasks, return_exceptions=True)
return results
if __name__ == "__main__":
pages = [f"https://books.toscrape.com/catalogue/page-{i}.html" for i in range(1, 11)]
data = asyncio.run(scrape_batch(pages))
for i, result in enumerate(data):
if isinstance(result, Exception):
print(f"Page {i+1} failed: {result}")
else:
print(f"Page {i+1}: {len(result['html']):,} bytes — {result['elapsed_ms']}ms")The Python scraping API client wraps this pattern with built-in retry logic, concurrency throttling, and typed responses — worth switching to once you move beyond prototyping.
Step 5: Route Data to Storage
Wire the Code node output to whichever storage node fits your stack.
Postgres — recommended for structured pipelines:
- Node: Postgres, Operation: Insert, Table:
scraped_books - Map
title,price,rating,url,scraped_atdirectly from Code node output fields
Google Sheets — minimal setup for low-volume runs:
- Node: Google Sheets, Operation: Append or Update
- Same column mapping
Webhook forward — for downstream microservices or event buses:
{
"source": "n8n-book-scraper",
"run_id": "{{ $execution.id }}",
"count": 20,
"records": [
{ "title": "A Light in the Attic", "price": "£51.77", "rating": "Three", "url": "..." }
]
}Step 6: Schedule and Add Error Handling
Swap the manual trigger for a Schedule Trigger node before going to production.
| Cadence | Cron Expression | Typical Use Case |
|---|---|---|
| Hourly | 0 * * * * | Price monitoring |
| Daily 06:00 UTC | 0 6 * * * | News/content aggregation |
| Every 15 minutes | */15 * * * * | Inventory feeds |
| Weekdays 09:00 UTC | 0 9 * * 1-5 | B2B lead enrichment |
For event-driven scraping — e.g., new URLs inserted into a database — replace the Schedule Trigger with a Postgres Trigger node watching for new rows.
Error handling — configure before going live:
- HTTP Request node → enable Retry On Fail: 3 retries, 2000ms backoff
- Code node → enable Continue On Fail if partial runs are acceptable
- In Settings → Error Workflow, assign a dedicated workflow that captures and routes failures:
// Runs inside the error workflow's Code node
const err = $input.first().json;
return [{
json: {
workflow: err.workflow?.name,
node: err.execution?.lastNodeExecuted, // which node threw
message: err.execution?.error?.message,
failed_at: new Date().toISOString(),
execution_id: err.execution?.id,
}
}];Route the output to a Postgres scrape_errors table or a Slack node. Silent failures are harder to diagnose than loud ones.
Monitoring Pipeline Health
Don't rely solely on n8n's execution log. Instrument your pipeline explicitly:
- Log
success: falseresponses from the scraping API to a monitoring table — the API returns this field even on 200 responses if the target blocked the request - Store
elapsed_msper run in ascrape_metricstable; trend upward means proxy pool degradation - Row count guard — after the storage node, add a Code node that alerts if
results.length < EXPECTED_MINIMUM:
const MINIMUM = 15; // expect at least 15 records per page
const count = $input.all().length;
if (count < MINIMUM) { // trigger alert path
throw new Error(`Low yield: got ${count}, expected >= ${MINIMUM}`);
}
return $input.all(); // pass through if OKPlace this node between the Code parser and the storage node. When it throws, n8n's error workflow catches it.
Takeaways
- n8n's HTTP Request node integrates with any REST scraping API in minutes — no custom nodes required
- Use
render_js: trueselectively; static fetches are faster and cheaper than headless browser requests - Keep parsing logic inside the Code node to maintain self-contained, debuggable workflows
- Cheerio handles the majority of HTML extraction cases; fall back to a dedicated parser service only for complex XPath requirements
- Configure retries on the HTTP node and a global error workflow before scheduling — silent data loss compounds across runs
- For event-driven ingestion triggered by new URLs in a queue or database, swap the Schedule Trigger for a Postgres Trigger or AMQP node without changing the rest of the workflow
Was this article helpful?
Frequently Asked Questions
Related Articles
Popular Posts
Recommended
Newsletter
Scraping insights and API tips. No spam.
Recommended Reading

Selenium Bot Detection: Why You Get Flagged and How to Fix It

How to Scrape AliExpress: Complete Guide for 2026

Why Your Headless Browser Gets Detected (and How to Fix It)

How to Scrape Indeed: Complete Guide for 2026

How to Scrape Cloudflare-Protected Sites in 2026
Stay in the Loop
Get scraping insights, API tips, and platform updates. No spam — we only send when we have something worth reading.


