AlterLabAlterLab
Build a Web Scraping Pipeline with n8n and AlterLab
Tutorials

Build a Web Scraping Pipeline with n8n and AlterLab

Connect n8n to a scraping API for automated data extraction with anti-bot bypass, JavaScript rendering, proxy rotation, and scheduled cron triggers — step by step.

Yash Dubey
Yash Dubey

March 31, 2026

8 min read
2 views

n8n is a workflow automation platform built around HTTP nodes, visual routing, and an in-process JavaScript runtime. When you pair it with AlterLab — a scraping API that handles anti-bot detection, headless rendering, and proxy rotation — you get a complete data extraction pipeline without managing browser pools, proxy credentials, or retry logic from scratch.

This tutorial builds a production-ready pipeline: URL inputs → scraping API → HTML parsing → structured storage, driven by a cron schedule with proper error handling.

Prerequisites

  • n8n instance (self-hosted via Docker or n8n Cloud)
  • API key — follow the quickstart guide to get one in under two minutes
  • Familiarity with n8n's workflow editor and basic JavaScript

Step 1: Store the API Key in n8n Credentials

Never hardcode secrets into HTTP Request nodes. Go to Settings → Credentials → Add Credential → Header Auth and fill in:

FieldValue
NameScraping API Key
Header NameX-API-Key
Header ValueYOUR_API_KEY

Reference this credential in every HTTP Request node in the workflow. Rotating the key means updating one credential, not hunting through nodes.


Step 2: Configure the HTTP Request Node

Drop an HTTP Request node into the canvas. Set Method to POST, URL to https://api.alterlab.io/v1/scrape, authenticate with the credential created above, and set Body Content Type to JSON.

JSON
{
  "url": "https://books.toscrape.com/catalogue/page-1.html",
  "render_js": false,
  "premium_proxy": false,
  "country": "us",
  "timeout": 30000
}

For targets protected by Cloudflare, Akamai, or PerimeterX, set render_js: true and premium_proxy: true. The anti-bot bypass layer handles TLS fingerprinting, browser emulation, and CAPTCHA solving transparently — no extra configuration on your end.

The same request in cURL for testing before wiring into n8n:

Bash
curl -X POST https://api.alterlab.io/v1/scrape \
  -H "X-API-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://books.toscrape.com/catalogue/page-1.html",
    "render_js": false,
    "premium_proxy": false
  }'

The equivalent single-URL call in Python:

Python
import httpx

API_KEY = "YOUR_API_KEY"
BASE_URL = "https://api.alterlab.io/v1/scrape"

def scrape(url: str, render_js: bool = False) -> dict:
    with httpx.Client() as client:                    # synchronous single fetch
        r = client.post(
            BASE_URL,
            headers={"X-API-Key": API_KEY},
            json={"url": url, "render_js": render_js},
            timeout=30.0,
        )
        r.raise_for_status()
        return r.json()

result = scrape("https://books.toscrape.com/catalogue/page-1.html")
print(result["status_code"], result["elapsed_ms"], "ms")

The API response shape:

JSON
{
  "success": true,
  "status_code": 200,
  "url": "https://books.toscrape.com/catalogue/page-1.html",
  "html": "<!DOCTYPE html>...",
  "elapsed_ms": 712
}

Try it against a live target to see the response before building the rest of the pipeline:

Try it yourself

Try scraping this page with AlterLab's API — no setup required


Step 3: Parse HTML in the Code Node

Add a Code node immediately after the HTTP Request. n8n bundles Cheerio in its runtime — use it to walk the DOM and emit structured records.

JAVASCRIPT
const { load } = require('cheerio');

const results = [];

for (const item of $input.all()) {
  const $ = load(item.json.html);

  $('article.product_pod').each((_, el) => {         // iterate product cards
    const title   = $(el).find('h3 a').attr('title');
    const price   = $(el).find('.price_color').text().trim();
    const rating  = $(el).find('p.star-rating').attr('class')?.split(' ')[1];
    const relHref = $(el).find('h3 a').attr('href');

    results.push({                                    // emit flat record
      title,
      price,
      rating,
      url: `https://books.toscrape.com/catalogue/${relHref}`,
      scraped_at: new Date().toISOString(),
    });
  });
}

return results.map(r => ({ json: r }));

For targets that return JSON from an XHR endpoint (scraped through the proxy), skip Cheerio and parse directly:

JAVASCRIPT
const raw = $input.first().json.html;
const data = JSON.parse(raw);            // html field contains the raw JSON string
return data.products.map(p => ({ json: p }));

If Cheerio is missing in a self-hosted setup, run npm install cheerio in the n8n working directory and restart the service.


99.2%Scrape Success Rate
<800msAvg Static Response
180+Proxy Countries
0Browser Config Required

Step 4: Scrape Multiple Pages

Use a Code node to generate a URL list, then feed it through Split In Batches → HTTP Request:

JAVASCRIPT
const BASE  = 'https://books.toscrape.com/catalogue/page-';
const PAGES = 50;

const urls = Array.from(                        // generate range of page URLs
  { length: PAGES },
  (_, i) => ({ json: { url: `${BASE}${i + 1}.html` } })
);

return urls;

Set Split In Batches to a batch size of 5 to avoid hammering the target. The HTTP Request node processes each batch item as a separate request automatically.

For high-volume pipelines where n8n acts as the orchestrator and Python handles the heavy lifting, use async fan-out:

Python
import asyncio
import httpx

API_KEY  = "YOUR_API_KEY"
ENDPOINT = "https://api.alterlab.io/v1/scrape"

async def fetch(client: httpx.AsyncClient, url: str) -> dict:
    r = await client.post(
        ENDPOINT,
        headers={"X-API-Key": API_KEY},
        json={"url": url, "render_js": False},
        timeout=30.0,
    )
    r.raise_for_status()
    return r.json()

async def scrape_batch(urls: list[str]) -> list[dict]:  # fan-out entry point
    async with httpx.AsyncClient() as client:           # single connection pool
        tasks   = [fetch(client, u) for u in urls]      # build coroutine list
        results = await asyncio.gather(*tasks, return_exceptions=True)
        return results

if __name__ == "__main__":
    pages = [f"https://books.toscrape.com/catalogue/page-{i}.html" for i in range(1, 11)]
    data  = asyncio.run(scrape_batch(pages))

    for i, result in enumerate(data):
        if isinstance(result, Exception):
            print(f"Page {i+1} failed: {result}")
        else:
            print(f"Page {i+1}: {len(result['html']):,} bytes — {result['elapsed_ms']}ms")

The Python scraping API client wraps this pattern with built-in retry logic, concurrency throttling, and typed responses — worth switching to once you move beyond prototyping.


Step 5: Route Data to Storage

Wire the Code node output to whichever storage node fits your stack.

Postgres — recommended for structured pipelines:

  • Node: Postgres, Operation: Insert, Table: scraped_books
  • Map title, price, rating, url, scraped_at directly from Code node output fields

Google Sheets — minimal setup for low-volume runs:

  • Node: Google Sheets, Operation: Append or Update
  • Same column mapping

Webhook forward — for downstream microservices or event buses:

JSON
{
  "source": "n8n-book-scraper",
  "run_id": "{{ $execution.id }}",
  "count": 20,
  "records": [
    { "title": "A Light in the Attic", "price": "£51.77", "rating": "Three", "url": "..." }
  ]
}

Step 6: Schedule and Add Error Handling

Swap the manual trigger for a Schedule Trigger node before going to production.

CadenceCron ExpressionTypical Use Case
Hourly0 * * * *Price monitoring
Daily 06:00 UTC0 6 * * *News/content aggregation
Every 15 minutes*/15 * * * *Inventory feeds
Weekdays 09:00 UTC0 9 * * 1-5B2B lead enrichment

For event-driven scraping — e.g., new URLs inserted into a database — replace the Schedule Trigger with a Postgres Trigger node watching for new rows.

Error handling — configure before going live:

  1. HTTP Request node → enable Retry On Fail: 3 retries, 2000ms backoff
  2. Code node → enable Continue On Fail if partial runs are acceptable
  3. In Settings → Error Workflow, assign a dedicated workflow that captures and routes failures:
JAVASCRIPT
// Runs inside the error workflow's Code node
const err = $input.first().json;

return [{
  json: {
    workflow:     err.workflow?.name,
    node:         err.execution?.lastNodeExecuted,   // which node threw
    message:      err.execution?.error?.message,
    failed_at:    new Date().toISOString(),
    execution_id: err.execution?.id,
  }
}];

Route the output to a Postgres scrape_errors table or a Slack node. Silent failures are harder to diagnose than loud ones.



Monitoring Pipeline Health

Don't rely solely on n8n's execution log. Instrument your pipeline explicitly:

  • Log success: false responses from the scraping API to a monitoring table — the API returns this field even on 200 responses if the target blocked the request
  • Store elapsed_ms per run in a scrape_metrics table; trend upward means proxy pool degradation
  • Row count guard — after the storage node, add a Code node that alerts if results.length < EXPECTED_MINIMUM:
JAVASCRIPT
const MINIMUM = 15; // expect at least 15 records per page

const count = $input.all().length;

if (count < MINIMUM) {                        // trigger alert path
  throw new Error(`Low yield: got ${count}, expected >= ${MINIMUM}`);
}

return $input.all(); // pass through if OK

Place this node between the Code parser and the storage node. When it throws, n8n's error workflow catches it.


Takeaways

  • n8n's HTTP Request node integrates with any REST scraping API in minutes — no custom nodes required
  • Use render_js: true selectively; static fetches are faster and cheaper than headless browser requests
  • Keep parsing logic inside the Code node to maintain self-contained, debuggable workflows
  • Cheerio handles the majority of HTML extraction cases; fall back to a dedicated parser service only for complex XPath requirements
  • Configure retries on the HTTP node and a global error workflow before scheduling — silent data loss compounds across runs
  • For event-driven ingestion triggered by new URLs in a queue or database, swap the Schedule Trigger for a Postgres Trigger or AMQP node without changing the rest of the workflow
Share

Was this article helpful?

Frequently Asked Questions

n8n has no native scraper, but its HTTP Request node connects to any scraping API. Pair it with a service like AlterLab to handle anti-bot bypass, JavaScript rendering, and proxy rotation — n8n handles scheduling, HTML parsing, and downstream storage.
Pass `"render_js": true` in the HTTP Request body payload. The scraping API backend spins up a headless browser, executes JavaScript, and returns the fully rendered HTML — no Playwright or Puppeteer setup needed inside your n8n instance.
Enable "Retry On Fail" on the HTTP Request node with 3 retries and 2-second backoff. Create a dedicated n8n Error Workflow to catch uncaught failures and log them to a Postgres dead-letter table or fire a Slack alert, preventing silent data loss in scheduled pipelines.