AlterLabAlterLab
Automate Web Scraping in n8n with AlterLab API
Tutorials

Automate Web Scraping in n8n with AlterLab API

Learn how to build automated web scraping workflows in n8n using AlterLab's API. Step-by-step tutorial with Python SDK and cURL examples.

Yash Dubey
Yash Dubey

April 11, 2026

8 min read
15 views

Automate Web Scraping in n8n with AlterLab's API

n8n is a workflow automation tool that connects APIs, databases, and services. Pair it with a scraping API that handles anti-bot bypass, proxy rotation, and headless rendering, and you get a pipeline that pulls structured data from any website on a schedule.

This tutorial shows how to build that pipeline. You will configure an n8n workflow that sends scrape requests, receives clean JSON, and routes the data to a database, spreadsheet, or webhook.

Prerequisites

  • An n8n instance (self-hosted or cloud)
  • An API key from alterlab.io/signup
  • Basic familiarity with n8n's node-based workflow editor

Step 1: Configure the HTTP Request Node

Create a new workflow in n8n. Add an HTTP Request node and configure it as follows:

  • Method: POST
  • URL: https://api.alterlab.io/v1/scrape
  • Authentication: Header Auth
  • Header Name: X-API-Key
  • Header Value: Your API key
  • Send Body: JSON

Set the JSON body to:

JSON
{
  "url": "https://example.com/products",
  "formats": ["json"],
  "min_tier": 3
}

The min_tier parameter controls the scraping tier. Tier 3 enables JavaScript rendering. Set it higher for sites with aggressive bot detection. The anti-bot bypass system auto-escalates if the initial tier fails.

Step 2: Test with cURL First

Before building the full workflow, verify the endpoint works from your terminal. This isolates API issues from n8n configuration problems.

Bash
curl -X POST https://api.alterlab.io/v1/scrape \
  -H "X-API-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com/products", "formats": ["json"]}'

A successful response returns structured data:

JSON
{
  "status": "success",
  "data": {
    "products": [
      {"name": "Widget A", "price": 29.99},
      {"name": "Widget B", "price": 49.99}
    ]
  },
  "metadata": {
    "url": "https://example.com/products",
    "timestamp": "2026-04-11T10:30:00Z"
  }
}
Try it yourself

Try scraping this page with AlterLab

Step 3: Build the Full n8n Workflow

A production workflow needs more than a single HTTP request. You need error handling, data transformation, and a destination for the scraped data.

Workflow Structure

Code
[Schedule Trigger] -> [HTTP Request (Scrape)] -> [Code (Parse)] -> [Database/Sheet/Webhook]

Add these nodes in order:

1. Schedule Trigger

Set a cron expression for your scrape frequency. Daily at 6 AM UTC:

Code
0 6 * * *

2. HTTP Request Node

Use the configuration from Step 1. Enable "Continue On Fail" so one failed scrape does not block the entire workflow.

3. Code Node (Data Transformation)

Parse the JSON response and extract the fields you need:

Python
# Access the HTTP Request output
response = json.parse($input.first().json.body)

# Extract product data
products = response.get("data", {}).get("products", [])

# Transform to your schema
items = []
for product in products:
    items.append({
        "json": {
            "name": product["name"],
            "price": product["price"],
            "scraped_at": response["metadata"]["timestamp"],
            "source": response["metadata"]["url"]
        }
    })

return items

4. Destination Node

Connect your output node. Common choices:

  • Postgres/MySQL: Use the database node to upsert records
  • Google Sheets: Append rows for lightweight tracking
  • Webhook: Push to your own API or a Slack channel

Step 4: Handle Multiple URLs

Scraping a single page is straightforward. Real pipelines scrape dozens or hundreds of URLs. Use n8n's Split Out node to fan out requests.

Python
# Code node that outputs multiple URLs
urls = [
    "https://example.com/products/page/1",
    "https://example.com/products/page/2",
    "https://example.com/products/page/3"
]

return [{"json": {"url": u}} for u in urls]

Connect this to a Split Out node, then to your HTTP Request node. Each URL becomes a separate execution branch. n8n processes them in parallel up to your concurrency limit.

Add rate limiting between requests if the target site requires it. Use the Wait node between the Split Out and HTTP Request nodes:

Code
Wait: 2 seconds

Step 5: Add Error Handling and Retries

Scraping fails. Pages change structure, sites go down, anti-bot systems update. Your workflow should handle failures gracefully.

Retry Configuration

In the HTTP Request node settings:

  • Retry On Fail: Enable
  • Max Retries: 3
  • Retry Backoff: Exponential

Error Routing

Add an error output branch from the HTTP Request node:

Code
[HTTP Request] --(success)--> [Parse] --> [Database]
       |
       --(error)--> [Error Handler] --> [Alert/Log]

The error handler can log failures to a separate sheet, send a Slack notification, or queue the URL for a retry with a higher tier.

Python
# Capture failed URLs for retry
error_data = $input.first().json

failed_urls.append({
    "url": error_data.get("url"),
    "error": error_data.get("error"),
    "timestamp": datetime.utcnow().isoformat(),
    "retry_tier": 4  # escalate tier on retry
})

return [{"json": {"failed": failed_urls}}]

Step 6: Use Cortex AI for Structured Extraction

Some pages do not have clean HTML structures. Product listings buried in JavaScript, unstructured text, or dynamic content require a different approach. Cortex AI extracts structured data using natural language instructions.

Bash
curl -X POST https://api.alterlab.io/v1/scrape \
  -H "X-API-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com/reviews",
    "formats": ["json"],
    "cortex": {
      "prompt": "Extract reviewer name, rating (1-5), and review text from each review block"
    }
  }'

The response returns data matching your schema:

JSON
{
  "status": "success",
  "data": {
    "reviews": [
      {
        "reviewer_name": "Jane D.",
        "rating": 5,
        "review_text": "Excellent product, fast shipping."
      },
      {
        "reviewer_name": "Mark S.",
        "rating": 4,
        "review_text": "Good quality, slightly overpriced."
      }
    ]
  }
}

In n8n, the Cortex output works identically to standard JSON output. Route it through the same Code and Database nodes.

Step 7: Monitor and Alert on Changes

Scraping is not always about collecting new data. Sometimes you need to detect changes on existing pages. Price drops, stock availability, competitor updates, regulatory filings.

Configure monitoring by storing previous scrape results and comparing them on each run:

Python
# Compare current scrape with previous state
current = $input.first().json
previous = get_previous_state(current["url"])  # from database

changes = []
for key in current["data"]:
    if key not in previous:
        changes.append({"field": key, "action": "added", "value": current["data"][key]})
    elif current["data"][key] != previous[key]:
        changes.append({
            "field": key,
            "action": "changed",
            "old": previous[key],
            "new": current["data"][key]
        })

# Only pass through if changes detected
if changes:
    return [{"json": {"url": current["url"], "changes": changes}}]
return []

When changes exist, route to an alert node. When nothing changed, the workflow exits silently.

3Lines of config to start
JSONOutput format
cronScheduling syntax
autoProxy rotation

Cost Considerations

Scraping pipelines can run expensive if you are not careful. A few practices:

  • Cache aggressively: Do not re-scrape pages that have not changed. Store hashes of previous responses and skip identical results.
  • Use the lowest tier that works: Start with min_tier: 1 for static pages. Only escalate to tier 3+ for JavaScript-heavy sites.
  • Batch URLs: Group related URLs into single workflow runs rather than triggering separate workflows per URL.
  • Set spend limits: API keys support spend caps. Set them per workflow to prevent runaway costs.

Check pricing for current rates. You pay for what you use with no monthly minimums.

Complete Workflow Example

Here is the full n8n workflow JSON for a daily product price scrape:

JSON
{
  "name": "Daily Price Scraper",
  "nodes": [
    {
      "name": "Schedule",
      "type": "n8n-nodes-base.scheduleTrigger",
      "parameters": {
        "rule": { "interval": ["days"], "triggerAtHour": 6 }
      }
    },
    {
      "name": "Scrape Products",
      "type": "n8n-nodes-base.httpRequest",
      "parameters": {
        "method": "POST",
        "url": "https://api.alterlab.io/v1/scrape",
        "authentication": "headerAuth",
        "body": {
          "url": "={{ $json.url }}",
          "formats": ["json"],
          "min_tier": 3
        },
        "options": {
          "retryOnFail": true,
          "maxTries": 3
        }
      }
    },
    {
      "name": "Parse Response",
      "type": "n8n-nodes-base.code",
      "parameters": {
        "jsCode": "const data = $input.first().json.body;\nreturn data.data.products.map(p => ({ json: p }));"
      }
    },
    {
      "name": "Save to Database",
      "type": "n8n-nodes-base.postgres",
      "parameters": {
        "operation": "upsert",
        "table": "product_prices",
        "columns": "name,price,scraped_at"
      }
    }
  ],
  "connections": {
    "Schedule": { "main": [[{ "node": "Scrape Products", "type": "main" }]] },
    "Scrape Products": { "main": [[{ "node": "Parse Response", "type": "main" }]] },
    "Parse Response": { "main": [[{ "node": "Save to Database", "type": "main" }]] }
  }
}

Import this into n8n via the workflow editor, replace the authentication credentials with your API key, and adjust the URL and database schema to match your use case.

Troubleshooting

Empty responses: The page may require a higher tier. Increase min_tier to 4 or 5. Check the API docs for tier descriptions.

Rate limit errors: Add a Wait node between requests. Start with 1-2 seconds and increase if needed.

CAPTCHA blocks: Set min_tier: 5 to enable CAPTCHA solving. This costs more per request but eliminates manual intervention.

Schema drift: Websites change their HTML structure. Cortex AI handles this better than CSS selectors since it uses semantic understanding. Switch to Cortex if your selectors break frequently.

n8n timeout: Long-running scrapes can exceed n8n's execution timeout. For large batches, use the webhook pattern. Configure AlterLab to push results to an n8n webhook URL instead of polling.

Takeaway

n8n handles orchestration. AlterLab handles extraction. Together they give you a scraping pipeline that runs on a schedule, handles failures, and delivers clean data to your systems.

Start with a single URL and a basic HTTP Request node. Add error handling, multi-URL support, and change detection as your needs grow. The quickstart guide covers API setup in under five minutes.

Share

Was this article helpful?

Frequently Asked Questions

Use n8n's HTTP Request node to POST to https://api.alterlab.io/v1/scrape with your API key in the X-API-Key header. You can also use the Python SDK in a Code node for more complex workflows.
Yes. AlterLab automatically handles anti-bot detection, CAPTCHAs, and JavaScript rendering. You set the tier level via the min_tier parameter and the API handles the rest.
AlterLab returns clean JSON, Markdown, or plain text. JSON works best in n8n since it maps directly to node outputs for downstream processing.