Automate Lead Enrichment in n8n with Web Scraping APIs
Tutorials

Automate Lead Enrichment in n8n with Web Scraping APIs

Build a deterministic n8n workflow to extract structured JSON data from public company websites using automated data pipelines and headless browsers.

Yash Dubey
Yash Dubey

April 24, 2026

5 min read
2 views

Lead enrichment pipelines typically rely on static, outdated third-party databases. By extracting data from public company websites directly, you guarantee data freshness and relevance. n8n provides the orchestration layer to move this data, but extracting structured data from unstructured HTML requires a dedicated scraping layer.

We will build an n8n workflow that takes a raw company URL, processes it through a headless browser, extracts specific firmographic data using LLM-based parsing, and pushes the structured JSON into a database.

The Core Concept: HTML to JSON

Raw HTML is noisy. Writing regex or CSS selectors for hundreds of different company website layouts is brittle and requires constant maintenance. The modern approach offloads the parsing to a scraping API that accepts a URL and a desired JSON schema, returning exactly the data requested.

AlterLab handles this via Cortex AI. You pass a target URL and a schema definition. The API handles the network routing, renders the JavaScript, parses the DOM, and returns the variables matching your schema.

Try it yourself

Extract company data as JSON with AlterLab

Pipeline Architecture

Our automated pipeline consists of four distinct stages inside n8n.

Before building the workflow, you need an active n8n instance (self-hosted or cloud) and a scraping API credential. If you do not have one, create an account to get an API key.

Prototyping the Extraction Request

Before configuring the n8n HTTP Request node, test the extraction logic locally. We want to extract three data points from a target company website: the main support email, the primary product offering, and the physical address.

Here is how to test the extraction using standard tools.

Bash
curl -X POST https://api.alterlab.io/v1/scrape \
  -H "X-API-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example-b2b-site.com/contact",
    "formats": ["json"],
    "cortex": {
      "schema": {
        "support_email": "string",
        "primary_product": "string",
        "headquarters_address": "string"
      }
    }
  }'

If you are building custom n8n nodes or prefer writing Python scripts for your data engineering tasks, you can achieve the exact same operation using our Python SDK.

Python
import alterlab

client = alterlab.Client("YOUR_API_KEY")

response = client.scrape(
    url="https://example-b2b-site.com/contact",
    formats=["json"],
    cortex={
        "schema": {
            "support_email": "string",
            "primary_product": "string",
            "headquarters_address": "string"
        }
    }
)

print(response.json)

Both methods return a deterministic JSON object mapping the schema keys to the extracted values. We will use this exact payload structure inside our n8n workflow.

Step 1: Configuring the n8n Trigger

Start by adding a Postgres node (or your preferred database node) to n8n. Set the operation to Execute Query.

Write a query that selects records lacking enrichment data. Limit the batch size to prevent overwhelming the downstream nodes.

SQL
SELECT id, domain 
FROM leads 
WHERE enrichment_status = 'pending' 
LIMIT 10;

Add a Schedule Trigger to run this query every hour. This creates a steady, predictable throughput for the enrichment pipeline.

Step 2: The Scraping Node

Add an HTTP Request node directly after the database trigger. This node loops through the domains returned by the database and calls the scraping API.

Configure the HTTP Request node with these settings:

  • Method: POST
  • URL: https://api.alterlab.io/v1/scrape
  • Authentication: Generic Credential Type (Header Auth)
  • Header Name: X-API-Key

In the Body Parameters section, use n8n expressions to dynamically inject the domain from the previous node.

JSON
{
  "url": "https://{{ $json.domain }}",
  "formats": ["json"],
  "cortex": {
    "schema": {
      "support_email": "string",
      "primary_product": "string",
      "headquarters_address": "string"
    }
  }
}

Managing Execution Tiers

Public B2B directories and heavily trafficked company websites often deploy strict security measures. Standard HTTP requests will fail with 403 Forbidden errors.

Your scraping configuration needs to account for this. By default, AlterLab automatically escalates the request through different proxy and browser tiers until it succeeds. You pay for what you use based on the tier required to access the public data. If you know a target domain requires JavaScript rendering, you can bypass the lower tiers by setting a min_tier parameter in your JSON body. This reduces total latency. Read more about handling complex targets in our anti-bot solution documentation.

Step 3: Parsing and Validation

Add an IF node after the HTTP Request. Network requests fail, target sites go down, and domains expire. You must handle these states gracefully.

Configure the IF node to check the HTTP status code.

  • Condition 1: {{ $response.statusCode }} Equal to 200.
  • Condition 2: {{ $json.data.support_email }} Is Not Empty.

If the conditions are true, route the workflow to the True branch. If false, route to an error handling branch that updates the database record status to failed to prevent infinite retry loops.

Step 4: Storage and Database Updates

On the True branch of your IF node, add another Postgres node. Set the operation to Update.

Map the extracted JSON data to your database columns using n8n expressions:

  • email: {{ $json.data.support_email }}
  • product_focus: {{ $json.data.primary_product }}
  • location: {{ $json.data.headquarters_address }}
  • enrichment_status: completed

Ensure you use the id from the original trigger node as the update key.

Scaling the Workflow

Once the pipeline runs successfully for small batches, you will need to adjust the configuration for higher volume.

Concurrency and Rate Limits

n8n processes items sequentially by default. To process leads faster, use the Split In Batches node. Set the batch size to 5 or 10. The HTTP Request node will fire these requests in parallel. Ensure your AlterLab API key has a sufficient concurrency limit to handle the batch size.

Asynchronous Processing

For complex sites requiring heavy JavaScript execution, the request might take longer than n8n's default HTTP timeout. Instead of keeping the HTTP connection open, switch to asynchronous webhooks.

  1. Create a Webhook node in n8n and copy its test URL.
  2. Update your AlterLab HTTP Request body to include the webhook URL: {"url": "...", "webhook_url": "YOUR_N8N_WEBHOOK"}.
  3. n8n will immediately receive a 202 Accepted response.
  4. The scraping API will process the page in the background and POST the final JSON payload to your n8n Webhook node once complete.

Takeaways

Extracting unstructured web data into structured database records does not require massive custom codebases. By connecting n8n's orchestration with a dedicated scraping API, you build a resilient pipeline that adapts to site changes automatically. Focus your engineering effort on how to use the enriched data, not on maintaining CSS selectors. Ensure you configure error routing, handle status codes, and test your schemas thoroughly before scaling the batch sizes.

Share

Was this article helpful?

Frequently Asked Questions

You can use an HTTP Request node to send the target URL to a scraping API equipped with LLM extraction. The API parses the DOM and returns a clean JSON payload directly to your n8n workflow.
n8n itself does not render JavaScript. You must route the HTTP request through a headless browser service or a scraping API that executes the JavaScript before returning the response to n8n.
Use the Webhook node in n8n to receive asynchronous POST requests. Instead of holding the HTTP node open, your scraping API processes the page in the background and pushes the final data to the n8n webhook URL.