
Automate Lead Enrichment in n8n with Web Scraping APIs
Build a deterministic n8n workflow to extract structured JSON data from public company websites using automated data pipelines and headless browsers.
April 24, 2026
Lead enrichment pipelines typically rely on static, outdated third-party databases. By extracting data from public company websites directly, you guarantee data freshness and relevance. n8n provides the orchestration layer to move this data, but extracting structured data from unstructured HTML requires a dedicated scraping layer.
We will build an n8n workflow that takes a raw company URL, processes it through a headless browser, extracts specific firmographic data using LLM-based parsing, and pushes the structured JSON into a database.
The Core Concept: HTML to JSON
Raw HTML is noisy. Writing regex or CSS selectors for hundreds of different company website layouts is brittle and requires constant maintenance. The modern approach offloads the parsing to a scraping API that accepts a URL and a desired JSON schema, returning exactly the data requested.
AlterLab handles this via Cortex AI. You pass a target URL and a schema definition. The API handles the network routing, renders the JavaScript, parses the DOM, and returns the variables matching your schema.
Extract company data as JSON with AlterLab
Pipeline Architecture
Our automated pipeline consists of four distinct stages inside n8n.
Before building the workflow, you need an active n8n instance (self-hosted or cloud) and a scraping API credential. If you do not have one, create an account to get an API key.
Prototyping the Extraction Request
Before configuring the n8n HTTP Request node, test the extraction logic locally. We want to extract three data points from a target company website: the main support email, the primary product offering, and the physical address.
Here is how to test the extraction using standard tools.
curl -X POST https://api.alterlab.io/v1/scrape \
-H "X-API-Key: YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://example-b2b-site.com/contact",
"formats": ["json"],
"cortex": {
"schema": {
"support_email": "string",
"primary_product": "string",
"headquarters_address": "string"
}
}
}'If you are building custom n8n nodes or prefer writing Python scripts for your data engineering tasks, you can achieve the exact same operation using our Python SDK.
import alterlab
client = alterlab.Client("YOUR_API_KEY")
response = client.scrape(
url="https://example-b2b-site.com/contact",
formats=["json"],
cortex={
"schema": {
"support_email": "string",
"primary_product": "string",
"headquarters_address": "string"
}
}
)
print(response.json)Both methods return a deterministic JSON object mapping the schema keys to the extracted values. We will use this exact payload structure inside our n8n workflow.
Step 1: Configuring the n8n Trigger
Start by adding a Postgres node (or your preferred database node) to n8n. Set the operation to Execute Query.
Write a query that selects records lacking enrichment data. Limit the batch size to prevent overwhelming the downstream nodes.
SELECT id, domain
FROM leads
WHERE enrichment_status = 'pending'
LIMIT 10;Add a Schedule Trigger to run this query every hour. This creates a steady, predictable throughput for the enrichment pipeline.
Step 2: The Scraping Node
Add an HTTP Request node directly after the database trigger. This node loops through the domains returned by the database and calls the scraping API.
Configure the HTTP Request node with these settings:
- Method: POST
- URL:
https://api.alterlab.io/v1/scrape - Authentication: Generic Credential Type (Header Auth)
- Header Name:
X-API-Key
In the Body Parameters section, use n8n expressions to dynamically inject the domain from the previous node.
{
"url": "https://{{ $json.domain }}",
"formats": ["json"],
"cortex": {
"schema": {
"support_email": "string",
"primary_product": "string",
"headquarters_address": "string"
}
}
}Managing Execution Tiers
Public B2B directories and heavily trafficked company websites often deploy strict security measures. Standard HTTP requests will fail with 403 Forbidden errors.
Your scraping configuration needs to account for this. By default, AlterLab automatically escalates the request through different proxy and browser tiers until it succeeds. You pay for what you use based on the tier required to access the public data. If you know a target domain requires JavaScript rendering, you can bypass the lower tiers by setting a min_tier parameter in your JSON body. This reduces total latency. Read more about handling complex targets in our anti-bot solution documentation.
Step 3: Parsing and Validation
Add an IF node after the HTTP Request. Network requests fail, target sites go down, and domains expire. You must handle these states gracefully.
Configure the IF node to check the HTTP status code.
- Condition 1:
{{ $response.statusCode }}Equal to200. - Condition 2:
{{ $json.data.support_email }}Is Not Empty.
If the conditions are true, route the workflow to the True branch. If false, route to an error handling branch that updates the database record status to failed to prevent infinite retry loops.
Step 4: Storage and Database Updates
On the True branch of your IF node, add another Postgres node. Set the operation to Update.
Map the extracted JSON data to your database columns using n8n expressions:
email:{{ $json.data.support_email }}product_focus:{{ $json.data.primary_product }}location:{{ $json.data.headquarters_address }}enrichment_status:completed
Ensure you use the id from the original trigger node as the update key.
Scaling the Workflow
Once the pipeline runs successfully for small batches, you will need to adjust the configuration for higher volume.
Concurrency and Rate Limits
n8n processes items sequentially by default. To process leads faster, use the Split In Batches node. Set the batch size to 5 or 10. The HTTP Request node will fire these requests in parallel. Ensure your AlterLab API key has a sufficient concurrency limit to handle the batch size.
Asynchronous Processing
For complex sites requiring heavy JavaScript execution, the request might take longer than n8n's default HTTP timeout. Instead of keeping the HTTP connection open, switch to asynchronous webhooks.
- Create a Webhook node in n8n and copy its test URL.
- Update your AlterLab HTTP Request body to include the webhook URL:
{"url": "...", "webhook_url": "YOUR_N8N_WEBHOOK"}. - n8n will immediately receive a 202 Accepted response.
- The scraping API will process the page in the background and POST the final JSON payload to your n8n Webhook node once complete.
Takeaways
Extracting unstructured web data into structured database records does not require massive custom codebases. By connecting n8n's orchestration with a dedicated scraping API, you build a resilient pipeline that adapts to site changes automatically. Focus your engineering effort on how to use the enriched data, not on maintaining CSS selectors. Ensure you configure error routing, handle status codes, and test your schemas thoroughly before scaling the batch sizes.
Was this article helpful?
Frequently Asked Questions
Related Articles
Popular Posts
Recommended

Selenium Bot Detection: Why You Get Caught and How to Avoid It

How to Scrape AliExpress: Complete Guide for 2026

Why Your Headless Browser Gets Detected (and How to Fix It)

Scraping JavaScript-Heavy SPAs with Python: Dynamic Content, Infinite Scroll, and API Interception

How to Bypass Cloudflare Bot Protection When Web Scraping
Newsletter
Scraping insights and API tips. No spam.
Recommended Reading

Selenium Bot Detection: Why You Get Caught and How to Avoid It

How to Scrape AliExpress: Complete Guide for 2026

Why Your Headless Browser Gets Detected (and How to Fix It)

Scraping JavaScript-Heavy SPAs with Python: Dynamic Content, Infinite Scroll, and API Interception

How to Bypass Cloudflare Bot Protection When Web Scraping
Stay in the Loop
Get scraping insights, API tips, and platform updates. No spam — we only send when we have something worth reading.

