
Automate Lead Enrichment in n8n with Web Scraping APIs
Build a deterministic n8n workflow to extract structured JSON data from public company websites using automated data pipelines and headless browsers.
Lead enrichment pipelines typically rely on static, outdated third-party databases. By extracting data from public company websites directly, you guarantee data freshness and relevance. n8n provides the orchestration layer to move this data, but extracting structured data from unstructured HTML requires a dedicated scraping layer.
We will build an n8n workflow that takes a raw company URL, processes it through a headless browser, extracts specific firmographic data using LLM-based parsing, and pushes the structured JSON into a database.
The Core Concept: HTML to JSON
Raw HTML is noisy. Writing regex or CSS selectors for hundreds of different company website layouts is brittle and requires constant maintenance. The modern approach offloads the parsing to a scraping API that accepts a URL and a desired JSON schema, returning exactly the data requested.
AlterLab handles this via Cortex AI. You pass a target URL and a schema definition. The API handles the network routing, renders the JavaScript, parses the DOM, and returns the variables matching your schema.
Extract company data as JSON with AlterLab
Pipeline Architecture
Our automated pipeline consists of four distinct stages inside n8n.
Before building the workflow, you need an active n8n instance (self-hosted or cloud) and a scraping API credential. If you do not have one, create an account to get an API key.
Prototyping the Extraction Request
Before configuring the n8n HTTP Request node, test the extraction logic locally. We want to extract three data points from a target company website: the main support email, the primary product offering, and the physical address.
Here is how to test the extraction using standard tools.
curl -X POST https://api.alterlab.io/v1/scrape \
-H "X-API-Key: YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://example-b2b-site.com/contact",
"formats": ["json"],
"cortex": {
"schema": {
"support_email": "string",
"primary_product": "string",
"headquarters_address": "string"
}
}
}'If you are building custom n8n nodes or prefer writing Python scripts for your data engineering tasks, you can achieve the exact same operation using our Python SDK.
import alterlab
client = alterlab.Client("YOUR_API_KEY")
response = client.scrape(
url="https://example-b2b-site.com/contact",
formats=["json"],
cortex={
"schema": {
"support_email": "string",
"primary_product": "string",
"headquarters_address": "string"
}
}
)
print(response.json)Both methods return a deterministic JSON object mapping the schema keys to the extracted values. We will use this exact payload structure inside our n8n workflow.
Step 1: Configuring the n8n Trigger
Start by adding a Postgres node (or your preferred database node) to n8n. Set the operation to Execute Query.
Write a query that selects records lacking enrichment data. Limit the batch size to prevent overwhelming the downstream nodes.
SELECT id, domain
FROM leads
WHERE enrichment_status = 'pending'
LIMIT 10;Add a Schedule Trigger to run this query every hour. This creates a steady, predictable throughput for the enrichment pipeline.
Step 2: The Scraping Node
Add an HTTP Request node directly after the database trigger. This node loops through the domains returned by the database and calls the scraping API.
Configure the HTTP Request node with these settings:
- Method: POST
- URL:
https://api.alterlab.io/v1/scrape - Authentication: Generic Credential Type (Header Auth)
- Header Name:
X-API-Key
In the Body Parameters section, use n8n expressions to dynamically inject the domain from the previous node.
{
"url": "https://{{ $json.domain }}",
"formats": ["json"],
"cortex": {
"schema": {
"support_email": "string",
"primary_product": "string",
"headquarters_address": "string"
}
}
}Managing Execution Tiers
Public B2B directories and heavily trafficked company websites often deploy strict security measures. Standard HTTP requests will fail with 403 Forbidden errors.
Your scraping configuration needs to account for this. By default, AlterLab automatically escalates the request through different proxy and browser tiers until it succeeds. You pay for what you use based on the tier required to access the public data. If you know a target domain requires JavaScript rendering, you can bypass the lower tiers by setting a min_tier parameter in your JSON body. This reduces total latency. Read more about handling complex targets in our anti-bot solution documentation.
Step 3: Parsing and Validation
Add an IF node after the HTTP Request. Network requests fail, target sites go down, and domains expire. You must handle these states gracefully.
Configure the IF node to check the HTTP status code.
- Condition 1:
{{ $response.statusCode }}Equal to200. - Condition 2:
{{ $json.data.support_email }}Is Not Empty.
If the conditions are true, route the workflow to the True branch. If false, route to an error handling branch that updates the database record status to failed to prevent infinite retry loops.
Step 4: Storage and Database Updates
On the True branch of your IF node, add another Postgres node. Set the operation to Update.
Map the extracted JSON data to your database columns using n8n expressions:
email:{{ $json.data.support_email }}product_focus:{{ $json.data.primary_product }}location:{{ $json.data.headquarters_address }}enrichment_status:completed
Ensure you use the id from the original trigger node as the update key.
Scaling the Workflow
Once the pipeline runs successfully for small batches, you will need to adjust the configuration for higher volume.
Concurrency and Rate Limits
n8n processes items sequentially by default. To process leads faster, use the Split In Batches node. Set the batch size to 5 or 10. The HTTP Request node will fire these requests in parallel. Ensure your AlterLab API key has a sufficient concurrency limit to handle the batch size.
Asynchronous Processing
For complex sites requiring heavy JavaScript execution, the request might take longer than n8n's default HTTP timeout. Instead of keeping the HTTP connection open, switch to asynchronous webhooks.
- Create a Webhook node in n8n and copy its test URL.
- Update your AlterLab HTTP Request body to include the webhook URL:
{"url": "...", "webhook_url": "YOUR_N8N_WEBHOOK"}. - n8n will immediately receive a 202 Accepted response.
- The scraping API will process the page in the background and POST the final JSON payload to your n8n Webhook node once complete.
Takeaways
Extracting unstructured web data into structured database records does not require massive custom codebases. By connecting n8n's orchestration with a dedicated scraping API, you build a resilient pipeline that adapts to site changes automatically. Focus your engineering effort on how to use the enriched data, not on maintaining CSS selectors. Ensure you configure error routing, handle status codes, and test your schemas thoroughly before scaling the batch sizes.
Was this article helpful?
Frequently Asked Questions
Related Articles

Scrape SERPs for AI Agents Without Triggering Anti-Bot Defenses
Learn how to reliably extract public data from search engine results pages (SERPs) for AI agents using rotating proxies and browser fingerprinting management.
Herald Blog Service

Build an MCP Server for Real-Time LLM Web Scraping
Learn how to build a Model Context Protocol (MCP) server that grounds LLMs with real-time web data extraction while optimizing token usage.
Herald Blog Service

Connect Ollama to Live Web Data Using Markdown Extraction
Feed live web data to local LLMs via Ollama using headless browser extraction and token-efficient Markdown conversion for robust RAG pipelines.
Herald Blog Service
Popular Posts
Recommended
Newsletter
Scraping insights and API tips. No spam.
Recommended Reading

How to Scrape Amazon in 2026: Engineering Guide

How to Scrape AliExpress: Complete Guide for 2026

Why Your Headless Browser Gets Detected (and How to Fix It)

How to Scrape Indeed: Complete Guide for 2026

How to Scrape Twitter/X Data: Complete Guide for 2026
Stay in the Loop
Get scraping insights, API tips, and platform updates. No spam — we only send when we have something worth reading.
Explore AlterLab
Web Scraping API Resources
Part of the Web Scraping API Documentation cluster
Complete API reference with 5-tier auto-escalation — Curl to challenge resolution.
Pillar pageConfigure Tier 4 browser rendering for SPAs and dynamic content.
Scrape pages behind login using session management.
Real success rates and cost data across all 5 tiers.
MCP Server, Python SDK, and Firecrawl-compatible API for AI agent workflows.