
Build AI Agents That Scrape the Web in Real Time
Learn how to build AI agents that scrape websites in real time using a managed API. No infrastructure to maintain, no proxies to rotate, no CAPTCHAs to solve.
April 10, 2026
AI agents need reliable data. The web has the most data. But scraping at scale means fighting CAPTCHAs, rotating proxies, maintaining headless browsers, and handling rate limits. Most teams spend more time on infrastructure than on the agent logic itself.
There is a better approach. Offload the scraping layer to a managed API. Your agent focuses on reasoning, planning, and acting. The API handles browsers, proxies, and anti-bot bypass. Here is how to build it.
The Architecture
An AI agent that scrapes the web has three layers:
- Agent core — An LLM that decides what to scrape, when, and what to do with the results.
- Scraping layer — A managed API that fetches pages, renders JavaScript, bypasses bot detection, and returns clean data.
- Data pipeline — Storage, transformation, and feedback loops that let the agent learn from past scrapes.
The scraping layer is where most projects fail. Running your own browser fleet means managing Chromium instances, handling memory leaks, rotating IP pools, solving CAPTCHAs, and updating fingerprints when sites change their detection logic. That is a full-time job.
A managed API removes all of that. You send a URL. You get back structured data. The agent does the rest.
How the Scraping Layer Works
When your agent calls a scraping API, here is what happens behind the scenes:
The API auto-escalates. It starts with the fastest, cheapest method — a simple HTTP request. If the page requires JavaScript rendering, it escalates to a headless browser. If the page has bot detection, it applies anti-bot bypass techniques. You do not configure any of this. It happens automatically.
Python SDK Example
Install the SDK and make your first call:
pip install alterlabfrom alterlab import AlterLabClient
client = AlterLabClient(api_key="YOUR_API_KEY")
response = client.scrape(
url="https://news.ycombinator.com",
formats=["json"],
wait_for="table.item"
)
print(response.json)The wait_for parameter tells the headless browser to wait until a specific CSS selector appears in the DOM. This is critical for JavaScript-heavy sites where content loads asynchronously. Without it, you get an empty page.
The formats parameter controls output. Pass ["json"] for structured data, ["markdown"] for LLM-friendly text, or ["html"] for raw DOM access.
cURL Equivalent
The same request via cURL:
curl -X POST https://api.alterlab.io/v1/scrape \
-H "X-API-Key: YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://news.ycombinator.com",
"formats": ["json"],
"wait_for": "table.item"
}'Both approaches return the same response. Use the SDK in your agent code. Use cURL for quick testing or when integrating from non-Python environments.
Try scraping this page with AlterLab
Building the Agent Loop
The scraping call is one piece. The agent needs a loop: decide what to scrape, fetch it, process results, decide next action. Here is a minimal implementation:
from alterlab import AlterLabClient
import json
client = AlterLabClient(api_key="YOUR_API_KEY")
def agent_loop(seed_urls, max_iterations=5):
visited = set()
results = []
for url in seed_urls:
if url in visited or len(results) >= max_iterations:
continue
response = client.scrape(
url=url,
formats=["markdown"],
cortex_extract={
"prompt": "Extract all article titles and their URLs"
}
)
extracted = response.cortex
results.extend(extracted)
visited.add(url)
# Agent decides next URLs to scrape based on results
new_urls = [item["url"] for item in extracted if "url" in item]
seed_urls.extend(new_urls)
return results
data = agent_loop(["https://news.ycombinator.com"])
print(json.dumps(data, indent=2))The cortex_extract parameter sends the page content to an LLM with your extraction prompt. Instead of writing CSS selectors that break when sites redesign, you describe what you want in natural language. The LLM returns structured JSON.
This is where the agent becomes useful. It does not just scrape one page. It reads results, decides what to scrape next, and continues until it has enough data or hits your iteration limit.
Handling Anti-Bot Detection
Sites use Cloudflare, PerimeterX, DataDome, and custom solutions to block scrapers. Your agent will hit these. The question is whether you handle them yourself or delegate.
Handling them yourself means:
- Maintaining a proxy pool with residential IPs
- Rotating user agents and browser fingerprints
- Solving CAPTCHAs via third-party services
- Updating your approach every time a site changes its detection logic
Delegating to a managed API means none of that. The anti-bot bypass layer handles it automatically. When a site blocks a simple HTTP request, the API escalates to a headless browser with real browser fingerprints. When a CAPTCHA appears, it solves it. When rate limits trigger, it rotates IPs.
You can also set a minimum tier to skip the lightweight attempts:
response = client.scrape(
url="https://example-protected-site.com",
formats=["json"],
min_tier=3,
bypass_level="high"
)min_tier=3 tells the API to skip basic HTTP clients and go straight to headless browser mode. bypass_level="high" applies the strongest anti-bot measures. Use this for sites known to block scrapers aggressively.
Scheduling Recurring Scrapes
Agents do not always run once. Many need fresh data on a schedule — price checks every hour, job board updates every morning, competitor monitoring daily.
Instead of running a cron job that calls your agent, use the API's built-in scheduling:
schedule = client.schedules.create(
url="https://example.com/products",
formats=["json"],
cron="0 */4 * * *",
webhook_url="https://your-server.com/webhook",
cortex_extract={
"prompt": "Extract product name, price, and availability"
}
)
print(f"Schedule ID: {schedule.id}")This runs every 4 hours, extracts product data via AI, and pushes results to your webhook. No cron daemon. No retry logic. No monitoring whether your scraper crashed at 3 AM.
Monitoring for Changes
Another common agent pattern: watch a page and react when something changes. A product goes on sale. A job posting appears. A competitor updates their pricing.
The monitoring API handles this:
monitor = client.monitors.create(
url="https://example.com/pricing",
check_interval="1h",
diff_detection=True,
webhook_url="https://your-server.com/alerts",
notify_on=["content_change", "price_change"]
)When the page changes, the API sends a diff to your webhook. Your agent receives the change, evaluates whether it matters, and takes action. You do not poll. You do not compare snapshots manually. The API detects changes and pushes them to you.
Output Formats for Agent Consumption
Agents consume data differently than traditional scrapers. An LLM does not need raw HTML. It needs structured, clean input.
For most agent workflows, Markdown is the sweet spot. It strips layout noise, preserves document structure, and uses fewer tokens than HTML. JSON is best when you need specific fields. Use Cortex AI extraction to get JSON directly from unstructured pages.
# Markdown for LLM context
md_response = client.scrape(url=url, formats=["markdown"])
agent_context = md_response.markdown
# JSON for structured data
json_response = client.scrape(
url=url,
formats=["json"],
cortex_extract={"prompt": "Extract all prices as numbers"}
)
prices = json_response.jsonError Handling and Retries
Scraping fails. Sites go down. Selectors break. Rate limits trigger. Your agent needs to handle failures gracefully.
from alterlab import AlterLabClient, ScrapingError
client = AlterLabClient(api_key="YOUR_API_KEY")
def scrape_with_retry(url, max_retries=3):
for attempt in range(max_retries):
try:
response = client.scrape(url=url, formats=["json"])
return response.json
except ScrapingError as e:
if e.status_code == 429:
time.sleep(2 ** attempt)
continue
elif e.status_code >= 500:
time.sleep(5)
continue
raise
raise Exception(f"Failed after {max_retries} retries")The SDK raises ScrapingError with the HTTP status code. Rate limits (429) benefit from exponential backoff. Server errors (5xx) warrant a longer pause. Client errors (4xx) usually mean the URL is invalid or the site is blocking you — retrying will not help.
Cost Considerations
Every scrape costs something. The question is whether you pay in engineering hours or in API usage.
Running your own infrastructure means paying for servers, proxy services, CAPTCHA solvers, and developer time to maintain everything. A managed API charges per request. The pricing model scales with usage — you pay for what you scrape, not for idle browser instances.
For an agent that scrapes intermittently, the managed approach is almost always cheaper. You are not paying for a headless browser sitting idle between agent decisions.
Putting It All Together
Here is the complete pattern for an agent that scrapes, processes, and acts:
from alterlab import AlterLabClient
import json
client = AlterLabClient(api_key="YOUR_API_KEY")
class WebScrapingAgent:
def __init__(self, seed_urls, extraction_prompt):
self.seed_urls = seed_urls
self.extraction_prompt = extraction_prompt
self.visited = set()
self.findings = []
def run(self, max_pages=10):
while len(self.findings) < max_pages and self.seed_urls:
url = self.seed_urls.pop(0)
if url in self.visited:
continue
response = client.scrape(
url=url,
formats=["json"],
cortex_extract={"prompt": self.extraction_prompt}
)
data = response.json
self.findings.extend(data)
self.visited.add(url)
# Agent logic: decide next URLs based on findings
new_urls = self.decide_next_urls(data)
self.seed_urls.extend(new_urls)
return self.findings
def decide_next_urls(self, data):
# Your agent logic here
# Could use an LLM to analyze results and pick next targets
return []
agent = WebScrapingAgent(
seed_urls=["https://news.ycombinator.com"],
extraction_prompt="Extract article titles, authors, and URLs"
)
results = agent.run(max_pages=5)
print(json.dumps(results, indent=2))This agent starts with seed URLs, scrapes each one, extracts structured data via AI, decides what to scrape next, and continues until it has enough data. The entire scraping layer is a single API call.
Key Takeaways
- Offload browser management, proxy rotation, and CAPTCHA solving to a managed API
- Use Markdown format for LLM context, JSON for structured data
- Cortex AI extraction replaces fragile CSS selectors with natural language prompts
- Scheduling and monitoring APIs eliminate the need for cron jobs and polling
- Set
min_tierto skip lightweight attempts for sites known to block scrapers - Handle
ScrapingErrorwith exponential backoff for rate limits - Pay per request instead of maintaining idle browser infrastructure
The agent should spend its time reasoning about data, not fighting with headless browsers.
Was this article helpful?
Frequently Asked Questions
Related Articles
Popular Posts
Recommended

Selenium Bot Detection: Why You Get Caught and How to Avoid It

How to Scrape Glassdoor: Complete Guide for 2026

Why Your Headless Browser Gets Detected (and How to Fix It)

Web Scraping APIs vs DIY Scrapers: When to Stop Building Infrastructure

Scraping JavaScript-Heavy SPAs with Python: Dynamic Content, Infinite Scroll, and API Interception
Newsletter
Scraping insights and API tips. No spam.
Recommended Reading

Selenium Bot Detection: Why You Get Caught and How to Avoid It

How to Scrape Glassdoor: Complete Guide for 2026

Why Your Headless Browser Gets Detected (and How to Fix It)

Web Scraping APIs vs DIY Scrapers: When to Stop Building Infrastructure

Scraping JavaScript-Heavy SPAs with Python: Dynamic Content, Infinite Scroll, and API Interception
Stay in the Loop
Get scraping insights, API tips, and platform updates. No spam — we only send when we have something worth reading.

