AlterLabAlterLab
Build AI Agents That Scrape the Web in Real Time
Tutorials

Build AI Agents That Scrape the Web in Real Time

Learn how to build AI agents that scrape websites in real time using a managed API. No infrastructure to maintain, no proxies to rotate, no CAPTCHAs to solve.

Yash Dubey
Yash Dubey

April 10, 2026

8 min read
4 views

AI agents need reliable data. The web has the most data. But scraping at scale means fighting CAPTCHAs, rotating proxies, maintaining headless browsers, and handling rate limits. Most teams spend more time on infrastructure than on the agent logic itself.

There is a better approach. Offload the scraping layer to a managed API. Your agent focuses on reasoning, planning, and acting. The API handles browsers, proxies, and anti-bot bypass. Here is how to build it.

The Architecture

An AI agent that scrapes the web has three layers:

  1. Agent core — An LLM that decides what to scrape, when, and what to do with the results.
  2. Scraping layer — A managed API that fetches pages, renders JavaScript, bypasses bot detection, and returns clean data.
  3. Data pipeline — Storage, transformation, and feedback loops that let the agent learn from past scrapes.

The scraping layer is where most projects fail. Running your own browser fleet means managing Chromium instances, handling memory leaks, rotating IP pools, solving CAPTCHAs, and updating fingerprints when sites change their detection logic. That is a full-time job.

A managed API removes all of that. You send a URL. You get back structured data. The agent does the rest.

How the Scraping Layer Works

When your agent calls a scraping API, here is what happens behind the scenes:

The API auto-escalates. It starts with the fastest, cheapest method — a simple HTTP request. If the page requires JavaScript rendering, it escalates to a headless browser. If the page has bot detection, it applies anti-bot bypass techniques. You do not configure any of this. It happens automatically.

Python SDK Example

Install the SDK and make your first call:

Bash
pip install alterlab
Python
from alterlab import AlterLabClient

client = AlterLabClient(api_key="YOUR_API_KEY")

response = client.scrape(
    url="https://news.ycombinator.com",
    formats=["json"],
    wait_for="table.item"
)

print(response.json)

The wait_for parameter tells the headless browser to wait until a specific CSS selector appears in the DOM. This is critical for JavaScript-heavy sites where content loads asynchronously. Without it, you get an empty page.

The formats parameter controls output. Pass ["json"] for structured data, ["markdown"] for LLM-friendly text, or ["html"] for raw DOM access.

cURL Equivalent

The same request via cURL:

Bash
curl -X POST https://api.alterlab.io/v1/scrape \
  -H "X-API-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://news.ycombinator.com",
    "formats": ["json"],
    "wait_for": "table.item"
  }'

Both approaches return the same response. Use the SDK in your agent code. Use cURL for quick testing or when integrating from non-Python environments.

Try it yourself

Try scraping this page with AlterLab

Building the Agent Loop

The scraping call is one piece. The agent needs a loop: decide what to scrape, fetch it, process results, decide next action. Here is a minimal implementation:

Python
from alterlab import AlterLabClient
import json

client = AlterLabClient(api_key="YOUR_API_KEY")

def agent_loop(seed_urls, max_iterations=5):
    visited = set()
    results = []

    for url in seed_urls:
        if url in visited or len(results) >= max_iterations:
            continue

        response = client.scrape(
            url=url,
            formats=["markdown"],
            cortex_extract={
                "prompt": "Extract all article titles and their URLs"
            }
        )

        extracted = response.cortex
        results.extend(extracted)
        visited.add(url)

        # Agent decides next URLs to scrape based on results
        new_urls = [item["url"] for item in extracted if "url" in item]
        seed_urls.extend(new_urls)

    return results

data = agent_loop(["https://news.ycombinator.com"])
print(json.dumps(data, indent=2))

The cortex_extract parameter sends the page content to an LLM with your extraction prompt. Instead of writing CSS selectors that break when sites redesign, you describe what you want in natural language. The LLM returns structured JSON.

This is where the agent becomes useful. It does not just scrape one page. It reads results, decides what to scrape next, and continues until it has enough data or hits your iteration limit.

Handling Anti-Bot Detection

Sites use Cloudflare, PerimeterX, DataDome, and custom solutions to block scrapers. Your agent will hit these. The question is whether you handle them yourself or delegate.

Handling them yourself means:

  • Maintaining a proxy pool with residential IPs
  • Rotating user agents and browser fingerprints
  • Solving CAPTCHAs via third-party services
  • Updating your approach every time a site changes its detection logic

Delegating to a managed API means none of that. The anti-bot bypass layer handles it automatically. When a site blocks a simple HTTP request, the API escalates to a headless browser with real browser fingerprints. When a CAPTCHA appears, it solves it. When rate limits trigger, it rotates IPs.

You can also set a minimum tier to skip the lightweight attempts:

Python
response = client.scrape(
    url="https://example-protected-site.com",
    formats=["json"],
    min_tier=3,
    bypass_level="high"
)

min_tier=3 tells the API to skip basic HTTP clients and go straight to headless browser mode. bypass_level="high" applies the strongest anti-bot measures. Use this for sites known to block scrapers aggressively.

Scheduling Recurring Scrapes

Agents do not always run once. Many need fresh data on a schedule — price checks every hour, job board updates every morning, competitor monitoring daily.

Instead of running a cron job that calls your agent, use the API's built-in scheduling:

Python
schedule = client.schedules.create(
    url="https://example.com/products",
    formats=["json"],
    cron="0 */4 * * *",
    webhook_url="https://your-server.com/webhook",
    cortex_extract={
        "prompt": "Extract product name, price, and availability"
    }
)

print(f"Schedule ID: {schedule.id}")

This runs every 4 hours, extracts product data via AI, and pushes results to your webhook. No cron daemon. No retry logic. No monitoring whether your scraper crashed at 3 AM.

Monitoring for Changes

Another common agent pattern: watch a page and react when something changes. A product goes on sale. A job posting appears. A competitor updates their pricing.

The monitoring API handles this:

Python
monitor = client.monitors.create(
    url="https://example.com/pricing",
    check_interval="1h",
    diff_detection=True,
    webhook_url="https://your-server.com/alerts",
    notify_on=["content_change", "price_change"]
)

When the page changes, the API sends a diff to your webhook. Your agent receives the change, evaluates whether it matters, and takes action. You do not poll. You do not compare snapshots manually. The API detects changes and pushes them to you.

Output Formats for Agent Consumption

Agents consume data differently than traditional scrapers. An LLM does not need raw HTML. It needs structured, clean input.

For most agent workflows, Markdown is the sweet spot. It strips layout noise, preserves document structure, and uses fewer tokens than HTML. JSON is best when you need specific fields. Use Cortex AI extraction to get JSON directly from unstructured pages.

Python
# Markdown for LLM context
md_response = client.scrape(url=url, formats=["markdown"])
agent_context = md_response.markdown

# JSON for structured data
json_response = client.scrape(
    url=url,
    formats=["json"],
    cortex_extract={"prompt": "Extract all prices as numbers"}
)
prices = json_response.json

Error Handling and Retries

Scraping fails. Sites go down. Selectors break. Rate limits trigger. Your agent needs to handle failures gracefully.

Python
from alterlab import AlterLabClient, ScrapingError

client = AlterLabClient(api_key="YOUR_API_KEY")

def scrape_with_retry(url, max_retries=3):
    for attempt in range(max_retries):
        try:
            response = client.scrape(url=url, formats=["json"])
            return response.json
        except ScrapingError as e:
            if e.status_code == 429:
                time.sleep(2 ** attempt)
                continue
            elif e.status_code >= 500:
                time.sleep(5)
                continue
            raise
    raise Exception(f"Failed after {max_retries} retries")

The SDK raises ScrapingError with the HTTP status code. Rate limits (429) benefit from exponential backoff. Server errors (5xx) warrant a longer pause. Client errors (4xx) usually mean the URL is invalid or the site is blocking you — retrying will not help.

Cost Considerations

Every scrape costs something. The question is whether you pay in engineering hours or in API usage.

Running your own infrastructure means paying for servers, proxy services, CAPTCHA solvers, and developer time to maintain everything. A managed API charges per request. The pricing model scales with usage — you pay for what you scrape, not for idle browser instances.

For an agent that scrapes intermittently, the managed approach is almost always cheaper. You are not paying for a headless browser sitting idle between agent decisions.

Putting It All Together

Here is the complete pattern for an agent that scrapes, processes, and acts:

Python
from alterlab import AlterLabClient
import json

client = AlterLabClient(api_key="YOUR_API_KEY")

class WebScrapingAgent:
    def __init__(self, seed_urls, extraction_prompt):
        self.seed_urls = seed_urls
        self.extraction_prompt = extraction_prompt
        self.visited = set()
        self.findings = []

    def run(self, max_pages=10):
        while len(self.findings) < max_pages and self.seed_urls:
            url = self.seed_urls.pop(0)
            if url in self.visited:
                continue

            response = client.scrape(
                url=url,
                formats=["json"],
                cortex_extract={"prompt": self.extraction_prompt}
            )

            data = response.json
            self.findings.extend(data)
            self.visited.add(url)

            # Agent logic: decide next URLs based on findings
            new_urls = self.decide_next_urls(data)
            self.seed_urls.extend(new_urls)

        return self.findings

    def decide_next_urls(self, data):
        # Your agent logic here
        # Could use an LLM to analyze results and pick next targets
        return []

agent = WebScrapingAgent(
    seed_urls=["https://news.ycombinator.com"],
    extraction_prompt="Extract article titles, authors, and URLs"
)

results = agent.run(max_pages=5)
print(json.dumps(results, indent=2))

This agent starts with seed URLs, scrapes each one, extracts structured data via AI, decides what to scrape next, and continues until it has enough data. The entire scraping layer is a single API call.

Key Takeaways

  • Offload browser management, proxy rotation, and CAPTCHA solving to a managed API
  • Use Markdown format for LLM context, JSON for structured data
  • Cortex AI extraction replaces fragile CSS selectors with natural language prompts
  • Scheduling and monitoring APIs eliminate the need for cron jobs and polling
  • Set min_tier to skip lightweight attempts for sites known to block scrapers
  • Handle ScrapingError with exponential backoff for rate limits
  • Pay per request instead of maintaining idle browser infrastructure

The agent should spend its time reasoning about data, not fighting with headless browsers.

Share

Was this article helpful?

Frequently Asked Questions

AI agents use managed scraping APIs that handle proxy rotation, CAPTCHA solving, and browser fingerprinting automatically. The agent sends a URL and receives clean data back, while the API layer handles all anti-bot measures.
Return data as JSON with consistent schemas. Use CSS selectors or AI-powered extraction to pull specific fields like prices, titles, or availability. Structured output lets agents reason over data without parsing HTML.
Yes. Use a scraping API with headless browser support. The API renders JavaScript server-side and returns the final DOM. You do not need to run Puppeteer, Playwright, or Selenium yourself.