Pricing Compare Playground Blog Docs Changelog

Build AI Agents That Scrape the Web in Real Time

Learn how to build AI agents that scrape websites in real time using a managed API. No infrastructure to maintain, no proxies to rotate, no CAPTCHAs to solve.

Yash DubeyApril 10, 2026

8 min read

175 views

AlterLab handles this automatically — scrape any URL with one API call. No infrastructure required.

Try it free

AI agents need reliable data. The web has the most data. But scraping at scale means fighting CAPTCHAs, rotating proxies, maintaining headless browsers, and handling rate limits. Most teams spend more time on infrastructure than on the agent logic itself.

There is a better approach. Offload the scraping layer to a managed API. Your agent focuses on reasoning, planning, and acting. The API handles browsers, proxies, and anti-bot bypass. Here is how to build it.

The Architecture

An AI agent that scrapes the web has three layers:

Agent core — An LLM that decides what to scrape, when, and what to do with the results.
Scraping layer — A managed API that fetches pages, renders JavaScript, bypasses bot detection, and returns clean data.
Data pipeline — Storage, transformation, and feedback loops that let the agent learn from past scrapes.

The scraping layer is where most projects fail. Running your own browser fleet means managing Chromium instances, handling memory leaks, rotating IP pools, solving CAPTCHAs, and updating fingerprints when sites change their detection logic. That is a full-time job.

A managed API removes all of that. You send a URL. You get back structured data. The agent does the rest.

How the Scraping Layer Works

When your agent calls a scraping API, here is what happens behind the scenes:

The API auto-escalates. It starts with the fastest, cheapest method — a simple HTTP request. If the page requires JavaScript rendering, it escalates to a headless browser. If the page has bot detection, it applies anti-bot bypass techniques. You do not configure any of this. It happens automatically.

Python SDK Example

Install the SDK and make your first call:

Bash

pip install alterlab

Python

from alterlab import AlterLabClient

client = AlterLabClient(api_key="YOUR_API_KEY")

response = client.scrape(
    url="https://news.ycombinator.com",
    formats=["json"],
    wait_for="table.item"
)

print(response.json)

The wait_for parameter tells the headless browser to wait until a specific CSS selector appears in the DOM. This is critical for JavaScript-heavy sites where content loads asynchronously. Without it, you get an empty page.

The formats parameter controls output. Pass ["json"] for structured data, ["markdown"] for LLM-friendly text, or ["html"] for raw DOM access.

cURL Equivalent

The same request via cURL:

Bash

curl -X POST https://api.alterlab.io/v1/scrape \
  -H "X-API-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://news.ycombinator.com",
    "formats": ["json"],
    "wait_for": "table.item"
  }'

Both approaches return the same response. Use the SDK in your agent code. Use cURL for quick testing or when integrating from non-Python environments.

Try it yourself

Try scraping this page with AlterLab

curl -X POST https://api.alterlab.io/v1/scrape \
  -H "X-API-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"url": "https://news.ycombinator.com"}'

Enable JavaScript to try the live demo, or sign up to use the API directly.

Building the Agent Loop

The scraping call is one piece. The agent needs a loop: decide what to scrape, fetch it, process results, decide next action. Here is a minimal implementation:

Python

from alterlab import AlterLabClient
import json

client = AlterLabClient(api_key="YOUR_API_KEY")

def agent_loop(seed_urls, max_iterations=5):
    visited = set()
    results = []

    for url in seed_urls:
        if url in visited or len(results) >= max_iterations:
            continue

        response = client.scrape(
            url=url,
            formats=["markdown"],
            cortex_extract={
                "prompt": "Extract all article titles and their URLs"
            }
        )

        extracted = response.cortex
        results.extend(extracted)
        visited.add(url)

        # Agent decides next URLs to scrape based on results
        new_urls = [item["url"] for item in extracted if "url" in item]
        seed_urls.extend(new_urls)

    return results

data = agent_loop(["https://news.ycombinator.com"])
print(json.dumps(data, indent=2))

The cortex_extract parameter sends the page content to an LLM with your extraction prompt. Instead of writing CSS selectors that break when sites redesign, you describe what you want in natural language. The LLM returns structured JSON.

This is where the agent becomes useful. It does not just scrape one page. It reads results, decides what to scrape next, and continues until it has enough data or hits your iteration limit.

Handling Anti-Bot Detection

Sites use Cloudflare, PerimeterX, DataDome, and custom solutions to block scrapers. Your agent will hit these. The question is whether you handle them yourself or delegate.

Handling them yourself means:

Maintaining a proxy pool with residential IPs
Rotating user agents and browser fingerprints
Solving CAPTCHAs via third-party services
Updating your approach every time a site changes its detection logic

Delegating to a managed API means none of that. The anti-bot bypass layer handles it automatically. When a site blocks a simple HTTP request, the API escalates to a headless browser with real browser fingerprints. When a CAPTCHA appears, it solves it. When rate limits trigger, it rotates IPs.

You can also set a minimum tier to skip the lightweight attempts:

Python

response = client.scrape(
    url="https://example-protected-site.com",
    formats=["json"],
    min_tier=3,
    bypass_level="high"
)

min_tier=3 tells the API to skip basic HTTP clients and go straight to headless browser mode. bypass_level="high" applies the strongest anti-bot measures. Use this for sites known to block scrapers aggressively.

Scheduling Recurring Scrapes

Agents do not always run once. Many need fresh data on a schedule — price checks every hour, job board updates every morning, competitor monitoring daily.

Instead of running a cron job that calls your agent, use the API's built-in scheduling:

Python

schedule = client.schedules.create(
    url="https://example.com/products",
    formats=["json"],
    cron="0 */4 * * *",
    webhook_url="https://your-server.com/webhook",
    cortex_extract={
        "prompt": "Extract product name, price, and availability"
    }
)

print(f"Schedule ID: {schedule.id}")

This runs every 4 hours, extracts product data via AI, and pushes results to your webhook. No cron daemon. No retry logic. No monitoring whether your scraper crashed at 3 AM.

Monitoring for Changes

Another common agent pattern: watch a page and react when something changes. A product goes on sale. A job posting appears. A competitor updates their pricing.

The monitoring API handles this:

Python

monitor = client.monitors.create(
    url="https://example.com/pricing",
    check_interval="1h",
    diff_detection=True,
    webhook_url="https://your-server.com/alerts",
    notify_on=["content_change", "price_change"]
)

When the page changes, the API sends a diff to your webhook. Your agent receives the change, evaluates whether it matters, and takes action. You do not poll. You do not compare snapshots manually. The API detects changes and pushes them to you.

Output Formats for Agent Consumption

Agents consume data differently than traditional scrapers. An LLM does not need raw HTML. It needs structured, clean input.

For most agent workflows, Markdown is the sweet spot. It strips layout noise, preserves document structure, and uses fewer tokens than HTML. JSON is best when you need specific fields. Use Cortex AI extraction to get JSON directly from unstructured pages.

Python

# Markdown for LLM context
md_response = client.scrape(url=url, formats=["markdown"])
agent_context = md_response.markdown

# JSON for structured data
json_response = client.scrape(
    url=url,
    formats=["json"],
    cortex_extract={"prompt": "Extract all prices as numbers"}
)
prices = json_response.json

Error Handling and Retries

Scraping fails. Sites go down. Selectors break. Rate limits trigger. Your agent needs to handle failures gracefully.

Python

from alterlab import AlterLabClient, ScrapingError

client = AlterLabClient(api_key="YOUR_API_KEY")

def scrape_with_retry(url, max_retries=3):
    for attempt in range(max_retries):
        try:
            response = client.scrape(url=url, formats=["json"])
            return response.json
        except ScrapingError as e:
            if e.status_code == 429:
                time.sleep(2 ** attempt)
                continue
            elif e.status_code >= 500:
                time.sleep(5)
                continue
            raise
    raise Exception(f"Failed after {max_retries} retries")

The SDK raises ScrapingError with the HTTP status code. Rate limits (429) benefit from exponential backoff. Server errors (5xx) warrant a longer pause. Client errors (4xx) usually mean the URL is invalid or the site is blocking you — retrying will not help.

Cost Considerations

Every scrape costs something. The question is whether you pay in engineering hours or in API usage.

Running your own infrastructure means paying for servers, proxy services, CAPTCHA solvers, and developer time to maintain everything. A managed API charges per request. The pricing model scales with usage — you pay for what you scrape, not for idle browser instances.

For an agent that scrapes intermittently, the managed approach is almost always cheaper. You are not paying for a headless browser sitting idle between agent decisions.

Putting It All Together

Here is the complete pattern for an agent that scrapes, processes, and acts:

Python

from alterlab import AlterLabClient
import json

client = AlterLabClient(api_key="YOUR_API_KEY")

class WebScrapingAgent:
    def __init__(self, seed_urls, extraction_prompt):
        self.seed_urls = seed_urls
        self.extraction_prompt = extraction_prompt
        self.visited = set()
        self.findings = []

    def run(self, max_pages=10):
        while len(self.findings) < max_pages and self.seed_urls:
            url = self.seed_urls.pop(0)
            if url in self.visited:
                continue

            response = client.scrape(
                url=url,
                formats=["json"],
                cortex_extract={"prompt": self.extraction_prompt}
            )

            data = response.json
            self.findings.extend(data)
            self.visited.add(url)

            # Agent logic: decide next URLs based on findings
            new_urls = self.decide_next_urls(data)
            self.seed_urls.extend(new_urls)

        return self.findings

    def decide_next_urls(self, data):
        # Your agent logic here
        # Could use an LLM to analyze results and pick next targets
        return []

agent = WebScrapingAgent(
    seed_urls=["https://news.ycombinator.com"],
    extraction_prompt="Extract article titles, authors, and URLs"
)

results = agent.run(max_pages=5)
print(json.dumps(results, indent=2))

This agent starts with seed URLs, scrapes each one, extracts structured data via AI, decides what to scrape next, and continues until it has enough data. The entire scraping layer is a single API call.

Key Takeaways

Offload browser management, proxy rotation, and CAPTCHA solving to a managed API
Use Markdown format for LLM context, JSON for structured data
Cortex AI extraction replaces fragile CSS selectors with natural language prompts
Scheduling and monitoring APIs eliminate the need for cron jobs and polling
Set min_tier to skip lightweight attempts for sites known to block scrapers
Handle ScrapingError with exponential backoff for rate limits
Pay per request instead of maintaining idle browser infrastructure

The agent should spend its time reasoning about data, not fighting with headless browsers.

Was this article helpful?

Try it yourself

Skip the proxy management overhead

AlterLab handles proxy rotation, browser environments, and challenge resolution for you.

curl -X POST https://api.alterlab.io/v1/scrape \
  -H "X-API-Key: YOUR_KEY" \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com"}'

No credit card required · 5,000 free requests

Frequently Asked Questions

AI agents use managed scraping APIs that handle proxy rotation, CAPTCHA solving, and browser fingerprinting automatically. The agent sends a URL and receives clean data back, while the API layer handles all anti-bot measures.

Return data as JSON with consistent schemas. Use CSS selectors or AI-powered extraction to pull specific fields like prices, titles, or availability. Structured output lets agents reason over data without parsing HTML.

Yes. Use a scraping API with headless browser support. The API renders JavaScript server-side and returns the final DOM. You do not need to run Puppeteer, Playwright, or Selenium yourself.

Yash Dubey

View all posts

Tutorials

How to Give Your AI Agent Access to Medium Data

Learn how to connect your AI agent to Medium using AlterLab's Extract API to retrieve structured, public data for RAG pipelines and content intelligence.

Herald Blog Service

Jul 9, 2026

Best Practices

Managing Headless Browser Overhead in Data Pipelines

Learn how to reduce latency and resource consumption when using headless browsers for data extraction in large-scale web scraping pipelines.

Herald Blog Service

Jul 8, 2026

Tutorials

How to Give Your AI Agent Access to AngelList Data

Enable AI agents to retrieve AngelList job data via AlterLab structured extraction with clean JSON output and automatic anti bot handling

Herald Blog Service

Jul 7, 2026

Stay in the Loop

Get scraping insights, API tips, and platform updates. No spam — we only send when we have something worth reading.

Web Scraping API Resources

Part of the Web Scraping API Documentation cluster

Web Scraping API Documentation

Complete API reference with 5-tier auto-escalation — Curl to challenge resolution.

Pillar page

JavaScript Rendering Guide

Configure Tier 4 browser rendering for SPAs and dynamic content.

Authenticated Scraping Guide

Scrape pages behind login using session management.

Web Scraping API Benchmarks

Real success rates and cost data across all 5 tiers.

AlterLab for AI Agents

MCP Server, Python SDK, and Firecrawl-compatible API for AI agent workflows.

The Architecture

How the Scraping Layer Works

Python SDK Example

cURL Equivalent

Building the Agent Loop

Handling Anti-Bot Detection

Scheduling Recurring Scrapes

Monitoring for Changes

Output Formats for Agent Consumption

Error Handling and Retries

Cost Considerations

Putting It All Together

Key Takeaways

Frequently Asked Questions

Related Articles

How to Give Your AI Agent Access to Medium Data

Managing Headless Browser Overhead in Data Pipelines

How to Give Your AI Agent Access to AngelList Data

Popular Posts

Playwright Bot Detection: What Actually Works in 2026

How to Scrape Twitter/X: Complete Guide for 2026

Why Your Headless Browser Gets Detected (and How to Fix It)

Best Web Scraping APIs in 2026: Complete Comparison Guide

How to Scrape Cloudflare-Protected Sites in 2026

Recommended

How to Scrape AliExpress: Complete Guide for 2026

Why Your Headless Browser Gets Detected (and How to Fix It)

AlterLab vs Firecrawl: Which Scraping API Is Better in 2026?

How to Scrape Twitter/X Data: Complete Guide for 2026

How to Scrape Cloudflare-Protected Sites in 2026

Newsletter

Recommended Reading

How to Scrape AliExpress: Complete Guide for 2026

Why Your Headless Browser Gets Detected (and How to Fix It)

AlterLab vs Firecrawl: Which Scraping API Is Better in 2026?

How to Scrape Twitter/X Data: Complete Guide for 2026

How to Scrape Cloudflare-Protected Sites in 2026

Stay in the Loop

Explore AlterLab

Anti-Bot Handling API

JavaScript Rendering API

Pricing

Documentation

Web Scraping API Resources