Pricing Compare Playground Blog Docs Changelog

Build an MCP Server for Agentic Web Scraping and Real-Time LLM Grounding

Learn how to build a Model Context Protocol (MCP) server that empowers LLM agents to extract real-time data from public websites using Python.

Yash DubeyMay 8, 2026

7 min read

152 views

Large Language Models (LLMs) operate in a vacuum. To build autonomous agents that perform market research, track public pricing across e-commerce sites, or analyze real estate listings, you must provide them with real-time access to the web. Static Retrieval-Augmented Generation (RAG) is insufficient for data that changes hourly. Agents need the ability to reach out, fetch current pages, and read the contents.

The Model Context Protocol (MCP) standardizes how AI models connect to external tools. Instead of writing custom tool-calling logic for every agent framework (LangChain, LlamaIndex, AutoGen), you write an MCP server once. Any MCP-compatible client—including Claude Desktop—can then discover and execute your tools automatically.

This tutorial demonstrates how to build an MCP server that gives your AI agents the ability to read the web. We will build a Python-based server that exposes a single tool for data extraction, utilizing an external infrastructure layer to handle headless browsers and proxy rotation.

The Architecture of Agentic Scraping

When an agent needs real-time data, it enters a standard tool-calling loop. The MCP architecture cleanly separates the reasoning engine from the execution environment.

By isolating the extraction logic within an MCP server, your agent does not need to know about timeouts, HTTP headers, or network retries. It simply requests a URL and receives text.

Core Concept: Preparing Data for the Context Window

Before writing the server, we must address the most common failure point in agentic scraping: token limits.

Raw HTML from modern single-page applications is bloated with inline CSS, SVG paths, and minified JavaScript. Feeding an 800KB HTML file into an agent's context window will instantly exhaust token limits and degrade the model's reasoning capabilities.

The solution is converting HTML into clean Markdown before returning it to the agent. This strips the structural noise while preserving the semantic hierarchy (headings, links, tables) that the LLM needs to understand the page structure.

Try it yourself

Test raw data extraction and Markdown conversion before feeding it to your agent

curl -X POST https://api.alterlab.io/v1/scrape \
  -H "X-API-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com"}'

Enable JavaScript to try the live demo, or sign up to use the API directly.

Data Extraction: cURL vs. Python

To implement the extraction, we use AlterLab. When your agent requests a URL, the MCP server will fire an API request to fetch the cleaned data.

Here is the exact same extraction operation demonstrated in both cURL and Python. Notice the format="markdown" parameter, which is critical for LLM consumption.

Bash

# cURL Implementation
curl -X POST https://api.alterlab.io/v1/scrape \
  -H "X-API-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com",
    "format": "markdown",
    "render_js": true
  }'

Python

# Python SDK Implementation
import alterlab

client = alterlab.Client("YOUR_API_KEY")

# The highlight below shows the critical LLM-optimization parameters
response = client.scrape(
    url="https://example.com",
    format="markdown",
    render_js=True
)

print(response.text)

If you are building complex data pipelines, checking the Python SDK documentation will provide advanced configuration options for specific site architectures.

Building the MCP Server

We will use the official mcp Python package provided by Anthropic. This package abstracts away the JSON-RPC messages and standard I/O handling, allowing you to define tools using standard Python decorators and type hints.

Prerequisites

Initialize a new Python project and install the required dependencies:

Bash

mkdir agent-scraper-mcp
cd agent-scraper-mcp
python -m venv venv
source venv/bin/activate
pip install mcp alterlab pydantic

The Server Code

Create a file named server.py. This script initializes the MCP server and registers the web scraping tool. The descriptive docstrings inside the tool definition are critical—the MCP protocol passes these descriptions directly to the LLM so it knows when and how to use the tool.

Python

import os
import asyncio
from mcp.server.fastmcp import FastMCP
from pydantic import BaseModel, Field
import alterlab

# Initialize the MCP Server
mcp = FastMCP("WebScraper")

# Initialize the extraction client
# Ensure ALTERLAB_API_KEY is set in your environment variables
api_key = os.environ.get("ALTERLAB_API_KEY")
if not api_key:
    raise ValueError("ALTERLAB_API_KEY environment variable is missing.")

client = alterlab.Client(api_key)

# The docstring and type hints below are sent to the LLM.
# Write them as instructions to the AI agent.
@mcp.tool()
def scrape_public_url(url: str, render_js: bool = True) -> str:
    """
    Extracts readable text from a publicly accessible URL.
    Use this tool when you need to read the current contents of a webpage.
    Returns the page content formatted as Markdown.
    
    Args:
        url: The full HTTP/HTTPS URL of the target page.
        render_js: Set to False only if you know the site is static HTML.
    """
    try:
        # Highlighting the actual extraction execution
        response = client.scrape(
            url=url,
            format="markdown",
            render_js=render_js
        )
        
        # Guardrail against overly massive pages
        content = response.text
        if len(content) > 100000:
            return content[:100000] + "\n\n...[Content truncated for length]..."
            
        return content
        
    except Exception as e:
        return f"Error extracting data from {url}: {str(e)}"

if __name__ == "__main__":
    # Run the server using Standard I/O transport
    mcp.run(transport='stdio')

Handling Anti-Bot and Dynamic Content

You might wonder why we don't just use Python's requests library inside the MCP tool.

When agents operate autonomously, they frequently encounter Cloudflare challenges, Datadome blocks, and pages that require extensive JavaScript rendering to populate the DOM. If your agent's requests.get() call returns a 403 Forbidden or an empty HTML skeleton, the agent will hallucinate an answer based on the failure message or simply crash the workflow.

By delegating the extraction to an infrastructure layer with robust anti-bot handling, the MCP server guarantees that the agent receives the actual page content. The agent focuses purely on semantic reasoning, while the API handles proxy rotation, headless browser management, and fingerprinting.

Connecting the Server to an Agent

MCP servers typically communicate over Standard I/O (stdio). This means the agent framework spawns the server as a subprocess and communicates via standard input and output streams.

Testing with Claude Desktop

The easiest way to test your new server is by plugging it into Claude Desktop. You need to modify Claude's configuration file to point to your Python script.

Configuration file locations:

macOS: ~/Library/Application Support/Claude/claude_desktop_config.json
Windows: %APPDATA%\Claude\claude_desktop_config.json

Add your server to the mcpServers object:

JSON

{
  "mcpServers": {
    "web-scraper": {
      "command": "/path/to/your/agent-scraper-mcp/venv/bin/python",
      "args": [
        "/path/to/your/agent-scraper-mcp/server.py"
      ],
      "env": {
        "ALTERLAB_API_KEY": "your_api_key_here"
      }
    }
  }
}

Restart Claude Desktop. You will now see a small "plug" icon indicating available tools. You can issue prompts like:

"Read the documentation at https://docs.python.org/3/library/asyncio.html and summarize the latest changes to the TaskGroup API."

Claude will recognize that it lacks real-time knowledge of that URL, invoke the scrape_public_url tool via MCP, wait for the Markdown response, and formulate a correct, grounded answer based on the live page content.

Production Considerations for Agentic Pipelines

When moving from local testing to production agent deployments (e.g., deploying on AWS or running background workers with LangGraph), keep these architectural principles in mind:

Timeout Management: Web extraction can take anywhere from 1 to 15 seconds depending on the target's rendering complexity. Ensure your MCP client and the overlying LLM API calls have appropriate timeout buffers configured.
Context Window Protection: The truncation logic in the server.py snippet (content[:100000]) is critical. Unbounded scraping returns will trigger context_length_exceeded errors from your LLM provider.
Structured Data: If your agent specifically needs JSON output instead of Markdown, you can define a secondary tool in your MCP server (extract_structured_data) and utilize Cortex AI to map the DOM into a predefined JSON schema. Read the API docs for implementation details on schema enforcement.

MarkdownOptimal LLM Format

stdioLocal MCP Transport

100kSuggested Char Limit

Takeaways

Building an MCP server bridges the gap between static LLM reasoning and real-time internet data.

Use the Model Context Protocol to write tool definitions once, allowing any compliant agent framework to discover and use your extraction capabilities.
Never feed raw HTML into an agent. Always convert to Markdown to preserve context windows and reduce token costs.
Offload browser management and proxy rotation to dedicated infrastructure so your AI agents can focus strictly on reasoning and analysis.

By implementing this architecture, you transform isolated language models into capable, internet-aware research assistants.

Was this article helpful?

Try it yourself

One API call. Any language.

Python SDK, Node SDK, or plain HTTP. Get started in under a minute.

from alterlab import AlterLab

client = AlterLab(api_key="YOUR_KEY")
result = client.scrape("https://example.com")
print(result.markdown)

No credit card required · 5,000 free requests

Frequently Asked Questions

MCP is an open standard that allows developers to securely connect AI models to external data sources and tools. It provides a standardized way for LLMs to access files, databases, and APIs.

Web scraping provides AI agents with real-time grounding, allowing them to access current information that isn't in their training data. This enables tasks like automated competitive analysis, public price tracking, and live research.

Using a robust scraping API with smart rendering and proxy rotation ensures requests succeed without being blocked. This delegates infrastructure management so the agent only deals with the extracted data.

Yash Dubey

View all posts

Tutorials

TikTok Data API: Extract Structured JSON in 2026

Build a resilient data pipeline to extract public TikTok data via API. Learn how to retrieve typed, structured JSON for AI training and analytics.

Herald Blog Service

Jun 18, 2026

Tutorials

Etsy Data API: Extract Structured JSON in 2026

Build robust e-commerce data pipelines by extracting structured JSON from public Etsy listings. Learn how to use Python and JSON schemas for reliable extraction.

Herald Blog Service

Jun 18, 2026

Tutorials

How to Scrape Facebook Data: Complete Guide for 2026

Learn how to scrape Facebook public page data using Python and modern APIs. Handle dynamic GraphQL content, JavaScript rendering, and rate limits effectively.

Herald Blog Service

Jun 18, 2026

Stay in the Loop

Get scraping insights, API tips, and platform updates. No spam — we only send when we have something worth reading.

Web Scraping API Resources

Part of the Web Scraping API Documentation cluster

Web Scraping API Documentation

Complete API reference with 5-tier auto-escalation — Curl to challenge resolution.

Pillar page

JavaScript Rendering Guide

Configure Tier 4 browser rendering for SPAs and dynamic content.

Authenticated Scraping Guide

Scrape pages behind login using session management.

Web Scraping API Benchmarks

Real success rates and cost data across all 5 tiers.

AlterLab for AI Agents

MCP Server, Python SDK, and Firecrawl-compatible API for AI agent workflows.

The Architecture of Agentic Scraping

Core Concept: Preparing Data for the Context Window

Data Extraction: cURL vs. Python

Building the MCP Server

Prerequisites

The Server Code

Handling Anti-Bot and Dynamic Content

Connecting the Server to an Agent

Testing with Claude Desktop

Production Considerations for Agentic Pipelines

Takeaways

Frequently Asked Questions

Related Articles

TikTok Data API: Extract Structured JSON in 2026

Etsy Data API: Extract Structured JSON in 2026

How to Scrape Facebook Data: Complete Guide for 2026

Popular Posts

Playwright Bot Detection: What Actually Works in 2026

Why Your Headless Browser Gets Detected (and How to Fix It)

Best Web Scraping APIs in 2026: Complete Comparison Guide

How to Scrape Twitter/X: Complete Guide for 2026

How to Scrape Cloudflare-Protected Sites in 2026

Recommended

How to Scrape Amazon in 2026: Engineering Guide

How to Scrape AliExpress: Complete Guide for 2026

Why Your Headless Browser Gets Detected (and How to Fix It)

How to Scrape Twitter/X Data: Complete Guide for 2026

How to Scrape Cloudflare-Protected Sites in 2026

Newsletter

Recommended Reading

How to Scrape Amazon in 2026: Engineering Guide

How to Scrape AliExpress: Complete Guide for 2026

Why Your Headless Browser Gets Detected (and How to Fix It)

How to Scrape Twitter/X Data: Complete Guide for 2026

How to Scrape Cloudflare-Protected Sites in 2026

Stay in the Loop

Explore AlterLab

Python Web Scraping API

Compare Scraping APIs

Pricing

Documentation

Web Scraping API Resources