Build an MCP Server for Agentic Web Scraping and Real-Time LLM Grounding
Tutorials

Build an MCP Server for Agentic Web Scraping and Real-Time LLM Grounding

Learn how to build a Model Context Protocol (MCP) server that empowers LLM agents to extract real-time data from public websites using Python.

Yash Dubey
Yash Dubey
7 min read
152 views

Large Language Models (LLMs) operate in a vacuum. To build autonomous agents that perform market research, track public pricing across e-commerce sites, or analyze real estate listings, you must provide them with real-time access to the web. Static Retrieval-Augmented Generation (RAG) is insufficient for data that changes hourly. Agents need the ability to reach out, fetch current pages, and read the contents.

The Model Context Protocol (MCP) standardizes how AI models connect to external tools. Instead of writing custom tool-calling logic for every agent framework (LangChain, LlamaIndex, AutoGen), you write an MCP server once. Any MCP-compatible client—including Claude Desktop—can then discover and execute your tools automatically.

This tutorial demonstrates how to build an MCP server that gives your AI agents the ability to read the web. We will build a Python-based server that exposes a single tool for data extraction, utilizing an external infrastructure layer to handle headless browsers and proxy rotation.

The Architecture of Agentic Scraping

When an agent needs real-time data, it enters a standard tool-calling loop. The MCP architecture cleanly separates the reasoning engine from the execution environment.

By isolating the extraction logic within an MCP server, your agent does not need to know about timeouts, HTTP headers, or network retries. It simply requests a URL and receives text.

Core Concept: Preparing Data for the Context Window

Before writing the server, we must address the most common failure point in agentic scraping: token limits.

Raw HTML from modern single-page applications is bloated with inline CSS, SVG paths, and minified JavaScript. Feeding an 800KB HTML file into an agent's context window will instantly exhaust token limits and degrade the model's reasoning capabilities.

The solution is converting HTML into clean Markdown before returning it to the agent. This strips the structural noise while preserving the semantic hierarchy (headings, links, tables) that the LLM needs to understand the page structure.

Try it yourself

Test raw data extraction and Markdown conversion before feeding it to your agent

Data Extraction: cURL vs. Python

To implement the extraction, we use AlterLab. When your agent requests a URL, the MCP server will fire an API request to fetch the cleaned data.

Here is the exact same extraction operation demonstrated in both cURL and Python. Notice the format="markdown" parameter, which is critical for LLM consumption.

Bash
# cURL Implementation
curl -X POST https://api.alterlab.io/v1/scrape \
  -H "X-API-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com",
    "format": "markdown",
    "render_js": true
  }'
Python
# Python SDK Implementation
import alterlab

client = alterlab.Client("YOUR_API_KEY")

# The highlight below shows the critical LLM-optimization parameters
response = client.scrape(
    url="https://example.com",
    format="markdown",
    render_js=True
)

print(response.text)

If you are building complex data pipelines, checking the Python SDK documentation will provide advanced configuration options for specific site architectures.

Building the MCP Server

We will use the official mcp Python package provided by Anthropic. This package abstracts away the JSON-RPC messages and standard I/O handling, allowing you to define tools using standard Python decorators and type hints.

Prerequisites

Initialize a new Python project and install the required dependencies:

Bash
mkdir agent-scraper-mcp
cd agent-scraper-mcp
python -m venv venv
source venv/bin/activate
pip install mcp alterlab pydantic

The Server Code

Create a file named server.py. This script initializes the MCP server and registers the web scraping tool. The descriptive docstrings inside the tool definition are critical—the MCP protocol passes these descriptions directly to the LLM so it knows when and how to use the tool.

Python
import os
import asyncio
from mcp.server.fastmcp import FastMCP
from pydantic import BaseModel, Field
import alterlab

# Initialize the MCP Server
mcp = FastMCP("WebScraper")

# Initialize the extraction client
# Ensure ALTERLAB_API_KEY is set in your environment variables
api_key = os.environ.get("ALTERLAB_API_KEY")
if not api_key:
    raise ValueError("ALTERLAB_API_KEY environment variable is missing.")

client = alterlab.Client(api_key)

# The docstring and type hints below are sent to the LLM.
# Write them as instructions to the AI agent.
@mcp.tool()
def scrape_public_url(url: str, render_js: bool = True) -> str:
    """
    Extracts readable text from a publicly accessible URL.
    Use this tool when you need to read the current contents of a webpage.
    Returns the page content formatted as Markdown.
    
    Args:
        url: The full HTTP/HTTPS URL of the target page.
        render_js: Set to False only if you know the site is static HTML.
    """
    try:
        # Highlighting the actual extraction execution
        response = client.scrape(
            url=url,
            format="markdown",
            render_js=render_js
        )
        
        # Guardrail against overly massive pages
        content = response.text
        if len(content) > 100000:
            return content[:100000] + "\n\n...[Content truncated for length]..."
            
        return content
        
    except Exception as e:
        return f"Error extracting data from {url}: {str(e)}"

if __name__ == "__main__":
    # Run the server using Standard I/O transport
    mcp.run(transport='stdio')

Handling Anti-Bot and Dynamic Content

You might wonder why we don't just use Python's requests library inside the MCP tool.

When agents operate autonomously, they frequently encounter Cloudflare challenges, Datadome blocks, and pages that require extensive JavaScript rendering to populate the DOM. If your agent's requests.get() call returns a 403 Forbidden or an empty HTML skeleton, the agent will hallucinate an answer based on the failure message or simply crash the workflow.

By delegating the extraction to an infrastructure layer with robust anti-bot handling, the MCP server guarantees that the agent receives the actual page content. The agent focuses purely on semantic reasoning, while the API handles proxy rotation, headless browser management, and fingerprinting.

Connecting the Server to an Agent

MCP servers typically communicate over Standard I/O (stdio). This means the agent framework spawns the server as a subprocess and communicates via standard input and output streams.

Testing with Claude Desktop

The easiest way to test your new server is by plugging it into Claude Desktop. You need to modify Claude's configuration file to point to your Python script.

Configuration file locations:

  • macOS: ~/Library/Application Support/Claude/claude_desktop_config.json
  • Windows: %APPDATA%\Claude\claude_desktop_config.json

Add your server to the mcpServers object:

JSON
{
  "mcpServers": {
    "web-scraper": {
      "command": "/path/to/your/agent-scraper-mcp/venv/bin/python",
      "args": [
        "/path/to/your/agent-scraper-mcp/server.py"
      ],
      "env": {
        "ALTERLAB_API_KEY": "your_api_key_here"
      }
    }
  }
}

Restart Claude Desktop. You will now see a small "plug" icon indicating available tools. You can issue prompts like:

"Read the documentation at https://docs.python.org/3/library/asyncio.html and summarize the latest changes to the TaskGroup API."

Claude will recognize that it lacks real-time knowledge of that URL, invoke the scrape_public_url tool via MCP, wait for the Markdown response, and formulate a correct, grounded answer based on the live page content.

Production Considerations for Agentic Pipelines

When moving from local testing to production agent deployments (e.g., deploying on AWS or running background workers with LangGraph), keep these architectural principles in mind:

  1. Timeout Management: Web extraction can take anywhere from 1 to 15 seconds depending on the target's rendering complexity. Ensure your MCP client and the overlying LLM API calls have appropriate timeout buffers configured.
  2. Context Window Protection: The truncation logic in the server.py snippet (content[:100000]) is critical. Unbounded scraping returns will trigger context_length_exceeded errors from your LLM provider.
  3. Structured Data: If your agent specifically needs JSON output instead of Markdown, you can define a secondary tool in your MCP server (extract_structured_data) and utilize Cortex AI to map the DOM into a predefined JSON schema. Read the API docs for implementation details on schema enforcement.
MarkdownOptimal LLM Format
stdioLocal MCP Transport
100kSuggested Char Limit

Takeaways

Building an MCP server bridges the gap between static LLM reasoning and real-time internet data.

  • Use the Model Context Protocol to write tool definitions once, allowing any compliant agent framework to discover and use your extraction capabilities.
  • Never feed raw HTML into an agent. Always convert to Markdown to preserve context windows and reduce token costs.
  • Offload browser management and proxy rotation to dedicated infrastructure so your AI agents can focus strictly on reasoning and analysis.

By implementing this architecture, you transform isolated language models into capable, internet-aware research assistants.

Share

Was this article helpful?

Frequently Asked Questions

MCP is an open standard that allows developers to securely connect AI models to external data sources and tools. It provides a standardized way for LLMs to access files, databases, and APIs.
Web scraping provides AI agents with real-time grounding, allowing them to access current information that isn't in their training data. This enables tasks like automated competitive analysis, public price tracking, and live research.
Using a robust scraping API with smart rendering and proxy rotation ensures requests succeed without being blocked. This delegates infrastructure management so the agent only deals with the extracted data.