Build an MCP Server for Real-Time Web Data Extraction
Tutorials

Build an MCP Server for Real-Time Web Data Extraction

Learn to build a Model Context Protocol (MCP) server using Python and AlterLab to give AI agents real-time, reliable access to live web data.

Yash Dubey
Yash Dubey

May 20, 2026

6 min read
20 views

TL;DR

Build an MCP server to give AI agents real-time web access by wrapping the AlterLab API in a standardized tool schema. This setup allows agents to fetch live content, bypass anti-bot measures automatically, and process structured web data without hardcoding selectors for every new site.

AI agents are limited by their training data cutoffs and the "wall" of the public web. While Retrieval-Augmented Generation (RAG) helps with static data, agents often need live information from e-commerce sites, news portals, or technical documentation.

The Model Context Protocol (MCP) is the emerging standard for bridging this gap. By building a custom MCP server, you can expose web scraping capabilities as "tools" that an LLM can invoke dynamically. This tutorial shows how to build a production-ready MCP server using Python and AlterLab.

Understanding the MCP Architecture

MCP operates on a client-server model. The Client (such as a developer IDE or an AI agent framework) initiates the connection. The Server provides resources (data), tools (executable functions), and prompts (predefined templates).

For web data extraction, we primarily use Tools. A tool is a function that an LLM can decide to call based on its description. When the agent needs live data, it sends a JSON-RPC request to your MCP server, which then calls the AlterLab API to retrieve and clean the requested page.

< 200msMCP Protocol Overhead
99.9%Tool Invocation Success
100%Stateless Execution

Prerequisites

To follow this guide, you need:

  1. Python 3.10 or higher.
  2. An AlterLab API key. You can sign up to get started.
  3. The mcp Python SDK and the AlterLab Python SDK.

Step 1: Initialize the Project

Create a new directory and install the necessary dependencies. We use the official mcp package which provides the base classes for building servers.

Bash
mkdir alterlab-mcp-server
cd alterlab-mcp-server
python -m venv venv
source venv/bin/activate
pip install mcp alterlab

Step 2: Configure AlterLab Integration

Before building the server, verify you can connect to the scraping API. AlterLab handles the complexity of rotating proxies and anti-bot solution logic automatically.

Python
import alterlab
import os

client = alterlab.Client(api_key="YOUR_API_KEY") # highlighted
response = client.scrape("https://example.com") # highlighted
print(f"Status: {response.status_code}") # highlighted

You can also verify this via cURL to ensure your environment can reach the API:

Bash
curl -X POST https://api.alterlab.io/v1/scrape \
  -H "X-API-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com", "formats": ["markdown"]}'

Step 3: Implementing the MCP Server

The server needs to define a tool that takes a URL as input and returns the page content. We will use formats=['markdown'] to ensure the agent receives clean, LLM-friendly text rather than raw HTML.

Python
from mcp.server.fastmcp import FastMCP
import alterlab
import os

# Initialize FastMCP server
mcp = FastMCP("AlterLab Web Scraper")

# Initialize AlterLab client
# In production, use environment variables for keys
api_key = os.getenv("ALTERLAB_API_KEY")
client = alterlab.Client(api_key=api_key)

@mcp.tool()
def scrape_website(url: str) -> str:
    """
    Scrapes a website and returns the content in Markdown format.
    Use this tool to get real-time data from any public website.
    """
    try:
        # Requesting markdown format for better LLM context
        result = client.scrape(
            url=url,
            formats=["markdown"],
            wait_for_network_idle=True
        )
        
        if result.success:
            return result.markdown
        else:
            return f"Error: {result.error_message}"
            
    except Exception as e:
        return f"An unexpected error occurred: {str(e)}"

if __name__ == "__main__":
    mcp.run(transport="stdio")

Why Markdown?

LLMs process Markdown much more efficiently than HTML. HTML contains significant noise (tags, scripts, styles) that consumes tokens and distracts the model. By using AlterLab's markdown conversion, you provide the agent with the core semantic content of the page, improving extraction accuracy.

Try it yourself

Try scraping a page with AlterLab to see the markdown output format.

Step 4: Connecting the Server to an Agent

MCP servers typically communicate over stdio. This means the agent launches your script as a subprocess and sends commands via standard input.

To use this with a client like Claude Desktop, you would add the following to your configuration file:

JSON
{
  "mcpServers": {
    "alterlab": {
      "command": "python",
      "args": ["/path/to/alterlab-mcp-server/server.py"],
      "env": {
        "ALTERLAB_API_KEY": "YOUR_ACTUAL_KEY"
      }
    }
  }
}

Step 5: Advanced Tooling & Structured Data

While simple scraping is useful, agents often need specific data points. You can add a more advanced tool that utilizes AlterLab's "Cortex" engine for AI-powered extraction directly at the source.

Python
@mcp.tool()
def extract_structured_data(url: str, schema_description: str) -> str:
    """
    Extracts specific data from a page based on a description.
    Example schema_description: 'Extract the product price, name, and availability status.'
    """
    result = client.scrape(
        url=url,
        formats=["json"],
        extract={
            "description": schema_description
        }
    )
    
    if result.success:
        return str(result.json_data)
    return f"Failed to extract data: {result.error_message}"

This second tool allows the agent to specify exactly what it wants. Instead of the agent reading 2000 words of Markdown and finding a price, the MCP server returns a tiny JSON object, saving massive amounts of token cost.

Deployment Flow

Follow these steps to move your MCP server from a local script to a tool accessible by your agentic workflows.

Handling Technical Challenges

Rate Limiting and Concurrency

AI agents can be aggressive. If an agent loops and tries to scrape the same URL 50 times, it will consume your balance quickly. Implement simple caching or rate limiting within your MCP server to prevent runaway agent behavior. Refer to the documentation for best practices on managing high-volume requests.

Bot Detection

Some sites use advanced challenges. By default, AlterLab's anti-bot handling manages most of these. If an agent reports it cannot see the content, you can modify your MCP tool to increase the min_tier parameter, which triggers more sophisticated browser emulation and CAPTCHA solving.

Comparison: Direct Scraper vs. MCP Server

Takeaway

Building an MCP server for web extraction transforms a "blind" LLM into an agent capable of interacting with the live web. By wrapping AlterLab's reliable infrastructure in the MCP standard, you solve two problems at once: the technical difficulty of bypassing bot detection and the architectural difficulty of giving agents tool-use capabilities.

For more details on advanced extraction parameters, check our API reference or explore our engineering blog for more agentic automation patterns.

Share

Was this article helpful?

Frequently Asked Questions

MCP is an open standard that allows AI agents to securely access data and tools from external services through a unified interface. It uses a JSON-RPC 2.0 based protocol typically implemented over stdio or HTTP.
An MCP server provides a standardized schema that LLMs can understand natively, enabling agents to discover scraping tools and execute them within a structured context. This reduces integration overhead and improves the reliability of tool-use in agentic workflows.
Yes, MCP servers are designed to run as local processes that communicate with AI clients like Claude Desktop or custom agent frameworks via standard input and output.