AI Agent & MCP Integration
Connect AI agents to the web using AlterLab's MCP server. Give Claude, Cursor, or any MCP-compatible tool the ability to scrape, extract, and screenshot any website.
What is MCP?
Overview
Traditional scraping requires writing code for every target. MCP integration flips this: your AI agent decides what to scrape, how to extract data, and what to do with results — all through natural language.
9 Tools
Scrape, extract, screenshot, estimate costs, check balance, and manage authenticated sessions.
Zero Code
Ask your AI agent in plain English. It picks the right tool, parameters, and output format.
Full Anti-Bot
Every tool call goes through AlterLab's tier escalation. Protected sites are handled automatically.
Step 1: Install the MCP Server
The AlterLab MCP server is published on npm. Install it globally so MCP clients can find the binary:
npm install -g alterlab-mcp-serverRequirements
Step 2: Configure Your MCP Client
Add AlterLab to your MCP client's configuration file. Below are examples for popular clients.
Claude Desktop
Edit your Claude Desktop config file:
- macOS:
~/Library/Application Support/Claude/claude_desktop_config.json - Windows:
%APPDATA%\Claude\claude_desktop_config.json
{
"mcpServers": {
"alterlab": {
"command": "npx",
"args": ["-y", "alterlab-mcp-server"],
"env": {
"ALTERLAB_API_KEY": "sk_live_your_api_key_here"
}
}
}
}Keep Your API Key Secret
Step 3: Available Tools
Once configured, your AI agent has access to 9 tools. Here's what each one does:
Core Tools
alterlab_scrapeScrape a URL and return content as markdown, text, HTML, or JSON. Automatically handles anti-bot protection with tier escalation.
url (required) — URL to scrapeformats (default: ["markdown"]) — Output formats: text, json, html, markdownrender_js (default: false) — Render JavaScript with headless browser (+3 credits)use_proxy (default: false) — Route through premium proxy (+1 credit)session_id (optional) — UUID of a stored session for authenticated scrapingalterlab_extractExtract structured data using pre-built profiles, custom JSON schemas, or natural language prompts.
url (required) — URL to extract fromextraction_profile (default: "auto") — auto, product, article, job_posting, faq, recipe, eventextraction_schema (optional) — Custom JSON Schema for precise field extractionextraction_prompt (optional) — Natural language instructionsalterlab_screenshotTake a full-page screenshot of any URL. Returns a PNG image directly in the conversation.
url (required) — URL to screenshotwait_for (optional) — CSS selector to wait for before capturingUtility Tools
alterlab_estimate_costEstimate the credit cost of scraping a URL without actually scraping it. Returns predicted tier, cost, and confidence level.
alterlab_check_balanceCheck your account balance, total deposited, and total spent. No parameters needed.
Session Management
Sessions let you scrape authenticated pages by storing cookies across requests.
alterlab_create_sessionCreate a new session with cookies for authenticated scraping.
alterlab_list_sessionsList all stored sessions and their domains.
alterlab_validate_sessionCheck if a session's cookies are still valid.
alterlab_delete_sessionDelete a stored session when no longer needed.
Step 4: Basic Usage Patterns
With the MCP server configured, you can ask your AI agent to scrape in natural language. The agent translates your request into the right tool call.
Simple Scraping
You say:
"Scrape https://news.ycombinator.com and give me the top 10 stories."
The agent calls alterlab_scrape with the URL, gets the markdown content, and parses out the story titles and links.
Structured Extraction
You say:
"Extract the product name, price, and rating from https://example.com/product/123"
The agent calls alterlab_extract with extraction_profile: "product" and returns structured JSON with the requested fields.
Cost-Aware Scraping
You say:
"How much would it cost to scrape these 50 URLs? Check a few first."
The agent calls alterlab_estimate_cost on a sample, then alterlab_check_balance to verify you have enough credits before proceeding.
Example: Autonomous Research Agent
This example shows how to build an AI agent that autonomously researches a topic by scraping multiple sources, extracting key data, and producing a summary report.
How It Works
System Prompt
Give your AI agent this system prompt to enable autonomous research:
You are a research agent with web scraping capabilities.
When asked to research a topic:
1. Plan: Identify 3-5 authoritative sources to investigate
2. Scrape: Use alterlab_scrape to fetch each source as markdown
3. Extract: If pages have structured data, use alterlab_extract
4. Verify: Cross-reference facts across multiple sources
5. Synthesize: Produce a structured report with citations
Rules:
- Always check costs with alterlab_estimate_cost before large batches
- Use alterlab_check_balance to monitor credit usage
- Prefer markdown format for scraping (best for analysis)
- Use render_js=true only for JavaScript-heavy sites (SPAs, dashboards)
- Cite sources with URLs in your final reportAgent Conversation
Here is what a research session looks like:
User: Research the current state of web scraping regulations
in the EU. Include recent court rulings.
Agent: I'll research this topic across multiple sources.
[Calls alterlab_scrape: "https://gdpr.eu/tag/web-scraping/"]
[Calls alterlab_scrape: "https://techcrunch.com/tag/web-scraping/"]
[Calls alterlab_scrape: "https://eur-lex.europa.eu/..." with render_js=true]
[Calls alterlab_extract on legal database with extraction_prompt:
"Extract case name, date, ruling, and implications for web scraping"]
Agent: Here's my research report:
## EU Web Scraping Regulations - March 2026
### Key Findings
1. GDPR Article 6(1)(f) — legitimate interest basis...
2. Recent CJEU ruling in Case C-xxx/24...
3. ...
### Sources
- [GDPR.eu](https://gdpr.eu/...) — Retrieved March 2026
- [TechCrunch](https://techcrunch.com/...) — Retrieved March 2026
- [EUR-Lex](https://eur-lex.europa.eu/...) — Official textPython Agent Implementation
If you prefer to build your own agent in code, here is a Python implementation using the AlterLab SDK directly:
import alterlab
import json
client = alterlab.AlterLab(api_key="sk_live_your_key")
def research_agent(topic: str, max_sources: int = 5) -> dict:
"""Autonomous research agent that scrapes and synthesizes."""
# Step 1: Scrape a search-oriented page for source discovery
search_result = client.scrape(
url=f"https://news.ycombinator.com/",
formats=["markdown"],
)
# Step 2: Scrape each source and collect content
sources = []
for url in discovered_urls[:max_sources]:
# Check cost first
estimate = client.estimate(url=url)
print(f"Estimated cost for {url}: {estimate.credits} credits")
result = client.scrape(
url=url,
formats=["markdown"],
advanced={"render_js": estimate.needs_js},
)
sources.append({
"url": url,
"content": result.markdown[:5000], # Trim for context window
"title": result.metadata.get("title", ""),
})
# Step 3: Extract structured data where applicable
for source in sources:
extraction = client.scrape(
url=source["url"],
formats=["json"],
extraction_prompt=f"Extract key facts about {topic}",
)
source["structured_data"] = extraction.json_data
return {
"topic": topic,
"source_count": len(sources),
"sources": sources,
}Example: RAG Pipeline with AlterLab
Retrieval-Augmented Generation (RAG) combines web scraping with LLM reasoning. Use AlterLab to fetch fresh web content and feed it into your LLM as context for grounded, up-to-date answers.
Architecture
Query
User asks a question
Retrieve
Scrape relevant pages via AlterLab
Augment
Inject scraped content as LLM context
Generate
LLM answers with cited sources
MCP-Based RAG
The simplest RAG pipeline uses MCP directly — no code required. Just instruct your AI agent:
When I ask a question, follow this process:
1. Identify 2-3 authoritative URLs that would answer the question
2. Use alterlab_scrape to fetch each URL as markdown
3. Read the scraped content carefully
4. Answer my question using ONLY information from the scraped pages
5. Cite each claim with the source URL
Always scrape fresh content — don't rely on your training data for
facts that may have changed.Code-Based RAG Pipeline
import alterlab
from openai import OpenAI
scraper = alterlab.AlterLab(api_key="sk_live_your_key")
llm = OpenAI()
def rag_answer(question: str, source_urls: list[str]) -> str:
"""Answer a question using fresh web content as context."""
# Step 1: Scrape all source URLs
context_parts = []
for url in source_urls:
result = scraper.scrape(
url=url,
formats=["markdown"],
)
context_parts.append(
f"## Source: {url}\n\n{result.markdown[:3000]}"
)
context = "\n\n---\n\n".join(context_parts)
# Step 2: Send to LLM with scraped context
response = llm.chat.completions.create(
model="gpt-4o",
messages=[
{
"role": "system",
"content": (
"Answer questions using ONLY the provided context. "
"Cite sources with URLs. If the context doesn't "
"contain the answer, say so."
),
},
{
"role": "user",
"content": f"Context:\n{context}\n\nQuestion: {question}",
},
],
)
return response.choices[0].message.content
# Usage
answer = rag_answer(
question="What are the latest changes to robots.txt standards?",
source_urls=[
"https://developers.google.com/search/docs/crawling-indexing/robots-txt",
"https://www.rfc-editor.org/rfc/rfc9309",
],
)
print(answer)Best Practices
1. Use Markdown Format for LLM Context
Request formats: ["markdown"] when scraping for AI consumption. Markdown preserves structure (headings, lists, links) while being token-efficient compared to HTML.
2. Estimate Before Batch Scraping
Always call alterlab_estimate_cost on a sample of URLs before scraping hundreds of pages. This prevents unexpected credit consumption, especially for sites that require JavaScript rendering.
3. Trim Content for Context Windows
Scraped pages can be long. Truncate content to the first 3,000 to 5,000 characters per source, or use alterlab_extract with a focused schema to get only the data you need.
4. Use Sessions for Authenticated Content
For gated content, create a session with alterlab_create_session and pass the session_id to subsequent scrape calls. The session persists cookies across requests.
5. Enable JS Rendering Only When Needed
JavaScript rendering adds 3 credits per request and increases latency. Most news sites, blogs, and documentation pages work fine without it. Reserve render_js: true for SPAs, dashboards, and dynamic content.
6. Cross-Reference Multiple Sources
For research agents, scrape at least 3 sources per claim. LLMs can hallucinate when given thin context. More sources means better fact-checking and higher confidence in the final output.