How to Give Your AI Agent Access to PubMed Data
Tutorials

How to Give Your AI Agent Access to PubMed Data

Learn how to give your AI agent structured access to PubMed's public data for medical research monitoring and RAG pipelines using AlterLab's extraction APIs.

5 min read
12 views

TL;DR: Equip your AI agent with structured PubMed data by using AlterLab's Extract API to bypass anti-bot measures and return clean JSON. This enables reliable medical research monitoring, clinical trial tracking, and biotech intelligence without parsing HTML or managing proxies.

Disclaimer: This guide covers accessing publicly available data. Always review a site's robots.txt and Terms of Service before automated access.

99.2%Request Success Rate
<1sAvg Structured Response
0HTML Parsing Required

Why AI agents need PubMed data

AI agents in healthcare and life sciences require current PubMed data for:

  • Medical research monitoring: Tracking new publications on specific diseases or treatments to update knowledge bases.
  • Clinical trial tracking: Identifying emerging trial results or protocol changes for real-time intelligence.
  • Biotech intelligence: Monitoring competitor research, grant publications, and emerging science for strategic decisions.

Why raw HTTP requests fail for agents

Direct requests to PubMed often fail for agents due to:

  • Rate limiting: PubMed blocks IPs exceeding request thresholds, causing failed tool calls.
  • JavaScript rendering: Dynamic content (like abstracts loaded via JS) returns incomplete HTML to naive scrapers.
  • Bot detection: Advanced anti-bot systems challenge requests with CAPTCHAs, wasting agent context windows on retries.
  • Token budget waste: Failed requests consume LLM tokens without yielding usable data, increasing costs and reducing pipeline reliability.

Connecting your agent to PubMed via AlterLab

Use AlterLab's Extract API (Extract API docs) to get structured data from PubMed pages. This handles anti-bot bypass, JavaScript rendering, and returns clean JSON ready for your LLM.

Getting started guide shows how to install the AlterLab SDK. Here’s a Python example extracting structured data from a PubMed article:

Python
import alterlab
from alterlab import Client

client = Client("YOUR_API_KEY")

# Define schema for PubMed article structure
schema = {
    "title": "string",
    "authors": "string",
    "journal": "string",
    "pub_date": "string",
    "abstract": "string",
    "doi": "string"
}

# Extract structured data from a PubMed article URL
result = client.extract(
    url="https://pubmed.ncbi.nlm.nih.gov/34567890/",
    schema=schema
)

# Result.data is a dict, ready for LLM context or RAG pipeline
print(result.data)

Equivalent cURL command:

Bash
curl -X POST https://api.alterlab.io/api/v1/extract \
  -H "X-API-Key: YOUR_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://pubmed.ncbi.nlm.nih.gov/34567890/",
    "schema": {
      "title": "string",
      "authors": "string",
      "journal": "string",
      "pub_date": "string",
      "abstract": "string",
      "doi": "string"
    }
  }'

For raw HTML (e.g., if you need full page content), use the Scrape API (/api/v1/scrape). However, structured extraction via Extract API is recommended for agents to minimize post-processing.

Using the Search API for PubMed queries

To search PubMed for articles matching a query, use AlterLab's Search API (/api/v1/search). This returns structured search results without needing to parse PubMed's search page.

Python
import alterlab
from alterlab import Client

client = Client("YOUR_API_KEY")

# Search PubMed for recent articles on cancer immunotherapy
search_params = {
    "query": "cancer immunotherapy 2024",
    "site": "pubmed.ncbi.nlm.nih.gov",
    "num_results": 10
}

response = client.search(**search_params)

# Response contains structured list of articles
for article in response.data:
    print(f"{article['title']} - {article['journal']}")
Bash
curl -X POST https://api.alterlab.io/api/v1/search \
  -H "X-API-Key: YOUR_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "query": "cancer immunotherapy 2024",
    "site": "pubmed.ncbi.nlm.nih.gov",
    "num_results": 10
  }'

MCP integration

AlterLab provides an MCP server that lets Claude, GPT, or Cursor agents call web data extraction as a native tool. This simplifies agent configuration by abstracting API keys and request handling.

See the AI agent tutorial to set up the MCP server and integrate it with your agent framework.

Building a medical research monitoring pipeline

Here’s an end-to-end example of an agent monitoring PubMed for new diabetes research:

  1. Agent triggers a daily search for "type 2 diabetes treatment 2024" via AlterLab's Search API.
  2. For each new article (comparing against a known ID set), the agent extracts structured data (title, abstract, DOI) using the Extract API.
  3. The agent summarizes key findings and updates a medical knowledge base in vector store for RAG.
  4. If high-impact findings are detected (e.g., new mechanism), the agent alerts researchers via Slack.
Python
import alterlab
from alterlab import Client
import hashlib
import json
from datetime import datetime, timedelta

# Initialize client (in production, load API key from secure vault)
client = Client("YOUR_API_KEY")

# Track seen articles to avoid duplicates
SEEN_ARTICLES_FILE = "seen_articles.json"

def load_seen_articles():
    try:
        with open(SEEN_ARTICLES_FILE) as f:
            return set(json.load(f))
    except FileNotFoundError:
        return set()

def save_seen_articles(seen_set):
    with open(SEEN_ARTICLES_FILE, "w") as f:
        json.dump(list(seen_set), f)

def monitor_diabetes_research():
    seen = load_seen_articles()
    
    # Search for new diabetes articles from last 7 days
    seven_days_ago = (datetime.now() - timedelta(days=7)).strftime("%Y/%m/%d")
    search_query = f"type 2 diabetes treatment {seven_days_ago}[Date - Publication] : 3000[Date - Publication]"
    
    search_response = client.search(
        query=search_query,
        site="pubmed.ncbi.nlm.nih.gov",
        num_results=20
    )
    
    new_articles = []
    for article in search_response.data:
        # Create unique ID from PMID or DOI
        article_id = article.get("pmid") or article.get("doi") or hashlib.md5(article["title"].encode()).hexdigest()
        
        if article_id not in seen:
            seen.add(article_id)
            
            # Extract full structured data for new article
            extract_result = client.extract(
                url=article["url"],
                schema={
                    "title": "string",
                    "authors": "string",
                    "journal": "string",
                    "pub_date": "string",
                    "abstract": "string",
                    "doi": "string"
                }
            )
            
            new_articles.append(extract_result.data)
    
    # Update knowledge base with new articles (pseudo-code)
    if new_articles:
        update_knowledge_base(new_articles)
        save_seen_articles(seen)
        print(f"Added {len(new_articles)} new diabetes research articles to knowledge base")
    else:
        print("No new articles found")

def update_knowledge_base(articles):
    # In practice: embed abstracts and store in vector DB (e.g., Pinecone, Weaviate)
    pass

if __name__ == "__main__":
    monitor_diabetes_research()

Key takeaways

  • AI agents need reliable, structured web data to function effectively in knowledge-intensive domains like healthcare.
  • AlterLab eliminates anti-bot, rendering, and parsing complexity, letting agents focus on data utilization rather than data acquisition.
  • Structured extraction via Extract API delivers PubMed data in LLM-ready JSON, preserving token budgets for reasoning.
  • Always comply with robots.txt and rate limits; users bear responsibility for reviewing PubMed's Terms of Service.
  • Scale agentic workloads efficiently with usage-based pricing—see pricing for details.
Share

Was this article helpful?

Frequently Asked Questions

Accessing publicly available data on PubMed is generally permissible under fair use and precedents like hiQ v LinkedIn, but agents must comply with PubMed's robots.txt, implement rate limiting, and avoid private or restricted data. Always review the site's Terms of Service.
AlterLab automatically manages rotating proxies, headless browsers with realistic fingerprints, and CAPTCHA solving to ensure agents receive consistent structured data without manual intervention or failed requests.
AlterLab charges per successful request with volume discounts; agentic workloads typically start at $0.001 per request for basic scraping, with structured extraction adding minimal overhead. See [pricing](/pricing) for details.