Tutorials

How to Scrape Medium Data: Complete Guide for 2026

A practical guide to scraping publicly accessible tech data from Medium using Python and Node.js with AlterLab's web scraping API in 2026.

5 min read
37 views

AlterLab handles this automaticallyscrape any URL with one API call. No infrastructure required.

Try it free

TL;DR: To scrape Medium data in 2026, use AlterLab's API with automatic anti-bot handling. Start with T1/T2 tiers for public pages, escalate to T3/T4 for protected content, and extract structured data via Cortex for typed JSON output. Always respect robots.txt and rate limits.

This guide covers extracting publicly accessible data. Always review a site's robots.txt and Terms of Service before scraping.

Why collect tech data from Medium?

Medium hosts valuable public technical content useful for:

  • Tech trend analysis: Monitor emerging frameworks, libraries, and architectural patterns in engineering blogs
  • Competitive intelligence: Track how companies discuss product launches, API changes, or infrastructure shifts
  • Content aggregation: Build curated feeds of high-quality technical articles for internal knowledge sharing or newsletters

Technical challenges

Medium implements standard anti-bot measures including rate limiting based on IP reputation, header validation (User-Agent, Accept), and occasional JavaScript challenges for suspicious traffic. Raw HTTP requests often receive 429 or 403 responses. AlterLab's Smart Rendering API mitigates these through:

  • Automatic proxy rotation from a large residential pool
  • Dynamic header management mimicking real browsers
  • Tier escalation from T1 (curl) to T4 (headless browser) as needed
  • Built-in retry logic with exponential backoff
99.2%Success Rate
1.2sAvg Response
$0.002Per Request (T3)

Quick start with AlterLab API

See the Getting started guide for SDK installation. Below are examples for scraping a public Medium tech article.

Python example:

Python
import alterlab

client = alterlab.Client("YOUR_API_KEY")
response = client.scrape("https://medium.com/@example/understanding-react-19-compiler-abc123")
print(response.text[:500])  # First 500 chars of HTML

Node.js example:

JAVASCRIPT
import { AlterLab } from "@alterlab/sdk";

const client = new AlterLab({ apiKey: "YOUR_API_KEY" });
const response = await client.scrape("https://medium.com/@example/understanding-react-19-compiler-abc123");
console.log(response.text.slice(0, 500));

cURL example:

Bash
curl -X POST https://api.alterlab.io/v1/scrape \
  -H "X-API-Key: YOUR_KEY" \
  -d '{"url": "https://medium.com/@example/understanding-react-19-compiler-abc123"}'

Extracting structured data

For consistent data extraction, target these common CSS selectors on Medium article pages:

  • Title: h1[data-testid="storyTitle"] or h1.graf--title
  • Author: a[data-testid="authorName"] or a[data-action="show-user-card"]
  • Publication date: time[datetime] (ISO 8601 format in datetime attribute)
  • Reading time: span[data-testid="readingTime"]
  • Claps: button[data-testid="clapButton"] (note: requires interaction for real count; static count may be in adjacent text)
  • Tags: a[data-action="show-tag"] within the tag container

Example Python extraction:

Python
import alterlab
from bs4 import BeautifulSoup

client = alterlab.Client("YOUR_API_KEY")
html = client.scrape("https://medium.com/example/page").text
soup = BeautifulSoup(html, 'html.parser')

tags = [tag.get_text(strip=True) for tag in soup.select('a[data-action="show-tag"]')]
print(f"Tags: {tags}")

Structured JSON extraction with Cortex

AlterLab's Cortex AI extracts typed JSON directly from pages without CSS selectors. Define a schema for Medium article metadata:

Python
import alterlab

client = alterlab.Client("YOUR_API_KEY")
result = client.extract(
    url="https://medium.com/@example/understanding-react-19-compiler-abc123",
    schema={
        "type": "object",
        "properties": {
            "title": {"type": "string"},
            "author": {"type": "string"},
            "published_date": {"type": "string", "format": "date-time"},
            "reading_time_minutes": {"type": "integer"},
            "tags": {"type": "array", "items": {"type": "string"}}
        },
        "required": ["title", "author"]
    }
)
print(result.data)
# Output: {"title": "Understanding React 19 Compiler", "author": "Jane Dev", ...}

Cortex handles JavaScript rendering and anti-bot challenges automatically, returning validated JSON matching your schema.

Cost breakdown

AlterLab's pricing scales with technical difficulty. For Medium:

  • T1/T2: Rarely sufficient due to header/JS checks
  • T3: Typical for Medium's anti-bot level (stealth mode with proxy rotation)
  • T4: Needed if heavy client-side rendering obstructs content

See AlterLab pricing for full details. Note: AlterLab auto-escalates tiers — start at T1 and the API promotes automatically if a lower tier fails. You only pay for the tier that succeeds.

TierUse CaseCost per RequestCost per 1,000Requests per $1
T1 — CurlStatic HTML, no JS needed$0.0002$0.205,000
T2 — HTTPStandard pages with headers$0.0003$0.303,333
T3 — StealthProtected pages, anti-bot active$0.002$2.00500
T4 — BrowserFull JS rendering required$0.004$4.00250
T5 — CAPTCHACAPTCHA solving + JS rendering$0.02$20.0050

Best practices

  • Rate limiting: Start with 1 request/second; adjust based on response headers (AlterLab includes X-RateLimit-Remaining)
  • Robots.txt compliance: Check https://medium.com/robots.txt — disallow /api/, /login/, but allow / @username/ paths
  • Dynamic content: Use Cortex for JS-dependent data instead of manual scrolling/waiting
  • Error handling: Implement retries for 429/5xx; alterlab SDK auto-retries transient failures
  • Data freshness: For time-sensitive data, pair with AlterLab's scheduling (cron expressions) or webhooks

Scaling up

For large-scale Medium data collection:

  • Batch requests: Use AlterLab's /batch endpoint (up to 100 URLs/request) to reduce overhead
  • Scheduling: Set up recurring scrapes via AlterLab's dashboard API for weekly trend analysis
  • Responsible scaling:
    • Monitor success rates per domain; pause if >5% failure rate
    • Use AlterLab's usage alerts to avoid unexpected costs
    • Store raw HTML minimally; extract only needed fields to reduce storage
    • Consider sampling: scrape 10% of articles daily instead of 100%
Try it yourself

Try scraping Medium with AlterLab

Key takeaways

  • Medium's public tech content is scrapeable with proper anti-bot handling via AlterLab's tiered system
  • Always verify data accessibility through robots.txt and ToS before scraping
  • Use Cortex for reliable structured output instead of fragile CSS selectors
  • Budget for T3/T4 tiers ($0.002-$0.004/request) for consistent Medium access
  • Implement rate limiting and monitoring to maintain sustainable scraping practices

Hit reply if you have questions.

Share

Was this article helpful?

Frequently Asked Questions

Scraping publicly accessible data is generally legal under precedents like hiQ v LinkedIn, but you must review Medium's robots.txt and Terms of Service, implement rate limiting, and avoid private or login-required data.
Medium employs standard anti-bot protections (rate limiting, header checks, occasional JS challenges) that can block basic HTTP requests. AlterLab's Smart Rendering API handles proxy rotation, header management, and automatic tier escalation to maintain access to public data.
Costs range from $0.0002 per request for static HTML (T1) to $0.004 for full browser rendering (T4), with AlterLab auto-escalating tiers so you only pay for the successful tier. For Medium's typical protections, expect T3 ($0.002/request) or T4.