AlterLabAlterLab
Tutorials

How to Scrape Medium: Complete Guide for 2026

Learn how to scrape Medium articles, author data, and engagement metrics with Python. Includes working code examples, anti-bot bypass, and scaling strategies.

Yash Dubey
Yash Dubey

April 9, 2026

7 min read
4 views

Why scrape Medium?

Medium hosts millions of technical articles, opinion pieces, and industry analysis. Engineers scrape it for three primary use cases.

Content research and trend analysis. Track which topics gain traction over time. Measure engagement patterns across tags like machine-learning, devops, or cybersecurity. Build datasets that show how technical discourse shifts quarter over quarter.

Author and publication monitoring. Follow specific writers or publications for competitive intelligence. Track posting frequency, topic evolution, and audience response. Useful for content teams planning editorial calendars or recruiters identifying subject matter experts.

Training data for NLP models. Medium articles provide clean, well-formatted text suitable for language model fine-tuning, sentiment analysis, and topic classification. The platform's consistent structure makes extraction straightforward once you handle the anti-bot layer.

Anti-bot challenges on medium.com

Medium serves its content through a React-based single-page application. The initial HTML response contains minimal article content. JavaScript must execute before the actual text, images, and engagement metrics appear in the DOM.

Their anti-bot stack includes:

  • JavaScript challenges that verify the client can execute code before serving content
  • Browser fingerprinting that checks for headless browser signatures, missing fonts, and inconsistent navigator properties
  • Rate limiting on repeated requests from the same IP range
  • Cookie-based session validation that tracks request patterns over time

Running a basic requests.get() against medium.com returns a nearly empty HTML shell. You need a real browser environment with proper TLS fingerprints, consistent viewport dimensions, and realistic timing between interactions. Managing this infrastructure yourself means maintaining browser instances, rotating residential proxies, and updating evasion scripts whenever Medium changes their detection logic.

AlterLab handles all of this through its anti-bot bypass API. You send a URL, get back fully rendered HTML or structured JSON. No browser management, no proxy rotation, no fingerprint tuning.

99.2%Success Rate
1.2sAvg Response
T3Min Tier Needed
0Proxy Management

Quick start with AlterLab API

Install the Python SDK and make your first request. The getting started guide covers account setup and API key generation.

Python
import alterlab

client = alterlab.Client("YOUR_API_KEY")
response = client.scrape(
    "https://medium.com/@example/your-article-slug-abc123",
    formats=["json"],
    min_tier=3
)
print(response.json)
Bash
curl -X POST https://api.alterlab.io/v1/scrape \
  -H "X-API-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://medium.com/@example/your-article-slug-abc123",
    "formats": ["json"],
    "min_tier": 3
  }'

The min_tier=3 parameter tells the system to skip basic HTTP tiers and go straight to headless browser rendering. Medium requires JavaScript execution, so tiers 1 and 2 will return incomplete content. Setting min_tier=3 saves you a retry cycle.

The response includes the fully rendered page. With formats=["json"], you get a parsed structure instead of raw HTML. This matters because Medium's DOM is deeply nested and class names change frequently.

Try it yourself

Try scraping Medium with AlterLab

Extracting structured data

Medium articles follow a consistent internal structure. Here are the selectors and extraction patterns for the data points engineers typically need.

Article content and metadata

Python
import alterlab
from bs4 import BeautifulSoup

client = alterlab.Client("YOUR_API_KEY")
response = client.scrape(
    "https://medium.com/@author/article-title-xyz789",
    min_tier=3
)

soup = BeautifulSoup(response.text, "html.parser")

title = soup.find("h1", class_="pw-post-title")
subtitle = soup.find("h2", class_="pw-subtitle-paragraph")
author = soup.find("div", class_="pw-post-author-name")
publish_date = soup.find("time")
content_sections = soup.find_all("p")

article_data = {
    "title": title.text.strip() if title else None,
    "subtitle": subtitle.text.strip() if subtitle else None,
    "author": author.text.strip() if author else None,
    "published": publish_date["datetime"] if publish_date else None,
    "body": "\n".join(p.text for p in content_sections)
}

Engagement metrics

Claps, responses, and reading time render client-side. You need to wait for the page to fully hydrate before extracting these values.

Python
import alterlab

client = alterlab.Client("YOUR_API_KEY")
response = client.scrape(
    "https://medium.com/@author/article-title-xyz789",
    min_tier=3,
    wait_for=".js-clapCount"
)

claps = response.css(".js-clapCount").text
responses = response.css(".js-responsesCount").text
reading_time = response.css(".postMetaInline-readingTime").text

The wait_for parameter ensures the scraper pauses until the engagement counters render. Without it, you will capture placeholder elements before the JavaScript populates actual numbers.

Author profile data

Python
import alterlab

client = alterlab.Client("YOUR_API_KEY")
response = client.scrape(
    "https://medium.com/@username",
    min_tier=3,
    formats=["json"]
)

profile = response.json
author_name = profile.get("author", {}).get("name")
follower_count = profile.get("author", {}).get("stats", {}).get("followersCount")
article_count = profile.get("author", {}).get("stats", {}).get("postsCount")
bio = profile.get("author", {}).get("bio")

When you request JSON format, AlterLab attempts to extract structured entities from the page. For author profiles, this includes name, bio, follower counts, and publication history. The exact fields depend on what Medium exposes in the rendered DOM at request time.

Common pitfalls

Rate limiting and request patterns

Medium throttles aggressive scraping. Sending 50 requests per minute from a single IP triggers temporary blocks. Space your requests out. If you are pulling article lists or search results, add 2-3 seconds between requests. AlterLab's proxy rotation handles IP-level distribution, but you should still implement reasonable delays in your own code to avoid triggering application-level rate limits.

Dynamic content and lazy loading

Medium loads images, embedded tweets, and code snippets asynchronously. The initial render may not include all content. Use the wait_for parameter to target specific elements that load late in the page lifecycle. For articles with embedded gists or code blocks, wait_for=".gist" ensures those render before capture.

Some Medium content requires a logged-in session. Member-only articles display a paywall overlay that blocks the full text. AlterLab can scrape publicly visible content without authentication. If you need access to member-only articles, you must provide session cookies:

Python
import alterlab

client = alterlab.Client("YOUR_API_KEY")
response = client.scrape(
    "https://medium.com/@author/member-only-article-abc",
    min_tier=3,
    cookies=[
        {"name": "uid", "value": "YOUR_SESSION_COOKIE", "domain": ".medium.com"}
    ]
)

Note that sharing or automating credential acquisition violates Medium's Terms of Service. Only use cookies from accounts you own and have explicit permission to use.

Class name volatility

Medium updates their frontend regularly. CSS class names like pw-post-title or js-clapCount may change between deployments. Build your extraction logic to handle missing fields gracefully. Log when selectors fail so you can update them before your pipeline accumulates bad data.

Scaling up

When you move from scraping a dozen articles to thousands, three things matter: cost control, scheduling, and error handling.

Batch processing

Structure your scraper to handle URL lists efficiently. Process articles in parallel where possible, but respect Medium's infrastructure by capping concurrent requests.

Python
import alterlab
import asyncio

client = alterlab.Client("YOUR_API_KEY")
urls = [
    "https://medium.com/@author/article-one-abc",
    "https://medium.com/@author/article-two-def",
    "https://medium.com/@author/article-three-ghi",
]

async def scrape_batch(url_list):
    tasks = [client.scrape_async(url, min_tier=3) for url in url_list]
    results = await asyncio.gather(*tasks)
    return results

articles = asyncio.run(scrape_batch(urls))
for article in articles:
    print(article.status_code, article.url)

Scheduling recurring scrapes

If you monitor specific authors or tags, set up recurring scrapes instead of running manual jobs. AlterLab's scheduling system uses cron expressions.

Python
import alterlab

client = alterlab.Client("YOUR_API_KEY")
schedule = client.schedules.create(
    url="https://medium.com/tag/machine-learning",
    cron="0 9 * * 1",
    formats=["json"],
    min_tier=3,
    webhook_url="https://your-server.com/webhook/medium-articles"
)
print(f"Schedule created: {schedule.id}")

This runs every Monday at 9 AM UTC, scrapes the machine-learning tag page, and pushes results to your webhook endpoint. You get fresh data without maintaining a cron daemon or retry logic.

Cost management

Medium pages require headless browser rendering, which costs more per request than static HTML pages. Each scrape at tier 3 consumes more balance than a tier 1 curl request. Monitor your usage dashboard and set spend limits on your API keys to prevent unexpected charges.

For high-volume operations, consider caching results. If you scrape the same article URL multiple times, store the output locally and only re-scrape when you need updated engagement metrics. This reduces redundant requests and keeps costs predictable. Review AlterLab pricing to understand per-request costs at each tier and plan your budget accordingly.

Error handling and retries

Network failures, temporary blocks, and page structure changes will cause individual requests to fail. Wrap your scraping calls in retry logic with exponential backoff.

Python
import alterlab
import time

client = alterlab.Client("YOUR_API_KEY")

def scrape_with_retry(url, max_retries=3):
    for attempt in range(max_retries):
        try:
            response = client.scrape(url, min_tier=3)
            if response.status_code == 200:
                return response
            time.sleep(2 ** attempt)
        except Exception as e:
            if attempt == max_retries - 1:
                raise
            time.sleep(2 ** attempt)
    return None

Log failed URLs separately so you can investigate whether failures stem from selector changes, temporary outages, or permanent page removals.

Key takeaways

Medium requires headless browser rendering due to its React-based architecture and anti-bot protections. Setting min_tier=3 ensures you skip tiers that cannot execute JavaScript.

Use formats=["json"] to get structured data instead of parsing volatile HTML class names. Add wait_for selectors to capture lazy-loaded engagement metrics. Space out requests to avoid application-level rate limits, and cache results to reduce redundant scrapes.

For recurring monitoring, use scheduled scrapes with webhooks instead of maintaining your own cron infrastructure. Set spend limits on API keys to control costs at scale.

T3+Required Tier
JSONRecommended Format
2-3sRequest Spacing
asyncBatch Processing
Share

Was this article helpful?

Frequently Asked Questions

Scraping publicly accessible data on Medium is generally legal, but you must respect their Terms of Service and robots.txt. Avoid scraping behind authenticated walls, do not overload their servers, and use the data for personal or research purposes rather than republishing full articles.
Medium uses standard anti-bot protections including JavaScript challenges, fingerprinting, and rate limiting. AlterLab handles these automatically through its [anti-bot bypass API](/anti-bot-bypass-api), rotating proxies, and headless browser rendering so you get clean HTML without managing evasion logic yourself.
Cost depends on volume and whether you need headless browser rendering. Medium pages typically require JavaScript execution, so you will want tier 3 or higher. Check [AlterLab pricing](/pricing) for per-request rates. Most engineers scraping a few thousand articles per month spend under $50.