How to Scrape Medium: Complete Guide for 2026
Learn how to scrape Medium articles, author data, and engagement metrics with Python. Includes working code examples, anti-bot bypass, and scaling strategies.
April 9, 2026
Why scrape Medium?
Medium hosts millions of technical articles, opinion pieces, and industry analysis. Engineers scrape it for three primary use cases.
Content research and trend analysis. Track which topics gain traction over time. Measure engagement patterns across tags like machine-learning, devops, or cybersecurity. Build datasets that show how technical discourse shifts quarter over quarter.
Author and publication monitoring. Follow specific writers or publications for competitive intelligence. Track posting frequency, topic evolution, and audience response. Useful for content teams planning editorial calendars or recruiters identifying subject matter experts.
Training data for NLP models. Medium articles provide clean, well-formatted text suitable for language model fine-tuning, sentiment analysis, and topic classification. The platform's consistent structure makes extraction straightforward once you handle the anti-bot layer.
Anti-bot challenges on medium.com
Medium serves its content through a React-based single-page application. The initial HTML response contains minimal article content. JavaScript must execute before the actual text, images, and engagement metrics appear in the DOM.
Their anti-bot stack includes:
- JavaScript challenges that verify the client can execute code before serving content
- Browser fingerprinting that checks for headless browser signatures, missing fonts, and inconsistent navigator properties
- Rate limiting on repeated requests from the same IP range
- Cookie-based session validation that tracks request patterns over time
Running a basic requests.get() against medium.com returns a nearly empty HTML shell. You need a real browser environment with proper TLS fingerprints, consistent viewport dimensions, and realistic timing between interactions. Managing this infrastructure yourself means maintaining browser instances, rotating residential proxies, and updating evasion scripts whenever Medium changes their detection logic.
AlterLab handles all of this through its anti-bot bypass API. You send a URL, get back fully rendered HTML or structured JSON. No browser management, no proxy rotation, no fingerprint tuning.
Quick start with AlterLab API
Install the Python SDK and make your first request. The getting started guide covers account setup and API key generation.
import alterlab
client = alterlab.Client("YOUR_API_KEY")
response = client.scrape(
"https://medium.com/@example/your-article-slug-abc123",
formats=["json"],
min_tier=3
)
print(response.json)curl -X POST https://api.alterlab.io/v1/scrape \
-H "X-API-Key: YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://medium.com/@example/your-article-slug-abc123",
"formats": ["json"],
"min_tier": 3
}'The min_tier=3 parameter tells the system to skip basic HTTP tiers and go straight to headless browser rendering. Medium requires JavaScript execution, so tiers 1 and 2 will return incomplete content. Setting min_tier=3 saves you a retry cycle.
The response includes the fully rendered page. With formats=["json"], you get a parsed structure instead of raw HTML. This matters because Medium's DOM is deeply nested and class names change frequently.
Try scraping Medium with AlterLab
Extracting structured data
Medium articles follow a consistent internal structure. Here are the selectors and extraction patterns for the data points engineers typically need.
Article content and metadata
import alterlab
from bs4 import BeautifulSoup
client = alterlab.Client("YOUR_API_KEY")
response = client.scrape(
"https://medium.com/@author/article-title-xyz789",
min_tier=3
)
soup = BeautifulSoup(response.text, "html.parser")
title = soup.find("h1", class_="pw-post-title")
subtitle = soup.find("h2", class_="pw-subtitle-paragraph")
author = soup.find("div", class_="pw-post-author-name")
publish_date = soup.find("time")
content_sections = soup.find_all("p")
article_data = {
"title": title.text.strip() if title else None,
"subtitle": subtitle.text.strip() if subtitle else None,
"author": author.text.strip() if author else None,
"published": publish_date["datetime"] if publish_date else None,
"body": "\n".join(p.text for p in content_sections)
}Engagement metrics
Claps, responses, and reading time render client-side. You need to wait for the page to fully hydrate before extracting these values.
import alterlab
client = alterlab.Client("YOUR_API_KEY")
response = client.scrape(
"https://medium.com/@author/article-title-xyz789",
min_tier=3,
wait_for=".js-clapCount"
)
claps = response.css(".js-clapCount").text
responses = response.css(".js-responsesCount").text
reading_time = response.css(".postMetaInline-readingTime").textThe wait_for parameter ensures the scraper pauses until the engagement counters render. Without it, you will capture placeholder elements before the JavaScript populates actual numbers.
Author profile data
import alterlab
client = alterlab.Client("YOUR_API_KEY")
response = client.scrape(
"https://medium.com/@username",
min_tier=3,
formats=["json"]
)
profile = response.json
author_name = profile.get("author", {}).get("name")
follower_count = profile.get("author", {}).get("stats", {}).get("followersCount")
article_count = profile.get("author", {}).get("stats", {}).get("postsCount")
bio = profile.get("author", {}).get("bio")When you request JSON format, AlterLab attempts to extract structured entities from the page. For author profiles, this includes name, bio, follower counts, and publication history. The exact fields depend on what Medium exposes in the rendered DOM at request time.
Common pitfalls
Rate limiting and request patterns
Medium throttles aggressive scraping. Sending 50 requests per minute from a single IP triggers temporary blocks. Space your requests out. If you are pulling article lists or search results, add 2-3 seconds between requests. AlterLab's proxy rotation handles IP-level distribution, but you should still implement reasonable delays in your own code to avoid triggering application-level rate limits.
Dynamic content and lazy loading
Medium loads images, embedded tweets, and code snippets asynchronously. The initial render may not include all content. Use the wait_for parameter to target specific elements that load late in the page lifecycle. For articles with embedded gists or code blocks, wait_for=".gist" ensures those render before capture.
Session and cookie handling
Some Medium content requires a logged-in session. Member-only articles display a paywall overlay that blocks the full text. AlterLab can scrape publicly visible content without authentication. If you need access to member-only articles, you must provide session cookies:
import alterlab
client = alterlab.Client("YOUR_API_KEY")
response = client.scrape(
"https://medium.com/@author/member-only-article-abc",
min_tier=3,
cookies=[
{"name": "uid", "value": "YOUR_SESSION_COOKIE", "domain": ".medium.com"}
]
)Note that sharing or automating credential acquisition violates Medium's Terms of Service. Only use cookies from accounts you own and have explicit permission to use.
Class name volatility
Medium updates their frontend regularly. CSS class names like pw-post-title or js-clapCount may change between deployments. Build your extraction logic to handle missing fields gracefully. Log when selectors fail so you can update them before your pipeline accumulates bad data.
Scaling up
When you move from scraping a dozen articles to thousands, three things matter: cost control, scheduling, and error handling.
Batch processing
Structure your scraper to handle URL lists efficiently. Process articles in parallel where possible, but respect Medium's infrastructure by capping concurrent requests.
import alterlab
import asyncio
client = alterlab.Client("YOUR_API_KEY")
urls = [
"https://medium.com/@author/article-one-abc",
"https://medium.com/@author/article-two-def",
"https://medium.com/@author/article-three-ghi",
]
async def scrape_batch(url_list):
tasks = [client.scrape_async(url, min_tier=3) for url in url_list]
results = await asyncio.gather(*tasks)
return results
articles = asyncio.run(scrape_batch(urls))
for article in articles:
print(article.status_code, article.url)Scheduling recurring scrapes
If you monitor specific authors or tags, set up recurring scrapes instead of running manual jobs. AlterLab's scheduling system uses cron expressions.
import alterlab
client = alterlab.Client("YOUR_API_KEY")
schedule = client.schedules.create(
url="https://medium.com/tag/machine-learning",
cron="0 9 * * 1",
formats=["json"],
min_tier=3,
webhook_url="https://your-server.com/webhook/medium-articles"
)
print(f"Schedule created: {schedule.id}")This runs every Monday at 9 AM UTC, scrapes the machine-learning tag page, and pushes results to your webhook endpoint. You get fresh data without maintaining a cron daemon or retry logic.
Cost management
Medium pages require headless browser rendering, which costs more per request than static HTML pages. Each scrape at tier 3 consumes more balance than a tier 1 curl request. Monitor your usage dashboard and set spend limits on your API keys to prevent unexpected charges.
For high-volume operations, consider caching results. If you scrape the same article URL multiple times, store the output locally and only re-scrape when you need updated engagement metrics. This reduces redundant requests and keeps costs predictable. Review AlterLab pricing to understand per-request costs at each tier and plan your budget accordingly.
Error handling and retries
Network failures, temporary blocks, and page structure changes will cause individual requests to fail. Wrap your scraping calls in retry logic with exponential backoff.
import alterlab
import time
client = alterlab.Client("YOUR_API_KEY")
def scrape_with_retry(url, max_retries=3):
for attempt in range(max_retries):
try:
response = client.scrape(url, min_tier=3)
if response.status_code == 200:
return response
time.sleep(2 ** attempt)
except Exception as e:
if attempt == max_retries - 1:
raise
time.sleep(2 ** attempt)
return NoneLog failed URLs separately so you can investigate whether failures stem from selector changes, temporary outages, or permanent page removals.
Key takeaways
Medium requires headless browser rendering due to its React-based architecture and anti-bot protections. Setting min_tier=3 ensures you skip tiers that cannot execute JavaScript.
Use formats=["json"] to get structured data instead of parsing volatile HTML class names. Add wait_for selectors to capture lazy-loaded engagement metrics. Space out requests to avoid application-level rate limits, and cache results to reduce redundant scrapes.
For recurring monitoring, use scheduled scrapes with webhooks instead of maintaining your own cron infrastructure. Set spend limits on API keys to control costs at scale.
Related guides
Was this article helpful?
Frequently Asked Questions
Related Articles
Popular Posts
Recommended

Selenium Bot Detection: Why You Get Caught and How to Avoid It

How to Scrape Glassdoor: Complete Guide for 2026

Why Your Headless Browser Gets Detected (and How to Fix It)

Web Scraping APIs vs DIY Scrapers: When to Stop Building Infrastructure

Scraping JavaScript-Heavy SPAs with Python: Dynamic Content, Infinite Scroll, and API Interception
Newsletter
Scraping insights and API tips. No spam.
Recommended Reading

Selenium Bot Detection: Why You Get Caught and How to Avoid It

How to Scrape Glassdoor: Complete Guide for 2026

Why Your Headless Browser Gets Detected (and How to Fix It)

Web Scraping APIs vs DIY Scrapers: When to Stop Building Infrastructure

Scraping JavaScript-Heavy SPAs with Python: Dynamic Content, Infinite Scroll, and API Interception
Stay in the Loop
Get scraping insights, API tips, and platform updates. No spam — we only send when we have something worth reading.


