
How to Scrape Instagram: Complete Guide for 2026
Learn how to scrape Instagram in 2026 with Python. Covers Meta's anti-bot protections, GraphQL endpoints, structured data extraction, and scaling pipelines reliably.
March 31, 2026
Instagram holds structured, commercially valuable data — follower counts, post engagement rates, hashtag velocity, business contact info — that Meta's official API surfaces only partially and only after a lengthy app review. For most data use cases, scraping is the practical path.
Getting that data reliably in 2026 is significantly harder than it was three years ago. Meta has progressively hardened Instagram against automated access. This guide covers what actually works: understanding the defenses, bypassing them with a managed API, and building a pipeline that produces clean structured JSON from a public profile page.
Why Scrape Instagram?
Three use cases drive the majority of Instagram scraping work in production:
Influencer analytics and campaign measurement. Brands need follower counts, engagement rates (likes + comments ÷ followers), and posting frequency for potential partners before signing contracts. Meta's official API doesn't expose this data at scale without app review approval — scraping fills the gap.
Competitive intelligence. E-commerce brands track competitor product launches by monitoring branded hashtags, post frequency, and comment sentiment across 50–200 accounts on a recurring schedule. This is a standard data ops task that doesn't fit inside the official API's rate limits.
Academic and social research. Researchers studying public health communication, political messaging, and misinformation use Instagram post metadata — timestamps, caption text, engagement counts, geotags — as primary data sources. IRB-approved research often relies on programmatic collection since the official research API has restrictive access requirements.
Anti-Bot Challenges on instagram.com
Instagram's defenses are among the most sophisticated on the consumer web. Here's exactly what you're up against:
Login walls. Since 2024, Instagram redirects unauthenticated requests for most content to a login page. Even "public" profiles require a logged-in session to view more than a preview. This breaks requests-based scrapers immediately — you get a login redirect before you see any profile data.
JavaScript rendering. The profile page is a React SPA. The HTML served to a plain HTTP client contains almost no user data. Everything is hydrated client-side via GraphQL calls after the initial page load. You need a headless browser with full JavaScript execution to see what a real user sees.
Fingerprinting and behavioral analysis. Meta runs TLS fingerprinting, canvas fingerprinting, and mouse movement heuristics. Headless Chromium with default settings is detected and blocked almost immediately. Stealth configuration — patching navigator.webdriver, spoofing canvas, mimicking realistic viewport and timing — is required and needs ongoing maintenance as detection techniques evolve.
GraphQL endpoint churn. Instagram's internal API (the https://www.instagram.com/api/v1/ family) changes frequently. Query hashes get invalidated. Endpoint paths get restructured. A scraper that worked three months ago may 404 today without any announcement.
Aggressive rate limiting. Rate limits apply per IP, per session, and per account. Exceeding them triggers CAPTCHAs or temporary blocks lasting hours. Even with a valid session, sequential requests from the same IP saturate limits quickly.
Managing all of this yourself — maintaining stealth browser configs, rotating residential IPs, handling session expiry, tracking API changes — is a significant ongoing engineering cost. AlterLab's anti-bot bypass API handles the full stack: stealth browser rendering, residential proxy rotation, and session management, so you get clean HTML or JSON back from a single API call.
Quick Start with AlterLab API
Install the SDK and grab your API key by following the Getting started guide.
The minimal example — fetching a public Instagram profile page with full JavaScript rendering:
import alterlab
client = alterlab.Client("YOUR_API_KEY")
response = client.scrape(
"https://www.instagram.com/natgeo/",
render_js=True, # Full headless browser render
premium_proxy=True, # Residential proxy pool
wait_for="#react-root" # Wait for React hydration before capture
)
print(response.status_code)
print(response.text[:500])For Instagram's internal profile API endpoint, which returns structured JSON and doesn't require a browser render:
import alterlab
import json
client = alterlab.Client("YOUR_API_KEY")
username = "natgeo"
api_url = f"https://www.instagram.com/api/v1/users/web_profile_info/?username={username}"
response = client.scrape(
api_url,
headers={
"X-IG-App-ID": "936619743392459",
"X-Requested-With": "XMLHttpRequest",
},
render_js=False, # This endpoint returns JSON directly
premium_proxy=True
)
data = json.loads(response.text)
user = data["data"]["user"]
print(f"Username: {user['username']}")
print(f"Followers: {user['edge_followed_by']['count']:,}")
print(f"Bio: {user['biography']}")The same request via cURL, for testing or non-Python environments:
curl -X POST https://api.alterlab.io/v1/scrape \
-H "X-API-Key: YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://www.instagram.com/api/v1/users/web_profile_info/?username=natgeo",
"render_js": false,
"premium_proxy": true,
"headers": {
"X-IG-App-ID": "936619743392459",
"X-Requested-With": "XMLHttpRequest"
}
}'Try scraping a public Instagram profile with AlterLab — no setup required
Extracting Structured Data
Once you have the raw response, parsing depends on whether you're working with the API JSON response or rendered page HTML.
From the Profile API Response
The web_profile_info endpoint returns a consistent JSON structure. Here's a complete extractor for the fields most pipelines need:
import alterlab
import json
from typing import Optional
client = alterlab.Client("YOUR_API_KEY")
def fetch_raw(username: str) -> dict:
url = f"https://www.instagram.com/api/v1/users/web_profile_info/?username={username}"
resp = client.scrape(
url,
headers={
"X-IG-App-ID": "936619743392459",
"X-Requested-With": "XMLHttpRequest",
},
premium_proxy=True
)
return json.loads(resp.text)["data"]["user"]
def extract_profile(username: str) -> dict:
user = fetch_raw(username)
posts = []
for edge in user["edge_owner_to_timeline_media"]["edges"]:
node = edge["node"]
caption_edges = node.get("edge_media_to_caption", {}).get("edges", [])
caption = caption_edges[0]["node"]["text"] if caption_edges else ""
posts.append({
"shortcode": node["shortcode"],
"url": f"https://www.instagram.com/p/{node['shortcode']}/",
"likes": node["edge_liked_by"]["count"],
"comments": node["edge_media_to_comment"]["count"],
"caption": caption[:280],
"timestamp": node["taken_at_timestamp"],
"is_video": node["is_video"],
})
return {
"username": user["username"],
"full_name": user["full_name"],
"biography": user["biography"],
"followers": user["edge_followed_by"]["count"],
"following": user["edge_follow"]["count"],
"post_count": user["edge_owner_to_timeline_media"]["count"],
"is_verified": user["is_verified"],
"profile_pic_url": user["profile_pic_url_hd"],
"recent_posts": posts,
}
profile = extract_profile("natgeo")
print(json.dumps(profile, indent=2))JSON Path Reference
| Field | JSON Path |
|---|---|
| Username | data.user.username |
| Full name | data.user.full_name |
| Bio | data.user.biography |
| Follower count | data.user.edge_followed_by.count |
| Following count | data.user.edge_follow.count |
| Total posts | data.user.edge_owner_to_timeline_media.count |
| Verified badge | data.user.is_verified |
| Post shortcode | data.user.edge_owner_to_timeline_media.edges[N].node.shortcode |
| Post likes | data.user.edge_owner_to_timeline_media.edges[N].node.edge_liked_by.count |
| Post caption | data.user.edge_owner_to_timeline_media.edges[N].node.edge_media_to_caption.edges[0].node.text |
| Post timestamp | data.user.edge_owner_to_timeline_media.edges[N].node.taken_at_timestamp |
HTML Fallback
If the API endpoint returns a login redirect, fall back to parsing the rendered page. Instagram injects state data into <script type="application/json"> tags. The meta[name="description"] selector is the most stable fallback since Meta embeds follower counts in the page description:
from bs4 import BeautifulSoup
import json
def extract_from_html(html: str) -> dict:
soup = BeautifulSoup(html, "html.parser")
# Attempt to parse inline JSON state blocks
for tag in soup.find_all("script", {"type": "application/json"}):
try:
data = json.loads(tag.string or "")
raw = json.dumps(data)
if '"biography"' in raw and '"edge_followed_by"' in raw:
return data # Found the right block
except (json.JSONDecodeError, TypeError):
continue
# Stable CSS fallback — meta description embeds follower/post counts
meta = soup.select_one('meta[name="description"]')
return {
"meta_description": meta["content"] if meta else None,
}Note: Instagram's CSS class names on visible DOM elements change with every deployment. Avoid selectors based on hashed class names like ._aacl. Prefer structural selectors or data-* attributes.
Common Pitfalls
Misreading login walls as successful responses. Instagram returns HTTP 200 even when serving a login modal. Always check the body before parsing:
def is_login_wall(text: str) -> bool:
indicators = [
'"loginUrl"',
"Log in to Instagram",
'"requiresLogin":true',
]
return any(indicator in text for indicator in indicators)
response = client.scrape(url, render_js=True, premium_proxy=True)
if is_login_wall(response.text):
raise ValueError("Login wall detected — session or proxy rotation needed")Relying on graphql/query/ with hardcoded query_hash values. These hashes are invalidated by Meta regularly without notice. The api/v1/ endpoint family is more stable. Store endpoint URLs as configuration values, not hardcoded strings.
Ignoring pagination. The web_profile_info endpoint returns only the 12 most recent posts. For full post history, paginate using edge_owner_to_timeline_media.page_info.end_cursor and issue subsequent requests with the cursor as the after variable.
Timestamp timezone mistakes. Instagram returns Unix timestamps in seconds (taken_at_timestamp). Use datetime.utcfromtimestamp() and store as UTC. Local timezone conversion at ingestion time causes subtle bugs when processing data across regions.
Sequential requests without concurrency control. Sending requests one-by-one through the same proxy IP saturates per-IP rate limits quickly. The fix is async requests with a concurrency semaphore — covered in the next section.
Scaling Up
Async Batch Scraping
import asyncio
import alterlab
import json
from typing import List
client = alterlab.AsyncClient("YOUR_API_KEY")
async def scrape_profile(username: str) -> dict:
url = f"https://www.instagram.com/api/v1/users/web_profile_info/?username={username}"
try:
resp = await client.scrape(
url,
headers={
"X-IG-App-ID": "936619743392459",
"X-Requested-With": "XMLHttpRequest",
},
premium_proxy=True
)
data = json.loads(resp.text)
return {"username": username, "data": data["data"]["user"], "error": None}
except Exception as exc:
return {"username": username, "data": None, "error": str(exc)}
async def batch_scrape(usernames: List[str], concurrency: int = 15) -> List[dict]:
sem = asyncio.Semaphore(concurrency)
async def bounded(u):
async with sem:
return await scrape_profile(u)
return await asyncio.gather(*[bounded(u) for u in usernames])
# Usage
usernames = ["natgeo", "nasa", "time", "bbcnews", "cnn", "vogue", "wired"]
results = asyncio.run(batch_scrape(usernames))
for r in results:
if r["error"]:
print(f"FAILED {r['username']}: {r['error']}")
else:
u = r["data"]
print(f"{u['username']:20s} — {u['edge_followed_by']['count']:>12,} followers")Cost Modeling
AlterLab charges per successful scrape. Calls to Instagram's JSON endpoint with render_js=False bill as standard requests. Full headless browser renders (render_js=True) cost more per request but are required for content not accessible via the API endpoint.
For a pipeline scraping 10,000 profiles per month with a 70/30 split between API calls and browser renders, see the AlterLab pricing calculator for current per-request rates and volume discount tiers.
Key Takeaways
- Instagram's login walls, JavaScript rendering, and GraphQL API churn make DIY scraping brittle. Expect to spend significant engineering time just keeping a homegrown scraper running, not building value on top of data.
- The
web_profile_infoendpoint is the most reliable path to structured profile and post data without a full browser render. Use it as your first option; fall back to rendered HTML when it's blocked. - Always validate responses for login walls before parsing. A 200 response with a login modal body is the most common silent failure mode in Instagram scrapers.
- Store raw JSON before parsing. Data structures change. Raw storage lets you re-parse without re-scraping, which saves both time and API credits.
- Use async requests with a concurrency semaphore. Proxy rotation distributes load across IPs, but only if you're not hitting a single IP faster than requests can be distributed. Cap concurrency at 10–20 for Instagram.
Related Guides
Building a multi-platform social data pipeline? The same patterns apply across the major platforms:
- How to Scrape Reddit — public subreddit and post data via Reddit's API with reliable HTML fallback
- How to Scrape Twitter/X — profile, tweet, and engagement data from X's heavily rate-limited endpoints
- How to Scrape TikTok — video metadata, creator stats, and trending content extraction at scale
Was this article helpful?
Frequently Asked Questions
Related Articles
Popular Posts
Recommended

Selenium Bot Detection: Why You Get Caught and How to Avoid It

How to Scrape Glassdoor: Complete Guide for 2026

Why Your Headless Browser Gets Detected (and How to Fix It)

Web Scraping APIs vs DIY Scrapers: When to Stop Building Infrastructure

Web Scraping with Node.js and Puppeteer: The Complete 2026 Guide
Newsletter
Scraping insights and API tips. No spam.
Recommended Reading

Selenium Bot Detection: Why You Get Caught and How to Avoid It

How to Scrape Glassdoor: Complete Guide for 2026

Why Your Headless Browser Gets Detected (and How to Fix It)

Web Scraping APIs vs DIY Scrapers: When to Stop Building Infrastructure

Web Scraping with Node.js and Puppeteer: The Complete 2026 Guide
Stay in the Loop
Get scraping insights, API tips, and platform updates. No spam — we only send when we have something worth reading.


