Pricing Compare Playground Blog Docs Changelog

How to Scrape Instagram: Complete Guide for 2026

Learn how to scrape Instagram in 2026 with Python. Covers Meta's anti-bot protections, GraphQL endpoints, structured data extraction, and scaling pipelines reliably.

Yash DubeyMarch 31, 2026

9 min read

1,542 views

AlterLab handles this automatically — scrape any URL with one API call. No infrastructure required.

Try it free

Instagram holds structured, commercially valuable data — follower counts, post engagement rates, hashtag velocity, business contact info — that Meta's official API surfaces only partially and only after a lengthy app review. For most data use cases, scraping is the practical path.

Getting that data reliably in 2026 is significantly harder than it was three years ago. Meta has progressively hardened Instagram against automated access. This guide covers what actually works: understanding the defenses, bypassing them with a managed API, and building a pipeline that produces clean structured JSON from a public profile page.

Why Scrape Instagram?

Three use cases drive the majority of Instagram scraping work in production:

Influencer analytics and campaign measurement. Brands need follower counts, engagement rates (likes + comments ÷ followers), and posting frequency for potential partners before signing contracts. Meta's official API doesn't expose this data at scale without app review approval — scraping fills the gap.

Competitive intelligence. E-commerce brands track competitor product launches by monitoring branded hashtags, post frequency, and comment sentiment across 50–200 accounts on a recurring schedule. This is a standard data ops task that doesn't fit inside the official API's rate limits.

Academic and social research. Researchers studying public health communication, political messaging, and misinformation use Instagram post metadata — timestamps, caption text, engagement counts, geotags — as primary data sources. IRB-approved research often relies on programmatic collection since the official research API has restrictive access requirements.

2B+Monthly Active Users

99.1%AlterLab Success Rate

1.4sAvg Response Time

0CAPTCHAs to Solve

Anti-Bot Challenges on instagram.com

Instagram's defenses are among the most sophisticated on the consumer web. Here's exactly what you're up against:

Login walls. Since 2024, Instagram redirects unauthenticated requests for most content to a login page. Even "public" profiles require a logged-in session to view more than a preview. This breaks requests-based scrapers immediately — you get a login redirect before you see any profile data.

JavaScript rendering. The profile page is a React SPA. The HTML served to a plain HTTP client contains almost no user data. Everything is hydrated client-side via GraphQL calls after the initial page load. You need a headless browser with full JavaScript execution to see what a real user sees.

Fingerprinting and behavioral analysis. Meta runs TLS fingerprinting, canvas fingerprinting, and mouse movement heuristics. Headless Chromium with default settings is detected and blocked almost immediately. Stealth configuration — patching navigator.webdriver, spoofing canvas, mimicking realistic viewport and timing — is required and needs ongoing maintenance as detection techniques evolve.

GraphQL endpoint churn. Instagram's internal API (the https://www.instagram.com/api/v1/ family) changes frequently. Query hashes get invalidated. Endpoint paths get restructured. A scraper that worked three months ago may 404 today without any announcement.

Aggressive rate limiting. Rate limits apply per IP, per session, and per account. Exceeding them triggers CAPTCHAs or temporary blocks lasting hours. Even with a valid session, sequential requests from the same IP saturate limits quickly.

Managing all of this yourself — maintaining stealth browser configs, rotating residential IPs, handling session expiry, tracking API changes — is a significant ongoing engineering cost. AlterLab's anti-bot bypass API handles the full stack: stealth browser rendering, residential proxy rotation, and session management, so you get clean HTML or JSON back from a single API call.

Quick Start with AlterLab API

Install the SDK and grab your API key by following the Getting started guide.

The minimal example — fetching a public Instagram profile page with full JavaScript rendering:

Python

import alterlab

client = alterlab.Client("YOUR_API_KEY")

response = client.scrape(
    "https://www.instagram.com/natgeo/",
    render_js=True,           # Full headless browser render
    premium_proxy=True,       # Residential proxy pool
    wait_for="#react-root"    # Wait for React hydration before capture
)

print(response.status_code)
print(response.text[:500])

For Instagram's internal profile API endpoint, which returns structured JSON and doesn't require a browser render:

Python

import alterlab
import json

client = alterlab.Client("YOUR_API_KEY")

username = "natgeo"
api_url = f"https://www.instagram.com/api/v1/users/web_profile_info/?username={username}"

response = client.scrape(
    api_url,
    headers={
        "X-IG-App-ID": "936619743392459",
        "X-Requested-With": "XMLHttpRequest",
    },
    render_js=False,    # This endpoint returns JSON directly
    premium_proxy=True
)

data = json.loads(response.text)
user = data["data"]["user"]
print(f"Username:  {user['username']}")
print(f"Followers: {user['edge_followed_by']['count']:,}")
print(f"Bio:       {user['biography']}")

The same request via cURL, for testing or non-Python environments:

Bash

curl -X POST https://api.alterlab.io/v1/scrape \
  -H "X-API-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://www.instagram.com/api/v1/users/web_profile_info/?username=natgeo",
    "render_js": false,
    "premium_proxy": true,
    "headers": {
      "X-IG-App-ID": "936619743392459",
      "X-Requested-With": "XMLHttpRequest"
    }
  }'

Try it yourself

Try scraping a public Instagram profile with AlterLab — no setup required

curl -X POST https://api.alterlab.io/v1/scrape \
  -H "X-API-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"url": "https://www.instagram.com/instagram/"}'

Enable JavaScript to try the live demo, or sign up to use the API directly.

Extracting Structured Data

Once you have the raw response, parsing depends on whether you're working with the API JSON response or rendered page HTML.

From the Profile API Response

The web_profile_info endpoint returns a consistent JSON structure. Here's a complete extractor for the fields most pipelines need:

Python

import alterlab
import json
from typing import Optional

client = alterlab.Client("YOUR_API_KEY")

def fetch_raw(username: str) -> dict:
    url = f"https://www.instagram.com/api/v1/users/web_profile_info/?username={username}"
    resp = client.scrape(
        url,
        headers={
            "X-IG-App-ID": "936619743392459",
            "X-Requested-With": "XMLHttpRequest",
        },
        premium_proxy=True
    )
    return json.loads(resp.text)["data"]["user"]

def extract_profile(username: str) -> dict:
    user = fetch_raw(username)

    posts = []
    for edge in user["edge_owner_to_timeline_media"]["edges"]:
        node = edge["node"]
        caption_edges = node.get("edge_media_to_caption", {}).get("edges", [])
        caption = caption_edges[0]["node"]["text"] if caption_edges else ""

        posts.append({
            "shortcode": node["shortcode"],
            "url": f"https://www.instagram.com/p/{node['shortcode']}/",
            "likes": node["edge_liked_by"]["count"],
            "comments": node["edge_media_to_comment"]["count"],
            "caption": caption[:280],
            "timestamp": node["taken_at_timestamp"],
            "is_video": node["is_video"],
        })

    return {
        "username": user["username"],
        "full_name": user["full_name"],
        "biography": user["biography"],
        "followers": user["edge_followed_by"]["count"],
        "following": user["edge_follow"]["count"],
        "post_count": user["edge_owner_to_timeline_media"]["count"],
        "is_verified": user["is_verified"],
        "profile_pic_url": user["profile_pic_url_hd"],
        "recent_posts": posts,
    }

profile = extract_profile("natgeo")
print(json.dumps(profile, indent=2))

JSON Path Reference

Field	JSON Path
Username	`data.user.username`
Full name	`data.user.full_name`
Bio	`data.user.biography`
Follower count	`data.user.edge_followed_by.count`
Following count	`data.user.edge_follow.count`
Total posts	`data.user.edge_owner_to_timeline_media.count`
Verified badge	`data.user.is_verified`
Post shortcode	`data.user.edge_owner_to_timeline_media.edges[N].node.shortcode`
Post likes	`data.user.edge_owner_to_timeline_media.edges[N].node.edge_liked_by.count`
Post caption	`data.user.edge_owner_to_timeline_media.edges[N].node.edge_media_to_caption.edges[0].node.text`
Post timestamp	`data.user.edge_owner_to_timeline_media.edges[N].node.taken_at_timestamp`

HTML Fallback

If the API endpoint returns a login redirect, fall back to parsing the rendered page. Instagram injects state data into <script type="application/json"> tags. The meta[name="description"] selector is the most stable fallback since Meta embeds follower counts in the page description:

Python

from bs4 import BeautifulSoup
import json

def extract_from_html(html: str) -> dict:
    soup = BeautifulSoup(html, "html.parser")

    # Attempt to parse inline JSON state blocks
    for tag in soup.find_all("script", {"type": "application/json"}):
        try:
            data = json.loads(tag.string or "")
            raw = json.dumps(data)
            if '"biography"' in raw and '"edge_followed_by"' in raw:
                return data  # Found the right block
        except (json.JSONDecodeError, TypeError):
            continue

    # Stable CSS fallback — meta description embeds follower/post counts
    meta = soup.select_one('meta[name="description"]')
    return {
        "meta_description": meta["content"] if meta else None,
    }

Note: Instagram's CSS class names on visible DOM elements change with every deployment. Avoid selectors based on hashed class names like ._aacl. Prefer structural selectors or data-* attributes.

Common Pitfalls

Misreading login walls as successful responses. Instagram returns HTTP 200 even when serving a login modal. Always check the body before parsing:

Python

def is_login_wall(text: str) -> bool:
    indicators = [
        '"loginUrl"',
        "Log in to Instagram",
        '"requiresLogin":true',
    ]
    return any(indicator in text for indicator in indicators)

response = client.scrape(url, render_js=True, premium_proxy=True)
if is_login_wall(response.text):
    raise ValueError("Login wall detected — session or proxy rotation needed")

Relying on graphql/query/ with hardcoded query_hash values. These hashes are invalidated by Meta regularly without notice. The api/v1/ endpoint family is more stable. Store endpoint URLs as configuration values, not hardcoded strings.

Ignoring pagination. The web_profile_info endpoint returns only the 12 most recent posts. For full post history, paginate using edge_owner_to_timeline_media.page_info.end_cursor and issue subsequent requests with the cursor as the after variable.

Timestamp timezone mistakes. Instagram returns Unix timestamps in seconds (taken_at_timestamp). Use datetime.utcfromtimestamp() and store as UTC. Local timezone conversion at ingestion time causes subtle bugs when processing data across regions.

Sequential requests without concurrency control. Sending requests one-by-one through the same proxy IP saturates per-IP rate limits quickly. The fix is async requests with a concurrency semaphore — covered in the next section.

Scaling Up

Async Batch Scraping

Python

import asyncio
import alterlab
import json
from typing import List

client = alterlab.AsyncClient("YOUR_API_KEY")

async def scrape_profile(username: str) -> dict:
    url = f"https://www.instagram.com/api/v1/users/web_profile_info/?username={username}"
    try:
        resp = await client.scrape(
            url,
            headers={
                "X-IG-App-ID": "936619743392459",
                "X-Requested-With": "XMLHttpRequest",
            },
            premium_proxy=True
        )
        data = json.loads(resp.text)
        return {"username": username, "data": data["data"]["user"], "error": None}
    except Exception as exc:
        return {"username": username, "data": None, "error": str(exc)}

async def batch_scrape(usernames: List[str], concurrency: int = 15) -> List[dict]:
    sem = asyncio.Semaphore(concurrency)

    async def bounded(u):
        async with sem:
            return await scrape_profile(u)

    return await asyncio.gather(*[bounded(u) for u in usernames])

# Usage
usernames = ["natgeo", "nasa", "time", "bbcnews", "cnn", "vogue", "wired"]
results = asyncio.run(batch_scrape(usernames))

for r in results:
    if r["error"]:
        print(f"FAILED  {r['username']}: {r['error']}")
    else:
        u = r["data"]
        print(f"{u['username']:20s} — {u['edge_followed_by']['count']:>12,} followers")

Cost Modeling

AlterLab charges per successful scrape. Calls to Instagram's JSON endpoint with render_js=False bill as standard requests. Full headless browser renders (render_js=True) cost more per request but are required for content not accessible via the API endpoint.

For a pipeline scraping 10,000 profiles per month with a 70/30 split between API calls and browser renders, see the AlterLab pricing calculator for current per-request rates and volume discount tiers.

Key Takeaways

Instagram's login walls, JavaScript rendering, and GraphQL API churn make DIY scraping brittle. Expect to spend significant engineering time just keeping a homegrown scraper running, not building value on top of data.
The web_profile_info endpoint is the most reliable path to structured profile and post data without a full browser render. Use it as your first option; fall back to rendered HTML when it's blocked.
Always validate responses for login walls before parsing. A 200 response with a login modal body is the most common silent failure mode in Instagram scrapers.
Store raw JSON before parsing. Data structures change. Raw storage lets you re-parse without re-scraping, which saves both time and API credits.
Use async requests with a concurrency semaphore. Proxy rotation distributes load across IPs, but only if you're not hitting a single IP faster than requests can be distributed. Cap concurrency at 10–20 for Instagram.

Building a multi-platform social data pipeline? The same patterns apply across the major platforms:

How to Scrape Reddit — public subreddit and post data via Reddit's API with reliable HTML fallback
How to Scrape Twitter/X — profile, tweet, and engagement data from X's heavily rate-limited endpoints
How to Scrape TikTok — video metadata, creator stats, and trending content extraction at scale

Was this article helpful?

Try it yourself

Extract public social data reliably

Full browser rendering with automatic challenge resolution. You get clean data.

curl -X POST https://api.alterlab.io/v1/scrape \
  -H "X-API-Key: YOUR_KEY" \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com/profile", "render_js": true}'

No credit card required · 5,000 free requests

Frequently Asked Questions

Scraping publicly accessible Instagram data sits in a legal gray area. Instagram's Terms of Service prohibit automated access, but courts — including the Ninth Circuit's hiQ v. LinkedIn ruling — have generally held that scraping publicly available data does not violate the CFAA. Always consult legal counsel for your specific use case, particularly if you process personal data subject to GDPR or CCPA.

Instagram uses TLS fingerprinting, JavaScript-based browser detection, behavioral analysis, and aggressive rate limiting — each layer requiring separate mitigation. AlterLab's anti-bot bypass API handles stealth browser rendering, residential proxy rotation, and session management automatically, so you get a clean response without building or maintaining any of that infrastructure yourself.

Cost depends on request volume and render type. Calls to Instagram's internal JSON endpoints (render_js=false) bill as standard requests; full headless browser renders cost more per request. AlterLab's pricing page includes a calculator to model monthly spend, and volume discounts apply for pipelines running tens of thousands of requests per month.

Yash Dubey

View all posts

Tutorials

How to Scrape DoorDash Data: Complete Guide for 2026

Learn how to scrape DoorDash data using Python and Node.js. A technical guide on extracting public food data, handling anti-bot protections, and structured AI extraction.

Herald Blog Service

Jul 4, 2026

Web Scraping

Playwright vs. Puppeteer vs. Selenium for Scraping in 2026

Compare Playwright, Puppeteer, and Selenium for web scraping in 2026. Learn which browser automation tool is best for speed, reliability, and bot detection handling.

Herald Blog Service

Jul 4, 2026

Tutorials

SEC EDGAR Data API: Extract Structured JSON in 2026

Get structured JSON from SEC EDGAR via AlterLab’s API. Extract title, identifier, date_published and more with schema validation. Always start with the answer and keep it concise.

Herald Blog Service

Jul 2, 2026

Stay in the Loop

Get scraping insights, API tips, and platform updates. No spam — we only send when we have something worth reading.

Web Scraping API Resources

Part of the Web Scraping API Documentation cluster

Web Scraping API Documentation

Complete API reference with 5-tier auto-escalation — Curl to challenge resolution.

Pillar page

JavaScript Rendering Guide

Configure Tier 4 browser rendering for SPAs and dynamic content.

Authenticated Scraping Guide

Scrape pages behind login using session management.

Web Scraping API Benchmarks

Real success rates and cost data across all 5 tiers.

AlterLab for AI Agents

MCP Server, Python SDK, and Firecrawl-compatible API for AI agent workflows.

Why Scrape Instagram?

Anti-Bot Challenges on instagram.com

Quick Start with AlterLab API

Extracting Structured Data

From the Profile API Response

JSON Path Reference

HTML Fallback

Common Pitfalls

Scaling Up

Async Batch Scraping

Cost Modeling

Key Takeaways

Related Guides

Frequently Asked Questions

Related Articles

How to Scrape DoorDash Data: Complete Guide for 2026

Playwright vs. Puppeteer vs. Selenium for Scraping in 2026

SEC EDGAR Data API: Extract Structured JSON in 2026

Popular Posts

Playwright Bot Detection: What Actually Works in 2026

How to Scrape Twitter/X: Complete Guide for 2026

Why Your Headless Browser Gets Detected (and How to Fix It)

Best Web Scraping APIs in 2026: Complete Comparison Guide

How to Scrape Cloudflare-Protected Sites in 2026

Recommended

How to Scrape AliExpress: Complete Guide for 2026

Why Your Headless Browser Gets Detected (and How to Fix It)

AlterLab vs Firecrawl: Which Scraping API Is Better in 2026?

How to Scrape Twitter/X Data: Complete Guide for 2026

How to Scrape Cloudflare-Protected Sites in 2026

Newsletter

Recommended Reading

How to Scrape AliExpress: Complete Guide for 2026

Why Your Headless Browser Gets Detected (and How to Fix It)

AlterLab vs Firecrawl: Which Scraping API Is Better in 2026?

How to Scrape Twitter/X Data: Complete Guide for 2026

How to Scrape Cloudflare-Protected Sites in 2026

Stay in the Loop

Explore AlterLab

Anti-Bot Handling API

JavaScript Rendering API

Pricing

Documentation

Web Scraping API Resources