AlterLabAlterLab
How to Scrape Instagram: Complete Guide for 2026
Tutorials

How to Scrape Instagram: Complete Guide for 2026

Learn how to scrape Instagram in 2026 with Python. Covers Meta's anti-bot protections, GraphQL endpoints, structured data extraction, and scaling pipelines reliably.

Yash Dubey
Yash Dubey

March 31, 2026

9 min read
4 views

Instagram holds structured, commercially valuable data — follower counts, post engagement rates, hashtag velocity, business contact info — that Meta's official API surfaces only partially and only after a lengthy app review. For most data use cases, scraping is the practical path.

Getting that data reliably in 2026 is significantly harder than it was three years ago. Meta has progressively hardened Instagram against automated access. This guide covers what actually works: understanding the defenses, bypassing them with a managed API, and building a pipeline that produces clean structured JSON from a public profile page.


Why Scrape Instagram?

Three use cases drive the majority of Instagram scraping work in production:

Influencer analytics and campaign measurement. Brands need follower counts, engagement rates (likes + comments ÷ followers), and posting frequency for potential partners before signing contracts. Meta's official API doesn't expose this data at scale without app review approval — scraping fills the gap.

Competitive intelligence. E-commerce brands track competitor product launches by monitoring branded hashtags, post frequency, and comment sentiment across 50–200 accounts on a recurring schedule. This is a standard data ops task that doesn't fit inside the official API's rate limits.

Academic and social research. Researchers studying public health communication, political messaging, and misinformation use Instagram post metadata — timestamps, caption text, engagement counts, geotags — as primary data sources. IRB-approved research often relies on programmatic collection since the official research API has restrictive access requirements.

2B+Monthly Active Users
99.1%AlterLab Success Rate
1.4sAvg Response Time
0CAPTCHAs to Solve

Anti-Bot Challenges on instagram.com

Instagram's defenses are among the most sophisticated on the consumer web. Here's exactly what you're up against:

Login walls. Since 2024, Instagram redirects unauthenticated requests for most content to a login page. Even "public" profiles require a logged-in session to view more than a preview. This breaks requests-based scrapers immediately — you get a login redirect before you see any profile data.

JavaScript rendering. The profile page is a React SPA. The HTML served to a plain HTTP client contains almost no user data. Everything is hydrated client-side via GraphQL calls after the initial page load. You need a headless browser with full JavaScript execution to see what a real user sees.

Fingerprinting and behavioral analysis. Meta runs TLS fingerprinting, canvas fingerprinting, and mouse movement heuristics. Headless Chromium with default settings is detected and blocked almost immediately. Stealth configuration — patching navigator.webdriver, spoofing canvas, mimicking realistic viewport and timing — is required and needs ongoing maintenance as detection techniques evolve.

GraphQL endpoint churn. Instagram's internal API (the https://www.instagram.com/api/v1/ family) changes frequently. Query hashes get invalidated. Endpoint paths get restructured. A scraper that worked three months ago may 404 today without any announcement.

Aggressive rate limiting. Rate limits apply per IP, per session, and per account. Exceeding them triggers CAPTCHAs or temporary blocks lasting hours. Even with a valid session, sequential requests from the same IP saturate limits quickly.

Managing all of this yourself — maintaining stealth browser configs, rotating residential IPs, handling session expiry, tracking API changes — is a significant ongoing engineering cost. AlterLab's anti-bot bypass API handles the full stack: stealth browser rendering, residential proxy rotation, and session management, so you get clean HTML or JSON back from a single API call.


Quick Start with AlterLab API

Install the SDK and grab your API key by following the Getting started guide.

The minimal example — fetching a public Instagram profile page with full JavaScript rendering:

Python
import alterlab

client = alterlab.Client("YOUR_API_KEY")

response = client.scrape(
    "https://www.instagram.com/natgeo/",
    render_js=True,           # Full headless browser render
    premium_proxy=True,       # Residential proxy pool
    wait_for="#react-root"    # Wait for React hydration before capture
)

print(response.status_code)
print(response.text[:500])

For Instagram's internal profile API endpoint, which returns structured JSON and doesn't require a browser render:

Python
import alterlab
import json

client = alterlab.Client("YOUR_API_KEY")

username = "natgeo"
api_url = f"https://www.instagram.com/api/v1/users/web_profile_info/?username={username}"

response = client.scrape(
    api_url,
    headers={
        "X-IG-App-ID": "936619743392459",
        "X-Requested-With": "XMLHttpRequest",
    },
    render_js=False,    # This endpoint returns JSON directly
    premium_proxy=True
)

data = json.loads(response.text)
user = data["data"]["user"]
print(f"Username:  {user['username']}")
print(f"Followers: {user['edge_followed_by']['count']:,}")
print(f"Bio:       {user['biography']}")

The same request via cURL, for testing or non-Python environments:

Bash
curl -X POST https://api.alterlab.io/v1/scrape \
  -H "X-API-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://www.instagram.com/api/v1/users/web_profile_info/?username=natgeo",
    "render_js": false,
    "premium_proxy": true,
    "headers": {
      "X-IG-App-ID": "936619743392459",
      "X-Requested-With": "XMLHttpRequest"
    }
  }'
Try it yourself

Try scraping a public Instagram profile with AlterLab — no setup required


Extracting Structured Data

Once you have the raw response, parsing depends on whether you're working with the API JSON response or rendered page HTML.

From the Profile API Response

The web_profile_info endpoint returns a consistent JSON structure. Here's a complete extractor for the fields most pipelines need:

Python
import alterlab
import json
from typing import Optional

client = alterlab.Client("YOUR_API_KEY")

def fetch_raw(username: str) -> dict:
    url = f"https://www.instagram.com/api/v1/users/web_profile_info/?username={username}"
    resp = client.scrape(
        url,
        headers={
            "X-IG-App-ID": "936619743392459",
            "X-Requested-With": "XMLHttpRequest",
        },
        premium_proxy=True
    )
    return json.loads(resp.text)["data"]["user"]

def extract_profile(username: str) -> dict:
    user = fetch_raw(username)

    posts = []
    for edge in user["edge_owner_to_timeline_media"]["edges"]:
        node = edge["node"]
        caption_edges = node.get("edge_media_to_caption", {}).get("edges", [])
        caption = caption_edges[0]["node"]["text"] if caption_edges else ""

        posts.append({
            "shortcode": node["shortcode"],
            "url": f"https://www.instagram.com/p/{node['shortcode']}/",
            "likes": node["edge_liked_by"]["count"],
            "comments": node["edge_media_to_comment"]["count"],
            "caption": caption[:280],
            "timestamp": node["taken_at_timestamp"],
            "is_video": node["is_video"],
        })

    return {
        "username": user["username"],
        "full_name": user["full_name"],
        "biography": user["biography"],
        "followers": user["edge_followed_by"]["count"],
        "following": user["edge_follow"]["count"],
        "post_count": user["edge_owner_to_timeline_media"]["count"],
        "is_verified": user["is_verified"],
        "profile_pic_url": user["profile_pic_url_hd"],
        "recent_posts": posts,
    }

profile = extract_profile("natgeo")
print(json.dumps(profile, indent=2))

JSON Path Reference

FieldJSON Path
Usernamedata.user.username
Full namedata.user.full_name
Biodata.user.biography
Follower countdata.user.edge_followed_by.count
Following countdata.user.edge_follow.count
Total postsdata.user.edge_owner_to_timeline_media.count
Verified badgedata.user.is_verified
Post shortcodedata.user.edge_owner_to_timeline_media.edges[N].node.shortcode
Post likesdata.user.edge_owner_to_timeline_media.edges[N].node.edge_liked_by.count
Post captiondata.user.edge_owner_to_timeline_media.edges[N].node.edge_media_to_caption.edges[0].node.text
Post timestampdata.user.edge_owner_to_timeline_media.edges[N].node.taken_at_timestamp

HTML Fallback

If the API endpoint returns a login redirect, fall back to parsing the rendered page. Instagram injects state data into <script type="application/json"> tags. The meta[name="description"] selector is the most stable fallback since Meta embeds follower counts in the page description:

Python
from bs4 import BeautifulSoup
import json

def extract_from_html(html: str) -> dict:
    soup = BeautifulSoup(html, "html.parser")

    # Attempt to parse inline JSON state blocks
    for tag in soup.find_all("script", {"type": "application/json"}):
        try:
            data = json.loads(tag.string or "")
            raw = json.dumps(data)
            if '"biography"' in raw and '"edge_followed_by"' in raw:
                return data  # Found the right block
        except (json.JSONDecodeError, TypeError):
            continue

    # Stable CSS fallback — meta description embeds follower/post counts
    meta = soup.select_one('meta[name="description"]')
    return {
        "meta_description": meta["content"] if meta else None,
    }

Note: Instagram's CSS class names on visible DOM elements change with every deployment. Avoid selectors based on hashed class names like ._aacl. Prefer structural selectors or data-* attributes.


Common Pitfalls

Misreading login walls as successful responses. Instagram returns HTTP 200 even when serving a login modal. Always check the body before parsing:

Python
def is_login_wall(text: str) -> bool:
    indicators = [
        '"loginUrl"',
        "Log in to Instagram",
        '"requiresLogin":true',
    ]
    return any(indicator in text for indicator in indicators)

response = client.scrape(url, render_js=True, premium_proxy=True)
if is_login_wall(response.text):
    raise ValueError("Login wall detected — session or proxy rotation needed")

Relying on graphql/query/ with hardcoded query_hash values. These hashes are invalidated by Meta regularly without notice. The api/v1/ endpoint family is more stable. Store endpoint URLs as configuration values, not hardcoded strings.

Ignoring pagination. The web_profile_info endpoint returns only the 12 most recent posts. For full post history, paginate using edge_owner_to_timeline_media.page_info.end_cursor and issue subsequent requests with the cursor as the after variable.

Timestamp timezone mistakes. Instagram returns Unix timestamps in seconds (taken_at_timestamp). Use datetime.utcfromtimestamp() and store as UTC. Local timezone conversion at ingestion time causes subtle bugs when processing data across regions.

Sequential requests without concurrency control. Sending requests one-by-one through the same proxy IP saturates per-IP rate limits quickly. The fix is async requests with a concurrency semaphore — covered in the next section.


Scaling Up

Async Batch Scraping

Python
import asyncio
import alterlab
import json
from typing import List

client = alterlab.AsyncClient("YOUR_API_KEY")

async def scrape_profile(username: str) -> dict:
    url = f"https://www.instagram.com/api/v1/users/web_profile_info/?username={username}"
    try:
        resp = await client.scrape(
            url,
            headers={
                "X-IG-App-ID": "936619743392459",
                "X-Requested-With": "XMLHttpRequest",
            },
            premium_proxy=True
        )
        data = json.loads(resp.text)
        return {"username": username, "data": data["data"]["user"], "error": None}
    except Exception as exc:
        return {"username": username, "data": None, "error": str(exc)}

async def batch_scrape(usernames: List[str], concurrency: int = 15) -> List[dict]:
    sem = asyncio.Semaphore(concurrency)

    async def bounded(u):
        async with sem:
            return await scrape_profile(u)

    return await asyncio.gather(*[bounded(u) for u in usernames])

# Usage
usernames = ["natgeo", "nasa", "time", "bbcnews", "cnn", "vogue", "wired"]
results = asyncio.run(batch_scrape(usernames))

for r in results:
    if r["error"]:
        print(f"FAILED  {r['username']}: {r['error']}")
    else:
        u = r["data"]
        print(f"{u['username']:20s}{u['edge_followed_by']['count']:>12,} followers")

Cost Modeling

AlterLab charges per successful scrape. Calls to Instagram's JSON endpoint with render_js=False bill as standard requests. Full headless browser renders (render_js=True) cost more per request but are required for content not accessible via the API endpoint.

For a pipeline scraping 10,000 profiles per month with a 70/30 split between API calls and browser renders, see the AlterLab pricing calculator for current per-request rates and volume discount tiers.


Key Takeaways

  • Instagram's login walls, JavaScript rendering, and GraphQL API churn make DIY scraping brittle. Expect to spend significant engineering time just keeping a homegrown scraper running, not building value on top of data.
  • The web_profile_info endpoint is the most reliable path to structured profile and post data without a full browser render. Use it as your first option; fall back to rendered HTML when it's blocked.
  • Always validate responses for login walls before parsing. A 200 response with a login modal body is the most common silent failure mode in Instagram scrapers.
  • Store raw JSON before parsing. Data structures change. Raw storage lets you re-parse without re-scraping, which saves both time and API credits.
  • Use async requests with a concurrency semaphore. Proxy rotation distributes load across IPs, but only if you're not hitting a single IP faster than requests can be distributed. Cap concurrency at 10–20 for Instagram.

Building a multi-platform social data pipeline? The same patterns apply across the major platforms:

Share

Was this article helpful?

Frequently Asked Questions

Scraping publicly accessible Instagram data sits in a legal gray area. Instagram's Terms of Service prohibit automated access, but courts — including the Ninth Circuit's hiQ v. LinkedIn ruling — have generally held that scraping publicly available data does not violate the CFAA. Always consult legal counsel for your specific use case, particularly if you process personal data subject to GDPR or CCPA.
Instagram uses TLS fingerprinting, JavaScript-based browser detection, behavioral analysis, and aggressive rate limiting — each layer requiring separate mitigation. AlterLab's anti-bot bypass API handles stealth browser rendering, residential proxy rotation, and session management automatically, so you get a clean response without building or maintaining any of that infrastructure yourself.
Cost depends on request volume and render type. Calls to Instagram's internal JSON endpoints (render_js=false) bill as standard requests; full headless browser renders cost more per request. AlterLab's pricing page includes a calculator to model monthly spend, and volume discounts apply for pipelines running tens of thousands of requests per month.