Pricing Compare Playground Blog Docs Changelog

How to Scrape LinkedIn Profiles and Company Data Without Getting Blocked in 2026

How to Scrape LinkedIn Profiles and Company Data Without Getting Blocked in 2026 Scraping LinkedIn profiles and company data is one of the harder...

Yash DubeyMarch 17, 2026

10 min read

585 views

Scraping LinkedIn profiles and company data is one of the harder engineering problems in data extraction — not because LinkedIn's HTML is complex, but because their bot detection is aggressive, layered, and constantly updated. This guide covers what LinkedIn's defense stack actually looks like in 2026, which approaches still work, and how to build a pipeline that holds up under sustained load.

What You're Up Against

LinkedIn does not use a third-party bot protection vendor. Their detection is in-house and operates across several independent layers simultaneously:

TLS fingerprinting (JA3/JA3S): LinkedIn inspects the TLS handshake before your request is even parsed. Python's requests library has a well-known JA3 hash. So does Node.js's https module. If your fingerprint matches a known automation signature, you're rate-limited or blocked before serving a single byte.

HTTP/2 settings fingerprinting: Beyond TLS, LinkedIn inspects the HTTP/2 SETTINGS frame — window size, header table size, stream concurrency. These values are distinct between browsers and libraries like httpx or aiohttp.

Behavioral analysis: LinkedIn tracks profile view velocity per session, per IP, and per account. Viewing 40 profiles in 20 minutes from the same session triggers a soft block. Scraping 200 profiles/day from the same account triggers a permanent suspension.

IP reputation: Datacenter IPs (AWS, GCP, DigitalOcean, Hetzner) are near-universally blocked. LinkedIn has had years to compile ASN-level blocklists. Residential proxies are required.

Authentication wall: Most profile data — current job, past experience, education, connections — is behind login. Public profile pages show a truncated view and often redirect to the login wall after 2-3 requests from an unauthenticated session.

Understanding this stack tells you what tools are off the table immediately: raw requests, basic Selenium without stealth patches, and datacenter proxies. The approaches that still work in 2026 are headless browsers with fingerprint spoofing, proper session management with valid li_at cookies, and residential proxy rotation.

What Data is Realistically Scrapable

Before writing a line of code, be precise about what you need:

Data Type	Requires Login	Detection Risk	Notes
Company overview (name, size, industry, HQ)	No	Low	Public pages are stable
Company employee count	No	Low	Often in structured `ld+json`
Job postings	No	Low	LinkedIn Jobs is more open
Personal profile (headline, current role)	Soft	Medium	Truncated without auth
Full work history, education	Yes	High	Requires `li_at` session
Connection graph	Yes	Very High	Heavily monitored
Post/activity feed	Yes	High	Lazy-loaded, paginated

Company pages are significantly more accessible than personal profiles. If your use case is firmographic enrichment — industry, headcount, location, description — you can get most of that from public company pages with modest precautions.

For personal profiles with full history, you need an authenticated session.

Approach 1: Scraping Public Company Pages

Company pages (linkedin.com/company/stripe/) render a meaningful amount of data without authentication. They also embed a ld+json block with structured data, which is far more reliable than scraping HTML class names (LinkedIn obfuscates these and changes them frequently).

Python

import asyncio
import json
import random
import httpx
from parsel import Selector

HEADERS = {
    "User-Agent": (
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
        "AppleWebKit/537.36 (KHTML, like Gecko) "
        "Chrome/122.0.0.0 Safari/537.36"
    ),
    "Accept-Language": "en-US,en;q=0.9",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
    "Sec-Fetch-Dest": "document",
    "Sec-Fetch-Mode": "navigate",
    "Sec-Fetch-Site": "none",
    "Sec-Ch-Ua": '"Chromium";v="122", "Not(A:Brand";v="24"',
    "Sec-Ch-Ua-Mobile": "?0",
    "Sec-Ch-Ua-Platform": '"Windows"',
}

async def scrape_company(slug: str, proxy: str) -> dict:
    url = f"https://www.linkedin.com/company/{slug}/"
    
    # Use HTTP/2 and a transport that mimics Chrome's TLS fingerprint
    transport = httpx.AsyncHTTPTransport(http2=True)
    
    async with httpx.AsyncClient(
        headers=HEADERS,
        proxies={"https://": proxy},
        transport=transport,
        follow_redirects=True,
        timeout=30.0,
    ) as client:
        resp = await client.get(url)
        resp.raise_for_status()
    
    sel = Selector(resp.text)
    
    # Extract structured data first — more reliable than class-based selectors
    ld_json_blocks = sel.css('script[type="application/ld+json"]::text').getall()
    structured = {}
    for block in ld_json_blocks:
        try:
            data = json.loads(block)
            if data.get("@type") in ("Organization", "Corporation"):
                structured = data
                break
        except json.JSONDecodeError:
            continue
    
    # Fall back to meta tags for basics
    name = (
        structured.get("name")
        or sel.css('meta[property="og:title"]::attr(content)').get("")
    )
    description = (
        structured.get("description")
        or sel.css('meta[name="description"]::attr(content)').get("")
    )
    employee_count = structured.get("numberOfEmployees", {})
    
    return {
        "slug": slug,
        "name": name,
        "description": description,
        "url": structured.get("url"),
        "founded": structured.get("foundingDate"),
        "employee_range": employee_count.get("value") if isinstance(employee_count, dict) else None,
        "industry": structured.get("industry"),
        "headquarters": structured.get("address", {}).get("addressLocality"),
    }


async def scrape_batch(slugs: list[str], proxies: list[str]):
    results = []
    for slug in slugs:
        proxy = random.choice(proxies)
        try:
            data = await scrape_company(slug, proxy)
            results.append(data)
        except httpx.HTTPStatusError as e:
            print(f"[{slug}] HTTP {e.response.status_code}")
        # Randomized delay — critical for avoiding velocity detection
        await asyncio.sleep(random.uniform(2.5, 6.0))
    return results

A few things worth noting in this code:

http2=True matters. LinkedIn's servers prefer HTTP/2, and an HTTP/1.1 client looks anomalous.
Sec-Ch-Ua and Sec-Fetch-* headers are set by Chrome automatically. Their absence is a fingerprint.
The ld+json extraction is the most stable part of this pipeline. LinkedIn's obfuscated class names can change weekly; their schema.org structured data changes far less frequently.
The randomized delay (uniform(2.5, 6.0)) is not optional. Fixed intervals like time.sleep(2) are a pattern that detection systems flag.

Approach 2: Full Profile Scraping with Playwright

For personal profiles with full work history, you need a real browser. httpx won't execute the JavaScript that renders the page content, and LinkedIn uses lazy-loading for most profile sections.

Use playwright with playwright-stealth to patch the automation indicators that Playwright exposes by default (navigator.webdriver, Chrome runtime, permission APIs, etc.).

Python

import asyncio
import json
import random
from playwright.async_api import async_playwright
from playwright_stealth import stealth_async

# li_at is LinkedIn's primary session cookie.
# Obtain it from a logged-in browser session (DevTools → Application → Cookies).
LI_AT_COOKIE = "your_li_at_cookie_value_here"

PROFILE_SELECTORS = {
    "name": "h1.text-heading-xlarge",
    "headline": "div.text-body-medium.break-words",
    "location": "span.text-body-small.inline.t-black--light.break-words",
    "about": "div.display-flex.ph5.pv3 span.visually-hidden",
}

async def scrape_profile(url: str, proxy_server: str) -> dict:
    async with async_playwright() as p:
        browser = await p.chromium.launch(
            headless=True,
            proxy={"server": proxy_server},
            args=[
                "--disable-blink-features=AutomationControlled",
                "--no-sandbox",
            ],
        )
        context = await browser.new_context(
            viewport={"width": 1440, "height": 900},
            locale="en-US",
            timezone_id="America/New_York",
            user_agent=(
                "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
                "AppleWebKit/537.36 (KHTML, like Gecko) "
                "Chrome/122.0.0.0 Safari/537.36"
            ),
        )
        
        # Inject the li_at session cookie before navigating
        await context.add_cookies([{
            "name": "li_at",
            "value": LI_AT_COOKIE,
            "domain": ".linkedin.com",
            "path": "/",
            "httpOnly": True,
            "secure": True,
        }])
        
        page = await context.new_page()
        await stealth_async(page)
        
        # Block images and fonts to reduce bandwidth and page load time
        await page.route("**/*.{png,jpg,jpeg,gif,webp,svg,woff,woff2,ttf}", 
                         lambda route: route.abort())
        
        await page.goto(url, wait_until="domcontentloaded", timeout=45_000)
        
        # Mimic scroll behavior — LinkedIn lazy-loads experience/education sections
        for _ in range(4):
            await page.mouse.wheel(0, random.randint(400, 800))
            await asyncio.sleep(random.uniform(0.8, 1.8))
        
        # Extract visible text fields
        result = {"url": url}
        for key, selector in PROFILE_SELECTORS.items():
            try:
                el = page.locator(selector).first
                result[key] = (await el.inner_text(timeout=5_000)).strip()
            except Exception:
                result[key] = None
        
        # Extract experience section
        experience = []
        exp_items = await page.locator(
            "li.artdeco-list__item.pvs-list__item--line-separated"
        ).all()
        for item in exp_items[:10]:  # cap to avoid long-running loops
            try:
                title = await item.locator("span[aria-hidden='true']").first.inner_text()
                experience.append(title.strip())
            except Exception:
                continue
        result["experience_titles"] = experience
        
        await browser.close()
        return result


async def run_pipeline(profile_urls: list[str], proxies: list[str]):
    for url in profile_urls:
        proxy = random.choice(proxies)
        data = await scrape_profile(url, proxy)
        print(json.dumps(data, indent=2))
        # LinkedIn monitors inter-request timing at the account level
        # Keep it well under 3 profiles/minute per session
        await asyncio.sleep(random.uniform(20, 40))

Key decisions in this code:

Stealth patching: playwright_stealth patches ~20 browser properties that Playwright exposes. Without it, navigator.webdriver === true and you're flagged immediately.
Cookie injection over login flow: Automating the login form is slower and creates a distinct behavioral pattern. Injecting li_at directly is cleaner. Treat it as a secret — rotate accounts periodically.
Resource blocking: Blocking images and fonts cuts page load from ~4MB to ~400KB and halves scrape time.
Scroll simulation: LinkedIn's experience and education sections don't render until scrolled into view. The mouse.wheel calls are not optional for complete data.
20–40 second delay between profiles: This is not excessive caution — it's roughly what a human reads a profile in. Anything faster risks session suspension.

Proxy Strategy

Residential proxies are non-negotiable for LinkedIn at any meaningful scale. The decision tree is:

< 100 profiles/day: A single residential IP rotated per session is sufficient. Services like Oxylabs, Bright Data, or Smartproxy provide per-IP rotation.
100–1,000 profiles/day: Rotate per request. Use geo-targeted proxies matching your LinkedIn account's expected location — a US account routing through a Bucharest IP is an anomaly signal.
> 1,000 profiles/day: You need multiple LinkedIn accounts, multiple residential proxy pools, and request distribution across both dimensions. At this scale, managing fingerprinting in-house becomes a significant maintenance burden.

For teams that want to skip the proxy infrastructure and browser fingerprint management, scraping APIs like AlterLab handle rotating proxies, TLS fingerprint spoofing, and JavaScript rendering in a single API call — useful when the scraping itself isn't your core engineering problem.

Rate Limiting and Request Patterns

LinkedIn's rate limiting operates at three independent levels:

IP level: Even with residential proxies, individual IPs have request budgets. Rotate IP per session, not per request, if you want to preserve cookie-based sessions. Rotating mid-session triggers a re-authentication challenge.

Account level: LinkedIn tracks profile view counts per authenticated session. Stay under 80–100 profile views per 24-hour period per account. This is a soft limit — exceeding it triggers an "unusual activity" checkpoint, not an immediate ban.

Velocity detection: The interval between sequential profile views matters more than the total count. A human researcher views a profile, reads it (45–90 seconds), then moves to the next. Spikes below 15 seconds between views consistently trigger flags.

Practical implementation:

Python

import time
import random
from dataclasses import dataclass, field
from collections import deque

@dataclass
class RateLimiter:
    max_per_hour: int = 60
    min_interval_seconds: float = 20.0
    _timestamps: deque = field(default_factory=deque)
    
    def wait(self):
        now = time.monotonic()
        
        # Enforce minimum interval
        if self._timestamps:
            elapsed = now - self._timestamps[-1]
            if elapsed < self.min_interval_seconds:
                sleep_time = self.min_interval_seconds - elapsed + random.uniform(0, 5)
                time.sleep(sleep_time)
                now = time.monotonic()
        
        # Enforce hourly budget
        cutoff = now - 3600
        while self._timestamps and self._timestamps[0] < cutoff:
            self._timestamps.popleft()
        
        if len(self._timestamps) >= self.max_per_hour:
            oldest = self._timestamps[0]
            wait_until = oldest + 3600
            time.sleep(max(0, wait_until - now) + random.uniform(10, 30))
        
        self._timestamps.append(time.monotonic())

Handling Structure Changes

LinkedIn's HTML uses obfuscated class names that change on deploys. Do not hard-code class names as primary selectors. Use this hierarchy, in order of stability:

ld+json structured data — most stable, changes with schema.org spec
aria-label and semantic attributes — stable across redesigns
data-* attributes — moderately stable
Tag + position selectors (e.g., h1:first-of-type) — fragile but better than class names
Obfuscated class names (e.g., .pvs-list__item--line-separated) — treat as temporary

When selectors break — and they will — the fastest recovery path is to diff the HTML before/after the break and update your attribute-based selectors. Keep a snapshot of the last known-good HTML in your test fixtures.

When Raw Scraping Isn't Worth It

There are scenarios where building and maintaining this stack isn't justified:

You need < 500 profiles/month and don't want to manage proxy billing and account rotation
Your team doesn't have bandwidth to monitor for LinkedIn anti-bot updates
You need consistent uptime SLAs that your own scraper can't provide

In those cases, a managed scraping API handles the fingerprint management, proxy infrastructure, and JavaScript rendering for you. AlterLab's API supports rendering JavaScript pages with a single POST request:

Python

import httpx

response = httpx.post(
    "https://api.alterlab.io/v1/scrape",
    headers={"X-API-Key": "your_api_key"},
    json={
        "url": "https://www.linkedin.com/company/stripe/",
        "render_js": True,
        "wait_for": "div.org-top-card",
        "proxy_country": "us",
    }
)

html = response.json()["html"]

The tradeoff: you trade control and cost optimization for reliability and zero infrastructure maintenance. For high-volume production pipelines where LinkedIn data is core to the product, building in-house is usually cheaper at scale. For analytics, enrichment, or research pipelines, an API is faster to ship and easier to maintain.

Legal and Ethical Considerations

LinkedIn's Terms of Service prohibit automated scraping. The hiQ Labs v. LinkedIn case (9th Circuit, 2022) established that scraping publicly available data is not a violation of the Computer Fraud and Abuse Act, but this doesn't override LinkedIn's ToS or make all scraping legally risk-free in all jurisdictions.

Be precise about what you actually need:

Personal profile data is subject to GDPR and CCPA. Have a documented legal basis.
Don't scrape contact information at scale for cold outreach — that's the use case that triggers the most aggressive legal responses.
Company firmographic data (headcount, industry, description) is the lowest-risk data type.

Key Takeaways

Scraping LinkedIn in 2026 requires addressing multiple detection layers simultaneously:

TLS and HTTP/2 fingerprinting — use a real browser or a library with Chrome-compatible fingerprints. Raw requests doesn't pass.
Residential proxies are not optional — datacenter IPs are blocked at the ASN level.
Session cookies (li_at) — required for full profile data. Inject them directly rather than automating login.
Behavioral mimicry — randomize delays, simulate scrolling, stay under 80 profile views per 24 hours per account.
Target ld+json and semantic attributes — obfuscated class names are temporary. Structured data and ARIA attributes are stable.
Company pages are far more accessible than personal profiles. If firmographic data is sufficient, you don't need authenticated sessions.
Build vs. buy depends on volume and team bandwidth — above ~5,000 profiles/day with SLA requirements, a managed API is often the right call.

The maintenance burden is the real cost here. LinkedIn's detection evolves continuously. Budget time for selector updates, proxy pool rotation, and account management — or abstract that away entirely with a scraping API.

Was this article helpful?

Try it yourself

Extract public social data reliably

Full browser rendering with automatic challenge resolution. You get clean data.

curl -X POST https://api.alterlab.io/v1/scrape \
  -H "X-API-Key: YOUR_KEY" \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com/profile", "render_js": true}'

No credit card required · 5,000 free requests

Yash Dubey

View all posts

Web Scraping

Advanced Headless Browser Anti-Bot Techniques: TLS & Canvas

Understand TLS, Canvas, and WebGL fingerprinting in headless browser scraping. Learn how anti-bot systems detect agents and how modern pipelines adapt.

Herald Blog Service

Jun 5, 2026

336

Web Scraping

Playwright vs Puppeteer for AI Agents & RAG Pipelines

Comparing Playwright vs Puppeteer for AI data collection. Learn which headless browser wins on speed, context isolation, and reliability for LLMs in 2026.

Herald Blog Service

May 25, 2026

127

Web Scraping

Scraping JavaScript-Heavy SPAs with Python: Dynamic Content, Infinite Scroll, and API Interception

Scraping JavaScriptHeavy SPAs with Python: Dynamic Content, Infinite Scroll, and API Interception Modern web applications rarely serve their data in the...

Yash Dubey

Mar 16, 2026

727

Stay in the Loop

Get scraping insights, API tips, and platform updates. No spam — we only send when we have something worth reading.

Web Scraping API Resources

Part of the Web Scraping API Documentation cluster

Web Scraping API Documentation

Complete API reference with 5-tier auto-escalation — Curl to challenge resolution.

Pillar page

JavaScript Rendering Guide

Configure Tier 4 browser rendering for SPAs and dynamic content.

Authenticated Scraping Guide

Scrape pages behind login using session management.

Web Scraping API Benchmarks

Real success rates and cost data across all 5 tiers.

AlterLab for AI Agents

MCP Server, Python SDK, and Firecrawl-compatible API for AI agent workflows.

What You're Up Against

What Data is Realistically Scrapable

Approach 1: Scraping Public Company Pages

Approach 2: Full Profile Scraping with Playwright

Proxy Strategy

Rate Limiting and Request Patterns

Handling Structure Changes

When Raw Scraping Isn't Worth It

Legal and Ethical Considerations

Key Takeaways

Related Articles

Advanced Headless Browser Anti-Bot Techniques: TLS & Canvas

Playwright vs Puppeteer for AI Agents & RAG Pipelines

Scraping JavaScript-Heavy SPAs with Python: Dynamic Content, Infinite Scroll, and API Interception

Popular Posts

Playwright Bot Detection: What Actually Works in 2026

Why Your Headless Browser Gets Detected (and How to Fix It)

How to Scrape Twitter/X: Complete Guide for 2026

Best Web Scraping APIs in 2026: Complete Comparison Guide

How to Scrape Cloudflare-Protected Sites in 2026

Recommended

Selenium Bot Detection: Why You Get Flagged and How to Fix It

How to Scrape AliExpress: Complete Guide for 2026

Why Your Headless Browser Gets Detected (and How to Fix It)

Scraping JavaScript-Heavy SPAs with Python: Dynamic Content, Infinite Scroll, and API Interception

How to Scrape Cloudflare-Protected Sites in 2026

Newsletter

Recommended Reading

Selenium Bot Detection: Why You Get Flagged and How to Fix It

How to Scrape AliExpress: Complete Guide for 2026

Why Your Headless Browser Gets Detected (and How to Fix It)

Scraping JavaScript-Heavy SPAs with Python: Dynamic Content, Infinite Scroll, and API Interception

How to Scrape Cloudflare-Protected Sites in 2026

Stay in the Loop

Explore AlterLab

Python Web Scraping API

Compare Scraping APIs

Pricing

Documentation

Web Scraping API Resources