AlterLabAlterLab
How to Scrape LinkedIn Profiles and Company Data Without Getting Blocked in 2026
Web Scraping

How to Scrape LinkedIn Profiles and Company Data Without Getting Blocked in 2026

How to Scrape LinkedIn Profiles and Company Data Without Getting Blocked in 2026 Scraping LinkedIn profiles and company data is one of the harder...

Yash Dubey
Yash Dubey

March 17, 2026

10 min read
17 views

Scraping LinkedIn profiles and company data is one of the harder engineering problems in data extraction — not because LinkedIn's HTML is complex, but because their bot detection is aggressive, layered, and constantly updated. This guide covers what LinkedIn's defense stack actually looks like in 2026, which approaches still work, and how to build a pipeline that holds up under sustained load.


What You're Up Against

LinkedIn does not use a third-party bot protection vendor. Their detection is in-house and operates across several independent layers simultaneously:

TLS fingerprinting (JA3/JA3S): LinkedIn inspects the TLS handshake before your request is even parsed. Python's requests library has a well-known JA3 hash. So does Node.js's https module. If your fingerprint matches a known automation signature, you're rate-limited or blocked before serving a single byte.

HTTP/2 settings fingerprinting: Beyond TLS, LinkedIn inspects the HTTP/2 SETTINGS frame — window size, header table size, stream concurrency. These values are distinct between browsers and libraries like httpx or aiohttp.

Behavioral analysis: LinkedIn tracks profile view velocity per session, per IP, and per account. Viewing 40 profiles in 20 minutes from the same session triggers a soft block. Scraping 200 profiles/day from the same account triggers a permanent suspension.

IP reputation: Datacenter IPs (AWS, GCP, DigitalOcean, Hetzner) are near-universally blocked. LinkedIn has had years to compile ASN-level blocklists. Residential proxies are required.

Authentication wall: Most profile data — current job, past experience, education, connections — is behind login. Public profile pages show a truncated view and often redirect to the login wall after 2-3 requests from an unauthenticated session.

Understanding this stack tells you what tools are off the table immediately: raw requests, basic Selenium without stealth patches, and datacenter proxies. The approaches that still work in 2026 are headless browsers with fingerprint spoofing, proper session management with valid li_at cookies, and residential proxy rotation.


What Data is Realistically Scrapable

Before writing a line of code, be precise about what you need:

Data TypeRequires LoginDetection RiskNotes
Company overview (name, size, industry, HQ)NoLowPublic pages are stable
Company employee countNoLowOften in structured ld+json
Job postingsNoLowLinkedIn Jobs is more open
Personal profile (headline, current role)SoftMediumTruncated without auth
Full work history, educationYesHighRequires li_at session
Connection graphYesVery HighHeavily monitored
Post/activity feedYesHighLazy-loaded, paginated

Company pages are significantly more accessible than personal profiles. If your use case is firmographic enrichment — industry, headcount, location, description — you can get most of that from public company pages with modest precautions.

For personal profiles with full history, you need an authenticated session.


Approach 1: Scraping Public Company Pages

Company pages (linkedin.com/company/stripe/) render a meaningful amount of data without authentication. They also embed a ld+json block with structured data, which is far more reliable than scraping HTML class names (LinkedIn obfuscates these and changes them frequently).

Python
import asyncio
import json
import random
import httpx
from parsel import Selector

HEADERS = {
    "User-Agent": (
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
        "AppleWebKit/537.36 (KHTML, like Gecko) "
        "Chrome/122.0.0.0 Safari/537.36"
    ),
    "Accept-Language": "en-US,en;q=0.9",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
    "Sec-Fetch-Dest": "document",
    "Sec-Fetch-Mode": "navigate",
    "Sec-Fetch-Site": "none",
    "Sec-Ch-Ua": '"Chromium";v="122", "Not(A:Brand";v="24"',
    "Sec-Ch-Ua-Mobile": "?0",
    "Sec-Ch-Ua-Platform": '"Windows"',
}

async def scrape_company(slug: str, proxy: str) -> dict:
    url = f"https://www.linkedin.com/company/{slug}/"
    
    # Use HTTP/2 and a transport that mimics Chrome's TLS fingerprint
    transport = httpx.AsyncHTTPTransport(http2=True)
    
    async with httpx.AsyncClient(
        headers=HEADERS,
        proxies={"https://": proxy},
        transport=transport,
        follow_redirects=True,
        timeout=30.0,
    ) as client:
        resp = await client.get(url)
        resp.raise_for_status()
    
    sel = Selector(resp.text)
    
    # Extract structured data first — more reliable than class-based selectors
    ld_json_blocks = sel.css('script[type="application/ld+json"]::text').getall()
    structured = {}
    for block in ld_json_blocks:
        try:
            data = json.loads(block)
            if data.get("@type") in ("Organization", "Corporation"):
                structured = data
                break
        except json.JSONDecodeError:
            continue
    
    # Fall back to meta tags for basics
    name = (
        structured.get("name")
        or sel.css('meta[property="og:title"]::attr(content)').get("")
    )
    description = (
        structured.get("description")
        or sel.css('meta[name="description"]::attr(content)').get("")
    )
    employee_count = structured.get("numberOfEmployees", {})
    
    return {
        "slug": slug,
        "name": name,
        "description": description,
        "url": structured.get("url"),
        "founded": structured.get("foundingDate"),
        "employee_range": employee_count.get("value") if isinstance(employee_count, dict) else None,
        "industry": structured.get("industry"),
        "headquarters": structured.get("address", {}).get("addressLocality"),
    }


async def scrape_batch(slugs: list[str], proxies: list[str]):
    results = []
    for slug in slugs:
        proxy = random.choice(proxies)
        try:
            data = await scrape_company(slug, proxy)
            results.append(data)
        except httpx.HTTPStatusError as e:
            print(f"[{slug}] HTTP {e.response.status_code}")
        # Randomized delay — critical for avoiding velocity detection
        await asyncio.sleep(random.uniform(2.5, 6.0))
    return results

A few things worth noting in this code:

  • http2=True matters. LinkedIn's servers prefer HTTP/2, and an HTTP/1.1 client looks anomalous.
  • Sec-Ch-Ua and Sec-Fetch-* headers are set by Chrome automatically. Their absence is a fingerprint.
  • The ld+json extraction is the most stable part of this pipeline. LinkedIn's obfuscated class names can change weekly; their schema.org structured data changes far less frequently.
  • The randomized delay (uniform(2.5, 6.0)) is not optional. Fixed intervals like time.sleep(2) are a pattern that detection systems flag.

Approach 2: Full Profile Scraping with Playwright

For personal profiles with full work history, you need a real browser. httpx won't execute the JavaScript that renders the page content, and LinkedIn uses lazy-loading for most profile sections.

Use playwright with playwright-stealth to patch the automation indicators that Playwright exposes by default (navigator.webdriver, Chrome runtime, permission APIs, etc.).

Python
import asyncio
import json
import random
from playwright.async_api import async_playwright
from playwright_stealth import stealth_async

# li_at is LinkedIn's primary session cookie.
# Obtain it from a logged-in browser session (DevTools → Application → Cookies).
LI_AT_COOKIE = "your_li_at_cookie_value_here"

PROFILE_SELECTORS = {
    "name": "h1.text-heading-xlarge",
    "headline": "div.text-body-medium.break-words",
    "location": "span.text-body-small.inline.t-black--light.break-words",
    "about": "div.display-flex.ph5.pv3 span.visually-hidden",
}

async def scrape_profile(url: str, proxy_server: str) -> dict:
    async with async_playwright() as p:
        browser = await p.chromium.launch(
            headless=True,
            proxy={"server": proxy_server},
            args=[
                "--disable-blink-features=AutomationControlled",
                "--no-sandbox",
            ],
        )
        context = await browser.new_context(
            viewport={"width": 1440, "height": 900},
            locale="en-US",
            timezone_id="America/New_York",
            user_agent=(
                "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
                "AppleWebKit/537.36 (KHTML, like Gecko) "
                "Chrome/122.0.0.0 Safari/537.36"
            ),
        )
        
        # Inject the li_at session cookie before navigating
        await context.add_cookies([{
            "name": "li_at",
            "value": LI_AT_COOKIE,
            "domain": ".linkedin.com",
            "path": "/",
            "httpOnly": True,
            "secure": True,
        }])
        
        page = await context.new_page()
        await stealth_async(page)
        
        # Block images and fonts to reduce bandwidth and page load time
        await page.route("**/*.{png,jpg,jpeg,gif,webp,svg,woff,woff2,ttf}", 
                         lambda route: route.abort())
        
        await page.goto(url, wait_until="domcontentloaded", timeout=45_000)
        
        # Mimic scroll behavior — LinkedIn lazy-loads experience/education sections
        for _ in range(4):
            await page.mouse.wheel(0, random.randint(400, 800))
            await asyncio.sleep(random.uniform(0.8, 1.8))
        
        # Extract visible text fields
        result = {"url": url}
        for key, selector in PROFILE_SELECTORS.items():
            try:
                el = page.locator(selector).first
                result[key] = (await el.inner_text(timeout=5_000)).strip()
            except Exception:
                result[key] = None
        
        # Extract experience section
        experience = []
        exp_items = await page.locator(
            "li.artdeco-list__item.pvs-list__item--line-separated"
        ).all()
        for item in exp_items[:10]:  # cap to avoid long-running loops
            try:
                title = await item.locator("span[aria-hidden='true']").first.inner_text()
                experience.append(title.strip())
            except Exception:
                continue
        result["experience_titles"] = experience
        
        await browser.close()
        return result


async def run_pipeline(profile_urls: list[str], proxies: list[str]):
    for url in profile_urls:
        proxy = random.choice(proxies)
        data = await scrape_profile(url, proxy)
        print(json.dumps(data, indent=2))
        # LinkedIn monitors inter-request timing at the account level
        # Keep it well under 3 profiles/minute per session
        await asyncio.sleep(random.uniform(20, 40))

Key decisions in this code:

  • Stealth patching: playwright_stealth patches ~20 browser properties that Playwright exposes. Without it, navigator.webdriver === true and you're flagged immediately.
  • Cookie injection over login flow: Automating the login form is slower and creates a distinct behavioral pattern. Injecting li_at directly is cleaner. Treat it as a secret — rotate accounts periodically.
  • Resource blocking: Blocking images and fonts cuts page load from ~4MB to ~400KB and halves scrape time.
  • Scroll simulation: LinkedIn's experience and education sections don't render until scrolled into view. The mouse.wheel calls are not optional for complete data.
  • 20–40 second delay between profiles: This is not excessive caution — it's roughly what a human reads a profile in. Anything faster risks session suspension.

Proxy Strategy

Residential proxies are non-negotiable for LinkedIn at any meaningful scale. The decision tree is:

  • < 100 profiles/day: A single residential IP rotated per session is sufficient. Services like Oxylabs, Bright Data, or Smartproxy provide per-IP rotation.
  • 100–1,000 profiles/day: Rotate per request. Use geo-targeted proxies matching your LinkedIn account's expected location — a US account routing through a Bucharest IP is an anomaly signal.
  • > 1,000 profiles/day: You need multiple LinkedIn accounts, multiple residential proxy pools, and request distribution across both dimensions. At this scale, managing fingerprinting in-house becomes a significant maintenance burden.

For teams that want to skip the proxy infrastructure and browser fingerprint management, scraping APIs like AlterLab handle rotating proxies, TLS fingerprint spoofing, and JavaScript rendering in a single API call — useful when the scraping itself isn't your core engineering problem.


Rate Limiting and Request Patterns

LinkedIn's rate limiting operates at three independent levels:

IP level: Even with residential proxies, individual IPs have request budgets. Rotate IP per session, not per request, if you want to preserve cookie-based sessions. Rotating mid-session triggers a re-authentication challenge.

Account level: LinkedIn tracks profile view counts per authenticated session. Stay under 80–100 profile views per 24-hour period per account. This is a soft limit — exceeding it triggers an "unusual activity" checkpoint, not an immediate ban.

Velocity detection: The interval between sequential profile views matters more than the total count. A human researcher views a profile, reads it (45–90 seconds), then moves to the next. Spikes below 15 seconds between views consistently trigger flags.

Practical implementation:

Python
import time
import random
from dataclasses import dataclass, field
from collections import deque

@dataclass
class RateLimiter:
    max_per_hour: int = 60
    min_interval_seconds: float = 20.0
    _timestamps: deque = field(default_factory=deque)
    
    def wait(self):
        now = time.monotonic()
        
        # Enforce minimum interval
        if self._timestamps:
            elapsed = now - self._timestamps[-1]
            if elapsed < self.min_interval_seconds:
                sleep_time = self.min_interval_seconds - elapsed + random.uniform(0, 5)
                time.sleep(sleep_time)
                now = time.monotonic()
        
        # Enforce hourly budget
        cutoff = now - 3600
        while self._timestamps and self._timestamps[0] < cutoff:
            self._timestamps.popleft()
        
        if len(self._timestamps) >= self.max_per_hour:
            oldest = self._timestamps[0]
            wait_until = oldest + 3600
            time.sleep(max(0, wait_until - now) + random.uniform(10, 30))
        
        self._timestamps.append(time.monotonic())

Handling Structure Changes

LinkedIn's HTML uses obfuscated class names that change on deploys. Do not hard-code class names as primary selectors. Use this hierarchy, in order of stability:

  1. ld+json structured data — most stable, changes with schema.org spec
  2. aria-label and semantic attributes — stable across redesigns
  3. data-* attributes — moderately stable
  4. Tag + position selectors (e.g., h1:first-of-type) — fragile but better than class names
  5. Obfuscated class names (e.g., .pvs-list__item--line-separated) — treat as temporary

When selectors break — and they will — the fastest recovery path is to diff the HTML before/after the break and update your attribute-based selectors. Keep a snapshot of the last known-good HTML in your test fixtures.


When Raw Scraping Isn't Worth It

There are scenarios where building and maintaining this stack isn't justified:

  • You need < 500 profiles/month and don't want to manage proxy billing and account rotation
  • Your team doesn't have bandwidth to monitor for LinkedIn anti-bot updates
  • You need consistent uptime SLAs that your own scraper can't provide

In those cases, a managed scraping API handles the fingerprint management, proxy infrastructure, and JavaScript rendering for you. AlterLab's API supports rendering JavaScript pages with a single POST request:

Python
import httpx

response = httpx.post(
    "https://api.alterlab.io/v1/scrape",
    headers={"X-API-Key": "your_api_key"},
    json={
        "url": "https://www.linkedin.com/company/stripe/",
        "render_js": True,
        "wait_for": "div.org-top-card",
        "proxy_country": "us",
    }
)

html = response.json()["html"]

The tradeoff: you trade control and cost optimization for reliability and zero infrastructure maintenance. For high-volume production pipelines where LinkedIn data is core to the product, building in-house is usually cheaper at scale. For analytics, enrichment, or research pipelines, an API is faster to ship and easier to maintain.


LinkedIn's Terms of Service prohibit automated scraping. The hiQ Labs v. LinkedIn case (9th Circuit, 2022) established that scraping publicly available data is not a violation of the Computer Fraud and Abuse Act, but this doesn't override LinkedIn's ToS or make all scraping legally risk-free in all jurisdictions.

Be precise about what you actually need:

  • Personal profile data is subject to GDPR and CCPA. Have a documented legal basis.
  • Don't scrape contact information at scale for cold outreach — that's the use case that triggers the most aggressive legal responses.
  • Company firmographic data (headcount, industry, description) is the lowest-risk data type.

Key Takeaways

Scraping LinkedIn in 2026 requires addressing multiple detection layers simultaneously:

  • TLS and HTTP/2 fingerprinting — use a real browser or a library with Chrome-compatible fingerprints. Raw requests doesn't pass.
  • Residential proxies are not optional — datacenter IPs are blocked at the ASN level.
  • Session cookies (li_at) — required for full profile data. Inject them directly rather than automating login.
  • Behavioral mimicry — randomize delays, simulate scrolling, stay under 80 profile views per 24 hours per account.
  • Target ld+json and semantic attributes — obfuscated class names are temporary. Structured data and ARIA attributes are stable.
  • Company pages are far more accessible than personal profiles. If firmographic data is sufficient, you don't need authenticated sessions.
  • Build vs. buy depends on volume and team bandwidth — above ~5,000 profiles/day with SLA requirements, a managed API is often the right call.

The maintenance burden is the real cost here. LinkedIn's detection evolves continuously. Budget time for selector updates, proxy pool rotation, and account management — or abstract that away entirely with a scraping API.

Share

Was this article helpful?