
How to Scrape LinkedIn Profiles and Company Data Without Getting Blocked in 2026
How to Scrape LinkedIn Profiles and Company Data Without Getting Blocked in 2026 Scraping LinkedIn profiles and company data is one of the harder...
March 17, 2026
Scraping LinkedIn profiles and company data is one of the harder engineering problems in data extraction — not because LinkedIn's HTML is complex, but because their bot detection is aggressive, layered, and constantly updated. This guide covers what LinkedIn's defense stack actually looks like in 2026, which approaches still work, and how to build a pipeline that holds up under sustained load.
What You're Up Against
LinkedIn does not use a third-party bot protection vendor. Their detection is in-house and operates across several independent layers simultaneously:
TLS fingerprinting (JA3/JA3S): LinkedIn inspects the TLS handshake before your request is even parsed. Python's requests library has a well-known JA3 hash. So does Node.js's https module. If your fingerprint matches a known automation signature, you're rate-limited or blocked before serving a single byte.
HTTP/2 settings fingerprinting: Beyond TLS, LinkedIn inspects the HTTP/2 SETTINGS frame — window size, header table size, stream concurrency. These values are distinct between browsers and libraries like httpx or aiohttp.
Behavioral analysis: LinkedIn tracks profile view velocity per session, per IP, and per account. Viewing 40 profiles in 20 minutes from the same session triggers a soft block. Scraping 200 profiles/day from the same account triggers a permanent suspension.
IP reputation: Datacenter IPs (AWS, GCP, DigitalOcean, Hetzner) are near-universally blocked. LinkedIn has had years to compile ASN-level blocklists. Residential proxies are required.
Authentication wall: Most profile data — current job, past experience, education, connections — is behind login. Public profile pages show a truncated view and often redirect to the login wall after 2-3 requests from an unauthenticated session.
Understanding this stack tells you what tools are off the table immediately: raw requests, basic Selenium without stealth patches, and datacenter proxies. The approaches that still work in 2026 are headless browsers with fingerprint spoofing, proper session management with valid li_at cookies, and residential proxy rotation.
What Data is Realistically Scrapable
Before writing a line of code, be precise about what you need:
| Data Type | Requires Login | Detection Risk | Notes |
|---|---|---|---|
| Company overview (name, size, industry, HQ) | No | Low | Public pages are stable |
| Company employee count | No | Low | Often in structured ld+json |
| Job postings | No | Low | LinkedIn Jobs is more open |
| Personal profile (headline, current role) | Soft | Medium | Truncated without auth |
| Full work history, education | Yes | High | Requires li_at session |
| Connection graph | Yes | Very High | Heavily monitored |
| Post/activity feed | Yes | High | Lazy-loaded, paginated |
Company pages are significantly more accessible than personal profiles. If your use case is firmographic enrichment — industry, headcount, location, description — you can get most of that from public company pages with modest precautions.
For personal profiles with full history, you need an authenticated session.
Approach 1: Scraping Public Company Pages
Company pages (linkedin.com/company/stripe/) render a meaningful amount of data without authentication. They also embed a ld+json block with structured data, which is far more reliable than scraping HTML class names (LinkedIn obfuscates these and changes them frequently).
import asyncio
import json
import random
import httpx
from parsel import Selector
HEADERS = {
"User-Agent": (
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/122.0.0.0 Safari/537.36"
),
"Accept-Language": "en-US,en;q=0.9",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Sec-Fetch-Dest": "document",
"Sec-Fetch-Mode": "navigate",
"Sec-Fetch-Site": "none",
"Sec-Ch-Ua": '"Chromium";v="122", "Not(A:Brand";v="24"',
"Sec-Ch-Ua-Mobile": "?0",
"Sec-Ch-Ua-Platform": '"Windows"',
}
async def scrape_company(slug: str, proxy: str) -> dict:
url = f"https://www.linkedin.com/company/{slug}/"
# Use HTTP/2 and a transport that mimics Chrome's TLS fingerprint
transport = httpx.AsyncHTTPTransport(http2=True)
async with httpx.AsyncClient(
headers=HEADERS,
proxies={"https://": proxy},
transport=transport,
follow_redirects=True,
timeout=30.0,
) as client:
resp = await client.get(url)
resp.raise_for_status()
sel = Selector(resp.text)
# Extract structured data first — more reliable than class-based selectors
ld_json_blocks = sel.css('script[type="application/ld+json"]::text').getall()
structured = {}
for block in ld_json_blocks:
try:
data = json.loads(block)
if data.get("@type") in ("Organization", "Corporation"):
structured = data
break
except json.JSONDecodeError:
continue
# Fall back to meta tags for basics
name = (
structured.get("name")
or sel.css('meta[property="og:title"]::attr(content)').get("")
)
description = (
structured.get("description")
or sel.css('meta[name="description"]::attr(content)').get("")
)
employee_count = structured.get("numberOfEmployees", {})
return {
"slug": slug,
"name": name,
"description": description,
"url": structured.get("url"),
"founded": structured.get("foundingDate"),
"employee_range": employee_count.get("value") if isinstance(employee_count, dict) else None,
"industry": structured.get("industry"),
"headquarters": structured.get("address", {}).get("addressLocality"),
}
async def scrape_batch(slugs: list[str], proxies: list[str]):
results = []
for slug in slugs:
proxy = random.choice(proxies)
try:
data = await scrape_company(slug, proxy)
results.append(data)
except httpx.HTTPStatusError as e:
print(f"[{slug}] HTTP {e.response.status_code}")
# Randomized delay — critical for avoiding velocity detection
await asyncio.sleep(random.uniform(2.5, 6.0))
return resultsA few things worth noting in this code:
http2=Truematters. LinkedIn's servers prefer HTTP/2, and an HTTP/1.1 client looks anomalous.Sec-Ch-UaandSec-Fetch-*headers are set by Chrome automatically. Their absence is a fingerprint.- The
ld+jsonextraction is the most stable part of this pipeline. LinkedIn's obfuscated class names can change weekly; their schema.org structured data changes far less frequently. - The randomized delay (
uniform(2.5, 6.0)) is not optional. Fixed intervals liketime.sleep(2)are a pattern that detection systems flag.
Approach 2: Full Profile Scraping with Playwright
For personal profiles with full work history, you need a real browser. httpx won't execute the JavaScript that renders the page content, and LinkedIn uses lazy-loading for most profile sections.
Use playwright with playwright-stealth to patch the automation indicators that Playwright exposes by default (navigator.webdriver, Chrome runtime, permission APIs, etc.).
import asyncio
import json
import random
from playwright.async_api import async_playwright
from playwright_stealth import stealth_async
# li_at is LinkedIn's primary session cookie.
# Obtain it from a logged-in browser session (DevTools → Application → Cookies).
LI_AT_COOKIE = "your_li_at_cookie_value_here"
PROFILE_SELECTORS = {
"name": "h1.text-heading-xlarge",
"headline": "div.text-body-medium.break-words",
"location": "span.text-body-small.inline.t-black--light.break-words",
"about": "div.display-flex.ph5.pv3 span.visually-hidden",
}
async def scrape_profile(url: str, proxy_server: str) -> dict:
async with async_playwright() as p:
browser = await p.chromium.launch(
headless=True,
proxy={"server": proxy_server},
args=[
"--disable-blink-features=AutomationControlled",
"--no-sandbox",
],
)
context = await browser.new_context(
viewport={"width": 1440, "height": 900},
locale="en-US",
timezone_id="America/New_York",
user_agent=(
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/122.0.0.0 Safari/537.36"
),
)
# Inject the li_at session cookie before navigating
await context.add_cookies([{
"name": "li_at",
"value": LI_AT_COOKIE,
"domain": ".linkedin.com",
"path": "/",
"httpOnly": True,
"secure": True,
}])
page = await context.new_page()
await stealth_async(page)
# Block images and fonts to reduce bandwidth and page load time
await page.route("**/*.{png,jpg,jpeg,gif,webp,svg,woff,woff2,ttf}",
lambda route: route.abort())
await page.goto(url, wait_until="domcontentloaded", timeout=45_000)
# Mimic scroll behavior — LinkedIn lazy-loads experience/education sections
for _ in range(4):
await page.mouse.wheel(0, random.randint(400, 800))
await asyncio.sleep(random.uniform(0.8, 1.8))
# Extract visible text fields
result = {"url": url}
for key, selector in PROFILE_SELECTORS.items():
try:
el = page.locator(selector).first
result[key] = (await el.inner_text(timeout=5_000)).strip()
except Exception:
result[key] = None
# Extract experience section
experience = []
exp_items = await page.locator(
"li.artdeco-list__item.pvs-list__item--line-separated"
).all()
for item in exp_items[:10]: # cap to avoid long-running loops
try:
title = await item.locator("span[aria-hidden='true']").first.inner_text()
experience.append(title.strip())
except Exception:
continue
result["experience_titles"] = experience
await browser.close()
return result
async def run_pipeline(profile_urls: list[str], proxies: list[str]):
for url in profile_urls:
proxy = random.choice(proxies)
data = await scrape_profile(url, proxy)
print(json.dumps(data, indent=2))
# LinkedIn monitors inter-request timing at the account level
# Keep it well under 3 profiles/minute per session
await asyncio.sleep(random.uniform(20, 40))Key decisions in this code:
- Stealth patching:
playwright_stealthpatches ~20 browser properties that Playwright exposes. Without it,navigator.webdriver === trueand you're flagged immediately. - Cookie injection over login flow: Automating the login form is slower and creates a distinct behavioral pattern. Injecting
li_atdirectly is cleaner. Treat it as a secret — rotate accounts periodically. - Resource blocking: Blocking images and fonts cuts page load from ~4MB to ~400KB and halves scrape time.
- Scroll simulation: LinkedIn's experience and education sections don't render until scrolled into view. The
mouse.wheelcalls are not optional for complete data. - 20–40 second delay between profiles: This is not excessive caution — it's roughly what a human reads a profile in. Anything faster risks session suspension.
Proxy Strategy
Residential proxies are non-negotiable for LinkedIn at any meaningful scale. The decision tree is:
- < 100 profiles/day: A single residential IP rotated per session is sufficient. Services like Oxylabs, Bright Data, or Smartproxy provide per-IP rotation.
- 100–1,000 profiles/day: Rotate per request. Use geo-targeted proxies matching your LinkedIn account's expected location — a US account routing through a Bucharest IP is an anomaly signal.
- > 1,000 profiles/day: You need multiple LinkedIn accounts, multiple residential proxy pools, and request distribution across both dimensions. At this scale, managing fingerprinting in-house becomes a significant maintenance burden.
For teams that want to skip the proxy infrastructure and browser fingerprint management, scraping APIs like AlterLab handle rotating proxies, TLS fingerprint spoofing, and JavaScript rendering in a single API call — useful when the scraping itself isn't your core engineering problem.
Rate Limiting and Request Patterns
LinkedIn's rate limiting operates at three independent levels:
IP level: Even with residential proxies, individual IPs have request budgets. Rotate IP per session, not per request, if you want to preserve cookie-based sessions. Rotating mid-session triggers a re-authentication challenge.
Account level: LinkedIn tracks profile view counts per authenticated session. Stay under 80–100 profile views per 24-hour period per account. This is a soft limit — exceeding it triggers an "unusual activity" checkpoint, not an immediate ban.
Velocity detection: The interval between sequential profile views matters more than the total count. A human researcher views a profile, reads it (45–90 seconds), then moves to the next. Spikes below 15 seconds between views consistently trigger flags.
Practical implementation:
import time
import random
from dataclasses import dataclass, field
from collections import deque
@dataclass
class RateLimiter:
max_per_hour: int = 60
min_interval_seconds: float = 20.0
_timestamps: deque = field(default_factory=deque)
def wait(self):
now = time.monotonic()
# Enforce minimum interval
if self._timestamps:
elapsed = now - self._timestamps[-1]
if elapsed < self.min_interval_seconds:
sleep_time = self.min_interval_seconds - elapsed + random.uniform(0, 5)
time.sleep(sleep_time)
now = time.monotonic()
# Enforce hourly budget
cutoff = now - 3600
while self._timestamps and self._timestamps[0] < cutoff:
self._timestamps.popleft()
if len(self._timestamps) >= self.max_per_hour:
oldest = self._timestamps[0]
wait_until = oldest + 3600
time.sleep(max(0, wait_until - now) + random.uniform(10, 30))
self._timestamps.append(time.monotonic())Handling Structure Changes
LinkedIn's HTML uses obfuscated class names that change on deploys. Do not hard-code class names as primary selectors. Use this hierarchy, in order of stability:
ld+jsonstructured data — most stable, changes with schema.org specaria-labeland semantic attributes — stable across redesignsdata-*attributes — moderately stable- Tag + position selectors (e.g.,
h1:first-of-type) — fragile but better than class names - Obfuscated class names (e.g.,
.pvs-list__item--line-separated) — treat as temporary
When selectors break — and they will — the fastest recovery path is to diff the HTML before/after the break and update your attribute-based selectors. Keep a snapshot of the last known-good HTML in your test fixtures.
When Raw Scraping Isn't Worth It
There are scenarios where building and maintaining this stack isn't justified:
- You need < 500 profiles/month and don't want to manage proxy billing and account rotation
- Your team doesn't have bandwidth to monitor for LinkedIn anti-bot updates
- You need consistent uptime SLAs that your own scraper can't provide
In those cases, a managed scraping API handles the fingerprint management, proxy infrastructure, and JavaScript rendering for you. AlterLab's API supports rendering JavaScript pages with a single POST request:
import httpx
response = httpx.post(
"https://api.alterlab.io/v1/scrape",
headers={"X-API-Key": "your_api_key"},
json={
"url": "https://www.linkedin.com/company/stripe/",
"render_js": True,
"wait_for": "div.org-top-card",
"proxy_country": "us",
}
)
html = response.json()["html"]The tradeoff: you trade control and cost optimization for reliability and zero infrastructure maintenance. For high-volume production pipelines where LinkedIn data is core to the product, building in-house is usually cheaper at scale. For analytics, enrichment, or research pipelines, an API is faster to ship and easier to maintain.
Legal and Ethical Considerations
LinkedIn's Terms of Service prohibit automated scraping. The hiQ Labs v. LinkedIn case (9th Circuit, 2022) established that scraping publicly available data is not a violation of the Computer Fraud and Abuse Act, but this doesn't override LinkedIn's ToS or make all scraping legally risk-free in all jurisdictions.
Be precise about what you actually need:
- Personal profile data is subject to GDPR and CCPA. Have a documented legal basis.
- Don't scrape contact information at scale for cold outreach — that's the use case that triggers the most aggressive legal responses.
- Company firmographic data (headcount, industry, description) is the lowest-risk data type.
Key Takeaways
Scraping LinkedIn in 2026 requires addressing multiple detection layers simultaneously:
- TLS and HTTP/2 fingerprinting — use a real browser or a library with Chrome-compatible fingerprints. Raw
requestsdoesn't pass. - Residential proxies are not optional — datacenter IPs are blocked at the ASN level.
- Session cookies (
li_at) — required for full profile data. Inject them directly rather than automating login. - Behavioral mimicry — randomize delays, simulate scrolling, stay under 80 profile views per 24 hours per account.
- Target
ld+jsonand semantic attributes — obfuscated class names are temporary. Structured data and ARIA attributes are stable. - Company pages are far more accessible than personal profiles. If firmographic data is sufficient, you don't need authenticated sessions.
- Build vs. buy depends on volume and team bandwidth — above ~5,000 profiles/day with SLA requirements, a managed API is often the right call.
The maintenance burden is the real cost here. LinkedIn's detection evolves continuously. Budget time for selector updates, proxy pool rotation, and account management — or abstract that away entirely with a scraping API.
Was this article helpful?
Related Articles
Popular Posts
Recommended

Selenium Bot Detection: Why You Get Caught and How to Avoid It

Why Your Headless Browser Gets Detected (and How to Fix It)

Scraping JavaScript-Heavy SPAs with Python: Dynamic Content, Infinite Scroll, and API Interception

Scraping E-Commerce Sites at Scale Without Getting Blocked

Web Scraping with Node.js and Puppeteer: The Complete 2026 Guide
Newsletter
Scraping insights and API tips. No spam.
Recommended Reading

Selenium Bot Detection: Why You Get Caught and How to Avoid It

Why Your Headless Browser Gets Detected (and How to Fix It)

Scraping JavaScript-Heavy SPAs with Python: Dynamic Content, Infinite Scroll, and API Interception

Scraping E-Commerce Sites at Scale Without Getting Blocked

Web Scraping with Node.js and Puppeteer: The Complete 2026 Guide
Stay in the Loop
Get scraping insights, API tips, and platform updates. No spam — we only send when we have something worth reading.
