
How to Scrape Crunchbase Data: Complete Guide for 2026
Learn how to scrape Crunchbase for public company data using Python, AlterLab API, and best practices for finance scraping in 2026.
This guide covers extracting publicly accessible data. Always review a site's robots.txt and Terms of Service before scraping.
TL;DR
To scrape Crunchbase with Python, use AlterLab’s API to render JavaScript pages, extract company fields via CSS selectors or JSON paths, and respect rate limits. The quickest path is a single alterlab.Client.scrape() call that returns clean HTML or structured output.
Why collect finance data from Crunchbase?
Crunchbase aggregates funding rounds, acquisitions, and leadership changes for private and public companies. Three practical uses include:
- Market research: Track emerging competitors by monitoring new funding announcements in your sector.
- Investment screening: Build watchlists of startups that match your criteria like‑stage and geography filters.
- Data enrichment: Augment CRM records with latest employee counts or latest financing dates for outreach personalization.
Technical challenges
Finance‑focused sites like Crunchbase deploy several anti‑bot measures:
- Rate limiting per IP after a burst of requests.
- JavaScript‑heavy pages that load company data via React hydrations, making raw HTML sparse.
- Bot detection using fingerprinting and CAPTCHA challenges on suspicious traffic.
Raw requests.get() often returns a minimal shell or a challenge page. AlterLab’s Smart Rendering API solves this by launching a headless browser, applying rotating proxies, and waiting for network idle before returning the fully rendered content.
Quick start with AlterLab API
First, install the Python SDK (see the Getting started guide for full setup). Then authenticate and scrape a public company page.
import alterlab
client = alterlab.Client("YOUR_API_KEY")
# Target a public Crunchbase company profile
response = client.scrape(
url="https://crunchbase.com/organization/sequoia-capital",
params={"formats": ["html"], "wait_for": "networkidle"}
)
print(response.text[:1500]) # preview of rendered HTMLEquivalent cURL request:
curl -X POST https://api.alterlab.io/v1/scrape \
-H "X-API-Key: YOUR_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://crunchbase.com/organization/sequoia-capital",
"formats": ["html"],
"wait_for": "networkidle"
}'The response contains the fully rendered DOM, ready for parsing.
Extracting structured data
Once you have the rendered HTML, use a parser like BeautifulSoup or lxml to pull fields. Commonly visible data points on a company page include:
| Field | CSS selector (example) | Notes |
|---|---|---|
| Company name | h1.chz-heading | Usually the main heading |
| Tagline / description | .cb-section-description | Short pitch |
| Funding total | div:has-text("Funding Total") + div | Adjacent value after label |
| Latest round | div:has-text("Latest Round") + div | Stage and amount |
| Employee count | div:has-text("Employee Count") + div | Number or range |
| Acquisitions | .acquisitions-section .cb-table-row | Loop for each row |
from bs4 import BeautifulSoup
soup = BeautifulSoup(response.text, "html.parser")
name = soup.select_one("h1.chz-heading").get_text(strip=True)
tagline = soup.select_one(".cb-section-description").get_text(strip=True)
funding_total = soup.select_one('div:has-text("Funding Total") + div').get_text(strip=True)
latest_round = soup.select_one('div:has-text("Latest Round") + div').get_text(strip=True)
employees = soup.select_one('div:has-text("Employee Count") + div').get_text(strip=True)
print({
"name": name,
"tagline": tagline,
"funding_total": funding_total,
"latest_round": latest_round,
"employees": employees,
})If you prefer structured output, AlterLab can return JSON via its built‑in extraction:
response = client.scrape(
url="https://crunchbase.com/organization/sequoia-capital",
params={"formats": ["json"], "json_schema": {"company": "string"}}
)
print(response.json) # already parsedBest practices
- Rate limiting: Pause 1–2 seconds between requests to stay under typical limits; adjust based on HTTP 429 responses.
- Robots.txt: Check
https://crunchbase.com/robots.txtfor disallowed paths; avoid scraping/admin/or/login/. - Handling dynamic content: Use AlterLab’s
wait_forparameter (e.g.,"networkidle"or a CSS selector) instead of arbitrarytime.sleep. - Error handling: Retry on 5xx or network errors with exponential backoff; log failed URLs for later review.
- Data freshness: For frequently changing fields like funding totals, schedule re‑scrapes daily or weekly depending on use case.
Scaling up
When you need to scrape hundreds of company profiles:
- Batch requests: Send multiple URLs in parallel using asyncio or a thread pool; AlterLab’s API handles concurrency safely.
- Scheduling: Use the platform’s scheduling feature to run a pipeline nightly and store results in a data warehouse.
- Cost control: Monitor usage via the dashboard; see AlterLab pricing for per‑scrape rates and volume discounts. Adjust
min_tierto skip unnecessary browser tiers for lighter pages.
Example of a scheduled batch job using the SDK:
import asyncio
from alterlab import Client
client = alterlab.Client("YOUR_API_KEY")
URLs = [
f"https://crunchbase.com/organization/{slug}"
for slug in ["sequoia-capital", "a16z", "accel", "greylock"]
]
async def scrape_one(url):
return await client.scrape_async(
url=url,
params={"formats": ["json"]},
max_retries=2
)
async def main():
results = await asyncio.gather(*[scrape_one(u) for u in URLs])
for r in results:
print(r.json)
if __name__ == "__main__":
asyncio.run(main())Key takeaways
- AlterLab’s Smart Rendering API neutralizes Crunchbase’s JavaScript and anti‑bot layers, letting you focus on data extraction.
- Target only publicly visible fields; respect robots.txt, rate limits, and the site’s Terms of Service.
- Start with a single Python call, then scale via batching, scheduling, and smart tier selection to balance speed and cost.
AlterLab // Web Data, Simplified.
Was this article helpful?
Frequently Asked Questions
Related Articles

How to Give Your AI Agent Access to eBay Data
Learn how to equip your AI agent with live eBay data using AlterLab’s Extract and Search APIs for reliable, structured access.
Herald Blog Service

How to Give Your AI Agent Access to SimilarWeb Data
Learn how to give your AI agent direct access to SimilarWeb traffic data using structured extraction, anti‑bot bypass, and MCP tooling—no parsing, no headaches.
Herald Blog Service

How to Give Your AI Agent Access to Statista Data
Enable AI agents to access public Statista data via AlterLab's APIs for structured extraction, search, and MCP integration—no anti-bot barriers or parsing overhead.
Herald Blog Service
Popular Posts
Recommended
Newsletter
Scraping insights and API tips. No spam.
Recommended Reading

How to Scrape AliExpress: Complete Guide for 2026

Why Your Headless Browser Gets Detected (and How to Fix It)

How to Scrape Twitter/X Data: Complete Guide for 2026

How to Scrape Cloudflare-Protected Sites in 2026

How to Bypass Cloudflare Bot Protection with Puppeteer in 2026
Stay in the Loop
Get scraping insights, API tips, and platform updates. No spam — we only send when we have something worth reading.
Explore AlterLab
Anti-Bot Handling API
Automatic challenge handling for protected sites — works out of the box.
JavaScript Rendering API
Render SPAs and dynamic content with headless Chromium.
Pricing
5-tier pricing from $0.0002/page. 5,000 free requests to start.
Documentation
API reference, SDKs, quickstart guides, and tutorials.
Web Scraping API Resources
Part of the Web Scraping API Documentation cluster
Complete API reference with 5-tier auto-escalation — Curl to challenge resolution.
Pillar pageConfigure Tier 4 browser rendering for SPAs and dynamic content.
Scrape pages behind login using session management.
Real success rates and cost data across all 5 tiers.
MCP Server, Python SDK, and Firecrawl-compatible API for AI agent workflows.