
How to Scrape Crunchbase: Complete Guide for 2026
Learn how to scrape Crunchbase company data, funding rounds, and executive profiles with Python. Step-by-step guide with working code examples and anti-bot bypass.
April 8, 2026
Why scrape Crunchbase?
Crunchbase holds structured data on millions of companies, funding rounds, acquisitions, and key executives. Engineers scrape it for three common use cases.
Investment research. Track funding rounds across specific verticals. Monitor which startups raised Series A in the last 30 days. Feed that data into internal dashboards or alert systems.
Lead generation. Build prospect lists filtered by company size, industry, and recent funding events. Sales teams use this to prioritize outreach to companies that just raised capital and are likely expanding.
Market intelligence. Map competitive landscapes. Track acquisition patterns. Monitor executive moves between companies. Data teams pipe this into internal knowledge graphs or BI tools.
Doing this manually does not scale. You need a programmatic approach.
Anti-bot challenges on crunchbase.com
Crunchbase protects its data with several layers of anti-bot infrastructure.
Cloudflare bot detection. The site sits behind Cloudflare's WAF. Standard requests from Python's requests library get challenged or blocked entirely. You need a browser that executes JavaScript and passes Cloudflare's fingerprinting checks.
JavaScript rendering. Company profiles load dynamically. The initial HTML response contains minimal data. The actual content, funding tables, executive lists, renders client-side. A simple HTTP GET returns an empty shell.
Rate limiting. Crunchbase throttles repeated requests from the same IP. Aggressive scraping triggers temporary blocks. You need rotating proxies and request pacing.
Login walls. Some data points require authentication. Public company profiles are accessible, but deeper investor details and contact information sit behind accounts.
Building infrastructure to handle all of this yourself means maintaining headless browsers, proxy pools, and challenge solvers. Most teams would rather extract data than debug CAPTCHAs. AlterLab handles the anti-bot layer so your code just sends a URL and receives rendered HTML. See the Anti-bot bypass API for technical details on how the rendering pipeline works.
Quick start with AlterLab API
Install the SDK and scrape your first Crunchbase page in under a minute. If you are new to the platform, follow the Getting started guide to set up your API key first.
from alterlab import AlterLab
client = AlterLab(api_key="YOUR_API_KEY")
response = client.scrape(
url="https://www.crunchbase.com/organization/stripe",
formats=["markdown"],
wait_for_selector=".component--funding-rounds"
)
print(response.markdown)The wait_for_selector parameter tells the headless browser to pause until the funding rounds table renders. Without it, you get partial HTML.
Here is the same request with cURL:
curl -X POST https://api.alterlab.io/v1/scrape \
-H "X-API-Key: YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://www.crunchbase.com/organization/stripe",
"formats": ["markdown"],
"wait_for_selector": ".component--funding-rounds"
}'The response returns clean markdown with the funding table, company description, and executive list. No JavaScript to parse. No Cloudflare challenge to solve.
Extracting structured data
Raw HTML is a starting point. You need structured fields. Here are the CSS selectors for common Crunchbase data points.
from alterlab import AlterLab
from bs4 import BeautifulSoup
client = AlterLab(api_key="YOUR_API_KEY")
response = client.scrape(
url="https://www.crunchbase.com/organization/stripe",
formats=["html"]
)
soup = BeautifulSoup(response.html, "html.parser")
company_name = soup.select_one("h1.profile-title").get_text(strip=True)
description = soup.select_one(".profile-description").get_text(strip=True)
funding_total = soup.select_one(".funding-total .amount").get_text(strip=True)
last_funding_date = soup.select_one(".last-funding-date").get_text(strip=True)
headquarters = soup.select_one(".location-name").get_text(strip=True)
print(f"Company: {company_name}")
print(f"Total Funding: {funding_total}")
print(f"Last Round: {last_funding_date}")
print(f"HQ: {headquarters}")For JSON output, skip BeautifulSoup entirely. Request the json format and parse the structured response:
from alterlab import AlterLab
client = AlterLab(api_key="YOUR_API_KEY")
response = client.scrape(
url="https://www.crunchbase.com/organization/stripe",
formats=["json"],
json_mode="extract"
)
data = response.json
print(data.get("company_name"))
print(data.get("funding_rounds"))The json_mode="extract" parameter runs Cortex AI extraction on the page. You define the schema you want, and the LLM pulls structured fields from the rendered content. No CSS selectors to maintain when Crunchbase updates their layout.
Try scraping a Crunchbase company profile with AlterLab
Common pitfalls
Skipping the wait selector. Crunchbase loads content asynchronously. If you scrape without wait_for_selector, you capture the loading skeleton, not the data. Always wait for a known element like .component--funding-rounds or .profile-header.
Hitting rate limits without rotation. Sending 50 requests per minute from a single IP triggers throttling. Use the proxy rotation built into the API. It switches IPs automatically between requests.
Scraping authenticated pages. Some Crunchbase data requires login. The API can handle public pages only. If you need authenticated data, you must provide session cookies, and even then, some endpoints block automated access entirely.
Ignoring output format. Default HTML output works for simple cases. For data pipelines, request formats=["json"] or formats=["markdown"]. Markdown strips navigation chrome and leaves you with readable content. JSON gives you parseable structure.
Not handling missing fields. Crunchbase pages vary in structure. Early-stage startups have sparse profiles. Public companies have dense ones. Your extraction code should handle None values gracefully. Use .get_text(strip=True) if element else None patterns.
Scaling up
Scraping one company profile is straightforward. Scraping 10,000 requires planning.
Batch processing. Queue URLs in your application and send them in parallel. The API handles concurrent requests. You control the pace. Start with 5 concurrent requests, monitor response times, and scale up.
from alterlab import AlterLab
import asyncio
client = AlterLab(api_key="YOUR_API_KEY")
companies = [
"https://www.crunchbase.com/organization/stripe",
"https://www.crunchbase.com/organization/plaid",
"https://www.crunchbase.com/organization/brex",
"https://www.crunchbase.com/organization/ramp",
]
async def scrape_company(url):
response = await client.scrape_async(
url=url,
formats=["json"],
wait_for_selector=".component--funding-rounds"
)
return response.json
results = await asyncio.gather(*[scrape_company(url) for url in companies])
for result in results:
print(result.get("company_name"), result.get("funding_total"))Scheduling recurring scrapes. Company data changes. Funding rounds close. Executives move. Use cron-based scheduling to re-scrape profiles on a cadence.
from alterlab import AlterLab
client = AlterLab(api_key="YOUR_API_KEY")
schedule = client.schedules.create(
url="https://www.crunchbase.com/organization/stripe",
cron="0 9 * * 1",
formats=["json"],
webhook_url="https://your-server.com/webhook/crunchbase",
name="Weekly Stripe Profile"
)
print(f"Schedule created: {schedule.id}")This runs every Monday at 9 AM and pushes results to your webhook. No polling required.
Monitoring for changes. Instead of re-scraping on a fixed schedule, use the monitoring feature to detect when a page actually changes. You get notified only when funding data updates.
from alterlab import AlterLab
client = AlterLab(api_key="YOUR_API_KEY")
monitor = client.monitors.create(
url="https://www.crunchbase.com/organization/stripe",
check_interval="daily",
diff_threshold=0.05,
webhook_url="https://your-server.com/webhook/changes"
)
print(f"Monitoring active: {monitor.id}")Cost management. Each scrape consumes balance based on the tier required. Crunchbase needs JavaScript rendering, which maps to T2 or higher. Set spend limits on API keys to control costs. Check AlterLab pricing for current per-request rates across tiers. Most teams monitoring 500 companies with weekly checks stay well within the starter tier.
Key takeaways
Crunchbase data is valuable and well-protected. Cloudflare challenges, JavaScript rendering, and rate limiting make DIY scraping expensive to maintain.
Use a rendering API that handles bot bypass automatically. Request JSON or Markdown output to skip HTML parsing. Wait for dynamic content with wait_for_selector. Batch requests with async calls. Schedule recurring scrapes with cron expressions. Monitor pages for actual changes instead of blind re-scraping.
Start with a single company profile. Validate your extraction logic. Then scale to your full target list.
Related guides
Was this article helpful?
Frequently Asked Questions
Related Articles
Popular Posts
Recommended

Selenium Bot Detection: Why You Get Caught and How to Avoid It

How to Scrape Glassdoor: Complete Guide for 2026

Why Your Headless Browser Gets Detected (and How to Fix It)

Web Scraping APIs vs DIY Scrapers: When to Stop Building Infrastructure

Scraping JavaScript-Heavy SPAs with Python: Dynamic Content, Infinite Scroll, and API Interception
Newsletter
Scraping insights and API tips. No spam.
Recommended Reading

Selenium Bot Detection: Why You Get Caught and How to Avoid It

How to Scrape Glassdoor: Complete Guide for 2026

Why Your Headless Browser Gets Detected (and How to Fix It)

Web Scraping APIs vs DIY Scrapers: When to Stop Building Infrastructure

Scraping JavaScript-Heavy SPAs with Python: Dynamic Content, Infinite Scroll, and API Interception
Stay in the Loop
Get scraping insights, API tips, and platform updates. No spam — we only send when we have something worth reading.


