How to Scrape Crunchbase Data: Complete Guide for 2026
Tutorials

How to Scrape Crunchbase Data: Complete Guide for 2026

Learn how to scrape Crunchbase for public company data using Python, AlterLab API, and best practices for finance scraping in 2026.

4 min read
11 views

This guide covers extracting publicly accessible data. Always review a site's robots.txt and Terms of Service before scraping.

TL;DR

To scrape Crunchbase with Python, use AlterLab’s API to render JavaScript pages, extract company fields via CSS selectors or JSON paths, and respect rate limits. The quickest path is a single alterlab.Client.scrape() call that returns clean HTML or structured output.

Why collect finance data from Crunchbase?

Crunchbase aggregates funding rounds, acquisitions, and leadership changes for private and public companies. Three practical uses include:

  • Market research: Track emerging competitors by monitoring new funding announcements in your sector.
  • Investment screening: Build watchlists of startups that match your criteria like‑stage and geography filters.
  • Data enrichment: Augment CRM records with latest employee counts or latest financing dates for outreach personalization.

Technical challenges

Finance‑focused sites like Crunchbase deploy several anti‑bot measures:

  • Rate limiting per IP after a burst of requests.
  • JavaScript‑heavy pages that load company data via React hydrations, making raw HTML sparse.
  • Bot detection using fingerprinting and CAPTCHA challenges on suspicious traffic.

Raw requests.get() often returns a minimal shell or a challenge page. AlterLab’s Smart Rendering API solves this by launching a headless browser, applying rotating proxies, and waiting for network idle before returning the fully rendered content.

99.2%Success Rate
1.2sAvg Response

Quick start with AlterLab API

First, install the Python SDK (see the Getting started guide for full setup). Then authenticate and scrape a public company page.

Python
import alterlab

client = alterlab.Client("YOUR_API_KEY")
# Target a public Crunchbase company profile
response = client.scrape(
    url="https://crunchbase.com/organization/sequoia-capital",
    params={"formats": ["html"], "wait_for": "networkidle"}
)
print(response.text[:1500])  # preview of rendered HTML

Equivalent cURL request:

Bash
curl -X POST https://api.alterlab.io/v1/scrape \
  -H "X-API-Key: YOUR_KEY" \
  -H "Content-Type: application/json" \
  -d '{
        "url": "https://crunchbase.com/organization/sequoia-capital",
        "formats": ["html"],
        "wait_for": "networkidle"
      }'

The response contains the fully rendered DOM, ready for parsing.

Extracting structured data

Once you have the rendered HTML, use a parser like BeautifulSoup or lxml to pull fields. Commonly visible data points on a company page include:

FieldCSS selector (example)Notes
Company nameh1.chz-headingUsually the main heading
Tagline / description.cb-section-descriptionShort pitch
Funding totaldiv:has-text("Funding Total") + divAdjacent value after label
Latest rounddiv:has-text("Latest Round") + divStage and amount
Employee countdiv:has-text("Employee Count") + divNumber or range
Acquisitions.acquisitions-section .cb-table-rowLoop for each row
Python
from bs4 import BeautifulSoup
soup = BeautifulSoup(response.text, "html.parser")

name = soup.select_one("h1.chz-heading").get_text(strip=True)
tagline = soup.select_one(".cb-section-description").get_text(strip=True)
funding_total = soup.select_one('div:has-text("Funding Total") + div').get_text(strip=True)
latest_round = soup.select_one('div:has-text("Latest Round") + div').get_text(strip=True)
employees = soup.select_one('div:has-text("Employee Count") + div').get_text(strip=True)

print({
    "name": name,
    "tagline": tagline,
    "funding_total": funding_total,
    "latest_round": latest_round,
    "employees": employees,
})

If you prefer structured output, AlterLab can return JSON via its built‑in extraction:

Python
response = client.scrape(
    url="https://crunchbase.com/organization/sequoia-capital",
    params={"formats": ["json"], "json_schema": {"company": "string"}}
)
print(response.json)  # already parsed

Best practices

  • Rate limiting: Pause 1–2 seconds between requests to stay under typical limits; adjust based on HTTP 429 responses.
  • Robots.txt: Check https://crunchbase.com/robots.txt for disallowed paths; avoid scraping /admin/ or /login/.
  • Handling dynamic content: Use AlterLab’s wait_for parameter (e.g., "networkidle" or a CSS selector) instead of arbitrary time.sleep.
  • Error handling: Retry on 5xx or network errors with exponential backoff; log failed URLs for later review.
  • Data freshness: For frequently changing fields like funding totals, schedule re‑scrapes daily or weekly depending on use case.

Scaling up

When you need to scrape hundreds of company profiles:

  • Batch requests: Send multiple URLs in parallel using asyncio or a thread pool; AlterLab’s API handles concurrency safely.
  • Scheduling: Use the platform’s scheduling feature to run a pipeline nightly and store results in a data warehouse.
  • Cost control: Monitor usage via the dashboard; see AlterLab pricing for per‑scrape rates and volume discounts. Adjust min_tier to skip unnecessary browser tiers for lighter pages.

Example of a scheduled batch job using the SDK:

Python
import asyncio
from alterlab import Client

client = alterlab.Client("YOUR_API_KEY")
URLs = [
    f"https://crunchbase.com/organization/{slug}"
    for slug in ["sequoia-capital", "a16z", "accel", "greylock"]
]

async def scrape_one(url):
    return await client.scrape_async(
        url=url,
        params={"formats": ["json"]},
        max_retries=2
    )

async def main():
    results = await asyncio.gather(*[scrape_one(u) for u in URLs])
    for r in results:
        print(r.json)

if __name__ == "__main__":
    asyncio.run(main())

Key takeaways

  • AlterLab’s Smart Rendering API neutralizes Crunchbase’s JavaScript and anti‑bot layers, letting you focus on data extraction.
  • Target only publicly visible fields; respect robots.txt, rate limits, and the site’s Terms of Service.
  • Start with a single Python call, then scale via batching, scheduling, and smart tier selection to balance speed and cost.

AlterLab // Web Data, Simplified.

Share

Was this article helpful?

Frequently Asked Questions

Scraping publicly accessible data is generally permissible under precedents like hiQ v LinkedIn, but you must review Crunchbase’s robots.txt and Terms of Service, apply rate limiting, and avoid private or login‑restricted information.
Crunchbase employs rate limits, JavaScript rendering, and bot detection; AlterLab’s Smart Rendering API handles headless browsing, proxy rotation, and automatic retries to maintain compliant access.
AlterLab charges per successful scrape; pricing scales with volume and tier (e.g., T3 for JS‑heavy pages). See the pricing page for detailed rates and volume discounts.