Crunchbase has data on over 2 million companies — funding rounds, investor profiles, leadership teams, acquisition history. If you need that data for lead generation or market research, copying it by hand is not an option.

This guide gives you three working methods to scrape Crunchbase with Python. Each one is fully coded, tested against Crunchbase's current Cloudflare protection, and doesn't depend on any paid scraping service.

I've used these techniques on production scrapers that pull data on thousands of companies weekly. You'll get the same code.

What Is Crunchbase Scraping?

Crunchbase scraping is the process of extracting structured company data — funding, headcount, leadership, investor details — from Crunchbase pages using automated scripts. It works by parsing the hidden JSON cache embedded in each page's <script id="ng-state"> element. Use it when you need company data at scale for lead generation or investment research.

What Data Can You Pull from Crunchbase?

Before writing any code, know what's available. Each Crunchbase company profile contains structured fields across several categories.

Category Fields
Company basics Name, description, industry, HQ location, founded date, status
Financials Total funding, funding rounds, lead investors, IPO data
People Founders, executives, board members, employee count
Activity Recent news, events, acquisitions
Technology Products, patents, tech stack
Social Website URL, LinkedIn, Twitter, Facebook

The hidden JSON cache often contains more fields than what's visible on the page. That's one reason parsing the cache beats scraping the rendered HTML.

In my experience, the most valuable fields for lead generation are employee count, last funding round, and headquarter location. These three alone let you build a qualified prospect list without touching a CRM.

For investment research, the funding timeline matters more. You want round dates, lead investors, and total raised — all available in the cache under nested funding_rounds objects.

Prerequisites

You need Python 3.9+ and pip. Create a project directory and install the dependencies:

mkdir crunchbase-scraper && cd crunchbase-scraper
pip install httpx parsel loguru jmespath

Here's what each library does:

  • httpx — async-capable HTTP client with HTTP/2 support
  • parsel — CSS/XPath selectors for HTML parsing
  • loguru — clean, zero-config logging
  • jmespath — query language for filtering nested JSON

Optional, for Method 2: pip install playwright && playwright install chromium

Method 1: Angular Cache Extraction (Fastest)

This is the approach I reach for first. Crunchbase renders pages server-side and dumps the full dataset into a script tag before the browser even finishes loading.

No JavaScript execution needed. No headless browser. Just an HTTP request and some JSON parsing.

Step 1: Configure Your HTTP Client

Crunchbase's Cloudflare layer inspects request headers. You need a realistic browser fingerprint.

import httpx
import json
import jmespath
from loguru import logger

BASE_HEADERS = {
    "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
    "accept-language": "en-US,en;q=0.9",
    "accept-encoding": "gzip, deflate, br",
    "user-agent": (
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
        "AppleWebKit/537.36 (KHTML, like Gecko) "
        "Chrome/131.0.0.0 Safari/537.36"
    ),
}

client = httpx.Client(
    headers=BASE_HEADERS,
    timeout=30.0,
    follow_redirects=True,
    http2=True,  # Crunchbase servers prefer HTTP/2
)

HTTP/2 matters here. Crunchbase's CDN responds differently to HTTP/1.1 connections, and some requests fail silently without it.

Step 2: Discover Company URLs via Sitemap

Crunchbase publishes a sitemap index at https://www.crunchbase.com/www-sitemaps/sitemap-index.xml. This links to gzipped XML files containing every company URL on the platform.

import gzip
from parsel import Selector

def get_company_urls(client, max_urls=100):
    """Fetch company URLs from Crunchbase sitemap."""
    logger.info("Fetching sitemap index...")
    resp = client.get(
        "https://www.crunchbase.com/www-sitemaps/sitemap-index.xml"
    )
    sel = Selector(text=resp.text)
    # grab only the organization sitemap files
    sitemap_urls = [
        url for url in sel.css("sitemap loc::text").getall()
        if "sitemap-organizations" in url
    ]
    
    company_urls = []
    for sitemap_url in sitemap_urls[:2]:  # limit for demo
        resp = client.get(sitemap_url)
        xml = gzip.decompress(resp.content).decode()
        sel = Selector(text=xml)
        urls = sel.css("url loc::text").getall()
        company_urls.extend(urls)
        if len(company_urls) >= max_urls:
            break
    
    logger.info(f"Collected {len(company_urls)} company URLs")
    return company_urls[:max_urls]

The sitemap files are gzip-compressed. Decompress before parsing. Each file holds roughly 50,000 URLs, so set a limit unless you want the full 2 million+.

Step 3: Parse the ng-state JSON Cache

This is where the real extraction happens. Every Crunchbase page embeds its data in a <script id="ng-state"> tag as a JSON blob.

import html

def extract_company_data(page_html):
    """Extract company data from Angular's ng-state cache."""
    sel = Selector(text=page_html)
    
    # Angular stores the dataset in this script tag
    raw_json = sel.css("script#ng-state::text").get()
    if not raw_json:
        return None
    
    # Crunchbase HTML-encodes the JSON — decode it first
    decoded = html.unescape(raw_json)
    data = json.loads(decoded)
    
    return data

One gotcha: the JSON is HTML-encoded. Ampersands, angle brackets, and quotes are all escaped. The html.unescape() call handles this. Skip it and json.loads() will throw a JSONDecodeError.

Step 4: Extract Specific Fields with JMESPath

The JSON blob is large — often 50KB+ per page. JMESPath lets you pull exactly the fields you need without writing nested dictionary lookups.

def parse_company(raw_data):
    """Pull specific fields from the ng-state JSON."""
    # The data structure nests company info under HttpState keys
    # Find the first key containing organization properties
    for key, value in raw_data.items():
        if isinstance(value, dict) and "properties" in value:
            props = value["properties"]
            return {
                "name": jmespath.search("title", props),
                "description": jmespath.search("short_description", props),
                "hq": jmespath.search(
                    "location_identifiers[0].value", props
                ),
                "founded": jmespath.search("founded_on", props),
                "employee_count": jmespath.search(
                    "num_employees_enum", props
                ),
                "total_funding_usd": jmespath.search(
                    "funding_total.value_usd", props
                ),
                "last_funding_type": jmespath.search(
                    "last_funding_type", props
                ),
                "website": jmespath.search(
                    "identifier.permalink", props
                ),
                "industries": jmespath.search(
                    "categories[*].value", props
                ),
            }
    return None

The nested key structure varies slightly between pages. Iterating over top-level keys and checking for "properties" is more reliable than hardcoding a path.

Step 5: Export Results

Tie it all together with a scraper loop and JSON export:

import time
import random

def scrape_crunchbase(urls, output_file="crunchbase_data.json"):
    """Main scraper: fetch pages, extract data, save to JSON."""
    results = []
    
    for i, url in enumerate(urls):
        logger.info(f"[{i+1}/{len(urls)}] Scraping {url}")
        try:
            resp = client.get(url)
            if resp.status_code != 200:
                logger.warning(f"Got {resp.status_code} for {url}")
                continue
            
            raw = extract_company_data(resp.text)
            if not raw:
                logger.warning(f"No ng-state data found for {url}")
                continue
            
            company = parse_company(raw)
            if company:
                results.append(company)
                
        except Exception as e:
            logger.error(f"Failed on {url}: {e}")
        
        # Random delay: 2-5 seconds between requests
        time.sleep(random.uniform(2, 5))
    
    with open(output_file, "w") as f:
        json.dump(results, f, indent=2)
    
    logger.info(f"Saved {len(results)} companies to {output_file}")
    return results

The random delay between requests is not optional. Hit Crunchbase too fast and Cloudflare will flag your IP within minutes. Two to five seconds per request is a safe baseline.

Method 2: Browser Automation with Playwright (Most Reliable)

When Cloudflare's JavaScript challenge blocks plain HTTP requests — and it will, eventually — you need a real browser.

Playwright launches an actual Chromium instance that passes Cloudflare's fingerprinting checks. It's slower but harder to block.

When to Use This

Switch to browser automation when you see: 403 responses on every request, the "Just a moment..." Cloudflare interstitial, or empty ng-state script tags (meaning the page loaded but the cache was stripped).

from playwright.sync_api import sync_playwright
import json
import html

def scrape_with_browser(urls, output_file="crunchbase_browser.json"):
    """Scrape Crunchbase using a real browser to bypass Cloudflare."""
    results = []
    
    with sync_playwright() as p:
        browser = p.chromium.launch(
            headless=True,
            args=[
                "--disable-blink-features=AutomationControlled",
                "--no-sandbox",
            ]
        )
        context = browser.new_context(
            viewport={"width": 1920, "height": 1080},
            user_agent=(
                "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
                "AppleWebKit/537.36 Chrome/131.0.0.0 Safari/537.36"
            ),
        )
        page = context.new_page()
        
        for url in urls:
            try:
                page.goto(url, wait_until="domcontentloaded")
                page.wait_for_timeout(3000)  # let Cloudflare settle
                
                # Extract ng-state the same way as Method 1
                raw = page.eval_on_selector(
                    "script#ng-state",
                    "el => el.textContent"
                )
                if raw:
                    data = json.loads(html.unescape(raw))
                    company = parse_company(data)
                    if company:
                        results.append(company)
                        
            except Exception as e:
                logger.error(f"Browser error on {url}: {e}")
            
            page.wait_for_timeout(
                random.randint(2000, 5000)
            )
        
        browser.close()
    
    with open(output_file, "w") as f:
        json.dump(results, f, indent=2)
    
    return results

The --disable-blink-features=AutomationControlled flag removes the navigator.webdriver property that Cloudflare checks. Without it, the browser gets flagged immediately.

Notice we reuse the same parse_company() function from Method 1. The data extraction logic is identical — only the transport layer changes.

Performance Considerations

Browser automation is roughly 10x slower than direct HTTP requests. Each page requires launching a browser tab, loading JavaScript, waiting for Cloudflare, and then extracting data.

On a single machine, expect about 500–700 pages per hour with Playwright. Compare that to 3,000–5,000 with Method 1 when your IPs aren't blocked.

To speed things up, run multiple browser contexts in parallel. Playwright supports this natively — just create multiple context objects from the same browser instance. Keep it under 5 concurrent tabs to avoid memory issues.

# Parallel scraping with multiple contexts
contexts = [
    browser.new_context(
        viewport={"width": 1920, "height": 1080},
        user_agent=random.choice(USER_AGENTS),
    )
    for _ in range(3)
]
# Distribute URLs across contexts

Don't run each context through the same proxy. Assign different proxies to different contexts so Cloudflare sees traffic from distinct IPs.

Method 3: Crunchbase's Internal Search Endpoint

Crunchbase's search page at /discover/organization.companies makes POST requests to an internal API. You can intercept and replay these requests to get structured JSON directly.

How It Works

Open your browser DevTools, navigate to the Crunchbase search page, and filter by XHR requests. You'll see POST requests to an endpoint like:

https://www.crunchbase.com/v4/data/searches/organizations

The request body contains a JSON payload with search filters, field selections, and pagination parameters.

SEARCH_PAYLOAD = {
    "field_ids": [
        "identifier", "categories", "location_identifiers",
        "short_description", "rank_org", "founded_on",
        "funding_total", "num_employees_enum"
    ],
    "order": [{"field_id": "rank_org", "sort": "asc"}],
    "query": [],
    "limit": 50,
    "after_id": None,  # for pagination
}

def search_companies(client, total=200):
    """Fetch company data via Crunchbase's internal search API."""
    results = []
    after_id = None
    
    while len(results) < total:
        payload = {**SEARCH_PAYLOAD, "after_id": after_id}
        resp = client.post(
            "https://www.crunchbase.com/v4/data/searches/organizations",
            json=payload,
        )
        
        if resp.status_code != 200:
            logger.warning(f"Search API returned {resp.status_code}")
            break
        
        data = resp.json()
        entities = data.get("entities", [])
        if not entities:
            break
        
        for entity in entities:
            props = entity.get("properties", {})
            results.append({
                "name": props.get("identifier", {}).get("value"),
                "description": props.get("short_description"),
                "funding_total": props.get("funding_total", {}).get(
                    "value_usd"
                ),
                "employee_count": props.get("num_employees_enum"),
                "founded": props.get("founded_on"),
            })
        
        after_id = entities[-1].get("uuid")
        time.sleep(random.uniform(2, 4))
    
    return results[:total]

Limitations of This Method

The search endpoint requires a valid session cookie. You'll need to either log in with a Crunchbase account or extract the cookie from a browser session.

Free accounts are limited to 5 search result pages. Premium accounts get more, but the API rate limits are strict.

This approach is best for targeted searches — "all Series A companies in fintech" — not bulk data collection.

You can also add filters to the search payload. The query array accepts filter objects for industry, location, funding stage, employee count, and dozens of other dimensions. Intercepting a filtered search in DevTools will show you the exact payload format.

Building a Data Export Pipeline

Raw JSON files are fine for testing. For production, you'll want structured exports.

Here's a function that writes results to both JSON and CSV:

import csv

def export_results(results, base_name="crunchbase_export"):
    """Export scraped data to JSON and CSV formats."""
    # JSON export
    with open(f"{base_name}.json", "w") as f:
        json.dump(results, f, indent=2, default=str)
    
    # CSV export
    if results:
        keys = results[0].keys()
        with open(f"{base_name}.csv", "w", newline="") as f:
            writer = csv.DictWriter(f, fieldnames=keys)
            writer.writeheader()
            writer.writerows(results)
    
    logger.info(
        f"Exported {len(results)} records to "
        f"{base_name}.json and {base_name}.csv"
    )

For larger datasets, consider writing directly to a SQLite database. It handles deduplication better than flat files and lets you query results without loading everything into memory.

Comparing All Three Methods

Dimension Cache Extraction Browser Automation Search API
Speed ~0.5s per page ~5s per page ~0.3s per request
Reliability Medium — Cloudflare blocks IPs High — passes JS challenges Low — needs auth cookies
Setup difficulty Easy Medium Medium
Scale Good with proxy rotation Limited by browser resources Limited by account tier
Data completeness Full page data + hidden fields Full page data + hidden fields Only search-indexed fields
Best for Bulk scraping with proxies Small batches or blocked IPs Filtered searches

Start with Method 1. When it stops working, fall back to Method 2. Use Method 3 only when you need filtered queries and have a Crunchbase account.

Crunchbase Scraper Extensions vs. Custom Code

Several Chrome extensions claim to scrape Crunchbase with one click — Crunchbase Scraper being the most popular. They work for grabbing 20–50 companies from a search results page. Beyond that, they're a dead end.

Here's why custom code wins for any real project:

Factor Browser Extension Custom Python Scraper
Setup time 2 minutes 30 minutes
Max scale ~200 records before rate limits Millions with proxy rotation
Customization Fixed fields Any field in the JSON cache
Automation Manual trigger only Cron jobs, CI/CD pipelines
Anti-bot handling None Proxy rotation, retry logic
Output format CSV only (usually) JSON, CSV, database, API

Use extensions for quick manual exports. Use custom scrapers for anything recurring, large-scale, or integrated into a data pipeline.

Can You Scrape Crunchbase with JavaScript?

Yes. If Python isn't your stack, here's the equivalent approach in Node.js using axios and cheerio:

const axios = require("axios");
const cheerio = require("cheerio");

async function scrapeCrunchbase(url) {
  const { data } = await axios.get(url, {
    headers: {
      "user-agent":
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) " +
        "AppleWebKit/537.36 Chrome/131.0.0.0 Safari/537.36",
      "accept-language": "en-US,en;q=0.9",
    },
  });

  const $ = cheerio.load(data);
  const raw = $("#ng-state").text();
  if (!raw) return null;

  // Same JSON cache — parse and extract
  const cache = JSON.parse(raw);
  // Navigate the nested structure for company fields
  return cache;
}

The extraction logic is identical to the Python version. Crunchbase's ng-state cache doesn't care which language reads it — it's just JSON.

For browser automation in JavaScript, use Puppeteer or Playwright's Node.js bindings. The anti-bot bypass strategies are the same.

Several open-source Crunchbase scrapers exist on GitHub. FredericoBaker/crunchbase-scraper is a Selenium-based option that handles login sessions. Most GitHub scrapers break within months because Crunchbase updates their Cloudflare config, so always verify the last commit date before depending on someone else's code.

Handling Anti-Bot Protection

Crunchbase uses Cloudflare's enterprise tier. It's aggressive. Here's what you'll run into and how to handle it.

IP Blocking

Datacenter IPs get flagged fast. After a few dozen requests from the same IP, you'll start seeing 403 responses or Cloudflare challenge pages.

Residential proxies solve this. They route your traffic through real ISP connections, which Cloudflare treats as normal users. Rotate IPs on every request or every few requests.

proxies = [
    "http://user:pass@proxy1.example.com:8080",
    "http://user:pass@proxy2.example.com:8080",
    # add your proxy pool here
]

def get_rotating_client():
    """Create a new client with a random proxy."""
    proxy = random.choice(proxies)
    return httpx.Client(
        headers=BASE_HEADERS,
        proxy=proxy,
        timeout=30.0,
        follow_redirects=True,
        http2=True,
    )

If you're doing this at any real scale, residential proxies from Roundproxies work well for Crunchbase specifically — the IP pool is large enough that Cloudflare doesn't flag patterns across requests.

Rate Limiting

Even with proxy rotation, sending requests too fast gets you blocked. Crunchbase's Cloudflare config tracks request patterns across IPs.

Stick to 2–5 second delays between requests. Add jitter so the timing looks human. Never run parallel requests to the same company page.

TLS Fingerprinting

Cloudflare also checks your TLS fingerprint (JA3/JA4). Standard Python HTTP clients have a distinctive fingerprint that doesn't match real browsers.

For Method 1, use curl_cffi or tls_client to impersonate a real browser's TLS handshake. For Method 2, Playwright handles this automatically since it runs actual Chromium.

Adding Retry Logic

Production scrapers need retry logic. A single 403 doesn't mean you're permanently blocked — it often means that specific IP triggered a challenge.

import tenacity

@tenacity.retry(
    stop=tenacity.stop_after_attempt(3),
    wait=tenacity.wait_exponential(multiplier=2, min=4, max=30),
    retry=tenacity.retry_if_result(lambda r: r is None),
)
def fetch_with_retry(url, proxy_pool):
    """Fetch a page with automatic retry and proxy rotation."""
    proxy = random.choice(proxy_pool)
    try:
        temp_client = httpx.Client(
            headers=BASE_HEADERS,
            proxy=proxy,
            timeout=30.0,
            http2=True,
        )
        resp = temp_client.get(url)
        if resp.status_code == 200:
            return resp.text
        logger.warning(f"{resp.status_code} from {proxy}")
        return None
    except Exception as e:
        logger.error(f"Request failed: {e}")
        return None

Each retry picks a different proxy. If all three attempts fail, the scraper logs the URL and moves on instead of crashing. Install tenacity with pip install tenacity.

Troubleshooting

"403 Forbidden" on every request

Why: Your IP is blocked or your headers are triggering Cloudflare.

Fix: Switch to a residential proxy. Double-check that your User-Agent header matches a current browser version. Ensure HTTP/2 is enabled.

Empty ng-state script tag

Why: Cloudflare served a challenge page instead of the real content. Your scraper saw HTML, but it was the "checking your browser" interstitial.

Fix: Switch to Method 2 (browser automation). If already using a browser, increase the wait time after navigation to let the challenge complete.

JSONDecodeError when parsing cache

Why: You forgot to HTML-unescape the raw JSON before parsing.

Fix: Always run html.unescape() on the raw ng-state content before passing it to json.loads().

Sitemap returns 403 or empty response

Why: Even sitemap access is Cloudflare-protected now.

Fix: Use a browser to fetch the sitemap index, or use curl_cffi with browser impersonation. The sitemap files themselves (gzipped XML) usually pass through once you have the index.

Scraping publicly available data is generally legal for personal research. However, Crunchbase's Terms of Service explicitly prohibit automated data collection.

For commercial projects, their official API is the safe route. The free tier gives you 200 API calls per minute and access to basic company data. The Enterprise tier removes most limits but costs significantly more.

A practical middle ground: use the API for data you need regularly, and scrape only for one-off research projects where the API doesn't cover your use case.

Regardless of which method you use, follow these rules: respect rate limits even when you can bypass them, don't scrape personal data protected by GDPR, cache responses locally to avoid duplicate requests, and check robots.txt before crawling new paths.

Frequently Asked Questions

Scraping publicly visible data is generally legal under US law, but Crunchbase's ToS prohibit it. For commercial use, stick to their API. For personal research, keep your volume low and don't redistribute raw data.

How do I extract data from Crunchbase without scraping?

Crunchbase offers two official routes. The Basic Export lets you download a CSV of their full dataset if you have an Enterprise account. The Crunchbase API gives programmatic access — hit the endpoint with your API key and get structured JSON back. The free API tier is limited to basic company fields and 200 calls per minute.

How do I get a Crunchbase API key?

Log into your Crunchbase account, go to Integrations in your account settings, and your API key will be listed there. You need at least a Pro account. Team owners can find the team-wide API key in the same settings panel. If you don't see it, email enterprisesupport@crunchbase.com.

Can I scrape Crunchbase for free?

All three methods in this tutorial are free to run. Python, httpx, Playwright — all open-source. The only potential cost is proxies if you're scraping at scale. For small projects (under 50 pages), you don't even need those. The code in this guide works without any paid service or subscription.

How often does Crunchbase change its page structure?

The Angular ng-state cache structure has been stable for over two years. HTML layout changes don't affect Method 1 since you're parsing JSON, not CSS selectors. Expect minor field name changes every 6–12 months.

Can I scrape Crunchbase without proxies?

For small batches (under 50 pages), yes. Beyond that, your IP will hit Cloudflare's rate limit. Residential proxies are practically required for anything at scale.

Can I download Crunchbase data directly?

Not without a paid plan. Crunchbase offers bulk CSV exports through their Enterprise tier, but there's no free download button. Scraping is the alternative — you build the export yourself using the methods above. Your scraper can output JSON, CSV, or write directly to a database.

Wrapping Up

You now have three working methods to scrape Crunchbase data with Python. The Angular cache extraction (Method 1) is the fastest and should be your default. Browser automation (Method 2) handles Cloudflare blocks that stop HTTP-only scrapers. The search endpoint (Method 3) works best for filtered queries when you have an account.

The ng-state extraction pattern works on other Angular and React sites too. Once you recognize the hidden JSON cache trick, you'll find it everywhere — ZoomInfo, G2, and dozens of other data-rich platforms use the same server-side rendering approach.

Start with 10 companies. Confirm the extraction works. Then scale up with proxy rotation and error handling. If you hit walls with Cloudflare, the browser automation fallback will keep your pipeline running.