How to scrape Crunchbase in 2026: 6 easy steps

December 08, 2025

9 min read

Crunchbase holds data on over 2 million companies, including funding rounds, leadership info, and investor details. Extracting this data manually would take weeks.

This guide shows you exactly how to scrape Crunchbase using Python. You'll learn multiple extraction methods, from simple HTTP requests to handling Cloudflare protection.

I've scraped Crunchbase for lead generation projects and market research. The techniques here come from real production scrapers that collected data on thousands of companies.

How Does Crunchbase Scraping Work?

Scraping Crunchbase works by extracting company data from the hidden JSON cache embedded in each page's HTML source. Crunchbase uses Angular and stores pre-rendered data in a <script id="ng-state"> element. You can parse this JSON directly instead of scraping visible HTML elements, making extraction faster and more reliable than traditional scraping methods.

This approach bypasses many common scraping headaches. No need to wait for JavaScript rendering or deal with complex CSS selectors.

What You'll Learn

This tutorial covers everything you need to build a working Crunchbase scraper. You'll learn how to discover company URLs through sitemaps, extract data from the Angular cache, handle anti-bot protection, and export results to JSON.

The code works with Python 3.8+ and requires only a few standard libraries. Each step includes complete code examples you can copy and modify.

Prerequisites

Before starting, make sure you have Python 3.8 or higher installed on your system. You'll also need pip for installing packages.

Create a new project directory for your scraper:

mkdir crunchbase-scraper
cd crunchbase-scraper

Install the required libraries using pip:

pip install httpx parsel loguru

Here's what each package does. The httpx library handles HTTP requests with async support. Parsel provides CSS and XPath selectors for parsing HTML. Loguru gives you clean, colorful logging.

You can swap httpx for requests if you prefer synchronous code. The core logic stays the same.

Step 1: Set Up Your HTTP Client

Start by creating a properly configured HTTP client. Crunchbase checks request headers, so you need realistic browser headers.

Create a file called scraper.py:

import httpx
import json
from loguru import logger

BASE_HEADERS = {
    "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
    "accept-language": "en-US,en;q=0.9",
    "accept-encoding": "gzip, deflate, br",
    "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"
}

client = httpx.Client(
    headers=BASE_HEADERS,
    timeout=30.0,
    follow_redirects=True,
    http2=True
)

The HTTP/2 support matters here. Crunchbase's servers respond better to HTTP/2 connections. Modern browsers use this protocol by default.

Setting a 30-second timeout prevents your scraper from hanging on slow responses. The follow_redirects parameter handles any URL redirections automatically.

Step 2: Discover Company URLs Through Sitemaps

You need a list of company URLs before scraping. Crunchbase publishes a sitemap index containing links to every company page.

The sitemap lives at https://www.crunchbase.com/www-sitemaps/sitemap-index.xml. This index file points to compressed XML files organized by content type.

Here's how to parse the sitemap index:

import gzip
from parsel import Selector

def get_sitemap_urls(client):
    """Fetch all organization sitemap URLs from the index."""
    logger.info("Fetching sitemap index...")
    
    response = client.get("https://www.crunchbase.com/www-sitemaps/sitemap-index.xml")
    selector = Selector(text=response.text)
    
    # Extract URLs containing 'organizations'
    sitemap_urls = selector.xpath("//sitemap/loc/text()").getall()
    org_sitemaps = [url for url in sitemap_urls if "organizations" in url]
    
    logger.info(f"Found {len(org_sitemaps)} organization sitemaps")
    return org_sitemaps

Each sitemap file is compressed with gzip. You need to decompress before parsing:

def parse_sitemap(client, sitemap_url):
    """Parse a gzipped sitemap and extract company URLs."""
    logger.info(f"Parsing sitemap: {sitemap_url}")
    
    response = client.get(sitemap_url)
    decompressed = gzip.decompress(response.content)
    
    selector = Selector(text=decompressed.decode())
    urls = selector.xpath("//url/loc/text()").getall()
    
    logger.info(f"Found {len(urls)} company URLs")
    return urls

The sitemaps also include lastmod timestamps. This tells you when each company profile was updated. Filter by date to scrape only recently modified pages.

Step 3: Extract Data from the Hidden JSON Cache

Here's where Crunchbase scraping gets interesting. The site uses Angular, which pre-renders data into a JSON blob hidden in the page source.

Look for a <script id="ng-state"> tag. This contains all the data visible on the page, plus additional fields not shown in the UI.

First, you need to unescape the Angular-encoded content:

def unescape_angular(text):
    """Convert Angular escape sequences back to normal characters."""
    replacements = {
        "&a;": "&",
        "&q;": '"',
        "&s;": "'",
        "&l;": "<",
        "&g;": ">"
    }
    for old, new in replacements.items():
        text = text.replace(old, new)
    return text

Angular escapes special characters to prevent XSS attacks. The function above reverses this encoding.

Now extract and parse the company data:

def extract_company_data(html):
    """Extract company information from page HTML."""
    selector = Selector(text=html)
    
    # Find the Angular state script
    app_state = selector.css("script#ng-state::text").get()
    
    if not app_state:
        # Try alternative selector for newer pages
        app_state = selector.css("script#client-app-state::text").get()
    
    if not app_state:
        logger.warning("Could not find app state data")
        return None
    
    # Unescape and parse JSON
    app_state = unescape_angular(app_state)
    data = json.loads(app_state)
    
    return data

The JSON structure contains multiple cache entries. Company data lives under specific keys in the HttpState object.

Here's how to find and extract the relevant data:

def parse_organization(data):
    """Parse organization details from the app state."""
    http_state = data.get("HttpState", {})
    
    # Find the organization data cache key
    org_key = None
    for key in http_state.keys():
        if "entities/organizations/" in key:
            org_key = key
            break
    
    if not org_key:
        return None
    
    org_data = http_state[org_key].get("data", {})
    properties = org_data.get("properties", {})
    cards = org_data.get("cards", {})
    
    # Extract company details
    company = {
        "name": properties.get("title"),
        "permalink": properties.get("identifier", {}).get("permalink"),
        "description": properties.get("short_description"),
        "founded_year": cards.get("overview_fields2", {}).get("founded_on", {}).get("value"),
        "headquarters": cards.get("overview_fields2", {}).get("location_identifiers", []),
        "website": cards.get("overview_fields2", {}).get("website", {}).get("value"),
        "employee_count": cards.get("overview_fields2", {}).get("num_employees_enum"),
        "total_funding": cards.get("funding_total", {}).get("value_usd"),
    }
    
    return company

The cards object contains most useful fields. Different cards store different data types like funding rounds, team members, and technology info.

Step 4: Handle Cloudflare Protection with Proxies

Crunchbase uses Cloudflare to block automated access. After several requests from the same IP, you'll start seeing challenge pages.

Rotating proxies solve this problem. Each request comes from a different IP address, making your scraper look like many different users.

For serious scraping projects, residential proxies work best. Datacenter IPs often get blocked immediately. Services offer residential proxy pools that blend in with normal traffic.

Here's how to configure proxy rotation with httpx:

import random

PROXY_LIST = [
    "http://user:pass@proxy1.example.com:8080",
    "http://user:pass@proxy2.example.com:8080",
    # Add more proxies here
]

def get_client_with_proxy():
    """Create an HTTP client with a random proxy."""
    proxy = random.choice(PROXY_LIST)
    
    client = httpx.Client(
        headers=BASE_HEADERS,
        timeout=30.0,
        proxies={"all://": proxy},
        http2=True
    )
    
    return client

Rotate proxies for each request when scraping Crunchbase at scale. This spreads your requests across many IP addresses.

Add delays between requests too. Even with proxy rotation, rapid-fire requests trigger rate limiting:

import time

def scrape_with_delay(urls, min_delay=2, max_delay=5):
    """Scrape URLs with random delays between requests."""
    results = []
    
    for url in urls:
        client = get_client_with_proxy()
        
        try:
            response = client.get(url)
            data = extract_company_data(response.text)
            
            if data:
                company = parse_organization(data)
                results.append(company)
                logger.info(f"Scraped: {company.get('name')}")
        
        except Exception as e:
            logger.error(f"Failed: {url} - {e}")
        
        finally:
            client.close()
        
        # Random delay between requests
        delay = random.uniform(min_delay, max_delay)
        time.sleep(delay)
    
    return results

Random delays make your traffic pattern look more human. Bots typically send requests at fixed intervals.

Step 5: Extract Employee and Contact Data

Beyond company overview data, Crunchbase pages contain employee information. This includes names, titles, LinkedIn profiles, and sometimes contact details.

The people data lives in a different cache key:

def extract_employees(data):
    """Extract employee information from the app state."""
    http_state = data.get("HttpState", {})
    
    # Find the contacts/people cache key
    people_key = None
    for key in http_state.keys():
        if "/data/searches/contacts" in key:
            people_key = key
            break
    
    if not people_key:
        return []
    
    people_data = http_state[people_key].get("data", {})
    entities = people_data.get("entities", [])
    
    employees = []
    for person in entities:
        props = person.get("properties", {})
        
        employee = {
            "name": props.get("name"),
            "title": props.get("title"),
            "linkedin": props.get("linkedin"),
            "departments": props.get("job_departments", []),
            "levels": props.get("job_levels", [])
        }
        employees.append(employee)
    
    return employees

Note that detailed contact information requires visiting the /people tab of each company page. The main company page only shows basic employee data.

Step 6: Export Results to JSON

After scraping, save your data in a structured format. JSON works well for further processing:

def save_results(companies, filename="crunchbase_data.json"):
    """Save scraped data to a JSON file."""
    with open(filename, "w", encoding="utf-8") as f:
        json.dump(companies, f, indent=2, ensure_ascii=False)
    
    logger.info(f"Saved {len(companies)} companies to {filename}")

For large datasets, consider streaming to JSONL format. This writes one JSON object per line and handles memory better:

def save_results_streaming(companies, filename="crunchbase_data.jsonl"):
    """Save data in JSONL format for large datasets."""
    with open(filename, "w", encoding="utf-8") as f:
        for company in companies:
            f.write(json.dumps(company, ensure_ascii=False) + "\n")

JSONL files load faster for analysis tools like pandas. You can also process them line by line without loading everything into memory.

Complete Working Example

Here's the full scraper combining all the pieces:

import httpx
import json
import gzip
import time
import random
from parsel import Selector
from loguru import logger

BASE_HEADERS = {
    "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
    "accept-language": "en-US,en;q=0.9",
    "accept-encoding": "gzip, deflate, br",
    "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) Chrome/120.0.0.0"
}


def unescape_angular(text):
    replacements = {"&a;": "&", "&q;": '"', "&s;": "'", "&l;": "<", "&g;": ">"}
    for old, new in replacements.items():
        text = text.replace(old, new)
    return text


def scrape_company(client, url):
    """Scrape a single company page."""
    response = client.get(url)
    selector = Selector(text=response.text)
    
    app_state = selector.css("script#ng-state::text").get()
    if not app_state:
        app_state = selector.css("script#client-app-state::text").get()
    
    if not app_state:
        return None
    
    data = json.loads(unescape_angular(app_state))
    http_state = data.get("HttpState", {})
    
    for key, value in http_state.items():
        if "entities/organizations/" in key:
            org = value.get("data", {})
            props = org.get("properties", {})
            cards = org.get("cards", {})
            
            return {
                "name": props.get("title"),
                "description": props.get("short_description"),
                "website": cards.get("overview_fields2", {}).get("website", {}).get("value"),
                "headquarters": cards.get("overview_fields2", {}).get("location_identifiers", []),
                "funding_total": cards.get("funding_total", {}).get("value_usd"),
            }
    
    return None


def main():
    """Main scraper function."""
    urls = [
        "https://www.crunchbase.com/organization/tesla-motors",
        "https://www.crunchbase.com/organization/openai",
        "https://www.crunchbase.com/organization/stripe",
    ]
    
    results = []
    
    with httpx.Client(headers=BASE_HEADERS, timeout=30, http2=True) as client:
        for url in urls:
            company = scrape_company(client, url)
            
            if company:
                results.append(company)
                logger.info(f"Scraped: {company['name']}")
            
            time.sleep(random.uniform(2, 4))
    
    with open("crunchbase_data.json", "w") as f:
        json.dump(results, f, indent=2)
    
    logger.info(f"Done! Saved {len(results)} companies")


if __name__ == "__main__":
    main()

Run this script with python scraper.py. It scrapes three company pages and saves results to JSON.

Alternative: Browser Automation with Selenium

Sometimes HTTP requests fail against heavy Cloudflare protection. Browser automation provides a fallback option.

Selenium launches a real browser that executes JavaScript. This passes most bot detection systems:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

def scrape_with_browser(url):
    """Scrape using a real browser instance."""
    driver = webdriver.Chrome()
    
    try:
        driver.get(url)
        
        # Wait for page to load
        WebDriverWait(driver, 10).until(
            EC.presence_of_element_located((By.CSS_SELECTOR, "script#ng-state"))
        )
        
        # Extract the same ng-state data
        script = driver.find_element(By.CSS_SELECTOR, "script#ng-state")
        data = json.loads(script.get_attribute("textContent"))
        
        return data
    
    finally:
        driver.quit()

Browser automation runs slower than HTTP requests. Use it only when direct requests fail consistently.

For scale, consider headless browser services. They run browsers in the cloud and handle proxy rotation automatically.

Common Issues and Solutions

Several problems appear frequently when scraping Crunchbase. Here are fixes for the most common ones.

Getting blocked after a few requests happens when you hit rate limits. Add longer delays between requests and rotate proxies. Residential proxies from services like Roundproxies work better than datacenter IPs for avoiding blocks.

Empty ng-state data occurs on pages protected by JavaScript challenges. The browser needs to execute Cloudflare's challenge script first. Use Selenium or a headless browser service for these pages.

Timeouts on sitemap downloads happen because the gzipped files are large. Increase your timeout to 60 seconds or more. Stream the download if memory is limited.

Missing fields in the JSON means the company profile lacks that data. Check if the field exists before accessing it, and handle None values gracefully.

Final Thoughts

You now have a complete toolkit for scraping Crunchbase with Python. The hidden JSON extraction method works faster than HTML parsing and returns more data.

Start with the HTTP-based approach for speed. Fall back to browser automation when Cloudflare blocks become persistent. Rotate proxies to maintain access at scale.

The techniques here apply beyond Crunchbase. Many Angular and React sites store data in similar hidden caches. Once you understand this pattern, you can adapt the code for other targets.

FAQ

Is it legal to scrape Crunchbase?

Scraping publicly available data from Crunchbase is generally legal for personal use. However, Crunchbase's terms of service prohibit automated data collection. For commercial projects, consider using their official API to avoid legal issues.

How do I avoid getting blocked when scraping Crunchbase?

Use rotating residential proxies, add random delays of 2-5 seconds between requests, and set realistic browser headers. Crunchbase uses Cloudflare protection, so datacenter IPs get blocked quickly. Services like Roundproxies offer residential proxy pools that work well for this purpose.

What data can I extract from Crunchbase?

You can scrape company names, descriptions, funding information, employee counts, headquarters locations, founder details, and investor data. The hidden JSON cache often contains more fields than what's visible on the page, including technology stack and acquisition history.

Marius Bernard

Marius Bernard is a Web Scraping Engineer & Technical Advisor at Roundproxies. He authored the Web Scraping chapter of the 2024 Web Almanac/Techinsider. He loves python, golang and proxies.

Get the best
proxies out there

Get Proxies now

Related from Knowledge Base

Go Web Scraping: Complete 2025 Guide & Code Examples

PHP Web Scraping Guide 2026: Speed & Anti-Bot Tips

C# Web Scraping Guide: Build Fast Working Scrapers

Web Scraping in R: Complete Guide 2026

Web Scraping in Rust: Complete 2026 Guide

How to Do Web Scraping in Kotlin: The Developer's Guide

How to Do Web Scraping in Lua: A Developer's Guide

How to Do Web Scraping in Dart: A Complete 2026 Guide

How to Do Web Scraping in Perl: The Complete Developer's Guide

Python Web Scraping Guide: Build Scrapers in 2026

How to Use Botasaurus in 2026

How to Scrape Dynamic Websites With Headless Web Browsers

12 Ways to Make HTTPS Requests in Node.js

15 Methods to Not Get Blocked Web Scraping

How to Use Playwright Playwright Proxy in 2026

How to Take Screenshots with Puppeteer

How to Store and Manage Scraped Data Efficiently

User-Agent Rotation: Why and How to Implement It

How to Scrape Data Behind Login Pages

What Are Backconnect Proxies and How They Work