Crunchbase holds data on over 2 million companies, including funding rounds, leadership info, and investor details. Extracting this data manually would take weeks.
This guide shows you exactly how to scrape Crunchbase using Python. You'll learn multiple extraction methods, from simple HTTP requests to handling Cloudflare protection.
I've scraped Crunchbase for lead generation projects and market research. The techniques here come from real production scrapers that collected data on thousands of companies.
How Does Crunchbase Scraping Work?
Scraping Crunchbase works by extracting company data from the hidden JSON cache embedded in each page's HTML source. Crunchbase uses Angular and stores pre-rendered data in a <script id="ng-state"> element. You can parse this JSON directly instead of scraping visible HTML elements, making extraction faster and more reliable than traditional scraping methods.
This approach bypasses many common scraping headaches. No need to wait for JavaScript rendering or deal with complex CSS selectors.
What You'll Learn
This tutorial covers everything you need to build a working Crunchbase scraper. You'll learn how to discover company URLs through sitemaps, extract data from the Angular cache, handle anti-bot protection, and export results to JSON.
The code works with Python 3.8+ and requires only a few standard libraries. Each step includes complete code examples you can copy and modify.
Prerequisites
Before starting, make sure you have Python 3.8 or higher installed on your system. You'll also need pip for installing packages.
Create a new project directory for your scraper:
mkdir crunchbase-scraper
cd crunchbase-scraper
Install the required libraries using pip:
pip install httpx parsel loguru
Here's what each package does. The httpx library handles HTTP requests with async support. Parsel provides CSS and XPath selectors for parsing HTML. Loguru gives you clean, colorful logging.
You can swap httpx for requests if you prefer synchronous code. The core logic stays the same.
Step 1: Set Up Your HTTP Client
Start by creating a properly configured HTTP client. Crunchbase checks request headers, so you need realistic browser headers.
Create a file called scraper.py:
import httpx
import json
from loguru import logger
BASE_HEADERS = {
"accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
"accept-language": "en-US,en;q=0.9",
"accept-encoding": "gzip, deflate, br",
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"
}
client = httpx.Client(
headers=BASE_HEADERS,
timeout=30.0,
follow_redirects=True,
http2=True
)
The HTTP/2 support matters here. Crunchbase's servers respond better to HTTP/2 connections. Modern browsers use this protocol by default.
Setting a 30-second timeout prevents your scraper from hanging on slow responses. The follow_redirects parameter handles any URL redirections automatically.
Step 2: Discover Company URLs Through Sitemaps
You need a list of company URLs before scraping. Crunchbase publishes a sitemap index containing links to every company page.
The sitemap lives at https://www.crunchbase.com/www-sitemaps/sitemap-index.xml. This index file points to compressed XML files organized by content type.
Here's how to parse the sitemap index:
import gzip
from parsel import Selector
def get_sitemap_urls(client):
"""Fetch all organization sitemap URLs from the index."""
logger.info("Fetching sitemap index...")
response = client.get("https://www.crunchbase.com/www-sitemaps/sitemap-index.xml")
selector = Selector(text=response.text)
# Extract URLs containing 'organizations'
sitemap_urls = selector.xpath("//sitemap/loc/text()").getall()
org_sitemaps = [url for url in sitemap_urls if "organizations" in url]
logger.info(f"Found {len(org_sitemaps)} organization sitemaps")
return org_sitemaps
Each sitemap file is compressed with gzip. You need to decompress before parsing:
def parse_sitemap(client, sitemap_url):
"""Parse a gzipped sitemap and extract company URLs."""
logger.info(f"Parsing sitemap: {sitemap_url}")
response = client.get(sitemap_url)
decompressed = gzip.decompress(response.content)
selector = Selector(text=decompressed.decode())
urls = selector.xpath("//url/loc/text()").getall()
logger.info(f"Found {len(urls)} company URLs")
return urls
The sitemaps also include lastmod timestamps. This tells you when each company profile was updated. Filter by date to scrape only recently modified pages.
Step 3: Extract Data from the Hidden JSON Cache
Here's where Crunchbase scraping gets interesting. The site uses Angular, which pre-renders data into a JSON blob hidden in the page source.
Look for a <script id="ng-state"> tag. This contains all the data visible on the page, plus additional fields not shown in the UI.
First, you need to unescape the Angular-encoded content:
def unescape_angular(text):
"""Convert Angular escape sequences back to normal characters."""
replacements = {
"&a;": "&",
"&q;": '"',
"&s;": "'",
"&l;": "<",
"&g;": ">"
}
for old, new in replacements.items():
text = text.replace(old, new)
return text
Angular escapes special characters to prevent XSS attacks. The function above reverses this encoding.
Now extract and parse the company data:
def extract_company_data(html):
"""Extract company information from page HTML."""
selector = Selector(text=html)
# Find the Angular state script
app_state = selector.css("script#ng-state::text").get()
if not app_state:
# Try alternative selector for newer pages
app_state = selector.css("script#client-app-state::text").get()
if not app_state:
logger.warning("Could not find app state data")
return None
# Unescape and parse JSON
app_state = unescape_angular(app_state)
data = json.loads(app_state)
return data
The JSON structure contains multiple cache entries. Company data lives under specific keys in the HttpState object.
Here's how to find and extract the relevant data:
def parse_organization(data):
"""Parse organization details from the app state."""
http_state = data.get("HttpState", {})
# Find the organization data cache key
org_key = None
for key in http_state.keys():
if "entities/organizations/" in key:
org_key = key
break
if not org_key:
return None
org_data = http_state[org_key].get("data", {})
properties = org_data.get("properties", {})
cards = org_data.get("cards", {})
# Extract company details
company = {
"name": properties.get("title"),
"permalink": properties.get("identifier", {}).get("permalink"),
"description": properties.get("short_description"),
"founded_year": cards.get("overview_fields2", {}).get("founded_on", {}).get("value"),
"headquarters": cards.get("overview_fields2", {}).get("location_identifiers", []),
"website": cards.get("overview_fields2", {}).get("website", {}).get("value"),
"employee_count": cards.get("overview_fields2", {}).get("num_employees_enum"),
"total_funding": cards.get("funding_total", {}).get("value_usd"),
}
return company
The cards object contains most useful fields. Different cards store different data types like funding rounds, team members, and technology info.
Step 4: Handle Cloudflare Protection with Proxies
Crunchbase uses Cloudflare to block automated access. After several requests from the same IP, you'll start seeing challenge pages.
Rotating proxies solve this problem. Each request comes from a different IP address, making your scraper look like many different users.
For serious scraping projects, residential proxies work best. Datacenter IPs often get blocked immediately. Services offer residential proxy pools that blend in with normal traffic.
Here's how to configure proxy rotation with httpx:
import random
PROXY_LIST = [
"http://user:pass@proxy1.example.com:8080",
"http://user:pass@proxy2.example.com:8080",
# Add more proxies here
]
def get_client_with_proxy():
"""Create an HTTP client with a random proxy."""
proxy = random.choice(PROXY_LIST)
client = httpx.Client(
headers=BASE_HEADERS,
timeout=30.0,
proxies={"all://": proxy},
http2=True
)
return client
Rotate proxies for each request when scraping Crunchbase at scale. This spreads your requests across many IP addresses.
Add delays between requests too. Even with proxy rotation, rapid-fire requests trigger rate limiting:
import time
def scrape_with_delay(urls, min_delay=2, max_delay=5):
"""Scrape URLs with random delays between requests."""
results = []
for url in urls:
client = get_client_with_proxy()
try:
response = client.get(url)
data = extract_company_data(response.text)
if data:
company = parse_organization(data)
results.append(company)
logger.info(f"Scraped: {company.get('name')}")
except Exception as e:
logger.error(f"Failed: {url} - {e}")
finally:
client.close()
# Random delay between requests
delay = random.uniform(min_delay, max_delay)
time.sleep(delay)
return results
Random delays make your traffic pattern look more human. Bots typically send requests at fixed intervals.
Step 5: Extract Employee and Contact Data
Beyond company overview data, Crunchbase pages contain employee information. This includes names, titles, LinkedIn profiles, and sometimes contact details.
The people data lives in a different cache key:
def extract_employees(data):
"""Extract employee information from the app state."""
http_state = data.get("HttpState", {})
# Find the contacts/people cache key
people_key = None
for key in http_state.keys():
if "/data/searches/contacts" in key:
people_key = key
break
if not people_key:
return []
people_data = http_state[people_key].get("data", {})
entities = people_data.get("entities", [])
employees = []
for person in entities:
props = person.get("properties", {})
employee = {
"name": props.get("name"),
"title": props.get("title"),
"linkedin": props.get("linkedin"),
"departments": props.get("job_departments", []),
"levels": props.get("job_levels", [])
}
employees.append(employee)
return employees
Note that detailed contact information requires visiting the /people tab of each company page. The main company page only shows basic employee data.
Step 6: Export Results to JSON
After scraping, save your data in a structured format. JSON works well for further processing:
def save_results(companies, filename="crunchbase_data.json"):
"""Save scraped data to a JSON file."""
with open(filename, "w", encoding="utf-8") as f:
json.dump(companies, f, indent=2, ensure_ascii=False)
logger.info(f"Saved {len(companies)} companies to {filename}")
For large datasets, consider streaming to JSONL format. This writes one JSON object per line and handles memory better:
def save_results_streaming(companies, filename="crunchbase_data.jsonl"):
"""Save data in JSONL format for large datasets."""
with open(filename, "w", encoding="utf-8") as f:
for company in companies:
f.write(json.dumps(company, ensure_ascii=False) + "\n")
JSONL files load faster for analysis tools like pandas. You can also process them line by line without loading everything into memory.
Complete Working Example
Here's the full scraper combining all the pieces:
import httpx
import json
import gzip
import time
import random
from parsel import Selector
from loguru import logger
BASE_HEADERS = {
"accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"accept-language": "en-US,en;q=0.9",
"accept-encoding": "gzip, deflate, br",
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) Chrome/120.0.0.0"
}
def unescape_angular(text):
replacements = {"&a;": "&", "&q;": '"', "&s;": "'", "&l;": "<", "&g;": ">"}
for old, new in replacements.items():
text = text.replace(old, new)
return text
def scrape_company(client, url):
"""Scrape a single company page."""
response = client.get(url)
selector = Selector(text=response.text)
app_state = selector.css("script#ng-state::text").get()
if not app_state:
app_state = selector.css("script#client-app-state::text").get()
if not app_state:
return None
data = json.loads(unescape_angular(app_state))
http_state = data.get("HttpState", {})
for key, value in http_state.items():
if "entities/organizations/" in key:
org = value.get("data", {})
props = org.get("properties", {})
cards = org.get("cards", {})
return {
"name": props.get("title"),
"description": props.get("short_description"),
"website": cards.get("overview_fields2", {}).get("website", {}).get("value"),
"headquarters": cards.get("overview_fields2", {}).get("location_identifiers", []),
"funding_total": cards.get("funding_total", {}).get("value_usd"),
}
return None
def main():
"""Main scraper function."""
urls = [
"https://www.crunchbase.com/organization/tesla-motors",
"https://www.crunchbase.com/organization/openai",
"https://www.crunchbase.com/organization/stripe",
]
results = []
with httpx.Client(headers=BASE_HEADERS, timeout=30, http2=True) as client:
for url in urls:
company = scrape_company(client, url)
if company:
results.append(company)
logger.info(f"Scraped: {company['name']}")
time.sleep(random.uniform(2, 4))
with open("crunchbase_data.json", "w") as f:
json.dump(results, f, indent=2)
logger.info(f"Done! Saved {len(results)} companies")
if __name__ == "__main__":
main()
Run this script with python scraper.py. It scrapes three company pages and saves results to JSON.
Alternative: Browser Automation with Selenium
Sometimes HTTP requests fail against heavy Cloudflare protection. Browser automation provides a fallback option.
Selenium launches a real browser that executes JavaScript. This passes most bot detection systems:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
def scrape_with_browser(url):
"""Scrape using a real browser instance."""
driver = webdriver.Chrome()
try:
driver.get(url)
# Wait for page to load
WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.CSS_SELECTOR, "script#ng-state"))
)
# Extract the same ng-state data
script = driver.find_element(By.CSS_SELECTOR, "script#ng-state")
data = json.loads(script.get_attribute("textContent"))
return data
finally:
driver.quit()
Browser automation runs slower than HTTP requests. Use it only when direct requests fail consistently.
For scale, consider headless browser services. They run browsers in the cloud and handle proxy rotation automatically.
Common Issues and Solutions
Several problems appear frequently when scraping Crunchbase. Here are fixes for the most common ones.
Getting blocked after a few requests happens when you hit rate limits. Add longer delays between requests and rotate proxies. Residential proxies from services like Roundproxies work better than datacenter IPs for avoiding blocks.
Empty ng-state data occurs on pages protected by JavaScript challenges. The browser needs to execute Cloudflare's challenge script first. Use Selenium or a headless browser service for these pages.
Timeouts on sitemap downloads happen because the gzipped files are large. Increase your timeout to 60 seconds or more. Stream the download if memory is limited.
Missing fields in the JSON means the company profile lacks that data. Check if the field exists before accessing it, and handle None values gracefully.
Final Thoughts
You now have a complete toolkit for scraping Crunchbase with Python. The hidden JSON extraction method works faster than HTML parsing and returns more data.
Start with the HTTP-based approach for speed. Fall back to browser automation when Cloudflare blocks become persistent. Rotate proxies to maintain access at scale.
The techniques here apply beyond Crunchbase. Many Angular and React sites store data in similar hidden caches. Once you understand this pattern, you can adapt the code for other targets.
FAQ
Is it legal to scrape Crunchbase?
Scraping publicly available data from Crunchbase is generally legal for personal use. However, Crunchbase's terms of service prohibit automated data collection. For commercial projects, consider using their official API to avoid legal issues.
How do I avoid getting blocked when scraping Crunchbase?
Use rotating residential proxies, add random delays of 2-5 seconds between requests, and set realistic browser headers. Crunchbase uses Cloudflare protection, so datacenter IPs get blocked quickly. Services like Roundproxies offer residential proxy pools that work well for this purpose.
What data can I extract from Crunchbase?
You can scrape company names, descriptions, funding information, employee counts, headquarters locations, founder details, and investor data. The hidden JSON cache often contains more fields than what's visible on the page, including technology stack and acquisition history.