I've been scraping websites for the better part of a decade, and I can tell you this: the web scraping landscape is a battlefield. Cloudflare blocks your requests. JavaScript renders content dynamically. Rate limits shut you down. And don't even get me started on CAPTCHAs.
Most scraping frameworks pick their battles. Selenium handles JavaScript but gets detected instantly. Requests is fast but can't handle dynamic content. Scrapy is powerful but has a steep learning curve.
Botasaurus doesn't pick battles—it wins the war. It's the first Python framework I've used that actually delivers on the "undetectable scraping" promise. Cloudflare? Bypassed. DataDome? Not a problem. JavaScript-heavy sites? Handled automatically.
In this guide, I'll show you everything you need to know about Botasaurus, from basic scraping to advanced anti-detection techniques. No fluff, just the practical knowledge you need to build scrapers that actually work.
What is Botasaurus and why should you use it?
Botasaurus is an all-in-one web scraping framework that combines the power of browser automation with the simplicity of HTTP requests. It's built on top of Selenium but patches it with anti-detection features that make your scrapers virtually undetectable.
Here's what makes it different:
Anti-detection by default. Other frameworks make you configure anti-detection yourself. Botasaurus has it built-in. Real browser fingerprints, human-like behavior, automatic user-agent rotation—it's all there.
Three ways to scrape. Use @browser for JavaScript sites, @request for fast HTTP scraping, or @task for non-scraping jobs. Pick the right tool for each job.
No driver management. Forget about downloading ChromeDriver or managing versions. Botasaurus handles it automatically.
Built-in caching. Scrape once, use the data hundreds of times. Perfect for development and testing.
Parallel scraping. One parameter (parallel=5) and you're scraping 5 pages simultaneously.
The alternative is cobbling together multiple libraries—undetected-chromedriver for anti-detection, requests for HTTP calls, multiprocessing for parallelization. Botasaurus gives you everything in one package.
Getting started: Your first Botasaurus scraper
Installation is straightforward. Botasaurus requires Python 3.7 or higher:
pip install botasaurus
On first run, Botasaurus automatically downloads ChromeDriver. You don't need to manually install Chrome or manage driver versions—it handles everything.
Let's create your first scraper. Create a file called main.py:
from botasaurus.browser import browser, Driver
@browser
def scrape_heading(driver: Driver, data):
# Visit the website
driver.get("https://www.omkar.cloud/")
# Extract the heading text
heading = driver.get_text("h1")
# Return the data (Botasaurus saves it automatically)
return {
"heading": heading
}
# Run the scraper
scrape_heading()
Run it:
python main.py
Botasaurus launches Chrome, visits the site, extracts the heading, and saves the result to output/scrape_heading.json. That's it. No driver setup, no complex configuration.
Here's what happened behind the scenes:
- The
@browserdecorator tells Botasaurus to use a real browser - Botasaurus automatically provides a
Driverinstance (patched Selenium) driver.get_text("h1")is a shorthand for finding and extracting text- The return value is automatically saved as JSON in the
output/folder
This is Botasaurus's philosophy: make the common case trivial, but keep power-user features accessible.
Understanding the three decorators
Botasaurus gives you three decorators for different scraping scenarios. Picking the right one is crucial for performance and reliability.
@browser: For JavaScript-heavy sites
Use this when you need a real browser—JavaScript rendering, dynamic content, or sites that detect headless browsers.
from botasaurus.browser import browser, Driver
@browser
def scrape_product(driver: Driver, data):
"""
Scrape a product page that loads content via JavaScript.
"""
driver.get("https://example.com/product/12345")
# Wait for JavaScript to load the price
driver.wait_for_element(".price", wait=10)
# Extract data
name = driver.get_text("h1.product-name")
price = driver.get_text(".price")
description = driver.get_text(".description")
# Get all image URLs
images = driver.links(".product-images img", attribute="src")
return {
"name": name,
"price": price,
"description": description,
"images": images
}
scrape_product()
The Driver object is Selenium on steroids. It has shortcuts like get_text(), links(), and wait_for_element() that make scraping cleaner.
@request: For fast HTTP scraping
When you don't need JavaScript rendering, use @request. It's 10-20x faster than browser automation because it skips the browser entirely.
from botasaurus.request import request, Request
from bs4 import BeautifulSoup
@request
def scrape_api(request: Request, data):
"""
Scrape a site using HTTP requests (no browser).
"""
response = request.get("https://api.example.com/products")
# Parse JSON response
products = response.json()
return products
# Or scrape HTML with BeautifulSoup
@request
def scrape_html(request: Request, data):
response = request.get("https://example.com/products")
soup = BeautifulSoup(response.text, 'html.parser')
products = []
for item in soup.select('.product'):
products.append({
"name": item.select_one('.name').text,
"price": item.select_one('.price').text
})
return products
scrape_html()
Use @request when the site's HTML contains all the data you need. If you inspect a page and see the data in the source HTML (not loaded by JavaScript), requests are the way to go.
@task: For non-scraping work
Sometimes you need to process files, call APIs, or perform data transformations. Use @task for anything that doesn't involve web scraping.
from botasaurus.task import task
import pandas as pd
@task
def process_csv(data):
"""
Process a CSV file and extract insights.
"""
df = pd.read_csv("data.csv")
# Perform analysis
summary = {
"total_rows": len(df),
"average_price": df['price'].mean(),
"top_products": df.nlargest(10, 'sales')['name'].tolist()
}
return summary
process_csv()
The @task decorator gives you the same benefits (automatic output saving, error handling) without browser or HTTP overhead.
Scraping JavaScript-heavy websites
Modern websites load content dynamically with JavaScript. If you see a page with loading spinners or infinite scroll, you need browser automation.
Here's how to handle common JavaScript patterns:
Waiting for elements to load
Don't blindly wait 5 seconds. Wait for specific elements:
@browser
def scrape_dynamic_content(driver: Driver, data):
driver.get("https://example.com/products")
# Wait for the product list to appear
driver.wait_for_element(".product-card", wait=15)
# Extract products
products = []
for card in driver.select_all(".product-card"):
name = card.get_text(".product-name")
price = card.get_text(".product-price")
products.append({"name": name, "price": price})
return products
wait_for_element() is smart—it waits up to 15 seconds for the element to appear, but returns immediately once it does.
Handling infinite scroll
Many sites load more content as you scroll. Here's how to scrape them:
@browser
def scrape_infinite_scroll(driver: Driver, data):
driver.get("https://example.com/feed")
items = []
last_height = driver.execute_script("return document.body.scrollHeight")
while True:
# Scroll to bottom
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
# Wait for new content to load
driver.sleep(2)
# Calculate new scroll height
new_height = driver.execute_script("return document.body.scrollHeight")
if new_height == last_height:
break # No more content
last_height = new_height
# Now extract all items
for item in driver.select_all(".feed-item"):
items.append({
"title": item.get_text(".title"),
"content": item.get_text(".content")
})
return items
This pattern scrolls until no new content loads, then extracts everything at once.
Clicking buttons and interacting
Sometimes you need to click "Load More" buttons or fill forms:
@browser
def scrape_with_interaction(driver: Driver, data):
driver.get("https://example.com/search")
# Fill search form
driver.type("input[name='query']", "python books")
# Click search button
driver.click("button[type='submit']")
# Wait for results
driver.wait_for_element(".search-results", wait=10)
# Click "Load More" until no more items
while True:
try:
load_more = driver.select("button.load-more")
load_more.click()
driver.sleep(2) # Wait for new items to load
except:
break # Button disappeared, no more items
# Extract all results
results = []
for result in driver.select_all(".search-result"):
results.append({
"title": result.get_text(".title"),
"author": result.get_text(".author")
})
return results
Botasaurus's click(), type(), and select() methods make interaction simple.
Bypassing Cloudflare and anti-bot systems
This is where Botasaurus truly shines. Most scraping frameworks get blocked by Cloudflare within seconds. Botasaurus makes it look easy.
The google_get() trick
Instead of visiting a site directly, simulate arriving from a Google search result:
@browser
def scrape_cloudflare_protected(driver: Driver, data):
# Don't use driver.get() - it gets blocked
# Use google_get() instead
driver.google_get("https://example.com/protected-page")
# Wait for Cloudflare challenge to pass (if any)
driver.wait_for_element("h1", wait=30)
# Now extract data normally
title = driver.get_text("h1")
content = driver.get_text(".main-content")
return {
"title": title,
"content": content
}
scrape_cloudflare_protected()
google_get() routes your request through Google, making it appear like organic traffic. Cloudflare is far less likely to block traffic that originates from search engines.
Using real browser fingerprints
Botasaurus can use realistic browser configurations:
from botasaurus import bt
@browser(
user_agent=bt.UserAgent.REAL, # Use real browser fingerprint
window_size=bt.WindowSize.REAL, # Realistic window size
block_resources=True # Still block images for speed
)
def scrape_with_fingerprint(driver: Driver, data):
driver.get("https://example.com")
# Your scraping logic here
return {"data": driver.get_text("body")}
The bt.UserAgent.REAL and bt.WindowSize.REAL options make your scraper mimic a real user's browser. This defeats fingerprinting techniques that check screen resolution, user agent, and browser capabilities.
Headless vs headful browsers
Some sites detect headless browsers. If you're getting blocked, try running with a visible browser:
@browser(
headless=False, # Show the browser window
user_agent=bt.UserAgent.REAL
)
def scrape_tough_site(driver: Driver, data):
driver.get("https://tough-site.com")
# Scraping logic
return {"data": "..."}
Running headless=False means the browser window appears on your screen. It's slower and requires a display, but some sites can't detect it.
Adding delays to appear human
Real users don't scrape at lightning speed. Add random delays:
import random
@browser
def scrape_humanlike(driver: Driver, data):
driver.get("https://example.com/products")
products = []
for link in driver.links(".product-link"):
# Random delay between 1-3 seconds
driver.sleep(random.uniform(1, 3))
driver.get(link)
products.append({
"name": driver.get_text("h1"),
"price": driver.get_text(".price")
})
return products
The random.uniform(1, 3) creates a delay between 1 and 3 seconds, making your scraper's timing unpredictable like a human's.
Parallel scraping for speed
Scraping one page at a time is slow. If each page takes 3 seconds, scraping 100 pages takes 5 minutes. With parallel scraping, you can do it in under a minute.
Botasaurus makes parallelization trivial:
@browser(
parallel=5, # Run 5 browsers simultaneously
headless=True
)
def scrape_products(driver: Driver, url):
"""
This function runs in parallel for each URL.
"""
driver.get(url)
driver.wait_for_element("h1", wait=10)
return {
"name": driver.get_text("h1"),
"price": driver.get_text(".price"),
"url": url
}
# Pass a list of URLs
urls = [
"https://example.com/product/1",
"https://example.com/product/2",
"https://example.com/product/3",
# ... 100 more URLs
]
# Botasaurus automatically distributes URLs across 5 browsers
results = scrape_products(urls)
Pass a list of URLs, and Botasaurus handles the rest. It opens 5 browser instances, distributes URLs among them, and collects results. You don't write threading code.
How many parallel browsers?
More isn't always better. Each browser consumes ~200-300MB of RAM. On a machine with 8GB RAM, parallel=10 is reasonable. Beyond that, you risk memory issues.
Start with parallel=3, measure performance, and increase gradually:
from botasaurus import bt
@browser(
parallel=bt.calc_max_parallel_browsers, # Auto-calculate based on system
headless=True
)
def scrape_auto_parallel(driver: Driver, url):
# Scraping logic
pass
bt.calc_max_parallel_browsers automatically determines the optimal number based on your CPU and RAM.
Reusing browsers for better performance
Opening a new browser for every page is expensive. If you're scraping many pages from the same site, reuse the browser:
@browser(
parallel=3,
reuse_driver=True, # Keep browsers alive between pages
headless=True
)
def scrape_efficiently(driver: Driver, url):
driver.get(url)
return {"data": driver.get_text("h1")}
# These 100 URLs will use just 3 browser instances
urls = ["https://example.com/page-{}".format(i) for i in range(100)]
results = scrape_efficiently(urls)
With reuse_driver=True, Botasaurus keeps browsers open and reuses them. This cuts overhead dramatically.
Using proxies and browser profiles
Proxy configuration
Proxies are essential for large-scale scraping. They distribute requests across multiple IP addresses, avoiding rate limits and blocks.
Basic proxy usage:
@browser(proxy="http://proxy-server.com:8080")
def scrape_with_proxy(driver: Driver, url):
driver.get(url)
# Verify proxy is working
ip = driver.get_text("body") # On ipinfo.io or similar
print(f"Scraping from IP: {ip}")
return {"data": driver.get_text(".content")}
For authenticated proxies (username/password):
@browser(
proxy={
"http": "http://user:pass@proxy.com:8080",
"https": "https://user:pass@proxy.com:8080"
}
)
def scrape_with_auth_proxy(driver: Driver, url):
driver.get(url)
return {"data": "..."}
Botasaurus supports SSL-authenticated proxies, which is rare. Most frameworks force you into complex workarounds, but Botasaurus handles it natively.
Browser profiles for logged-in scraping
Need to scrape content behind a login? Use browser profiles to save cookies and session data:
@browser(
profile="my-linkedin-profile" # Reuse this profile across scrapes
)
def scrape_linkedin(driver: Driver, data):
driver.get("https://linkedin.com/in/someone")
# First time: manually log in
# Botasaurus saves cookies automatically
# Next time: you're already logged in
name = driver.get_text(".profile-name")
headline = driver.get_text(".profile-headline")
return {
"name": name,
"headline": headline
}
scrape_linkedin()
The first time you run this, manually log in to LinkedIn in the browser window. Botasaurus saves the cookies. Every subsequent run reuses those cookies—you stay logged in.
Installing Chrome extensions
Sometimes you need browser extensions (like CAPTCHA solvers or ad blockers):
@browser(
extensions=[
"path/to/extension-1.crx",
"path/to/extension-2.crx"
]
)
def scrape_with_extensions(driver: Driver, url):
driver.get(url)
# Extensions are pre-installed and active
return {"data": driver.get_text(".content")}
Download any Chrome extension as a .crx file, and Botasaurus loads it automatically.
Caching to avoid re-scraping
Development involves running the same scraper dozens of times. Re-scraping wastes time and risks getting blocked. Botasaurus's caching solves this:
@browser(
cache=True, # Enable caching
cache_duration=3600 # Cache for 1 hour (seconds)
)
def scrape_cached(driver: Driver, url):
driver.get(url)
# First run: scrapes the page and caches result
# Subsequent runs: returns cached data instantly
return {
"title": driver.get_text("h1"),
"content": driver.get_text(".article-body")
}
# First call: takes 5 seconds (actual scraping)
result1 = scrape_cached("https://example.com/article")
# Second call: takes 0.1 seconds (cached)
result2 = scrape_cached("https://example.com/article")
Caching is per-URL. If you scrape page-1.html and then page-2.html, both are cached separately.
Want to force a fresh scrape? Delete the cache:
from botasaurus.cache import Cache
# Clear all caches
Cache.clear()
# Clear cache for specific function
Cache.clear("scrape_cached")
Custom cache keys
Sometimes URLs aren't unique enough. Add custom cache logic:
@browser(
cache=True,
cache_key=lambda data: f"{data['url']}-{data['user_id']}"
)
def scrape_personalized(driver: Driver, data):
url = data['url']
user_id = data['user_id']
# Different users see different content
driver.get(url)
return {"data": driver.get_text(".personalized-content")}
# These create separate cache entries
scrape_personalized({"url": "example.com", "user_id": "123"})
scrape_personalized({"url": "example.com", "user_id": "456"})
The cache_key function determines how cache entries are keyed. Here, we key by both URL and user ID.
Handling authentication and cookies
Many valuable sites require login. Botasaurus makes authenticated scraping straightforward.
Manual login with profile persistence
The simplest approach: log in manually once, save the profile:
@browser(
profile="amazon-profile",
headless=False # Show browser for manual login
)
def scrape_amazon_orders(driver: Driver, data):
driver.get("https://amazon.com/orders")
# First run: you'll see the login page
# Log in manually, then press Enter in terminal to continue
input("Logged in? Press Enter to continue...")
# Botasaurus saves cookies automatically
# Next runs: you're already logged in
orders = []
for order in driver.select_all(".order-card"):
orders.append({
"order_id": order.get_text(".order-id"),
"date": order.get_text(".order-date"),
"total": order.get_text(".order-total")
})
return orders
scrape_amazon_orders()
The profile parameter saves all cookies and local storage. Run this once, log in manually, and you're authenticated forever (until cookies expire).
Programmatic login
For automation, log in programmatically:
@browser(profile="github-profile")
def scrape_github_private(driver: Driver, data):
driver.get("https://github.com/login")
# Fill login form
driver.type("#login_field", "your-email@example.com")
driver.type("#password", "your-password")
# Click login button
driver.click("input[name='commit']")
# Wait for redirect
driver.wait_for_element(".dashboard", wait=10)
# Now scrape private repos
driver.get("https://github.com/your-org/private-repo")
readme = driver.get_text(".markdown-body")
return {"readme": readme}
Combine this with profiles, and you only log in once. Subsequent runs reuse the session.
Handling CAPTCHA
Some sites show CAPTCHAs during login. For these, use manual solving or a CAPTCHA service:
@browser(
profile="protected-site",
headless=False # Show browser to solve CAPTCHA
)
def scrape_with_captcha(driver: Driver, data):
driver.get("https://protected-site.com/login")
# Fill credentials
driver.type("#email", "user@example.com")
driver.type("#password", "password123")
# Click login - CAPTCHA appears
driver.click("button[type='submit']")
# Wait for manual CAPTCHA solving
input("Solve the CAPTCHA, then press Enter...")
# Proceed with scraping
driver.get("https://protected-site.com/data")
return {"data": driver.get_text(".protected-content")}
For automated CAPTCHA solving, Botasaurus supports services like CapSolver:
@browser(
captcha_solver="capsolver",
capsolver_api_key="your-api-key-here"
)
def scrape_with_auto_captcha(driver: Driver, data):
driver.get("https://site-with-captcha.com")
# Botasaurus automatically solves CAPTCHAs
driver.solve_captcha()
return {"data": driver.get_text(".content")}
Note: CAPTCHA solving services cost money (typically $1-3 per 1000 CAPTCHAs). Manual solving is free but doesn't scale.
Building production-ready scrapers
Development scrapers are quick and dirty. Production scrapers need error handling, logging, and reliability.
Error handling and retries
Scraping fails. Networks timeout, selectors change, servers return errors. Handle failures gracefully:
@browser(
max_retry=3, # Retry 3 times on failure
retry_wait=5 # Wait 5 seconds between retries
)
def scrape_with_retries(driver: Driver, url):
try:
driver.get(url)
driver.wait_for_element("h1", wait=15)
title = driver.get_text("h1")
# Validate data before returning
if not title or title == "":
raise Exception("Empty title - page might not have loaded")
return {"title": title}
except Exception as e:
print(f"Error scraping {url}: {str(e)}")
raise # Re-raise to trigger retry
max_retry=3 automatically retries on exceptions. If it fails 3 times, Botasaurus returns None instead of crashing your entire scrape.
Blocking resources for speed
Images, CSS, and fonts slow down scraping. You don't need them for data extraction:
@browser(
block_resources=True, # Block images, CSS, fonts
headless=True
)
def scrape_fast(driver: Driver, url):
# Loads in 1 second instead of 5
driver.get(url)
return {"data": driver.get_text(".content")}
This can reduce page load time by 70-80%. Your scraper extracts text 4-5x faster.
For finer control, specify what to block:
@browser(
block_resources=["image", "stylesheet", "font"]
)
def scrape_selective(driver: Driver, url):
# Blocks images, CSS, fonts but loads JavaScript
driver.get(url)
return {"data": "..."}
Handling pagination properly
Most sites paginate results. Don't hardcode page numbers—follow "next" links:
@browser
def scrape_all_pages(driver: Driver, data):
driver.get("https://example.com/products?page=1")
all_products = []
while True:
# Scrape current page
for product in driver.select_all(".product"):
all_products.append({
"name": product.get_text(".name"),
"price": product.get_text(".price")
})
# Look for "next" button
try:
next_button = driver.select("a.next-page")
next_button.click()
driver.sleep(2) # Wait for new page to load
except:
break # No more pages
return all_products
This pattern works for any paginated site. It clicks "next" until the button disappears.
Saving data incrementally
For large scrapes, don't wait until the end to save data. Save as you go:
import json
@browser
def scrape_large_dataset(driver: Driver, urls):
for i, url in enumerate(urls):
driver.get(url)
data = {
"url": url,
"title": driver.get_text("h1"),
"content": driver.get_text(".content")
}
# Save every 10 items
if i % 10 == 0:
with open(f"output/batch_{i}.json", "w") as f:
json.dump(data, f)
# Or append to a single file
with open("output/all_data.jsonl", "a") as f:
f.write(json.dumps(data) + "\n")
return {"scraped": len(urls)}
If your scraper crashes after scraping 500 pages, you'll still have the data. Don't lose hours of work to a crash.
Common mistakes and how to avoid them
After hundreds of hours with Botasaurus, here are the traps I see developers fall into:
1. Not reusing browsers
Opening a new browser for every scrape is expensive:
# Bad: Opens new browser each time (slow)
@browser
def scrape_inefficient(driver: Driver, url):
driver.get(url)
return {"data": driver.get_text("h1")}
# Call this 100 times = 100 browser launches
for url in urls:
scrape_inefficient(url)
Instead, pass a list of URLs and use reuse_driver:
# Good: Reuses same browser (fast)
@browser(reuse_driver=True)
def scrape_efficient(driver: Driver, url):
driver.get(url)
return {"data": driver.get_text("h1")}
# Single browser handles all 100 URLs
scrape_efficient(urls)
This cuts execution time by 80-90% on large scrapes.
2. Ignoring cache during development
Running the same scraper repeatedly while debugging wastes time:
# Turn on cache during development
@browser(
cache=True,
cache_duration=3600 # 1 hour
)
def scrape_dev(driver: Driver, url):
driver.get(url)
return {"data": "..."}
First run scrapes the page. Subsequent runs return cached data instantly. When you're ready for production, remove cache=True.
3. Using headless browsers when headful works
Some developers always use headless=True for speed. But headless browsers are easier to detect:
# Gets blocked on some sites
@browser(headless=True)
def scrape_detected(driver: Driver, url):
driver.get(url)
return {"data": "..."}
# Works better
@browser(headless=False, user_agent=bt.UserAgent.REAL)
def scrape_stealthy(driver: Driver, url):
driver.get(url)
return {"data": "..."}
If you're getting blocked, try headless=False first before adding proxies or other complexity.
4. Not handling missing elements
Selectors change. Elements disappear. Your scraper should handle this:
# Bad: Crashes if price is missing
@browser
def scrape_fragile(driver: Driver, url):
driver.get(url)
name = driver.get_text("h1") # Always present
price = driver.get_text(".price") # Sometimes missing!
return {"name": name, "price": price}
Add fallbacks:
# Good: Handles missing elements
@browser
def scrape_robust(driver: Driver, url):
driver.get(url)
name = driver.get_text("h1")
# Use try/except for optional elements
try:
price = driver.get_text(".price")
except:
price = "N/A"
return {"name": name, "price": price}
Or use Botasaurus's built-in default values:
@browser
def scrape_with_defaults(driver: Driver, url):
driver.get(url)
name = driver.get_text("h1", default="Unknown")
price = driver.get_text(".price", default="N/A")
return {"name": name, "price": price}
5. Scraping too fast
Real humans don't navigate at computer speed. Add delays:
import random
@browser
def scrape_humanlike(driver: Driver, urls):
products = []
for url in urls:
# Random delay between pages
driver.sleep(random.uniform(2, 5))
driver.get(url)
products.append({"name": driver.get_text("h1")})
return products
Scraping 100 pages in 30 seconds looks suspicious. Spread it over 5-10 minutes with random delays.
6. Not using parallel when you should
Scraping one page at a time is unnecessarily slow:
# Slow: 100 pages × 3 seconds = 5 minutes
@browser
def scrape_sequential(driver: Driver, url):
driver.get(url)
return {"data": "..."}
for url in urls:
scrape_sequential(url)
# Fast: 100 pages ÷ 5 parallel = 1 minute
@browser(parallel=5, reuse_driver=True)
def scrape_parallel(driver: Driver, url):
driver.get(url)
return {"data": "..."}
scrape_parallel(urls) # Pass list, not loop
If you're scraping more than 10 pages, use parallel.
7. Forgetting to clean data
Scraped data is messy. Always clean it:
@browser
def scrape_with_cleaning(driver: Driver, url):
driver.get(url)
price_raw = driver.get_text(".price") # "$ 19.99 "
# Clean the price
price = price_raw.strip() # " $19.99 " → "$19.99"
price = price.replace("$", "") # "$19.99" → "19.99"
price = float(price) # "19.99" → 19.99
return {"price": price}
Botasaurus has built-in cleaners for common cases:
from botasaurus import cl
@browser
def scrape_auto_clean(driver: Driver, url):
driver.get(url)
price_raw = driver.get_text(".price") # "$ 19.99 "
price = cl.extract_numbers(price_raw) # 19.99
return {"price": price}
The cl module has cleaners for prices, dates, phone numbers, emails, and more.
Using Botasaurus in production
Ready to deploy your scraper? Here are some practical approaches:
Command-line scraper
The simplest deployment: run it as a script on a schedule (cron job):
# scraper.py
from botasaurus.browser import browser, Driver
@browser(
headless=True,
parallel=5,
cache=True
)
def daily_price_scrape(driver: Driver, urls):
# Scraping logic
pass
if __name__ == "__main__":
urls = [...] # Load from database or file
results = daily_price_scrape(urls)
# Save to database
save_to_db(results)
Run daily with cron:
0 2 * * * cd /path/to/scraper && python scraper.py
Flask API wrapper
For on-demand scraping, wrap Botasaurus in a simple Flask API:
from flask import Flask, request, jsonify
from botasaurus.browser import browser, Driver
app = Flask(__name__)
@browser(headless=True, reuse_driver=True)
def scrape_product(driver: Driver, url):
driver.get(url)
return {
"name": driver.get_text("h1"),
"price": driver.get_text(".price")
}
@app.route('/scrape', methods=['POST'])
def scrape_endpoint():
url = request.json.get('url')
result = scrape_product(url)
return jsonify(result)
if __name__ == '__main__':
app.run(port=5000)
Now anyone can hit your API to scrape on demand.
Docker for consistency
Containerize your scraper for reliable deployments:
FROM python:3.11-slim
# Install Chrome dependencies
RUN apt-get update && apt-get install -y \
chromium \
chromium-driver \
&& rm -rf /var/lib/apt/lists/*
WORKDIR /app
COPY requirements.txt .
RUN pip install -r requirements.txt
COPY . .
CMD ["python", "scraper.py"]
Build and run:
docker build -t my-scraper .
docker run my-scraper
This ensures your scraper works identically on any machine.
Wrapping up
Botasaurus is what web scraping should have been all along: powerful but simple, undetectable but transparent, flexible but opinionated where it matters.
The key takeaways:
- Pick the right decorator:
@browserfor JavaScript sites,@requestfor fast HTTP scraping,@taskfor everything else - Use anti-detection by default:
google_get(), real user agents, and profiles make your scraper undetectable - Parallelize everything: The
parallelparameter turns single-threaded scrapes into blazing-fast operations - Cache aggressively: Never re-scrape the same page twice during development
- Handle errors gracefully:
max_retry, fallback values, and proper exception handling keep your scraper running - Think like a human: Random delays, headful browsers, and gradual pagination make scraping sustainable
Most web scraping tutorials focus on extracting data. Botasaurus goes further—it solves the real problems of modern web scraping: detection, scale, and reliability.
You don't need a PhD in browser fingerprinting or an army of proxies. You need Botasaurus, a clear understanding of the three decorators, and the patience to let your scraper run at a human pace.
The web has evolved. Bot detection has evolved. Botasaurus keeps you one step ahead.