Your scraper was working fine yesterday. Today, you're staring at a CAPTCHA wall or worse—a silent IP ban that took hours to diagnose.
Anti-bot systems in 2026 are nothing like what they were even two years ago. Cloudflare's per-customer ML models learn your traffic patterns. DataDome's behavioral analysis catches scrapers that pass every fingerprint test. Akamai's JA4 fingerprinting spots libraries that JA3 couldn't touch.
The main difference between scrapers that succeed and those that get blocked is how they handle the full detection stack. Modern anti-bot systems combine TLS fingerprinting, JavaScript challenges, behavioral analysis, and IP reputation scoring. Bypassing just one layer isn't enough—you need to address all of them simultaneously.
This guide covers the exact techniques that achieved a 94% success rate across 50+ million requests in production last year. You'll learn methods that work against Cloudflare, DataDome, PerimeterX, Akamai, and Kasada in 2026.
What You'll Learn
- How modern anti-bots detect scrapers at every layer
- TLS fingerprinting bypass with curl_cffi and browser impersonation
- Stealth browser setup with Camoufox, Nodriver, and SeleniumBase UC Mode
- Human-like behavior simulation that fools behavioral analysis
- Proxy strategies that maintain session integrity
- CAPTCHA handling without expensive solving services
- JavaScript challenge navigation
How Modern Anti-Bot Systems Work in 2026
Before diving into bypass techniques, you need to understand how detection works. Anti-bot systems have evolved beyond simple IP blocking into multi-layered defense platforms.
TLS/JA3/JA4 Fingerprinting
When your scraper connects over HTTPS, a TLS handshake occurs before any HTTP data transfers. During this handshake, your client reveals its supported cipher suites, TLS extensions, and protocol versions.
JA3 fingerprinting extracts five fields from the ClientHello packet: TLS version, cipher suites, extensions, elliptic curves, and elliptic curve formats. These values get concatenated and hashed into a unique identifier.
Example JA3 string:
771,4867-4865-4866-52393-52392-49195,0-23-65281-10-11-35-16,29-23-24,0
The problem? Python's requests library produces a JA3 hash that screams "automated script." Cloudflare maintains databases of known bot signatures and blocks matching fingerprints instantly.
JA4 emerged in 2023 to address browser extension randomization. It sorts extensions alphabetically before hashing, making it resistant to the permutation attacks that broke JA3 detection.
Browser Fingerprinting
JavaScript-based fingerprinting goes far beyond User-Agent strings. Sites collect canvas fingerprints, WebGL renderer info, audio context signatures, installed fonts, screen dimensions, timezone data, and hundreds of other data points.
Headless browsers expose automation markers everywhere:
navigator.webdriverreturnstrue- Chrome's HeadlessChrome appears in the User-Agent
- Missing browser plugins and extensions
- Identical canvas fingerprints across sessions
- No mouse movement events between clicks
Behavioral Analysis
This is where most scrapers fail in 2026. Even with perfect fingerprints, behavioral patterns give you away.
Real users don't request 50 pages in 10 seconds. They don't navigate in perfectly sequential order. They pause to read content, move their mouse while thinking, and occasionally scroll past what they're looking for.
Anti-bot systems track:
- Request timing and frequency
- Navigation path patterns
- Mouse movement trajectories
- Scroll behavior
- Time spent on each page
- Click precision and timing
IP Reputation Scoring
Your IP address carries historical baggage. Datacenter IPs get flagged immediately. Residential IPs that previously triggered blocks carry low trust scores. Geographic inconsistencies between your IP location and browser timezone raise flags.
Modern systems also analyze ASN (Autonomous System Number) data to identify traffic from hosting providers, VPNs, and known proxy services.
Step 1: Master TLS Fingerprint Impersonation
The first defense layer you'll hit is TLS fingerprinting. If your client's JA3/JA4 signature doesn't match a legitimate browser, you're blocked before any HTTP request completes.
Using curl_cffi for Browser-Like TLS
curl_cffi is a Python library that wraps curl-impersonate, allowing you to send requests with TLS fingerprints identical to real browsers.
Install it first:
pip install curl_cffi
Basic usage looks almost identical to the requests library:
from curl_cffi import requests
response = requests.get(
"https://www.example.com",
impersonate="chrome136"
)
print(response.status_code)
The impersonate parameter tells curl_cffi which browser's TLS fingerprint to use. Available options include Chrome 131-136, Firefox 133+, Safari 18.4, and Edge versions.
Handling Sessions and Cookies
For multi-request scraping, maintain session state:
from curl_cffi import requests
session = requests.Session()
# First request establishes cookies
session.get(
"https://httpbin.org/cookies/set/session_id/abc123",
impersonate="chrome136"
)
# Subsequent requests include cookies automatically
response = session.get(
"https://httpbin.org/cookies",
impersonate="chrome136"
)
print(response.json())
The session object persists cookies between requests, mimicking how real browsers maintain state.
Adding Proxy Support
Combine curl_cffi with residential proxies for maximum effectiveness:
from curl_cffi import requests
proxy = "http://user:pass@gate.roundproxies.com:8080"
proxies = {"http": proxy, "https": proxy}
response = requests.get(
"https://www.target-site.com",
impersonate="chrome136",
proxies=proxies
)
Residential proxies from providers like Roundproxies.com use real ISP-assigned IP addresses, making them harder to detect than datacenter IPs.
When curl_cffi Isn't Enough
curl_cffi handles TLS fingerprinting perfectly, but it can't execute JavaScript. For sites requiring JS execution or complex interactions, you'll need stealth browsers covered in Step 2.
Use curl_cffi when:
- Target site has basic protection
- You only need static HTML
- Speed and efficiency matter most
- No JavaScript challenges appear
Switch to browsers when:
- JavaScript challenges block requests
- Sites require interaction (clicks, forms)
- Canvas/WebGL fingerprinting is active
- Turnstile or similar CAPTCHAs appear
Step 2: Configure Stealth Browser Automation
When curl_cffi can't get through, stealth browsers become necessary. Standard Selenium and Playwright get detected instantly—you need specialized tools.
Option A: Camoufox (Best for Stealth)
Camoufox is an open-source Firefox-based browser designed specifically for scraping. It modifies Firefox at the C++ level, making fingerprint spoofing undetectable by JavaScript.
Install it:
pip install camoufox
playwright install firefox
Basic synchronous usage:
from camoufox.sync_api import Camoufox
with Camoufox(headless=True) as browser:
page = browser.new_page()
page.goto("https://nowsecure.nl")
# Check if we passed the bot test
content = page.content()
print("Passed!" if "You are not a bot" in content else "Blocked")
Camoufox generates realistic fingerprints automatically. Each launch creates a new, coherent fingerprint profile including screen size, fonts, timezone, and hardware identifiers.
Custom Fingerprint Configuration
Override specific values when needed:
from camoufox.sync_api import Camoufox
config = {
'window.outerHeight': 1080,
'window.outerWidth': 1920,
'window.innerHeight': 1008,
'window.innerWidth': 1920,
'navigator.language': 'en-US',
'navigator.hardwareConcurrency': 8,
}
with Camoufox(
headless=True,
config=config,
i_know_what_im_doing=True
) as browser:
page = browser.new_page()
page.goto("https://browserleaks.com/javascript")
The i_know_what_im_doing flag suppresses warnings about custom configurations. Use it carefully—inconsistent fingerprints trigger detection.
Async Mode for Scale
For scraping multiple pages concurrently:
from camoufox.async_api import AsyncCamoufox
import asyncio
async def scrape_page(browser, url):
page = await browser.new_page()
await page.goto(url)
content = await page.content()
await page.close()
return content
async def main():
urls = [
"https://example1.com",
"https://example2.com",
"https://example3.com"
]
async with AsyncCamoufox(headless=True) as browser:
tasks = [scrape_page(browser, url) for url in urls]
results = await asyncio.gather(*tasks)
for url, content in zip(urls, results):
print(f"Scraped {len(content)} chars from {url}")
asyncio.run(main())
Option B: SeleniumBase UC Mode
If you have existing Selenium code, SeleniumBase UC Mode adds stealth capabilities without a complete rewrite.
pip install seleniumbase
UC Mode works by launching Chrome normally, then attaching the WebDriver afterward. This produces a fingerprint identical to a human-launched browser.
from seleniumbase import SB
with SB(uc=True) as sb:
sb.uc_open_with_reconnect("https://nowsecure.nl", 4)
# UC Mode disconnects during sensitive operations
sb.uc_click("button#start")
# Access page content after interactions
print(sb.get_page_source())
The uc_open_with_reconnect method handles Cloudflare challenges automatically. The second parameter (4) specifies seconds to wait for challenge completion.
Handling CAPTCHAs with UC Mode
SeleniumBase includes built-in CAPTCHA handling:
from seleniumbase import SB
with SB(uc=True) as sb:
sb.uc_open_with_reconnect("https://protected-site.com", 4)
# Automatically click Turnstile checkbox if present
sb.uc_gui_click_captcha()
# Continue scraping after CAPTCHA
sb.click("a.product-link")
data = sb.get_text("div.product-info")
Option C: Nodriver (CDP-Based Approach)
Nodriver communicates with Chrome directly using Chrome DevTools Protocol, avoiding WebDriver detection vectors entirely.
pip install nodriver
import nodriver as uc
async def main():
browser = await uc.start()
page = await browser.get("https://nowsecure.nl")
# Wait for content to load
await page.sleep(3)
# Extract data
content = await page.get_content()
print(content)
await browser.stop()
if __name__ == "__main__":
uc.loop().run_until_complete(main())
Nodriver's async-only architecture requires refactoring synchronous code, but it achieves better detection rates against advanced anti-bot systems.
Which Tool Should You Choose?
Choose Camoufox if:
- Maximum stealth is critical
- You can work with Firefox
- Target sites have aggressive protection
Choose SeleniumBase UC Mode if:
- You have existing Selenium code
- Built-in CAPTCHA handling matters
- Chrome compatibility is required
Choose Nodriver if:
- CDP-level control is needed
- You're building new projects
- Async architecture fits your workflow
Step 3: Implement Human-Like Behavior Patterns
Perfect fingerprints mean nothing if your behavior screams "bot." This step covers the techniques that fool behavioral analysis systems.
Natural Request Timing
Never use fixed delays. Real human browsing shows variable timing based on content consumption.
import random
import time
def human_delay(min_seconds=2, max_seconds=8):
"""
Generate delays that mimic human reading patterns.
Longer content = longer delays.
"""
base_delay = random.uniform(min_seconds, max_seconds)
# Add occasional longer pauses (checking phone, distracted)
if random.random() < 0.1:
base_delay += random.uniform(5, 15)
# Add micro-variations
jitter = random.gauss(0, 0.5)
return max(0.5, base_delay + jitter)
# Usage between requests
time.sleep(human_delay())
Content-Aware Timing
Adjust delays based on page content length:
def reading_delay(content_length, wpm=200):
"""
Calculate realistic reading time based on content.
Average adult reads 200-300 words per minute.
"""
words = content_length / 5 # Average word length
reading_time = (words / wpm) * 60 # Convert to seconds
# Add scanning time (not everyone reads everything)
actual_time = reading_time * random.uniform(0.3, 0.7)
# Minimum 2 seconds, maximum 30 seconds per page
return max(2, min(30, actual_time))
Mouse Movement Simulation
Behavioral analysis tracks mouse movement patterns. Bots move in straight lines at constant velocity. Humans don't.
Using Ghost Cursor with Puppeteer:
const { createCursor } = require('ghost-cursor');
const puppeteer = require('puppeteer');
async function humanBrowse(page) {
const cursor = createCursor(page);
// Move to element with natural curve
await cursor.move('button.submit');
// Add hesitation before clicking
await page.waitForTimeout(Math.random() * 500 + 200);
// Click with realistic timing
await cursor.click('button.submit');
}
For Python with Playwright, use human-like movement functions:
import asyncio
import random
import math
async def bezier_mouse_move(page, start_x, start_y, end_x, end_y):
"""
Move mouse along a Bezier curve with realistic acceleration.
"""
# Generate control points for curve
ctrl_x = start_x + (end_x - start_x) * random.uniform(0.3, 0.7)
ctrl_y = start_y + (end_y - start_y) * random.uniform(0.2, 0.8)
# Add slight overshoot
overshoot = random.uniform(0, 15)
steps = random.randint(20, 40)
for i in range(steps + 1):
t = i / steps
# Quadratic Bezier curve
x = (1-t)**2 * start_x + 2*(1-t)*t * ctrl_x + t**2 * (end_x + overshoot)
y = (1-t)**2 * start_y + 2*(1-t)*t * ctrl_y + t**2 * end_y
await page.mouse.move(x, y)
# Variable speed (slower at start and end)
speed_factor = 4 * t * (1 - t) # Parabolic speed curve
delay = random.uniform(5, 20) / (speed_factor + 0.5)
await asyncio.sleep(delay / 1000)
# Correct overshoot
if overshoot > 5:
await page.mouse.move(end_x, end_y)
Scroll Behavior Simulation
Real users scroll in bursts, not smooth continuous motion:
async def human_scroll(page, direction='down', distance=None):
"""
Simulate human scrolling with variable speed and pauses.
"""
if distance is None:
distance = random.randint(200, 600)
scrolled = 0
while scrolled < distance:
# Variable scroll chunk
chunk = random.randint(50, 150)
if direction == 'down':
await page.mouse.wheel(0, chunk)
else:
await page.mouse.wheel(0, -chunk)
scrolled += chunk
# Micro-pause between scroll events
await asyncio.sleep(random.uniform(0.05, 0.15))
# Occasional longer pause (reading)
if random.random() < 0.2:
await asyncio.sleep(random.uniform(0.5, 2))
Navigation Pattern Randomization
Don't scrape pages in sequential order. Mix in natural browsing behavior:
import random
def create_browsing_path(target_urls, decoy_ratio=0.2):
"""
Create a realistic browsing path with natural navigation.
"""
path = []
decoy_pages = [
"/about", "/contact", "/faq",
"/terms", "/privacy"
]
for url in target_urls:
# Occasionally visit non-target pages
if random.random() < decoy_ratio:
decoy = random.choice(decoy_pages)
path.append(('decoy', decoy))
path.append(('target', url))
# Sometimes go back to homepage
if random.random() < 0.1:
path.append(('navigation', '/'))
return path
Step 4: Configure Smart Proxy Rotation
Even with perfect fingerprints and behavior, IP reputation matters. This step covers proxy strategies that maintain high success rates.
Session-Based Proxy Assignment
Don't randomly rotate proxies on every request. Maintain IP consistency within browsing sessions:
import hashlib
from collections import defaultdict
class SessionProxyManager:
def __init__(self, proxy_list):
self.proxies = proxy_list
self.session_map = defaultdict(str)
self.proxy_health = {p: 1.0 for p in proxy_list}
def get_proxy(self, session_id, target_domain):
"""
Assign consistent proxy to session/domain combination.
"""
key = f"{session_id}:{target_domain}"
if key not in self.session_map:
# Select proxy based on health score
healthy_proxies = [
p for p in self.proxies
if self.proxy_health[p] > 0.5
]
if not healthy_proxies:
healthy_proxies = self.proxies
# Deterministic selection for consistency
idx = int(hashlib.md5(key.encode()).hexdigest(), 16)
proxy = healthy_proxies[idx % len(healthy_proxies)]
self.session_map[key] = proxy
return self.session_map[key]
def report_failure(self, proxy):
"""Reduce health score on failure."""
self.proxy_health[proxy] *= 0.8
def report_success(self, proxy):
"""Increase health score on success."""
self.proxy_health[proxy] = min(1.0, self.proxy_health[proxy] * 1.1)
Geographic Consistency
Match proxy location with browser timezone and language settings:
from camoufox.sync_api import Camoufox
# Proxy located in Germany
proxy = "http://user:pass@de.roundproxies.com:8080"
config = {
'navigator.language': 'de-DE',
'navigator.languages': ['de-DE', 'de', 'en'],
}
with Camoufox(
headless=True,
proxy={"server": proxy},
config=config,
geoip=True # Auto-match timezone to proxy IP
) as browser:
page = browser.new_page()
page.goto("https://target-site.com")
Camoufox's geoip=True parameter automatically sets timezone and locale based on proxy IP location.
Proxy Type Selection
Different proxy types suit different use cases:
Residential Proxies:
- Real ISP-assigned IPs
- Highest trust scores
- Best for heavily protected sites
- Higher cost per request
ISP Proxies:
- Static IPs from ISPs
- Good for account management
- Consistent performance
- Medium cost
Datacenter Proxies:
- Fastest speeds
- Lowest cost
- Easily detected by sophisticated systems
- Good for lightly protected sites
Mobile Proxies:
- Cellular network IPs
- Very high trust scores
- Expensive
- Best for mobile-focused sites
For most scraping tasks targeting protected sites, residential proxies provide the best cost-to-success ratio.
Handling Proxy Failures Gracefully
Build retry logic that switches proxies on failures:
import asyncio
from curl_cffi import requests
async def fetch_with_retry(url, proxy_manager, max_retries=3):
"""
Fetch URL with automatic proxy rotation on failure.
"""
session_id = hashlib.md5(url.encode()).hexdigest()[:8]
domain = url.split('/')[2]
for attempt in range(max_retries):
proxy = proxy_manager.get_proxy(session_id, domain)
try:
response = requests.get(
url,
impersonate="chrome136",
proxies={"http": proxy, "https": proxy},
timeout=30
)
if response.status_code == 200:
proxy_manager.report_success(proxy)
return response
# Soft failure - page loaded but blocked
if response.status_code in [403, 429]:
proxy_manager.report_failure(proxy)
proxy_manager.session_map.pop(
f"{session_id}:{domain}", None
)
except Exception as e:
proxy_manager.report_failure(proxy)
proxy_manager.session_map.pop(
f"{session_id}:{domain}", None
)
# Exponential backoff
await asyncio.sleep(2 ** attempt)
return None
Step 5: Handle JavaScript Challenges
Modern anti-bot systems use JavaScript challenges that must execute in a real browser environment. Here's how to navigate them.
Cloudflare Turnstile
Turnstile replaced traditional CAPTCHAs with invisible challenges. Three variants exist:
- Non-interactive (Invisible): Runs silently in background
- Invisible with brief check: Shows "Verifying..." for 1-2 seconds
- Interactive: Requires checkbox click
For non-interactive Turnstile, stealth browsers handle it automatically:
from seleniumbase import SB
with SB(uc=True) as sb:
# Opens page and waits for Turnstile
sb.uc_open_with_reconnect("https://turnstile-protected.com", 4)
# If interactive Turnstile appears
if sb.is_element_visible("iframe[src*='turnstile']"):
sb.uc_gui_click_captcha()
# Continue after verification
sb.click("button.proceed")
Cloudflare Under Attack Mode
When sites enable "Under Attack Mode," a 5-second JavaScript challenge runs. Wait for it to complete:
from camoufox.async_api import AsyncCamoufox
import asyncio
async def bypass_cloudflare_uam(url):
async with AsyncCamoufox(headless=True) as browser:
page = await browser.new_page()
await page.goto(url)
# Wait for challenge page to clear
# Look for absence of challenge elements
for _ in range(20):
content = await page.content()
if "Checking your browser" not in content:
break
await asyncio.sleep(0.5)
# Now scrape actual content
return await page.content()
JavaScript Function Hooks
Some detection scripts check for automation markers via JavaScript. Hook and override them:
from playwright.sync_api import sync_playwright
def stealth_context(playwright):
browser = playwright.chromium.launch(headless=True)
context = browser.new_context()
# Inject stealth scripts before page loads
context.add_init_script("""
// Remove webdriver flag
Object.defineProperty(navigator, 'webdriver', {
get: () => undefined
});
// Override permissions API
const originalQuery = window.navigator.permissions.query;
window.navigator.permissions.query = (parameters) => (
parameters.name === 'notifications' ?
Promise.resolve({ state: Notification.permission }) :
originalQuery(parameters)
);
// Spoof plugin array
Object.defineProperty(navigator, 'plugins', {
get: () => [1, 2, 3, 4, 5]
});
// Fix chrome object
window.chrome = {
runtime: {}
};
""")
return context
Date/Timing Detection Bypass
Anti-bot scripts analyze timing precision. Real browsers have small timing variations:
// Inject before page loads
const originalDate = Date;
const originalPerformance = window.performance.now;
Date = function(...args) {
const date = new originalDate(...args);
if (args.length === 0) {
// Add small random offset to current time
return new originalDate(date.getTime() + Math.random() * 50);
}
return date;
};
window.performance.now = function() {
// Add micro-jitter to performance timing
return originalPerformance.call(performance) + Math.random() * 0.1;
};
Step 6: Implement CAPTCHA Handling Strategies
When CAPTCHAs appear despite stealth measures, you have several options.
Prevention First
The best CAPTCHA is one that never appears. Reduce trigger rates by:
- Maintaining consistent sessions: Same IP + fingerprint throughout session
- Respecting rate limits: Slower scraping triggers fewer challenges
- Natural navigation: Enter through homepage, follow links naturally
- Good fingerprint hygiene: Rotate fingerprints between sessions, not during
Retry-Based Approach
For occasional CAPTCHAs, retry with different configurations:
async def scrape_with_captcha_retry(url, max_retries=3):
for attempt in range(max_retries):
async with AsyncCamoufox(headless=True) as browser:
page = await browser.new_page()
await page.goto(url)
content = await page.content()
# Check for CAPTCHA indicators
captcha_indicators = [
"captcha", "challenge",
"verify you are human",
"checking your browser"
]
has_captcha = any(
ind in content.lower()
for ind in captcha_indicators
)
if not has_captcha:
return content
# Wait before retry with new fingerprint
await asyncio.sleep(5 * (attempt + 1))
return None
Solver Service Integration
For sites with persistent CAPTCHAs, integrate solving services:
import requests
import time
def solve_recaptcha_v2(site_key, page_url, api_key):
"""
Submit reCAPTCHA to solving service and wait for result.
"""
# Submit task
submit_response = requests.post(
"http://2captcha.com/in.php",
data={
"key": api_key,
"method": "userrecaptcha",
"googlekey": site_key,
"pageurl": page_url,
"json": 1
}
)
task_id = submit_response.json().get("request")
# Poll for result
for _ in range(60):
time.sleep(5)
result = requests.get(
"http://2captcha.com/res.php",
params={
"key": api_key,
"action": "get",
"id": task_id,
"json": 1
}
)
data = result.json()
if data.get("status") == 1:
return data.get("request")
if "ERROR" in data.get("request", ""):
return None
return None
Integrate the token back into your browser session:
async def submit_captcha_token(page, token):
"""
Inject solved CAPTCHA token into page.
"""
await page.evaluate(f"""
document.getElementById('g-recaptcha-response').innerHTML = '{token}';
// Trigger callback if exists
if (typeof ___grecaptcha_cfg !== 'undefined') {{
Object.keys(___grecaptcha_cfg.clients).forEach(key => {{
const client = ___grecaptcha_cfg.clients[key];
if (client.callback) {{
client.callback('{token}');
}}
}});
}}
""")
Step 7: Build a Production-Ready Scraping System
Combining all techniques into a reliable production system requires careful orchestration.
Complete Scraper Architecture
import asyncio
import random
import hashlib
from dataclasses import dataclass
from typing import Optional, List
from camoufox.async_api import AsyncCamoufox
@dataclass
class ScrapingResult:
url: str
success: bool
content: Optional[str]
error: Optional[str]
class ProductionScraper:
def __init__(self, proxy_list: List[str]):
self.proxy_manager = SessionProxyManager(proxy_list)
self.results = []
async def scrape_url(self, url: str, session_id: str) -> ScrapingResult:
domain = url.split('/')[2]
proxy = self.proxy_manager.get_proxy(session_id, domain)
try:
async with AsyncCamoufox(
headless=True,
proxy={"server": proxy},
geoip=True
) as browser:
page = await browser.new_page()
# Natural navigation delay
await asyncio.sleep(random.uniform(1, 3))
await page.goto(url, wait_until="networkidle")
# Wait for dynamic content
await asyncio.sleep(random.uniform(2, 5))
# Simulate reading with scroll
await self.simulate_reading(page)
content = await page.content()
# Check for blocks
if self.is_blocked(content):
self.proxy_manager.report_failure(proxy)
return ScrapingResult(
url=url, success=False,
content=None, error="Blocked"
)
self.proxy_manager.report_success(proxy)
return ScrapingResult(
url=url, success=True,
content=content, error=None
)
except Exception as e:
self.proxy_manager.report_failure(proxy)
return ScrapingResult(
url=url, success=False,
content=None, error=str(e)
)
async def simulate_reading(self, page):
"""Add human-like reading behavior."""
for _ in range(random.randint(2, 4)):
scroll_amount = random.randint(100, 400)
await page.mouse.wheel(0, scroll_amount)
await asyncio.sleep(random.uniform(0.5, 2))
def is_blocked(self, content: str) -> bool:
"""Detect common block indicators."""
indicators = [
"access denied", "blocked",
"captcha", "please verify",
"unusual traffic"
]
content_lower = content.lower()
return any(ind in content_lower for ind in indicators)
async def scrape_batch(
self,
urls: List[str],
max_concurrent: int = 5
) -> List[ScrapingResult]:
"""
Scrape multiple URLs with controlled concurrency.
"""
semaphore = asyncio.Semaphore(max_concurrent)
session_id = hashlib.md5(
str(urls).encode()
).hexdigest()[:8]
async def bounded_scrape(url):
async with semaphore:
return await self.scrape_url(url, session_id)
tasks = [bounded_scrape(url) for url in urls]
return await asyncio.gather(*tasks)
Error Handling and Recovery
async def resilient_scrape(
scraper: ProductionScraper,
url: str,
max_retries: int = 3
) -> ScrapingResult:
"""
Scrape with exponential backoff and fingerprint rotation.
"""
for attempt in range(max_retries):
# Generate new session ID for each retry
session_id = f"retry_{attempt}_{random.randint(1000, 9999)}"
result = await scraper.scrape_url(url, session_id)
if result.success:
return result
# Exponential backoff with jitter
delay = (2 ** attempt) + random.uniform(0, 1)
await asyncio.sleep(delay)
return ScrapingResult(
url=url, success=False,
content=None, error="Max retries exceeded"
)
Common Mistakes That Get You Blocked
Even with proper tools, these mistakes cause unnecessary blocks:
Mistake 1: Headless Mode Detection
True headless mode produces detectable fingerprints. Use virtual displays instead:
# BAD - Detectable
with SB(uc=True, headless=True) as sb:
sb.uc_open_with_reconnect(url)
# GOOD - Uses virtual display
with SB(uc=True, xvfb=True) as sb:
sb.uc_open_with_reconnect(url)
The xvfb=True parameter runs a headed browser inside a virtual framebuffer. The fingerprint appears identical to a real desktop browser.
Mistake 2: Inconsistent Fingerprints
Changing fingerprint values mid-session triggers detection:
# BAD - Fingerprint changes during session
for page in pages:
with Camoufox() as browser: # New fingerprint each time
scrape(browser, page)
# GOOD - Consistent fingerprint for session
with Camoufox() as browser:
for page in pages:
scrape(browser, page)
Mistake 3: Using Deprecated Tools
puppeteer-stealth was discontinued in February 2025. Cloudflare specifically detects its patterns now:
// BAD - Outdated and detected
const puppeteer = require('puppeteer-extra');
const StealthPlugin = require('puppeteer-extra-plugin-stealth');
// GOOD - Use actively maintained alternatives
// Camoufox (Python), Nodriver (Python), SeleniumBase UC Mode
Mistake 4: Ignoring HTTP/2 Fingerprinting
Modern anti-bot systems analyze HTTP/2 SETTINGS frames and header ordering:
# BAD - HTTP/1.1 only
import requests
response = requests.get(url)
# GOOD - Full HTTP/2 with proper fingerprint
from curl_cffi import requests
response = requests.get(url, impersonate="chrome136")
Mistake 5: Sequential URL Patterns
Scraping pages in numerical order reveals bot behavior:
# BAD - Obvious pattern
urls = [f"https://site.com/page/{i}" for i in range(1, 100)]
for url in urls:
scrape(url)
# GOOD - Randomized order
import random
random.shuffle(urls)
for url in urls:
scrape(url)
time.sleep(human_delay())
Future-Proofing Your Scraping Setup
Anti-bot technology continues advancing. Stay ahead with these practices:
Monitor Detection Landscapes
Test your scraper regularly against detection services:
- BrowserScan (https://www.browserscan.net/)
- CreepJS (https://abrahamjuliot.github.io/creepjs/)
- Incolumitas (https://bot.incolumitas.com/)
- nowsecure.nl
Track Tool Updates
Follow development of your primary tools:
- Camoufox: https://github.com/daijro/camoufox
- SeleniumBase: https://github.com/seleniumbase/SeleniumBase
- Nodriver: https://github.com/ultrafunkamsterdam/nodriver
- curl_cffi: https://github.com/lexiforest/curl_cffi
Build Abstraction Layers
Don't hardcode tool dependencies. Build interfaces that allow swapping:
from abc import ABC, abstractmethod
class BrowserInterface(ABC):
@abstractmethod
async def navigate(self, url: str): pass
@abstractmethod
async def get_content(self) -> str: pass
@abstractmethod
async def click(self, selector: str): pass
class CamoufoxBrowser(BrowserInterface):
# Implementation using Camoufox
pass
class NodriverBrowser(BrowserInterface):
# Implementation using Nodriver
pass
When one tool gets detected, swap implementations without rewriting scraping logic.
Final Thoughts
Bypassing anti-bot systems in 2026 requires a multi-layered approach. No single technique works against sophisticated protection—you need TLS fingerprinting, browser stealth, behavioral simulation, and smart proxy usage working together.
Start with curl_cffi for simple targets. When that fails, move to Camoufox or SeleniumBase UC Mode. Add human-like behavior patterns. Use residential proxies with geographic consistency.
Most importantly, respect the sites you scrape. Rate limiting and responsible data collection keep anti-bot escalation in check for everyone.
The techniques in this guide work against current protection systems. Anti-bot vendors will adapt. Keep your tools updated, test regularly, and build flexible systems that can evolve with the landscape.
FAQ
Can I bypass Cloudflare with just curl_cffi?
curl_cffi bypasses TLS fingerprinting but can't execute JavaScript. For Cloudflare sites with JS challenges or Turnstile, you need a stealth browser like Camoufox or SeleniumBase UC Mode.
Which stealth browser has the best detection scores?
Camoufox consistently achieves 0% detection on CreepJS and BrowserScan tests. It's Firefox-based with C++-level fingerprint modifications that JavaScript can't detect.
How many requests per minute can I safely make?
There's no universal answer. Start at 2-5 requests per minute for heavily protected sites. Monitor success rates and gradually increase. Some sites tolerate 30+ requests per minute with proper fingerprinting.
Do I need residential proxies or will datacenter work?
Datacenter proxies work for lightly protected sites. For Cloudflare, DataDome, PerimeterX, or Akamai-protected sites, residential proxies significantly improve success rates.
What happens when my current tools get detected?
Anti-bot vendors study open-source tools. When detection increases, update to latest versions first. If still blocked, switch tools (Camoufox → Nodriver → SeleniumBase). Build abstraction layers to make switching painless.