Most Python developers are scraping dynamic websites wrong. They fire up Selenium, render the entire page, wait for JavaScript, and wonder why their scrapers are slow, memory-hungry, and constantly getting blocked.
The truth? You're burning through gigabytes of RAM to do what a simple HTTP request could handle.
Dynamic website scraping means extracting data from sites that load content through JavaScript after the initial HTML response. Most "dynamic" sites fetch data from unprotected API endpoints that your scraper can call directly—skipping the browser entirely.
This guide shows you the methods that actually work in 2026, from finding hidden API endpoints to bypassing sophisticated anti-bot systems using tools that detection systems can't easily flag.
What Makes a Website Dynamic?
A dynamic website loads content through JavaScript after the initial page request. When you hit a React or Vue-powered site, the server sends minimal HTML with empty containers.
JavaScript then executes in the browser and makes AJAX calls to fetch actual data. The DOM updates with content that never existed in the original source code.
Your requests.get() only captures that first empty shell. The actual data lives in step 3—the API call.
Here's how to check if a site is truly dynamic:
import requests
from bs4 import BeautifulSoup
def check_dynamic_content(url, target_selector):
"""
Verify if content loads dynamically
Returns True if JavaScript rendering is needed
"""
response = requests.get(url, headers={
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
})
soup = BeautifulSoup(response.text, 'html.parser')
# Try to find target content in raw HTML
content = soup.select(target_selector)
if not content or not content[0].text.strip():
print("Content loads dynamically - check for API endpoints")
return True
print("Content is in HTML - use regular scraping")
return False
# Example usage
is_dynamic = check_dynamic_content(
"https://example-store.com/products",
".product-card"
)
This function sends a plain HTTP request and checks if your target content exists. If the selector returns empty elements, the site uses JavaScript rendering.
Half the time, you'll discover the data is right there in the HTML.
The API-First Approach: Skip the Browser
Before reaching for browser automation, investigate the network traffic. Open DevTools, navigate to the Network tab, and filter by XHR/Fetch requests.
Most dynamic sites reveal clean JSON endpoints hiding in plain sight.
GET https://api.example.com/products?page=1&limit=20
Response: {"products": [...], "total": 1547}
This structured data is exactly what your scraper should target.
Discovering Hidden API Endpoints
Use this Playwright-based discovery script to capture every API call a page makes:
from playwright.sync_api import sync_playwright
import json
def discover_api_endpoints(url):
"""
Intercept all JSON responses to find API endpoints
Returns list of discovered APIs with headers
"""
api_calls = []
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
page = browser.new_page()
def capture_response(response):
content_type = response.headers.get('content-type', '')
if 'json' in content_type or 'application/json' in content_type:
api_calls.append({
'url': response.url,
'method': response.request.method,
'status': response.status,
'headers': dict(response.request.headers)
})
page.on('response', capture_response)
page.goto(url, wait_until='networkidle')
# Trigger lazy-loaded content
page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
page.wait_for_timeout(2000)
browser.close()
return api_calls
# Discover endpoints
endpoints = discover_api_endpoints("https://target-site.com/products")
for endpoint in endpoints:
print(f"Found: {endpoint['method']} {endpoint['url']}")
The script launches a browser instance, intercepts all responses with JSON content types, and stores the request details. The scroll action triggers lazy-loaded content that might make additional API calls.
Calling APIs Directly
Once you've identified the endpoints, bypass the browser entirely:
import requests
def scrape_via_api(api_url, headers):
"""
Direct API access - 10-100x faster than browser automation
"""
response = requests.get(api_url, headers=headers)
if response.status_code == 200:
return response.json()
return None
# Use exact headers captured from browser
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
'Accept': 'application/json',
'Referer': 'https://target-site.com/products'
}
data = scrape_via_api(
"https://api.target-site.com/v1/products?page=1",
headers
)
This approach uses minimal resources—10-50MB for your script versus 200-500MB per browser instance. Request times drop from 2-5 seconds to 50-200ms.
The API returns structured JSON that requires no HTML parsing.
When Browser Automation Is Actually Necessary
Sometimes the API route genuinely doesn't work. You'll need browser automation when facing:
- Signed or encrypted request payloads
- Complex authentication flows with JavaScript challenges
- GraphQL with dynamic query generation
- Aggressive bot protection requiring full browser fingerprints
When browser automation is unavoidable, don't use vanilla Selenium.
Nodriver: The Undetectable Browser Tool
Nodriver is the official successor to undetected-chromedriver, built from scratch to bypass anti-bot detection. It communicates directly with Chrome using a custom CDP implementation—no Selenium, no WebDriver protocol.
Standard Selenium sets detectable flags that anti-bot systems catch instantly:
// These properties expose Selenium
navigator.webdriver === true
window.chrome === undefined
navigator.plugins.length === 0
Nodriver eliminates these signatures entirely.
Installation and Basic Setup
pip install nodriver
You need Chrome or a Chromium-based browser installed. No separate driver downloads required.
import nodriver as uc
async def basic_scrape(url):
"""
Basic nodriver setup with anti-detection
"""
browser = await uc.start(
headless=False, # Headless mode is more detectable
lang="en-US"
)
page = await browser.get(url)
# Wait for content to load
await page.wait_for_selector('.product-list')
# Extract data via JavaScript - faster than parsing
products = await page.evaluate('''
() => Array.from(document.querySelectorAll('.product')).map(p => ({
name: p.querySelector('.name')?.textContent?.trim(),
price: p.querySelector('.price')?.textContent?.trim(),
url: p.querySelector('a')?.href
}))
''')
await browser.close()
return products
# Run the scraper
if __name__ == '__main__':
uc.loop().run_until_complete(basic_scrape("https://target-site.com"))
The headless=False setting makes detection harder. Anti-bot systems distinguish headless Chrome from regular Chrome through various fingerprint checks.
Handling Infinite Scroll
Many modern sites use infinite scroll instead of pagination. Nodriver handles this with scroll simulation:
import nodriver as uc
import asyncio
async def scrape_infinite_scroll(url, max_scrolls=10):
"""
Scrape pages with infinite scroll functionality
"""
browser = await uc.start(headless=False)
page = await browser.get(url)
all_items = []
previous_count = 0
scroll_count = 0
while scroll_count < max_scrolls:
# Scroll to bottom
await page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
# Wait for new content
await asyncio.sleep(2)
# Count current items
current_count = await page.evaluate('''
() => document.querySelectorAll('.item').length
''')
# Check if new items loaded
if current_count == previous_count:
# No new content - we've reached the end
break
previous_count = current_count
scroll_count += 1
print(f"Scroll {scroll_count}: Found {current_count} items")
# Extract all items after scrolling completes
all_items = await page.evaluate('''
() => Array.from(document.querySelectorAll('.item')).map(item => ({
title: item.querySelector('.title')?.textContent?.trim(),
description: item.querySelector('.desc')?.textContent?.trim()
}))
''')
await browser.close()
return all_items
if __name__ == '__main__':
items = uc.loop().run_until_complete(
scrape_infinite_scroll("https://infinite-scroll-site.com", max_scrolls=15)
)
print(f"Total items scraped: {len(items)}")
The script scrolls incrementally and tracks item count. When no new items load after a scroll, it stops—preventing infinite loops on pages without content limits.
Using Nodriver with Proxies
Proxy integration requires handling authentication through CDP:
import nodriver as uc
async def scrape_with_proxy(url, proxy_url):
"""
Nodriver with proxy support
proxy_url format: http://user:pass@host:port
"""
browser = await uc.start(
headless=False,
browser_args=[f'--proxy-server={proxy_url}']
)
page = await browser.get(url)
# Verify proxy is working
ip_info = await page.evaluate('''
async () => {
const response = await fetch('https://api.ipify.org?format=json');
return response.json();
}
''')
print(f"Current IP: {ip_info}")
# Continue with scraping
content = await page.get_content()
await browser.close()
return content
if __name__ == '__main__':
uc.loop().run_until_complete(
scrape_with_proxy("https://target.com", "http://user:pass@proxy.example.com:8080")
)
For authenticated proxies requiring popup handling, you'll need to use browser extensions or CDP fetch interception—check the nodriver documentation for advanced proxy patterns.
curl_cffi: Bypassing TLS Fingerprinting
Even with proper headers and User-Agent strings, some sites block requests based on TLS fingerprints. Your Python requests library has a unique signature that anti-bot systems recognize.
curl_cffi solves this by impersonating real browser TLS handshakes.
How TLS Fingerprinting Works
When establishing HTTPS connections, your client sends a "Client Hello" message containing:
- Supported cipher suites
- TLS extensions
- Signature algorithms
- ALPN protocols
This combination creates a unique JA3 fingerprint. Standard Python HTTP libraries produce fingerprints that scream "bot."
Installation and Usage
pip install curl_cffi
Basic usage mirrors the requests library:
from curl_cffi import requests
# Standard request - uses Python's TLS fingerprint (detectable)
# response = requests.get(url)
# Impersonated request - uses Chrome's TLS fingerprint
response = requests.get(
"https://protected-site.com",
impersonate="chrome"
)
print(response.status_code)
print(response.json())
The impersonate parameter switches the TLS fingerprint to match Chrome, Safari, or other browsers.
Available Browser Impersonations
curl_cffi supports multiple browser versions:
from curl_cffi import requests
# Chrome versions
response = requests.get(url, impersonate="chrome110")
response = requests.get(url, impersonate="chrome120")
response = requests.get(url, impersonate="chrome131")
response = requests.get(url, impersonate="chrome136")
# Safari versions
response = requests.get(url, impersonate="safari15_5")
response = requests.get(url, impersonate="safari17_0")
# Use latest available fingerprint
response = requests.get(url, impersonate="chrome")
Each impersonation adjusts cipher suites, extensions, and HTTP/2 settings to match the real browser's behavior exactly.
Session Management with curl_cffi
For sites requiring login or cookies, use session objects:
from curl_cffi import requests
def authenticated_scrape(login_url, target_url, credentials):
"""
Maintain session cookies across requests
"""
session = requests.Session()
# Login request
login_response = session.post(
login_url,
data=credentials,
impersonate="chrome"
)
if login_response.status_code != 200:
raise Exception("Login failed")
# Session cookies are automatically maintained
print(f"Cookies: {session.cookies}")
# Scrape authenticated content
data_response = session.get(
target_url,
impersonate="chrome"
)
return data_response.json()
# Usage
data = authenticated_scrape(
"https://site.com/login",
"https://site.com/api/protected-data",
{"username": "user", "password": "pass"}
)
The session object persists cookies across requests, maintaining authentication state throughout your scraping session.
Async Support for High-Volume Scraping
curl_cffi supports asyncio for concurrent requests:
import asyncio
from curl_cffi.requests import AsyncSession
async def scrape_multiple_pages(urls):
"""
Concurrent scraping with curl_cffi async
"""
async with AsyncSession() as session:
tasks = []
for url in urls:
task = session.get(url, impersonate="chrome")
tasks.append(task)
responses = await asyncio.gather(*tasks, return_exceptions=True)
results = []
for url, response in zip(urls, responses):
if isinstance(response, Exception):
print(f"Error fetching {url}: {response}")
continue
results.append({
'url': url,
'status': response.status_code,
'data': response.text[:500]
})
return results
# Run concurrent scraper
urls = [f"https://api.site.com/page/{i}" for i in range(1, 51)]
results = asyncio.run(scrape_multiple_pages(urls))
print(f"Successfully scraped {len(results)} pages")
Async requests dramatically increase throughput while maintaining the TLS fingerprint disguise.
Advanced Anti-Detection Techniques
Modern anti-bot systems analyze more than just headers and TLS fingerprints. They track behavioral patterns, timing, and session characteristics.
Humanized Request Timing
Bots make requests at consistent intervals. Humans don't.
import random
import time
from curl_cffi import requests
class HumanizedScraper:
def __init__(self, base_delay=1.5, variance=1.0):
self.session = requests.Session()
self.base_delay = base_delay
self.variance = variance
self.request_count = 0
def _human_delay(self):
"""Generate random delay mimicking human behavior"""
delay = self.base_delay + random.uniform(-self.variance, self.variance)
# Occasional longer pauses (simulating reading)
if random.random() < 0.1:
delay += random.uniform(3, 8)
time.sleep(max(0.5, delay))
def get(self, url, **kwargs):
"""Make request with human-like timing"""
self._human_delay()
self.request_count += 1
# Vary User-Agent occasionally
if self.request_count % 10 == 0:
kwargs.setdefault('headers', {})
kwargs['headers']['Accept-Language'] = random.choice([
'en-US,en;q=0.9',
'en-GB,en;q=0.8',
'en-US,en;q=0.9,es;q=0.8'
])
return self.session.get(url, impersonate="chrome", **kwargs)
# Usage
scraper = HumanizedScraper(base_delay=2, variance=1.5)
for page in range(1, 20):
response = scraper.get(f"https://site.com/page/{page}")
print(f"Page {page}: {response.status_code}")
The random delays and occasional longer pauses simulate natural browsing patterns.
Building Session Trust
Sites track session behavior from the first request. Arriving directly at a data endpoint looks suspicious.
from curl_cffi import requests
import time
import random
def build_trusted_session(site_url):
"""
Establish session trust before scraping target data
"""
session = requests.Session()
# Visit homepage first
session.get(site_url, impersonate="chrome")
time.sleep(random.uniform(2, 4))
# Browse naturally through the site
navigation_paths = ['/about', '/products', '/categories', '/contact']
random.shuffle(navigation_paths)
for path in navigation_paths[:random.randint(2, 4)]:
if random.random() < 0.7: # Don't visit everything
session.get(f"{site_url}{path}", impersonate="chrome")
time.sleep(random.uniform(1.5, 4))
return session
# Build trust, then scrape
session = build_trusted_session("https://target-site.com")
# Now access target data
response = session.get("https://target-site.com/api/products", impersonate="chrome")
This pattern establishes cookies, builds referrer history, and creates a session that appears legitimate.
Header Consistency
Inconsistent headers trigger detection. If your Accept-Language claims English but Accept-Encoding doesn't match Chrome's defaults, that's a red flag.
def get_consistent_headers(browser_type="chrome"):
"""
Return headers that match real browser behavior
"""
if browser_type == "chrome":
return {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8',
'Accept-Language': 'en-US,en;q=0.9',
'Accept-Encoding': 'gzip, deflate, br',
'Connection': 'keep-alive',
'Upgrade-Insecure-Requests': '1',
'Sec-Fetch-Dest': 'document',
'Sec-Fetch-Mode': 'navigate',
'Sec-Fetch-Site': 'none',
'Sec-Fetch-User': '?1',
'sec-ch-ua': '"Chromium";v="136", "Google Chrome";v="136", "Not.A/Brand";v="99"',
'sec-ch-ua-mobile': '?0',
'sec-ch-ua-platform': '"Windows"'
}
return {}
# Use consistent headers
headers = get_consistent_headers("chrome")
response = requests.get(url, headers=headers, impersonate="chrome")
Match the sec-ch-ua values to the browser version you're impersonating with curl_cffi.
Proxy Rotation Done Right
Rotating proxies prevent IP-based blocking, but naive rotation causes more problems than it solves.
Intelligent Proxy Management
Track proxy performance and automatically filter out poor performers:
from collections import defaultdict
import time
import random
from curl_cffi import requests
class SmartProxyRotator:
def __init__(self, proxies):
self.proxies = proxies
self.stats = defaultdict(lambda: {
'success': 0,
'failure': 0,
'last_used': 0,
'blocked': False
})
self.cooldown_period = 30 # seconds
def get_proxy(self):
"""Select best performing proxy with cooldown"""
current_time = time.time()
# Filter available proxies
available = [
p for p in self.proxies
if not self.stats[p]['blocked']
and (current_time - self.stats[p]['last_used']) > self.cooldown_period
]
if not available:
# All proxies on cooldown - wait or use least recently used
available = [p for p in self.proxies if not self.stats[p]['blocked']]
if not available:
raise Exception("All proxies are blocked")
# Sort by success rate
available.sort(
key=lambda p: self.stats[p]['success'] / max(1, self.stats[p]['failure'] + self.stats[p]['success']),
reverse=True
)
# Weight selection toward better proxies
weights = [1 / (i + 1) for i in range(len(available))]
selected = random.choices(available, weights=weights, k=1)[0]
self.stats[selected]['last_used'] = current_time
return selected
def report_result(self, proxy, success, status_code=None):
"""Update proxy statistics"""
if success:
self.stats[proxy]['success'] += 1
else:
self.stats[proxy]['failure'] += 1
# Mark as blocked if repeated failures or specific status
if status_code in [403, 429] or self.stats[proxy]['failure'] > 5:
self.stats[proxy]['blocked'] = True
print(f"Proxy blocked: {proxy}")
def make_request(self, url, **kwargs):
"""Make request with automatic proxy rotation"""
proxy = self.get_proxy()
try:
response = requests.get(
url,
proxies={'http': proxy, 'https': proxy},
impersonate="chrome",
timeout=15,
**kwargs
)
success = response.status_code == 200
self.report_result(proxy, success, response.status_code)
return response
except Exception as e:
self.report_result(proxy, False)
raise
# Usage
proxies = [
"http://user:pass@proxy1.example.com:8080",
"http://user:pass@proxy2.example.com:8080",
"http://user:pass@proxy3.example.com:8080"
]
rotator = SmartProxyRotator(proxies)
for page in range(1, 100):
try:
response = rotator.make_request(f"https://site.com/page/{page}")
print(f"Page {page}: {response.status_code}")
except Exception as e:
print(f"Page {page} failed: {e}")
The rotator tracks success rates, enforces cooldowns, and automatically removes blocked proxies from rotation.
Proxy Types for Different Scenarios
Not all proxies are equal. Choose based on your target:
Datacenter Proxies: Fast and cheap but easily detected. Use for unprotected sites or high-volume, low-sensitivity scraping.
Residential Proxies: Real ISP IPs that appear as genuine users. Necessary for sites with aggressive bot detection. Services like Roundproxies.com offer residential pools with automatic rotation.
ISP Proxies: Static residential IPs. Useful when you need persistent sessions with residential-level trust.
Mobile Proxies: 4G/5G carrier IPs. Highest trust level but most expensive. Reserve for targets that block everything else.
Handling Cloudflare and Other WAFs
Cloudflare presents JavaScript challenges that verify browser capabilities. Traditional HTTP clients fail these checks.
The cloudscraper Approach
For Cloudflare-protected sites that don't require full browser automation:
import cloudscraper
def scrape_cloudflare_site(url):
"""
Bypass Cloudflare using cloudscraper
"""
scraper = cloudscraper.create_scraper(
interpreter='js2py',
delay=5,
browser={
'browser': 'chrome',
'platform': 'windows',
'desktop': True
}
)
response = scraper.get(url)
if response.status_code == 200:
return response.text
return None
# Usage
html = scrape_cloudflare_site("https://cloudflare-protected.com")
cloudscraper solves JavaScript challenges automatically using embedded interpreters.
When cloudscraper Fails
For sites with advanced Cloudflare protection (Challenge Pages, Turnstile), use nodriver:
import nodriver as uc
import asyncio
async def bypass_cloudflare(url):
"""
Use full browser to handle Cloudflare challenges
"""
browser = await uc.start(
headless=False,
browser_args=['--disable-blink-features=AutomationControlled']
)
page = await browser.get(url)
# Wait for Cloudflare challenge to resolve
# Cloudflare typically redirects after verification
await asyncio.sleep(5)
# Check if we're past the challenge
current_url = await page.evaluate("() => window.location.href")
if "challenge" not in current_url.lower():
# Successfully bypassed
content = await page.get_content()
# Extract cookies for future requests
cookies = await page.evaluate('''
() => document.cookie
''')
await browser.close()
return content, cookies
await browser.close()
return None, None
if __name__ == '__main__':
content, cookies = uc.loop().run_until_complete(
bypass_cloudflare("https://cloudflare-site.com")
)
The browser-based approach handles JavaScript challenges that HTTP-only libraries cannot solve.
Complete Production Scraper Example
Here's a full scraper combining the techniques covered:
import asyncio
import random
import time
import json
from dataclasses import dataclass, asdict
from typing import List, Optional
from curl_cffi import requests
from bs4 import BeautifulSoup
@dataclass
class Product:
name: str
price: str
url: str
description: Optional[str] = None
class ProductionScraper:
def __init__(self, proxies: List[str] = None):
self.session = requests.Session()
self.proxies = proxies or []
self.proxy_index = 0
self.request_count = 0
self.results: List[Product] = []
def _get_proxy(self) -> Optional[dict]:
"""Rotate through available proxies"""
if not self.proxies:
return None
proxy = self.proxies[self.proxy_index]
self.proxy_index = (self.proxy_index + 1) % len(self.proxies)
return {'http': proxy, 'https': proxy}
def _human_delay(self):
"""Random delay between requests"""
base = 1.5
variance = random.uniform(-0.5, 1.5)
delay = base + variance
# Occasional longer pause
if random.random() < 0.1:
delay += random.uniform(2, 5)
time.sleep(delay)
def _make_request(self, url: str, **kwargs) -> Optional[requests.Response]:
"""Make request with retry logic"""
max_retries = 3
for attempt in range(max_retries):
try:
self._human_delay()
self.request_count += 1
response = self.session.get(
url,
proxies=self._get_proxy(),
impersonate="chrome",
timeout=20,
**kwargs
)
if response.status_code == 200:
return response
if response.status_code in [403, 429]:
print(f"Blocked on attempt {attempt + 1}, waiting...")
time.sleep(30 * (attempt + 1))
continue
except Exception as e:
print(f"Request error: {e}")
time.sleep(5)
return None
def _check_for_api(self, url: str) -> Optional[str]:
"""
Check if site has accessible API endpoint
Many sites load data via XHR that can be called directly
"""
# Common API patterns
api_patterns = [
url.replace('/products', '/api/products'),
url.replace('/products', '/api/v1/products'),
url + '?format=json',
]
for api_url in api_patterns:
response = self._make_request(
api_url,
headers={'Accept': 'application/json'}
)
if response and 'application/json' in response.headers.get('content-type', ''):
return api_url
return None
def scrape_product_listing(self, url: str) -> List[Product]:
"""Scrape products from listing page"""
# First, check for API endpoint
api_url = self._check_for_api(url)
if api_url:
print(f"Found API endpoint: {api_url}")
return self._scrape_via_api(api_url)
# Fall back to HTML parsing
return self._scrape_via_html(url)
def _scrape_via_api(self, api_url: str) -> List[Product]:
"""Scrape using discovered API endpoint"""
products = []
page = 1
while True:
paginated_url = f"{api_url}{'&' if '?' in api_url else '?'}page={page}"
response = self._make_request(paginated_url)
if not response:
break
try:
data = response.json()
# Adapt to common API response structures
items = data.get('products') or data.get('items') or data.get('data') or []
if not items:
break
for item in items:
products.append(Product(
name=item.get('name') or item.get('title', ''),
price=str(item.get('price', '')),
url=item.get('url') or item.get('link', ''),
description=item.get('description')
))
page += 1
# Check for pagination limits
if page > data.get('total_pages', 100):
break
except json.JSONDecodeError:
break
return products
def _scrape_via_html(self, url: str) -> List[Product]:
"""Scrape by parsing HTML"""
products = []
page = 1
while True:
paginated_url = f"{url}{'&' if '?' in url else '?'}page={page}"
response = self._make_request(paginated_url)
if not response:
break
soup = BeautifulSoup(response.text, 'html.parser')
# Common product card selectors
product_cards = soup.select('.product, .product-card, [data-product], .item')
if not product_cards:
break
for card in product_cards:
name_el = card.select_one('.name, .title, h2, h3')
price_el = card.select_one('.price, [data-price]')
link_el = card.select_one('a[href]')
desc_el = card.select_one('.description, .desc, p')
products.append(Product(
name=name_el.text.strip() if name_el else '',
price=price_el.text.strip() if price_el else '',
url=link_el['href'] if link_el else '',
description=desc_el.text.strip() if desc_el else None
))
page += 1
# Check for next page
if not soup.select('.next, .pagination a[rel="next"], [aria-label="Next"]'):
break
return products
def export_results(self, filename: str):
"""Export scraped data to JSON"""
with open(filename, 'w') as f:
json.dump([asdict(p) for p in self.results], f, indent=2)
print(f"Exported {len(self.results)} products to {filename}")
# Usage example
def main():
# Optional: Add your proxy list
proxies = [
# "http://user:pass@proxy1.example.com:8080",
# "http://user:pass@proxy2.example.com:8080",
]
scraper = ProductionScraper(proxies=proxies if proxies else None)
# Scrape target site
products = scraper.scrape_product_listing("https://example-store.com/products")
scraper.results = products
scraper.export_results("products.json")
print(f"Scraped {len(products)} products")
print(f"Total requests: {scraper.request_count}")
if __name__ == '__main__':
main()
This scraper automatically discovers API endpoints, falls back to HTML parsing when needed, handles pagination, rotates proxies, and implements humanized delays.
Common Mistakes That Get You Blocked
Running Headless by Default
Headless browsers are easier to detect. When possible, run in headed mode:
# More detectable
browser = await uc.start(headless=True)
# Less detectable - use with virtual display on servers
from pyvirtualdisplay import Display
display = Display(visible=0, size=(1920, 1080))
display.start()
browser = await uc.start(headless=False)
On headless servers, use Xvfb or pyvirtualdisplay to create a virtual screen.
Hammering with Concurrent Requests
Aggressive parallelism triggers rate limits immediately:
# This gets you banned
urls = [f"/product/{i}" for i in range(1000)]
results = await asyncio.gather(*[fetch(url) for url in urls])
# This doesn't
async def rate_limited_scrape(urls, max_concurrent=3):
semaphore = asyncio.Semaphore(max_concurrent)
async def bounded_fetch(url):
async with semaphore:
await asyncio.sleep(random.uniform(1, 3))
return await fetch(url)
return await asyncio.gather(*[bounded_fetch(url) for url in urls])
Keep concurrent requests between 3-5 for most sites. Increase only after confirming the site can handle more.
Ignoring Referrer Headers
Requests without referrers look suspicious. Always set appropriate referrers:
def get_with_referrer(session, url, base_url):
"""Include referrer matching navigation pattern"""
return session.get(
url,
headers={'Referer': base_url},
impersonate="chrome"
)
Using the Same Session Too Long
Long-running sessions accumulate suspicious characteristics. Rotate sessions periodically:
class SessionRotator:
def __init__(self, rotation_interval=50):
self.rotation_interval = rotation_interval
self.request_count = 0
self._create_session()
def _create_session(self):
self.session = requests.Session()
self.request_count = 0
def get(self, url, **kwargs):
if self.request_count >= self.rotation_interval:
self._create_session()
self.request_count += 1
return self.session.get(url, impersonate="chrome", **kwargs)
Fresh sessions have cleaner fingerprints.
FAQ
What's the difference between dynamic and static websites?
Static websites serve pre-rendered HTML where all content exists in the initial page source. Dynamic websites generate content after page load using JavaScript, fetching data from APIs and updating the DOM client-side.
Can I scrape dynamic websites without a browser?
Yes, in most cases. Identify the underlying API endpoints using browser DevTools, then call them directly with libraries like curl_cffi. Browser automation should be your last resort, not your first approach.
Why is nodriver better than Selenium for web scraping?
Selenium uses the WebDriver protocol which sets detectable properties like navigator.webdriver=true. Nodriver bypasses this entirely using a custom CDP implementation, making it significantly harder for anti-bot systems to detect.
How do I avoid getting blocked when scraping?
Implement multiple layers: use curl_cffi for TLS fingerprint impersonation, rotate residential proxies, add humanized delays between requests, maintain consistent headers, and build session trust by navigating naturally before accessing target data.
Is web scraping legal?
Web scraping legality depends on your jurisdiction, the data being scraped, and how you use it. Public data is generally scrapable, but check robots.txt, terms of service, and local laws. Never scrape personal data without consent.
How fast can I scrape without getting blocked?
It depends on the target. Start with 1-2 second delays between requests and 3-5 concurrent connections. Monitor for 429 or 403 responses and adjust accordingly. Some sites tolerate aggressive scraping while others block at the first sign of automation.
Choosing the Right Approach for Your Target
Every dynamic website scraping project requires a different strategy. Here's a decision framework:
Use curl_cffi alone when:
- The site has accessible API endpoints
- No JavaScript challenges or CAPTCHAs
- TLS fingerprinting is the primary detection method
- You need high request volume with minimal resource usage
Use nodriver when:
- Content requires JavaScript execution to render
- The site uses Cloudflare or similar WAFs
- You need to interact with page elements (clicks, forms)
- Authentication requires browser-based flows
Combine both when:
- Initial access requires nodriver to bypass challenges
- Subsequent requests can use curl_cffi with captured cookies
- You need browser sessions for discovery but HTTP for volume
The hybrid approach works especially well for dynamic websites. Use nodriver to solve initial challenges and capture authentication tokens, then switch to curl_cffi for the actual data extraction.
Performance Benchmarks: Browser vs. HTTP
Understanding the resource differences helps you choose wisely when scraping dynamic websites.
Browser Automation (nodriver/Playwright):
- Memory: 200-500MB per browser instance
- CPU: Moderate to high during JavaScript execution
- Speed: 2-5 seconds per page including rendering
- Concurrency: Limited by system resources (typically 5-10 instances)
HTTP Clients (curl_cffi/requests):
- Memory: 10-50MB for the entire script
- CPU: Minimal
- Speed: 50-200ms per request
- Concurrency: Hundreds of concurrent requests possible
For scraping dynamic websites at scale, the performance difference is 10-100x. Browser automation should be reserved for cases where HTTP requests genuinely cannot work.
Tools and Libraries Summary
Here's a quick reference of tools covered in this guide:
| Tool | Best For | Limitations |
|---|---|---|
| curl_cffi | TLS fingerprint bypass, fast requests | Cannot execute JavaScript |
| nodriver | Full anti-bot evasion, JS rendering | Resource intensive |
| cloudscraper | Basic Cloudflare bypass | Fails on advanced challenges |
| BeautifulSoup | HTML parsing | No JavaScript support |
| Playwright | Cross-browser automation | Detectable without patches |
Choose based on your specific target's protection level and your resource constraints.
Final Thoughts
Scraping dynamic websites doesn't require complex browser automation for most sites. Start with DevTools, find the API endpoints, and hit them directly.
When browser automation becomes necessary, tools like nodriver combined with curl_cffi's TLS fingerprint impersonation provide the stealth you need for dynamic website scraping.
The winning approach isn't building the most sophisticated scraper. It's finding the simplest path to clean data without triggering detection systems.
For high-volume scraping of dynamic websites that requires reliable proxy infrastructure, residential proxies from providers like Roundproxies.com give you the IP diversity needed to stay unblocked at scale.
Remember: the best scraper for dynamic websites is the one that gets data efficiently without ending up on a blocklist.