How to scrape Dynamic Websites with Python (2026 guide)

Most Python developers are scraping dynamic websites wrong. They fire up Selenium, render the entire page, wait for JavaScript, and wonder why their scrapers are slow, memory-hungry, and constantly getting blocked.

The truth? You're burning through gigabytes of RAM to do what a simple HTTP request could handle.

Dynamic website scraping means extracting data from sites that load content through JavaScript after the initial HTML response. Most "dynamic" sites fetch data from unprotected API endpoints that your scraper can call directly—skipping the browser entirely.

This guide shows you the methods that actually work in 2026, from finding hidden API endpoints to bypassing sophisticated anti-bot systems using tools that detection systems can't easily flag.

What Makes a Website Dynamic?

A dynamic website loads content through JavaScript after the initial page request. When you hit a React or Vue-powered site, the server sends minimal HTML with empty containers.

JavaScript then executes in the browser and makes AJAX calls to fetch actual data. The DOM updates with content that never existed in the original source code.

Your requests.get() only captures that first empty shell. The actual data lives in step 3—the API call.

Here's how to check if a site is truly dynamic:

import requests
from bs4 import BeautifulSoup

def check_dynamic_content(url, target_selector):
    """
    Verify if content loads dynamically
    Returns True if JavaScript rendering is needed
    """
    response = requests.get(url, headers={
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
    })
    soup = BeautifulSoup(response.text, 'html.parser')
    
    # Try to find target content in raw HTML
    content = soup.select(target_selector)
    
    if not content or not content[0].text.strip():
        print("Content loads dynamically - check for API endpoints")
        return True
    
    print("Content is in HTML - use regular scraping")
    return False

# Example usage
is_dynamic = check_dynamic_content(
    "https://example-store.com/products",
    ".product-card"
)

This function sends a plain HTTP request and checks if your target content exists. If the selector returns empty elements, the site uses JavaScript rendering.

Half the time, you'll discover the data is right there in the HTML.

The API-First Approach: Skip the Browser

Before reaching for browser automation, investigate the network traffic. Open DevTools, navigate to the Network tab, and filter by XHR/Fetch requests.

Most dynamic sites reveal clean JSON endpoints hiding in plain sight.

GET https://api.example.com/products?page=1&limit=20
Response: {"products": [...], "total": 1547}

This structured data is exactly what your scraper should target.

Discovering Hidden API Endpoints

Use this Playwright-based discovery script to capture every API call a page makes:

from playwright.sync_api import sync_playwright
import json

def discover_api_endpoints(url):
    """
    Intercept all JSON responses to find API endpoints
    Returns list of discovered APIs with headers
    """
    api_calls = []
    
    with sync_playwright() as p:
        browser = p.chromium.launch(headless=True)
        page = browser.new_page()
        
        def capture_response(response):
            content_type = response.headers.get('content-type', '')
            if 'json' in content_type or 'application/json' in content_type:
                api_calls.append({
                    'url': response.url,
                    'method': response.request.method,
                    'status': response.status,
                    'headers': dict(response.request.headers)
                })
        
        page.on('response', capture_response)
        page.goto(url, wait_until='networkidle')
        
        # Trigger lazy-loaded content
        page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
        page.wait_for_timeout(2000)
        
        browser.close()
    
    return api_calls

# Discover endpoints
endpoints = discover_api_endpoints("https://target-site.com/products")
for endpoint in endpoints:
    print(f"Found: {endpoint['method']} {endpoint['url']}")

The script launches a browser instance, intercepts all responses with JSON content types, and stores the request details. The scroll action triggers lazy-loaded content that might make additional API calls.

Calling APIs Directly

Once you've identified the endpoints, bypass the browser entirely:

import requests

def scrape_via_api(api_url, headers):
    """
    Direct API access - 10-100x faster than browser automation
    """
    response = requests.get(api_url, headers=headers)
    
    if response.status_code == 200:
        return response.json()
    
    return None

# Use exact headers captured from browser
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
    'Accept': 'application/json',
    'Referer': 'https://target-site.com/products'
}

data = scrape_via_api(
    "https://api.target-site.com/v1/products?page=1",
    headers
)

This approach uses minimal resources—10-50MB for your script versus 200-500MB per browser instance. Request times drop from 2-5 seconds to 50-200ms.

The API returns structured JSON that requires no HTML parsing.

When Browser Automation Is Actually Necessary

Sometimes the API route genuinely doesn't work. You'll need browser automation when facing:

  • Signed or encrypted request payloads
  • Complex authentication flows with JavaScript challenges
  • GraphQL with dynamic query generation
  • Aggressive bot protection requiring full browser fingerprints

When browser automation is unavoidable, don't use vanilla Selenium.

Nodriver: The Undetectable Browser Tool

Nodriver is the official successor to undetected-chromedriver, built from scratch to bypass anti-bot detection. It communicates directly with Chrome using a custom CDP implementation—no Selenium, no WebDriver protocol.

Standard Selenium sets detectable flags that anti-bot systems catch instantly:

// These properties expose Selenium
navigator.webdriver === true
window.chrome === undefined
navigator.plugins.length === 0

Nodriver eliminates these signatures entirely.

Installation and Basic Setup

pip install nodriver

You need Chrome or a Chromium-based browser installed. No separate driver downloads required.

import nodriver as uc

async def basic_scrape(url):
    """
    Basic nodriver setup with anti-detection
    """
    browser = await uc.start(
        headless=False,  # Headless mode is more detectable
        lang="en-US"
    )
    
    page = await browser.get(url)
    
    # Wait for content to load
    await page.wait_for_selector('.product-list')
    
    # Extract data via JavaScript - faster than parsing
    products = await page.evaluate('''
        () => Array.from(document.querySelectorAll('.product')).map(p => ({
            name: p.querySelector('.name')?.textContent?.trim(),
            price: p.querySelector('.price')?.textContent?.trim(),
            url: p.querySelector('a')?.href
        }))
    ''')
    
    await browser.close()
    return products

# Run the scraper
if __name__ == '__main__':
    uc.loop().run_until_complete(basic_scrape("https://target-site.com"))

The headless=False setting makes detection harder. Anti-bot systems distinguish headless Chrome from regular Chrome through various fingerprint checks.

Handling Infinite Scroll

Many modern sites use infinite scroll instead of pagination. Nodriver handles this with scroll simulation:

import nodriver as uc
import asyncio

async def scrape_infinite_scroll(url, max_scrolls=10):
    """
    Scrape pages with infinite scroll functionality
    """
    browser = await uc.start(headless=False)
    page = await browser.get(url)
    
    all_items = []
    previous_count = 0
    scroll_count = 0
    
    while scroll_count < max_scrolls:
        # Scroll to bottom
        await page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
        
        # Wait for new content
        await asyncio.sleep(2)
        
        # Count current items
        current_count = await page.evaluate('''
            () => document.querySelectorAll('.item').length
        ''')
        
        # Check if new items loaded
        if current_count == previous_count:
            # No new content - we've reached the end
            break
        
        previous_count = current_count
        scroll_count += 1
        print(f"Scroll {scroll_count}: Found {current_count} items")
    
    # Extract all items after scrolling completes
    all_items = await page.evaluate('''
        () => Array.from(document.querySelectorAll('.item')).map(item => ({
            title: item.querySelector('.title')?.textContent?.trim(),
            description: item.querySelector('.desc')?.textContent?.trim()
        }))
    ''')
    
    await browser.close()
    return all_items

if __name__ == '__main__':
    items = uc.loop().run_until_complete(
        scrape_infinite_scroll("https://infinite-scroll-site.com", max_scrolls=15)
    )
    print(f"Total items scraped: {len(items)}")

The script scrolls incrementally and tracks item count. When no new items load after a scroll, it stops—preventing infinite loops on pages without content limits.

Using Nodriver with Proxies

Proxy integration requires handling authentication through CDP:

import nodriver as uc

async def scrape_with_proxy(url, proxy_url):
    """
    Nodriver with proxy support
    proxy_url format: http://user:pass@host:port
    """
    browser = await uc.start(
        headless=False,
        browser_args=[f'--proxy-server={proxy_url}']
    )
    
    page = await browser.get(url)
    
    # Verify proxy is working
    ip_info = await page.evaluate('''
        async () => {
            const response = await fetch('https://api.ipify.org?format=json');
            return response.json();
        }
    ''')
    print(f"Current IP: {ip_info}")
    
    # Continue with scraping
    content = await page.get_content()
    
    await browser.close()
    return content

if __name__ == '__main__':
    uc.loop().run_until_complete(
        scrape_with_proxy("https://target.com", "http://user:pass@proxy.example.com:8080")
    )

For authenticated proxies requiring popup handling, you'll need to use browser extensions or CDP fetch interception—check the nodriver documentation for advanced proxy patterns.

curl_cffi: Bypassing TLS Fingerprinting

Even with proper headers and User-Agent strings, some sites block requests based on TLS fingerprints. Your Python requests library has a unique signature that anti-bot systems recognize.

curl_cffi solves this by impersonating real browser TLS handshakes.

How TLS Fingerprinting Works

When establishing HTTPS connections, your client sends a "Client Hello" message containing:

  • Supported cipher suites
  • TLS extensions
  • Signature algorithms
  • ALPN protocols

This combination creates a unique JA3 fingerprint. Standard Python HTTP libraries produce fingerprints that scream "bot."

Installation and Usage

pip install curl_cffi

Basic usage mirrors the requests library:

from curl_cffi import requests

# Standard request - uses Python's TLS fingerprint (detectable)
# response = requests.get(url)

# Impersonated request - uses Chrome's TLS fingerprint
response = requests.get(
    "https://protected-site.com",
    impersonate="chrome"
)

print(response.status_code)
print(response.json())

The impersonate parameter switches the TLS fingerprint to match Chrome, Safari, or other browsers.

Available Browser Impersonations

curl_cffi supports multiple browser versions:

from curl_cffi import requests

# Chrome versions
response = requests.get(url, impersonate="chrome110")
response = requests.get(url, impersonate="chrome120")
response = requests.get(url, impersonate="chrome131")
response = requests.get(url, impersonate="chrome136")

# Safari versions
response = requests.get(url, impersonate="safari15_5")
response = requests.get(url, impersonate="safari17_0")

# Use latest available fingerprint
response = requests.get(url, impersonate="chrome")

Each impersonation adjusts cipher suites, extensions, and HTTP/2 settings to match the real browser's behavior exactly.

Session Management with curl_cffi

For sites requiring login or cookies, use session objects:

from curl_cffi import requests

def authenticated_scrape(login_url, target_url, credentials):
    """
    Maintain session cookies across requests
    """
    session = requests.Session()
    
    # Login request
    login_response = session.post(
        login_url,
        data=credentials,
        impersonate="chrome"
    )
    
    if login_response.status_code != 200:
        raise Exception("Login failed")
    
    # Session cookies are automatically maintained
    print(f"Cookies: {session.cookies}")
    
    # Scrape authenticated content
    data_response = session.get(
        target_url,
        impersonate="chrome"
    )
    
    return data_response.json()

# Usage
data = authenticated_scrape(
    "https://site.com/login",
    "https://site.com/api/protected-data",
    {"username": "user", "password": "pass"}
)

The session object persists cookies across requests, maintaining authentication state throughout your scraping session.

Async Support for High-Volume Scraping

curl_cffi supports asyncio for concurrent requests:

import asyncio
from curl_cffi.requests import AsyncSession

async def scrape_multiple_pages(urls):
    """
    Concurrent scraping with curl_cffi async
    """
    async with AsyncSession() as session:
        tasks = []
        for url in urls:
            task = session.get(url, impersonate="chrome")
            tasks.append(task)
        
        responses = await asyncio.gather(*tasks, return_exceptions=True)
        
        results = []
        for url, response in zip(urls, responses):
            if isinstance(response, Exception):
                print(f"Error fetching {url}: {response}")
                continue
            results.append({
                'url': url,
                'status': response.status_code,
                'data': response.text[:500]
            })
        
        return results

# Run concurrent scraper
urls = [f"https://api.site.com/page/{i}" for i in range(1, 51)]
results = asyncio.run(scrape_multiple_pages(urls))
print(f"Successfully scraped {len(results)} pages")

Async requests dramatically increase throughput while maintaining the TLS fingerprint disguise.

Advanced Anti-Detection Techniques

Modern anti-bot systems analyze more than just headers and TLS fingerprints. They track behavioral patterns, timing, and session characteristics.

Humanized Request Timing

Bots make requests at consistent intervals. Humans don't.

import random
import time
from curl_cffi import requests

class HumanizedScraper:
    def __init__(self, base_delay=1.5, variance=1.0):
        self.session = requests.Session()
        self.base_delay = base_delay
        self.variance = variance
        self.request_count = 0
    
    def _human_delay(self):
        """Generate random delay mimicking human behavior"""
        delay = self.base_delay + random.uniform(-self.variance, self.variance)
        
        # Occasional longer pauses (simulating reading)
        if random.random() < 0.1:
            delay += random.uniform(3, 8)
        
        time.sleep(max(0.5, delay))
    
    def get(self, url, **kwargs):
        """Make request with human-like timing"""
        self._human_delay()
        
        self.request_count += 1
        
        # Vary User-Agent occasionally
        if self.request_count % 10 == 0:
            kwargs.setdefault('headers', {})
            kwargs['headers']['Accept-Language'] = random.choice([
                'en-US,en;q=0.9',
                'en-GB,en;q=0.8',
                'en-US,en;q=0.9,es;q=0.8'
            ])
        
        return self.session.get(url, impersonate="chrome", **kwargs)

# Usage
scraper = HumanizedScraper(base_delay=2, variance=1.5)
for page in range(1, 20):
    response = scraper.get(f"https://site.com/page/{page}")
    print(f"Page {page}: {response.status_code}")

The random delays and occasional longer pauses simulate natural browsing patterns.

Building Session Trust

Sites track session behavior from the first request. Arriving directly at a data endpoint looks suspicious.

from curl_cffi import requests
import time
import random

def build_trusted_session(site_url):
    """
    Establish session trust before scraping target data
    """
    session = requests.Session()
    
    # Visit homepage first
    session.get(site_url, impersonate="chrome")
    time.sleep(random.uniform(2, 4))
    
    # Browse naturally through the site
    navigation_paths = ['/about', '/products', '/categories', '/contact']
    random.shuffle(navigation_paths)
    
    for path in navigation_paths[:random.randint(2, 4)]:
        if random.random() < 0.7:  # Don't visit everything
            session.get(f"{site_url}{path}", impersonate="chrome")
            time.sleep(random.uniform(1.5, 4))
    
    return session

# Build trust, then scrape
session = build_trusted_session("https://target-site.com")

# Now access target data
response = session.get("https://target-site.com/api/products", impersonate="chrome")

This pattern establishes cookies, builds referrer history, and creates a session that appears legitimate.

Header Consistency

Inconsistent headers trigger detection. If your Accept-Language claims English but Accept-Encoding doesn't match Chrome's defaults, that's a red flag.

def get_consistent_headers(browser_type="chrome"):
    """
    Return headers that match real browser behavior
    """
    if browser_type == "chrome":
        return {
            'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8',
            'Accept-Language': 'en-US,en;q=0.9',
            'Accept-Encoding': 'gzip, deflate, br',
            'Connection': 'keep-alive',
            'Upgrade-Insecure-Requests': '1',
            'Sec-Fetch-Dest': 'document',
            'Sec-Fetch-Mode': 'navigate',
            'Sec-Fetch-Site': 'none',
            'Sec-Fetch-User': '?1',
            'sec-ch-ua': '"Chromium";v="136", "Google Chrome";v="136", "Not.A/Brand";v="99"',
            'sec-ch-ua-mobile': '?0',
            'sec-ch-ua-platform': '"Windows"'
        }
    
    return {}

# Use consistent headers
headers = get_consistent_headers("chrome")
response = requests.get(url, headers=headers, impersonate="chrome")

Match the sec-ch-ua values to the browser version you're impersonating with curl_cffi.

Proxy Rotation Done Right

Rotating proxies prevent IP-based blocking, but naive rotation causes more problems than it solves.

Intelligent Proxy Management

Track proxy performance and automatically filter out poor performers:

from collections import defaultdict
import time
import random
from curl_cffi import requests

class SmartProxyRotator:
    def __init__(self, proxies):
        self.proxies = proxies
        self.stats = defaultdict(lambda: {
            'success': 0,
            'failure': 0,
            'last_used': 0,
            'blocked': False
        })
        self.cooldown_period = 30  # seconds
    
    def get_proxy(self):
        """Select best performing proxy with cooldown"""
        current_time = time.time()
        
        # Filter available proxies
        available = [
            p for p in self.proxies
            if not self.stats[p]['blocked']
            and (current_time - self.stats[p]['last_used']) > self.cooldown_period
        ]
        
        if not available:
            # All proxies on cooldown - wait or use least recently used
            available = [p for p in self.proxies if not self.stats[p]['blocked']]
        
        if not available:
            raise Exception("All proxies are blocked")
        
        # Sort by success rate
        available.sort(
            key=lambda p: self.stats[p]['success'] / max(1, self.stats[p]['failure'] + self.stats[p]['success']),
            reverse=True
        )
        
        # Weight selection toward better proxies
        weights = [1 / (i + 1) for i in range(len(available))]
        selected = random.choices(available, weights=weights, k=1)[0]
        
        self.stats[selected]['last_used'] = current_time
        return selected
    
    def report_result(self, proxy, success, status_code=None):
        """Update proxy statistics"""
        if success:
            self.stats[proxy]['success'] += 1
        else:
            self.stats[proxy]['failure'] += 1
            
            # Mark as blocked if repeated failures or specific status
            if status_code in [403, 429] or self.stats[proxy]['failure'] > 5:
                self.stats[proxy]['blocked'] = True
                print(f"Proxy blocked: {proxy}")
    
    def make_request(self, url, **kwargs):
        """Make request with automatic proxy rotation"""
        proxy = self.get_proxy()
        
        try:
            response = requests.get(
                url,
                proxies={'http': proxy, 'https': proxy},
                impersonate="chrome",
                timeout=15,
                **kwargs
            )
            
            success = response.status_code == 200
            self.report_result(proxy, success, response.status_code)
            
            return response
            
        except Exception as e:
            self.report_result(proxy, False)
            raise

# Usage
proxies = [
    "http://user:pass@proxy1.example.com:8080",
    "http://user:pass@proxy2.example.com:8080",
    "http://user:pass@proxy3.example.com:8080"
]

rotator = SmartProxyRotator(proxies)

for page in range(1, 100):
    try:
        response = rotator.make_request(f"https://site.com/page/{page}")
        print(f"Page {page}: {response.status_code}")
    except Exception as e:
        print(f"Page {page} failed: {e}")

The rotator tracks success rates, enforces cooldowns, and automatically removes blocked proxies from rotation.

Proxy Types for Different Scenarios

Not all proxies are equal. Choose based on your target:

Datacenter Proxies: Fast and cheap but easily detected. Use for unprotected sites or high-volume, low-sensitivity scraping.

Residential Proxies: Real ISP IPs that appear as genuine users. Necessary for sites with aggressive bot detection. Services like Roundproxies.com offer residential pools with automatic rotation.

ISP Proxies: Static residential IPs. Useful when you need persistent sessions with residential-level trust.

Mobile Proxies: 4G/5G carrier IPs. Highest trust level but most expensive. Reserve for targets that block everything else.

Handling Cloudflare and Other WAFs

Cloudflare presents JavaScript challenges that verify browser capabilities. Traditional HTTP clients fail these checks.

The cloudscraper Approach

For Cloudflare-protected sites that don't require full browser automation:

import cloudscraper

def scrape_cloudflare_site(url):
    """
    Bypass Cloudflare using cloudscraper
    """
    scraper = cloudscraper.create_scraper(
        interpreter='js2py',
        delay=5,
        browser={
            'browser': 'chrome',
            'platform': 'windows',
            'desktop': True
        }
    )
    
    response = scraper.get(url)
    
    if response.status_code == 200:
        return response.text
    
    return None

# Usage
html = scrape_cloudflare_site("https://cloudflare-protected.com")

cloudscraper solves JavaScript challenges automatically using embedded interpreters.

When cloudscraper Fails

For sites with advanced Cloudflare protection (Challenge Pages, Turnstile), use nodriver:

import nodriver as uc
import asyncio

async def bypass_cloudflare(url):
    """
    Use full browser to handle Cloudflare challenges
    """
    browser = await uc.start(
        headless=False,
        browser_args=['--disable-blink-features=AutomationControlled']
    )
    
    page = await browser.get(url)
    
    # Wait for Cloudflare challenge to resolve
    # Cloudflare typically redirects after verification
    await asyncio.sleep(5)
    
    # Check if we're past the challenge
    current_url = await page.evaluate("() => window.location.href")
    
    if "challenge" not in current_url.lower():
        # Successfully bypassed
        content = await page.get_content()
        
        # Extract cookies for future requests
        cookies = await page.evaluate('''
            () => document.cookie
        ''')
        
        await browser.close()
        return content, cookies
    
    await browser.close()
    return None, None

if __name__ == '__main__':
    content, cookies = uc.loop().run_until_complete(
        bypass_cloudflare("https://cloudflare-site.com")
    )

The browser-based approach handles JavaScript challenges that HTTP-only libraries cannot solve.

Complete Production Scraper Example

Here's a full scraper combining the techniques covered:

import asyncio
import random
import time
import json
from dataclasses import dataclass, asdict
from typing import List, Optional
from curl_cffi import requests
from bs4 import BeautifulSoup

@dataclass
class Product:
    name: str
    price: str
    url: str
    description: Optional[str] = None

class ProductionScraper:
    def __init__(self, proxies: List[str] = None):
        self.session = requests.Session()
        self.proxies = proxies or []
        self.proxy_index = 0
        self.request_count = 0
        self.results: List[Product] = []
    
    def _get_proxy(self) -> Optional[dict]:
        """Rotate through available proxies"""
        if not self.proxies:
            return None
        
        proxy = self.proxies[self.proxy_index]
        self.proxy_index = (self.proxy_index + 1) % len(self.proxies)
        
        return {'http': proxy, 'https': proxy}
    
    def _human_delay(self):
        """Random delay between requests"""
        base = 1.5
        variance = random.uniform(-0.5, 1.5)
        delay = base + variance
        
        # Occasional longer pause
        if random.random() < 0.1:
            delay += random.uniform(2, 5)
        
        time.sleep(delay)
    
    def _make_request(self, url: str, **kwargs) -> Optional[requests.Response]:
        """Make request with retry logic"""
        max_retries = 3
        
        for attempt in range(max_retries):
            try:
                self._human_delay()
                self.request_count += 1
                
                response = self.session.get(
                    url,
                    proxies=self._get_proxy(),
                    impersonate="chrome",
                    timeout=20,
                    **kwargs
                )
                
                if response.status_code == 200:
                    return response
                
                if response.status_code in [403, 429]:
                    print(f"Blocked on attempt {attempt + 1}, waiting...")
                    time.sleep(30 * (attempt + 1))
                    continue
                
            except Exception as e:
                print(f"Request error: {e}")
                time.sleep(5)
        
        return None
    
    def _check_for_api(self, url: str) -> Optional[str]:
        """
        Check if site has accessible API endpoint
        Many sites load data via XHR that can be called directly
        """
        # Common API patterns
        api_patterns = [
            url.replace('/products', '/api/products'),
            url.replace('/products', '/api/v1/products'),
            url + '?format=json',
        ]
        
        for api_url in api_patterns:
            response = self._make_request(
                api_url,
                headers={'Accept': 'application/json'}
            )
            
            if response and 'application/json' in response.headers.get('content-type', ''):
                return api_url
        
        return None
    
    def scrape_product_listing(self, url: str) -> List[Product]:
        """Scrape products from listing page"""
        
        # First, check for API endpoint
        api_url = self._check_for_api(url)
        
        if api_url:
            print(f"Found API endpoint: {api_url}")
            return self._scrape_via_api(api_url)
        
        # Fall back to HTML parsing
        return self._scrape_via_html(url)
    
    def _scrape_via_api(self, api_url: str) -> List[Product]:
        """Scrape using discovered API endpoint"""
        products = []
        page = 1
        
        while True:
            paginated_url = f"{api_url}{'&' if '?' in api_url else '?'}page={page}"
            response = self._make_request(paginated_url)
            
            if not response:
                break
            
            try:
                data = response.json()
                
                # Adapt to common API response structures
                items = data.get('products') or data.get('items') or data.get('data') or []
                
                if not items:
                    break
                
                for item in items:
                    products.append(Product(
                        name=item.get('name') or item.get('title', ''),
                        price=str(item.get('price', '')),
                        url=item.get('url') or item.get('link', ''),
                        description=item.get('description')
                    ))
                
                page += 1
                
                # Check for pagination limits
                if page > data.get('total_pages', 100):
                    break
                    
            except json.JSONDecodeError:
                break
        
        return products
    
    def _scrape_via_html(self, url: str) -> List[Product]:
        """Scrape by parsing HTML"""
        products = []
        page = 1
        
        while True:
            paginated_url = f"{url}{'&' if '?' in url else '?'}page={page}"
            response = self._make_request(paginated_url)
            
            if not response:
                break
            
            soup = BeautifulSoup(response.text, 'html.parser')
            
            # Common product card selectors
            product_cards = soup.select('.product, .product-card, [data-product], .item')
            
            if not product_cards:
                break
            
            for card in product_cards:
                name_el = card.select_one('.name, .title, h2, h3')
                price_el = card.select_one('.price, [data-price]')
                link_el = card.select_one('a[href]')
                desc_el = card.select_one('.description, .desc, p')
                
                products.append(Product(
                    name=name_el.text.strip() if name_el else '',
                    price=price_el.text.strip() if price_el else '',
                    url=link_el['href'] if link_el else '',
                    description=desc_el.text.strip() if desc_el else None
                ))
            
            page += 1
            
            # Check for next page
            if not soup.select('.next, .pagination a[rel="next"], [aria-label="Next"]'):
                break
        
        return products
    
    def export_results(self, filename: str):
        """Export scraped data to JSON"""
        with open(filename, 'w') as f:
            json.dump([asdict(p) for p in self.results], f, indent=2)
        print(f"Exported {len(self.results)} products to {filename}")

# Usage example
def main():
    # Optional: Add your proxy list
    proxies = [
        # "http://user:pass@proxy1.example.com:8080",
        # "http://user:pass@proxy2.example.com:8080",
    ]
    
    scraper = ProductionScraper(proxies=proxies if proxies else None)
    
    # Scrape target site
    products = scraper.scrape_product_listing("https://example-store.com/products")
    
    scraper.results = products
    scraper.export_results("products.json")
    
    print(f"Scraped {len(products)} products")
    print(f"Total requests: {scraper.request_count}")

if __name__ == '__main__':
    main()

This scraper automatically discovers API endpoints, falls back to HTML parsing when needed, handles pagination, rotates proxies, and implements humanized delays.

Common Mistakes That Get You Blocked

Running Headless by Default

Headless browsers are easier to detect. When possible, run in headed mode:

# More detectable
browser = await uc.start(headless=True)

# Less detectable - use with virtual display on servers
from pyvirtualdisplay import Display
display = Display(visible=0, size=(1920, 1080))
display.start()
browser = await uc.start(headless=False)

On headless servers, use Xvfb or pyvirtualdisplay to create a virtual screen.

Hammering with Concurrent Requests

Aggressive parallelism triggers rate limits immediately:

# This gets you banned
urls = [f"/product/{i}" for i in range(1000)]
results = await asyncio.gather(*[fetch(url) for url in urls])

# This doesn't
async def rate_limited_scrape(urls, max_concurrent=3):
    semaphore = asyncio.Semaphore(max_concurrent)
    
    async def bounded_fetch(url):
        async with semaphore:
            await asyncio.sleep(random.uniform(1, 3))
            return await fetch(url)
    
    return await asyncio.gather(*[bounded_fetch(url) for url in urls])

Keep concurrent requests between 3-5 for most sites. Increase only after confirming the site can handle more.

Ignoring Referrer Headers

Requests without referrers look suspicious. Always set appropriate referrers:

def get_with_referrer(session, url, base_url):
    """Include referrer matching navigation pattern"""
    return session.get(
        url,
        headers={'Referer': base_url},
        impersonate="chrome"
    )

Using the Same Session Too Long

Long-running sessions accumulate suspicious characteristics. Rotate sessions periodically:

class SessionRotator:
    def __init__(self, rotation_interval=50):
        self.rotation_interval = rotation_interval
        self.request_count = 0
        self._create_session()
    
    def _create_session(self):
        self.session = requests.Session()
        self.request_count = 0
    
    def get(self, url, **kwargs):
        if self.request_count >= self.rotation_interval:
            self._create_session()
        
        self.request_count += 1
        return self.session.get(url, impersonate="chrome", **kwargs)

Fresh sessions have cleaner fingerprints.

FAQ

What's the difference between dynamic and static websites?

Static websites serve pre-rendered HTML where all content exists in the initial page source. Dynamic websites generate content after page load using JavaScript, fetching data from APIs and updating the DOM client-side.

Can I scrape dynamic websites without a browser?

Yes, in most cases. Identify the underlying API endpoints using browser DevTools, then call them directly with libraries like curl_cffi. Browser automation should be your last resort, not your first approach.

Why is nodriver better than Selenium for web scraping?

Selenium uses the WebDriver protocol which sets detectable properties like navigator.webdriver=true. Nodriver bypasses this entirely using a custom CDP implementation, making it significantly harder for anti-bot systems to detect.

How do I avoid getting blocked when scraping?

Implement multiple layers: use curl_cffi for TLS fingerprint impersonation, rotate residential proxies, add humanized delays between requests, maintain consistent headers, and build session trust by navigating naturally before accessing target data.

Web scraping legality depends on your jurisdiction, the data being scraped, and how you use it. Public data is generally scrapable, but check robots.txt, terms of service, and local laws. Never scrape personal data without consent.

How fast can I scrape without getting blocked?

It depends on the target. Start with 1-2 second delays between requests and 3-5 concurrent connections. Monitor for 429 or 403 responses and adjust accordingly. Some sites tolerate aggressive scraping while others block at the first sign of automation.

Choosing the Right Approach for Your Target

Every dynamic website scraping project requires a different strategy. Here's a decision framework:

Use curl_cffi alone when:

  • The site has accessible API endpoints
  • No JavaScript challenges or CAPTCHAs
  • TLS fingerprinting is the primary detection method
  • You need high request volume with minimal resource usage

Use nodriver when:

  • Content requires JavaScript execution to render
  • The site uses Cloudflare or similar WAFs
  • You need to interact with page elements (clicks, forms)
  • Authentication requires browser-based flows

Combine both when:

  • Initial access requires nodriver to bypass challenges
  • Subsequent requests can use curl_cffi with captured cookies
  • You need browser sessions for discovery but HTTP for volume

The hybrid approach works especially well for dynamic websites. Use nodriver to solve initial challenges and capture authentication tokens, then switch to curl_cffi for the actual data extraction.

Performance Benchmarks: Browser vs. HTTP

Understanding the resource differences helps you choose wisely when scraping dynamic websites.

Browser Automation (nodriver/Playwright):

  • Memory: 200-500MB per browser instance
  • CPU: Moderate to high during JavaScript execution
  • Speed: 2-5 seconds per page including rendering
  • Concurrency: Limited by system resources (typically 5-10 instances)

HTTP Clients (curl_cffi/requests):

  • Memory: 10-50MB for the entire script
  • CPU: Minimal
  • Speed: 50-200ms per request
  • Concurrency: Hundreds of concurrent requests possible

For scraping dynamic websites at scale, the performance difference is 10-100x. Browser automation should be reserved for cases where HTTP requests genuinely cannot work.

Tools and Libraries Summary

Here's a quick reference of tools covered in this guide:

Tool Best For Limitations
curl_cffi TLS fingerprint bypass, fast requests Cannot execute JavaScript
nodriver Full anti-bot evasion, JS rendering Resource intensive
cloudscraper Basic Cloudflare bypass Fails on advanced challenges
BeautifulSoup HTML parsing No JavaScript support
Playwright Cross-browser automation Detectable without patches

Choose based on your specific target's protection level and your resource constraints.

Final Thoughts

Scraping dynamic websites doesn't require complex browser automation for most sites. Start with DevTools, find the API endpoints, and hit them directly.

When browser automation becomes necessary, tools like nodriver combined with curl_cffi's TLS fingerprint impersonation provide the stealth you need for dynamic website scraping.

The winning approach isn't building the most sophisticated scraper. It's finding the simplest path to clean data without triggering detection systems.

For high-volume scraping of dynamic websites that requires reliable proxy infrastructure, residential proxies from providers like Roundproxies.com give you the IP diversity needed to stay unblocked at scale.

Remember: the best scraper for dynamic websites is the one that gets data efficiently without ending up on a blocklist.