How to Scrape Dynamic Websites with Python

Dynamic website scraping is the latest "must-have" skill for Python developers, but most are doing it completely wrong.

Across Stack Overflow threads and Medium tutorials, the advice is the same: fire up Selenium, render the full page, and extract your data. Problem solved, right?

The problem?

You're burning through gigabytes of RAM to do what a simple HTTP request could handle. Most "dynamic" sites aren't as dynamic as you think—and the ones that are? They're usually just fetching data from unprotected API endpoints.

What Python scrapers need to know (The TL;DR)

Before diving into browser automation, here are the key takeaways:

  • Browser automation ≠ always necessary. Most dynamic sites just load data via XHR/Fetch requests you can intercept directly.
  • Selenium screams "BOT!" to every anti-detection system. If you must use a browser, use nodriver or Playwright with proper evasion.
  • API endpoints are hiding in plain sight. Open DevTools, watch the Network tab, and you'll find clean JSON endpoints on 80% of "JavaScript-heavy" sites.
  • Detection isn't about one thing. Modern anti-bot systems check TLS fingerprints, mouse movements, browser properties, and request patterns.
  • Speed kills (your scraper). Hammering a site with concurrent requests is the fastest way to get IP-banned.

This guide explains how dynamic website scraping actually works and shows you how to extract data efficiently without triggering every anti-bot system on the internet.

But you probably don't need that headless browser you're about to spin up.

What makes a website "dynamic," really?

A dynamic website loads content through JavaScript after the initial HTML response, using client-side rendering to populate data that doesn't exist in the source code.

Here's what happens when you hit a React-powered e-commerce site:

  1. Server sends minimal HTML with empty containers
  2. JavaScript executes in the browser
  3. AJAX calls fetch data from API endpoints
  4. DOM updates with the actual content

Your requests.get() only sees step 1. The data you want happens in step 3.

Sound familiar? It should. This is how 90% of modern web apps work.

Only here's the secret most tutorials miss: you can skip straight to step 3.

Why most scrapers are doing it backwards

Even experienced developers reach for Selenium first. They spin up Chrome, load the full page, wait for JavaScript, then extract data from the rendered DOM.

This approach burns resources for no reason.

The resource cost nobody talks about

Running headless Chrome consumes:

  • 200-500MB RAM per instance
  • 2-5 seconds per page load
  • CPU cycles for JavaScript execution
  • Bandwidth for assets you don't need

Meanwhile, hitting the API directly uses:

  • 10-50MB for your Python script
  • 50-200ms per request
  • Minimal CPU
  • Only the JSON payload

The math isn't hard. One approach is 10-100x more efficient.

What the DevTools Network tab reveals

Open any "dynamic" site and watch the Network tab. Filter by XHR/Fetch. You'll see something like:

GET https://api.example.com/products?page=1&limit=20
Response: {"products": [...], "total": 1547}

Clean, structured JSON. No parsing required.

This is what your scraper should be targeting.

The smart approach: Intercept, don't render

Step 1: Check if you even need JavaScript

Before anything else, verify the content is actually dynamic:

import requests
from bs4 import BeautifulSoup

def check_if_dynamic(url, target_selector):
    """Quick reality check before overengineering"""
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')
    
    content = soup.select(target_selector)
    
    if not content or content[0].text.strip() == '':
        print("❌ Content loads dynamically")
        return True
    else:
        print("✅ Content is in the HTML - use regular scraping")
        return False

Half the time, the "dynamic" site you're worried about has the data right in the source.

Step 2: Hunt the API endpoints

If content is truly dynamic, become a detective:

from playwright.sync_api import sync_playwright
import json

def discover_api_endpoints(url):
    """Find the actual data sources"""
    api_calls = []
    
    with sync_playwright() as p:
        browser = p.chromium.launch(headless=True)
        page = browser.new_page()
        
        # Intercept all JSON responses
        def capture_apis(response):
            if 'json' in response.headers.get('content-type', ''):
                api_calls.append({
                    'url': response.url,
                    'method': response.request.method,
                    'headers': dict(response.request.headers)
                })
        
        page.on('response', capture_apis)
        page.goto(url, wait_until='networkidle')
        
        # Trigger lazy-loaded content
        page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
        page.wait_for_timeout(2000)
        
        browser.close()
        
    return api_calls

Now you have the exact endpoints, headers, and parameters. No more guessing.

Step 3: Call APIs directly (skip the middleman)

Once you find the endpoints:

import requests

def scrape_via_api(api_url, headers):
    """Why render HTML when you can get JSON?"""
    
    # Use the exact headers from browser
    response = requests.get(api_url, headers=headers)
    
    # Clean data, no parsing needed
    return response.json()

This approach is faster, cleaner, and less detectable than any browser automation.

When you actually need browser automation

Sometimes the API route genuinely doesn't work:

  • Signed/encrypted requests
  • Complex authentication flows
  • GraphQL with dynamic queries
  • Aggressive bot protection

Fine. But don't use vanilla Selenium.

The detection problem with WebDriver

Selenium uses WebDriver protocol, which sets detectible properties:

navigator.webdriver === true  // Dead giveaway
window.chrome === undefined   // Missing in headless
navigator.plugins.length === 0  // No plugins = bot

Sites check these properties and block you instantly.

Enter nodriver: Selenium's stealthier cousin

import nodriver as uc
import asyncio

async def undetected_scrape(url):
    """Browser automation that doesn't announce itself"""
    
    browser = await uc.start(
        headless=False,  # Headless is more detectable
        lang="en-US",
    )
    
    page = await browser.get(url)
    await page.wait_for_selector('.product-list')
    
    # Extract via JavaScript - faster than parsing
    products = await page.evaluate('''
        () => Array.from(document.querySelectorAll('.product')).map(p => ({
            name: p.querySelector('.name')?.textContent,
            price: p.querySelector('.price')?.textContent
        }))
    ''')
    
    await browser.close()
    return products

Nodriver bypasses WebDriver entirely, using Chrome DevTools Protocol instead. Detection systems can't see the usual bot signatures.

The anti-detection arms race

Modern anti-bot systems like Cloudflare and DataDome look for more than just navigator.webdriver.

TLS fingerprinting catches most bots

Python's requests library has a unique TLS signature. Anti-bot systems maintain databases of these fingerprints.

Standard requests:

requests.get(url)  # TLS fingerprint screams "Python"

Disguised requests:

from curl_cffi import requests

# Mimics Chrome's TLS fingerprint exactly
response = requests.get(url, impersonate="chrome110")

One line change. Massive difference in detection.

Behavioral patterns matter more than headers

Anti-bot systems track:

  • Request intervals (too consistent = bot)
  • Mouse movements (none = bot)
  • Page interaction (straight to data = bot)
  • Session depth (single page = bot)

Here's how to look human:

import random
import time

class HumanizedScraper:
    def __init__(self):
        self.session = requests.Session()
        
    def build_trust(self, site_url):
        """Act like a real visitor"""
        
        # Visit homepage first
        self.session.get(site_url)
        time.sleep(random.uniform(2, 4))
        
        # Browse around naturally
        for page in ['/about', '/products', '/contact']:
            if random.random() < 0.7:  # Don't visit everything
                self.session.get(site_url + page)
                time.sleep(random.uniform(1.5, 3.5))
        
        return self.session

Real users don't beeline to your target data. Neither should your scraper.

Common mistakes that get you blocked

The headless browser trap

Running headless seems smart—no GUI, faster, right?

Wrong. Headless browsers are easier to detect:

# Detectable
browser = await uc.start(headless=True)

# Better - invisible but not headless
from pyvirtualdisplay import Display
display = Display(visible=0, size=(1920, 1080))
display.start()
browser = await uc.start(headless=False)

The browser thinks it's visible. Detection systems can't tell the difference.

The concurrent request death spiral

Greedy scrapers blast sites with parallel requests:

# This gets you banned
urls = [f"/product/{i}" for i in range(1000)]
results = await asyncio.gather(*[fetch(url) for url in urls])

Smart scrapers use controlled concurrency:

async def distributed_scrape(urls, max_concurrent=3):
    """Scrape fast without triggering rate limits"""
    
    semaphore = asyncio.Semaphore(max_concurrent)
    
    async def bounded_fetch(url):
        async with semaphore:
            await asyncio.sleep(random.uniform(0.5, 1.5))
            return await fetch(url)
    
    return await asyncio.gather(*[bounded_fetch(url) for url in urls])

The difference between 3 and 30 concurrent requests? About 27 blocked IPs.

The proxy rotation mistake

Don't just cycle through proxies randomly:

class IntelligentProxyRotator:
    def __init__(self, proxies):
        self.proxies = proxies
        self.proxy_stats = defaultdict(lambda: {
            'success': 0, 'failure': 0, 'last_used': 0
        })
    
    def get_proxy(self):
        """Use successful proxies, rest failed ones"""
        
        current_time = time.time()
        
        # Sort by success rate and cooldown
        best_proxies = sorted(
            self.proxies,
            key=lambda p: (
                self.proxy_stats[p]['success'] / 
                max(1, self.proxy_stats[p]['failure']),
                current_time - self.proxy_stats[p]['last_used']
            ),
            reverse=True
        )
        
        # Enforce cooldown period
        for proxy in best_proxies:
            if current_time - self.proxy_stats[proxy]['last_used'] > 30:
                self.proxy_stats[proxy]['last_used'] = current_time
                return proxy
                
        return best_proxies[0]  # Emergency fallback

Good proxies get reused efficiently. Bad ones get naturally filtered out.

The uncomfortable truth about "undetectable" scraping

Here's what nobody tells you: There's no such thing as undetectable scraping.

Every technique in this guide has a counter-technique. Every evasion method leaves traces. The goal isn't perfect stealth—it's being too expensive to block.

Sites could block every scraper if they wanted. They don't because:

  • False positives hurt real users
  • Advanced detection costs money
  • Some scraping drives traffic
  • APIs are better honeypots than blocks

Your job is to stay below the threshold where blocking you becomes worthwhile.

Closing takeaway

Chasing complex browser automation for dynamic websites is usually overkill.

Most "JavaScript-heavy" sites are just static HTML shells fetching data from API endpoints you can call directly. Even when browser automation is genuinely necessary, tools like nodriver combined with smart evasion techniques work better than vanilla Selenium ever could.

The winning approach isn't building the most sophisticated scraper—it's finding the simplest path to clean data. Start with DevTools, check for APIs, and only reach for browser automation when you've exhausted the alternatives.

Remember: The best scraper isn't the one with the most features. It's the one that gets the data efficiently without ending up on a blocklist.

Marius Bernard

Marius Bernard

Marius Bernard is a Product Advisor, Technical SEO, & Brand Ambassador at Roundproxies. He was the lead author for the SEO chapter of the 2024 Web and a reviewer for the 2023 SEO chapter.