Web Scraping

Pydoll guide: How to scrape without WebDriver (2026)

Pydoll quietly showed up in early 2025 and solved the one thing that makes browser automation miserable: WebDriver management. No more matching ChromeDriver versions. No more navigator.webdriver=true getting your scraper flagged within seconds.

I switched my production scrapers from Selenium to Pydoll three months ago. Setup time dropped from 20 minutes of driver debugging to pip install and go. This guide walks you through everything — from first install to concurrent multi-tab scraping with proxy rotation.

What Is Pydoll?

Pydoll is an async-first Python library that automates Chromium-based browsers through the Chrome DevTools Protocol (CDP) — no WebDriver required. It connects directly to Chrome's debugging interface, which means fewer moving parts, better stealth against bot detection, and zero driver version mismatches. Use it when you need to scrape JavaScript-heavy sites that static tools like Requests or Scrapy can't handle.

The library hit 6,000+ GitHub stars within its first year. It's fully typed with mypy, built on asyncio, and ships with built-in CAPTCHA bypass helpers for Cloudflare Turnstile and reCAPTCHA v3.

Pydoll vs. Selenium and Playwright

Before committing to a tool, you should know what you're trading off.

Feature Pydoll Selenium Playwright
WebDriver required No Yes No (bundled)
Async native Yes No Yes
Type safety 100% typed Partial Partial
CAPTCHA bypass Built-in helpers None None
Multi-browser support Chromium only All major All major
navigator.webdriver leak No Yes (by default) No
Python version 3.10+ 3.8+ 3.8+

Choose Pydoll when you're scraping Chromium-only and want the cleanest async API with built-in stealth. Skip it if you need Firefox or WebKit support — Playwright is better there.

For a deeper comparison, see our guide on Playwright vs. Selenium for web scraping.

How to Install Pydoll

You need Python 3.10 or higher and a Chromium-based browser (Chrome or Edge) installed on your machine.

Create a project directory and set up a virtual environment:

mkdir pydoll-scraper && cd pydoll-scraper
python -m venv venv
source venv/bin/activate  # Windows: venv\Scripts\activate

Install the library:

pip install pydoll-python

Verify it worked:

python -c "import pydoll; print('Pydoll installed successfully')"

That's it. No chromedriver download, no path config, no version matching. Pydoll finds your local Chrome install automatically.

Core Concepts You Need to Know

Pydoll has a small API surface. Four concepts cover 90% of what you'll use:

Browser. The Chrome or Chromium instance. You create it with Chrome() and typically use it as an async context manager.

Tab (Page). Each open tab in the browser. You get the first one from browser.start() and create additional tabs with browser.new_tab(). All scraping happens on a tab.

WebElement. The result of finding an element on the page. Supports .click(), .type_text(), .get_text(), and attribute extraction.

Browser context. An isolated session within a single browser process — separate cookies, storage, and proxy settings. Think of it as a programmatic incognito window.

Your First Pydoll Scraper

Let's build a minimal scraper that navigates to a page and grabs its title. This confirms your setup works end to end.

import asyncio
from pydoll.browser import Chrome

async def main():
    async with Chrome() as browser:
        tab = await browser.start()
        await tab.go_to("https://books.toscrape.com/")
        
        title = await tab.execute_script("return document.title")
        print(f"Page title: {title}")

asyncio.run(main())

The async with Chrome() block launches Chrome and tears it down automatically when done. tab.go_to() navigates, and execute_script runs arbitrary JavaScript in the page context.

Run it and you should see: Page title: All products | Books to Scrape - Sandbox.

Extracting Data With CSS Selectors

The real work starts when you need to pull structured data off a page. Pydoll's tab.find() method accepts CSS selectors, IDs, class names, and XPath.

Here's how to scrape book titles and prices from Books to Scrape:

import asyncio
import json
from pydoll.browser import Chrome

async def main():
    async with Chrome() as browser:
        tab = await browser.start()
        await tab.go_to("https://books.toscrape.com/")

        # Find all product containers
        books = await tab.find(
            css_selector="article.product_pod",
            find_all=True
        )

        results = []
        for book in books:
            title = await book.find(css_selector="h3 a")
            price = await book.find(css_selector=".price_color")

            results.append({
                "title": await title.get_attribute("title"),
                "price": await price.get_text()
            })

        print(json.dumps(results[:5], indent=2))

asyncio.run(main())

Two things to notice. First, find_all=True returns a list of WebElement objects. Without it, you get a single element. Second, get_attribute("title") pulls HTML attributes while get_text() pulls visible text content.

The output looks like this:

[
  {"title": "A Light in the Attic", "price": "£51.77"},
  {"title": "Tipping the Velvet", "price": "£53.74"},
  {"title": "Soumission", "price": "£50.10"}
]

Exporting scraped data to CSV

Once you've collected data, you'll want it in a portable format. Here's the same scraper with CSV export:

import asyncio
import csv
from pathlib import Path
from pydoll.browser import Chrome

async def main():
    async with Chrome() as browser:
        tab = await browser.start()
        await tab.go_to("https://books.toscrape.com/")

        books = await tab.find(css_selector="article.product_pod", find_all=True)

        results = []
        for book in books:
            link = await book.find(css_selector="h3 a")
            price = await book.find(css_selector=".price_color")
            rating = await book.find(css_selector="p.star-rating")

            results.append({
                "title": await link.get_attribute("title"),
                "price": await price.get_text(),
                "rating": (await rating.get_attribute("class")).split()[-1]
            })

        # Write to CSV
        output = Path("books.csv")
        with open(output, "w", newline="", encoding="utf-8") as f:
            writer = csv.DictWriter(f, fieldnames=["title", "price", "rating"])
            writer.writeheader()
            writer.writerows(results)

        print(f"Saved {len(results)} books to {output}")

asyncio.run(main())

Notice the rating extraction trick: the star rating is stored as a CSS class like star-rating Three, so splitting on whitespace and grabbing the last element gives you the rating word. Small details like this matter when you're building scrapers against real sites.

Handling JavaScript-Rendered Pages

Static scraping tools break on SPAs and sites that load content after the initial page load. Pydoll handles this natively because it runs a real browser.

The key is wait_element() — it pauses execution until a target element appears in the DOM:

import asyncio
from pydoll.browser import Chrome

async def main():
    async with Chrome() as browser:
        tab = await browser.start()
        await tab.go_to("https://quotes.toscrape.com/js/")

        # Wait for quotes to render (they load via JS after a delay)
        await tab.wait_element(css_selector=".quote", timeout=10)

        quotes = await tab.find(css_selector=".quote", find_all=True)

        for quote in quotes[:3]:
            text_el = await quote.find(css_selector=".text")
            author_el = await quote.find(css_selector=".author")
            print(f"{await text_el.get_text()} — {await author_el.get_text()}")

asyncio.run(main())

The timeout=10 parameter tells Pydoll to wait up to 10 seconds for the .quote elements to appear. If they don't show up, it raises a TimeoutError. This is more reliable than hardcoded asyncio.sleep() calls because it returns the moment the content is ready.

Concurrent Scraping With Multiple Tabs

This is where Pydoll earns its keep. Because it's async-first, you can scrape multiple pages simultaneously using asyncio.gather() with separate tabs.

Here's how to scrape 5 pages of a paginated site concurrently:

import asyncio
from pydoll.browser import Chrome

async def scrape_page(tab, url):
    """Scrape all book titles from a single page."""
    await tab.go_to(url)
    books = await tab.find(css_selector="article.product_pod", find_all=True)
    
    titles = []
    for book in books:
        link = await book.find(css_selector="h3 a")
        titles.append(await link.get_attribute("title"))
    return titles

async def main():
    base_url = "https://books.toscrape.com/catalogue/page-{}.html"
    
    async with Chrome() as browser:
        tab1 = await browser.start()
        # Create 4 additional tabs
        tabs = [tab1] + [await browser.new_tab() for _ in range(4)]

        # Scrape pages 1-5 concurrently
        tasks = [
            scrape_page(tab, base_url.format(i))
            for tab, i in zip(tabs, range(1, 6))
        ]
        all_results = await asyncio.gather(*tasks)

        total = sum(len(page) for page in all_results)
        print(f"Scraped {total} books across 5 pages concurrently")

asyncio.run(main())

Each tab maintains its own session state. Five pages that would take 10+ seconds sequentially finish in roughly 2-3 seconds. The browser reuses a single Chromium process, so resource overhead stays low.

For even stronger isolation, use browser contexts instead of tabs. Each context gets its own cookies, storage, and cache — and can run its own proxy:

async def main():
    async with Chrome() as browser:
        await browser.start()
        
        # Create isolated contexts with different proxies
        context1 = await browser.new_context(
            proxy={"server": "http://proxy1.example.com:8080"}
        )
        context2 = await browser.new_context(
            proxy={"server": "http://proxy2.example.com:8080"}
        )
        
        tab1 = await context1.new_tab()
        tab2 = await context2.new_tab()
        
        # These tabs use different IPs and don't share cookies
        results = await asyncio.gather(
            scrape_page(tab1, "https://example.com/page/1"),
            scrape_page(tab2, "https://example.com/page/2")
        )

Browser contexts are the right tool when you're scraping a single domain with IP rotation. Tabs share cookies and sessions by default — contexts don't.

Browser-Context Requests: The Hybrid Pattern

This is the feature most tutorials skip, and it's genuinely powerful. After navigating and authenticating via the browser UI, you can make raw HTTP requests through tab.request — and they automatically inherit the tab's cookies, session, and headers.

Why does this matter? Browser automation is slow. HTTP requests are fast. The hybrid pattern lets you log in with the browser, then switch to direct API calls for the heavy data collection.

import asyncio
from pydoll.browser import Chrome

async def main():
    async with Chrome() as browser:
        tab = await browser.start()
        
        # Step 1: Navigate and interact via browser (slow, but handles JS/auth)
        await tab.go_to("https://quotes.toscrape.com/login")
        
        username_input = await tab.find(id="username")
        await username_input.type_text("admin")
        
        password_input = await tab.find(id="password")
        await password_input.type_text("admin")
        
        submit_btn = await tab.find(css_selector='input[type="submit"]')
        await submit_btn.click()

        # Step 2: Now use direct HTTP requests with the authenticated session
        response = await tab.request.get(
            "https://quotes.toscrape.com/api/quotes?page=1"
        )
        print(response.json())

asyncio.run(main())

The session cookies from the browser login carry over to tab.request.get(). No need to manually extract cookies or build a separate requests session.

This pattern cuts scraping time dramatically on sites where you need authentication but the actual data comes from API endpoints.

Using the @retry Decorator for Production Scrapers

Scrapers break. Elements don't load, connections drop, pages time out. Pydoll ships a @retry decorator that handles transient failures without cluttering your code with try/except blocks.

import asyncio
from pydoll.browser import Chrome
from pydoll.decorators import retry
from pydoll.exceptions import ElementNotFound, NetworkError

async def handle_failure(**kwargs):
    """Recovery logic — runs before each retry."""
    tab = kwargs.get("tab")
    if tab:
        await tab.go_to("about:blank")  # Reset state

@retry(
    max_retries=3,
    exceptions=[ElementNotFound, NetworkError],
    on_retry=handle_failure,
    exponential_backoff=True  # Waits 2s, 4s, 8s between retries
)
async def scrape_product(tab, url):
    await tab.go_to(url)
    await tab.wait_element(css_selector=".product-title", timeout=8)
    
    title_el = await tab.find(css_selector=".product-title")
    price_el = await tab.find(css_selector=".product-price")
    
    return {
        "title": await title_el.get_text(),
        "price": await price_el.get_text()
    }

Three things make this decorator worth using over a manual loop. First, exceptions lets you specify exactly which errors trigger a retry — you won't accidentally retry on a ValueError in your parsing logic. Second, on_retry runs custom recovery functions (like clearing cache or rotating proxies) before the next attempt. Third, exponential_backoff prevents hammering a struggling server.

Intercepting Requests to Speed Up Scraping

Every page you scrape loads images, fonts, tracking scripts, and analytics. None of that matters for data extraction, and it all slows you down.

Pydoll's event system lets you intercept and block unnecessary resources:

import asyncio
from pydoll.browser import Chrome
from pydoll.events.network import NetworkEvents

BLOCKED_TYPES = {"Image", "Font", "Stylesheet", "Media"}

async def block_resources(event):
    """Drop requests for non-essential resource types."""
    resource_type = event.get("params", {}).get("type", "")
    request_id = event["params"]["requestId"]
    
    if resource_type in BLOCKED_TYPES:
        # Abort the request
        return True
    return False

async def main():
    async with Chrome() as browser:
        tab = await browser.start()
        
        # Enable network interception
        await tab.enable_network_events()
        await tab.on(NetworkEvents.REQUEST_WILL_BE_SENT, block_resources)
        
        await tab.go_to("https://books.toscrape.com/")
        
        books = await tab.find(css_selector="article.product_pod", find_all=True)
        print(f"Found {len(books)} books (without loading images/css)")

asyncio.run(main())

Blocking images, fonts, and stylesheets alone can cut page load times by 40-60% on media-heavy sites. On a 1,000-page scraping job, that adds up to real time savings.

Integrating Proxies With Pydoll

When you're scraping at any meaningful scale, IP rotation isn't optional. Pydoll supports proxy configuration through ChromiumOptions.

Here's how to route your scraper through a proxy:

import asyncio
from pydoll.browser import Chrome
from pydoll.browser.options import ChromiumOptions

async def main():
    options = ChromiumOptions()
    
    # Set your proxy — works with HTTP, HTTPS, and SOCKS5
    options.add_argument("--proxy-server=http://USERNAME:PASSWORD@proxy.example.com:8080")

    async with Chrome(options=options) as browser:
        tab = await browser.start()
        
        # Verify your IP changed
        await tab.go_to("https://httpbin.org/ip")
        ip_text = await tab.execute_script(
            "return document.querySelector('pre').textContent"
        )
        print(f"Current IP: {ip_text}")

asyncio.run(main())

For rotating proxies, you can use browser contexts. Each context supports its own proxy configuration, so you can run multiple sessions with different IPs from a single browser process.

Residential proxies work best here since datacenter IPs get flagged fast on protected sites. If you need a residential proxy pool, Roundproxies provides rotating residential, ISP, and mobile proxies that pair well with Pydoll's async architecture.

For maximum stealth, combine proxy rotation with request delays:

import asyncio
import random

async def scrape_with_delay(tab, url):
    await asyncio.sleep(random.uniform(1.0, 3.0))  # Random delay
    await tab.go_to(url)
    # ... scraping logic

Random delays between 1-3 seconds prevent your request pattern from looking automated.


Best Practices for Pydoll Scraping

These come from running Pydoll in production, not from reading docs.

Always use the async context manager. async with Chrome() as browser: ensures Chrome shuts down cleanly even if your script crashes. Orphaned Chrome processes eat memory fast and can block subsequent runs.

Set explicit timeouts everywhere. The default timeout is generous, which means a broken scraper sits idle for minutes before failing. Set timeout=8 on wait_element() calls and you'll catch problems faster.

Use headless mode for production, headed mode for debugging. During development, let Chrome run visibly so you can watch what's happening. Switch to headless for deployment:

options = ChromiumOptions()
options.add_argument("--headless=new")

The --headless=new flag uses Chrome's updated headless mode, which behaves identically to headed Chrome. The old --headless flag runs a different rendering path that some anti-bot systems detect.

Respect rate limits and robots.txt. The fact that Pydoll can bypass protections doesn't mean you should ignore the rules. Add delays between requests, check robots.txt before scraping, and don't hammer servers during peak hours.

Log your scraping runs. When you're scraping thousands of pages, you need to know which URLs failed and why. Python's built-in logging module is enough — just log the URL, status, and any exceptions at the INFO level.

Common Errors and Fixes

"No browser found" or "Could not find Chrome"

Pydoll auto-detects Chrome, but sometimes fails on non-standard installs.

Fix: Set the binary path explicitly:

options = ChromiumOptions()
options.binary_location = "/usr/bin/google-chrome"  # Your Chrome path

"TimeoutError: Element not found within timeout"

The element you're waiting for either doesn't exist or loads after your timeout.

Fix: Increase the timeout and verify your selector is correct:

# Bump timeout to 15 seconds
await tab.wait_element(css_selector=".my-element", timeout=15)

Open the target page in a regular browser and verify the selector matches what you expect using DevTools.

"ConnectionRefusedError" or "CDP connection failed"

Another Chrome process might be using the debugging port, or Chrome crashed.

Fix: Kill any lingering Chrome processes and retry:

pkill -f chrome  # Linux/Mac
taskkill /f /im chrome.exe  # Windows

Scraper works locally but fails in Docker

Headless Chrome in Docker needs specific flags.

Fix: Add these options:

options = ChromiumOptions()
options.add_argument("--no-sandbox")
options.add_argument("--disable-dev-shm-usage")
options.add_argument("--disable-gpu")

Wrapping Up

Pydoll strips away the worst parts of browser automation — driver management, detection fingerprints, and synchronous blocking — and replaces them with a clean async API that talks directly to Chrome via CDP.

The features that matter most for real scraping work: concurrent multi-tab execution for speed, browser-context requests for the hybrid auth-then-API pattern, and the @retry decorator so your scripts don't die at 3 AM on a transient network error.

Start with the basic scraper above, then add complexity as you need it. The official Pydoll documentation covers advanced topics like browser fingerprint management and the full evasion strategy in detail.