Web Scraping

Python Web Scraping 2026: Step-by-step tutorial

Want to automatically extract data from websites without copying and pasting for hours? Web scraping with Python lets you do exactly that.

Python remains the go-to language for scraping because of its readable syntax and massive library ecosystem. Whether you're collecting product prices, monitoring competitor websites, or building datasets for machine learning, this guide has you covered.

In this guide, you'll learn how to build scrapers that work on both static HTML pages and JavaScript-heavy dynamic sites. We'll cover multiple approaches from simple HTTP requests to full browser automation.

What Is Web Scraping with Python?

Web scraping with Python involves writing automated scripts that fetch web pages, parse their HTML content, and extract specific data elements. The process mimics what a human does when browsing—visiting URLs, reading content, and copying information—but at machine speed.

Python dominates the scraping landscape because it offers libraries for every step of the process. You can send HTTP requests with requests, parse HTML with BeautifulSoup, control browsers with Playwright or Selenium, and build full crawlers with Scrapy.

The scraping workflow breaks down into four stages. First, you connect to the target page. Second, you retrieve its HTML content. Third, you locate and extract the data you need. Fourth, you store that data in a usable format like CSV or JSON.

Static sites return complete HTML from the server. Dynamic sites load content through JavaScript after the initial page load. These two types require different scraping approaches.

Setting Up Your Python Scraping Environment

Before writing any scraping code, you need a properly configured Python environment. This prevents library conflicts and keeps your projects organized.

Prerequisites

You need Python 3.8 or newer installed on your system. Check your version by running this command in your terminal:

python3 --version

You should see output like Python 3.11.4 or similar. If Python isn't installed, download it from the official Python website.

Creating a Virtual Environment

Virtual environments isolate your project dependencies. Create one with these commands:

mkdir python-scraper
cd python-scraper
python3 -m venv venv

Activate the virtual environment. On macOS and Linux:

source venv/bin/activate

On Windows:

venv\Scripts\activate

Your terminal prompt should now show (venv) indicating the environment is active. All packages you install will stay within this environment.

Installing Core Libraries

Install the essential scraping libraries with pip:

pip install requests beautifulsoup4 lxml playwright selenium scrapy

This installs everything you need for both static and dynamic scraping. We'll cover each library in detail throughout this guide.

Method 1: Scraping Static Sites with Requests and Beautiful Soup

For static HTML pages, the combination of requests and BeautifulSoup provides the fastest and simplest approach. This method works when the data you want exists in the initial HTML response from the server.

Understanding Static vs Dynamic Content

Open your browser's developer tools and view the page source. If you can see the data you want in that raw HTML, the page is static.

Scrape dynmiac or static pages with Python

If the source shows empty containers or JavaScript variables, the page loads content dynamically.

Making HTTP Requests

The requests library handles HTTP communication. Here's how to fetch a webpage:

import requests

url = "https://quotes.toscrape.com/"
response = requests.get(url)

print(response.status_code)  # Should be 200
print(response.text[:500])   # First 500 characters of HTML

The get() method sends an HTTP GET request to the specified URL. The response object contains the status code, headers, and the page content in the text attribute.

Get Requests with Python

Always check the status code before parsing. A 200 means success. Codes like 403 or 429 indicate blocking or rate limiting.

Adding Request Headers

Websites can detect scrapers by examining request headers. Mimic a real browser by setting appropriate headers:

import requests

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
    "Accept-Language": "en-US,en;q=0.5",
    "Accept-Encoding": "gzip, deflate, br",
    "Connection": "keep-alive",
}

url = "https://quotes.toscrape.com/"
response = requests.get(url, headers=headers)

The User-Agent header tells the server which browser you're using. Using a current Chrome or Firefox user agent reduces the chance of being blocked.

Parsing HTML with Beautiful Soup

Beautiful Soup transforms raw HTML text into a navigable tree structure. You can then search for elements using tag names, CSS classes, or CSS selectors.

from bs4 import BeautifulSoup

html = response.text
soup = BeautifulSoup(html, "lxml")

# Find elements by tag
all_divs = soup.find_all("div")

# Find by class
quotes = soup.find_all("div", class_="quote")

# Find by CSS selector
authors = soup.select(".author")

The lxml parser offers better performance than Python's built-in html.parser. Install it with pip install lxml if you haven't already.

Extracting Data from Elements

Once you've located elements, extract their text content or attribute values:

from bs4 import BeautifulSoup

soup = BeautifulSoup(html, "lxml")

# Get text content
quote_element = soup.select_one(".text")
quote_text = quote_element.get_text(strip=True)

# Get attribute value
link_element = soup.select_one("a")
href = link_element.get("href")

# Get all text from nested elements
full_text = soup.get_text(separator=" ", strip=True)

The get_text() method extracts visible text from an element and its children.

Extract text from website with Python

The strip=True parameter removes leading and trailing whitespace.

Complete Static Scraping Example

Here's a full working scraper that extracts quotes from the practice site:

import requests
from bs4 import BeautifulSoup
import csv

def scrape_quotes():
    """Scrape all quotes from quotes.toscrape.com"""
    
    base_url = "https://quotes.toscrape.com"
    all_quotes = []
    page = 1
    
    while True:
        # Build URL for current page
        url = f"{base_url}/page/{page}/"
        
        # Fetch the page
        response = requests.get(url)
        
        # Check if page exists
        if response.status_code != 200:
            break
            
        # Parse HTML
        soup = BeautifulSoup(response.text, "lxml")
        
        # Find all quote containers
        quote_elements = soup.select(".quote")
        
        # Stop if no quotes found
        if not quote_elements:
            break
        
        # Extract data from each quote
        for quote in quote_elements:
            text = quote.select_one(".text").get_text(strip=True)
            author = quote.select_one(".author").get_text(strip=True)
            tags = [tag.get_text() for tag in quote.select(".tag")]
            
            all_quotes.append({
                "text": text,
                "author": author,
                "tags": ", ".join(tags)
            })
        
        print(f"Scraped page {page}: {len(quote_elements)} quotes")
        page += 1
    
    return all_quotes

# Run the scraper
quotes = scrape_quotes()
print(f"Total quotes collected: {len(quotes)}")

This scraper automatically handles pagination by incrementing the page number until no more quotes appear. The while loop continues until either the server returns an error or no quote elements exist on the page.

Method 2: Scraping Dynamic Sites with Playwright

Modern websites often load content through JavaScript after the initial page render. Traditional HTTP clients can't execute JavaScript, so you need a browser automation tool.

Playwright, developed by Microsoft, provides fast and reliable browser automation. It supports Chromium, Firefox, and WebKit, offering cross-browser compatibility.

Installing Playwright

Install Playwright and download the browser binaries:

pip install playwright
playwright install

The second command downloads Chromium, Firefox, and WebKit browsers that Playwright will control.

Basic Playwright Usage

Playwright launches a real browser and lets you interact with pages programmatically:

from playwright.sync_api import sync_playwright

def scrape_dynamic_page():
    with sync_playwright() as p:
        # Launch browser in headless mode
        browser = p.chromium.launch(headless=True)
        
        # Create new page
        page = browser.new_page()
        
        # Navigate to URL
        page.goto("https://quotes.toscrape.com/scroll")
        
        # Wait for content to load
        page.wait_for_selector(".quote")
        
        # Get page content
        content = page.content()
        print(content[:1000])
        
        # Close browser
        browser.close()

scrape_dynamic_page()

The headless=True parameter runs the browser without a visible window. Set it to False during development to watch your scraper work.

Waiting for Dynamic Content

Dynamic pages require explicit waits. The content might not exist immediately after navigation completes.

from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch(headless=True)
    page = browser.new_page()
    
    page.goto("https://quotes.toscrape.com/scroll")
    
    # Wait for specific element
    page.wait_for_selector(".quote", timeout=10000)
    
    # Or wait for network to be idle
    page.wait_for_load_state("networkidle")
    
    # Extract data
    quotes = page.locator(".quote").all()
    
    for quote in quotes:
        text = quote.locator(".text").text_content()
        author = quote.locator(".author").text_content()
        print(f"{author}: {text[:50]}...")
    
    browser.close()

The timeout parameter specifies how long to wait in milliseconds. Increase it for slow-loading pages.

Handling Infinite Scroll

Some sites load more content as you scroll down. Automate scrolling to trigger these loads:

from playwright.sync_api import sync_playwright
import time

def scrape_infinite_scroll():
    with sync_playwright() as p:
        browser = p.chromium.launch(headless=True)
        page = browser.new_page()
        
        page.goto("https://quotes.toscrape.com/scroll")
        
        # Scroll to load all content
        previous_height = 0
        while True:
            # Scroll to bottom
            page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
            
            # Wait for new content
            time.sleep(2)
            
            # Get new height
            new_height = page.evaluate("document.body.scrollHeight")
            
            # Break if no new content loaded
            if new_height == previous_height:
                break
            
            previous_height = new_height
        
        # Now extract all loaded quotes
        quotes = page.locator(".quote").all()
        print(f"Found {len(quotes)} quotes after scrolling")
        
        browser.close()

scrape_infinite_scroll()

This script scrolls until the page height stops increasing, indicating all content has loaded.

Complete Playwright Scraping Example

Here's a production-ready scraper for dynamic content:

from playwright.sync_api import sync_playwright
import json
import time

def scrape_dynamic_quotes():
    """Scrape quotes from dynamic JavaScript-rendered page"""
    
    all_quotes = []
    
    with sync_playwright() as p:
        browser = p.chromium.launch(headless=True)
        
        # Set viewport size
        page = browser.new_page(viewport={"width": 1920, "height": 1080})
        
        # Navigate to page
        page.goto("https://quotes.toscrape.com/scroll")
        
        # Wait for initial quotes
        page.wait_for_selector(".quote")
        
        # Scroll to load all quotes
        for _ in range(10):  # Maximum 10 scroll attempts
            page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
            time.sleep(1)
        
        # Extract quotes using Playwright locators
        quote_elements = page.locator(".quote").all()
        
        for element in quote_elements:
            text = element.locator(".text").text_content()
            author = element.locator(".author").text_content()
            
            # Clean the text
            text = text.strip().strip('"').strip('"')
            author = author.strip()
            
            all_quotes.append({
                "text": text,
                "author": author
            })
        
        browser.close()
    
    return all_quotes

# Run scraper
quotes = scrape_dynamic_quotes()

# Save to JSON
with open("quotes.json", "w", encoding="utf-8") as f:
    json.dump(quotes, f, indent=2, ensure_ascii=False)

print(f"Saved {len(quotes)} quotes to quotes.json")

Method 3: Browser Automation with Selenium

Selenium has been the industry standard for browser automation for over a decade. While Playwright is newer and often faster, Selenium remains widely used and well-documented.

Installing Selenium

Install Selenium with pip:

pip install selenium

Modern Selenium (version 4.6+) automatically downloads the correct browser driver. You just need Chrome or Firefox installed on your system.

Basic Selenium Usage

Selenium controls browsers through a WebDriver interface:

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By

def basic_selenium_scrape():
    # Configure Chrome options
    options = Options()
    options.add_argument("--headless")  # Run without GUI
    
    # Initialize driver
    driver = webdriver.Chrome(options=options)
    
    try:
        # Navigate to page
        driver.get("https://quotes.toscrape.com/")
        
        # Find elements
        quotes = driver.find_elements(By.CSS_SELECTOR, ".quote")
        
        for quote in quotes:
            text = quote.find_element(By.CSS_SELECTOR, ".text").text
            author = quote.find_element(By.CSS_SELECTOR, ".author").text
            print(f"{author}: {text[:50]}...")
    
    finally:
        # Always close the driver
        driver.quit()

basic_selenium_scrape()

The try/finally block ensures the browser closes even if an error occurs. Failing to close browsers leads to resource leaks.

Explicit Waits in Selenium

Selenium's explicit waits handle dynamic content loading:

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

def selenium_with_waits():
    options = Options()
    options.add_argument("--headless")
    
    driver = webdriver.Chrome(options=options)
    
    try:
        driver.get("https://quotes.toscrape.com/scroll")
        
        # Wait up to 10 seconds for quotes to appear
        wait = WebDriverWait(driver, 10)
        wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, ".quote")))
        
        quotes = driver.find_elements(By.CSS_SELECTOR, ".quote")
        print(f"Found {len(quotes)} quotes")
        
    finally:
        driver.quit()

selenium_with_waits()

Expected conditions include presence_of_element_located, visibility_of_element_located, element_to_be_clickable, and many more.

Interacting with Page Elements

Selenium can fill forms, click buttons, and simulate keyboard input:

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys

def selenium_interactions():
    options = Options()
    options.add_argument("--headless")
    
    driver = webdriver.Chrome(options=options)
    
    try:
        driver.get("https://quotes.toscrape.com/login")
        
        # Find and fill username field
        username = driver.find_element(By.ID, "username")
        username.send_keys("testuser")
        
        # Find and fill password field
        password = driver.find_element(By.ID, "password")
        password.send_keys("testpass")
        
        # Submit the form
        password.send_keys(Keys.RETURN)
        
        # Wait for redirect
        import time
        time.sleep(2)
        
        print(f"Current URL: {driver.current_url}")
        
    finally:
        driver.quit()

selenium_interactions()

Method 4: Using Scrapy for Large-Scale Scraping

Scrapy is a complete web crawling framework designed for extracting data from websites at scale. Unlike simple scripts, Scrapy handles concurrency, rate limiting, and data pipelines automatically.

Installing Scrapy

Install Scrapy with pip:

pip install scrapy

Creating a Scrapy Project

Scrapy uses a project structure. Create one with the command line tool:

scrapy startproject quotescraper
cd quotescraper
scrapy genspider quotes quotes.toscrape.com

This creates a directory structure with configuration files and a spider template.

Writing a Scrapy Spider

Spiders define how to crawl pages and extract data. Edit the generated spider file:

# quotescraper/spiders/quotes.py

import scrapy

class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = ["https://quotes.toscrape.com/"]
    
    def parse(self, response):
        # Extract quotes from current page
        for quote in response.css(".quote"):
            yield {
                "text": quote.css(".text::text").get(),
                "author": quote.css(".author::text").get(),
                "tags": quote.css(".tag::text").getall()
            }
        
        # Follow pagination links
        next_page = response.css("li.next a::attr(href)").get()
        if next_page:
            yield response.follow(next_page, self.parse)

The yield statement returns extracted data and continues crawling. Scrapy handles the asynchronous execution behind the scenes.

Running the Spider

Run your spider from the command line:

scrapy crawl quotes -o quotes.json

The -o flag specifies the output file. Scrapy supports JSON, CSV, and XML formats natively.

Configuring Scrapy Settings

Adjust settings in settings.py to control crawling behavior:

# quotescraper/settings.py

# Respect robots.txt
ROBOTSTXT_OBEY = True

# Add delay between requests
DOWNLOAD_DELAY = 1

# Limit concurrent requests
CONCURRENT_REQUESTS = 8

# Set user agent
USER_AGENT = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) Chrome/120.0.0.0"

# Enable auto-throttle
AUTOTHROTTLE_ENABLED = True
AUTOTHROTTLE_START_DELAY = 1
AUTOTHROTTLE_MAX_DELAY = 10

The auto-throttle feature automatically adjusts request rates based on server response times. This helps avoid overloading target servers.

Handling Anti-Bot Protection

Websites increasingly deploy anti-bot measures to block scrapers. Understanding these defenses helps you build more resilient scrapers.

Common Anti-Bot Techniques

Rate limiting restricts how many requests an IP address can make within a time period. Exceed the limit and you'll receive 429 errors or temporary blocks.

JavaScript challenges require browsers to execute scripts before accessing content. Simple HTTP clients fail these checks immediately.

CAPTCHAs ask users to solve puzzles proving they're human. These appear after suspicious activity detection.

Fingerprinting analyzes browser characteristics like screen resolution, installed fonts, and WebGL capabilities. Headless browsers often have detectable fingerprints.

Rotating User Agents

Vary your User-Agent header to appear as different browsers:

import random
import requests

USER_AGENTS = [
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 Chrome/120.0.0.0 Safari/537.36",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 Chrome/120.0.0.0 Safari/537.36",
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:121.0) Gecko/20100101 Firefox/121.0",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 Safari/605.1.15",
]

def get_random_headers():
    return {
        "User-Agent": random.choice(USER_AGENTS),
        "Accept": "text/html,application/xhtml+xml",
        "Accept-Language": "en-US,en;q=0.9",
    }

response = requests.get(url, headers=get_random_headers())

Adding Request Delays

Avoid triggering rate limits by spacing out requests:

import time
import random

def respectful_scrape(urls):
    results = []
    
    for url in urls:
        response = requests.get(url, headers=get_random_headers())
        results.append(response.text)
        
        # Random delay between 1 and 3 seconds
        delay = random.uniform(1, 3)
        time.sleep(delay)
    
    return results

Random delays appear more natural than fixed intervals.

Handling Request Failures

Build retry logic for transient failures:

import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry

def create_session_with_retries():
    session = requests.Session()
    
    retries = Retry(
        total=3,
        backoff_factor=1,
        status_forcelist=[429, 500, 502, 503, 504]
    )
    
    adapter = HTTPAdapter(max_retries=retries)
    session.mount("http://", adapter)
    session.mount("https://", adapter)
    
    return session

session = create_session_with_retries()
response = session.get(url)

The backoff factor increases wait time between retries exponentially.

Using Proxies for Web Scraping

Proxies route your requests through different IP addresses. This helps avoid IP-based blocks and rate limits.

Types of Proxies

Datacenter proxies come from cloud servers. They're fast and cheap but easily detected.

Residential proxies use IP addresses from real ISP customers. They appear legitimate but cost more.

Mobile proxies route through cellular networks. They're hardest to detect but most expensive.

Rotating proxies automatically switch IPs for each request. This provides the best protection against blocks.

If you need reliable proxies for your scraping projects, providers like Roundproxies.com offer residential, datacenter, ISP, and mobile proxy options with automatic rotation.

Implementing Proxy Rotation

Here's how to rotate proxies with requests:

import requests
import random

PROXIES = [
    "http://user:pass@proxy1.example.com:8080",
    "http://user:pass@proxy2.example.com:8080",
    "http://user:pass@proxy3.example.com:8080",
]

def get_with_proxy(url):
    proxy = random.choice(PROXIES)
    proxies = {
        "http": proxy,
        "https": proxy
    }
    
    response = requests.get(url, proxies=proxies, timeout=30)
    return response

response = get_with_proxy("https://example.com")

Using Proxies with Playwright

Playwright supports proxy configuration at browser launch:

from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch(
        headless=True,
        proxy={
            "server": "http://proxy.example.com:8080",
            "username": "user",
            "password": "pass"
        }
    )
    
    page = browser.new_page()
    page.goto("https://example.com")
    
    browser.close()

Using Proxies with Selenium

Configure Selenium to use proxies through Chrome options:

from selenium import webdriver
from selenium.webdriver.chrome.options import Options

def selenium_with_proxy(proxy_url):
    options = Options()
    options.add_argument("--headless")
    options.add_argument(f"--proxy-server={proxy_url}")
    
    driver = webdriver.Chrome(options=options)
    return driver

driver = selenium_with_proxy("http://proxy.example.com:8080")
driver.get("https://example.com")
driver.quit()

Storing Scraped Data

Once you've extracted data, you need to store it in a useful format. The choice depends on data structure and intended use.

Saving to CSV

CSV works well for tabular data with consistent fields:

import csv

def save_to_csv(data, filename):
    if not data:
        return
    
    fieldnames = data[0].keys()
    
    with open(filename, "w", newline="", encoding="utf-8") as f:
        writer = csv.DictWriter(f, fieldnames=fieldnames)
        writer.writeheader()
        writer.writerows(data)

quotes = [
    {"text": "Quote 1", "author": "Author 1"},
    {"text": "Quote 2", "author": "Author 2"}
]

save_to_csv(quotes, "quotes.csv")

Saving to JSON

JSON preserves nested structures and mixed data types:

import json

def save_to_json(data, filename):
    with open(filename, "w", encoding="utf-8") as f:
        json.dump(data, f, indent=2, ensure_ascii=False)

quotes = [
    {"text": "Quote 1", "author": "Author 1", "tags": ["wisdom", "life"]},
    {"text": "Quote 2", "author": "Author 2", "tags": ["humor"]}
]

save_to_json(quotes, "quotes.json")

Saving to SQLite Database

Databases handle large datasets and enable complex queries:

import sqlite3

def save_to_sqlite(data, db_name, table_name):
    conn = sqlite3.connect(db_name)
    cursor = conn.cursor()
    
    # Create table
    cursor.execute(f"""
        CREATE TABLE IF NOT EXISTS {table_name} (
            id INTEGER PRIMARY KEY AUTOINCREMENT,
            text TEXT,
            author TEXT,
            tags TEXT
        )
    """)
    
    # Insert data
    for item in data:
        cursor.execute(
            f"INSERT INTO {table_name} (text, author, tags) VALUES (?, ?, ?)",
            (item["text"], item["author"], ", ".join(item.get("tags", [])))
        )
    
    conn.commit()
    conn.close()

save_to_sqlite(quotes, "quotes.db", "quotes")

Common Mistakes and How to Avoid Them

Years of web scraping with Python experience reveal patterns in what goes wrong. Learning from these mistakes saves hours of debugging.

Not Handling Errors Gracefully

Network requests fail. Servers go down. Elements disappear. Wrap critical code in try/except blocks:

import requests
from bs4 import BeautifulSoup

def safe_scrape(url):
    try:
        response = requests.get(url, timeout=30)
        response.raise_for_status()
        
        soup = BeautifulSoup(response.text, "lxml")
        title = soup.select_one("h1")
        
        return title.get_text() if title else None
        
    except requests.RequestException as e:
        print(f"Request failed: {e}")
        return None
    except Exception as e:
        print(f"Parsing failed: {e}")
        return None

Ignoring robots.txt

The robots.txt file specifies which pages scrapers should avoid. Respecting it demonstrates good citizenship:

from urllib.robotparser import RobotFileParser

def can_scrape(url, user_agent="*"):
    parser = RobotFileParser()
    
    # Find robots.txt URL
    from urllib.parse import urlparse
    parsed = urlparse(url)
    robots_url = f"{parsed.scheme}://{parsed.netloc}/robots.txt"
    
    parser.set_url(robots_url)
    parser.read()
    
    return parser.can_fetch(user_agent, url)

if can_scrape("https://example.com/page"):
    # Proceed with scraping
    pass

Scraping Too Aggressively

Hammering servers with requests can get you blocked and may even cause legal issues. Always add delays and respect rate limits.

Not Validating Extracted Data

Empty or malformed data causes downstream problems. Validate before saving:

def validate_quote(quote):
    if not quote.get("text"):
        return False
    if not quote.get("author"):
        return False
    if len(quote["text"]) < 10:
        return False
    return True

valid_quotes = [q for q in quotes if validate_quote(q)]

Hardcoding Selectors

Websites change their HTML structure frequently. Make selectors configurable:

SELECTORS = {
    "quote_container": ".quote",
    "quote_text": ".text",
    "quote_author": ".author",
    "next_page": "li.next a"
}

def scrape_with_config(soup, selectors):
    quotes = []
    
    for container in soup.select(selectors["quote_container"]):
        text_el = container.select_one(selectors["quote_text"])
        author_el = container.select_one(selectors["quote_author"])
        
        if text_el and author_el:
            quotes.append({
                "text": text_el.get_text(strip=True),
                "author": author_el.get_text(strip=True)
            })
    
    return quotes

Advanced Techniques for 2026

The scraping landscape continues to evolve. Here are cutting-edge approaches that will define web scraping with Python in 2026 and beyond.

Async Scraping with HTTPX

HTTPX offers async capabilities that dramatically speed up scraping multiple pages:

import httpx
import asyncio
from bs4 import BeautifulSoup

async def fetch_page(client, url):
    """Fetch a single page asynchronously"""
    response = await client.get(url)
    return response.text

async def scrape_multiple_pages(urls):
    """Scrape multiple pages concurrently"""
    
    async with httpx.AsyncClient() as client:
        # Create tasks for all URLs
        tasks = [fetch_page(client, url) for url in urls]
        
        # Execute all requests concurrently
        pages = await asyncio.gather(*tasks)
        
        results = []
        for html in pages:
            soup = BeautifulSoup(html, "lxml")
            # Extract data from each page
            title = soup.select_one("h1")
            if title:
                results.append(title.get_text(strip=True))
        
        return results

# Generate URLs for first 10 pages
urls = [f"https://quotes.toscrape.com/page/{i}/" for i in range(1, 11)]

# Run async scraper
results = asyncio.run(scrape_multiple_pages(urls))
print(f"Scraped {len(results)} pages")

Async scraping can be 5-10x faster than synchronous approaches when scraping many pages.

AI-Powered Data Extraction

Large language models can extract structured data from messy HTML without brittle CSS selectors:

import openai
from bs4 import BeautifulSoup

def extract_with_ai(html, prompt):
    """Use AI to extract structured data from HTML"""
    
    # Clean HTML to reduce tokens
    soup = BeautifulSoup(html, "lxml")
    text = soup.get_text(separator="\n", strip=True)
    
    response = openai.ChatCompletion.create(
        model="gpt-4",
        messages=[
            {"role": "system", "content": "Extract structured data from the provided text."},
            {"role": "user", "content": f"{prompt}\n\nText:\n{text[:4000]}"}
        ]
    )
    
    return response.choices[0].message.content

# Example usage
prompt = "Extract all product names and prices as JSON"
result = extract_with_ai(html, prompt)

This approach handles layout changes gracefully since it focuses on semantic meaning rather than DOM structure.

Headless Browser Stealth Mode

Modern anti-bot systems detect headless browsers through various fingerprinting techniques. Use stealth plugins to appear more human:

from playwright.sync_api import sync_playwright
from playwright_stealth import stealth_sync

def stealth_scrape(url):
    """Scrape with stealth mode to avoid detection"""
    
    with sync_playwright() as p:
        browser = p.chromium.launch(headless=True)
        page = browser.new_page()
        
        # Apply stealth settings
        stealth_sync(page)
        
        # Now navigate
        page.goto(url)
        
        content = page.content()
        browser.close()
        
        return content

Install the stealth plugin with pip install playwright-stealth.

Building Resilient Scrapers

Production scrapers need monitoring and automatic recovery:

import logging
import time
from datetime import datetime

# Configure logging
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s',
    handlers=[
        logging.FileHandler('scraper.log'),
        logging.StreamHandler()
    ]
)

class ResilientScraper:
    def __init__(self, max_retries=3, backoff_factor=2):
        self.max_retries = max_retries
        self.backoff_factor = backoff_factor
        self.stats = {
            "success": 0,
            "failed": 0,
            "retries": 0
        }
    
    def scrape_with_retry(self, url, scrape_func):
        """Attempt scraping with exponential backoff"""
        
        for attempt in range(self.max_retries):
            try:
                result = scrape_func(url)
                self.stats["success"] += 1
                logging.info(f"Successfully scraped: {url}")
                return result
                
            except Exception as e:
                self.stats["retries"] += 1
                wait_time = self.backoff_factor ** attempt
                
                logging.warning(
                    f"Attempt {attempt + 1} failed for {url}: {e}. "
                    f"Retrying in {wait_time}s"
                )
                
                time.sleep(wait_time)
        
        self.stats["failed"] += 1
        logging.error(f"All attempts failed for: {url}")
        return None
    
    def get_stats(self):
        return self.stats

Distributed Scraping Architecture

For very large projects, distribute scraping across multiple machines:

# Simple task queue using Redis
import redis
import json

class ScrapingQueue:
    def __init__(self, redis_url="redis://localhost:6379"):
        self.redis = redis.from_url(redis_url)
        self.queue_name = "scraping_tasks"
        self.results_name = "scraping_results"
    
    def add_task(self, url):
        """Add URL to scraping queue"""
        task = {"url": url, "added_at": datetime.now().isoformat()}
        self.redis.lpush(self.queue_name, json.dumps(task))
    
    def get_task(self):
        """Get next URL from queue"""
        task_json = self.redis.rpop(self.queue_name)
        if task_json:
            return json.loads(task_json)
        return None
    
    def save_result(self, url, data):
        """Save scraping result"""
        result = {"url": url, "data": data}
        self.redis.lpush(self.results_name, json.dumps(result))
    
    def pending_count(self):
        """Get count of pending tasks"""
        return self.redis.llen(self.queue_name)

Run multiple worker processes that pull URLs from the shared queue. This scales horizontally across machines.

Comparing Python Scraping Libraries

Here's a quick comparison to help you choose the right tool:

Library Best For Learning Curve Speed Dynamic Content
Requests + BS4 Static sites Easy Fast No
HTTPX Async static scraping Medium Very Fast No
Playwright Dynamic sites, stealth Medium Medium Yes
Selenium Legacy projects, testing Medium Slow Yes
Scrapy Large-scale crawling Hard Fast With plugins

Choose based on your specific requirements. Start simple and add complexity only when needed.

Ethical Scraping Best Practices

Responsible scraping benefits everyone. Follow these guidelines to maintain good relationships with target sites.

Respecting Server Resources

Your scraper impacts real infrastructure. Space out requests to avoid overwhelming servers:

import time
import random

def polite_scraper(urls, min_delay=1, max_delay=3):
    """Scrape with respectful delays"""
    
    for url in urls:
        # Your scraping logic here
        yield scrape_page(url)
        
        # Random delay appears more natural
        delay = random.uniform(min_delay, max_delay)
        time.sleep(delay)

Checking Terms of Service

Many websites explicitly address scraping in their terms. Review these before starting any project. Some sites offer official APIs that provide cleaner data access.

Identifying Your Scraper

Include contact information in your User-Agent so site owners can reach you:

USER_AGENT = "MyScraperBot/1.0 (+https://mysite.com/scraper-info; contact@mysite.com)"

This transparency builds trust and often prevents blocks.

Caching Responses

Don't re-scrape data you already have. Cache responses to reduce server load:

import hashlib
import os
import json

class ResponseCache:
    def __init__(self, cache_dir="cache"):
        self.cache_dir = cache_dir
        os.makedirs(cache_dir, exist_ok=True)
    
    def _get_cache_path(self, url):
        """Generate cache file path from URL"""
        url_hash = hashlib.md5(url.encode()).hexdigest()
        return os.path.join(self.cache_dir, f"{url_hash}.json")
    
    def get(self, url):
        """Retrieve cached response"""
        path = self._get_cache_path(url)
        if os.path.exists(path):
            with open(path, "r") as f:
                return json.load(f)
        return None
    
    def set(self, url, data):
        """Cache response data"""
        path = self._get_cache_path(url)
        with open(path, "w") as f:
            json.dump(data, f)

cache = ResponseCache()

# Check cache before requesting
cached = cache.get(url)
if cached:
    data = cached
else:
    data = scrape_page(url)
    cache.set(url, data)

FAQ

Web scraping with Python exists in a legal gray area. Scraping publicly available data is generally acceptable, but violating a site's terms of service or scraping private information can create legal issues. Always check the target site's terms before scraping.

What's the best Python library for web scraping?

The best tool for web scraping with Python depends on your needs. For static HTML pages, requests plus BeautifulSoup offers the simplest solution. For JavaScript-heavy sites, Playwright provides modern browser automation. Scrapy excels at large-scale crawling projects.

How do I avoid getting blocked while scraping?

Use realistic request headers, add random delays between requests, rotate IP addresses with proxies, and respect the site's robots.txt file. Avoid patterns that distinguish bots from humans.

Can I scrape websites that require login?

Yes. Browser automation tools like Playwright and Selenium can fill login forms and maintain session cookies. For API-based authentication, requests can handle cookies and tokens directly.

How often should I run my scraper?

That depends on how frequently the target data changes. For real-time prices, you might scrape hourly. For static content, weekly or monthly might suffice. Always consider server impact when scheduling frequent scrapes.

What's the difference between web scraping and web crawling?

Scraping extracts specific data from pages you've identified. Crawling discovers pages by following links across a website. Many projects combine both—crawling to find pages, then scraping to extract data.

Conclusion

Web scraping with Python remains one of the most powerful ways to collect data from the internet in 2026. Whether you're building a price monitoring tool, gathering research data, or feeding machine learning models, Python's ecosystem has you covered.

For static sites, start with requests and BeautifulSoup. They're simple, fast, and handle most use cases. When you encounter JavaScript-rendered content, switch to Playwright or Selenium for full browser automation.

Large-scale projects benefit from Scrapy's built-in handling of concurrency, retries, and data pipelines. Its learning curve pays off quickly when you're crawling thousands of pages.

Remember to scrape responsibly. Respect robots.txt files, add delays between requests, and use proxies when needed. Aggressive scraping damages websites and gets your IP banned.

The techniques in this guide form a solid foundation. Practice on the Quotes to Scrape sandbox site until the concepts become second nature. Then apply what you've learned to real-world projects.