knowledgebase

Python Web Scraping Guide: Build Scrapers in 2026

October 25, 2025

11 min read

Web scraping lets you automatically extract data from websites instead of copying and pasting like it's 1999. Python makes this easy with libraries like Requests, Beautiful Soup, and Selenium, but there's more to building a solid scraper than just firing off HTTP requests and hoping for the best.

In this guide, I'll walk you through the practical side of web scraping in Python—from basic techniques to performance optimization and anti-detection tricks that actually work. You'll learn how to handle JavaScript-heavy sites, avoid getting blocked, and scale your scrapers to handle thousands of pages without breaking a sweat.

Why Python for Web Scraping?

Python dominates the web scraping world for good reason. The syntax is clean enough that you can focus on solving problems instead of fighting the language. Plus, the ecosystem is packed with libraries built specifically for scraping.

But here's what nobody tells you: Python isn't the fastest language. For small to medium projects, this doesn't matter. For massive scrapers that need to hit thousands of pages per second, you might eventually look at Go or Node.js. That said, Python's async capabilities and the ability to distribute work across processes mean you can scale pretty far before hitting that wall.

Another advantage? The community. When you get stuck (and you will), there's probably a Stack Overflow answer or GitHub issue waiting for you.

Setting Up Your Environment

First things first—you need Python 3.8 or newer. I'm using 3.11 for this guide, but anything 3.8+ will work fine.

Create a virtual environment to keep your dependencies isolated:

python -m venv scraper-env
source scraper-env/bin/activate  # On Windows: scraper-env\Scripts\activate

Now install the core libraries:

pip install requests beautifulsoup4 lxml httpx aiohttp

Here's what each does:

requests: Makes HTTP requests (the standard, works everywhere)
beautifulsoup4: Parses HTML and extracts data
lxml: Fast HTML parser that Beautiful Soup can use
httpx: Modern alternative to requests with async support
aiohttp: For async HTTP requests at scale

You might also want playwright or selenium for JavaScript-heavy sites, but we'll get to that later.

Basic Web Scraping with Requests and Beautiful Soup

Let's start with a simple example. We'll scrape quotes from http://quotes.toscrape.com—a practice site that's scraper-friendly.

import requests
from bs4 import BeautifulSoup

# Make the HTTP request
url = "http://quotes.toscrape.com"
response = requests.get(url)

# Parse the HTML
soup = BeautifulSoup(response.content, 'lxml')

# Find all quote containers
quotes = soup.find_all('div', class_='quote')

# Extract the data
for quote in quotes:
    text = quote.find('span', class_='text').get_text()
    author = quote.find('small', class_='author').get_text()
    print(f"{text}\n— {author}\n")

This is your bread-and-butter scraper. You make a request, parse the HTML, find the elements you want, and extract the text.

What's happening here:

requests.get() fetches the HTML
BeautifulSoup() turns that HTML into a searchable tree
find_all() grabs all matching elements
get_text() extracts the actual text content

The 'lxml' parser is faster than the default html.parser, especially for larger pages. Always specify a parser—it makes your scraper more reliable.

Inspecting Websites: Finding the Data You Need

Before you write any code, you need to know where the data lives in the HTML. Open your browser's DevTools (F12 or right-click → Inspect) and start poking around.

Here's my process:

Find one example of the data you want on the page
Right-click it and select "Inspect" to jump to that element in the HTML
Look for patterns—usually, multiple items share the same class or structure
Check if the data loads dynamically (more on this later)

For example, on that quotes site, all quotes are wrapped in <div class="quote"> elements. Inside each, the quote text is in <span class="text"> and the author is in <small class="author">.

Pro tip: Use soup.select() with CSS selectors for cleaner code:

# Instead of this:
quotes = soup.find_all('div', class_='quote')

# You can write this:
quotes = soup.select('div.quote')

CSS selectors are usually shorter and match exactly what you see in DevTools.

Handling Different Content Types

Not all data lives in nice, clean HTML tags. Sometimes you need to extract from different formats.

Tables

HTML tables are annoying to parse manually. Beautiful Soup makes it easier, but here's a trick: pandas can read tables directly:

import pandas as pd

url = "https://example.com/data-table"
tables = pd.read_html(url)

# If there are multiple tables, pick the one you want
df = tables[0]
print(df.head())

This returns a DataFrame you can work with immediately. Way faster than manually parsing <tr> and <td> tags.

JSON in HTML

Many modern sites embed data as JSON in <script> tags. This is actually easier to work with than HTML:

import json
import re

html = response.text
# Find JSON data in script tags
json_data = re.search(r'var products = ({.*?});', html, re.DOTALL)

if json_data:
    products = json.loads(json_data.group(1))
    print(products)

You're looking for patterns like var data = {...} or window.__INITIAL_STATE__ = {...}. The JSON is usually cleaner and more complete than what's rendered in the HTML.

Images and Files

To download images or PDFs:

import requests
from pathlib import Path

def download_file(url, save_path):
    response = requests.get(url, stream=True)
    response.raise_for_status()
    
    with open(save_path, 'wb') as f:
        for chunk in response.iter_content(chunk_size=8192):
            f.write(chunk)

# Example
img_url = "https://example.com/image.jpg"
download_file(img_url, "image.jpg")

The stream=True parameter prevents loading huge files into memory all at once.

Scraping JavaScript-Heavy Sites

Here's where things get interesting. Many modern sites load content with JavaScript, meaning the HTML you get from requests is empty or incomplete.

Quick test: If you view the page source (Ctrl+U) and can't find your data, it's being loaded with JavaScript.

You have two options:

Option 1: Find the API (The Smart Way)

Most JavaScript-heavy sites load data from an API. Open DevTools → Network tab → XHR/Fetch, then refresh the page. You'll see API requests that return JSON data—often easier to work with than HTML.

import requests

# Instead of scraping the HTML
api_url = "https://api.example.com/products?page=1"
headers = {
    'User-Agent': 'Mozilla/5.0',
    'Accept': 'application/json'
}

response = requests.get(api_url, headers=headers)
data = response.json()

for item in data['products']:
    print(item['name'], item['price'])

This is faster and cleaner than using a headless browser. The API usually returns exactly the data you need without all the HTML markup.

Option 2: Use Playwright (The Heavy Way)

If there's no accessible API, you need a real browser. Playwright is better than Selenium for most use cases—it's faster, more reliable, and has a better API.

pip install playwright
playwright install  # Downloads browser binaries

from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch(headless=True)
    page = browser.new_page()
    
    page.goto('https://example.com')
    page.wait_for_selector('.product-card')  # Wait for content to load
    
    # Get the rendered HTML
    html = page.content()
    
    # Or extract data directly
    products = page.query_selector_all('.product-card')
    for product in products:
        name = product.query_selector('.name').inner_text()
        price = product.query_selector('.price').inner_text()
        print(f"{name}: {price}")
    
    browser.close()

Playwright handles all the JavaScript execution, waiting for elements to load, and even scrolling if needed. The downside? It's slower and uses more resources than simple HTTP requests.

Avoiding Detection Without Paid Tools

Websites block scrapers for a reason—you're using their bandwidth and computing resources. But if you're respectful and follow best practices, you can avoid most blocks without paying for proxy services.

Set Proper Headers

At minimum, always set a User-Agent:

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    'Accept-Language': 'en-US,en;q=0.5',
    'Accept-Encoding': 'gzip, deflate, br',
    'DNT': '1',
    'Connection': 'keep-alive',
    'Upgrade-Insecure-Requests': '1'
}

response = requests.get(url, headers=headers)

The default requests User-Agent screams "I'm a bot!" Adding realistic headers makes you look like a regular browser.

Rotate User Agents

Don't use the same User-Agent for every request:

import random

user_agents = [
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) Chrome/120.0.0.0',
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) Safari/537.36',
    'Mozilla/5.0 (X11; Linux x86_64) Firefox/121.0',
]

headers = {
    'User-Agent': random.choice(user_agents)
}

Add Random Delays

Never scrape at a constant rate—vary your request timing:

import time
import random

for url in urls:
    response = requests.get(url, headers=headers)
    # Process response...
    
    # Random delay between 1-3 seconds
    time.sleep(random.uniform(1, 3))

This mimics human browsing behavior. Hitting a site with 100 requests per second is a surefire way to get blocked.

Respect robots.txt

Check the site's robots.txt file (add /robots.txt to the domain). It tells you what's allowed:

from urllib.robotparser import RobotFileParser

rp = RobotFileParser()
rp.set_url("https://example.com/robots.txt")
rp.read()

url = "https://example.com/products"
user_agent = "MyBot"

if rp.can_fetch(user_agent, url):
    # OK to scrape
    response = requests.get(url)
else:
    print("Not allowed to scrape this URL")

Use Sessions for Cookies

Many sites expect cookies. Using a session maintains cookies across requests:

session = requests.Session()
session.headers.update(headers)

# First request sets cookies
response1 = session.get('https://example.com')

# Subsequent requests automatically include cookies
response2 = session.get('https://example.com/page2')

The httpx Alternative

Consider using httpx instead of requests. It has the same API but better performance and native async support:

import httpx

with httpx.Client() as client:
    response = client.get(url, headers=headers, follow_redirects=True)
    print(response.text)

The follow_redirects=True is important—some sites redirect scrapers to different pages.

Async Scraping for Speed

If you need to scrape hundreds or thousands of pages, sequential requests are painfully slow. Async programming lets you fire off multiple requests simultaneously.

Here's the difference in real terms: scraping 100 pages sequentially at 2 seconds each = 200 seconds. With async, the same job might take 10 seconds.

Basic Async with aiohttp

import asyncio
import aiohttp
from bs4 import BeautifulSoup

async def fetch_page(session, url):
    async with session.get(url) as response:
        return await response.text()

async def scrape_multiple(urls):
    async with aiohttp.ClientSession() as session:
        tasks = [fetch_page(session, url) for url in urls]
        pages = await asyncio.gather(*tasks)
        
        for page in pages:
            soup = BeautifulSoup(page, 'lxml')
            # Extract data...

# Run it
urls = [f'https://example.com/page/{i}' for i in range(1, 101)]
asyncio.run(scrape_multiple(urls))

What's happening:

async def creates an async function
await pauses execution until the response arrives
asyncio.gather() runs all requests concurrently
asyncio.run() starts the event loop

Throttling Concurrent Requests

Don't slam a server with 1000 simultaneous requests. Use a semaphore to limit concurrency:

import asyncio
import aiohttp

async def fetch_with_limit(session, url, semaphore):
    async with semaphore:
        async with session.get(url) as response:
            return await response.text()

async def scrape_with_throttle(urls, max_concurrent=10):
    semaphore = asyncio.Semaphore(max_concurrent)
    
    async with aiohttp.ClientSession() as session:
        tasks = [fetch_with_limit(session, url, semaphore) for url in urls]
        return await asyncio.gather(*tasks)

# Only 10 requests at a time
urls = [f'https://example.com/page/{i}' for i in range(1, 101)]
results = asyncio.run(scrape_with_throttle(urls))

This limits your scraper to 10 concurrent requests, which is respectful and less likely to trigger rate limits.

The httpx Async Alternative

httpx supports async too and has a cleaner API:

import asyncio
import httpx

async def scrape_async(urls):
    async with httpx.AsyncClient() as client:
        tasks = [client.get(url) for url in urls]
        responses = await asyncio.gather(*tasks)
        
        for response in responses:
            print(response.status_code)

urls = ['https://example.com/1', 'https://example.com/2']
asyncio.run(scrape_async(urls))

I prefer httpx over aiohttp for most projects. The API is more intuitive, and it handles edge cases better.

Handling Pagination and Multiple Pages

Most sites split data across multiple pages. You need to scrape all of them.

Simple Numbered Pagination

base_url = "https://example.com/products?page={}"

for page_num in range(1, 11):  # Pages 1-10
    url = base_url.format(page_num)
    response = requests.get(url, headers=headers)
    soup = BeautifulSoup(response.content, 'lxml')
    
    # Extract data from this page...
    
    time.sleep(random.uniform(1, 2))

Following "Next" Links

Some sites use "Next" buttons instead of numbered pages:

current_url = "https://example.com/products"

while current_url:
    response = requests.get(current_url, headers=headers)
    soup = BeautifulSoup(response.content, 'lxml')
    
    # Extract data...
    
    # Find next page link
    next_link = soup.find('a', class_='next')
    current_url = next_link['href'] if next_link else None
    
    if current_url:
        time.sleep(random.uniform(1, 2))

Infinite Scroll Pages

For sites that load content as you scroll (like Instagram or Twitter), you usually need to:

Find the API endpoint they're calling
Replicate those requests with pagination parameters

Check the Network tab in DevTools as you scroll. Look for XHR/Fetch requests with parameters like offset, cursor, or page.

Session Management and Authentication

Some data requires logging in. Here's how to handle authentication in your scraper.

import requests

session = requests.Session()

# Step 1: Get the login page (sometimes needed for CSRF tokens)
login_page = session.get('https://example.com/login')
soup = BeautifulSoup(login_page.content, 'lxml')

# Some sites use CSRF tokens
csrf_token = soup.find('input', {'name': 'csrf_token'})['value']

# Step 2: Submit login form
login_data = {
    'username': 'your_username',
    'password': 'your_password',
    'csrf_token': csrf_token  # If needed
}

response = session.post('https://example.com/login', data=login_data)

# Step 3: Use the session for authenticated requests
protected_page = session.get('https://example.com/protected-data')

The session object maintains cookies, so subsequent requests stay logged in.

API Token Authentication

Many APIs require tokens in headers:

headers = {
    'Authorization': 'Bearer YOUR_API_TOKEN',
    'User-Agent': 'Mozilla/5.0'
}

response = requests.get('https://api.example.com/data', headers=headers)
data = response.json()

If the API requires OAuth, use the requests-oauthlib library—it handles the token dance for you.

Storing Your Scraped Data

You've got the data. Now what? Let's look at storage options.

CSV Files

For simple tabular data:

import csv

data = [
    {'name': 'Product 1', 'price': 19.99},
    {'name': 'Product 2', 'price': 29.99}
]

with open('products.csv', 'w', newline='', encoding='utf-8') as f:
    writer = csv.DictWriter(f, fieldnames=['name', 'price'])
    writer.writeheader()
    writer.writerows(data)

JSON Files

For nested or complex data:

import json

data = {
    'products': [
        {'name': 'Product 1', 'price': 19.99, 'tags': ['electronics', 'new']},
        {'name': 'Product 2', 'price': 29.99, 'tags': ['clothing']}
    ]
}

with open('data.json', 'w', encoding='utf-8') as f:
    json.dump(data, f, indent=2, ensure_ascii=False)

The ensure_ascii=False preserves Unicode characters instead of escaping them.

SQLite Database

For larger projects where you need to query data:

import sqlite3

conn = sqlite3.connect('products.db')
cursor = conn.cursor()

# Create table
cursor.execute('''
    CREATE TABLE IF NOT EXISTS products (
        id INTEGER PRIMARY KEY,
        name TEXT,
        price REAL,
        url TEXT UNIQUE
    )
''')

# Insert data
products = [
    ('Product 1', 19.99, 'https://example.com/p1'),
    ('Product 2', 29.99, 'https://example.com/p2')
]

cursor.executemany(
    'INSERT OR IGNORE INTO products (name, price, url) VALUES (?, ?, ?)',
    products
)

conn.commit()
conn.close()

The UNIQUE constraint on URL prevents duplicate entries if you run the scraper multiple times.

Pandas DataFrames

For data analysis:

import pandas as pd

data = {
    'name': ['Product 1', 'Product 2'],
    'price': [19.99, 29.99]
}

df = pd.DataFrame(data)

# Save to CSV
df.to_csv('products.csv', index=False)

# Save to Excel
df.to_excel('products.xlsx', index=False)

# Basic analysis
print(df.describe())
print(df['price'].mean())

Common Mistakes and How to Avoid Them

Let me save you some headaches.

Mistake 1: Not Handling Errors

Networks fail. Servers return 500 errors. Your code needs to handle this:

import requests
from requests.exceptions import RequestException
import time

def fetch_with_retry(url, max_retries=3):
    for attempt in range(max_retries):
        try:
            response = requests.get(url, timeout=10)
            response.raise_for_status()  # Raises exception for 4xx/5xx
            return response
        except RequestException as e:
            if attempt == max_retries - 1:
                raise
            print(f"Attempt {attempt + 1} failed: {e}")
            time.sleep(2 ** attempt)  # Exponential backoff

Always set a timeout. Without it, your scraper can hang forever waiting for a response.

Mistake 2: Parsing Brittle Selectors

Using overly specific selectors breaks when sites change:

# Bad - relies on deep nesting
soup.find('div').find('div').find('span', class_='price-new')

# Better - target unique identifiers
soup.find('span', class_='price-new')

# Even better - have a fallback
price = (
    soup.find('span', class_='price-new') or
    soup.find('span', class_='price') or
    soup.find('div', {'data-price': True})
)

Mistake 3: Not Checking Response Content

Always verify you got what you expected:

response = requests.get(url)

# Check status code
if response.status_code != 200:
    print(f"Error: {response.status_code}")
    return

# Check content type
if 'text/html' not in response.headers.get('Content-Type', ''):
    print("Response is not HTML")
    return

# Check for common error pages
if 'Access Denied' in response.text or 'Error 403' in response.text:
    print("Blocked by server")
    return

Mistake 4: Ignoring Encodings

Text extraction can go wrong with weird characters:

# Explicitly handle encoding
response = requests.get(url)
response.encoding = response.apparent_encoding  # Let requests detect encoding

soup = BeautifulSoup(response.text, 'lxml')

# When saving to files
with open('output.txt', 'w', encoding='utf-8') as f:
    f.write(data)

Mistake 5: Scraping Too Fast

I've said it before, but it's worth repeating: add delays. Getting your IP banned wastes more time than the delays ever could.

Wrapping Up

Web scraping in Python doesn't have to be complicated. Start with Requests and Beautiful Soup for simple sites, move to async with aiohttp or httpx when you need speed, and reach for Playwright only when dealing with heavy JavaScript.

The real skill isn't in the tools—it's in understanding how websites work, finding the cleanest path to your data, and building scrapers that don't get blocked. Check the Network tab before you write code, respect robots.txt, add random delays, and handle errors gracefully.

Now you've got the tools to build scrapers that actually work. Start small, test your code on scraper-friendly sites first, and gradually tackle more complex projects. The data's out there—go get it.

Marius Bernard

Marius Bernard is a Web Scraping Engineer & Technical Advisor at Roundproxies. He authored the Web Scraping chapter of the 2024 Web Almanac/Techinsider. He loves python, golang and proxies.

Get the best
proxies out there

Get Proxies now

Related from Knowledge Base

How ISP Proxies work in 2026: Step by step explained

Go Web Scraping: Complete 2025 Guide & Code Examples

PHP Web Scraping Guide 2026: Speed & Anti-Bot Tips

C# Web Scraping Guide: Build Fast Working Scrapers

Web Scraping in R: Complete Guide 2026

Web Scraping in Rust: Complete 2026 Guide

How to Do Web Scraping in Kotlin: The Developer's Guide

How to Do Web Scraping in Lua: A Developer's Guide

How to Do Web Scraping in Dart: A Complete 2026 Guide

How to Do Web Scraping in Perl: The Complete Developer's Guide

How to Use Botasaurus in 2026

How to Scrape Dynamic Websites With Headless Web Browsers

12 Ways to Make HTTPS Requests in Node.js

15 Methods to Not Get Blocked Web Scraping

How to Use Playwright Playwright Proxy in 2026

How to Take Screenshots with Puppeteer

How to Store and Manage Scraped Data Efficiently

User-Agent Rotation: Why and How to Implement It

How to Scrape Data Behind Login Pages

What Are Backconnect Proxies and How They Work