Web scraping lets you automatically extract data from websites instead of copying and pasting like it's 1999. Python makes this easy with libraries like Requests, Beautiful Soup, and Selenium, but there's more to building a solid scraper than just firing off HTTP requests and hoping for the best.
In this guide, I'll walk you through the practical side of web scraping in Python—from basic techniques to performance optimization and anti-detection tricks that actually work. You'll learn how to handle JavaScript-heavy sites, avoid getting blocked, and scale your scrapers to handle thousands of pages without breaking a sweat.
Why Python for Web Scraping?
Python dominates the web scraping world for good reason. The syntax is clean enough that you can focus on solving problems instead of fighting the language. Plus, the ecosystem is packed with libraries built specifically for scraping.
But here's what nobody tells you: Python isn't the fastest language. For small to medium projects, this doesn't matter. For massive scrapers that need to hit thousands of pages per second, you might eventually look at Go or Node.js. That said, Python's async capabilities and the ability to distribute work across processes mean you can scale pretty far before hitting that wall.
Another advantage? The community. When you get stuck (and you will), there's probably a Stack Overflow answer or GitHub issue waiting for you.
Setting Up Your Environment
First things first—you need Python 3.8 or newer. I'm using 3.11 for this guide, but anything 3.8+ will work fine.
Create a virtual environment to keep your dependencies isolated:
python -m venv scraper-env
source scraper-env/bin/activate # On Windows: scraper-env\Scripts\activate
Now install the core libraries:
pip install requests beautifulsoup4 lxml httpx aiohttp
Here's what each does:
requests: Makes HTTP requests (the standard, works everywhere)beautifulsoup4: Parses HTML and extracts datalxml: Fast HTML parser that Beautiful Soup can usehttpx: Modern alternative to requests with async supportaiohttp: For async HTTP requests at scale
You might also want playwright or selenium for JavaScript-heavy sites, but we'll get to that later.
Basic Web Scraping with Requests and Beautiful Soup
Let's start with a simple example. We'll scrape quotes from http://quotes.toscrape.com—a practice site that's scraper-friendly.
import requests
from bs4 import BeautifulSoup
# Make the HTTP request
url = "http://quotes.toscrape.com"
response = requests.get(url)
# Parse the HTML
soup = BeautifulSoup(response.content, 'lxml')
# Find all quote containers
quotes = soup.find_all('div', class_='quote')
# Extract the data
for quote in quotes:
text = quote.find('span', class_='text').get_text()
author = quote.find('small', class_='author').get_text()
print(f"{text}\n— {author}\n")
This is your bread-and-butter scraper. You make a request, parse the HTML, find the elements you want, and extract the text.
What's happening here:
requests.get()fetches the HTMLBeautifulSoup()turns that HTML into a searchable treefind_all()grabs all matching elementsget_text()extracts the actual text content
The 'lxml' parser is faster than the default html.parser, especially for larger pages. Always specify a parser—it makes your scraper more reliable.
Inspecting Websites: Finding the Data You Need
Before you write any code, you need to know where the data lives in the HTML. Open your browser's DevTools (F12 or right-click → Inspect) and start poking around.
Here's my process:
- Find one example of the data you want on the page
- Right-click it and select "Inspect" to jump to that element in the HTML
- Look for patterns—usually, multiple items share the same class or structure
- Check if the data loads dynamically (more on this later)
For example, on that quotes site, all quotes are wrapped in <div class="quote"> elements. Inside each, the quote text is in <span class="text"> and the author is in <small class="author">.
Pro tip: Use soup.select() with CSS selectors for cleaner code:
# Instead of this:
quotes = soup.find_all('div', class_='quote')
# You can write this:
quotes = soup.select('div.quote')
CSS selectors are usually shorter and match exactly what you see in DevTools.
Handling Different Content Types
Not all data lives in nice, clean HTML tags. Sometimes you need to extract from different formats.
Tables
HTML tables are annoying to parse manually. Beautiful Soup makes it easier, but here's a trick: pandas can read tables directly:
import pandas as pd
url = "https://example.com/data-table"
tables = pd.read_html(url)
# If there are multiple tables, pick the one you want
df = tables[0]
print(df.head())
This returns a DataFrame you can work with immediately. Way faster than manually parsing <tr> and <td> tags.
JSON in HTML
Many modern sites embed data as JSON in <script> tags. This is actually easier to work with than HTML:
import json
import re
html = response.text
# Find JSON data in script tags
json_data = re.search(r'var products = ({.*?});', html, re.DOTALL)
if json_data:
products = json.loads(json_data.group(1))
print(products)
You're looking for patterns like var data = {...} or window.__INITIAL_STATE__ = {...}. The JSON is usually cleaner and more complete than what's rendered in the HTML.
Images and Files
To download images or PDFs:
import requests
from pathlib import Path
def download_file(url, save_path):
response = requests.get(url, stream=True)
response.raise_for_status()
with open(save_path, 'wb') as f:
for chunk in response.iter_content(chunk_size=8192):
f.write(chunk)
# Example
img_url = "https://example.com/image.jpg"
download_file(img_url, "image.jpg")
The stream=True parameter prevents loading huge files into memory all at once.
Scraping JavaScript-Heavy Sites
Here's where things get interesting. Many modern sites load content with JavaScript, meaning the HTML you get from requests is empty or incomplete.
Quick test: If you view the page source (Ctrl+U) and can't find your data, it's being loaded with JavaScript.
You have two options:
Option 1: Find the API (The Smart Way)
Most JavaScript-heavy sites load data from an API. Open DevTools → Network tab → XHR/Fetch, then refresh the page. You'll see API requests that return JSON data—often easier to work with than HTML.
import requests
# Instead of scraping the HTML
api_url = "https://api.example.com/products?page=1"
headers = {
'User-Agent': 'Mozilla/5.0',
'Accept': 'application/json'
}
response = requests.get(api_url, headers=headers)
data = response.json()
for item in data['products']:
print(item['name'], item['price'])
This is faster and cleaner than using a headless browser. The API usually returns exactly the data you need without all the HTML markup.
Option 2: Use Playwright (The Heavy Way)
If there's no accessible API, you need a real browser. Playwright is better than Selenium for most use cases—it's faster, more reliable, and has a better API.
pip install playwright
playwright install # Downloads browser binaries
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
page = browser.new_page()
page.goto('https://example.com')
page.wait_for_selector('.product-card') # Wait for content to load
# Get the rendered HTML
html = page.content()
# Or extract data directly
products = page.query_selector_all('.product-card')
for product in products:
name = product.query_selector('.name').inner_text()
price = product.query_selector('.price').inner_text()
print(f"{name}: {price}")
browser.close()
Playwright handles all the JavaScript execution, waiting for elements to load, and even scrolling if needed. The downside? It's slower and uses more resources than simple HTTP requests.
Avoiding Detection Without Paid Tools
Websites block scrapers for a reason—you're using their bandwidth and computing resources. But if you're respectful and follow best practices, you can avoid most blocks without paying for proxy services.
Set Proper Headers
At minimum, always set a User-Agent:
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en-US,en;q=0.5',
'Accept-Encoding': 'gzip, deflate, br',
'DNT': '1',
'Connection': 'keep-alive',
'Upgrade-Insecure-Requests': '1'
}
response = requests.get(url, headers=headers)
The default requests User-Agent screams "I'm a bot!" Adding realistic headers makes you look like a regular browser.
Rotate User Agents
Don't use the same User-Agent for every request:
import random
user_agents = [
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) Chrome/120.0.0.0',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) Safari/537.36',
'Mozilla/5.0 (X11; Linux x86_64) Firefox/121.0',
]
headers = {
'User-Agent': random.choice(user_agents)
}
Add Random Delays
Never scrape at a constant rate—vary your request timing:
import time
import random
for url in urls:
response = requests.get(url, headers=headers)
# Process response...
# Random delay between 1-3 seconds
time.sleep(random.uniform(1, 3))
This mimics human browsing behavior. Hitting a site with 100 requests per second is a surefire way to get blocked.
Respect robots.txt
Check the site's robots.txt file (add /robots.txt to the domain). It tells you what's allowed:
from urllib.robotparser import RobotFileParser
rp = RobotFileParser()
rp.set_url("https://example.com/robots.txt")
rp.read()
url = "https://example.com/products"
user_agent = "MyBot"
if rp.can_fetch(user_agent, url):
# OK to scrape
response = requests.get(url)
else:
print("Not allowed to scrape this URL")
Use Sessions for Cookies
Many sites expect cookies. Using a session maintains cookies across requests:
session = requests.Session()
session.headers.update(headers)
# First request sets cookies
response1 = session.get('https://example.com')
# Subsequent requests automatically include cookies
response2 = session.get('https://example.com/page2')
The httpx Alternative
Consider using httpx instead of requests. It has the same API but better performance and native async support:
import httpx
with httpx.Client() as client:
response = client.get(url, headers=headers, follow_redirects=True)
print(response.text)
The follow_redirects=True is important—some sites redirect scrapers to different pages.
Async Scraping for Speed
If you need to scrape hundreds or thousands of pages, sequential requests are painfully slow. Async programming lets you fire off multiple requests simultaneously.
Here's the difference in real terms: scraping 100 pages sequentially at 2 seconds each = 200 seconds. With async, the same job might take 10 seconds.
Basic Async with aiohttp
import asyncio
import aiohttp
from bs4 import BeautifulSoup
async def fetch_page(session, url):
async with session.get(url) as response:
return await response.text()
async def scrape_multiple(urls):
async with aiohttp.ClientSession() as session:
tasks = [fetch_page(session, url) for url in urls]
pages = await asyncio.gather(*tasks)
for page in pages:
soup = BeautifulSoup(page, 'lxml')
# Extract data...
# Run it
urls = [f'https://example.com/page/{i}' for i in range(1, 101)]
asyncio.run(scrape_multiple(urls))
What's happening:
async defcreates an async functionawaitpauses execution until the response arrivesasyncio.gather()runs all requests concurrentlyasyncio.run()starts the event loop
Throttling Concurrent Requests
Don't slam a server with 1000 simultaneous requests. Use a semaphore to limit concurrency:
import asyncio
import aiohttp
async def fetch_with_limit(session, url, semaphore):
async with semaphore:
async with session.get(url) as response:
return await response.text()
async def scrape_with_throttle(urls, max_concurrent=10):
semaphore = asyncio.Semaphore(max_concurrent)
async with aiohttp.ClientSession() as session:
tasks = [fetch_with_limit(session, url, semaphore) for url in urls]
return await asyncio.gather(*tasks)
# Only 10 requests at a time
urls = [f'https://example.com/page/{i}' for i in range(1, 101)]
results = asyncio.run(scrape_with_throttle(urls))
This limits your scraper to 10 concurrent requests, which is respectful and less likely to trigger rate limits.
The httpx Async Alternative
httpx supports async too and has a cleaner API:
import asyncio
import httpx
async def scrape_async(urls):
async with httpx.AsyncClient() as client:
tasks = [client.get(url) for url in urls]
responses = await asyncio.gather(*tasks)
for response in responses:
print(response.status_code)
urls = ['https://example.com/1', 'https://example.com/2']
asyncio.run(scrape_async(urls))
I prefer httpx over aiohttp for most projects. The API is more intuitive, and it handles edge cases better.
Handling Pagination and Multiple Pages
Most sites split data across multiple pages. You need to scrape all of them.
Simple Numbered Pagination
base_url = "https://example.com/products?page={}"
for page_num in range(1, 11): # Pages 1-10
url = base_url.format(page_num)
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.content, 'lxml')
# Extract data from this page...
time.sleep(random.uniform(1, 2))
Following "Next" Links
Some sites use "Next" buttons instead of numbered pages:
current_url = "https://example.com/products"
while current_url:
response = requests.get(current_url, headers=headers)
soup = BeautifulSoup(response.content, 'lxml')
# Extract data...
# Find next page link
next_link = soup.find('a', class_='next')
current_url = next_link['href'] if next_link else None
if current_url:
time.sleep(random.uniform(1, 2))
Infinite Scroll Pages
For sites that load content as you scroll (like Instagram or Twitter), you usually need to:
- Find the API endpoint they're calling
- Replicate those requests with pagination parameters
Check the Network tab in DevTools as you scroll. Look for XHR/Fetch requests with parameters like offset, cursor, or page.
Session Management and Authentication
Some data requires logging in. Here's how to handle authentication in your scraper.
Form-Based Login
import requests
session = requests.Session()
# Step 1: Get the login page (sometimes needed for CSRF tokens)
login_page = session.get('https://example.com/login')
soup = BeautifulSoup(login_page.content, 'lxml')
# Some sites use CSRF tokens
csrf_token = soup.find('input', {'name': 'csrf_token'})['value']
# Step 2: Submit login form
login_data = {
'username': 'your_username',
'password': 'your_password',
'csrf_token': csrf_token # If needed
}
response = session.post('https://example.com/login', data=login_data)
# Step 3: Use the session for authenticated requests
protected_page = session.get('https://example.com/protected-data')
The session object maintains cookies, so subsequent requests stay logged in.
API Token Authentication
Many APIs require tokens in headers:
headers = {
'Authorization': 'Bearer YOUR_API_TOKEN',
'User-Agent': 'Mozilla/5.0'
}
response = requests.get('https://api.example.com/data', headers=headers)
data = response.json()
If the API requires OAuth, use the requests-oauthlib library—it handles the token dance for you.
Storing Your Scraped Data
You've got the data. Now what? Let's look at storage options.
CSV Files
For simple tabular data:
import csv
data = [
{'name': 'Product 1', 'price': 19.99},
{'name': 'Product 2', 'price': 29.99}
]
with open('products.csv', 'w', newline='', encoding='utf-8') as f:
writer = csv.DictWriter(f, fieldnames=['name', 'price'])
writer.writeheader()
writer.writerows(data)
JSON Files
For nested or complex data:
import json
data = {
'products': [
{'name': 'Product 1', 'price': 19.99, 'tags': ['electronics', 'new']},
{'name': 'Product 2', 'price': 29.99, 'tags': ['clothing']}
]
}
with open('data.json', 'w', encoding='utf-8') as f:
json.dump(data, f, indent=2, ensure_ascii=False)
The ensure_ascii=False preserves Unicode characters instead of escaping them.
SQLite Database
For larger projects where you need to query data:
import sqlite3
conn = sqlite3.connect('products.db')
cursor = conn.cursor()
# Create table
cursor.execute('''
CREATE TABLE IF NOT EXISTS products (
id INTEGER PRIMARY KEY,
name TEXT,
price REAL,
url TEXT UNIQUE
)
''')
# Insert data
products = [
('Product 1', 19.99, 'https://example.com/p1'),
('Product 2', 29.99, 'https://example.com/p2')
]
cursor.executemany(
'INSERT OR IGNORE INTO products (name, price, url) VALUES (?, ?, ?)',
products
)
conn.commit()
conn.close()
The UNIQUE constraint on URL prevents duplicate entries if you run the scraper multiple times.
Pandas DataFrames
For data analysis:
import pandas as pd
data = {
'name': ['Product 1', 'Product 2'],
'price': [19.99, 29.99]
}
df = pd.DataFrame(data)
# Save to CSV
df.to_csv('products.csv', index=False)
# Save to Excel
df.to_excel('products.xlsx', index=False)
# Basic analysis
print(df.describe())
print(df['price'].mean())
Common Mistakes and How to Avoid Them
Let me save you some headaches.
Mistake 1: Not Handling Errors
Networks fail. Servers return 500 errors. Your code needs to handle this:
import requests
from requests.exceptions import RequestException
import time
def fetch_with_retry(url, max_retries=3):
for attempt in range(max_retries):
try:
response = requests.get(url, timeout=10)
response.raise_for_status() # Raises exception for 4xx/5xx
return response
except RequestException as e:
if attempt == max_retries - 1:
raise
print(f"Attempt {attempt + 1} failed: {e}")
time.sleep(2 ** attempt) # Exponential backoff
Always set a timeout. Without it, your scraper can hang forever waiting for a response.
Mistake 2: Parsing Brittle Selectors
Using overly specific selectors breaks when sites change:
# Bad - relies on deep nesting
soup.find('div').find('div').find('span', class_='price-new')
# Better - target unique identifiers
soup.find('span', class_='price-new')
# Even better - have a fallback
price = (
soup.find('span', class_='price-new') or
soup.find('span', class_='price') or
soup.find('div', {'data-price': True})
)
Mistake 3: Not Checking Response Content
Always verify you got what you expected:
response = requests.get(url)
# Check status code
if response.status_code != 200:
print(f"Error: {response.status_code}")
return
# Check content type
if 'text/html' not in response.headers.get('Content-Type', ''):
print("Response is not HTML")
return
# Check for common error pages
if 'Access Denied' in response.text or 'Error 403' in response.text:
print("Blocked by server")
return
Mistake 4: Ignoring Encodings
Text extraction can go wrong with weird characters:
# Explicitly handle encoding
response = requests.get(url)
response.encoding = response.apparent_encoding # Let requests detect encoding
soup = BeautifulSoup(response.text, 'lxml')
# When saving to files
with open('output.txt', 'w', encoding='utf-8') as f:
f.write(data)
Mistake 5: Scraping Too Fast
I've said it before, but it's worth repeating: add delays. Getting your IP banned wastes more time than the delays ever could.
Wrapping Up
Web scraping in Python doesn't have to be complicated. Start with Requests and Beautiful Soup for simple sites, move to async with aiohttp or httpx when you need speed, and reach for Playwright only when dealing with heavy JavaScript.
The real skill isn't in the tools—it's in understanding how websites work, finding the cleanest path to your data, and building scrapers that don't get blocked. Check the Network tab before you write code, respect robots.txt, add random delays, and handle errors gracefully.
Now you've got the tools to build scrapers that actually work. Start small, test your code on scraper-friendly sites first, and gradually tackle more complex projects. The data's out there—go get it.