Getting blocked while web scraping is frustrating. One minute you're collecting data smoothly, the next your IP is banned and you're staring at a 403 error page.
The good news? Most blocks are preventable if you understand how anti-bot systems work and know which techniques actually help. I've spent years scraping everything from e-commerce sites to social media platforms, and I've learned that staying undetected isn't about using every trick in the book—it's about using the right combination of techniques for your specific target.
This guide covers 15 practical methods to avoid getting blocked while web scraping, including some lesser-known approaches that can give you an edge. Whether you're scraping a handful of pages or running large-scale operations, these techniques will help you fly under the radar.
Why websites block scrapers in the first place
Before diving into solutions, it's worth understanding why you're getting blocked. Websites don't block scrapers just to be difficult—they have legitimate reasons:
Server load concerns: A poorly configured scraper can hammer a server with hundreds of requests per second, degrading performance for real users. That's basically a DDoS attack, even if unintentional.
Commercial interests: Companies view their data as a competitive asset. If you're scraping product prices or inventory data, they'd rather you didn't.
Terms of service: Many sites explicitly prohibit automated access in their ToS. While violating ToS isn't necessarily illegal, it gives them grounds to block you.
The key takeaway? Websites use increasingly sophisticated methods to detect bots—from simple IP tracking to advanced browser fingerprinting. Your job is to make your scraper look as human as possible.
1. Rotate IP addresses with proxy pools
IP rotation is the foundation of any serious scraping operation. Websites track how many requests come from each IP address, and if you send too many too fast, you'll get banned.
The solution is to distribute your requests across multiple IP addresses using proxies. Here's what you need to know:
Datacenter proxies are cheap (often under $1 per IP) but easier to detect because they come from hosting providers, not residential ISPs. They work fine for many sites but fail against sophisticated anti-bot systems.
Residential proxies route traffic through real user devices, making them much harder to detect. They're more expensive but essential for scraping sites with strong protections like Amazon or LinkedIn.
IP rotation frequency matters. Some scrapers rotate IPs after every request, while others use the same IP for several requests before switching. The right approach depends on your target—experiment to find what works.
Here's a simple Python example using a proxy rotation service:
import httpx
import random
proxies = [
"http://residential.roundproxies.com:31299",
"http://residential.roundproxies.com:31299",
"http://residential.roundproxies.com:31299",
]
def scrape_with_rotation(url):
proxy = random.choice(proxies)
response = httpx.get(url, proxy=proxy)
return response.text
Pro tip: Monitor which proxies get blocked and remove them from your pool. Some providers offer automatic proxy health checks and rotation.
2. Randomize user agent strings
Every HTTP request includes a User-Agent header that identifies your browser and operating system. If your scraper sends thousands of requests with the same User-Agent, it's an obvious red flag.
The fix is simple: rotate through a list of realistic User-Agent strings. Don't just pick one at random—make sure it's consistent with other headers you're sending.
import httpx
import random
user_agents = [
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
]
def scrape_with_ua_rotation(url):
headers = {"User-Agent": random.choice(user_agents)}
response = httpx.get(url, headers=headers)
return response.text
Common mistake: Using outdated User-Agent strings or ones that don't match your platform. If you claim to be Chrome on Windows but your other headers suggest Mac, anti-bot systems will catch the inconsistency.
3. Add random delays between requests
Humans don't browse at robot speed. If you're making 10 requests per second with perfect consistency, you're screaming "I'm a bot!"
Introduce random delays between requests to mimic human behavior:
import time
import random
import httpx
def scrape_with_delays(urls):
results = []
for url in urls:
# Random delay between 2-5 seconds
time.sleep(random.uniform(2, 5))
response = httpx.get(url)
results.append(response.text)
return results
How long should you wait? It depends on the site. For news sites with high traffic, 1-3 seconds might be fine. For smaller sites or accounts-based platforms, 5-10 seconds is safer. Monitor the site's response times and adjust accordingly.
Advanced approach: Instead of fixed delays, use exponential backoff when you detect rate limiting. Start with short delays, and if you get a 429 error (Too Many Requests), increase the delay exponentially before retrying.
4. Respect robots.txt (most of the time)
The robots.txt
file tells crawlers which parts of a site they're allowed to access. While it's not legally binding, respecting it shows good faith and reduces your chances of getting blocked.
from urllib.robotparser import RobotFileParser
def can_scrape(url, user_agent="*"):
parser = RobotFileParser()
parser.set_url(f"{url}/robots.txt")
parser.read()
return parser.can_fetch(user_agent, url)
# Check before scraping
if can_scrape("https://example.com/products"):
# Proceed with scraping
pass
That said, robots.txt
is a guideline, not a law. If you have a legitimate reason to access disallowed content (research, archiving, competitive analysis), use your judgment. Just be extra careful about rate limiting and stealth when accessing restricted areas.
5. Avoid honeypot traps
Honeypots are invisible links or elements designed to catch bots. They're styled with CSS to be invisible to humans (using display: none
, visibility: hidden
, or positioned off-screen) but appear in the HTML that scrapers parse.
If your scraper follows these links, the site knows you're a bot and can fingerprint your behavior.
How to avoid honeypots:
- Parse the CSS along with the HTML to identify hidden elements
- Skip links that match common honeypot patterns
- Test your scraper manually first to understand the site's structure
from bs4 import BeautifulSoup
def is_honeypot(element):
style = element.get('style', '')
css_class = element.get('class', [])
# Check for common honeypot indicators
if 'display:none' in style or 'visibility:hidden' in style:
return True
if 'hidden' in css_class or 'trap' in css_class:
return True
return False
def scrape_safe_links(html):
soup = BeautifulSoup(html, 'html.parser')
links = soup.find_all('a')
safe_links = [link['href'] for link in links
if not is_honeypot(link)]
return safe_links
6. Reverse engineer the API instead of scraping HTML
Here's a technique most scrapers overlook: instead of parsing HTML, find the underlying API that serves the data.
Modern websites are often single-page applications that fetch data via XHR/Fetch requests to JSON APIs. These APIs are cleaner, faster, and less likely to trigger anti-bot systems than full browser automation.
How to find hidden APIs:
- Open Chrome DevTools and go to the Network tab
- Filter by XHR or Fetch requests
- Interact with the site (scroll, search, filter)
- Look for requests to endpoints containing
/api/
,json
,graphql
, or similar patterns - Examine the request and response structure
import httpx
# Instead of scraping HTML like this:
# response = httpx.get("https://example.com/products")
# soup = BeautifulSoup(response.text, 'html.parser')
# Use the discovered API directly:
api_response = httpx.get("https://api.example.com/v1/products", params={
"page": 1,
"limit": 50
})
data = api_response.json()
products = data['products']
Real example: Many e-commerce sites load product data via API calls. Instead of rendering the full page and parsing HTML, you can call the API directly, bypass JavaScript rendering entirely, and get clean JSON data. It's faster, more reliable, and much harder to detect.
7. Use headless browsers with stealth plugins
For JavaScript-heavy sites that require browser rendering, headless browsers like Puppeteer or Playwright are essential. But out-of-the-box, they're easy to detect because they set navigator.webdriver = true
and have other telltale properties.
The solution is stealth plugins that patch these detection vectors:
// Using Puppeteer with puppeteer-extra and stealth plugin
const puppeteer = require('puppeteer-extra');
const StealthPlugin = require('puppeteer-extra-plugin-stealth');
puppeteer.use(StealthPlugin());
(async () => {
const browser = await puppeteer.launch({ headless: 'new' });
const page = await browser.newPage();
await page.goto('https://example.com');
const content = await page.content();
await browser.close();
})();
For Python users with Playwright:
from playwright.sync_api import sync_playwright
from playwright_stealth import stealth_sync
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
page = browser.new_page()
stealth_sync(page) # Apply stealth patches
page.goto('https://example.com')
content = page.content()
browser.close()
These plugins automatically:
- Remove the
webdriver
property - Patch canvas fingerprinting
- Spoof Chrome runtime properties
- Fix WebGL and audio context leaks
Limitation: Even stealth plugins don't guarantee invisibility. Advanced fingerprinting systems like Cloudflare or DataDome can still detect automation through timing analysis, mouse movement patterns, and dozens of other signals.
8. Handle CAPTCHAs strategically
CAPTCHAs are designed to block bots, but they're not insurmountable. Here are your options:
Option 1: Avoid triggering them by:
- Slowing down your requests
- Using residential proxies
- Maintaining consistent headers
- Acting more human-like
Option 2: Solve them programmatically using:
- OCR for simple image CAPTCHAs (rarely works anymore)
- Audio CAPTCHA alternatives (slightly easier to automate)
- CAPTCHA-solving services (costs money but works)
Option 3: Use real browsers with human solvers. Some scraping operations pause when they hit a CAPTCHA and alert a human operator to solve it manually. Not scalable, but works for small operations.
The reality: If a site uses reCAPTCHA v3 or hCaptcha, you're fighting an uphill battle. These systems analyze your entire browsing session, not just the CAPTCHA interaction. Focus on not triggering them in the first place.
9. Rotate browser fingerprints
Browser fingerprinting collects dozens of attributes—screen resolution, installed fonts, WebGL renderer, canvas fingerprint, audio context, timezone, language, and more—to create a unique identifier for your browser.
Even if you rotate IPs and User-Agents, if your fingerprint stays the same, you can be tracked.
Basic approach: Rotate viewport sizes and timezones to create variation.
from playwright.sync_api import sync_playwright
viewports = [
{"width": 1920, "height": 1080},
{"width": 1366, "height": 768},
{"width": 1536, "height": 864},
]
with sync_playwright() as p:
browser = p.chromium.launch()
page = browser.new_page(viewport=viewports[0])
# Scrape with this fingerprint
browser.close()
Advanced approach: Use anti-detect browsers like Multilogin or GoLogin that rotate complete browser fingerprints including canvas, WebGL, fonts, and audio properties. These are commercial tools designed specifically for multi-accounting and scraping.
DIY option: Manually inject JavaScript to spoof canvas and WebGL:
// Inject canvas noise to vary fingerprint
await page.evaluateOnNewDocument(() => {
const originalToDataURL = HTMLCanvasElement.prototype.toDataURL;
HTMLCanvasElement.prototype.toDataURL = function(type) {
// Add random noise
const context = this.getContext('2d');
const imageData = context.getImageData(0, 0, this.width, this.height);
for (let i = 0; i < imageData.data.length; i += 4) {
imageData.data[i] += Math.random() * 10 - 5;
}
context.putImageData(imageData, 0, 0);
return originalToDataURL.apply(this, arguments);
};
});
10. Mimic human behavior patterns
Advanced anti-bot systems analyze behavioral signals like mouse movements, scroll patterns, and typing cadence. If you're using headless automation, add realistic human-like behavior:
// Random mouse movements
await page.mouse.move(
Math.random() * 1000,
Math.random() * 800
);
// Realistic scrolling
await page.evaluate(() => {
window.scrollBy({
top: 300 + Math.random() * 100,
behavior: 'smooth'
});
});
// Pause to "read" content
await page.waitForTimeout(2000 + Math.random() * 3000);
Going further: Record actual user sessions and replay those interaction patterns in your scraper. Some companies build machine learning models trained on real user behavior to make their bots indistinguishable from humans.
11. Session management and cookie handling
Many sites require maintaining a session to access content. If you don't handle cookies properly, each request looks like it's from a different user, which is suspicious.
import httpx
# Create a client that persists cookies
client = httpx.Client()
# First request establishes session
response = client.get("https://example.com")
# Subsequent requests reuse cookies
products = client.get("https://example.com/products")
details = client.get("https://example.com/products/123")
client.close()
For headless browsers, cookies are handled automatically, but you can save and reuse them:
// Save cookies after login
const cookies = await page.context().cookies();
fs.writeFileSync('cookies.json', JSON.stringify(cookies));
// Restore cookies later
const savedCookies = JSON.parse(fs.readFileSync('cookies.json'));
await page.context().addCookies(savedCookies);
Pro tip: Some sites embed session tokens in local storage or in JavaScript variables. Use browser DevTools to find where these tokens are stored and extract them for API requests.
12. Scrape cached versions when possible
For non-time-sensitive data, scraping Google's cached version or Internet Archive snapshots can bypass anti-bot protections entirely.
Google Cache (note: Google has deprecated their cache feature, but cached pages can still sometimes be accessed via specific URLs or through Google Search's cached results):
import httpx
def scrape_cached(url):
cache_url = f"https://webcache.googleusercontent.com/search?q=cache:{url}"
response = httpx.get(cache_url)
return response.text
Internet Archive's Wayback Machine:
import httpx
def scrape_archive(url):
api_url = f"http://archive.org/wayback/available?url={url}"
response = httpx.get(api_url)
data = response.json()
if 'archived_snapshots' in data and data['archived_snapshots']:
snapshot_url = data['archived_snapshots']['closest']['url']
snapshot = httpx.get(snapshot_url)
return snapshot.text
return None
Limitation: Cached data isn't current, so this only works if you don't need real-time information.
13. Implement exponential backoff for rate limiting
When you hit rate limits (429 errors or temporarily blocked), don't just retry immediately. Use exponential backoff to gradually increase wait times:
import httpx
import time
def scrape_with_backoff(url, max_retries=5):
retry_count = 0
base_delay = 1
while retry_count < max_retries:
try:
response = httpx.get(url, timeout=30)
if response.status_code == 200:
return response.text
elif response.status_code == 429:
# Rate limited - wait and retry
wait_time = base_delay * (2 ** retry_count)
print(f"Rate limited. Waiting {wait_time}s before retry...")
time.sleep(wait_time)
retry_count += 1
else:
print(f"Error {response.status_code}")
return None
except Exception as e:
print(f"Request failed: {e}")
retry_count += 1
time.sleep(base_delay * (2 ** retry_count))
return None
This approach respects the server's capacity while giving you multiple chances to succeed.
14. Use multiple scraping strategies in parallel
Don't put all your eggs in one basket. Run different scraping approaches simultaneously and use whichever works best:
Strategy A: Direct API calls with rotating IPs Strategy B: Headless browser with stealth plugins
Strategy C: Cloud browser automation service
from concurrent.futures import ThreadPoolExecutor
def scrape_method_a(url):
# Fast, API-based approach
pass
def scrape_method_b(url):
# Browser-based fallback
pass
def scrape_parallel(urls):
with ThreadPoolExecutor(max_workers=5) as executor:
results_a = list(executor.map(scrape_method_a, urls))
# If Method A failed, try Method B for those URLs
failed_urls = [url for url, result in zip(urls, results_a)
if result is None]
if failed_urls:
results_b = list(executor.map(scrape_method_b, failed_urls))
return results_a
Real-world tip: Start with the fastest, cheapest method (HTTP requests to APIs). Fall back to slower methods (headless browsers) only when necessary. This optimizes both speed and cost.
15. Monitor and adapt continuously
Web scraping is a cat-and-mouse game. Sites update their anti-bot systems, and your scraper needs to adapt. Build monitoring into your setup:
import logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
def scrape_with_monitoring(url):
try:
response = httpx.get(url)
# Log success metrics
logger.info(f"Success: {url} - Status: {response.status_code}")
# Check for soft blocks (200 but wrong content)
if "access denied" in response.text.lower():
logger.warning(f"Soft block detected at {url}")
return response.text
except Exception as e:
logger.error(f"Failed to scrape {url}: {e}")
# Alert your team or switch strategies
return None
What to monitor:
- Success/failure rates
- Response times (sudden slowdowns indicate throttling)
- Content changes (detecting soft blocks)
- Proxy health
- Cost per successfully scraped page
Set up alerts for sudden drops in success rates so you can investigate and adapt quickly.
Putting it all together
These 15 methods aren't meant to be used all at once. The right combination depends on your target, scale, and budget. Here's a suggested approach:
For simple sites (blogs, news): Use methods 1-3 (IP rotation, User-Agent rotation, delays). Skip the expensive stuff.
For medium complexity (e-commerce without heavy bot protection): Add methods 4-7 (robots.txt, honeypot avoidance, API reverse engineering, basic headless browsers).
For hardened targets (sites with Cloudflare, DataDome, or reCAPTCHA): You'll need methods 8-15, including browser fingerprinting, behavioral mimicry, and commercial proxy/browser solutions.
Start simple, escalate as needed. Don't overcomplicate your first attempt—many sites can be scraped with just careful rate limiting and IP rotation. Add complexity only when you're actually getting blocked.
The most important takeaway? Web scraping without getting blocked is about respecting the site's resources, mimicking human behavior, and continuously adapting. No single technique is a silver bullet, but combining several thoughtfully will keep your scrapers running smoothly.
Related reading:
- How to build a rotating proxy pool from scratch
- Selenium vs. Playwright: Which is better for scraping?
- Legal considerations for web scraping in 2026
This article was originally published in October 2026.