knowledgebase

15 Methods to Not Get Blocked Web Scraping

October 19, 2025

11 min read

Getting blocked while web scraping is frustrating. One minute you're collecting data smoothly, the next your IP is banned and you're staring at a 403 error page.

The good news? Most blocks are preventable if you understand how anti-bot systems work and know which techniques actually help. I've spent years scraping everything from e-commerce sites to social media platforms, and I've learned that staying undetected isn't about using every trick in the book—it's about using the right combination of techniques for your specific target.

This guide covers 15 practical methods to avoid getting blocked while web scraping, including some lesser-known approaches that can give you an edge. Whether you're scraping a handful of pages or running large-scale operations, these techniques will help you fly under the radar.

Why websites block scrapers in the first place

Before diving into solutions, it's worth understanding why you're getting blocked. Websites don't block scrapers just to be difficult—they have legitimate reasons:

Server load concerns: A poorly configured scraper can hammer a server with hundreds of requests per second, degrading performance for real users. That's basically a DDoS attack, even if unintentional.

Commercial interests: Companies view their data as a competitive asset. If you're scraping product prices or inventory data, they'd rather you didn't.

Terms of service: Many sites explicitly prohibit automated access in their ToS. While violating ToS isn't necessarily illegal, it gives them grounds to block you.

The key takeaway? Websites use increasingly sophisticated methods to detect bots—from simple IP tracking to advanced browser fingerprinting. Your job is to make your scraper look as human as possible.

1. Rotate IP addresses with proxy pools

IP rotation is the foundation of any serious scraping operation. Websites track how many requests come from each IP address, and if you send too many too fast, you'll get banned.

The solution is to distribute your requests across multiple IP addresses using proxies. Here's what you need to know:

Datacenter proxies are cheap (often under $1 per IP) but easier to detect because they come from hosting providers, not residential ISPs. They work fine for many sites but fail against sophisticated anti-bot systems.

Residential proxies route traffic through real user devices, making them much harder to detect. They're more expensive but essential for scraping sites with strong protections like Amazon or LinkedIn.

IP rotation frequency matters. Some scrapers rotate IPs after every request, while others use the same IP for several requests before switching. The right approach depends on your target—experiment to find what works.

Here's a simple Python example using a proxy rotation service:

import httpx
import random

proxies = [
    "http://residential.roundproxies.com:31299",
    "http://residential.roundproxies.com:31299",
    "http://residential.roundproxies.com:31299",
]

def scrape_with_rotation(url):
    proxy = random.choice(proxies)
    response = httpx.get(url, proxy=proxy)
    return response.text

Pro tip: Monitor which proxies get blocked and remove them from your pool. Some providers offer automatic proxy health checks and rotation.

2. Randomize user agent strings

Every HTTP request includes a User-Agent header that identifies your browser and operating system. If your scraper sends thousands of requests with the same User-Agent, it's an obvious red flag.

The fix is simple: rotate through a list of realistic User-Agent strings. Don't just pick one at random—make sure it's consistent with other headers you're sending.

import httpx
import random

user_agents = [
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
]

def scrape_with_ua_rotation(url):
    headers = {"User-Agent": random.choice(user_agents)}
    response = httpx.get(url, headers=headers)
    return response.text

Common mistake: Using outdated User-Agent strings or ones that don't match your platform. If you claim to be Chrome on Windows but your other headers suggest Mac, anti-bot systems will catch the inconsistency.

3. Add random delays between requests

Humans don't browse at robot speed. If you're making 10 requests per second with perfect consistency, you're screaming "I'm a bot!"

Introduce random delays between requests to mimic human behavior:

import time
import random
import httpx

def scrape_with_delays(urls):
    results = []
    for url in urls:
        # Random delay between 2-5 seconds
        time.sleep(random.uniform(2, 5))
        response = httpx.get(url)
        results.append(response.text)
    return results

How long should you wait? It depends on the site. For news sites with high traffic, 1-3 seconds might be fine. For smaller sites or accounts-based platforms, 5-10 seconds is safer. Monitor the site's response times and adjust accordingly.

Advanced approach: Instead of fixed delays, use exponential backoff when you detect rate limiting. Start with short delays, and if you get a 429 error (Too Many Requests), increase the delay exponentially before retrying.

4. Respect robots.txt (most of the time)

The robots.txt file tells crawlers which parts of a site they're allowed to access. While it's not legally binding, respecting it shows good faith and reduces your chances of getting blocked.

from urllib.robotparser import RobotFileParser

def can_scrape(url, user_agent="*"):
    parser = RobotFileParser()
    parser.set_url(f"{url}/robots.txt")
    parser.read()
    return parser.can_fetch(user_agent, url)

# Check before scraping
if can_scrape("https://example.com/products"):
    # Proceed with scraping
    pass

That said, robots.txt is a guideline, not a law. If you have a legitimate reason to access disallowed content (research, archiving, competitive analysis), use your judgment. Just be extra careful about rate limiting and stealth when accessing restricted areas.

5. Avoid honeypot traps

Honeypots are invisible links or elements designed to catch bots. They're styled with CSS to be invisible to humans (using display: none, visibility: hidden, or positioned off-screen) but appear in the HTML that scrapers parse.

If your scraper follows these links, the site knows you're a bot and can fingerprint your behavior.

How to avoid honeypots:

Parse the CSS along with the HTML to identify hidden elements
Skip links that match common honeypot patterns
Test your scraper manually first to understand the site's structure

from bs4 import BeautifulSoup

def is_honeypot(element):
    style = element.get('style', '')
    css_class = element.get('class', [])
    
    # Check for common honeypot indicators
    if 'display:none' in style or 'visibility:hidden' in style:
        return True
    if 'hidden' in css_class or 'trap' in css_class:
        return True
    
    return False

def scrape_safe_links(html):
    soup = BeautifulSoup(html, 'html.parser')
    links = soup.find_all('a')
    
    safe_links = [link['href'] for link in links 
                  if not is_honeypot(link)]
    return safe_links

6. Reverse engineer the API instead of scraping HTML

Here's a technique most scrapers overlook: instead of parsing HTML, find the underlying API that serves the data.

Modern websites are often single-page applications that fetch data via XHR/Fetch requests to JSON APIs. These APIs are cleaner, faster, and less likely to trigger anti-bot systems than full browser automation.

How to find hidden APIs:

Open Chrome DevTools and go to the Network tab
Filter by XHR or Fetch requests
Interact with the site (scroll, search, filter)
Look for requests to endpoints containing /api/, json, graphql, or similar patterns
Examine the request and response structure

import httpx

# Instead of scraping HTML like this:
# response = httpx.get("https://example.com/products")
# soup = BeautifulSoup(response.text, 'html.parser')

# Use the discovered API directly:
api_response = httpx.get("https://api.example.com/v1/products", params={
    "page": 1,
    "limit": 50
})
data = api_response.json()
products = data['products']

Real example: Many e-commerce sites load product data via API calls. Instead of rendering the full page and parsing HTML, you can call the API directly, bypass JavaScript rendering entirely, and get clean JSON data. It's faster, more reliable, and much harder to detect.

7. Use headless browsers with stealth plugins

For JavaScript-heavy sites that require browser rendering, headless browsers like Puppeteer or Playwright are essential. But out-of-the-box, they're easy to detect because they set navigator.webdriver = true and have other telltale properties.

The solution is stealth plugins that patch these detection vectors:

// Using Puppeteer with puppeteer-extra and stealth plugin
const puppeteer = require('puppeteer-extra');
const StealthPlugin = require('puppeteer-extra-plugin-stealth');

puppeteer.use(StealthPlugin());

(async () => {
  const browser = await puppeteer.launch({ headless: 'new' });
  const page = await browser.newPage();
  
  await page.goto('https://example.com');
  const content = await page.content();
  
  await browser.close();
})();

For Python users with Playwright:

from playwright.sync_api import sync_playwright
from playwright_stealth import stealth_sync

with sync_playwright() as p:
    browser = p.chromium.launch(headless=True)
    page = browser.new_page()
    stealth_sync(page)  # Apply stealth patches
    page.goto('https://example.com')
    content = page.content()
    browser.close()

These plugins automatically:

Remove the webdriver property
Patch canvas fingerprinting
Spoof Chrome runtime properties
Fix WebGL and audio context leaks

Limitation: Even stealth plugins don't guarantee invisibility. Advanced fingerprinting systems like Cloudflare or DataDome can still detect automation through timing analysis, mouse movement patterns, and dozens of other signals.

8. Handle CAPTCHAs strategically

CAPTCHAs are designed to block bots, but they're not insurmountable. Here are your options:

Option 1: Avoid triggering them by:

Slowing down your requests
Using residential proxies
Maintaining consistent headers
Acting more human-like

Option 2: Solve them programmatically using:

OCR for simple image CAPTCHAs (rarely works anymore)
Audio CAPTCHA alternatives (slightly easier to automate)
CAPTCHA-solving services (costs money but works)

Option 3: Use real browsers with human solvers. Some scraping operations pause when they hit a CAPTCHA and alert a human operator to solve it manually. Not scalable, but works for small operations.

The reality: If a site uses reCAPTCHA v3 or hCaptcha, you're fighting an uphill battle. These systems analyze your entire browsing session, not just the CAPTCHA interaction. Focus on not triggering them in the first place.

9. Rotate browser fingerprints

Browser fingerprinting collects dozens of attributes—screen resolution, installed fonts, WebGL renderer, canvas fingerprint, audio context, timezone, language, and more—to create a unique identifier for your browser.

Even if you rotate IPs and User-Agents, if your fingerprint stays the same, you can be tracked.

Basic approach: Rotate viewport sizes and timezones to create variation.

from playwright.sync_api import sync_playwright

viewports = [
    {"width": 1920, "height": 1080},
    {"width": 1366, "height": 768},
    {"width": 1536, "height": 864},
]

with sync_playwright() as p:
    browser = p.chromium.launch()
    page = browser.new_page(viewport=viewports[0])
    # Scrape with this fingerprint
    browser.close()

Advanced approach: Use anti-detect browsers like Multilogin or GoLogin that rotate complete browser fingerprints including canvas, WebGL, fonts, and audio properties. These are commercial tools designed specifically for multi-accounting and scraping.

DIY option: Manually inject JavaScript to spoof canvas and WebGL:

// Inject canvas noise to vary fingerprint
await page.evaluateOnNewDocument(() => {
  const originalToDataURL = HTMLCanvasElement.prototype.toDataURL;
  HTMLCanvasElement.prototype.toDataURL = function(type) {
    // Add random noise
    const context = this.getContext('2d');
    const imageData = context.getImageData(0, 0, this.width, this.height);
    for (let i = 0; i < imageData.data.length; i += 4) {
      imageData.data[i] += Math.random() * 10 - 5;
    }
    context.putImageData(imageData, 0, 0);
    return originalToDataURL.apply(this, arguments);
  };
});

10. Mimic human behavior patterns

Advanced anti-bot systems analyze behavioral signals like mouse movements, scroll patterns, and typing cadence. If you're using headless automation, add realistic human-like behavior:

// Random mouse movements
await page.mouse.move(
  Math.random() * 1000, 
  Math.random() * 800
);

// Realistic scrolling
await page.evaluate(() => {
  window.scrollBy({
    top: 300 + Math.random() * 100,
    behavior: 'smooth'
  });
});

// Pause to "read" content
await page.waitForTimeout(2000 + Math.random() * 3000);

Going further: Record actual user sessions and replay those interaction patterns in your scraper. Some companies build machine learning models trained on real user behavior to make their bots indistinguishable from humans.

Many sites require maintaining a session to access content. If you don't handle cookies properly, each request looks like it's from a different user, which is suspicious.

import httpx

# Create a client that persists cookies
client = httpx.Client()

# First request establishes session
response = client.get("https://example.com")

# Subsequent requests reuse cookies
products = client.get("https://example.com/products")
details = client.get("https://example.com/products/123")

client.close()

For headless browsers, cookies are handled automatically, but you can save and reuse them:

// Save cookies after login
const cookies = await page.context().cookies();
fs.writeFileSync('cookies.json', JSON.stringify(cookies));

// Restore cookies later
const savedCookies = JSON.parse(fs.readFileSync('cookies.json'));
await page.context().addCookies(savedCookies);

Pro tip: Some sites embed session tokens in local storage or in JavaScript variables. Use browser DevTools to find where these tokens are stored and extract them for API requests.

12. Scrape cached versions when possible

For non-time-sensitive data, scraping Google's cached version or Internet Archive snapshots can bypass anti-bot protections entirely.

Google Cache (note: Google has deprecated their cache feature, but cached pages can still sometimes be accessed via specific URLs or through Google Search's cached results):

import httpx

def scrape_cached(url):
    cache_url = f"https://webcache.googleusercontent.com/search?q=cache:{url}"
    response = httpx.get(cache_url)
    return response.text

Internet Archive's Wayback Machine:

import httpx

def scrape_archive(url):
    api_url = f"http://archive.org/wayback/available?url={url}"
    response = httpx.get(api_url)
    data = response.json()
    
    if 'archived_snapshots' in data and data['archived_snapshots']:
        snapshot_url = data['archived_snapshots']['closest']['url']
        snapshot = httpx.get(snapshot_url)
        return snapshot.text
    return None

Limitation: Cached data isn't current, so this only works if you don't need real-time information.

13. Implement exponential backoff for rate limiting

When you hit rate limits (429 errors or temporarily blocked), don't just retry immediately. Use exponential backoff to gradually increase wait times:

import httpx
import time

def scrape_with_backoff(url, max_retries=5):
    retry_count = 0
    base_delay = 1
    
    while retry_count < max_retries:
        try:
            response = httpx.get(url, timeout=30)
            
            if response.status_code == 200:
                return response.text
            elif response.status_code == 429:
                # Rate limited - wait and retry
                wait_time = base_delay * (2 ** retry_count)
                print(f"Rate limited. Waiting {wait_time}s before retry...")
                time.sleep(wait_time)
                retry_count += 1
            else:
                print(f"Error {response.status_code}")
                return None
                
        except Exception as e:
            print(f"Request failed: {e}")
            retry_count += 1
            time.sleep(base_delay * (2 ** retry_count))
    
    return None

This approach respects the server's capacity while giving you multiple chances to succeed.

14. Use multiple scraping strategies in parallel

Don't put all your eggs in one basket. Run different scraping approaches simultaneously and use whichever works best:

Strategy A: Direct API calls with rotating IPs Strategy B: Headless browser with stealth plugins
Strategy C: Cloud browser automation service

from concurrent.futures import ThreadPoolExecutor

def scrape_method_a(url):
    # Fast, API-based approach
    pass

def scrape_method_b(url):
    # Browser-based fallback
    pass

def scrape_parallel(urls):
    with ThreadPoolExecutor(max_workers=5) as executor:
        results_a = list(executor.map(scrape_method_a, urls))
        
        # If Method A failed, try Method B for those URLs
        failed_urls = [url for url, result in zip(urls, results_a) 
                       if result is None]
        
        if failed_urls:
            results_b = list(executor.map(scrape_method_b, failed_urls))
    
    return results_a

Real-world tip: Start with the fastest, cheapest method (HTTP requests to APIs). Fall back to slower methods (headless browsers) only when necessary. This optimizes both speed and cost.

15. Monitor and adapt continuously

Web scraping is a cat-and-mouse game. Sites update their anti-bot systems, and your scraper needs to adapt. Build monitoring into your setup:

import logging

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

def scrape_with_monitoring(url):
    try:
        response = httpx.get(url)
        
        # Log success metrics
        logger.info(f"Success: {url} - Status: {response.status_code}")
        
        # Check for soft blocks (200 but wrong content)
        if "access denied" in response.text.lower():
            logger.warning(f"Soft block detected at {url}")
            
        return response.text
        
    except Exception as e:
        logger.error(f"Failed to scrape {url}: {e}")
        # Alert your team or switch strategies
        return None

What to monitor:

Success/failure rates
Response times (sudden slowdowns indicate throttling)
Content changes (detecting soft blocks)
Proxy health
Cost per successfully scraped page

Set up alerts for sudden drops in success rates so you can investigate and adapt quickly.

Putting it all together

These 15 methods aren't meant to be used all at once. The right combination depends on your target, scale, and budget. Here's a suggested approach:

For simple sites (blogs, news): Use methods 1-3 (IP rotation, User-Agent rotation, delays). Skip the expensive stuff.

For medium complexity (e-commerce without heavy bot protection): Add methods 4-7 (robots.txt, honeypot avoidance, API reverse engineering, basic headless browsers).

For hardened targets (sites with Cloudflare, DataDome, or reCAPTCHA): You'll need methods 8-15, including browser fingerprinting, behavioral mimicry, and commercial proxy/browser solutions.

Start simple, escalate as needed. Don't overcomplicate your first attempt—many sites can be scraped with just careful rate limiting and IP rotation. Add complexity only when you're actually getting blocked.

The most important takeaway? Web scraping without getting blocked is about respecting the site's resources, mimicking human behavior, and continuously adapting. No single technique is a silver bullet, but combining several thoughtfully will keep your scrapers running smoothly.

Related reading:

How to build a rotating proxy pool from scratch
Selenium vs. Playwright: Which is better for scraping?
Legal considerations for web scraping in 2026

This article was originally published in October 2026.

Marius Bernard

Marius Bernard is a Web Scraping Engineer & Technical Advisor at Roundproxies. He authored the Web Scraping chapter of the 2024 Web Almanac/Techinsider. He loves python, golang and proxies.

Get the best
proxies out there

Get Proxies now

Related from Knowledge Base

Go Web Scraping: Complete 2025 Guide & Code Examples

PHP Web Scraping Guide 2026: Speed & Anti-Bot Tips

C# Web Scraping Guide: Build Fast Working Scrapers

Web Scraping in R: Complete Guide 2026

Web Scraping in Rust: Complete 2026 Guide

How to Do Web Scraping in Kotlin: The Developer's Guide

How to Do Web Scraping in Lua: A Developer's Guide

How to Do Web Scraping in Dart: A Complete 2026 Guide

How to Do Web Scraping in Perl: The Complete Developer's Guide

Python Web Scraping Guide: Build Scrapers in 2026

How to Use Botasaurus in 2026

How to Scrape Dynamic Websites With Headless Web Browsers

12 Ways to Make HTTPS Requests in Node.js

How to Use Playwright Playwright Proxy in 2026

How to Take Screenshots with Puppeteer

How to Store and Manage Scraped Data Efficiently

User-Agent Rotation: Why and How to Implement It

How to Scrape Data Behind Login Pages

What Are Backconnect Proxies and How They Work

How to Do Requests in Go (Golang)