Expert Article

Why Your Web Scraper Gets Blocked (The Dirty Truth)

30 January 2026

6 min read

Everyone wants to talk about the code. The libraries. The frameworks.

Nobody wants to talk about what actually makes scrapers work.

I've been building production scrapers for six years. I've extracted billions of data points from sites that supposedly "can't be scraped." And I'm going to tell you something uncomfortable: the difference between scrapers that work and scrapers that fail has almost nothing to do with your technical skills.

It's about how good you are at deception.

The lie we tell ourselves

Browse any scraping tutorial and you'll see the same advice regurgitated endlessly. Use rotating proxies. Set a User-Agent header. Add random delays.

All technically correct. All completely insufficient.

I watched a junior developer follow every "best practice" guide on the internet. His scraper worked perfectly in testing. The moment he pointed it at a real target with actual bot detection? Blocked within 47 requests.

Here's what those guides don't tell you: modern anti-bot systems don't just look at what you send. They look at how you send it. They look at when you send it. They look at patterns you didn't even know existed.

And they're getting smarter every month.

The fingerprint problem nobody wants to explain

Your browser is a snitch.

Every time you visit a website, your browser broadcasts hundreds of signals. Screen resolution. Installed fonts. WebGL renderer. Audio context hash. Canvas fingerprint. Timezone. Language preferences.

Combined, these create a fingerprint more unique than your actual fingerprint.

When you fire up Selenium or Puppeteer with default settings, you might as well be wearing a sign that says "I AM A BOT." The navigator.webdriver flag is set to true. Your User-Agent says "HeadlessChrome." Your plugin list is empty. Your screen resolution is some weird dimension that no real human would use.

Anti-bot systems don't even need to look hard. You're practically screaming at them.

The dirty truth? Working scrapers don't just fake a User-Agent string. They fake everything. Canvas rendering. WebGL signatures. Audio fingerprints. They build coherent device profiles where every single attribute matches what you'd see on a real Windows 10 machine running Chrome 131 in Central Time.

If your fingerprint says you're on Windows but your font list contains macOS system fonts, you're done. If you claim to be Chrome but your TLS handshake looks like Python Requests, you're done. If your browser says it has a screen resolution of 1920x1080 but your WebGL reports a GPU that doesn't support that resolution, you're done.

Consistency is everything. And consistency requires lying comprehensively.

TLS fingerprinting: the silent killer

Here's something that broke my brain when I first learned about it.

Before your scraper even sends an HTTP request, before any of your headers are transmitted, before your User-Agent string goes anywhere—you've already been fingerprinted.

It happens during the TLS handshake. When your client connects over HTTPS, it announces its supported encryption methods, TLS version, and extensions. These parameters form a hash called JA3 (or the newer JA4).

Python Requests has a JA3 fingerprint. Selenium has one. Puppeteer has one. Real Chrome has a different one.

When a site sees a JA3 fingerprint that says "Python" but headers that say "Chrome," the jig is up. You've been identified before you even said hello.

Bypassing this requires tools that can impersonate Chrome's TLS stack at the network level. Libraries like curl_cffi exist specifically for this. But most developers don't even know this layer of detection exists.

They're sending requests and getting blocked, and they have no idea why. They rotate IPs. They change User-Agents. Nothing works.

Because the fingerprint that matters happened before any of that.

The behavioral layer where scrapers die

Let's say you've solved fingerprinting. Your browser looks identical to a real one at every technical level. You're still probably getting blocked.

Why?

Because you don't act human.

Real humans don't request 50 pages in 10 seconds. They don't navigate in perfectly sequential order. They pause to read content. They scroll at variable speeds. They move their mouse while thinking. They sometimes scroll past what they're looking for, then scroll back up.

Your scraper does none of this.

Modern behavioral analysis systems track scroll velocity, mouse movement patterns, click timing, keyboard cadence, and viewport focus events. They build statistical models of "normal" behavior and flag anything that deviates.

A scraper that waits a random 1-3 seconds between requests still fails this test. Real humans don't have uniformly distributed random delays. Their timing follows different patterns—quick bursts of activity, then pauses. The delay after reading a short product title is different from the delay after reading a long product description.

The practitioners who succeed actually simulate behavioral noise. They inject random mouse movements. They scroll to elements before clicking them. They add micro-pauses that mimic the rhythm of human attention.

It feels absurd. You're writing code that pretends to be distracted.

But it works.

The proxy lie

Everyone tells you to use proxies. Rotate IPs. Don't send too many requests from one address.

What they don't tell you: the type of proxy matters more than the rotation strategy.

Datacenter proxies are cheap. They're also largely useless against serious protection. Anti-bot services maintain databases of known datacenter IP ranges. Your fancy proxy rotation means nothing when every IP in your pool is pre-flagged as suspicious.

Residential proxies come from real ISPs, assigned to real homes. They're expensive—sometimes 10x the cost of datacenter alternatives. But they work because they look like regular people browsing from regular homes.

Mobile proxies are even harder to detect. They come from cellular networks where thousands of users share the same IP through CGNAT. Blocking a mobile IP means potentially blocking legitimate customers.

The dirty secret of the proxy game? The money you spend on infrastructure directly correlates with your success rate. Sites with heavy protection are essentially pay-to-play.

And the math often works out. If you're scraping data worth $10,000/month to your business, spending $2,000/month on residential proxies isn't a cost—it's a requirement.

The honeypot test

Smart site owners don't just try to detect bots. They try to trap them.

Honeypots are invisible links planted in the HTML. Hidden via CSS—display: none or visibility: hidden or simply colored to match the background. A human visitor never sees them. A scraper blindly following every link in the DOM? Walked right into it.

The moment your scraper visits a honeypot URL, you're marked. Your fingerprint is flagged. Your IP goes on a watchlist. And often, that watchlist is shared across anti-bot services used by thousands of sites.

You didn't just get blocked from one site. You got blocked from an ecosystem.

Working scrapers check element visibility before following links. They verify that a link has actual dimensions, isn't hidden by CSS, doesn't have a nofollow tag, and isn't positioned off-screen.

The extra code adds complexity. But skipping it is how you get burned in ways that take weeks to diagnose.

The rate limiting you don't see

Obvious rate limiting returns 429 errors or blocks your IP after too many requests.

Subtle rate limiting does something worse: it feeds you bad data.

Some sites detect scraper-like behavior and don't block you. Instead, they serve different content. Prices are wrong. Product availability is incorrect. Reviews are shuffled or omitted.

You keep scraping, happily collecting data, not realizing it's garbage.

I've seen scrapers run for months before someone noticed the data didn't match what real users saw. Months of bad business decisions based on poisoned information.

This is why verification matters. Compare your scraped data against manual spot-checks. Use different IP ranges and compare results. If the data differs significantly, you're probably being fed a shadow version of the site.

What actually works in 2026

After everything I've described, here's what a working scraper stack actually looks like:

Residential or mobile proxies. Not because rotation is magic, but because IP reputation matters. Datacenter IPs are dead weight against real protection.

Stealth browsers. Tools like Camoufox, nodriver, or SeleniumBase's UC Mode. They handle fingerprint spoofing at multiple layers—JavaScript APIs, browser attributes, even WebGL rendering. Out of the box, they pass tests that vanilla Puppeteer fails instantly.

TLS impersonation. Libraries like curl_cffi that mimic real browser TLS signatures. Without this, you're detectable before your request is even processed.

Behavioral noise. Random delays aren't enough. You need variable scroll patterns, mouse movement injection, realistic viewport focus. The goal is to look bored, distracted, and human.

Element visibility checks. Never follow a link without verifying it's actually visible to users. Same for form fields and buttons.

Data validation. Assume you're being fed garbage until proven otherwise. Build verification into your pipeline.

Failure recovery. Exponential backoff on errors. Automatic proxy rotation when you hit blocks. Session persistence where appropriate.

It's a lot. It's also table stakes.

The uncomfortable conclusion

The scraping industry runs on a fundamental tension: sites don't want you there, but they also can't block everyone without collateral damage.

The scrapers that work exploit this tension. They invest enough in deception that blocking them would risk blocking legitimate users. They spend money on infrastructure that makes their traffic indistinguishable from organic visitors.

This isn't about being clever. It's about being thorough.

Most scrapers fail because their creators treat it as a coding problem. Write the script, extract the data, move on. But working scrapers are engineering projects. They require investment in infrastructure, ongoing maintenance as detection evolves, and a realistic budget for the proxies that make success possible.

The dirty truth is that scraping at scale is expensive. The dirtier truth is that companies doing it successfully don't talk about their methods. You don't publish a detailed breakdown of how you circumvent LinkedIn's bot detection—not if you want to keep doing it.

So the tutorials stay shallow. The guides recycle the same insufficient advice. And developers keep building scrapers that work fine in testing and fail immediately in production.

Now you know why.

Marius Bernard

Marius Bernard is a Web Scraping Engineer & Technical Advisor at Roundproxies. He authored the Web Scraping chapter of the 2024 Web Almanac/Techinsider. He loves python, golang and proxies.

Get the best
proxies out there

Get Proxies now

This article was originally published in January 2026, written by Marius Bernard. It was most recently updated in January 2026.

Marius Bernard

Marius Bernard is a Web Scraping Engineer & Technical Advisor at Roundproxies. He authored the Web Scraping chapter of the 2024 Web Almanac/Techinsider. He loves python, golang and proxies.

Tags

Related from Knowledge Base

BoringSSL: Google's TLS Library Behind Chrome Fingerprinting

What Is IP Rotation? How it works and why you need it

How to bypass Bot Detection in 2026: 8 easy methods

What is 403 Forbidden Error? Causes & Fixes Explained

Guide to List Crawling in 2026: Extract data at scale

HTTP Error 429: What It Is & How to Fix It (2026)

The 8 best Residential Proxy providers in 2026

How ISP Proxies work in 2026: Step by step explained

C# Web Scraping Guide: Build Fast Working Scrapers

Web Scraping in R: Complete Guide 2026

Web Scraping in Rust: Complete 2026 Guide

Web Scraping with Kotlin in 2026: Complete Guide

How to Do Web Scraping in Lua: A Developer's Guide

How to Do Web Scraping in Dart: A Complete 2026 Guide

How to Do Web Scraping in Perl: The Complete Developer's Guide

How to Use Botasaurus in 2026

How to Scrape Dynamic Websites With Headless Web Browsers

12 Ways to Make HTTPS Requests in Node.js

15 Methods to Not Get Blocked Web Scraping

How to use Playwright Proxy in 2026: Full setup guide