Rust brings something different to web scraping: raw speed combined with memory safety. While Python dominates the scraping world with its simplicity, Rust compiles to native code and handles concurrency without the Global Interpreter Lock getting in your way.
That performance comes with a trade-off. Rust's ownership system and lifetime annotations can feel foreign at first. But once you understand the patterns, you'll find yourself building scrapers that process thousands of pages without breaking a sweat.
In this guide, you'll learn how to build production-grade web scrapers in Rust. We'll start with the basics and work our way up to concurrent scraping, anti-detection techniques, and memory optimization tricks that most tutorials skip.
Why Choose Rust for Web Scraping?
Before diving into code, let's talk about when Rust makes sense for scraping.
Rust shines when you need:
- High-volume scraping (think hundreds of thousands of pages)
- Maximum performance from limited hardware
- Memory-safe concurrent processing
- Integration with existing Rust codebases
- Fine-grained control over resource usage
Stick with Python or JavaScript if:
- You're scraping a few hundred pages
- Prototyping speed matters more than execution speed
- Your team doesn't know Rust
- You need a massive ecosystem of scraping libraries
The truth is, for most scraping projects, Python's simplicity wins. But when you're processing millions of records or need to squeeze every bit of performance from your infrastructure, Rust delivers.
Setting Up Your Rust Scraping Environment
First, install Rust using rustup:
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
Create a new project:
cargo new rust_scraper
cd rust_scraper
Add these dependencies to your Cargo.toml:
[dependencies]
reqwest = { version = "0.11", features = ["blocking", "cookies"] }
scraper = "0.18"
tokio = { version = "1", features = ["full"] }
serde = { version = "1.0", features = ["derive"] }
serde_json = "1.0"
Here's what each does:
- reqwest: HTTP client that handles requests and responses
- scraper: HTML parsing using CSS selectors
- tokio: Async runtime for concurrent scraping
- serde/serde_json: Serialization for saving scraped data
Your First Rust Scraper: Blocking Mode
Let's scrape some quotes from http://quotes.toscrape.com. We'll start with synchronous (blocking) code before moving to async.
use reqwest::blocking::Client;
use scraper::{Html, Selector};
fn main() -> Result<(), Box<dyn std::error::Error>> {
// Create a client with a realistic user agent
let client = Client::builder()
.user_agent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36")
.build()?;
// Fetch the page
let response = client.get("http://quotes.toscrape.com").send()?;
let body = response.text()?;
// Parse HTML
let document = Html::parse_document(&body);
// Define selectors
let quote_selector = Selector::parse("span.text").unwrap();
let author_selector = Selector::parse("small.author").unwrap();
// Extract data
for (quote_el, author_el) in document
.select("e_selector)
.zip(document.select(&author_selector))
{
let quote = quote_el.inner_html();
let author = author_el.inner_html();
println!("{} - {}", quote, author);
}
Ok(())
}
What's happening here?
The Client::builder() pattern lets you configure headers, timeouts, and connection pooling. Setting a user agent is crucial—many sites block requests without one.
The ? operator propagates errors up the call stack. This is Rust's way of handling errors without try-catch blocks. If any operation fails, the function returns early with that error.
Selector::parse() returns a Result, but we call .unwrap() because we control the selector string. In production code, you'd handle this more gracefully.
The .zip() iterator combines quotes with authors. This works because the site's HTML structure guarantees they appear in matching order.
Understanding Rust's Error Handling in Scrapers
One of Rust's biggest advantages is explicit error handling. No silent failures or uncaught exceptions.
use std::error::Error;
use reqwest::blocking::Client;
fn scrape_page(url: &str) -> Result<String, Box<dyn Error>> {
let client = Client::new();
let response = client.get(url).send()?;
// Check status code before parsing
if !response.status().is_success() {
return Err(format!("HTTP {}: {}", response.status(), url).into());
}
let text = response.text()?;
Ok(text)
}
fn main() {
match scrape_page("http://quotes.toscrape.com") {
Ok(html) => println!("Scraped {} bytes", html.len()),
Err(e) => eprintln!("Scraping failed: {}", e),
}
}
The Result type forces you to handle errors. You can't accidentally ignore a failed request like you might in Python with a bare except: pass.
In production scrapers, implement custom error types:
use std::fmt;
#[derive(Debug)]
enum ScraperError {
NetworkError(reqwest::Error),
ParseError(String),
NotFound,
}
impl fmt::Display for ScraperError {
fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
match self {
ScraperError::NetworkError(e) => write!(f, "Network error: {}", e),
ScraperError::ParseError(msg) => write!(f, "Parse error: {}", msg),
ScraperError::NotFound => write!(f, "Resource not found"),
}
}
}
impl Error for ScraperError {}
This gives you fine-grained control over error handling and recovery strategies.
Async Scraping with Tokio: The Real Power
Blocking IO works for single pages, but it doesn't scale. Async Rust with Tokio lets you scrape hundreds of pages concurrently without spawning OS threads.
Here's a basic async scraper:
use reqwest::Client;
use scraper::{Html, Selector};
use tokio;
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
let client = Client::new();
let urls = vec![
"http://quotes.toscrape.com/page/1/",
"http://quotes.toscrape.com/page/2/",
"http://quotes.toscrape.com/page/3/",
];
// Create futures for all requests
let futures = urls.into_iter().map(|url| {
let client = client.clone();
async move {
let response = client.get(url).send().await?;
let html = response.text().await?;
Ok::<String, reqwest::Error>(html)
}
});
// Execute concurrently
let results = futures::future::try_join_all(futures).await?;
println!("Scraped {} pages", results.len());
Ok(())
}
Key differences from blocking code:
The #[tokio::main] macro transforms your main function into an async runtime. Without it, you can't use .await.
client.clone() is cheap—it's just an Arc pointer under the hood. All clones share the same underlying connection pool.
try_join_all waits for all futures to complete. If any fail, the whole operation fails. Use join_all if you want to continue even when some requests fail.
Building a Proper Concurrent Scraper
Let's build something more realistic: a scraper that processes multiple pages with rate limiting and error recovery.
use reqwest::Client;
use scraper::{Html, Selector};
use tokio::time::{sleep, Duration};
use futures::stream::{self, StreamExt};
use std::sync::Arc;
struct Scraper {
client: Client,
concurrent_limit: usize,
}
impl Scraper {
fn new(concurrent_limit: usize) -> Self {
let client = Client::builder()
.user_agent("Mozilla/5.0 (compatible; RustScraper/1.0)")
.timeout(Duration::from_secs(30))
.build()
.unwrap();
Self { client, concurrent_limit }
}
async fn scrape_page(&self, url: &str) -> Result<Vec<String>, Box<dyn std::error::Error>> {
// Random delay between 1-3 seconds
let delay = Duration::from_millis(1000 + (rand::random::<u64>() % 2000));
sleep(delay).await;
let response = self.client.get(url).send().await?;
let html = response.text().await?;
let document = Html::parse_document(&html);
let selector = Selector::parse("span.text").unwrap();
let quotes: Vec<String> = document
.select(&selector)
.map(|el| el.inner_html())
.collect();
Ok(quotes)
}
async fn scrape_all(&self, urls: Vec<String>) -> Vec<Result<Vec<String>, Box<dyn std::error::Error>>> {
stream::iter(urls)
.map(|url| async move {
self.scrape_page(&url).await
})
.buffer_unordered(self.concurrent_limit)
.collect()
.await
}
}
#[tokio::main]
async fn main() {
let scraper = Scraper::new(5); // Max 5 concurrent requests
let urls: Vec<String> = (1..=10)
.map(|i| format!("http://quotes.toscrape.com/page/{}/", i))
.collect();
let results = scraper.scrape_all(urls).await;
let mut total_quotes = 0;
for result in results {
match result {
Ok(quotes) => total_quotes += quotes.len(),
Err(e) => eprintln!("Error: {}", e),
}
}
println!("Total quotes scraped: {}", total_quotes);
}
Why this pattern works:
buffer_unordered is the secret sauce. It processes up to N futures concurrently, yielding results as they complete. This gives you concurrency control without manual semaphore management.
The random delay mimics human behavior. Most anti-bot systems look for regular patterns. A consistent 2-second delay between requests is more suspicious than variable timing.
We're building a Scraper struct instead of loose functions. This makes it easy to share configuration and add features like connection pooling or request caching.
Advanced Anti-Detection: User-Agent Rotation
Sites often block scrapers by tracking user agents. Here's a rotation strategy:
use reqwest::Client;
use rand::seq::SliceRandom;
use std::sync::Arc;
use tokio::sync::Mutex;
const USER_AGENTS: &[&str] = &[
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:121.0) Gecko/20100101 Firefox/121.0",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:121.0) Gecko/20100101 Firefox/121.0",
];
struct RotatingClient {
user_agents: Vec<String>,
index: Arc<Mutex<usize>>,
}
impl RotatingClient {
fn new() -> Self {
let mut user_agents: Vec<String> = USER_AGENTS
.iter()
.map(|s| s.to_string())
.collect();
user_agents.shuffle(&mut rand::thread_rng());
Self {
user_agents,
index: Arc::new(Mutex::new(0)),
}
}
async fn get_client(&self) -> Client {
let mut idx = self.index.lock().await;
let ua = &self.user_agents[*idx];
*idx = (*idx + 1) % self.user_agents.len();
Client::builder()
.user_agent(ua)
.build()
.unwrap()
}
}
#[tokio::main]
async fn main() {
let rotating_client = Arc::new(RotatingClient::new());
for i in 0..10 {
let client = rotating_client.get_client().await;
let response = client
.get("https://httpbin.org/user-agent")
.send()
.await
.unwrap();
println!("Request {}: {}", i, response.text().await.unwrap());
}
}
The Arc<Mutex<usize>> pattern lets multiple async tasks safely share the rotation index. The mutex ensures only one task modifies the index at a time.
Shuffling the user agent list at startup adds another layer of randomness. Each run of your scraper will use a different rotation order.
Cookie Management for Session-Based Scraping
Some sites require cookies for session management. Here's how to handle them:
use reqwest::{Client, cookie::Jar};
use std::sync::Arc;
use url::Url;
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
let jar = Arc::new(Jar::default());
let client = Client::builder()
.cookie_provider(Arc::clone(&jar))
.build()?;
// First request sets cookies
let login_url = "http://quotes.toscrape.com/login";
let response = client.get(login_url).send().await?;
// Check what cookies were set
let parsed_url = Url::parse(login_url)?;
let cookies = jar.cookies(&parsed_url);
println!("Cookies: {:?}", cookies);
// Subsequent requests automatically include cookies
let profile_url = "http://quotes.toscrape.com/";
let response = client.get(profile_url).send().await?;
println!("Status: {}", response.status());
Ok(())
}
The cookie jar persists across requests automatically. You don't need to manually extract and reinsert cookies like in some other languages.
For more complex scenarios, save cookies to disk:
use reqwest::cookie::Jar;
use std::sync::Arc;
use serde::{Serialize, Deserialize};
#[derive(Serialize, Deserialize)]
struct CookieData {
name: String,
value: String,
domain: String,
}
fn save_cookies(jar: &Jar, url: &url::Url) -> Result<(), Box<dyn std::error::Error>> {
let cookies_header = jar.cookies(url).unwrap();
let cookies_str = cookies_header.to_str()?;
// Parse and save to JSON
// Implementation depends on your needs
Ok(())
}
Memory-Efficient Scraping with Rust's Ownership
One of Rust's killer features is zero-cost abstractions. Let's leverage that for memory-efficient scraping.
Instead of storing all HTML in memory, process it streaming:
use reqwest::Client;
use scraper::{Html, Selector};
use tokio::fs::File;
use tokio::io::AsyncWriteExt;
async fn scrape_to_file(
client: &Client,
url: &str,
output_path: &str,
) -> Result<(), Box<dyn std::error::Error>> {
let response = client.get(url).send().await?;
let html = response.text().await?;
// Parse and extract data
let document = Html::parse_document(&html);
let selector = Selector::parse("span.text").unwrap();
// Open file for writing
let mut file = File::create(output_path).await?;
// Write quotes one at a time instead of collecting
for element in document.select(&selector) {
let quote = element.inner_html();
file.write_all(quote.as_bytes()).await?;
file.write_all(b"\n").await?;
}
Ok(())
}
This approach uses constant memory regardless of how many quotes are on the page. The iterator processes elements one at a time without allocating a vector.
For truly massive datasets, use the bytes crate for zero-copy operations:
use bytes::Bytes;
use reqwest::Client;
async fn fetch_binary(client: &Client, url: &str) -> Result<Bytes, reqwest::Error> {
let response = client.get(url).send().await?;
let bytes = response.bytes().await?;
Ok(bytes)
}
Bytes is a reference-counted byte buffer. Multiple parts of your code can hold references to the same data without copying.
Building a Custom Rate Limiter
Most tutorials use simple delays. Let's build a proper token bucket rate limiter:
use tokio::time::{sleep, Duration, Instant};
use std::sync::Arc;
use tokio::sync::Mutex;
struct RateLimiter {
tokens: Arc<Mutex<f64>>,
capacity: f64,
refill_rate: f64,
last_refill: Arc<Mutex<Instant>>,
}
impl RateLimiter {
fn new(requests_per_second: f64) -> Self {
Self {
tokens: Arc::new(Mutex::new(requests_per_second)),
capacity: requests_per_second,
refill_rate: requests_per_second,
last_refill: Arc::new(Mutex::new(Instant::now())),
}
}
async fn acquire(&self) {
loop {
let mut tokens = self.tokens.lock().await;
let mut last_refill = self.last_refill.lock().await;
// Refill tokens based on elapsed time
let elapsed = last_refill.elapsed().as_secs_f64();
let new_tokens = (*tokens + elapsed * self.refill_rate).min(self.capacity);
*tokens = new_tokens;
*last_refill = Instant::now();
if *tokens >= 1.0 {
*tokens -= 1.0;
break;
}
drop(tokens);
drop(last_refill);
sleep(Duration::from_millis(100)).await;
}
}
}
#[tokio::main]
async fn main() {
let limiter = Arc::new(RateLimiter::new(5.0)); // 5 requests per second
let mut handles = vec![];
for i in 0..20 {
let limiter = Arc::clone(&limiter);
let handle = tokio::spawn(async move {
limiter.acquire().await;
println!("Request {} started at {:?}", i, Instant::now());
});
handles.push(handle);
}
for handle in handles {
handle.await.unwrap();
}
}
This rate limiter smooths out bursts. Instead of sleeping for a fixed interval, it refills tokens continuously. This is more efficient and provides better throughput.
Structured Data Extraction with Serde
Most scrapers need to save data. Serde makes this trivial:
use serde::{Serialize, Deserialize};
use scraper::{Html, Selector};
use std::fs::File;
use std::io::Write;
#[derive(Serialize, Deserialize, Debug)]
struct Quote {
text: String,
author: String,
tags: Vec<String>,
}
fn extract_quotes(html: &str) -> Vec<Quote> {
let document = Html::parse_document(html);
let quote_selector = Selector::parse("div.quote").unwrap();
let mut quotes = Vec::new();
for quote_el in document.select("e_selector) {
let text_selector = Selector::parse("span.text").unwrap();
let author_selector = Selector::parse("small.author").unwrap();
let tag_selector = Selector::parse("a.tag").unwrap();
let text = quote_el
.select(&text_selector)
.next()
.map(|el| el.inner_html())
.unwrap_or_default();
let author = quote_el
.select(&author_selector)
.next()
.map(|el| el.inner_html())
.unwrap_or_default();
let tags: Vec<String> = quote_el
.select(&tag_selector)
.map(|el| el.inner_html())
.collect();
quotes.push(Quote { text, author, tags });
}
quotes
}
fn save_to_json(quotes: &[Quote], path: &str) -> std::io::Result<()> {
let json = serde_json::to_string_pretty(quotes)?;
let mut file = File::create(path)?;
file.write_all(json.as_bytes())?;
Ok(())
}
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
let client = reqwest::Client::new();
let response = client.get("http://quotes.toscrape.com").send().await?;
let html = response.text().await?;
let quotes = extract_quotes(&html);
save_to_json("es, "quotes.json")?;
println!("Saved {} quotes", quotes.len());
Ok(())
}
The #[derive(Serialize)] macro generates all serialization code at compile time. No runtime overhead, no reflection.
For CSV output, add csv = "1.3" to your dependencies:
use csv::Writer;
fn save_to_csv(quotes: &[Quote], path: &str) -> Result<(), Box<dyn std::error::Error>> {
let mut wtr = Writer::from_path(path)?;
for quote in quotes {
wtr.serialize(quote)?;
}
wtr.flush()?;
Ok(())
}
JavaScript-Rendered Content: Headless Chrome
For sites that require JavaScript, use headless_chrome:
use headless_chrome::{Browser, LaunchOptions};
use headless_chrome::protocol::cdp::Page;
fn scrape_js_page(url: &str) -> Result<String, Box<dyn std::error::Error>> {
let browser = Browser::new(LaunchOptions::default())?;
let tab = browser.new_tab()?;
// Navigate and wait for load
tab.navigate_to(url)?;
tab.wait_until_navigated()?;
// Wait for specific element to ensure JS has run
tab.wait_for_element("div.quote")?;
// Get rendered HTML
let html = tab.get_content()?;
Ok(html)
}
Warning: headless browsers are slow and resource-intensive. Only use them when you actually need JavaScript execution. For most sites, reqwest + scraper is 10x faster.
Production Tips: Error Recovery and Retry Logic
Real scrapers need robust error handling. Here's a retry wrapper:
use tokio::time::{sleep, Duration};
async fn retry_with_backoff<F, T, E>(
mut f: F,
max_retries: u32,
) -> Result<T, E>
where
F: FnMut() -> futures::future::BoxFuture<'static, Result<T, E>>,
{
let mut retries = 0;
loop {
match f().await {
Ok(result) => return Ok(result),
Err(e) if retries >= max_retries => return Err(e),
Err(_) => {
let backoff = Duration::from_secs(2u64.pow(retries));
sleep(backoff).await;
retries += 1;
}
}
}
}
// Usage example
async fn scrape_with_retry(url: &str) -> Result<String, reqwest::Error> {
let client = reqwest::Client::new();
let url = url.to_string();
retry_with_backoff(
|| {
let client = client.clone();
let url = url.clone();
Box::pin(async move {
client.get(&url).send().await?.text().await
})
},
3, // max 3 retries
)
.await
}
Exponential backoff prevents hammering a failing server. The first retry waits 2 seconds, the second waits 4, the third waits 8.
Performance Benchmarking
Measure your scraper's performance with Criterion:
// benches/scraper_bench.rs
use criterion::{black_box, criterion_group, criterion_main, Criterion};
fn benchmark_selector_parse(c: &mut Criterion) {
let html = r#"<div class="quote">...</div>"#;
c.bench_function("parse html", |b| {
b.iter(|| {
let document = scraper::Html::parse_document(black_box(html));
let selector = scraper::Selector::parse("div.quote").unwrap();
document.select(&selector).count()
})
});
}
criterion_group!(benches, benchmark_selector_parse);
criterion_main!(benches);
Run with cargo bench. This helps you identify bottlenecks before they become problems.
Common Pitfalls and How to Avoid Them
1. Not handling Result properly
Don't use .unwrap() everywhere. It will panic on errors. Use ? or pattern matching:
// Bad
let html = response.text().await.unwrap();
// Good
let html = response.text().await?;
2. Creating too many HTTP clients
Each Client::new() creates a new connection pool. Reuse clients:
// Bad - creates 100 connection pools
for url in urls {
let client = Client::new();
client.get(url).send().await?;
}
// Good - one connection pool
let client = Client::new();
for url in urls {
client.get(url).send().await?;
}
3. Not setting timeouts
Always set timeouts to prevent hanging:
let client = Client::builder()
.timeout(Duration::from_secs(30))
.build()?;
4. Forgetting to respect robots.txt
Check a site's robots.txt before scraping at scale. Most commercial sites publish their scraping policies there.
When Rust Isn't the Answer
Be honest with yourself: Rust makes sense for performance-critical scraping, but it's overkill for many projects.
Use Python instead if:
- You're scraping < 1000 pages
- Your team doesn't know Rust
- You need to prototype quickly
- The scraping logic is complex and changes frequently
Use Rust when:
- Performance is actually a bottleneck
- You're processing millions of pages
- Memory usage matters (embedded systems, cloud costs)
- You're building a long-term production system
Wrapping Up
Rust brings speed and safety to web scraping, but it's not a silver bullet. The language's steep learning curve means you'll spend more time fighting the borrow checker than writing selectors—at least initially.
That said, once you internalize Rust's patterns, you'll build scrapers that are both fast and maintainable. The type system catches bugs at compile time that would be runtime errors in Python. The memory model lets you process huge datasets without GC pauses. And the async ecosystem scales to thousands of concurrent connections without spawning OS threads.
Start with the blocking examples in this guide. Once you're comfortable with Rust's basics, move to async code with Tokio. Add anti-detection features only when you need them. And always remember: the best scraper is one that respects the sites you're scraping.