Web scraping with R is one of those skills that looks intimidating until you actually try it. Then you realize it's just making HTTP requests and parsing HTML—which R handles surprisingly well.
I've been scraping websites with R for years, and I'm constantly surprised by how many data analysts don't know about some of the powerful (and polite) techniques available in the R ecosystem. Most tutorials cover rvest basics and stop there. This guide goes further.
Here's what you'll learn: how to scrape static and dynamic content, handle rate limiting properly with the polite package, speed things up with parallel processing using furrr, and tackle real-world obstacles like JavaScript-heavy sites and anti-bot measures. No fluff, just the techniques that actually work.
Why R for Web Scraping?
Before diving into code, let's address the elephant in the room: why use R when Python dominates web scraping discussions?
R excels at web scraping for one simple reason—you can extract, clean, analyze, and visualize data without switching tools. If you're already using R for data analysis, adding scraping to your workflow is seamless.
The tidyverse integration is another huge advantage. You can pipe scraped data directly into dplyr for cleaning or ggplot2 for visualization. This end-to-end workflow is why I prefer R for research and analysis projects, even though I'll reach for Python when building production scrapers.
Essential R Packages for Web Scraping
The R ecosystem has matured significantly. Here are the packages you actually need:
Core scraping:
rvest- Your main tool for parsing HTML and extracting datahttr2- Advanced HTTP requests (successor to httr, much more powerful)xml2- XML/HTML parsing engine that powers rvest
Dynamic content:
chromote- Lightweight Chrome automation (my preference over RSelenium)RSelenium- Full browser automation when you need maximum control
Being a good citizen:
polite- Automatically respects robots.txt and rate limits (criminally underused)robotstxt- Parse and check robots.txt files manually
Performance:
furrr- Parallel processing with future backend (cleaner than doParallel)memoise- Cache function calls to avoid re-scraping
Install everything with:
install.packages(c("rvest", "httr2", "polite", "furrr", "chromote", "memoise"))
Your First R Web Scraper
Let's build a basic scraper that actually works. We'll extract book data from a practice scraping site.
library(rvest)
library(dplyr)
# Target URL
url <- "http://books.toscrape.com/catalogue/page-1.html"
# Read the HTML
page <- read_html(url)
# Extract book titles
titles <- page %>%
html_elements("h3 a") %>%
html_attr("title")
# Extract prices
prices <- page %>%
html_elements(".price_color") %>%
html_text() %>%
gsub("£", "", .) %>%
as.numeric()
# Extract ratings
ratings <- page %>%
html_elements(".star-rating") %>%
html_attr("class") %>%
gsub("star-rating ", "", .)
# Combine into dataframe
books_df <- data.frame(
title = titles,
price = prices,
rating = ratings,
stringsAsFactors = FALSE
)
head(books_df, 3)
This code demonstrates the core rvest workflow: read_html() fetches the page, html_elements() selects nodes using CSS selectors, and html_text() or html_attr() extracts the data you want.
A note on selectors: Use your browser's DevTools to find the right CSS selectors. Right-click an element, choose "Inspect," and you'll see the HTML structure. Chrome and Firefox also have a "Copy selector" option that gives you the exact path.
Scraping Multiple Pages Efficiently
Single-page scraping is great for learning, but real projects involve hundreds or thousands of pages. Here's how to handle pagination without overwhelming servers.
The Basic Loop Approach
library(rvest)
library(purrr)
scrape_page <- function(page_num) {
url <- paste0("http://books.toscrape.com/catalogue/page-", page_num, ".html")
# Add small delay to be respectful
Sys.sleep(1)
page <- read_html(url)
titles <- page %>%
html_elements("h3 a") %>%
html_attr("title")
prices <- page %>%
html_elements(".price_color") %>%
html_text() %>%
gsub("£", "", .) %>%
as.numeric()
data.frame(
title = titles,
price = prices,
page = page_num
)
}
# Scrape pages 1-5
all_books <- map_dfr(1:5, scrape_page)
This works, but it's sequential and slow. Each page waits for the previous one to finish.
Parallel Scraping with furrr
Here's where things get interesting. The furrr package lets you parallelize any purrr function with minimal code changes:
library(furrr)
# Set up parallel processing
plan(multisession, workers = 4) # Use 4 cores
# Same scrape_page function from before
scrape_page <- function(page_num) {
url <- paste0("http://books.toscrape.com/catalogue/page-", page_num, ".html")
Sys.sleep(1)
page <- read_html(url)
titles <- page %>%
html_elements("h3 a") %>%
html_attr("title")
prices <- page %>%
html_elements(".price_color") %>%
html_text() %>%
gsub("£", "", .) %>%
as.numeric()
data.frame(title = titles, price = prices, page = page_num)
}
# Parallel scraping - just replace map_dfr with future_map_dfr
all_books <- future_map_dfr(1:20, scrape_page, .progress = TRUE)
The future_map_dfr() call runs your function across multiple cores simultaneously. On my machine, scraping 20 pages sequentially takes ~25 seconds. With 4 cores, it drops to ~8 seconds.
Important: Don't use too many workers. Start with detectCores() - 1 to leave one core free for other processes. And always add delays—parallel requests can easily overwhelm small servers.
The Polite Package: Be a Good Web Citizen
Most web scraping tutorials ignore ethics and etiquette. That's a mistake. The polite package makes responsible scraping automatic.
Here's the problem with basic rvest: it doesn't check robots.txt, doesn't identify your scraper properly, and makes it easy to accidentally DDoS a site. The polite package fixes all of this.
library(polite)
library(rvest)
# Introduce yourself to the host
session <- bow(
url = "http://books.toscrape.com",
user_agent = "Research Bot (your-email@example.com)",
force = TRUE
)
# The session object contains info about what you can scrape
print(session)
When you call bow(), polite:
- Fetches and parses robots.txt
- Checks if your target URL is allowed
- Sets appropriate rate limits
- Caches robots.txt so you don't re-fetch it
Now scrape using the session:
# Navigate to a specific page
page_session <- nod(session, path = "/catalogue/page-1.html")
# Scrape the page (with automatic rate limiting)
page_data <- scrape(page_session)
# Extract data as usual
titles <- page_data %>%
html_elements("h3 a") %>%
html_attr("title")
The magic happens behind the scenes. The scrape() function:
- Enforces rate limits (default 5 seconds between requests)
- Caches responses to avoid duplicate requests
- Respects crawl-delay directives from robots.txt
Polite Pagination Loop
Here's how to combine polite with pagination:
library(polite)
library(rvest)
library(dplyr)
# Initialize session
session <- bow("http://books.toscrape.com")
scrape_books_polite <- function(page_num, session) {
# Update path for this page
current_page <- nod(session, path = paste0("/catalogue/page-", page_num, ".html"))
# Scrape (rate limiting happens automatically)
page <- scrape(current_page)
# Extract data
data.frame(
title = page %>% html_elements("h3 a") %>% html_attr("title"),
price = page %>% html_elements(".price_color") %>% html_text()
)
}
# Scrape multiple pages
all_books <- map_dfr(1:10, ~scrape_books_polite(.x, session))
The polite package is one of those tools that separates professional scrapers from beginners. Use it.
Handling Dynamic Content with chromote
Many modern websites load content dynamically with JavaScript. When you scrape these sites with rvest, you get empty pages because the content hasn't rendered yet.
RSelenium is the traditional solution, but it's heavy—requires Java, external servers, and lots of setup. The chromote package is lighter and more R-native.
Basic chromote Setup
library(chromote)
library(rvest)
# Create a new Chrome session
browser <- ChromoteSession$new()
# Navigate to a JavaScript-heavy page
browser$Page$navigate("https://example.com/dynamic-content")
# Wait for page to load
browser$Page$loadEventFired()
# Get the fully rendered HTML
html_content <- browser$Runtime$evaluate("document.documentElement.outerHTML")$result$value
# Parse with rvest
page <- read_html(html_content)
# Extract data as usual
data <- page %>%
html_elements(".dynamic-content") %>%
html_text()
# Close browser when done
browser$close()
This opens a headless Chrome instance, waits for JavaScript to execute, and grabs the rendered HTML. You then parse it with rvest like any static page.
Handling Infinite Scroll
Here's a real-world challenge: sites that load more content as you scroll. You can automate scrolling with chromote:
library(chromote)
browser <- ChromoteSession$new()
browser$Page$navigate("https://example.com/infinite-scroll")
browser$Page$loadEventFired()
# Function to scroll to bottom
scroll_to_bottom <- function(browser, pause = 2) {
# Get current scroll height
prev_height <- browser$Runtime$evaluate(
"document.body.scrollHeight"
)$result$value
# Scroll to bottom
browser$Runtime$evaluate("window.scrollTo(0, document.body.scrollHeight);")
# Wait for new content to load
Sys.sleep(pause)
# Get new scroll height
new_height <- browser$Runtime$evaluate(
"document.body.scrollHeight"
)$result$value
return(new_height > prev_height)
}
# Keep scrolling until no new content loads
scroll_count <- 0
while(scroll_to_bottom(browser) && scroll_count < 20) {
scroll_count <- scroll_count + 1
message(paste("Scroll", scroll_count))
}
# Now extract all loaded content
html_content <- browser$Runtime$evaluate(
"document.documentElement.outerHTML"
)$result$value
page <- read_html(html_content)
browser$close()
This scrolls, waits for content to load, checks if the page got longer, and repeats. The scroll_count < 20 prevents infinite loops.
Advanced HTTP Handling with httr2
Sometimes rvest isn't enough. Maybe you need custom headers, authentication, or precise control over requests. That's when you reach for httr2.
Custom Headers to Mimic Browsers
Anti-bot systems often block requests that look suspicious. Setting proper headers helps:
library(httr2)
library(rvest)
# Build a request with proper headers
req <- request("https://example.com/products") %>%
req_headers(
"User-Agent" = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
"Accept" = "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Language" = "en-US,en;q=0.5",
"Accept-Encoding" = "gzip, deflate",
"Connection" = "keep-alive"
)
# Perform the request
resp <- req_perform(req)
# Extract HTML and parse
html <- resp_body_html(resp)
# Now use rvest
data <- html %>%
html_elements(".product") %>%
html_text()
The req_headers() function makes your scraper look more like a real browser. Many sites check User-Agent and Accept headers—sending the right ones reduces blocks.
Handling Sessions and Cookies
Some sites require login or maintain state through cookies. httr2 handles this elegantly:
library(httr2)
# Create a session
session <- request("https://example.com") %>%
req_options(cookie_jar = tempfile())
# Login (example POST request)
login_resp <- session %>%
req_url_path("/login") %>%
req_body_form(
username = "your_username",
password = "your_password"
) %>%
req_perform()
# Cookies are now stored - future requests use them automatically
protected_page <- session %>%
req_url_path("/protected-data") %>%
req_perform() %>%
resp_body_html()
# Parse as usual
data <- protected_page %>%
html_elements(".data") %>%
html_text()
The cookie_jar argument stores cookies between requests. Once you log in, subsequent requests include authentication cookies automatically.
Caching Strategies to Save Time
Scraping takes time. Re-scraping the same pages wastes time and bandwidth. The memoise package caches function results so you never scrape the same URL twice.
library(memoise)
library(rvest)
# Original scraping function
scrape_page_original <- function(url) {
Sys.sleep(1) # Respectful delay
read_html(url)
}
# Cached version
scrape_page_cached <- memoise(scrape_page_original)
# First call scrapes and caches
page1 <- scrape_page_cached("https://example.com/page1")
# Second call returns cached result instantly
page1_again <- scrape_page_cached("https://example.com/page1") # Instant!
# Different URL still scrapes normally
page2 <- scrape_page_cached("https://example.com/page2") # Takes 1+ second
This is invaluable during development. If your parsing code has bugs, you can fix and re-run without re-fetching pages.
For persistent caching across R sessions:
library(memoise)
# Cache to disk instead of memory
cache_dir <- cache_filesystem("./scraping_cache")
scrape_page_cached <- memoise(scrape_page_original, cache = cache_dir)
# Now cached results survive R restarts
The cache directory stores results as files. Even if you close R and come back later, previously scraped pages load from disk instantly.
Handling Common Scraping Challenges
Real-world scraping involves problems. Here's how to handle the most common ones.
Missing Data and Error Handling
Not every page has every element. Use possibly() from purrr to handle failures gracefully:
library(purrr)
# Function that might fail
extract_price <- function(page) {
page %>%
html_element(".price") %>%
html_text() %>%
as.numeric()
}
# Wrapped version returns NA on failure instead of erroring
safe_extract_price <- possibly(extract_price, otherwise = NA)
# Use in your scraper
prices <- map_dbl(pages, safe_extract_price) # No errors, just NAs
For more control, use tryCatch:
scrape_with_retry <- function(url, max_attempts = 3) {
attempt <- 1
while(attempt <= max_attempts) {
result <- tryCatch({
read_html(url)
}, error = function(e) {
message(paste("Attempt", attempt, "failed:", e$message))
if(attempt < max_attempts) {
Sys.sleep(2^attempt) # Exponential backoff
}
NULL
})
if(!is.null(result)) return(result)
attempt <- attempt + 1
}
stop("Failed after ", max_attempts, " attempts")
}
This retries failed requests with exponential backoff—wait 2 seconds after first failure, 4 after second, 8 after third.
Dealing with Rate Limits
If you're hitting 429 "Too Many Requests" errors, you need better rate limiting:
library(ratelimitr)
# Limit to 10 requests per minute
scrape_limited <- limit_rate(
read_html,
rate(n = 10, period = 60)
)
# Use limited function instead
pages <- map(urls, scrape_limited)
Or use the slowly() function from purrr:
library(purrr)
# Add 2-second delay between calls
scrape_slowly <- slowly(read_html, rate = rate_delay(2))
pages <- map(urls, scrape_slowly)
Both approaches work. I prefer limit_rate() when I know the server's limits, and slowly() for simple delays.
Bypassing Simple Anti-Bot Measures
Some sites block obvious bots. Here are tricks that help (use responsibly):
1. Rotate User Agents:
user_agents <- c(
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36"
)
scrape_with_ua_rotation <- function(url) {
ua <- sample(user_agents, 1)
request(url) %>%
req_headers("User-Agent" = ua) %>%
req_perform() %>%
resp_body_html()
}
2. Add Random Delays:
scrape_with_random_delay <- function(url) {
delay <- runif(1, min = 1, max = 3) # 1-3 second random delay
Sys.sleep(delay)
read_html(url)
}
3. Respect Referer Headers:
Some sites check where requests come from. Set the Referer header:
request(target_url) %>%
req_headers(
"Referer" = "https://example.com/previous-page",
"User-Agent" = "Mozilla/5.0..."
) %>%
req_perform()
Extracting Data from Tables
HTML tables are common and rvest makes them trivial:
library(rvest)
url <- "https://example.com/data-table"
page <- read_html(url)
# Extract all tables on the page
tables <- page %>%
html_table(fill = TRUE)
# Usually you want the first table
df <- tables[[1]]
# Clean column names
names(df) <- make.names(names(df))
head(df)
The html_table() function converts HTML tables directly to dataframes. The fill = TRUE argument handles irregular tables with inconsistent row lengths.
For tables with complex structures (nested headers, merged cells), you might need to extract manually:
# Get table element
table <- page %>% html_element("table")
# Extract headers
headers <- table %>%
html_elements("thead th") %>%
html_text()
# Extract rows
rows <- table %>%
html_elements("tbody tr")
# Process each row
data <- map_dfr(rows, function(row) {
cells <- row %>%
html_elements("td") %>%
html_text()
setNames(as.list(cells), headers)
})
Working with APIs (The Better Alternative)
Before scraping, always check if the site has an API. APIs are faster, more reliable, and explicitly allowed.
Many sites have undocumented APIs that their frontend uses. Find these by:
- Opening DevTools Network tab
- Filtering for XHR/Fetch requests
- Finding JSON responses
- Replicating the request in R
Example of calling a JSON API:
library(httr2)
library(jsonlite)
# Example: GitHub API
resp <- request("https://api.github.com/users/hadley/repos") %>%
req_headers("User-Agent" = "R-script") %>%
req_perform()
# Parse JSON
repos <- resp %>%
resp_body_json()
# Convert to dataframe
repos_df <- map_dfr(repos, function(repo) {
data.frame(
name = repo$name,
stars = repo$stargazers_count,
language = repo$language %||% "None"
)
})
The %||% is R's null-coalescing operator—it returns the left side unless it's NULL, then returns the right side.
Putting It All Together: Complete Example
Here's a realistic scraper that combines techniques from this guide:
library(polite)
library(rvest)
library(furrr)
library(dplyr)
# 1. Set up polite session
session <- bow(
"http://books.toscrape.com",
user_agent = "Research Project (your.email@example.com)"
)
# 2. Function to scrape a single page
scrape_book_page <- function(page_num, session) {
# Navigate to page
current_page <- nod(session, path = paste0("/catalogue/page-", page_num, ".html"))
# Scrape with automatic rate limiting
page <- scrape(current_page, verbose = TRUE)
# Extract book details
books <- data.frame(
title = page %>% html_elements("h3 a") %>% html_attr("title"),
price = page %>% html_elements(".price_color") %>% html_text() %>%
gsub("[^0-9.]", "", .) %>% as.numeric(),
availability = page %>% html_elements(".availability") %>% html_text() %>% trimws(),
rating = page %>% html_elements(".star-rating") %>% html_attr("class") %>%
gsub("star-rating ", "", .)
)
# Add page number
books$page <- page_num
return(books)
}
# 3. Scrape multiple pages in parallel
plan(multisession, workers = 3)
all_books <- future_map_dfr(
1:10,
~scrape_book_page(.x, session),
.progress = TRUE
)
# 4. Clean and analyze
books_cleaned <- all_books %>%
mutate(
in_stock = grepl("In stock", availability),
rating_num = case_when(
rating == "One" ~ 1,
rating == "Two" ~ 2,
rating == "Three" ~ 3,
rating == "Four" ~ 4,
rating == "Five" ~ 5
)
)
# 5. Summary
books_cleaned %>%
group_by(rating_num) %>%
summarise(
count = n(),
avg_price = mean(price, na.rm = TRUE)
)
This scraper:
- Uses polite for ethical scraping
- Respects rate limits automatically
- Runs in parallel for speed
- Extracts multiple data points
- Cleans and summarizes results
Best Practices and Legal Considerations
Before finishing, let's talk about staying out of trouble.
Legal stuff:
- Check the site's Terms of Service
- Respect robots.txt (polite does this for you)
- Don't scrape personal data without considering privacy laws
- If they have an API, use it
Technical best practices:
- Always set a user agent that identifies you
- Implement rate limiting (1-2 seconds minimum between requests)
- Cache results to avoid duplicate requests
- Use exponential backoff on failures
- Monitor your scraper for errors
- Store raw HTML before parsing (parsing bugs happen)
Ethical considerations:
- Don't overwhelm small servers
- Scrape during off-peak hours for heavy jobs
- Don't republish copyrighted content
- When in doubt, ask permission
Wrapping Up
Web scraping in R is more powerful than most people realize. The combination of rvest for parsing, polite for ethics, chromote for JavaScript, and furrr for speed gives you a professional toolkit.
Start simple with rvest, add polite for any serious project, bring in chromote when you hit JavaScript-heavy sites, and parallelize with furrr when speed matters. Most importantly, be respectful—scraping is a privilege, not a right.
The techniques in this guide work for 90% of scraping projects. The remaining 10%—CAPTCHAs, sophisticated anti-bot systems, complex authentication—require specialized solutions beyond this guide's scope. But you'll know when you hit those limits, and you'll have the foundation to solve them.
Now go scrape responsibly.