knowledgebase

Web Scraping in R: Complete Guide 2026

27 October 2025

11 min read

Web scraping with R is one of those skills that looks intimidating until you actually try it. Then you realize it's just making HTTP requests and parsing HTML—which R handles surprisingly well.

I've been scraping websites with R for years, and I'm constantly surprised by how many data analysts don't know about some of the powerful (and polite) techniques available in the R ecosystem. Most tutorials cover rvest basics and stop there. This guide goes further.

Here's what you'll learn: how to scrape static and dynamic content, handle rate limiting properly with the polite package, speed things up with parallel processing using furrr, and tackle real-world obstacles like JavaScript-heavy sites and anti-bot measures. No fluff, just the techniques that actually work.

Why R for Web Scraping?

Before diving into code, let's address the elephant in the room: why use R when Python dominates web scraping discussions?

R excels at web scraping for one simple reason—you can extract, clean, analyze, and visualize data without switching tools. If you're already using R for data analysis, adding scraping to your workflow is seamless.

The tidyverse integration is another huge advantage. You can pipe scraped data directly into dplyr for cleaning or ggplot2 for visualization. This end-to-end workflow is why I prefer R for research and analysis projects, even though I'll reach for Python when building production scrapers.

Essential R Packages for Web Scraping

The R ecosystem has matured significantly. Here are the packages you actually need:

Core scraping:

rvest - Your main tool for parsing HTML and extracting data
httr2 - Advanced HTTP requests (successor to httr, much more powerful)
xml2 - XML/HTML parsing engine that powers rvest

Dynamic content:

chromote - Lightweight Chrome automation (my preference over RSelenium)
RSelenium - Full browser automation when you need maximum control

Being a good citizen:

polite - Automatically respects robots.txt and rate limits (criminally underused)
robotstxt - Parse and check robots.txt files manually

Performance:

furrr - Parallel processing with future backend (cleaner than doParallel)
memoise - Cache function calls to avoid re-scraping

Install everything with:

install.packages(c("rvest", "httr2", "polite", "furrr", "chromote", "memoise"))

Your First R Web Scraper

Let's build a basic scraper that actually works. We'll extract book data from a practice scraping site.

library(rvest)
library(dplyr)

# Target URL
url <- "http://books.toscrape.com/catalogue/page-1.html"

# Read the HTML
page <- read_html(url)

# Extract book titles
titles <- page %>%
  html_elements("h3 a") %>%
  html_attr("title")

# Extract prices
prices <- page %>%
  html_elements(".price_color") %>%
  html_text() %>%
  gsub("£", "", .) %>%
  as.numeric()

# Extract ratings
ratings <- page %>%
  html_elements(".star-rating") %>%
  html_attr("class") %>%
  gsub("star-rating ", "", .)

# Combine into dataframe
books_df <- data.frame(
  title = titles,
  price = prices,
  rating = ratings,
  stringsAsFactors = FALSE
)

head(books_df, 3)

This code demonstrates the core rvest workflow: read_html() fetches the page, html_elements() selects nodes using CSS selectors, and html_text() or html_attr() extracts the data you want.

A note on selectors: Use your browser's DevTools to find the right CSS selectors. Right-click an element, choose "Inspect," and you'll see the HTML structure. Chrome and Firefox also have a "Copy selector" option that gives you the exact path.

Scraping Multiple Pages Efficiently

Single-page scraping is great for learning, but real projects involve hundreds or thousands of pages. Here's how to handle pagination without overwhelming servers.

The Basic Loop Approach

library(rvest)
library(purrr)

scrape_page <- function(page_num) {
  url <- paste0("http://books.toscrape.com/catalogue/page-", page_num, ".html")
  
  # Add small delay to be respectful
  Sys.sleep(1)
  
  page <- read_html(url)
  
  titles <- page %>%
    html_elements("h3 a") %>%
    html_attr("title")
  
  prices <- page %>%
    html_elements(".price_color") %>%
    html_text() %>%
    gsub("£", "", .) %>%
    as.numeric()
  
  data.frame(
    title = titles,
    price = prices,
    page = page_num
  )
}

# Scrape pages 1-5
all_books <- map_dfr(1:5, scrape_page)

This works, but it's sequential and slow. Each page waits for the previous one to finish.

Parallel Scraping with furrr

Here's where things get interesting. The furrr package lets you parallelize any purrr function with minimal code changes:

library(furrr)

# Set up parallel processing
plan(multisession, workers = 4)  # Use 4 cores

# Same scrape_page function from before
scrape_page <- function(page_num) {
  url <- paste0("http://books.toscrape.com/catalogue/page-", page_num, ".html")
  Sys.sleep(1)
  
  page <- read_html(url)
  
  titles <- page %>%
    html_elements("h3 a") %>%
    html_attr("title")
  
  prices <- page %>%
    html_elements(".price_color") %>%
    html_text() %>%
    gsub("£", "", .) %>%
    as.numeric()
  
  data.frame(title = titles, price = prices, page = page_num)
}

# Parallel scraping - just replace map_dfr with future_map_dfr
all_books <- future_map_dfr(1:20, scrape_page, .progress = TRUE)

The future_map_dfr() call runs your function across multiple cores simultaneously. On my machine, scraping 20 pages sequentially takes ~25 seconds. With 4 cores, it drops to ~8 seconds.

Important: Don't use too many workers. Start with detectCores() - 1 to leave one core free for other processes. And always add delays—parallel requests can easily overwhelm small servers.

The Polite Package: Be a Good Web Citizen

Most web scraping tutorials ignore ethics and etiquette. That's a mistake. The polite package makes responsible scraping automatic.

Here's the problem with basic rvest: it doesn't check robots.txt, doesn't identify your scraper properly, and makes it easy to accidentally DDoS a site. The polite package fixes all of this.

library(polite)
library(rvest)

# Introduce yourself to the host
session <- bow(
  url = "http://books.toscrape.com",
  user_agent = "Research Bot (your-email@example.com)",
  force = TRUE
)

# The session object contains info about what you can scrape
print(session)

When you call bow(), polite:

Fetches and parses robots.txt
Checks if your target URL is allowed
Sets appropriate rate limits
Caches robots.txt so you don't re-fetch it

Now scrape using the session:

# Navigate to a specific page
page_session <- nod(session, path = "/catalogue/page-1.html")

# Scrape the page (with automatic rate limiting)
page_data <- scrape(page_session)

# Extract data as usual
titles <- page_data %>%
  html_elements("h3 a") %>%
  html_attr("title")

The magic happens behind the scenes. The scrape() function:

Enforces rate limits (default 5 seconds between requests)
Caches responses to avoid duplicate requests
Respects crawl-delay directives from robots.txt

Polite Pagination Loop

Here's how to combine polite with pagination:

library(polite)
library(rvest)
library(dplyr)

# Initialize session
session <- bow("http://books.toscrape.com")

scrape_books_polite <- function(page_num, session) {
  # Update path for this page
  current_page <- nod(session, path = paste0("/catalogue/page-", page_num, ".html"))
  
  # Scrape (rate limiting happens automatically)
  page <- scrape(current_page)
  
  # Extract data
  data.frame(
    title = page %>% html_elements("h3 a") %>% html_attr("title"),
    price = page %>% html_elements(".price_color") %>% html_text()
  )
}

# Scrape multiple pages
all_books <- map_dfr(1:10, ~scrape_books_polite(.x, session))

The polite package is one of those tools that separates professional scrapers from beginners. Use it.

Handling Dynamic Content with chromote

Many modern websites load content dynamically with JavaScript. When you scrape these sites with rvest, you get empty pages because the content hasn't rendered yet.

RSelenium is the traditional solution, but it's heavy—requires Java, external servers, and lots of setup. The chromote package is lighter and more R-native.

Basic chromote Setup

library(chromote)
library(rvest)

# Create a new Chrome session
browser <- ChromoteSession$new()

# Navigate to a JavaScript-heavy page
browser$Page$navigate("https://example.com/dynamic-content")

# Wait for page to load
browser$Page$loadEventFired()

# Get the fully rendered HTML
html_content <- browser$Runtime$evaluate("document.documentElement.outerHTML")$result$value

# Parse with rvest
page <- read_html(html_content)

# Extract data as usual
data <- page %>%
  html_elements(".dynamic-content") %>%
  html_text()

# Close browser when done
browser$close()

This opens a headless Chrome instance, waits for JavaScript to execute, and grabs the rendered HTML. You then parse it with rvest like any static page.

Handling Infinite Scroll

Here's a real-world challenge: sites that load more content as you scroll. You can automate scrolling with chromote:

library(chromote)

browser <- ChromoteSession$new()
browser$Page$navigate("https://example.com/infinite-scroll")
browser$Page$loadEventFired()

# Function to scroll to bottom
scroll_to_bottom <- function(browser, pause = 2) {
  # Get current scroll height
  prev_height <- browser$Runtime$evaluate(
    "document.body.scrollHeight"
  )$result$value
  
  # Scroll to bottom
  browser$Runtime$evaluate("window.scrollTo(0, document.body.scrollHeight);")
  
  # Wait for new content to load
  Sys.sleep(pause)
  
  # Get new scroll height
  new_height <- browser$Runtime$evaluate(
    "document.body.scrollHeight"
  )$result$value
  
  return(new_height > prev_height)
}

# Keep scrolling until no new content loads
scroll_count <- 0
while(scroll_to_bottom(browser) && scroll_count < 20) {
  scroll_count <- scroll_count + 1
  message(paste("Scroll", scroll_count))
}

# Now extract all loaded content
html_content <- browser$Runtime$evaluate(
  "document.documentElement.outerHTML"
)$result$value

page <- read_html(html_content)
browser$close()

This scrolls, waits for content to load, checks if the page got longer, and repeats. The scroll_count < 20 prevents infinite loops.

Advanced HTTP Handling with httr2

Sometimes rvest isn't enough. Maybe you need custom headers, authentication, or precise control over requests. That's when you reach for httr2.

Custom Headers to Mimic Browsers

Anti-bot systems often block requests that look suspicious. Setting proper headers helps:

library(httr2)
library(rvest)

# Build a request with proper headers
req <- request("https://example.com/products") %>%
  req_headers(
    "User-Agent" = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
    "Accept" = "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
    "Accept-Language" = "en-US,en;q=0.5",
    "Accept-Encoding" = "gzip, deflate",
    "Connection" = "keep-alive"
  )

# Perform the request
resp <- req_perform(req)

# Extract HTML and parse
html <- resp_body_html(resp)

# Now use rvest
data <- html %>%
  html_elements(".product") %>%
  html_text()

The req_headers() function makes your scraper look more like a real browser. Many sites check User-Agent and Accept headers—sending the right ones reduces blocks.

Handling Sessions and Cookies

Some sites require login or maintain state through cookies. httr2 handles this elegantly:

library(httr2)

# Create a session
session <- request("https://example.com") %>%
  req_options(cookie_jar = tempfile())

# Login (example POST request)
login_resp <- session %>%
  req_url_path("/login") %>%
  req_body_form(
    username = "your_username",
    password = "your_password"
  ) %>%
  req_perform()

# Cookies are now stored - future requests use them automatically
protected_page <- session %>%
  req_url_path("/protected-data") %>%
  req_perform() %>%
  resp_body_html()

# Parse as usual
data <- protected_page %>%
  html_elements(".data") %>%
  html_text()

The cookie_jar argument stores cookies between requests. Once you log in, subsequent requests include authentication cookies automatically.

Caching Strategies to Save Time

Scraping takes time. Re-scraping the same pages wastes time and bandwidth. The memoise package caches function results so you never scrape the same URL twice.

library(memoise)
library(rvest)

# Original scraping function
scrape_page_original <- function(url) {
  Sys.sleep(1)  # Respectful delay
  read_html(url)
}

# Cached version
scrape_page_cached <- memoise(scrape_page_original)

# First call scrapes and caches
page1 <- scrape_page_cached("https://example.com/page1")

# Second call returns cached result instantly
page1_again <- scrape_page_cached("https://example.com/page1")  # Instant!

# Different URL still scrapes normally
page2 <- scrape_page_cached("https://example.com/page2")  # Takes 1+ second

This is invaluable during development. If your parsing code has bugs, you can fix and re-run without re-fetching pages.

For persistent caching across R sessions:

library(memoise)

# Cache to disk instead of memory
cache_dir <- cache_filesystem("./scraping_cache")
scrape_page_cached <- memoise(scrape_page_original, cache = cache_dir)

# Now cached results survive R restarts

The cache directory stores results as files. Even if you close R and come back later, previously scraped pages load from disk instantly.

Handling Common Scraping Challenges

Real-world scraping involves problems. Here's how to handle the most common ones.

Missing Data and Error Handling

Not every page has every element. Use possibly() from purrr to handle failures gracefully:

library(purrr)

# Function that might fail
extract_price <- function(page) {
  page %>%
    html_element(".price") %>%
    html_text() %>%
    as.numeric()
}

# Wrapped version returns NA on failure instead of erroring
safe_extract_price <- possibly(extract_price, otherwise = NA)

# Use in your scraper
prices <- map_dbl(pages, safe_extract_price)  # No errors, just NAs

For more control, use tryCatch:

scrape_with_retry <- function(url, max_attempts = 3) {
  attempt <- 1
  
  while(attempt <= max_attempts) {
    result <- tryCatch({
      read_html(url)
    }, error = function(e) {
      message(paste("Attempt", attempt, "failed:", e$message))
      if(attempt < max_attempts) {
        Sys.sleep(2^attempt)  # Exponential backoff
      }
      NULL
    })
    
    if(!is.null(result)) return(result)
    attempt <- attempt + 1
  }
  
  stop("Failed after ", max_attempts, " attempts")
}

This retries failed requests with exponential backoff—wait 2 seconds after first failure, 4 after second, 8 after third.

Dealing with Rate Limits

If you're hitting 429 "Too Many Requests" errors, you need better rate limiting:

library(ratelimitr)

# Limit to 10 requests per minute
scrape_limited <- limit_rate(
  read_html,
  rate(n = 10, period = 60)
)

# Use limited function instead
pages <- map(urls, scrape_limited)

Or use the slowly() function from purrr:

library(purrr)

# Add 2-second delay between calls
scrape_slowly <- slowly(read_html, rate = rate_delay(2))

pages <- map(urls, scrape_slowly)

Both approaches work. I prefer limit_rate() when I know the server's limits, and slowly() for simple delays.

Bypassing Simple Anti-Bot Measures

Some sites block obvious bots. Here are tricks that help (use responsibly):

1. Rotate User Agents:

user_agents <- c(
  "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
  "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36",
  "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36"
)

scrape_with_ua_rotation <- function(url) {
  ua <- sample(user_agents, 1)
  
  request(url) %>%
    req_headers("User-Agent" = ua) %>%
    req_perform() %>%
    resp_body_html()
}

2. Add Random Delays:

scrape_with_random_delay <- function(url) {
  delay <- runif(1, min = 1, max = 3)  # 1-3 second random delay
  Sys.sleep(delay)
  read_html(url)
}

3. Respect Referer Headers:

Some sites check where requests come from. Set the Referer header:

request(target_url) %>%
  req_headers(
    "Referer" = "https://example.com/previous-page",
    "User-Agent" = "Mozilla/5.0..."
  ) %>%
  req_perform()

Extracting Data from Tables

HTML tables are common and rvest makes them trivial:

library(rvest)

url <- "https://example.com/data-table"
page <- read_html(url)

# Extract all tables on the page
tables <- page %>%
  html_table(fill = TRUE)

# Usually you want the first table
df <- tables[[1]]

# Clean column names
names(df) <- make.names(names(df))

head(df)

The html_table() function converts HTML tables directly to dataframes. The fill = TRUE argument handles irregular tables with inconsistent row lengths.

For tables with complex structures (nested headers, merged cells), you might need to extract manually:

# Get table element
table <- page %>% html_element("table")

# Extract headers
headers <- table %>%
  html_elements("thead th") %>%
  html_text()

# Extract rows
rows <- table %>%
  html_elements("tbody tr")

# Process each row
data <- map_dfr(rows, function(row) {
  cells <- row %>%
    html_elements("td") %>%
    html_text()
  
  setNames(as.list(cells), headers)
})

Working with APIs (The Better Alternative)

Before scraping, always check if the site has an API. APIs are faster, more reliable, and explicitly allowed.

Many sites have undocumented APIs that their frontend uses. Find these by:

Opening DevTools Network tab
Filtering for XHR/Fetch requests
Finding JSON responses
Replicating the request in R

Example of calling a JSON API:

library(httr2)
library(jsonlite)

# Example: GitHub API
resp <- request("https://api.github.com/users/hadley/repos") %>%
  req_headers("User-Agent" = "R-script") %>%
  req_perform()

# Parse JSON
repos <- resp %>%
  resp_body_json()

# Convert to dataframe
repos_df <- map_dfr(repos, function(repo) {
  data.frame(
    name = repo$name,
    stars = repo$stargazers_count,
    language = repo$language %||% "None"
  )
})

The %||% is R's null-coalescing operator—it returns the left side unless it's NULL, then returns the right side.

Putting It All Together: Complete Example

Here's a realistic scraper that combines techniques from this guide:

library(polite)
library(rvest)
library(furrr)
library(dplyr)

# 1. Set up polite session
session <- bow(
  "http://books.toscrape.com",
  user_agent = "Research Project (your.email@example.com)"
)

# 2. Function to scrape a single page
scrape_book_page <- function(page_num, session) {
  # Navigate to page
  current_page <- nod(session, path = paste0("/catalogue/page-", page_num, ".html"))
  
  # Scrape with automatic rate limiting
  page <- scrape(current_page, verbose = TRUE)
  
  # Extract book details
  books <- data.frame(
    title = page %>% html_elements("h3 a") %>% html_attr("title"),
    price = page %>% html_elements(".price_color") %>% html_text() %>% 
            gsub("[^0-9.]", "", .) %>% as.numeric(),
    availability = page %>% html_elements(".availability") %>% html_text() %>% trimws(),
    rating = page %>% html_elements(".star-rating") %>% html_attr("class") %>%
             gsub("star-rating ", "", .)
  )
  
  # Add page number
  books$page <- page_num
  
  return(books)
}

# 3. Scrape multiple pages in parallel
plan(multisession, workers = 3)

all_books <- future_map_dfr(
  1:10, 
  ~scrape_book_page(.x, session),
  .progress = TRUE
)

# 4. Clean and analyze
books_cleaned <- all_books %>%
  mutate(
    in_stock = grepl("In stock", availability),
    rating_num = case_when(
      rating == "One" ~ 1,
      rating == "Two" ~ 2,
      rating == "Three" ~ 3,
      rating == "Four" ~ 4,
      rating == "Five" ~ 5
    )
  )

# 5. Summary
books_cleaned %>%
  group_by(rating_num) %>%
  summarise(
    count = n(),
    avg_price = mean(price, na.rm = TRUE)
  )

This scraper:

Uses polite for ethical scraping
Respects rate limits automatically
Runs in parallel for speed
Extracts multiple data points
Cleans and summarizes results

Best Practices and Legal Considerations

Before finishing, let's talk about staying out of trouble.

Legal stuff:

Check the site's Terms of Service
Respect robots.txt (polite does this for you)
Don't scrape personal data without considering privacy laws
If they have an API, use it

Technical best practices:

Always set a user agent that identifies you
Implement rate limiting (1-2 seconds minimum between requests)
Cache results to avoid duplicate requests
Use exponential backoff on failures
Monitor your scraper for errors
Store raw HTML before parsing (parsing bugs happen)

Ethical considerations:

Don't overwhelm small servers
Scrape during off-peak hours for heavy jobs
Don't republish copyrighted content
When in doubt, ask permission

Wrapping Up

Web scraping in R is more powerful than most people realize. The combination of rvest for parsing, polite for ethics, chromote for JavaScript, and furrr for speed gives you a professional toolkit.

Start simple with rvest, add polite for any serious project, bring in chromote when you hit JavaScript-heavy sites, and parallelize with furrr when speed matters. Most importantly, be respectful—scraping is a privilege, not a right.

The techniques in this guide work for 90% of scraping projects. The remaining 10%—CAPTCHAs, sophisticated anti-bot systems, complex authentication—require specialized solutions beyond this guide's scope. But you'll know when you hit those limits, and you'll have the foundation to solve them.

Now go scrape responsibly.

Marius Bernard

Marius Bernard is a Web Scraping Engineer & Technical Advisor at Roundproxies. He authored the Web Scraping chapter of the 2024 Web Almanac/Techinsider. He loves python, golang and proxies.

Get the best
proxies out there

Get Proxies now

This article was originally published in October 2025, written by Marius Bernard. It was most recently updated in October 2025.

Marius Bernard

Marius Bernard is a Web Scraping Engineer & Technical Advisor at Roundproxies. He authored the Web Scraping chapter of the 2024 Web Almanac/Techinsider. He loves python, golang and proxies.

Tags

knowledgebase

Related from Knowledge Base

BoringSSL: Google's TLS Library Behind Chrome Fingerprinting

What Is IP Rotation? How it works and why you need it

How to bypass Bot Detection in 2026: 8 easy methods

What is 403 Forbidden Error? Causes & Fixes Explained

Guide to List Crawling in 2026: Extract data at scale

HTTP Error 429: What It Is & How to Fix It (2026)

The 8 best Residential Proxy providers in 2026

How ISP Proxies work in 2026: Step by step explained

C# Web Scraping Guide: Build Fast Working Scrapers

Web Scraping in Rust: Complete 2026 Guide

Web Scraping with Kotlin in 2026: Complete Guide

How to Do Web Scraping in Lua: A Developer's Guide

How to Do Web Scraping in Dart: A Complete 2026 Guide

How to Do Web Scraping in Perl: The Complete Developer's Guide

How to Use Botasaurus in 2026

How to Scrape Dynamic Websites With Headless Web Browsers

12 Ways to Make HTTPS Requests in Node.js

15 Methods to Not Get Blocked Web Scraping

How to use Playwright Proxy in 2026: Full setup guide

How to Take Screenshots with Puppeteer