I've been scraping websites with C# for over five years, and I've learned that most tutorials miss the stuff that actually matters when you're building production scrapers.
They show you how to download HTML and parse it, but then your scraper gets blocked after 10 requests, crashes on large datasets, or mysteriously stops working after a week.
This guide covers everything from basic HTTP requests to the sneaky tricks that keep your scrapers running. I'll show you the code I actually use—not the textbook examples that fall apart in the real world.
Why C# for Web Scraping?
C# isn't the first language people think of for web scraping—that's usually Python. But if you're already in the .NET ecosystem, C# has some real advantages:
Strong typing catches bugs early. When you're dealing with thousands of scraped records, a typo in a property name won't silently fail at 3 AM.
Async/await makes concurrency elegant. Scraping is inherently I/O-bound, and C#'s async patterns let you handle hundreds of concurrent requests without callback hell.
Performance is solid. With modern .NET (Core 6+), C# performs on par with Node.js for network operations, and significantly better than Python for CPU-intensive parsing.
Enterprise-friendly. If you're building a scraper for a company that already runs .NET services, using C# means easier deployment, monitoring, and integration.
That said, if you're just doing quick one-off scrapes and aren't already a C# developer, Python with Beautiful Soup might still be faster to prototype.
The HttpClient Trap Everyone Falls Into
Before we write any scraping code, there's one critical mistake you need to avoid. I learned this the hard way when my scraper mysteriously started failing after a few hours of running fine.
Do NOT create a new HttpClient instance for each request.
Here's what many developers do (and what you shouldn't):
// DON'T DO THIS
public async Task<string> GetPageContent(string url)
{
using (var client = new HttpClient()) // This is wrong!
{
return await client.GetStringAsync(url);
}
}
This looks clean and follows standard C# disposal patterns, but it's a disaster waiting to happen. Every time you dispose an HttpClient, the underlying TCP connection stays in a TIME_WAIT state for 240 seconds by default. Under heavy load, you'll exhaust your available sockets and get mysterious connection failures.
The right way:
// Create ONE HttpClient instance and reuse it
private static readonly HttpClient _httpClient = new HttpClient();
public async Task<string> GetPageContent(string url)
{
return await _httpClient.GetStringAsync(url);
}
Or even better, use HttpClientFactory if you're on .NET Core 2.1+:
public class WebScraper
{
private readonly IHttpClientFactory _clientFactory;
public WebScraper(IHttpClientFactory clientFactory)
{
_clientFactory = clientFactory;
}
public async Task<string> ScrapeUrl(string url)
{
var client = _clientFactory.CreateClient();
return await client.GetStringAsync(url);
}
}
This pattern handles DNS updates correctly and manages connections efficiently. It's a subtle difference, but it means the difference between a scraper that runs for months versus one that crashes after a few hours.
Basic Scraping with HttpClient and HtmlAgilityPack
Let's build a simple but functional scraper. We'll scrape product data from a demo e-commerce site. Here's the complete workflow:
1. Install the necessary packages:
dotnet add package HtmlAgilityPack
2. Create the scraper class:
using HtmlAgilityPack;
using System;
using System.Collections.Generic;
using System.Net.Http;
using System.Threading.Tasks;
public class Product
{
public string Name { get; set; }
public decimal Price { get; set; }
public string Url { get; set; }
}
public class SimpleProductScraper
{
private static readonly HttpClient _client = new HttpClient();
public async Task<List<Product>> ScrapeProducts(string url)
{
var products = new List<Product>();
// Set a real User-Agent to avoid basic bot detection
_client.DefaultRequestHeaders.Clear();
_client.DefaultRequestHeaders.Add("User-Agent",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36");
try
{
// Download the page
var html = await _client.GetStringAsync(url);
// Parse with HtmlAgilityPack
var doc = new HtmlDocument();
doc.LoadHtml(html);
// Select product elements (adjust selector for your target site)
var productNodes = doc.DocumentNode.SelectNodes("//div[@class='product']");
if (productNodes == null) return products;
foreach (var node in productNodes)
{
var product = new Product
{
Name = node.SelectSingleNode(".//h2")?.InnerText.Trim(),
Price = ParsePrice(node.SelectSingleNode(".//span[@class='price']")?.InnerText),
Url = node.SelectSingleNode(".//a")?.GetAttributeValue("href", "")
};
products.Add(product);
}
}
catch (HttpRequestException ex)
{
Console.WriteLine($"Request failed: {ex.Message}");
}
return products;
}
private decimal ParsePrice(string priceText)
{
if (string.IsNullOrEmpty(priceText)) return 0;
// Remove currency symbols and parse
var cleanPrice = new string(priceText.Where(c => char.IsDigit(c) || c == '.').ToArray());
return decimal.TryParse(cleanPrice, out var price) ? price : 0;
}
}
This is your foundation. It handles the basics: making the HTTP request, parsing HTML, and extracting data. But there's a lot more to cover for production use.
Memory-Efficient Scraping for Large Datasets
Here's something most tutorials don't mention: if you're scraping thousands of pages, loading entire HTML documents into memory will eventually crash your application.
I discovered this when trying to scrape a catalog with 50,000 products. The scraper worked fine for the first 10,000, then started throwing OutOfMemoryExceptions. The solution? Stream the content and parse incrementally.
Instead of loading everything at once:
// Memory-inefficient for large files
var html = await client.GetStringAsync(url);
Stream the content:
public async Task<List<Product>> ScrapeProductsEfficiently(string url)
{
using (var response = await _client.GetAsync(url, HttpCompletionOption.ResponseHeadersRead))
{
response.EnsureSuccessStatusCode();
using (var stream = await response.Content.ReadAsStreamAsync())
{
var doc = new HtmlDocument();
doc.Load(stream);
// Your parsing logic here
return ExtractProducts(doc);
}
}
}
The key is HttpCompletionOption.ResponseHeadersRead. This tells HttpClient to return as soon as the headers arrive, not after downloading the entire response body. Then we stream the content directly into HtmlAgilityPack without creating a massive string in memory.
For really large datasets, you can also process the document in chunks:
public async IAsyncEnumerable<Product> StreamProducts(string url)
{
using var response = await _client.GetAsync(url, HttpCompletionOption.ResponseHeadersRead);
using var stream = await response.Content.ReadAsStreamAsync();
var doc = new HtmlDocument();
doc.Load(stream);
var nodes = doc.DocumentNode.SelectNodes("//div[@class='product']");
if (nodes == null) yield break;
foreach (var node in nodes)
{
yield return ParseProduct(node);
// Allow garbage collection between items
await Task.Yield();
}
}
This pattern yields one product at a time, allowing the garbage collector to clean up as you go. It's particularly useful when writing results to a database or file as you scrape.
Handling Dynamic Content Without a Full Browser
Not every JavaScript-rendered site needs Selenium. Sometimes you can find the API endpoints that the JavaScript calls and hit those directly. This is way faster than running a headless browser.
How to find these endpoints:
- Open Chrome DevTools (F12)
- Go to the Network tab
- Load the page and look for XHR/Fetch requests
- Check if any return JSON data
For example, many sites load product data from something like:
https://example.com/api/products?page=1&limit=20
You can call this directly:
public async Task<List<Product>> ScrapeFromApi(string apiUrl)
{
var json = await _client.GetStringAsync(apiUrl);
return JsonSerializer.Deserialize<List<Product>>(json);
}
This approach is orders of magnitude faster than browser automation. I once replaced a Selenium scraper that took 30 minutes to scrape 1,000 products with an API-based scraper that did the same job in 45 seconds.
When this doesn't work:
If the site uses heavy JavaScript rendering with no clean API endpoints, or if it requires complex interactions (clicking buttons, scrolling, etc.), then you'll need browser automation. But always check for API endpoints first—it's worth the 10 minutes of investigation.
Anti-Detection Techniques That Actually Work
Getting blocked is frustrating. Here are techniques I've found that actually help, starting with the basics and moving to more advanced approaches.
1. Rotate User-Agents
Basic but essential. Don't use the same User-Agent for every request:
private static readonly string[] UserAgents = {
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:123.0) Gecko/20100101 Firefox/123.0",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.2.1 Safari/605.1.15"
};
private static string GetRandomUserAgent()
{
return UserAgents[Random.Shared.Next(UserAgents.Length)];
}
2. Add Realistic Headers
Real browsers send more than just User-Agent:
public void SetRealisticHeaders(HttpClient client)
{
client.DefaultRequestHeaders.Clear();
client.DefaultRequestHeaders.Add("User-Agent", GetRandomUserAgent());
client.DefaultRequestHeaders.Add("Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8");
client.DefaultRequestHeaders.Add("Accept-Language", "en-US,en;q=0.9");
client.DefaultRequestHeaders.Add("Accept-Encoding", "gzip, deflate, br");
client.DefaultRequestHeaders.Add("DNT", "1");
client.DefaultRequestHeaders.Add("Connection", "keep-alive");
client.DefaultRequestHeaders.Add("Upgrade-Insecure-Requests", "1");
}
3. Respect Robots.txt (Sort Of)
You should check robots.txt, but here's a secret: many sites block aggressive crawlers via robots.txt but don't actually enforce it for moderate scraping. Check it, understand the rules, then make an informed decision. If they specifically disallow your use case, you might want to look for an API or contact them for permission.
4. Implement Smart Delays
Fixed delays (like Thread.Sleep(1000)) look suspicious. Real users don't click at perfect intervals:
private static async Task HumanDelay()
{
// Random delay between 1.5 and 4.5 seconds
var baseDelay = 1500;
var jitter = Random.Shared.Next(0, 3000);
await Task.Delay(baseDelay + jitter);
}
For more sophisticated patterns, implement burst behavior:
public class HumanLikeDelayStrategy
{
private int _requestCount = 0;
public async Task DelayBeforeNextRequest()
{
_requestCount++;
// Fast browsing for first few pages
if (_requestCount <= 5)
{
await Task.Delay(Random.Shared.Next(500, 1500));
}
// Occasional longer break (like human reading)
else if (_requestCount % 10 == 0)
{
await Task.Delay(Random.Shared.Next(8000, 15000));
}
// Normal browsing speed
else
{
await Task.Delay(Random.Shared.Next(2000, 4500));
}
}
}
This pattern mimics human behavior more closely: quick initial browsing, occasional pauses to "read" content, and variable speeds overall.
When to Use (and Not Use) Selenium
Selenium gets recommended a lot, but it's heavy. Here's when it's actually worth the overhead:
Use Selenium when:
- The site heavily relies on JavaScript with no accessible API
- You need to interact with complex UI elements (dropdowns, modals, infinite scroll)
- The site actively fingerprints browsers (checking for headless indicators)
- CAPTCHAs are involved that require human-like interaction
Don't use Selenium when:
- You can access data via direct HTTP requests
- The data is in JSON API responses
- Performance matters and you're scraping thousands of pages
- You're deploying to resource-constrained environments
If you must use Selenium, here's a minimal setup that works:
using OpenQA.Selenium;
using OpenQA.Selenium.Chrome;
public class SeleniumScraper
{
public async Task<List<Product>> ScrapeWithSelenium(string url)
{
var options = new ChromeOptions();
options.AddArgument("--headless");
options.AddArgument("--disable-blink-features=AutomationControlled");
options.AddArgument("--user-agent=" + GetRandomUserAgent());
using var driver = new ChromeDriver(options);
driver.Navigate().GoToUrl(url);
// Wait for JavaScript to load
await Task.Delay(3000);
var products = new List<Product>();
var elements = driver.FindElements(By.CssSelector(".product"));
foreach (var element in elements)
{
products.Add(new Product
{
Name = element.FindElement(By.CssSelector("h2")).Text,
Price = ParsePrice(element.FindElement(By.CssSelector(".price")).Text)
});
}
return products;
}
}
The --disable-blink-features=AutomationControlled flag is crucial—it removes the navigator.webdriver flag that screams "I'm a bot!"
Advanced: Concurrent Scraping with Rate Limiting
When you need speed but don't want to hammer the server (and get blocked), use controlled concurrency:
using System.Threading;
public class ConcurrentScraper
{
private readonly SemaphoreSlim _semaphore;
private readonly int _maxConcurrency;
public ConcurrentScraper(int maxConcurrentRequests = 5)
{
_maxConcurrency = maxConcurrentRequests;
_semaphore = new SemaphoreSlim(_maxConcurrency);
}
public async Task<List<Product>> ScrapeMultiplePages(List<string> urls)
{
var tasks = urls.Select(async url =>
{
await _semaphore.WaitAsync();
try
{
await Task.Delay(Random.Shared.Next(1000, 3000)); // Rate limiting
return await ScrapeUrl(url);
}
finally
{
_semaphore.Release();
}
});
var results = await Task.WhenAll(tasks);
return results.SelectMany(r => r).ToList();
}
private async Task<List<Product>> ScrapeUrl(string url)
{
// Your scraping logic here
return new List<Product>();
}
}
This limits you to 5 concurrent requests while adding random delays. It's fast enough to scrape thousands of pages in minutes, but respectful enough to avoid most rate limiting.
Storing Your Scraped Data
Once you've got the data, you need to store it efficiently. Here are three approaches I use depending on the project:
For quick jobs: CSV
using CsvHelper;
public async Task SaveToCsv(List<Product> products, string filename)
{
using var writer = new StreamWriter(filename);
using var csv = new CsvWriter(writer, CultureInfo.InvariantCulture);
await csv.WriteRecordsAsync(products);
}
For structured data: SQLite
using Microsoft.EntityFrameworkCore;
public class ScrapingContext : DbContext
{
public DbSet<Product> Products { get; set; }
protected override void OnConfiguring(DbContextOptionsBuilder options)
{
options.UseSqlite("Data Source=scraped_data.db");
}
}
public async Task SaveToDatabase(List<Product> products)
{
using var context = new ScrapingContext();
await context.Database.EnsureCreatedAsync();
await context.Products.AddRangeAsync(products);
await context.SaveChangesAsync();
}
For flexible schemas: JSON
using System.Text.Json;
public async Task SaveToJson(List<Product> products, string filename)
{
var options = new JsonSerializerOptions { WriteIndented = true };
var json = JsonSerializer.Serialize(products, options);
await File.WriteAllTextAsync(filename, json);
}
I usually start with JSON during development (easy to inspect), then move to SQLite for production (easy to query and deduplicate).
Common Pitfalls and How to Avoid Them
After building dozens of scrapers, here are the mistakes I see repeatedly:
1. Not handling pagination correctly
Many sites use "infinite scroll" that loads content via AJAX. You need to either:
- Find the API endpoint that loads more results
- Use Selenium to scroll and trigger loading
- Check for "Load More" buttons in the HTML
2. Ignoring SSL certificate errors
In development, you might be tempted to ignore SSL errors:
// Don't do this in production
var handler = new HttpClientHandler();
handler.ServerCertificateCustomValidationCallback =
HttpClientHandler.DangerousAcceptAnyServerCertificateValidator;
This opens security holes. If you need to scrape a site with certificate issues, diagnose the actual problem first.
3. Not implementing retry logic
Networks fail. Servers timeout. Implement exponential backoff:
public async Task<string> GetWithRetry(string url, int maxRetries = 3)
{
for (int i = 0; i < maxRetries; i++)
{
try
{
return await _client.GetStringAsync(url);
}
catch (HttpRequestException) when (i < maxRetries - 1)
{
await Task.Delay(TimeSpan.FromSeconds(Math.Pow(2, i)));
}
}
throw new Exception($"Failed after {maxRetries} attempts");
}
4. Scraping too aggressively
If your scraper is making 100 requests per second, you'll get blocked. Period. Even if the site doesn't have explicit rate limiting, you're likely overloading their servers or triggering DDoS protection. Keep it reasonable—5-10 requests per second is often plenty fast and much more sustainable.
Legal and Ethical Considerations
I'm not a lawyer, but here's what you should know: web scraping exists in a legal gray area. In the US, the general consensus from recent cases is that scraping public data is legal, but:
- Respect robots.txt (even if not legally required)
- Don't create unreasonable server load
- Don't scrape data behind authentication without permission
- Don't use scraped data in ways that violate the site's terms of service
- Personal data (especially in EU) is subject to GDPR
When in doubt, look for an official API first. Many sites provide APIs for legitimate use cases, and they're always better than scraping.
Wrapping Up
C# is a surprisingly solid choice for web scraping. You get strong typing, great async support, and excellent performance with modern .NET. The key is avoiding common traps like improper HttpClient usage and understanding when you actually need browser automation versus simple HTTP requests.
The techniques I've covered here—from memory-efficient streaming to human-like delay patterns—are what separate scrapers that work for a day from those that run reliably for months. Start with the basics, add complexity only when needed, and always respect the sites you're scraping.
If you're building scrapers at scale, you'll eventually want to look into rotating proxies, CAPTCHA solving services, or even commercial scraping APIs. But for most projects, the approaches in this guide will get you 90% of the way there without spending a dime on external services.
Now go build something useful. Just please don't take down anyone's server while doing it.