Go makes web scraping feel effortless once you understand its concurrency model. Where Python scrapers might struggle under heavy load or Node.js hits memory limits, Go handles thousands of concurrent requests without breaking a sweat.
I've built scrapers in Python, JavaScript, and Go over the years. Go scrapers consistently outperform the others by a wide margin—we're talking 5-10x faster execution and a fraction of the memory usage. But here's what most tutorials won't tell you: raw speed doesn't matter if your scraper gets blocked or crashes after an hour of runtime.
This guide cuts through the basics and shows you how to build production-ready Go scrapers. You'll learn the patterns that actually work at scale, not just toy examples that fall apart in the real world.
Why Go for Web Scraping?
Go isn't just another programming language for scraping—it's purpose-built for the kind of high-concurrency, network-heavy operations that scraping demands.
Here's what sets Go apart:
Native concurrency that actually works. Goroutines are lightweight (around 2KB each) and managed by the Go runtime. You can spin up 10,000 goroutines without your system grinding to a halt. Python's threading? Limited by the GIL. Node.js? Better, but still heavier.
Compiled binaries mean dead-simple deployment. Your scraper compiles to a single executable. No messing with virtual environments, package conflicts, or runtime dependencies. Copy one file to a server and run it.
Speed where it counts. Go's compiled nature delivers 5-10x faster parsing than interpreted languages. When you're processing millions of HTML pages, that difference compounds fast.
Built-in HTTP client that's production-ready. The net/http package handles connection pooling, timeouts, and redirects out of the box. You're not bolting on third-party libraries to get basic functionality.
Most Go vs. Python comparisons focus on raw speed. That's nice, but the real win is Go's concurrency model. You can write clean, readable code that scales horizontally without the complexity of async/await patterns or callback hell.
Setting Up Your Go Scraping Environment
Before diving into code, let's get your environment configured properly.
Install Go (version 1.19 or higher) from the official website. Verify the installation:
go version
Create a new project directory and initialize a Go module:
mkdir go-web-scraper
cd go-web-scraper
go mod init github.com/yourusername/go-web-scraper
This creates a go.mod file that tracks your dependencies. Unlike Python's requirements.txt or Node's package.json, Go's module system is built into the language.
Install the packages we'll use:
go get github.com/PuerkitoBio/goquery
go get github.com/chromedp/chromedp
Your basic main.go file structure:
package main
import (
"fmt"
"log"
)
func main() {
fmt.Println("Scraper initialized")
}
Run it to confirm everything works:
go run main.go
Making HTTP Requests the Right Way
Most tutorials show you how to make a basic GET request and call it a day. In production, you need connection pooling, proper timeouts, and custom headers.
Here's a bare-bones HTTP request:
package main
import (
"fmt"
"io"
"log"
"net/http"
)
func main() {
resp, err := http.Get("https://example.com")
if err != nil {
log.Fatal(err)
}
defer resp.Body.Close()
body, err := io.ReadAll(resp.Body)
if err != nil {
log.Fatal(err)
}
fmt.Println(string(body))
}
This works, but you're using Go's default HTTP client with no timeout. If the server hangs, your scraper hangs forever. Not ideal.
Here's how to configure a proper HTTP client:
package main
import (
"context"
"fmt"
"io"
"log"
"net/http"
"time"
)
func main() {
// Create a custom HTTP client with sensible defaults
client := &http.Client{
Timeout: 30 * time.Second,
Transport: &http.Transport{
MaxIdleConns: 100,
MaxIdleConnsPerHost: 10,
IdleConnTimeout: 90 * time.Second,
},
}
// Create request with context for fine-grained timeout control
ctx, cancel := context.WithTimeout(context.Background(), 10*time.Second)
defer cancel()
req, err := http.NewRequestWithContext(ctx, "GET", "https://example.com", nil)
if err != nil {
log.Fatal(err)
}
// Set realistic headers
req.Header.Set("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36")
req.Header.Set("Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8")
req.Header.Set("Accept-Language", "en-US,en;q=0.5")
resp, err := client.Do(req)
if err != nil {
log.Fatal(err)
}
defer resp.Body.Close()
if resp.StatusCode != http.StatusOK {
log.Fatalf("Bad status: %s", resp.Status)
}
body, err := io.ReadAll(resp.Body)
if err != nil {
log.Fatal(err)
}
fmt.Println(string(body))
}
Why this matters:
MaxIdleConnscontrols connection pooling. The default is 2, which is absurdly low for concurrent scraping.- Context with timeout gives you precise control over request duration, separate from the overall client timeout.
- Custom User-Agent makes you look like a browser, not a bot.
- Always check
StatusCodebefore processing the response body.
Parsing HTML with Goquery
Goquery brings jQuery-like selectors to Go. If you've done any web scraping in Python with BeautifulSoup or JavaScript with Cheerio, this'll feel familiar.
Install Goquery (if you haven't already):
go get github.com/PuerkitoBio/goquery
Basic parsing example:
package main
import (
"fmt"
"log"
"net/http"
"github.com/PuerkitoBio/goquery"
)
func main() {
resp, err := http.Get("https://example.com")
if err != nil {
log.Fatal(err)
}
defer resp.Body.Close()
if resp.StatusCode != http.StatusOK {
log.Fatalf("Status code: %d", resp.StatusCode)
}
// Parse HTML
doc, err := goquery.NewDocumentFromReader(resp.Body)
if err != nil {
log.Fatal(err)
}
// Extract all links
doc.Find("a").Each(func(i int, s *goquery.Selection) {
href, exists := s.Attr("href")
if exists {
fmt.Printf("Link %d: %s\n", i, href)
}
})
}
Extracting specific data:
Let's say you're scraping an e-commerce site for product information. Here's a realistic example:
type Product struct {
Name string
Price string
URL string
ImageURL string
}
func scrapeProducts(doc *goquery.Document) []Product {
var products []Product
doc.Find(".product-card").Each(func(i int, s *goquery.Selection) {
product := Product{
Name: s.Find(".product-title").Text(),
Price: s.Find(".product-price").Text(),
}
// Extract URL
if url, exists := s.Find("a").Attr("href"); exists {
product.URL = url
}
// Extract image URL
if imgURL, exists := s.Find("img").Attr("src"); exists {
product.ImageURL = imgURL
}
products = append(products, product)
})
return products
}
Goquery's Find() method uses CSS selectors. You can chain them just like jQuery: .parent .child, div#id, [data-attribute="value"], etc.
Pro tip: When scraping production sites, HTML structures change frequently. Build in fallback logic:
// Try multiple selectors in order of preference
price := s.Find(".price-new").Text()
if price == "" {
price = s.Find(".price").Text()
}
if price == "" {
price = s.Find("[data-price]").Text()
}
Concurrent Scraping with Goroutines (Done Properly)
Here's where Go really shines. Most tutorials show you this:
// DON'T DO THIS
for _, url := range urls {
go scrapeURL(url)
}
That spawns unlimited goroutines with no coordination. If urls contains 10,000 URLs, you just launched 10,000 concurrent requests. Most servers will ban you instantly, and your own machine might run out of resources.
The right way: use WaitGroups for coordination
package main
import (
"fmt"
"net/http"
"sync"
"time"
)
func scrapeURL(url string, wg *sync.WaitGroup) {
defer wg.Done()
client := &http.Client{Timeout: 10 * time.Second}
resp, err := client.Get(url)
if err != nil {
fmt.Printf("Error scraping %s: %v\n", url, err)
return
}
defer resp.Body.Close()
fmt.Printf("Scraped %s - Status: %d\n", url, resp.StatusCode)
}
func main() {
urls := []string{
"https://example.com/page1",
"https://example.com/page2",
"https://example.com/page3",
// ... more URLs
}
var wg sync.WaitGroup
for _, url := range urls {
wg.Add(1)
go scrapeURL(url, &wg)
// Rate limiting: small delay between launches
time.Sleep(100 * time.Millisecond)
}
wg.Wait()
fmt.Println("All URLs scraped")
}
What's happening:
WaitGrouptracks how many goroutines are runningwg.Add(1)increments the counter before launching a goroutinewg.Done()(via defer) decrements when the goroutine finisheswg.Wait()blocks until the counter reaches zero
The time.Sleep() between goroutine launches provides basic rate limiting. It's crude but effective for small jobs.
Building a Worker Pool for Controlled Concurrency
Unlimited goroutines are dangerous. A worker pool limits concurrency to a fixed number of workers, preventing resource exhaustion.
Here's a production-ready pattern most tutorials skip:
package main
import (
"context"
"fmt"
"net/http"
"sync"
"time"
)
type ScrapeJob struct {
URL string
}
type ScrapeResult struct {
URL string
StatusCode int
Error error
}
func worker(id int, jobs <-chan ScrapeJob, results chan<- ScrapeResult, wg *sync.WaitGroup) {
defer wg.Done()
client := &http.Client{
Timeout: 15 * time.Second,
}
for job := range jobs {
fmt.Printf("Worker %d processing %s\n", id, job.URL)
ctx, cancel := context.WithTimeout(context.Background(), 10*time.Second)
req, err := http.NewRequestWithContext(ctx, "GET", job.URL, nil)
if err != nil {
results <- ScrapeResult{URL: job.URL, Error: err}
cancel()
continue
}
req.Header.Set("User-Agent", "Mozilla/5.0")
resp, err := client.Do(req)
cancel()
if err != nil {
results <- ScrapeResult{URL: job.URL, Error: err}
continue
}
resp.Body.Close()
results <- ScrapeResult{
URL: job.URL,
StatusCode: resp.StatusCode,
}
// Rate limiting per worker
time.Sleep(500 * time.Millisecond)
}
}
func main() {
urls := []string{
"https://example.com/page1",
"https://example.com/page2",
"https://example.com/page3",
// Add 100+ URLs here
}
// Create channels
jobs := make(chan ScrapeJob, len(urls))
results := make(chan ScrapeResult, len(urls))
// Number of concurrent workers
numWorkers := 5
var wg sync.WaitGroup
// Start workers
for w := 1; w <= numWorkers; w++ {
wg.Add(1)
go worker(w, jobs, results, &wg)
}
// Send jobs
for _, url := range urls {
jobs <- ScrapeJob{URL: url}
}
close(jobs)
// Wait for all workers to finish
wg.Wait()
close(results)
// Process results
for result := range results {
if result.Error != nil {
fmt.Printf("Failed %s: %v\n", result.URL, result.Error)
} else {
fmt.Printf("Success %s: %d\n", result.URL, result.StatusCode)
}
}
}
Why this pattern works:
- Fixed concurrency: Only
numWorkersgoroutines run simultaneously, no matter how many URLs you have - Buffered channels: Prevent blocking when sending jobs
- Per-worker rate limiting: Each worker waits between requests
- Graceful shutdown: Channels close cleanly, no goroutine leaks
This is the pattern I use in production. It scales beautifully—handle 10 URLs or 10,000 by adjusting numWorkers.
Handling Dynamic JavaScript-Rendered Content
Some sites render content with JavaScript, which means a basic HTTP request returns an empty page. You need a headless browser.
Chromedp controls Chrome via the DevTools Protocol. It's faster and more memory-efficient than Selenium.
Basic chromedp example:
package main
import (
"context"
"fmt"
"log"
"time"
"github.com/chromedp/chromedp"
)
func main() {
// Create context
ctx, cancel := chromedp.NewContext(context.Background())
defer cancel()
// Set timeout
ctx, cancel = context.WithTimeout(ctx, 30*time.Second)
defer cancel()
var htmlContent string
// Navigate and extract HTML
err := chromedp.Run(ctx,
chromedp.Navigate("https://example.com"),
chromedp.WaitVisible(`body`, chromedp.ByQuery),
chromedp.OuterHTML(`html`, &htmlContent),
)
if err != nil {
log.Fatal(err)
}
fmt.Println("Page HTML length:", len(htmlContent))
}
Scraping dynamic content:
func scrapeDynamicSite(url string) ([]string, error) {
ctx, cancel := chromedp.NewContext(context.Background())
defer cancel()
ctx, cancel = context.WithTimeout(ctx, 30*time.Second)
defer cancel()
var titles []string
err := chromedp.Run(ctx,
chromedp.Navigate(url),
// Wait for content to load
chromedp.WaitVisible(`.product-card`, chromedp.ByQueryAll),
// Extract titles
chromedp.Evaluate(`Array.from(document.querySelectorAll('.product-title')).map(el => el.textContent)`, &titles),
)
if err != nil {
return nil, err
}
return titles, nil
}
Handling infinite scroll:
Many modern sites load content as you scroll. Here's how to trigger that:
err := chromedp.Run(ctx,
chromedp.Navigate(url),
chromedp.WaitVisible(`body`),
// Scroll to bottom multiple times
chromedp.Evaluate(`window.scrollTo(0, document.body.scrollHeight)`, nil),
chromedp.Sleep(2 * time.Second),
chromedp.Evaluate(`window.scrollTo(0, document.body.scrollHeight)`, nil),
chromedp.Sleep(2 * time.Second),
chromedp.OuterHTML(`html`, &htmlContent),
)
Chromedp is powerful but resource-intensive. Use it only when necessary—for static HTML, stick with net/http and Goquery.
Rate Limiting with Token Bucket
The time.Sleep() approach works for simple cases, but it's not precise. For production scrapers, implement a token bucket rate limiter.
Here's a reusable rate limiter:
package main
import (
"context"
"fmt"
"sync"
"time"
)
type RateLimiter struct {
tokens chan struct{}
maxTokens int
refillRate time.Duration
done chan struct{}
wg sync.WaitGroup
}
func NewRateLimiter(maxTokens int, refillRate time.Duration) *RateLimiter {
rl := &RateLimiter{
tokens: make(chan struct{}, maxTokens),
maxTokens: maxTokens,
refillRate: refillRate,
done: make(chan struct{}),
}
// Fill initial tokens
for i := 0; i < maxTokens; i++ {
rl.tokens <- struct{}{}
}
// Start refill goroutine
rl.wg.Add(1)
go rl.refill()
return rl
}
func (rl *RateLimiter) refill() {
defer rl.wg.Done()
ticker := time.NewTicker(rl.refillRate)
defer ticker.Stop()
for {
select {
case <-ticker.C:
select {
case rl.tokens <- struct{}{}:
default:
}
case <-rl.done:
return
}
}
}
func (rl *RateLimiter) Wait(ctx context.Context) error {
select {
case <-rl.tokens:
return nil
case <-ctx.Done():
return ctx.Err()
}
}
func (rl *RateLimiter) Stop() {
close(rl.done)
rl.wg.Wait()
}
func main() {
// Allow 10 requests per second
limiter := NewRateLimiter(10, 100*time.Millisecond)
defer limiter.Stop()
ctx := context.Background()
for i := 0; i < 50; i++ {
if err := limiter.Wait(ctx); err != nil {
fmt.Printf("Rate limit error: %v\n", err)
break
}
fmt.Printf("Request %d at %s\n", i, time.Now().Format("15:04:05.000"))
// Make your HTTP request here
}
}
This rate limiter ensures you never exceed your target rate, even under heavy concurrent load. It's context-aware, so you can cancel operations cleanly.
Error Handling and Retry Strategies
Networks fail. Servers timeout. Scrapers crash. Production code needs resilient error handling.
Here's a retry function with exponential backoff:
package main
import (
"fmt"
"math"
"math/rand"
"net/http"
"time"
)
func fetchWithRetry(url string, maxRetries int) (*http.Response, error) {
var resp *http.Response
var err error
client := &http.Client{Timeout: 15 * time.Second}
for attempt := 0; attempt < maxRetries; attempt++ {
if attempt > 0 {
// Exponential backoff with jitter
backoff := time.Duration(math.Pow(2, float64(attempt))) * time.Second
jitter := time.Duration(rand.Intn(1000)) * time.Millisecond
sleepTime := backoff + jitter
fmt.Printf("Retry %d/%d for %s after %v\n", attempt, maxRetries, url, sleepTime)
time.Sleep(sleepTime)
}
resp, err = client.Get(url)
if err == nil && resp.StatusCode == http.StatusOK {
return resp, nil
}
if err == nil {
resp.Body.Close()
// Don't retry on client errors (4xx)
if resp.StatusCode >= 400 && resp.StatusCode < 500 {
return nil, fmt.Errorf("client error %d, not retrying", resp.StatusCode)
}
}
}
if err != nil {
return nil, fmt.Errorf("max retries exceeded: %w", err)
}
return nil, fmt.Errorf("max retries exceeded, last status: %d", resp.StatusCode)
}
Why exponential backoff with jitter:
- Exponential backoff: Wait longer between each retry (1s, 2s, 4s, 8s...)
- Jitter: Add randomness to prevent thundering herd problems
- Skip retries for 4xx errors: Client errors won't fix themselves by retrying
Integrate this with your worker pool:
resp, err := fetchWithRetry(job.URL, 3)
if err != nil {
results <- ScrapeResult{URL: job.URL, Error: err}
continue
}
defer resp.Body.Close()
Real-World Production Patterns
Let's pull everything together into a production-ready scraper. This handles concurrency, rate limiting, retries, and proper error handling.
package main
import (
"context"
"fmt"
"io"
"log"
"net/http"
"sync"
"time"
"github.com/PuerkitoBio/goquery"
)
type Scraper struct {
client *http.Client
rateLimiter *RateLimiter
maxRetries int
}
type Product struct {
Name string
Price string
URL string
}
func NewScraper(requestsPerSecond int) *Scraper {
return &Scraper{
client: &http.Client{
Timeout: 30 * time.Second,
Transport: &http.Transport{
MaxIdleConns: 100,
MaxIdleConnsPerHost: 10,
},
},
rateLimiter: NewRateLimiter(requestsPerSecond, time.Second/time.Duration(requestsPerSecond)),
maxRetries: 3,
}
}
func (s *Scraper) scrapeURL(ctx context.Context, url string) ([]Product, error) {
// Rate limiting
if err := s.rateLimiter.Wait(ctx); err != nil {
return nil, err
}
// Make request with retry
var resp *http.Response
var err error
for attempt := 0; attempt < s.maxRetries; attempt++ {
if attempt > 0 {
time.Sleep(time.Duration(attempt) * time.Second)
}
req, reqErr := http.NewRequestWithContext(ctx, "GET", url, nil)
if reqErr != nil {
return nil, reqErr
}
req.Header.Set("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36")
resp, err = s.client.Do(req)
if err == nil && resp.StatusCode == http.StatusOK {
break
}
if resp != nil {
resp.Body.Close()
}
}
if err != nil {
return nil, fmt.Errorf("failed after %d retries: %w", s.maxRetries, err)
}
defer resp.Body.Close()
// Parse HTML
doc, err := goquery.NewDocumentFromReader(resp.Body)
if err != nil {
return nil, err
}
return s.extractProducts(doc), nil
}
func (s *Scraper) extractProducts(doc *goquery.Document) []Product {
var products []Product
doc.Find(".product-card").Each(func(i int, sel *goquery.Selection) {
product := Product{
Name: sel.Find(".product-name").Text(),
Price: sel.Find(".product-price").Text(),
}
if url, exists := sel.Find("a").Attr("href"); exists {
product.URL = url
}
products = append(products, product)
})
return products
}
func (s *Scraper) ScrapeMultiplePages(urls []string, workers int) []Product {
jobs := make(chan string, len(urls))
results := make(chan []Product, len(urls))
var wg sync.WaitGroup
ctx := context.Background()
// Start workers
for w := 0; w < workers; w++ {
wg.Add(1)
go func(workerID int) {
defer wg.Done()
for url := range jobs {
products, err := s.scrapeURL(ctx, url)
if err != nil {
log.Printf("Worker %d error scraping %s: %v", workerID, url, err)
continue
}
results <- products
}
}(w)
}
// Send jobs
for _, url := range urls {
jobs <- url
}
close(jobs)
// Wait and close results
go func() {
wg.Wait()
close(results)
}()
// Collect all products
var allProducts []Product
for products := range results {
allProducts = append(allProducts, products...)
}
return allProducts
}
func main() {
urls := []string{
"https://example.com/products?page=1",
"https://example.com/products?page=2",
"https://example.com/products?page=3",
// Add more URLs
}
scraper := NewScraper(5) // 5 requests per second
defer scraper.rateLimiter.Stop()
products := scraper.ScrapeMultiplePages(urls, 3) // 3 workers
fmt.Printf("Scraped %d products\n", len(products))
for _, p := range products[:10] { // Print first 10
fmt.Printf("- %s: %s\n", p.Name, p.Price)
}
}
This production scraper includes:
- Configurable rate limiting
- Automatic retries with backoff
- Worker pool for controlled concurrency
- Context-aware operations
- Proper resource cleanup
- Structured error handling
Common Pitfalls and How to Avoid Them
After building dozens of Go scrapers, here are the mistakes I see repeatedly:
1. Not respecting robots.txt
Always check robots.txt before scraping. It's not just ethical—many sites will ban you faster if you ignore it.
import "github.com/temoto/robotstxt"
func checkRobots(baseURL string, userAgent string) error {
robotsURL := baseURL + "/robots.txt"
resp, err := http.Get(robotsURL)
if err != nil {
return err
}
defer resp.Body.Close()
robots, err := robotstxt.FromResponse(resp)
if err != nil {
return err
}
if !robots.TestAgent("/path", userAgent) {
return fmt.Errorf("scraping not allowed by robots.txt")
}
return nil
}
2. Ignoring status codes
Always validate the HTTP status before parsing:
if resp.StatusCode != http.StatusOK {
return fmt.Errorf("unexpected status: %d", resp.StatusCode)
}
3. Memory leaks from unclosed response bodies
Always defer resp.Body.Close(), even in error cases:
resp, err := client.Do(req)
if err != nil {
return err
}
defer resp.Body.Close() // Critical!
if resp.StatusCode != http.StatusOK {
return fmt.Errorf("bad status: %d", resp.StatusCode)
}
4. Parsing errors silently
Don't ignore selector errors. Log them or use fallback logic:
title := doc.Find(".title").Text()
if title == "" {
log.Printf("Warning: No title found for %s", url)
title = "Unknown"
}
5. Unbounded goroutines
Never launch unlimited goroutines. Always use a worker pool or semaphore to limit concurrency.
6. Not handling context cancellation
When using contexts, check for cancellation:
select {
case <-ctx.Done():
return ctx.Err()
default:
// Continue processing
}
Wrapping Up
Go gives you the tools to build scrapers that actually scale. The combination of native concurrency, compiled speed, and a robust standard library means you can handle production workloads without drowning in complexity.
The patterns in this guide—worker pools, rate limiting, retry logic—aren't academic exercises. They're what separates toy scrapers from production systems that run reliably for months.
Start with the basic HTTP client and Goquery. Add concurrency once you understand the fundamentals. Layer in rate limiting and retries when you hit real-world network issues. Most importantly, test your scraper against realistic loads before deploying it.
Go makes fast scrapers easy to write. Building scrapers that stay fast under pressure takes a bit more work, but now you have the patterns to do it right.