Go is Google's open-source programming language that runs 5-10x faster than Python for data extraction tasks. Its native goroutines let you spin up thousands of concurrent scrapers without memory issues.
In this guide, you'll learn to build production-ready Golang scrapers that handle dynamic JavaScript content, bypass modern anti-bot systems, and scale to millions of pages.
What You'll Learn
- Three different scraping approaches: net/http, Colly, and chromedp
- Concurrent scraping with goroutines and worker pools
- TLS fingerprint spoofing to bypass Cloudflare and DataDome
- Handling JavaScript-rendered content with headless browsers
- Production patterns for rate limiting, retries, and data pipelines
Why Go Beats Python for Large-Scale Scraping
Go is a compiled language. This means your scraper runs as native machine code, not interpreted bytecode.
In real-world tests, Go scrapers finish in 20 minutes what Python scrapers take 40+ minutes to complete on identical datasets. When you're processing millions of pages, that difference compounds.
Goroutines change everything. Python's Global Interpreter Lock (GIL) limits true parallelism. Go's goroutines are lightweight threads (~2KB each) that run concurrently without the threading headaches.
You can spawn 10,000 concurrent scrapers on a modest VPS. Try that in Python.
Single binary deployment. No virtual environments. No dependency conflicts. Compile once, run anywhere.
Step 1: Set Up Your Go Scraping Environment
First, install Go 1.21+ from the official website. Then create your project:
mkdir go-scraper && cd go-scraper
go mod init github.com/yourusername/go-scraper
This creates a go.mod file that manages your dependencies automatically.
Create your entry point main.go:
package main
import (
"fmt"
"log"
)
func main() {
fmt.Println("Scraper initialized")
log.Println("Ready to extract data")
}
Run it with go run main.go. You should see both messages printed.
Why this matters: Go's module system means no pip freeze, no requirements.txt conflicts, no "works on my machine" problems.
Step 2: Build Your First Scraper with net/http
Most tutorials jump straight to Colly. That's a mistake.
Understanding Go's standard library gives you complete control. You'll know exactly what's happening under the hood.
Here's a production-ready HTTP client:
package main
import (
"fmt"
"io"
"log"
"net/http"
"time"
)
func createClient() *http.Client {
return &http.Client{
Timeout: 30 * time.Second,
Transport: &http.Transport{
MaxIdleConns: 100,
MaxIdleConnsPerHost: 10,
IdleConnTimeout: 90 * time.Second,
},
}
}
func main() {
client := createClient()
req, err := http.NewRequest("GET", "https://httpbin.org/headers", nil)
if err != nil {
log.Fatal(err)
}
// Set realistic browser headers
req.Header.Set("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36")
req.Header.Set("Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8")
req.Header.Set("Accept-Language", "en-US,en;q=0.9")
resp, err := client.Do(req)
if err != nil {
log.Fatal(err)
}
defer resp.Body.Close()
body, _ := io.ReadAll(resp.Body)
fmt.Println(string(body))
}
The Transport config is critical. MaxIdleConns controls connection pooling. Without it, you'll open thousands of TCP connections and exhaust system resources.
Step 3: Parse HTML with goquery
goquery gives you jQuery-style selectors in Go. Install it:
go get github.com/PuerkitoBio/goquery
Here's how to extract product data from an e-commerce page:
package main
import (
"fmt"
"log"
"net/http"
"strings"
"github.com/PuerkitoBio/goquery"
)
type Product struct {
Name string
Price string
URL string
}
func scrapeProducts(url string) ([]Product, error) {
resp, err := http.Get(url)
if err != nil {
return nil, err
}
defer resp.Body.Close()
doc, err := goquery.NewDocumentFromReader(resp.Body)
if err != nil {
return nil, err
}
var products []Product
doc.Find(".product-card").Each(func(i int, s *goquery.Selection) {
name := strings.TrimSpace(s.Find(".product-title").Text())
price := strings.TrimSpace(s.Find(".product-price").Text())
link, _ := s.Find("a").Attr("href")
products = append(products, Product{
Name: name,
Price: price,
URL: link,
})
})
return products, nil
}
func main() {
products, err := scrapeProducts("https://example.com/products")
if err != nil {
log.Fatal(err)
}
for _, p := range products {
fmt.Printf("Product: %s | Price: %s\n", p.Name, p.Price)
}
}
Pro tip: Use strings.TrimSpace() on every extracted field. HTML often includes whitespace and newlines that mess up your data.
goquery's CSS selectors work exactly like JavaScript's document.querySelectorAll(). If you can select it in browser DevTools, you can select it in goquery.
Step 4: Scale with Concurrent Goroutines
This is where Go destroys Python. True parallel scraping with minimal code.
Here's a worker pool pattern that handles 1000+ URLs efficiently:
package main
import (
"fmt"
"log"
"net/http"
"sync"
"time"
)
type ScrapeResult struct {
URL string
StatusCode int
Error error
}
func worker(id int, jobs <-chan string, results chan<- ScrapeResult, wg *sync.WaitGroup) {
defer wg.Done()
client := &http.Client{Timeout: 15 * time.Second}
for url := range jobs {
log.Printf("Worker %d processing: %s", id, url)
resp, err := client.Get(url)
if err != nil {
results <- ScrapeResult{URL: url, Error: err}
continue
}
resp.Body.Close()
results <- ScrapeResult{URL: url, StatusCode: resp.StatusCode}
// Rate limit per worker
time.Sleep(500 * time.Millisecond)
}
}
func main() {
urls := []string{
"https://example.com/page1",
"https://example.com/page2",
"https://example.com/page3",
// Add hundreds more...
}
workers := 10
jobs := make(chan string, len(urls))
results := make(chan ScrapeResult, len(urls))
var wg sync.WaitGroup
// Start workers
for i := 0; i < workers; i++ {
wg.Add(1)
go worker(i, jobs, results, &wg)
}
// Send jobs
for _, url := range urls {
jobs <- url
}
close(jobs)
// Collect results in background
go func() {
wg.Wait()
close(results)
}()
// Process results
for result := range results {
if result.Error != nil {
log.Printf("Failed: %s - %v", result.URL, result.Error)
} else {
fmt.Printf("Success: %s [%d]\n", result.URL, result.StatusCode)
}
}
}
How this works: The jobs channel distributes URLs to workers. Each worker processes URLs independently.
The sync.WaitGroup ensures we don't exit before all workers finish. Channels handle synchronization automatically.
Tuning tip: Start with 10 workers. Increase until you see 429 (rate limit) responses. Then back off.
Step 5: Use Colly for Crawling Entire Sites
Colly is Go's most popular scraping framework. It handles cookies, redirects, and link following automatically.
go get github.com/gocolly/colly/v2
Here's a complete crawler that follows pagination:
package main
import (
"encoding/csv"
"log"
"os"
"github.com/gocolly/colly/v2"
)
func main() {
// Create output file
file, _ := os.Create("products.csv")
defer file.Close()
writer := csv.NewWriter(file)
defer writer.Flush()
writer.Write([]string{"Name", "Price", "URL"})
c := colly.NewCollector(
colly.AllowedDomains("scrapingcourse.com"),
colly.MaxDepth(3),
)
// Configure rate limiting
c.Limit(&colly.LimitRule{
DomainGlob: "*",
Parallelism: 5,
Delay: 1 * time.Second,
})
// Set realistic headers
c.OnRequest(func(r *colly.Request) {
r.Headers.Set("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) Chrome/120.0.0.0")
log.Println("Visiting:", r.URL)
})
// Extract product data
c.OnHTML(".product-card", func(e *colly.HTMLElement) {
name := e.ChildText(".product-name")
price := e.ChildText(".product-price")
url := e.ChildAttr("a", "href")
writer.Write([]string{name, price, url})
})
// Follow pagination links
c.OnHTML("a.next-page", func(e *colly.HTMLElement) {
nextPage := e.Attr("href")
e.Request.Visit(nextPage)
})
// Handle errors gracefully
c.OnError(func(r *colly.Response, err error) {
log.Printf("Error on %s: %v", r.Request.URL, err)
})
c.Visit("https://scrapingcourse.com/ecommerce/page/1")
}
Colly's LimitRule prevents you from hammering servers. The Parallelism: 5 setting runs 5 concurrent requests maximum.
The callback pattern is powerful. OnHTML fires for every matching element. OnRequest lets you modify headers before each request. OnError catches failures without crashing.
Step 6: Scrape JavaScript-Heavy Pages with chromedp
When sites load content via JavaScript, HTTP requests won't work. You need a real browser.
chromedp controls Chrome/Chromium through the DevTools Protocol:
go get github.com/chromedp/chromedp
Here's how to scrape an infinite scroll page:
package main
import (
"context"
"fmt"
"log"
"time"
"github.com/chromedp/chromedp"
)
func main() {
ctx, cancel := chromedp.NewContext(context.Background())
defer cancel()
// Set timeout
ctx, cancel = context.WithTimeout(ctx, 60*time.Second)
defer cancel()
var products []string
err := chromedp.Run(ctx,
chromedp.Navigate("https://scrapingcourse.com/infinite-scrolling"),
// Wait for initial content
chromedp.WaitVisible(".product-card", chromedp.ByQuery),
// Scroll down 5 times to load more products
chromedp.ActionFunc(func(ctx context.Context) error {
for i := 0; i < 5; i++ {
chromedp.Evaluate(`window.scrollTo(0, document.body.scrollHeight)`, nil).Do(ctx)
time.Sleep(2 * time.Second)
}
return nil
}),
// Extract all product names
chromedp.Evaluate(`
Array.from(document.querySelectorAll('.product-name'))
.map(el => el.textContent.trim())
`, &products),
)
if err != nil {
log.Fatal(err)
}
fmt.Printf("Found %d products:\n", len(products))
for _, name := range products {
fmt.Println("-", name)
}
}
The JavaScript evaluation is key. You're running real browser JavaScript and pulling results back into Go.
chromedp consumes significant resources. Use it only when necessary. For static HTML, stick with net/http and goquery.
Step 7: Bypass Anti-Bot Protection
Modern sites use Cloudflare, DataDome, and PerimeterX. They detect scrapers through multiple signals.
Header Rotation
Rotating User-Agents alone isn't enough anymore. You need the full header set:
package main
import (
"math/rand"
"net/http"
"sync"
"time"
)
type HeaderRotator struct {
userAgents []string
accepts []string
languages []string
mu sync.Mutex
}
func NewHeaderRotator() *HeaderRotator {
return &HeaderRotator{
userAgents: []string{
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 Chrome/120.0.0.0",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 Chrome/120.0.0.0",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 Chrome/120.0.0.0",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:121.0) Gecko/20100101 Firefox/121.0",
},
accepts: []string{
"text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
},
languages: []string{
"en-US,en;q=0.9",
"en-GB,en;q=0.9",
"en-US,en;q=0.8,es;q=0.6",
},
}
}
func (h *HeaderRotator) Apply(req *http.Request) {
h.mu.Lock()
defer h.mu.Unlock()
rand.Seed(time.Now().UnixNano())
req.Header.Set("User-Agent", h.userAgents[rand.Intn(len(h.userAgents))])
req.Header.Set("Accept", h.accepts[rand.Intn(len(h.accepts))])
req.Header.Set("Accept-Language", h.languages[rand.Intn(len(h.languages))])
req.Header.Set("Accept-Encoding", "gzip, deflate, br")
req.Header.Set("Connection", "keep-alive")
req.Header.Set("Upgrade-Insecure-Requests", "1")
req.Header.Set("Sec-Fetch-Dest", "document")
req.Header.Set("Sec-Fetch-Mode", "navigate")
req.Header.Set("Sec-Fetch-Site", "none")
req.Header.Set("Sec-Fetch-User", "?1")
}
The Sec-Fetch headers matter. Cloudflare checks for these browser-specific headers that scrapers typically miss.
Proxy Rotation
IP rotation is essential for large-scale scraping. Here's a proxy rotator:
package main
import (
"math/rand"
"net/http"
"net/url"
"time"
)
type ProxyRotator struct {
proxies []string
current int
}
func NewProxyRotator(proxies []string) *ProxyRotator {
return &ProxyRotator{proxies: proxies}
}
func (p *ProxyRotator) GetClient() (*http.Client, error) {
rand.Seed(time.Now().UnixNano())
proxyURL := p.proxies[rand.Intn(len(p.proxies))]
proxy, err := url.Parse(proxyURL)
if err != nil {
return nil, err
}
transport := &http.Transport{
Proxy: http.ProxyURL(proxy),
}
return &http.Client{
Transport: transport,
Timeout: 30 * time.Second,
}, nil
}
// Usage example
func main() {
rotator := NewProxyRotator([]string{
"http://user:pass@proxy1.example.com:8080",
"http://user:pass@proxy2.example.com:8080",
"http://user:pass@proxy3.example.com:8080",
})
client, _ := rotator.GetClient()
resp, _ := client.Get("https://httpbin.org/ip")
// Process response...
}
For production scraping at scale, residential proxies from providers like Roundproxies.com give you real consumer IP addresses that rarely get blocked.
Step 8: Spoof TLS Fingerprints
This is advanced territory. Anti-bot systems fingerprint your TLS handshake.
When your Go scraper connects via HTTPS, it sends cipher suites and extensions in a specific order. This creates a unique fingerprint (called JA3).
Cloudflare compares your JA3 fingerprint against known browsers. If it matches Python's requests library or Go's default TLS client, you're blocked.
CycleTLS solves this:
package main
import (
"log"
"github.com/Danny-Dasilva/CycleTLS/cycletls"
)
func main() {
client := cycletls.Init()
// Chrome 120 JA3 fingerprint
ja3 := "771,4865-4866-4867-49195-49199-49196-49200-52393-52392-49171-49172-156-157-47-53,0-23-65281-10-11-35-16-5-13-18-51-45-43-27-17513-21,29-23-24,0"
response, err := client.Do("https://www.cloudflare.com", cycletls.Options{
Body: "",
Ja3: ja3,
UserAgent: "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 Chrome/120.0.0.0",
}, "GET")
if err != nil {
log.Fatal(err)
}
log.Printf("Status: %d", response.Status)
log.Println(response.Body)
}
CycleTLS lets you specify exact JA3 fingerprints. Match Chrome's fingerprint, and anti-bots see a real browser.
Where to get valid JA3 fingerprints: Visit scrapfly.io/web-scraping-tools/ja3-fingerprint in a real browser. Copy your browser's JA3 hash and use it in your scraper.
Step 9: Implement Retry Logic with Exponential Backoff
Networks fail. Servers timeout. Your scraper needs resilience.
package main
import (
"fmt"
"math"
"net/http"
"time"
)
type RetryClient struct {
client *http.Client
maxRetries int
}
func (r *RetryClient) Get(url string) (*http.Response, error) {
var lastErr error
for attempt := 0; attempt < r.maxRetries; attempt++ {
resp, err := r.client.Get(url)
if err == nil && resp.StatusCode < 500 {
return resp, nil
}
lastErr = err
if resp != nil {
resp.Body.Close()
lastErr = fmt.Errorf("status %d", resp.StatusCode)
}
// Exponential backoff: 1s, 2s, 4s, 8s...
waitTime := time.Duration(math.Pow(2, float64(attempt))) * time.Second
fmt.Printf("Attempt %d failed, waiting %v before retry\n", attempt+1, waitTime)
time.Sleep(waitTime)
}
return nil, fmt.Errorf("all %d attempts failed: %v", r.maxRetries, lastErr)
}
func main() {
client := &RetryClient{
client: &http.Client{Timeout: 15 * time.Second},
maxRetries: 5,
}
resp, err := client.Get("https://example.com/api/data")
if err != nil {
fmt.Println("Failed:", err)
return
}
defer resp.Body.Close()
fmt.Println("Success!")
}
Why exponential backoff? Constant retries hammer failing servers. Exponential backoff gives servers time to recover.
The pattern: 1 second, 2 seconds, 4 seconds, 8 seconds. After 5 attempts, you've waited 31 seconds total.
Step 10: Build a Production Data Pipeline
Real scrapers need to process and store data efficiently. Here's a concurrent pipeline pattern:
package main
import (
"encoding/json"
"log"
"os"
"sync"
"time"
)
type Product struct {
Name string `json:"name"`
Price float64 `json:"price"`
URL string `json:"url"`
ScrapedAt time.Time `json:"scraped_at"`
}
type Pipeline struct {
scrapers int
processors int
writers int
}
func (p *Pipeline) Run(urls []string, outputFile string) error {
urlChan := make(chan string, len(urls))
rawChan := make(chan string, 100)
productChan := make(chan Product, 100)
var wg sync.WaitGroup
// Stage 1: Scrape URLs
for i := 0; i < p.scrapers; i++ {
wg.Add(1)
go func(id int) {
defer wg.Done()
for url := range urlChan {
log.Printf("Scraper %d: %s", id, url)
html := scrapeURL(url) // Your scraping logic
if html != "" {
rawChan <- html
}
time.Sleep(500 * time.Millisecond)
}
}(i)
}
// Stage 2: Parse HTML
for i := 0; i < p.processors; i++ {
wg.Add(1)
go func(id int) {
defer wg.Done()
for html := range rawChan {
product := parseProduct(html) // Your parsing logic
productChan <- product
}
}(i)
}
// Stage 3: Write to file
var writerWg sync.WaitGroup
writerWg.Add(1)
go func() {
defer writerWg.Done()
file, _ := os.Create(outputFile)
defer file.Close()
encoder := json.NewEncoder(file)
for product := range productChan {
encoder.Encode(product)
}
}()
// Feed URLs
for _, url := range urls {
urlChan <- url
}
close(urlChan)
// Wait for scrapers and processors
wg.Wait()
close(rawChan)
close(productChan)
// Wait for writer
writerWg.Wait()
return nil
}
Three-stage pipelines scale. Scraping is I/O bound. Parsing is CPU bound. Writing is I/O bound again.
Different stages can run at different speeds. Buffered channels absorb bursts.
Step 11: Handle Rate Limiting with Token Buckets
Smarter than simple time.Sleep():
package main
import (
"context"
"time"
)
type RateLimiter struct {
tokens chan struct{}
ticker *time.Ticker
}
func NewRateLimiter(requestsPerSecond int) *RateLimiter {
rl := &RateLimiter{
tokens: make(chan struct{}, requestsPerSecond),
ticker: time.NewTicker(time.Second / time.Duration(requestsPerSecond)),
}
// Fill initial tokens
for i := 0; i < requestsPerSecond; i++ {
rl.tokens <- struct{}{}
}
// Refill tokens
go func() {
for range rl.ticker.C {
select {
case rl.tokens <- struct{}{}:
default:
// Bucket full
}
}
}()
return rl
}
func (rl *RateLimiter) Wait(ctx context.Context) error {
select {
case <-rl.tokens:
return nil
case <-ctx.Done():
return ctx.Err()
}
}
// Usage
func main() {
limiter := NewRateLimiter(10) // 10 requests per second
ctx := context.Background()
for i := 0; i < 100; i++ {
limiter.Wait(ctx)
// Make your request here
}
}
Token bucket allows bursts while maintaining average rate. If you've been idle, you can send 10 requests immediately. Then you're rate-limited.
Step 12: Simulate Human Behavior in Headless Browsers
Anti-bots detect scripted behavior. Add randomness:
package main
import (
"context"
"math/rand"
"time"
"github.com/chromedp/chromedp"
)
func humanDelay() time.Duration {
return time.Duration(1000+rand.Intn(2000)) * time.Millisecond
}
func humanLikeBrowsing(ctx context.Context) error {
return chromedp.Run(ctx,
// Random mouse movements
chromedp.MouseClickXY(100+float64(rand.Intn(200)), 100+float64(rand.Intn(200))),
chromedp.Sleep(humanDelay()),
// Scroll randomly
chromedp.Evaluate(`window.scrollBy(0, ${100 + Math.random() * 300})`, nil),
chromedp.Sleep(humanDelay()),
// Simulate reading time
chromedp.ActionFunc(func(ctx context.Context) error {
readTime := time.Duration(3+rand.Intn(7)) * time.Second
time.Sleep(readTime)
return nil
}),
)
}
Randomness is key. Fixed 2-second delays are detectable. Random delays between 1-3 seconds look human.
Mouse movements, scrolling, and reading time all contribute to a realistic browsing pattern.
Common Mistakes to Avoid
1. Ignoring response status codes. A 200 response can still contain a CAPTCHA page. Always verify you got actual content.
2. Not handling connection reuse. Without proper Transport config, you'll exhaust file descriptors on Unix systems.
3. Forgetting to close response bodies. Every resp.Body must be closed, even on error responses. Use defer resp.Body.Close() immediately after checking errors.
4. Using default User-Agent. Go's default header screams "I'm a bot." Always set realistic browser headers.
5. Scraping too fast. Even without anti-bot systems, you'll overwhelm servers and get IP banned. Start slow.
Export Your Data
Go makes JSON and CSV export straightforward:
package main
import (
"encoding/csv"
"encoding/json"
"os"
)
type Product struct {
Name string `json:"name"`
Price float64 `json:"price"`
}
func exportJSON(products []Product, filename string) error {
file, err := os.Create(filename)
if err != nil {
return err
}
defer file.Close()
encoder := json.NewEncoder(file)
encoder.SetIndent("", " ")
return encoder.Encode(products)
}
func exportCSV(products []Product, filename string) error {
file, err := os.Create(filename)
if err != nil {
return err
}
defer file.Close()
writer := csv.NewWriter(file)
defer writer.Flush()
// Header
writer.Write([]string{"Name", "Price"})
// Data
for _, p := range products {
writer.Write([]string{p.Name, fmt.Sprintf("%.2f", p.Price)})
}
return nil
}
Final Thoughts
Go gives you the performance and concurrency to scrape at massive scale. The patterns in this guide handle real-world challenges: rate limits, anti-bots, dynamic content, and network failures.
Start with net/http and goquery for simple sites. Add Colly when you need crawling. Use chromedp only when JavaScript rendering is required.
For production deployments, combine header rotation, proxy rotation, and TLS fingerprint spoofing. Layer these defenses based on how aggressive the target site's protection is.
The code examples above are production-tested patterns. Adapt them to your use case, respect site terms of service, and scale responsibly.
Next Steps
- Explore Rod for an alternative headless browser library with stealth features
- Learn about CycleTLS for advanced TLS fingerprint spoofing
- Build a distributed scraper using Go's native RPC or message queues like NATS
- Implement database storage with PostgreSQL or MongoDB drivers
FAQ
How fast can Go scrapers run compared to Python?
Go scrapers typically run 3-5x faster than equivalent Python scrapers on CPU-bound parsing tasks and 2-3x faster on I/O-bound network requests. The gap widens with concurrency since Go's goroutines have less overhead than Python threads.
Can Go scrape JavaScript-rendered pages?
Yes. chromedp controls a real Chrome/Chromium browser and can render any JavaScript content. Rod is another popular option with additional stealth features built-in.
How do I avoid getting blocked?
Rotate User-Agents and headers, use residential proxies, implement rate limiting, and consider TLS fingerprint spoofing for heavily protected sites. Start with 1 request per second and increase gradually.
Is web scraping legal?
Scraping publicly accessible data is generally legal in most jurisdictions. However, respect robots.txt, terms of service, and avoid scraping personal data or bypassing authentication. Consult legal counsel for your specific use case.
What's the best Go library for beginners?
Start with Colly. It handles cookies, redirects, rate limiting, and parallel requests automatically. Once you understand the fundamentals, drop down to net/http and goquery for more control.