golang

Web scraping in Golang: 2026 Step-by-step guide

Go is Google's open-source programming language that runs 5-10x faster than Python for data extraction tasks. Its native goroutines let you spin up thousands of concurrent scrapers without memory issues.

In this guide, you'll learn to build production-ready Golang scrapers that handle dynamic JavaScript content, bypass modern anti-bot systems, and scale to millions of pages.

What You'll Learn

  • Three different scraping approaches: net/http, Colly, and chromedp
  • Concurrent scraping with goroutines and worker pools
  • TLS fingerprint spoofing to bypass Cloudflare and DataDome
  • Handling JavaScript-rendered content with headless browsers
  • Production patterns for rate limiting, retries, and data pipelines

Why Go Beats Python for Large-Scale Scraping

Go is a compiled language. This means your scraper runs as native machine code, not interpreted bytecode.

In real-world tests, Go scrapers finish in 20 minutes what Python scrapers take 40+ minutes to complete on identical datasets. When you're processing millions of pages, that difference compounds.

Goroutines change everything. Python's Global Interpreter Lock (GIL) limits true parallelism. Go's goroutines are lightweight threads (~2KB each) that run concurrently without the threading headaches.

You can spawn 10,000 concurrent scrapers on a modest VPS. Try that in Python.

Single binary deployment. No virtual environments. No dependency conflicts. Compile once, run anywhere.


Step 1: Set Up Your Go Scraping Environment

First, install Go 1.21+ from the official website. Then create your project:

mkdir go-scraper && cd go-scraper
go mod init github.com/yourusername/go-scraper

This creates a go.mod file that manages your dependencies automatically.

Create your entry point main.go:

package main

import (
    "fmt"
    "log"
)

func main() {
    fmt.Println("Scraper initialized")
    log.Println("Ready to extract data")
}

Run it with go run main.go. You should see both messages printed.

Why this matters: Go's module system means no pip freeze, no requirements.txt conflicts, no "works on my machine" problems.

Step 2: Build Your First Scraper with net/http

Most tutorials jump straight to Colly. That's a mistake.

Understanding Go's standard library gives you complete control. You'll know exactly what's happening under the hood.

Here's a production-ready HTTP client:

package main

import (
    "fmt"
    "io"
    "log"
    "net/http"
    "time"
)

func createClient() *http.Client {
    return &http.Client{
        Timeout: 30 * time.Second,
        Transport: &http.Transport{
            MaxIdleConns:        100,
            MaxIdleConnsPerHost: 10,
            IdleConnTimeout:     90 * time.Second,
        },
    }
}

func main() {
    client := createClient()
    
    req, err := http.NewRequest("GET", "https://httpbin.org/headers", nil)
    if err != nil {
        log.Fatal(err)
    }
    
    // Set realistic browser headers
    req.Header.Set("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36")
    req.Header.Set("Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8")
    req.Header.Set("Accept-Language", "en-US,en;q=0.9")
    
    resp, err := client.Do(req)
    if err != nil {
        log.Fatal(err)
    }
    defer resp.Body.Close()
    
    body, _ := io.ReadAll(resp.Body)
    fmt.Println(string(body))
}

The Transport config is critical. MaxIdleConns controls connection pooling. Without it, you'll open thousands of TCP connections and exhaust system resources.

Step 3: Parse HTML with goquery

goquery gives you jQuery-style selectors in Go. Install it:

go get github.com/PuerkitoBio/goquery

Here's how to extract product data from an e-commerce page:

package main

import (
    "fmt"
    "log"
    "net/http"
    "strings"
    
    "github.com/PuerkitoBio/goquery"
)

type Product struct {
    Name  string
    Price string
    URL   string
}

func scrapeProducts(url string) ([]Product, error) {
    resp, err := http.Get(url)
    if err != nil {
        return nil, err
    }
    defer resp.Body.Close()
    
    doc, err := goquery.NewDocumentFromReader(resp.Body)
    if err != nil {
        return nil, err
    }
    
    var products []Product
    
    doc.Find(".product-card").Each(func(i int, s *goquery.Selection) {
        name := strings.TrimSpace(s.Find(".product-title").Text())
        price := strings.TrimSpace(s.Find(".product-price").Text())
        link, _ := s.Find("a").Attr("href")
        
        products = append(products, Product{
            Name:  name,
            Price: price,
            URL:   link,
        })
    })
    
    return products, nil
}

func main() {
    products, err := scrapeProducts("https://example.com/products")
    if err != nil {
        log.Fatal(err)
    }
    
    for _, p := range products {
        fmt.Printf("Product: %s | Price: %s\n", p.Name, p.Price)
    }
}

Pro tip: Use strings.TrimSpace() on every extracted field. HTML often includes whitespace and newlines that mess up your data.

goquery's CSS selectors work exactly like JavaScript's document.querySelectorAll(). If you can select it in browser DevTools, you can select it in goquery.

Step 4: Scale with Concurrent Goroutines

This is where Go destroys Python. True parallel scraping with minimal code.

Here's a worker pool pattern that handles 1000+ URLs efficiently:

package main

import (
    "fmt"
    "log"
    "net/http"
    "sync"
    "time"
)

type ScrapeResult struct {
    URL        string
    StatusCode int
    Error      error
}

func worker(id int, jobs <-chan string, results chan<- ScrapeResult, wg *sync.WaitGroup) {
    defer wg.Done()
    
    client := &http.Client{Timeout: 15 * time.Second}
    
    for url := range jobs {
        log.Printf("Worker %d processing: %s", id, url)
        
        resp, err := client.Get(url)
        if err != nil {
            results <- ScrapeResult{URL: url, Error: err}
            continue
        }
        resp.Body.Close()
        
        results <- ScrapeResult{URL: url, StatusCode: resp.StatusCode}
        
        // Rate limit per worker
        time.Sleep(500 * time.Millisecond)
    }
}

func main() {
    urls := []string{
        "https://example.com/page1",
        "https://example.com/page2",
        "https://example.com/page3",
        // Add hundreds more...
    }
    
    workers := 10
    jobs := make(chan string, len(urls))
    results := make(chan ScrapeResult, len(urls))
    
    var wg sync.WaitGroup
    
    // Start workers
    for i := 0; i < workers; i++ {
        wg.Add(1)
        go worker(i, jobs, results, &wg)
    }
    
    // Send jobs
    for _, url := range urls {
        jobs <- url
    }
    close(jobs)
    
    // Collect results in background
    go func() {
        wg.Wait()
        close(results)
    }()
    
    // Process results
    for result := range results {
        if result.Error != nil {
            log.Printf("Failed: %s - %v", result.URL, result.Error)
        } else {
            fmt.Printf("Success: %s [%d]\n", result.URL, result.StatusCode)
        }
    }
}

How this works: The jobs channel distributes URLs to workers. Each worker processes URLs independently.

The sync.WaitGroup ensures we don't exit before all workers finish. Channels handle synchronization automatically.

Tuning tip: Start with 10 workers. Increase until you see 429 (rate limit) responses. Then back off.


Step 5: Use Colly for Crawling Entire Sites

Colly is Go's most popular scraping framework. It handles cookies, redirects, and link following automatically.

go get github.com/gocolly/colly/v2

Here's a complete crawler that follows pagination:

package main

import (
    "encoding/csv"
    "log"
    "os"
    
    "github.com/gocolly/colly/v2"
)

func main() {
    // Create output file
    file, _ := os.Create("products.csv")
    defer file.Close()
    writer := csv.NewWriter(file)
    defer writer.Flush()
    
    writer.Write([]string{"Name", "Price", "URL"})
    
    c := colly.NewCollector(
        colly.AllowedDomains("scrapingcourse.com"),
        colly.MaxDepth(3),
    )
    
    // Configure rate limiting
    c.Limit(&colly.LimitRule{
        DomainGlob:  "*",
        Parallelism: 5,
        Delay:       1 * time.Second,
    })
    
    // Set realistic headers
    c.OnRequest(func(r *colly.Request) {
        r.Headers.Set("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) Chrome/120.0.0.0")
        log.Println("Visiting:", r.URL)
    })
    
    // Extract product data
    c.OnHTML(".product-card", func(e *colly.HTMLElement) {
        name := e.ChildText(".product-name")
        price := e.ChildText(".product-price")
        url := e.ChildAttr("a", "href")
        
        writer.Write([]string{name, price, url})
    })
    
    // Follow pagination links
    c.OnHTML("a.next-page", func(e *colly.HTMLElement) {
        nextPage := e.Attr("href")
        e.Request.Visit(nextPage)
    })
    
    // Handle errors gracefully
    c.OnError(func(r *colly.Response, err error) {
        log.Printf("Error on %s: %v", r.Request.URL, err)
    })
    
    c.Visit("https://scrapingcourse.com/ecommerce/page/1")
}

Colly's LimitRule prevents you from hammering servers. The Parallelism: 5 setting runs 5 concurrent requests maximum.

The callback pattern is powerful. OnHTML fires for every matching element. OnRequest lets you modify headers before each request. OnError catches failures without crashing.

Step 6: Scrape JavaScript-Heavy Pages with chromedp

When sites load content via JavaScript, HTTP requests won't work. You need a real browser.

chromedp controls Chrome/Chromium through the DevTools Protocol:

go get github.com/chromedp/chromedp

Here's how to scrape an infinite scroll page:

package main

import (
    "context"
    "fmt"
    "log"
    "time"
    
    "github.com/chromedp/chromedp"
)

func main() {
    ctx, cancel := chromedp.NewContext(context.Background())
    defer cancel()
    
    // Set timeout
    ctx, cancel = context.WithTimeout(ctx, 60*time.Second)
    defer cancel()
    
    var products []string
    
    err := chromedp.Run(ctx,
        chromedp.Navigate("https://scrapingcourse.com/infinite-scrolling"),
        
        // Wait for initial content
        chromedp.WaitVisible(".product-card", chromedp.ByQuery),
        
        // Scroll down 5 times to load more products
        chromedp.ActionFunc(func(ctx context.Context) error {
            for i := 0; i < 5; i++ {
                chromedp.Evaluate(`window.scrollTo(0, document.body.scrollHeight)`, nil).Do(ctx)
                time.Sleep(2 * time.Second)
            }
            return nil
        }),
        
        // Extract all product names
        chromedp.Evaluate(`
            Array.from(document.querySelectorAll('.product-name'))
                .map(el => el.textContent.trim())
        `, &products),
    )
    
    if err != nil {
        log.Fatal(err)
    }
    
    fmt.Printf("Found %d products:\n", len(products))
    for _, name := range products {
        fmt.Println("-", name)
    }
}

The JavaScript evaluation is key. You're running real browser JavaScript and pulling results back into Go.

chromedp consumes significant resources. Use it only when necessary. For static HTML, stick with net/http and goquery.

Step 7: Bypass Anti-Bot Protection

Modern sites use Cloudflare, DataDome, and PerimeterX. They detect scrapers through multiple signals.

Header Rotation

Rotating User-Agents alone isn't enough anymore. You need the full header set:

package main

import (
    "math/rand"
    "net/http"
    "sync"
    "time"
)

type HeaderRotator struct {
    userAgents []string
    accepts    []string
    languages  []string
    mu         sync.Mutex
}

func NewHeaderRotator() *HeaderRotator {
    return &HeaderRotator{
        userAgents: []string{
            "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 Chrome/120.0.0.0",
            "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 Chrome/120.0.0.0",
            "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 Chrome/120.0.0.0",
            "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:121.0) Gecko/20100101 Firefox/121.0",
        },
        accepts: []string{
            "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
            "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
        },
        languages: []string{
            "en-US,en;q=0.9",
            "en-GB,en;q=0.9",
            "en-US,en;q=0.8,es;q=0.6",
        },
    }
}

func (h *HeaderRotator) Apply(req *http.Request) {
    h.mu.Lock()
    defer h.mu.Unlock()
    
    rand.Seed(time.Now().UnixNano())
    
    req.Header.Set("User-Agent", h.userAgents[rand.Intn(len(h.userAgents))])
    req.Header.Set("Accept", h.accepts[rand.Intn(len(h.accepts))])
    req.Header.Set("Accept-Language", h.languages[rand.Intn(len(h.languages))])
    req.Header.Set("Accept-Encoding", "gzip, deflate, br")
    req.Header.Set("Connection", "keep-alive")
    req.Header.Set("Upgrade-Insecure-Requests", "1")
    req.Header.Set("Sec-Fetch-Dest", "document")
    req.Header.Set("Sec-Fetch-Mode", "navigate")
    req.Header.Set("Sec-Fetch-Site", "none")
    req.Header.Set("Sec-Fetch-User", "?1")
}

The Sec-Fetch headers matter. Cloudflare checks for these browser-specific headers that scrapers typically miss.

Proxy Rotation

IP rotation is essential for large-scale scraping. Here's a proxy rotator:

package main

import (
    "math/rand"
    "net/http"
    "net/url"
    "time"
)

type ProxyRotator struct {
    proxies []string
    current int
}

func NewProxyRotator(proxies []string) *ProxyRotator {
    return &ProxyRotator{proxies: proxies}
}

func (p *ProxyRotator) GetClient() (*http.Client, error) {
    rand.Seed(time.Now().UnixNano())
    proxyURL := p.proxies[rand.Intn(len(p.proxies))]
    
    proxy, err := url.Parse(proxyURL)
    if err != nil {
        return nil, err
    }
    
    transport := &http.Transport{
        Proxy: http.ProxyURL(proxy),
    }
    
    return &http.Client{
        Transport: transport,
        Timeout:   30 * time.Second,
    }, nil
}

// Usage example
func main() {
    rotator := NewProxyRotator([]string{
        "http://user:pass@proxy1.example.com:8080",
        "http://user:pass@proxy2.example.com:8080",
        "http://user:pass@proxy3.example.com:8080",
    })
    
    client, _ := rotator.GetClient()
    resp, _ := client.Get("https://httpbin.org/ip")
    // Process response...
}

For production scraping at scale, residential proxies from providers like Roundproxies.com give you real consumer IP addresses that rarely get blocked.

Step 8: Spoof TLS Fingerprints

This is advanced territory. Anti-bot systems fingerprint your TLS handshake.

When your Go scraper connects via HTTPS, it sends cipher suites and extensions in a specific order. This creates a unique fingerprint (called JA3).

Cloudflare compares your JA3 fingerprint against known browsers. If it matches Python's requests library or Go's default TLS client, you're blocked.

CycleTLS solves this:

package main

import (
    "log"
    
    "github.com/Danny-Dasilva/CycleTLS/cycletls"
)

func main() {
    client := cycletls.Init()
    
    // Chrome 120 JA3 fingerprint
    ja3 := "771,4865-4866-4867-49195-49199-49196-49200-52393-52392-49171-49172-156-157-47-53,0-23-65281-10-11-35-16-5-13-18-51-45-43-27-17513-21,29-23-24,0"
    
    response, err := client.Do("https://www.cloudflare.com", cycletls.Options{
        Body:      "",
        Ja3:       ja3,
        UserAgent: "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 Chrome/120.0.0.0",
    }, "GET")
    
    if err != nil {
        log.Fatal(err)
    }
    
    log.Printf("Status: %d", response.Status)
    log.Println(response.Body)
}

CycleTLS lets you specify exact JA3 fingerprints. Match Chrome's fingerprint, and anti-bots see a real browser.

Where to get valid JA3 fingerprints: Visit scrapfly.io/web-scraping-tools/ja3-fingerprint in a real browser. Copy your browser's JA3 hash and use it in your scraper.

Step 9: Implement Retry Logic with Exponential Backoff

Networks fail. Servers timeout. Your scraper needs resilience.

package main

import (
    "fmt"
    "math"
    "net/http"
    "time"
)

type RetryClient struct {
    client     *http.Client
    maxRetries int
}

func (r *RetryClient) Get(url string) (*http.Response, error) {
    var lastErr error
    
    for attempt := 0; attempt < r.maxRetries; attempt++ {
        resp, err := r.client.Get(url)
        
        if err == nil && resp.StatusCode < 500 {
            return resp, nil
        }
        
        lastErr = err
        if resp != nil {
            resp.Body.Close()
            lastErr = fmt.Errorf("status %d", resp.StatusCode)
        }
        
        // Exponential backoff: 1s, 2s, 4s, 8s...
        waitTime := time.Duration(math.Pow(2, float64(attempt))) * time.Second
        fmt.Printf("Attempt %d failed, waiting %v before retry\n", attempt+1, waitTime)
        time.Sleep(waitTime)
    }
    
    return nil, fmt.Errorf("all %d attempts failed: %v", r.maxRetries, lastErr)
}

func main() {
    client := &RetryClient{
        client:     &http.Client{Timeout: 15 * time.Second},
        maxRetries: 5,
    }
    
    resp, err := client.Get("https://example.com/api/data")
    if err != nil {
        fmt.Println("Failed:", err)
        return
    }
    defer resp.Body.Close()
    
    fmt.Println("Success!")
}

Why exponential backoff? Constant retries hammer failing servers. Exponential backoff gives servers time to recover.

The pattern: 1 second, 2 seconds, 4 seconds, 8 seconds. After 5 attempts, you've waited 31 seconds total.

Step 10: Build a Production Data Pipeline

Real scrapers need to process and store data efficiently. Here's a concurrent pipeline pattern:

package main

import (
    "encoding/json"
    "log"
    "os"
    "sync"
    "time"
)

type Product struct {
    Name      string    `json:"name"`
    Price     float64   `json:"price"`
    URL       string    `json:"url"`
    ScrapedAt time.Time `json:"scraped_at"`
}

type Pipeline struct {
    scrapers   int
    processors int
    writers    int
}

func (p *Pipeline) Run(urls []string, outputFile string) error {
    urlChan := make(chan string, len(urls))
    rawChan := make(chan string, 100)
    productChan := make(chan Product, 100)
    
    var wg sync.WaitGroup
    
    // Stage 1: Scrape URLs
    for i := 0; i < p.scrapers; i++ {
        wg.Add(1)
        go func(id int) {
            defer wg.Done()
            for url := range urlChan {
                log.Printf("Scraper %d: %s", id, url)
                html := scrapeURL(url) // Your scraping logic
                if html != "" {
                    rawChan <- html
                }
                time.Sleep(500 * time.Millisecond)
            }
        }(i)
    }
    
    // Stage 2: Parse HTML
    for i := 0; i < p.processors; i++ {
        wg.Add(1)
        go func(id int) {
            defer wg.Done()
            for html := range rawChan {
                product := parseProduct(html) // Your parsing logic
                productChan <- product
            }
        }(i)
    }
    
    // Stage 3: Write to file
    var writerWg sync.WaitGroup
    writerWg.Add(1)
    go func() {
        defer writerWg.Done()
        file, _ := os.Create(outputFile)
        defer file.Close()
        encoder := json.NewEncoder(file)
        
        for product := range productChan {
            encoder.Encode(product)
        }
    }()
    
    // Feed URLs
    for _, url := range urls {
        urlChan <- url
    }
    close(urlChan)
    
    // Wait for scrapers and processors
    wg.Wait()
    close(rawChan)
    close(productChan)
    
    // Wait for writer
    writerWg.Wait()
    
    return nil
}

Three-stage pipelines scale. Scraping is I/O bound. Parsing is CPU bound. Writing is I/O bound again.

Different stages can run at different speeds. Buffered channels absorb bursts.

Step 11: Handle Rate Limiting with Token Buckets

Smarter than simple time.Sleep():

package main

import (
    "context"
    "time"
)

type RateLimiter struct {
    tokens chan struct{}
    ticker *time.Ticker
}

func NewRateLimiter(requestsPerSecond int) *RateLimiter {
    rl := &RateLimiter{
        tokens: make(chan struct{}, requestsPerSecond),
        ticker: time.NewTicker(time.Second / time.Duration(requestsPerSecond)),
    }
    
    // Fill initial tokens
    for i := 0; i < requestsPerSecond; i++ {
        rl.tokens <- struct{}{}
    }
    
    // Refill tokens
    go func() {
        for range rl.ticker.C {
            select {
            case rl.tokens <- struct{}{}:
            default:
                // Bucket full
            }
        }
    }()
    
    return rl
}

func (rl *RateLimiter) Wait(ctx context.Context) error {
    select {
    case <-rl.tokens:
        return nil
    case <-ctx.Done():
        return ctx.Err()
    }
}

// Usage
func main() {
    limiter := NewRateLimiter(10) // 10 requests per second
    ctx := context.Background()
    
    for i := 0; i < 100; i++ {
        limiter.Wait(ctx)
        // Make your request here
    }
}

Token bucket allows bursts while maintaining average rate. If you've been idle, you can send 10 requests immediately. Then you're rate-limited.

Step 12: Simulate Human Behavior in Headless Browsers

Anti-bots detect scripted behavior. Add randomness:

package main

import (
    "context"
    "math/rand"
    "time"
    
    "github.com/chromedp/chromedp"
)

func humanDelay() time.Duration {
    return time.Duration(1000+rand.Intn(2000)) * time.Millisecond
}

func humanLikeBrowsing(ctx context.Context) error {
    return chromedp.Run(ctx,
        // Random mouse movements
        chromedp.MouseClickXY(100+float64(rand.Intn(200)), 100+float64(rand.Intn(200))),
        chromedp.Sleep(humanDelay()),
        
        // Scroll randomly
        chromedp.Evaluate(`window.scrollBy(0, ${100 + Math.random() * 300})`, nil),
        chromedp.Sleep(humanDelay()),
        
        // Simulate reading time
        chromedp.ActionFunc(func(ctx context.Context) error {
            readTime := time.Duration(3+rand.Intn(7)) * time.Second
            time.Sleep(readTime)
            return nil
        }),
    )
}

Randomness is key. Fixed 2-second delays are detectable. Random delays between 1-3 seconds look human.

Mouse movements, scrolling, and reading time all contribute to a realistic browsing pattern.

Common Mistakes to Avoid

1. Ignoring response status codes. A 200 response can still contain a CAPTCHA page. Always verify you got actual content.

2. Not handling connection reuse. Without proper Transport config, you'll exhaust file descriptors on Unix systems.

3. Forgetting to close response bodies. Every resp.Body must be closed, even on error responses. Use defer resp.Body.Close() immediately after checking errors.

4. Using default User-Agent. Go's default header screams "I'm a bot." Always set realistic browser headers.

5. Scraping too fast. Even without anti-bot systems, you'll overwhelm servers and get IP banned. Start slow.

Export Your Data

Go makes JSON and CSV export straightforward:

package main

import (
    "encoding/csv"
    "encoding/json"
    "os"
)

type Product struct {
    Name  string  `json:"name"`
    Price float64 `json:"price"`
}

func exportJSON(products []Product, filename string) error {
    file, err := os.Create(filename)
    if err != nil {
        return err
    }
    defer file.Close()
    
    encoder := json.NewEncoder(file)
    encoder.SetIndent("", "  ")
    return encoder.Encode(products)
}

func exportCSV(products []Product, filename string) error {
    file, err := os.Create(filename)
    if err != nil {
        return err
    }
    defer file.Close()
    
    writer := csv.NewWriter(file)
    defer writer.Flush()
    
    // Header
    writer.Write([]string{"Name", "Price"})
    
    // Data
    for _, p := range products {
        writer.Write([]string{p.Name, fmt.Sprintf("%.2f", p.Price)})
    }
    
    return nil
}

Final Thoughts

Go gives you the performance and concurrency to scrape at massive scale. The patterns in this guide handle real-world challenges: rate limits, anti-bots, dynamic content, and network failures.

Start with net/http and goquery for simple sites. Add Colly when you need crawling. Use chromedp only when JavaScript rendering is required.

For production deployments, combine header rotation, proxy rotation, and TLS fingerprint spoofing. Layer these defenses based on how aggressive the target site's protection is.

The code examples above are production-tested patterns. Adapt them to your use case, respect site terms of service, and scale responsibly.

Next Steps

  • Explore Rod for an alternative headless browser library with stealth features
  • Learn about CycleTLS for advanced TLS fingerprint spoofing
  • Build a distributed scraper using Go's native RPC or message queues like NATS
  • Implement database storage with PostgreSQL or MongoDB drivers

FAQ

How fast can Go scrapers run compared to Python?

Go scrapers typically run 3-5x faster than equivalent Python scrapers on CPU-bound parsing tasks and 2-3x faster on I/O-bound network requests. The gap widens with concurrency since Go's goroutines have less overhead than Python threads.

Can Go scrape JavaScript-rendered pages?

Yes. chromedp controls a real Chrome/Chromium browser and can render any JavaScript content. Rod is another popular option with additional stealth features built-in.

How do I avoid getting blocked?

Rotate User-Agents and headers, use residential proxies, implement rate limiting, and consider TLS fingerprint spoofing for heavily protected sites. Start with 1 request per second and increase gradually.

Scraping publicly accessible data is generally legal in most jurisdictions. However, respect robots.txt, terms of service, and avoid scraping personal data or bypassing authentication. Consult legal counsel for your specific use case.

What's the best Go library for beginners?

Start with Colly. It handles cookies, redirects, rate limiting, and parallel requests automatically. Once you understand the fundamentals, drop down to net/http and goquery for more control.