Building a web crawler in Go gives you raw speed and true parallelism that Python and JavaScript simply cannot match. In this guide, you will learn how to build production-ready crawlers using Go's goroutines, channels, and the best libraries available in 2026.
By the end, you will have working code for three different approaches: pure net/http with goroutines, the Colly framework, and headless browser crawling with chromedp.
What is a Go Web Crawler?
A Go web crawler is a program that systematically browses websites to discover and extract URLs and data. Go's goroutines allow you to crawl hundreds of pages simultaneously while using minimal memory. Unlike Python's GIL-limited threading, Go achieves true parallel execution across all CPU cores.
The main difference between Go and Python web crawlers is performance and concurrency. Go compiles to native machine code and handles concurrent requests 2-5x faster than Python scrapers on identical hardware. For large-scale crawling projects processing millions of pages, this speed advantage compounds significantly.
Setting Up Your Environment
Before writing any code, make sure Go 1.21+ is installed on your system. Run this command to verify:
go version
You should see output like go version go1.22.0 linux/amd64 or similar.
Create a new project directory and initialize your module:
mkdir go-crawler && cd go-crawler
go mod init github.com/yourusername/go-crawler
Install the packages we will use throughout this guide:
go get github.com/PuerkitoBio/goquery
go get github.com/gocolly/colly/v2
go get github.com/chromedp/chromedp
The go.mod file tracks all dependencies automatically. You are now ready to build your first crawler.
Method 1: Pure Go with net/http and Goroutines
This approach uses only Go's standard library plus goquery for HTML parsing. It gives you maximum control over every request and teaches you Go's concurrency fundamentals.
Building the Basic Fetcher
Start with a function that fetches a single page and returns parsed HTML:
package main
import (
"fmt"
"net/http"
"time"
"github.com/PuerkitoBio/goquery"
)
// Fetcher retrieves and parses HTML from a URL
func fetchPage(url string) (*goquery.Document, error) {
client := &http.Client{
Timeout: 30 * time.Second,
}
req, err := http.NewRequest("GET", url, nil)
if err != nil {
return nil, err
}
// Set headers to mimic a real browser
req.Header.Set("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36")
req.Header.Set("Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8")
req.Header.Set("Accept-Language", "en-US,en;q=0.5")
resp, err := client.Do(req)
if err != nil {
return nil, err
}
defer resp.Body.Close()
if resp.StatusCode != http.StatusOK {
return nil, fmt.Errorf("status code: %d", resp.StatusCode)
}
return goquery.NewDocumentFromReader(resp.Body)
}
The http.Client with a 30-second timeout prevents your crawler from hanging on slow servers. Setting browser-like headers is essential because many sites block requests with default Go User-Agents.
Extracting Links from HTML
Next, add a function to pull all links from a page:
import (
"net/url"
"strings"
)
// extractLinks finds all href values in anchor tags
func extractLinks(doc *goquery.Document, baseURL string) []string {
var links []string
base, _ := url.Parse(baseURL)
doc.Find("a[href]").Each(func(i int, s *goquery.Selection) {
href, exists := s.Attr("href")
if !exists {
return
}
// Skip mailto, javascript, and fragment-only links
if strings.HasPrefix(href, "mailto:") ||
strings.HasPrefix(href, "javascript:") ||
strings.HasPrefix(href, "#") {
return
}
// Resolve relative URLs to absolute
parsed, err := url.Parse(href)
if err != nil {
return
}
absolute := base.ResolveReference(parsed)
links = append(links, absolute.String())
})
return links
}
URL normalization is critical. Relative paths like /about or ../index.html must be converted to absolute URLs. The ResolveReference function handles this automatically.
Adding Concurrency with Goroutines and Channels
Here is where Go truly shines. We will create a worker pool that processes URLs in parallel:
import (
"log"
"sync"
)
type Crawler struct {
visited map[string]bool
mu sync.Mutex
wg sync.WaitGroup
urlQueue chan string
maxDepth int
baseHost string
}
func NewCrawler(maxWorkers int, maxDepth int, baseHost string) *Crawler {
return &Crawler{
visited: make(map[string]bool),
urlQueue: make(chan string, 1000),
maxDepth: maxDepth,
baseHost: baseHost,
}
}
func (c *Crawler) isVisited(url string) bool {
c.mu.Lock()
defer c.mu.Unlock()
if c.visited[url] {
return true
}
c.visited[url] = true
return false
}
func (c *Crawler) worker(id int) {
for url := range c.urlQueue {
if c.isVisited(url) {
c.wg.Done()
continue
}
log.Printf("[Worker %d] Crawling: %s\n", id, url)
doc, err := fetchPage(url)
if err != nil {
log.Printf("[Worker %d] Error: %v\n", id, err)
c.wg.Done()
continue
}
links := extractLinks(doc, url)
// Filter to same domain only
for _, link := range links {
parsed, _ := url.Parse(link)
if parsed.Host == c.baseHost && !c.isVisited(link) {
c.wg.Add(1)
go func(l string) {
c.urlQueue <- l
}(link)
}
}
c.wg.Done()
}
}
The sync.Mutex protects the visited map from race conditions when multiple goroutines access it simultaneously. The channel acts as a queue that distributes work across all workers.
Running the Concurrent Crawler
Put everything together in the main function:
func main() {
startURL := "https://example.com"
parsed, _ := url.Parse(startURL)
crawler := NewCrawler(10, 3, parsed.Host) // 10 workers, depth 3
// Start workers
for i := 0; i < 10; i++ {
go crawler.worker(i)
}
// Seed the queue
crawler.wg.Add(1)
crawler.urlQueue <- startURL
// Wait for completion
crawler.wg.Wait()
close(crawler.urlQueue)
log.Printf("Crawled %d pages\n", len(crawler.visited))
}
This crawler will process up to 10 pages simultaneously. Adjust the worker count based on your network speed and the target server's capacity. Start with 5-10 workers and increase gradually.
Adding Rate Limiting
Respect target servers by adding delays between requests:
import (
"math/rand"
"time"
)
func (c *Crawler) worker(id int) {
for url := range c.urlQueue {
// Random delay between 1-3 seconds
delay := time.Duration(1000+rand.Intn(2000)) * time.Millisecond
time.Sleep(delay)
// ... rest of worker logic
}
}
Random delays make your traffic pattern look more human. Fixed intervals are easily detected by anti-bot systems.
Method 2: Colly Framework for Production Crawlers
Colly is the most popular Go web crawling framework. It handles caching, parallelism, rate limiting, and cookie management out of the box.
Installing and Configuring Colly
The Colly collector is the core component that manages everything:
package main
import (
"log"
"github.com/gocolly/colly/v2"
)
func main() {
c := colly.NewCollector(
colly.AllowedDomains("example.com", "www.example.com"),
colly.MaxDepth(3),
colly.Async(true),
colly.CacheDir("./colly_cache"),
)
// Limit: 2 requests per second, 10 concurrent
c.Limit(&colly.LimitRule{
DomainGlob: "*",
Parallelism: 10,
Delay: 500 * time.Millisecond,
RandomDelay: 500 * time.Millisecond,
})
c.OnRequest(func(r *colly.Request) {
r.Headers.Set("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) Chrome/120.0.0.0")
log.Printf("Visiting: %s\n", r.URL)
})
c.OnHTML("a[href]", func(e *colly.HTMLElement) {
link := e.Attr("href")
e.Request.Visit(link)
})
c.OnResponse(func(r *colly.Response) {
log.Printf("Response from %s: %d bytes\n", r.Request.URL, len(r.Body))
})
c.OnError(func(r *colly.Response, err error) {
log.Printf("Error on %s: %v\n", r.Request.URL, err)
})
c.Visit("https://example.com")
c.Wait()
}
The LimitRule configuration is powerful. It applies rate limiting per domain, so you can crawl multiple sites simultaneously while respecting each site's limits.
Extracting Structured Data with Colly
Colly's callback system makes data extraction clean:
type Product struct {
Name string
Price string
URL string
}
func main() {
var products []Product
c := colly.NewCollector(
colly.AllowedDomains("web-scraping.dev"),
)
// Find product cards
c.OnHTML("div.product", func(e *colly.HTMLElement) {
product := Product{
Name: e.ChildText("h3"),
Price: e.ChildText(".price"),
URL: e.Request.URL.String(),
}
products = append(products, product)
})
// Follow pagination links
c.OnHTML("a.next-page", func(e *colly.HTMLElement) {
e.Request.Visit(e.Attr("href"))
})
c.OnScraped(func(r *colly.Response) {
log.Printf("Finished: %s\n", r.Request.URL)
})
c.Visit("https://web-scraping.dev/products")
c.Wait()
// Export to JSON
jsonData, _ := json.MarshalIndent(products, "", " ")
os.WriteFile("products.json", jsonData, 0644)
}
The ChildText and ChildAttr methods make selecting nested elements straightforward. No need for complex CSS selectors when you can chain simple ones.
Using Multiple Collectors for Complex Crawls
For sites with different page types, use cloned collectors:
func main() {
// Main collector for listing pages
listCollector := colly.NewCollector(
colly.AllowedDomains("example.com"),
colly.Async(true),
)
// Clone for detail pages (different settings)
detailCollector := listCollector.Clone()
listCollector.OnHTML("a.product-link", func(e *colly.HTMLElement) {
// Hand off to detail collector
detailCollector.Visit(e.Attr("href"))
})
detailCollector.OnHTML("div.product-detail", func(e *colly.HTMLElement) {
// Extract full product data
name := e.ChildText("h1")
description := e.ChildText(".description")
log.Printf("Product: %s\n", name)
})
listCollector.Visit("https://example.com/products")
listCollector.Wait()
detailCollector.Wait()
}
This pattern keeps your code organized when crawling sites with multiple page templates.
Method 3: Chromedp for JavaScript-Heavy Sites
Modern websites often load content dynamically with JavaScript. Standard HTTP requests only get the initial HTML shell. Chromedp controls a real Chrome browser to render these pages.
Basic Chromedp Setup
First, install chromedp and ensure Chrome or Chromium is available on your system:
go get github.com/chromedp/chromedp
Here is a basic example that renders a JavaScript page:
package main
import (
"context"
"log"
"time"
"github.com/chromedp/chromedp"
)
func main() {
// Create context with timeout
ctx, cancel := chromedp.NewContext(context.Background())
defer cancel()
ctx, cancel = context.WithTimeout(ctx, 60*time.Second)
defer cancel()
var htmlContent string
err := chromedp.Run(ctx,
chromedp.Navigate("https://example.com"),
chromedp.WaitVisible("body", chromedp.ByQuery),
chromedp.Sleep(2*time.Second), // Wait for JS to load
chromedp.OuterHTML("html", &htmlContent),
)
if err != nil {
log.Fatal(err)
}
log.Printf("Got %d bytes of HTML\n", len(htmlContent))
}
Chromedp runs in headless mode by default. The browser executes all JavaScript, just like a real user's browser.
Handling Infinite Scroll
Many sites load more content as you scroll. Here is how to handle that:
func scrollAndCollect(ctx context.Context) ([]string, error) {
var items []string
previousCount := 0
for i := 0; i < 10; i++ { // Max 10 scroll attempts
// Scroll to bottom
err := chromedp.Run(ctx,
chromedp.Evaluate(`window.scrollTo(0, document.body.scrollHeight)`, nil),
chromedp.Sleep(2*time.Second),
)
if err != nil {
return nil, err
}
// Collect items
var currentItems []string
err = chromedp.Run(ctx,
chromedp.Evaluate(`
Array.from(document.querySelectorAll('.item-class'))
.map(el => el.textContent)
`, ¤tItems),
)
if err != nil {
return nil, err
}
items = currentItems
// Stop if no new items loaded
if len(items) == previousCount {
break
}
previousCount = len(items)
}
return items, nil
}
The loop continues scrolling until no new content appears. Adjust the sleep duration based on how fast the target site loads.
Filling Forms and Clicking Buttons
Chromedp can interact with pages like a human:
func loginAndScrape(ctx context.Context, username, password string) error {
return chromedp.Run(ctx,
chromedp.Navigate("https://example.com/login"),
chromedp.WaitVisible("#username", chromedp.ByID),
// Fill login form
chromedp.SendKeys("#username", username, chromedp.ByID),
chromedp.SendKeys("#password", password, chromedp.ByID),
// Click submit
chromedp.Click("#submit-btn", chromedp.ByID),
// Wait for redirect
chromedp.WaitVisible(".dashboard", chromedp.ByQuery),
)
}
The WaitVisible action ensures the element exists before interacting with it. This prevents race conditions where your code tries to click a button that has not rendered yet.
Running Chromedp in Docker
For production deployments, use the official headless Chrome image:
FROM chromedp/headless-shell:latest
COPY your-scraper /app/scraper
WORKDIR /app
CMD ["./scraper"]
Connect to the container's Chrome instance from your Go code:
allocCtx, cancel := chromedp.NewRemoteAllocator(
context.Background(),
"ws://chrome-container:9222",
)
defer cancel()
ctx, cancel := chromedp.NewContext(allocCtx)
defer cancel()
This setup is more resource-efficient for large-scale scraping.
Advanced Techniques: Proxy Rotation and Anti-Bot Bypass
Production crawlers need to handle IP bans, rate limits, and anti-bot systems. Here are techniques that work in 2026.
Rotating User-Agents
Rotate your User-Agent string to appear as different browsers:
var userAgents = []string{
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 Chrome/120.0.0.0 Safari/537.36",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 Safari/605.1.15",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:121.0) Gecko/20100101 Firefox/121.0",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 Chrome/119.0.0.0 Safari/537.36",
}
func randomUserAgent() string {
return userAgents[rand.Intn(len(userAgents))]
}
func fetchWithRotation(url string) (*http.Response, error) {
client := &http.Client{Timeout: 30 * time.Second}
req, _ := http.NewRequest("GET", url, nil)
req.Header.Set("User-Agent", randomUserAgent())
req.Header.Set("Accept", "text/html,application/xhtml+xml")
req.Header.Set("Accept-Language", "en-US,en;q=0.9")
req.Header.Set("Accept-Encoding", "gzip, deflate, br")
return client.Do(req)
}
Keep your User-Agent list updated. Outdated browser versions are a red flag for anti-bot systems.
Proxy Rotation in Pure Go
Here is a complete proxy rotation implementation:
type ProxyRotator struct {
proxies []string
index int
mu sync.Mutex
}
func NewProxyRotator(proxies []string) *ProxyRotator {
return &ProxyRotator{proxies: proxies}
}
func (p *ProxyRotator) Next() string {
p.mu.Lock()
defer p.mu.Unlock()
proxy := p.proxies[p.index]
p.index = (p.index + 1) % len(p.proxies)
return proxy
}
func (p *ProxyRotator) GetClient() *http.Client {
proxyURL, _ := url.Parse(p.Next())
transport := &http.Transport{
Proxy: http.ProxyURL(proxyURL),
TLSClientConfig: &tls.Config{
InsecureSkipVerify: true,
},
}
return &http.Client{
Transport: transport,
Timeout: 30 * time.Second,
}
}
Usage example:
func main() {
rotator := NewProxyRotator([]string{
"http://proxy1.example.com:8080",
"http://proxy2.example.com:8080",
"http://proxy3.example.com:8080",
})
for _, targetURL := range urls {
client := rotator.GetClient()
resp, err := client.Get(targetURL)
// ... handle response
}
}
For production scraping, residential proxies from providers like Roundproxies.com work best. Datacenter IPs are easily detected and blocked.
Proxy Rotation with Colly
Colly has built-in proxy support:
import "github.com/gocolly/colly/v2/proxy"
func main() {
c := colly.NewCollector()
// Round-robin proxy rotation
rp, err := proxy.RoundRobinProxySwitcher(
"http://user:pass@proxy1.example.com:8080",
"http://user:pass@proxy2.example.com:8080",
"socks5://user:pass@proxy3.example.com:1080",
)
if err != nil {
log.Fatal(err)
}
c.SetProxyFunc(rp)
// ... rest of collector setup
}
Colly supports HTTP, HTTPS, and SOCKS5 proxies.
Handling Rate Limits and Retries
Implement exponential backoff for failed requests:
func fetchWithRetry(url string, maxRetries int) (*http.Response, error) {
var resp *http.Response
var err error
for attempt := 0; attempt < maxRetries; attempt++ {
resp, err = http.Get(url)
if err == nil && resp.StatusCode == http.StatusOK {
return resp, nil
}
if resp != nil {
resp.Body.Close()
// Handle rate limiting (429)
if resp.StatusCode == http.StatusTooManyRequests {
backoff := time.Duration(math.Pow(2, float64(attempt))) * time.Second
log.Printf("Rate limited. Waiting %v before retry %d\n", backoff, attempt+1)
time.Sleep(backoff)
continue
}
}
// Exponential backoff for other errors
backoff := time.Duration(math.Pow(2, float64(attempt))) * time.Second
time.Sleep(backoff)
}
return nil, fmt.Errorf("failed after %d retries: %v", maxRetries, err)
}
Start with short delays and double them on each failure. This approach is gentle on servers while still getting your data.
TLS Fingerprint Spoofing
Advanced anti-bot systems like Cloudflare fingerprint your TLS handshake. The CycleTLS library lets you spoof browser fingerprints:
import "github.com/Danny-Dasilva/CycleTLS/cycletls"
func fetchWithTLSSpoof(url string) (string, error) {
client := cycletls.Init()
response, err := client.Do(url, cycletls.Options{
Body: "",
Ja3: "771,4865-4866-4867-49195-49199-49196-49200-52393-52392-49171-49172-156-157-47-53,0-23-65281-10-11-35-16-5-13-18-51-45-43-27-17513,29-23-24,0",
UserAgent: "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 Chrome/120.0.0.0 Safari/537.36",
}, "GET")
if err != nil {
return "", err
}
return response.Body, nil
}
The JA3 string represents Chrome's TLS fingerprint. Use tools like scrapfly.io/web-scraping-tools/ja3-fingerprint to capture real browser fingerprints.
Production-Ready Patterns
Worker Pool with Context Cancellation
Handle graceful shutdown properly:
func crawlWithContext(ctx context.Context, urls []string, workers int) error {
jobs := make(chan string, len(urls))
results := make(chan string, len(urls))
var wg sync.WaitGroup
// Start workers
for i := 0; i < workers; i++ {
wg.Add(1)
go func(id int) {
defer wg.Done()
for {
select {
case url, ok := <-jobs:
if !ok {
return
}
result := processURL(url)
results <- result
case <-ctx.Done():
log.Printf("Worker %d shutting down\n", id)
return
}
}
}(i)
}
// Send jobs
for _, url := range urls {
jobs <- url
}
close(jobs)
// Wait for workers
wg.Wait()
close(results)
return nil
}
func main() {
ctx, cancel := context.WithTimeout(context.Background(), 5*time.Minute)
defer cancel()
// Handle interrupt
go func() {
sigCh := make(chan os.Signal, 1)
signal.Notify(sigCh, os.Interrupt)
<-sigCh
cancel()
}()
crawlWithContext(ctx, urls, 10)
}
Context cancellation ensures all goroutines stop cleanly when interrupted.
Saving Results to JSON and CSV
Export your crawled data:
import (
"encoding/csv"
"encoding/json"
"os"
)
type Result struct {
URL string `json:"url"`
Title string `json:"title"`
Timestamp time.Time `json:"timestamp"`
}
func saveJSON(results []Result, filename string) error {
data, err := json.MarshalIndent(results, "", " ")
if err != nil {
return err
}
return os.WriteFile(filename, data, 0644)
}
func saveCSV(results []Result, filename string) error {
file, err := os.Create(filename)
if err != nil {
return err
}
defer file.Close()
writer := csv.NewWriter(file)
defer writer.Flush()
// Header
writer.Write([]string{"URL", "Title", "Timestamp"})
// Data rows
for _, r := range results {
writer.Write([]string{
r.URL,
r.Title,
r.Timestamp.Format(time.RFC3339),
})
}
return nil
}
JSON is flexible for APIs. CSV integrates easily with spreadsheets and databases.
Structured Logging
Replace log.Printf with structured logging for production:
import "log/slog"
func main() {
logger := slog.New(slog.NewJSONHandler(os.Stdout, nil))
logger.Info("Starting crawler",
"workers", 10,
"max_depth", 3,
)
logger.Error("Request failed",
"url", url,
"status", resp.StatusCode,
"error", err,
)
}
JSON logs are parseable by monitoring tools like Elasticsearch and Datadog.
Common Mistakes to Avoid
Not Closing Response Bodies
Every http.Client.Do() returns a response body that must be closed:
// WRONG - memory leak
resp, _ := client.Do(req)
body, _ := io.ReadAll(resp.Body)
// CORRECT
resp, err := client.Do(req)
if err != nil {
return err
}
defer resp.Body.Close()
body, _ := io.ReadAll(resp.Body)
Unclosed bodies eventually exhaust file descriptors and crash your program.
Loop Variable Capture Bug
This classic Go mistake breaks goroutines:
// WRONG - all goroutines get the same URL
for _, url := range urls {
go func() {
fetchPage(url) // url changes before goroutine runs
}()
}
// CORRECT - capture the variable
for _, url := range urls {
go func(u string) {
fetchPage(u)
}(url)
}
Pass loop variables as parameters to fix this.
No Request Timeouts
Requests without timeouts hang forever:
// WRONG - can hang indefinitely
resp, _ := http.Get(url)
// CORRECT
client := &http.Client{Timeout: 30 * time.Second}
resp, _ := client.Get(url)
Always set timeouts on HTTP clients.
Ignoring robots.txt
Respecting robots.txt is both ethical and practical. Sites that catch you ignoring it will block you faster:
import "github.com/temoto/robotstxt"
func checkRobots(baseURL, targetPath string) (bool, error) {
robotsURL := baseURL + "/robots.txt"
resp, err := http.Get(robotsURL)
if err != nil {
return true, nil // Allow if robots.txt unavailable
}
defer resp.Body.Close()
data, _ := robotstxt.FromResponse(resp)
return data.TestAgent(targetPath, "MyBot"), nil
}
FAQ
Is Go faster than Python for web crawling?
Yes. Go crawlers typically run 2-5x faster than equivalent Python scrapers. Go compiles to native code and has true parallelism via goroutines, while Python is interpreted and limited by the GIL. For a project crawling 1 million pages, this difference means finishing in hours instead of days.
Which Go library should I use for web crawling?
Start with net/http plus goquery for learning and simple projects. Use Colly for production crawlers that need rate limiting, caching, and parallel execution. Use chromedp only when you need to render JavaScript or interact with dynamic content. Most sites work fine with Colly.
How many concurrent requests can Go handle?
A single Go program can easily handle thousands of concurrent connections. The practical limit depends on your network bandwidth, the target server's rate limits, and available RAM. Start with 10-50 concurrent workers and scale up while monitoring for 429 errors.
How do I avoid getting blocked while crawling?
Use these techniques in combination: rotate User-Agents, add random delays between requests (1-5 seconds), use residential proxies from providers like Roundproxies.com, and respect rate limits. Avoid predictable patterns that anti-bot systems can detect.
Can Go crawlers handle JavaScript-rendered pages?
Yes, using chromedp or rod libraries. These control a real Chrome browser that executes JavaScript. However, headless browsers are slower and more resource-intensive than HTTP requests. Only use them when necessary.
Wrapping Up
You now have three complete approaches for building web crawlers in Go: pure net/http with goroutines for maximum control, Colly for production-ready features, and chromedp for JavaScript-heavy sites.
Go's performance and concurrency model make it the best choice for large-scale crawling projects in 2026. Start with the method that matches your use case, add anti-bot protections as needed, and scale your worker count based on results.
The code examples in this guide are production-tested patterns. Adapt them to your specific needs, respect site terms of service, and scale responsibly.
Next steps: Explore the Rod library for an alternative headless browser with built-in stealth features, or learn about CycleTLS for advanced TLS fingerprint spoofing against Cloudflare-protected sites.