Web crawlers fascinate me. They are like digital explorers, hopping from link to link, mapping the internet one page at a time. But lets face it: building one can feel intimidating. Especially if you are new to concurrency or have not messed around too much with Go (also known as Golang).

In this guide, I will take you through the process of writing a web crawler using Go. Step by step. And in the end, you will have a script that fetches pages, parses links, and does it all with concurrency so you can explore faster.

Sound good?

Lets get into it.


Why Build Your Own Web Crawler?

Some folks say it is easier to grab an off-the-shelf tool. They are not wrong. Tools like Sitebulb or Screaming Frog exist for a reason. But building your own crawler unleashes your creativity.

Maybe you want to index very specific data: a list of blog posts with certain tags, or price information from e-commerce sites. A custom approach ensures you only gather the data you actually want.

Also, there is the raw speed of Go. It handles concurrency with ease. That means you can crawl multiple sites (or multiple URLs) at once, maxing out your bandwidth or CPU. If you need to build something big and robust, Go is ready for the challenge.


Setting Up Your Environment

Lets keep it simple:

  1. Install Go if you haven't already. (I am using 1.20, but anything newer than 1.18 is fine.)
  2. Create a new folder for your project. Maybe call it go-web-crawler.

Initialize your module:

go mod init yourusername/go-web-crawler  

From here, you are good to start coding.

Need an editor? Visual Studio Code or GoLand are popular. Both have built-in Go tools for auto-formatting, linting, and more.


The Basic Structure

We will need at least three big parts:

  1. A function to fetch a page and get its HTML.
  2. A function to parse that HTML and extract links.
  3. A concurrency pattern that keeps track of what we’ve visited and what we need to visit next.

Go has a built-in net/http library that we will use for making requests.

For parsing HTML, we could use golang.org/x/net/html or a library like goquery to simplify the process. Lets do a straightforward approach using the official net/html package first.

Building the Fetcher

Let us start with something simple:

package main

import (
    "fmt"
    "io"
    "net/http"
    "golang.org/x/net/html"
)

func fetchHTML(url string) (*html.Node, error) {
    resp, err := http.Get(url)
    if err != nil {
        return nil, err
    }
    defer resp.Body.Close()

    // If we want to handle non-200 statuses carefully, let's do it here:
    if resp.StatusCode != http.StatusOK {
        return nil, fmt.Errorf("status code error: %d %s", 
                               resp.StatusCode, resp.Status)
    }

    doc, err := html.Parse(resp.Body)
    if err != nil {
        return nil, err
    }
    return doc, nil
}

What are we doing here?

- We issue an HTTP GET with http.Get(url).
- If all goes well, we parse the response body using html.Parse.
- We return an *html.Node that we can explore for links.

Note: This function doesn't handle fancy stuff like timeouts, user-agents, or cookies.

If you need to bypass certain restrictions or services (like basic Cloudflare checks), you might consider adding a custom client or a library like cfscrape in Python.

But for now, we keep it straightforward and working with go.

Next, we want to navigate through the HTML node tree to find all <a> tags.

Let's do that.

func extractLinks(doc *html.Node) []string {
    var links []string

    var f func(*html.Node)
    f = func(n *html.Node) {
        if n.Type == html.ElementNode && n.Data == "a" {
            for _, attr := range n.Attr {
                if attr.Key == "href" {
                    links = append(links, attr.Val)
                }
            }
        }
        // Recursively traverse children
        for c := n.FirstChild; c != nil; c = c.NextSibling {
            f(c)
        }
    }
    f(doc)
    return links
}

We just:

1. Recursively walk the node tree.
2. Whenever we hit an <a> element, we grab its href.
3. We store those URLs in a slice.


Concurrency for the Win

Here is where Go shines: goroutines and channels.

We can queue up URLs to fetch, spawn a set number of workers, and let them do their thing in parallel.

Let's build a simple worker pool. We will keep track of visited URLs to avoid infinite loops or repeated processing.

We also decide how deep we want to crawl or how many total URLs to process.

import (
    "sync"
)

// A shared map to track visited pages
var visited = make(map[string]bool)
// A mutex to protect writes to visited
var mu sync.Mutex

func crawl(startURL string, maxDepth int) {
    // Use a channel for BFS or depth-based exploration
    urlsToVisit := make(chan string, 100)
    var wg sync.WaitGroup

    // Let's set up a function to process one URL
    worker := func() {
        defer wg.Done()
        for url := range urlsToVisit {
            // Check if already visited
            mu.Lock()
            if visited[url] {
                mu.Unlock()
                continue
            }
            visited[url] = true
            mu.Unlock()

            fmt.Println("Visiting:", url)
            doc, err := fetchHTML(url)
            if err != nil {
                fmt.Println("Error fetching:", err)
                continue
            }
            foundLinks := extractLinks(doc)

            // BFS-like approach: push discovered links back
            for _, link := range foundLinks {
                // Optional: sanitize or make absolute here
                // We'll do a naive approach for now
                if maxDepth > 0 {
                    // We'll pretend we have a function to check if link is same domain, etc.
                }
                urlsToVisit <- link
            }
        }
    }

    // Let’s say we spin up 5 workers
    for i := 0; i < 5; i++ {
        wg.Add(1)
        go worker()
    }

    // Kickstart
    urlsToVisit <- startURL

    // We'll close the channel after some logic. Quick approach: wait a bit.
    // Real approach: read a done condition or track depth for each URL, but let's keep it simple:
    // We'll just do a goroutine that closes after a short time or some condition.

    go func() {
        // This is naive. In production, you'd track depth or some BFS queue exhaustion.
        // For now, let's just wait or you might gather the links, check a certain depth, etc.
        // We'll wait for a few seconds or based on a more advanced approach.
        // Then we close the channel.
        // For demonstration, let's keep it manual or rely on a separate condition.
        // We'll skip that first, or do a short example:

        // ... potential logic ...
        // time.Sleep(10 * time.Second)
        // close(urlsToVisit)
    }()

    wg.Wait()
}

Yes, this is a bit open-ended.

In a real-world setting, you might pass along a (url, depth) pair, or you might define a certain boundary.

But the gist is the same: multiple goroutines read from urls ToVisit. Each goroutine fetches a page, extracts links, and sends new URLs back into the channel if we haven't visited them.

The sync.WaitGroup ensures we wait until all workers are done.


Handling URL Normalization

One messy problem is that links come in all forms:

- Relative paths like /about or ../index.html
- Full absolute ones: https://example.com/about

We often want to unify them. Typically, you:

  1. Parse the base URL.
  2. Use something like net/url to resolve references.

A quick fix:

import "net/url"

func normalizeLink(base string, link string) string {
    baseURL, err := url.Parse(base)
    if err != nil {
        return link
    }
    ref, err := url.Parse(link)
    if err != nil {
        return link
    }
    return baseURL.ResolveReference(ref).String()
}

Then in the worker, you can do:

// inside the for _, link := range foundLinks loop:
normalized := normalizeLink(url, link)
urlsToVisit <- normalized

Now you avoid duplicates caused by un-normalized links.


Dealing with Python- or JS-Filled Sites

Modern websites can be heavily reliant on JavaScript or even server-side rendering. A naive HTTP GET might get an empty shell. That's okay if you just want the raw HTML or are crawling mostly static sites.

But if you want to handle dynamic content, you would either need a headless browser approach or something like the Chromedp library for Go.

That gets more complicated. For now, we keep it basic.


Putting It All Together

Our main function might look like this:

func main() {
    start := "https://golang.org"
    fmt.Println("Starting at:", start)

    // We can define some basic depth or BFS limit here.
    // For simplicity, let's say we won't implement a strict BFS depth,
    // but you could pass along depth in the channel if you want.

    crawl(start, 1) // just pass 1 as a placeholder
}

When you run this:

  1. The program prints "Starting at: <url>".
  2. Spawns 5 workers that read from the channel.
  3. The first URL goes in, gets processed, extracts links, and so on.
  4. Because we have not closed the channel or set a real BFS limit, it might run forever or until you Ctrl+C.

In practice, you will refine this to avoid an infinite web crawl across the entire internet. One approach is to store not just visited[url] but also how many times or at what depth you visited it. Then only queue new links if depth < maxDepth. That way, you control the scope.


Conclusion

Building a web crawler in Go is surprisingly straightforward once you harness concurrency. You build a fetching function. A link parser. A concurrency pattern with goroutines and channels. And suddenly you have got a formidable data miner that can roam the web collecting what you need.

If all you wanted was a simple site-audit tool, you could do the same in Python or JavaScript with existing libraries, maybe even do quick hacks to bypass Cloudflare with something like cfscrape. But Go is a powerful ally when it comes to concurrency, speed, and efficiency.

So give it a go. Start small. Tweak the concurrency. Manage your visited set. Add a BFS depth limit. Or store everything in a database to parse later. Once you master these basics, the possibilities are huge.

By the time you finish, you won't just have a web crawler.

You will have a new perspective on how the internet is stitched together, link by link, waiting to be explored.

Marius Bernard

Marius Bernard

Marius Bernard is a Product Advisor, Technical SEO, & Brand Ambassador at Roundproxies. He was the lead author for the SEO chapter of the 2024 Web and a reviewer for the 2023 SEO chapter.