Lua doesn't get much attention in the web scraping world—Python and JavaScript dominate the conversation.

But if you're working with embedded systems, game engines, or tools like Nginx and Redis where Lua is already baked in, learning to scrape with Lua means you can extract data without adding another language to your stack.

I've been scraping with Lua for a few years now, mostly in environments where Python wasn't available or would've been overkill. The experience taught me that Lua's simplicity is actually its strength for scraping.

You won't find the ecosystem of Python, but you will find blazing-fast pattern matching, tiny memory footprints, and the ability to script directly into applications that already run Lua.

In this guide, I'll show you how to scrape with Lua from the ground up—from making HTTP requests to parsing HTML to handling the stuff that usually trips people up. I'll also share some techniques I haven't seen documented elsewhere, like using incremental parsing to scrape massive pages without running out of memory.

Why Scrape with Lua?

Before diving into code, let's address the obvious question: why bother with Lua when Python exists?

Here's when Lua makes sense:

You're already using Lua. If you're working with OpenResty, game mods, or embedded systems, adding a scraper in the same language beats managing multiple runtimes.

Memory matters. Lua's tiny footprint (the entire runtime is under 300KB) makes it perfect for constrained environments. I've run Lua scrapers on routers and IoT devices where Python wouldn't fit.

You need speed for pattern matching. Lua's pattern matching is compiled C code and runs significantly faster than Python's regex for simple cases. When you're processing millions of HTML strings, this adds up.

Simple is better. Lua has fewer features than Python, which means fewer ways to overcomplicate your scraper. Sometimes that's exactly what you want.

That said, if you need browser automation, complicated JavaScript rendering, or access to thousands of scraping libraries, stick with Python or JavaScript. Lua shines in specific niches—know which niche you're in.

Setting Up Your Environment

First, make sure you have Lua installed. Most Linux systems have it available:

# Check your Lua version
lua -v

# Install LuaRocks (Lua's package manager)
sudo apt-get install luarocks  # Debian/Ubuntu
brew install luarocks          # macOS

Now install the essential scraping packages:

# HTTP client
luarocks install luasocket

# HTTPS support
luarocks install luasec

# HTML parser
luarocks install gumbo

If you hit compilation errors with Gumbo, you may need development headers:

sudo apt-get install libssl-dev build-essential

That's it. Three libraries and you're ready to scrape.

Making HTTP Requests with LuaSocket

LuaSocket provides the http module for making requests. The simplest approach looks like this:

local http = require("socket.http")

local body, code, headers, status = http.request("http://example.com")

if code == 200 then
    print(body)
else
    print("Request failed with code:", code)
end

This works, but it's the naive approach. The http.request function has two forms—a simple string form (above) and a more flexible table form that gives you control over everything:

local http = require("socket.http")
local ltn12 = require("ltn12")

-- Store response in a table
local response_body = {}

local result, code, headers = http.request{
    url = "http://example.com/api/data",
    method = "GET",
    headers = {
        ["User-Agent"] = "Mozilla/5.0 (compatible; LuaScraper/1.0)",
        ["Accept"] = "text/html,application/xhtml+xml"
    },
    sink = ltn12.sink.table(response_body)
}

if code == 200 then
    local full_body = table.concat(response_body)
    print(full_body)
end

Let me break down what's happening here:

The sink parameter tells LuaSocket where to put the response data. By using ltn12.sink.table(response_body), chunks of the response get appended to the response_body table as they arrive. This is more memory-efficient for large responses because you can process chunks incrementally (more on this later).

Custom headers are critical for real scraping. The default User-Agent screams "I'm a bot," which some sites block immediately. Always set a realistic User-Agent.

The function returns the full response data, status code, and headers. Check that code before parsing—you don't want to parse an error page.

POST Requests and Form Data

For POST requests, you need to provide a source:

local http = require("socket.http")
local ltn12 = require("ltn12")

local request_body = "username=testuser&password=secret"
local response = {}

local result, code = http.request{
    url = "http://example.com/login",
    method = "POST",
    headers = {
        ["Content-Type"] = "application/x-www-form-urlencoded",
        ["Content-Length"] = tostring(#request_body)
    },
    source = ltn12.source.string(request_body),
    sink = ltn12.sink.table(response)
}

Notice we set Content-Length manually. LuaSocket won't calculate it for you, and many servers reject requests without it.

Parsing HTML: The Gumbo Approach

Now you've got HTML—time to extract data. You have two options: use a proper HTML parser or use pattern matching. Let's start with the parser.

The lua-gumbo library provides a full HTML5 parser that builds a DOM tree:

local gumbo = require("gumbo")
local http = require("socket.http")

local body = http.request("http://example.com")
local document = gumbo.parse(body)

-- Find element by ID
local content = document:getElementById("main-content")
if content then
    print(content.textContent)
end

-- Find elements by tag name
local links = document:getElementsByTagName("a")
for i = 1, #links do
    local href = links[i]:getAttribute("href")
    if href then
        print("Link:", href)
    end
end

Gumbo builds a proper DOM tree that follows W3C standards. This means methods like getElementById, getElementsByTagName, and getElementsByClassName work just like in a browser.

Here's a practical example—scraping article titles and links from a blog:

local gumbo = require("gumbo")
local http = require("socket.http")

local body = http.request("https://blog.example.com")
local document = gumbo.parse(body)

local articles = document:getElementsByClassName("article-item")

for i = 1, #articles do
    local article = articles[i]
    
    -- Get the title (first h2 in the article)
    local h2_elements = article:getElementsByTagName("h2")
    local title = h2_elements[1] and h2_elements[1].textContent or "No title"
    
    -- Get the link
    local links = article:getElementsByTagName("a")
    local url = links[1] and links[1]:getAttribute("href") or "No URL"
    
    print(string.format("Title: %s\nURL: %s\n", title, url))
end

The code walks through each article element, extracts the title from an h2 tag, and grabs the first link. Simple and readable.

When Gumbo Isn't Enough: CSS Selectors

Gumbo doesn't support CSS selectors out of the box (unlike Python's BeautifulSoup or JavaScript's Cheerio). If you need complex selectors, you have two options:

  1. Traverse the DOM manually (tedious but works)
  2. Use pattern matching for simpler cases (covered next)

For most scraping tasks, Gumbo's basic methods are sufficient. But if you're scraping a complex site with deeply nested structures, manual traversal gets ugly fast. That's where pattern matching can save you.

Lua Pattern Matching for Quick Extraction

Here's something most tutorials won't tell you: for quick-and-dirty scraping, Lua's pattern matching often beats using a full parser. It's faster, uses less memory, and requires no external dependencies.

Lua patterns aren't as powerful as regex, but they're perfect for extracting specific data from HTML. Here's the pattern matching approach to grab all links from a page:

local http = require("socket.http")

local body = http.request("http://example.com")

-- Extract all href values
for url in body:gmatch('href="([^"]+)"') do
    print(url)
end

That one line replaces the entire Gumbo link extraction code. The pattern href="([^"]+)" matches href=" followed by any characters that aren't a quote, captured in the parentheses.

Here are some patterns I use constantly:

-- Extract all email addresses
for email in body:gmatch('[%w%.%-_]+@[%w%.%-_]+%.%w+') do
    print(email)
end

-- Extract prices (e.g., $19.99)
for price in body:gmatch('%$%d+%.%d%d') do
    print(price)
end

-- Extract content between specific tags
local content = body:match('<div class="content">(.-)</div>')
print(content)

The .- pattern is particularly useful—it matches as few characters as possible (non-greedy), preventing you from accidentally capturing too much.

Pattern Matching Gotchas

Lua patterns have a few quirks that trip up newcomers:

No alternation operator. You can't do (foo|bar) like in regex. Instead, try multiple patterns or use character classes: [fb][oa][or] matches "foo" or "bar" (though this gets messy).

No \d or \w shorthands. Lua uses %d for digits and %w for alphanumeric. The % is Lua's escape character, not backslash.

Magic characters need escaping. Characters like ().-?*+[]%^$ have special meaning. Escape them with %:

-- Wrong: tries to match zero or more dots
local pattern = "example.*"

-- Right: matches literal "example.com"
local pattern = "example%.com"

Handling HTTPS and Authentication

Most modern sites use HTTPS, which LuaSocket doesn't support by default. You need luasec:

local https = require("ssl.https")
local ltn12 = require("ltn12")

local response = {}
local result, code = https.request{
    url = "https://secure-site.com/api",
    sink = ltn12.sink.table(response)
}

if code == 200 then
    print(table.concat(response))
end

The API is identical to the HTTP version, just replace http with https.

Basic Authentication

Some APIs require HTTP Basic Auth. Embed credentials directly in the URL:

local https = require("ssl.https")

local body = https.request("https://user:password@api.example.com/data")
print(body)

For more complex auth (like OAuth or JWT), you'll need to manually construct the Authorization header:

local mime = require("mime")

local username = "your_username"
local password = "your_password"
local credentials = mime.b64(username .. ":" .. password)

local response = {}
https.request{
    url = "https://api.example.com/data",
    headers = {
        ["Authorization"] = "Basic " .. credentials
    },
    sink = ltn12.sink.table(response)
}

Advanced: Streaming Large Responses

Here's a technique I haven't seen documented: using LTN12 to process responses incrementally. Most tutorials show you ltn12.sink.table(), which loads the entire response into memory. But what if you're scraping a 500MB JSON file?

Instead, write a custom sink that processes data as it arrives:

local http = require("socket.http")
local ltn12 = require("ltn12")

-- This function gets called with each chunk of data
local function process_chunk(chunk)
    if chunk then
        -- Process the chunk here
        -- For example, extract data with pattern matching
        for line in chunk:gmatch("[^\n]+") do
            if line:match("important_data") then
                print("Found:", line)
            end
        end
        return 1  -- Continue receiving
    end
end

http.request{
    url = "http://example.com/huge-file.txt",
    sink = process_chunk
}

This approach lets you scrape files of any size without loading them entirely into RAM. I've used this to process multi-gigabyte log files on servers with limited memory.

Buffering for Pattern Matching Across Chunks

One problem: what if your pattern spans across chunk boundaries? Here's a buffered solution:

local buffer = ""
local min_buffer_size = 8192

local function buffered_sink(chunk)
    if chunk then
        buffer = buffer .. chunk
        
        -- Process complete patterns
        local last_match = 1
        for match in buffer:gmatch('href="([^"]+)"') do
            print("Link:", match)
            last_match = buffer:find(match, 1, true)
        end
        
        -- Keep unprocessed tail for next chunk
        if #buffer > min_buffer_size then
            buffer = buffer:sub(last_match or 1)
        end
        
        return 1
    else
        -- Process remaining buffer on close
        for match in buffer:gmatch('href="([^"]+)"') do
            print("Link:", match)
        end
    end
end

This maintains a buffer across chunks and only processes when we have enough data. The tail stays in the buffer for the next chunk, ensuring we don't miss patterns at boundaries.

JavaScript-Heavy Sites with Splash

Pure Lua can't execute JavaScript, so sites that load content dynamically won't work with LuaSocket alone. That's where Splash comes in—a lightweight browser engine that uses Lua for scripting.

Splash runs as a separate service (via Docker) and provides HTTP endpoints you can call from Lua:

# Start Splash
docker run -p 8050:8050 scrapinghub/splash

Now send requests with Lua scripts embedded:

local http = require("socket.http")
local json = require("cjson")

local lua_script = [[
function main(splash, args)
    splash:go(args.url)
    splash:wait(2)  -- Wait for JS to execute
    return splash:html()
end
]]

local request_data = json.encode({
    lua_source = lua_script,
    url = "https://dynamic-site.com"
})

local response = {}
http.request{
    url = "http://localhost:8050/execute",
    method = "POST",
    headers = {
        ["Content-Type"] = "application/json",
        ["Content-Length"] = tostring(#request_data)
    },
    source = ltn12.source.string(request_data),
    sink = ltn12.sink.table(response)
}

local rendered_html = json.decode(table.concat(response))
print(rendered_html)

The Lua script inside Splash has access to browser methods like splash:go(), splash:wait(), and splash:html(). You can interact with pages, click buttons, fill forms—basically anything Selenium does, but faster.

Handling Infinite Scroll with Splash

Here's a practical Splash script for sites with infinite scrolling:

function main(splash, args)
    splash:go(args.url)
    
    local scroll_count = 5
    local scroll_delay = 1.0
    
    for i = 1, scroll_count do
        splash:wait(scroll_delay)
        splash:runjs("window.scrollTo(0, document.body.scrollHeight);")
    end
    
    splash:wait(2)
    return splash:html()
end

This scrolls to the bottom five times, waiting between scrolls for content to load. After you get the fully-rendered HTML, parse it with Gumbo as usual.

Rate Limiting and Politeness

Scraping responsibly means not hammering servers. Lua doesn't have built-in sleep functions, so use the socket library:

local socket = require("socket")

local urls = {
    "http://example.com/page1",
    "http://example.com/page2",
    "http://example.com/page3"
}

for _, url in ipairs(urls) do
    local body = http.request(url)
    -- Process body...
    
    socket.sleep(1)  -- Wait 1 second between requests
end

For more sophisticated rate limiting, track request timestamps:

local socket = require("socket")

local requests_per_second = 2
local request_times = {}

local function rate_limited_request(url)
    -- Remove old timestamps
    local now = socket.gettime()
    while #request_times > 0 and now - request_times[1] > 1 do
        table.remove(request_times, 1)
    end
    
    -- Wait if we've hit the limit
    if #request_times >= requests_per_second then
        local wait_time = 1 - (now - request_times[1])
        if wait_time > 0 then
            socket.sleep(wait_time)
        end
    end
    
    -- Make request and log timestamp
    table.insert(request_times, socket.gettime())
    return http.request(url)
end

This ensures you never exceed the specified rate, regardless of how fast your scraper runs.

Avoiding Common Lua Scraping Pitfalls

Mistake 1: Using dot syntax for method calls

In Lua, : is for method calls, . is for accessing fields:

-- Wrong
login:length()  -- Error!

-- Right
login.length()  -- This works for field access
#login          -- Or use length operator

This trips up everyone coming from Python. When in doubt, use the # operator for lengths.

Mistake 2: Not handling nil values

Lua doesn't throw exceptions when you access missing fields—it returns nil. This can cause silent failures:

-- Bad
local element = document:getElementById("nonexistent")
print(element.textContent)  -- Crashes with "attempt to index a nil value"

-- Good
local element = document:getElementById("nonexistent")
if element then
    print(element.textContent)
end

Always check for nil before accessing fields.

Mistake 3: Forgetting table.concat

When using ltn12.sink.table(), you get a table of chunks, not a single string:

local response = {}
http.request{
    url = "http://example.com",
    sink = ltn12.sink.table(response)
}

-- Wrong: prints table memory address
print(response)

-- Right: concatenates chunks into string
print(table.concat(response))

Mistake 4: Not escaping pattern characters

When searching for literal strings that contain pattern magic characters:

local url = "http://example.com/path?id=123"

-- Wrong: ? is a pattern character
if body:match(url) then ...

-- Right: escape special characters
local escaped = url:gsub("([%?%.%-])", "%%%1")
if body:match(escaped) then ...

Or better yet, use plain string search:

if body:find(url, 1, true) then  -- true = plain search
    print("Found it!")
end

Real-World Example: Scraping Product Data

Let's put it all together with a complete scraper that extracts product information from an e-commerce site:

local http = require("socket.http")
local gumbo = require("gumbo")
local socket = require("socket")

local function scrape_product(url)
    -- Make request with proper headers
    local body, code = http.request{
        url = url,
        headers = {
            ["User-Agent"] = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
        }
    }
    
    if code ~= 200 then
        print("Failed to fetch:", url)
        return nil
    end
    
    -- Parse HTML
    local document = gumbo.parse(body)
    
    -- Extract product name
    local name_elem = document:getElementById("product-title")
    local name = name_elem and name_elem.textContent or "Unknown"
    
    -- Extract price using pattern matching
    local price = body:match('price":"$([%d%.]+)"')
    
    -- Extract description
    local desc_elem = document:getElementById("product-description")
    local description = desc_elem and desc_elem.textContent or ""
    
    -- Extract all image URLs
    local images = {}
    local img_elements = document:getElementsByTagName("img")
    for i = 1, #img_elements do
        local src = img_elements[i]:getAttribute("src")
        if src and src:match("product%-image") then
            table.insert(images, src)
        end
    end
    
    return {
        name = name:match("^%s*(.-)%s*$"),  -- Trim whitespace
        price = price,
        description = description:match("^%s*(.-)%s*$"),
        images = images,
        url = url
    }
end

-- Scrape multiple products
local product_urls = {
    "https://shop.example.com/product/1",
    "https://shop.example.com/product/2",
    "https://shop.example.com/product/3"
}

local products = {}
for _, url in ipairs(product_urls) do
    local product = scrape_product(url)
    if product then
        table.insert(products, product)
        print(string.format("Scraped: %s - $%s", product.name, product.price))
    end
    socket.sleep(1)  -- Be polite
end

-- Output as CSV
print("\nName,Price,URL")
for _, p in ipairs(products) do
    print(string.format('"%s","$%s","%s"', p.name, p.price, p.url))
end

This scraper:

  • Sets proper headers to avoid detection
  • Uses Gumbo for structured parsing
  • Falls back to pattern matching for JSON-embedded data
  • Handles missing elements gracefully
  • Rate limits requests
  • Outputs clean CSV data

You can extend this by adding error logging, retry logic, or database storage.

Final Thoughts

Lua web scraping isn't mainstream, but it fills a specific need: lightweight, fast data extraction in environments where Python isn't available or practical. You won't find as many libraries or Stack Overflow answers, but the core tools—LuaSocket, Gumbo, and pattern matching—cover 90% of scraping tasks.

The real strength of Lua scraping is memory efficiency. When you're running on embedded hardware, inside game servers, or anywhere RAM is precious, Lua's tiny footprint and careful memory management beat heavier alternatives. Combine that with Lua's speed and you have a scraping solution that scales down to IoT devices and up to high-throughput data pipelines.

Start with the basics: HTTP requests, pattern matching, and simple parsing. Once you're comfortable, explore Splash for JavaScript sites and LTN12 for streaming large responses. Keep your scrapers polite, respect robots.txt, and always check a site's terms of service before scraping.

Now go build something.