Lua doesn't get much attention in the web scraping world—Python and JavaScript dominate the conversation.
But if you're working with embedded systems, game engines, or tools like Nginx and Redis where Lua is already baked in, learning to scrape with Lua means you can extract data without adding another language to your stack.
I've been scraping with Lua for a few years now, mostly in environments where Python wasn't available or would've been overkill. The experience taught me that Lua's simplicity is actually its strength for scraping.
You won't find the ecosystem of Python, but you will find blazing-fast pattern matching, tiny memory footprints, and the ability to script directly into applications that already run Lua.
In this guide, I'll show you how to scrape with Lua from the ground up—from making HTTP requests to parsing HTML to handling the stuff that usually trips people up. I'll also share some techniques I haven't seen documented elsewhere, like using incremental parsing to scrape massive pages without running out of memory.
Why Scrape with Lua?
Before diving into code, let's address the obvious question: why bother with Lua when Python exists?
Here's when Lua makes sense:
You're already using Lua. If you're working with OpenResty, game mods, or embedded systems, adding a scraper in the same language beats managing multiple runtimes.
Memory matters. Lua's tiny footprint (the entire runtime is under 300KB) makes it perfect for constrained environments. I've run Lua scrapers on routers and IoT devices where Python wouldn't fit.
You need speed for pattern matching. Lua's pattern matching is compiled C code and runs significantly faster than Python's regex for simple cases. When you're processing millions of HTML strings, this adds up.
Simple is better. Lua has fewer features than Python, which means fewer ways to overcomplicate your scraper. Sometimes that's exactly what you want.
That said, if you need browser automation, complicated JavaScript rendering, or access to thousands of scraping libraries, stick with Python or JavaScript. Lua shines in specific niches—know which niche you're in.
Setting Up Your Environment
First, make sure you have Lua installed. Most Linux systems have it available:
# Check your Lua version
lua -v
# Install LuaRocks (Lua's package manager)
sudo apt-get install luarocks # Debian/Ubuntu
brew install luarocks # macOS
Now install the essential scraping packages:
# HTTP client
luarocks install luasocket
# HTTPS support
luarocks install luasec
# HTML parser
luarocks install gumbo
If you hit compilation errors with Gumbo, you may need development headers:
sudo apt-get install libssl-dev build-essential
That's it. Three libraries and you're ready to scrape.
Making HTTP Requests with LuaSocket
LuaSocket provides the http module for making requests. The simplest approach looks like this:
local http = require("socket.http")
local body, code, headers, status = http.request("http://example.com")
if code == 200 then
print(body)
else
print("Request failed with code:", code)
end
This works, but it's the naive approach. The http.request function has two forms—a simple string form (above) and a more flexible table form that gives you control over everything:
local http = require("socket.http")
local ltn12 = require("ltn12")
-- Store response in a table
local response_body = {}
local result, code, headers = http.request{
url = "http://example.com/api/data",
method = "GET",
headers = {
["User-Agent"] = "Mozilla/5.0 (compatible; LuaScraper/1.0)",
["Accept"] = "text/html,application/xhtml+xml"
},
sink = ltn12.sink.table(response_body)
}
if code == 200 then
local full_body = table.concat(response_body)
print(full_body)
end
Let me break down what's happening here:
The sink parameter tells LuaSocket where to put the response data. By using ltn12.sink.table(response_body), chunks of the response get appended to the response_body table as they arrive. This is more memory-efficient for large responses because you can process chunks incrementally (more on this later).
Custom headers are critical for real scraping. The default User-Agent screams "I'm a bot," which some sites block immediately. Always set a realistic User-Agent.
The function returns the full response data, status code, and headers. Check that code before parsing—you don't want to parse an error page.
POST Requests and Form Data
For POST requests, you need to provide a source:
local http = require("socket.http")
local ltn12 = require("ltn12")
local request_body = "username=testuser&password=secret"
local response = {}
local result, code = http.request{
url = "http://example.com/login",
method = "POST",
headers = {
["Content-Type"] = "application/x-www-form-urlencoded",
["Content-Length"] = tostring(#request_body)
},
source = ltn12.source.string(request_body),
sink = ltn12.sink.table(response)
}
Notice we set Content-Length manually. LuaSocket won't calculate it for you, and many servers reject requests without it.
Parsing HTML: The Gumbo Approach
Now you've got HTML—time to extract data. You have two options: use a proper HTML parser or use pattern matching. Let's start with the parser.
The lua-gumbo library provides a full HTML5 parser that builds a DOM tree:
local gumbo = require("gumbo")
local http = require("socket.http")
local body = http.request("http://example.com")
local document = gumbo.parse(body)
-- Find element by ID
local content = document:getElementById("main-content")
if content then
print(content.textContent)
end
-- Find elements by tag name
local links = document:getElementsByTagName("a")
for i = 1, #links do
local href = links[i]:getAttribute("href")
if href then
print("Link:", href)
end
end
Gumbo builds a proper DOM tree that follows W3C standards. This means methods like getElementById, getElementsByTagName, and getElementsByClassName work just like in a browser.
Here's a practical example—scraping article titles and links from a blog:
local gumbo = require("gumbo")
local http = require("socket.http")
local body = http.request("https://blog.example.com")
local document = gumbo.parse(body)
local articles = document:getElementsByClassName("article-item")
for i = 1, #articles do
local article = articles[i]
-- Get the title (first h2 in the article)
local h2_elements = article:getElementsByTagName("h2")
local title = h2_elements[1] and h2_elements[1].textContent or "No title"
-- Get the link
local links = article:getElementsByTagName("a")
local url = links[1] and links[1]:getAttribute("href") or "No URL"
print(string.format("Title: %s\nURL: %s\n", title, url))
end
The code walks through each article element, extracts the title from an h2 tag, and grabs the first link. Simple and readable.
When Gumbo Isn't Enough: CSS Selectors
Gumbo doesn't support CSS selectors out of the box (unlike Python's BeautifulSoup or JavaScript's Cheerio). If you need complex selectors, you have two options:
- Traverse the DOM manually (tedious but works)
- Use pattern matching for simpler cases (covered next)
For most scraping tasks, Gumbo's basic methods are sufficient. But if you're scraping a complex site with deeply nested structures, manual traversal gets ugly fast. That's where pattern matching can save you.
Lua Pattern Matching for Quick Extraction
Here's something most tutorials won't tell you: for quick-and-dirty scraping, Lua's pattern matching often beats using a full parser. It's faster, uses less memory, and requires no external dependencies.
Lua patterns aren't as powerful as regex, but they're perfect for extracting specific data from HTML. Here's the pattern matching approach to grab all links from a page:
local http = require("socket.http")
local body = http.request("http://example.com")
-- Extract all href values
for url in body:gmatch('href="([^"]+)"') do
print(url)
end
That one line replaces the entire Gumbo link extraction code. The pattern href="([^"]+)" matches href=" followed by any characters that aren't a quote, captured in the parentheses.
Here are some patterns I use constantly:
-- Extract all email addresses
for email in body:gmatch('[%w%.%-_]+@[%w%.%-_]+%.%w+') do
print(email)
end
-- Extract prices (e.g., $19.99)
for price in body:gmatch('%$%d+%.%d%d') do
print(price)
end
-- Extract content between specific tags
local content = body:match('<div class="content">(.-)</div>')
print(content)
The .- pattern is particularly useful—it matches as few characters as possible (non-greedy), preventing you from accidentally capturing too much.
Pattern Matching Gotchas
Lua patterns have a few quirks that trip up newcomers:
No alternation operator. You can't do (foo|bar) like in regex. Instead, try multiple patterns or use character classes: [fb][oa][or] matches "foo" or "bar" (though this gets messy).
No \d or \w shorthands. Lua uses %d for digits and %w for alphanumeric. The % is Lua's escape character, not backslash.
Magic characters need escaping. Characters like ().-?*+[]%^$ have special meaning. Escape them with %:
-- Wrong: tries to match zero or more dots
local pattern = "example.*"
-- Right: matches literal "example.com"
local pattern = "example%.com"
Handling HTTPS and Authentication
Most modern sites use HTTPS, which LuaSocket doesn't support by default. You need luasec:
local https = require("ssl.https")
local ltn12 = require("ltn12")
local response = {}
local result, code = https.request{
url = "https://secure-site.com/api",
sink = ltn12.sink.table(response)
}
if code == 200 then
print(table.concat(response))
end
The API is identical to the HTTP version, just replace http with https.
Basic Authentication
Some APIs require HTTP Basic Auth. Embed credentials directly in the URL:
local https = require("ssl.https")
local body = https.request("https://user:password@api.example.com/data")
print(body)
For more complex auth (like OAuth or JWT), you'll need to manually construct the Authorization header:
local mime = require("mime")
local username = "your_username"
local password = "your_password"
local credentials = mime.b64(username .. ":" .. password)
local response = {}
https.request{
url = "https://api.example.com/data",
headers = {
["Authorization"] = "Basic " .. credentials
},
sink = ltn12.sink.table(response)
}
Advanced: Streaming Large Responses
Here's a technique I haven't seen documented: using LTN12 to process responses incrementally. Most tutorials show you ltn12.sink.table(), which loads the entire response into memory. But what if you're scraping a 500MB JSON file?
Instead, write a custom sink that processes data as it arrives:
local http = require("socket.http")
local ltn12 = require("ltn12")
-- This function gets called with each chunk of data
local function process_chunk(chunk)
if chunk then
-- Process the chunk here
-- For example, extract data with pattern matching
for line in chunk:gmatch("[^\n]+") do
if line:match("important_data") then
print("Found:", line)
end
end
return 1 -- Continue receiving
end
end
http.request{
url = "http://example.com/huge-file.txt",
sink = process_chunk
}
This approach lets you scrape files of any size without loading them entirely into RAM. I've used this to process multi-gigabyte log files on servers with limited memory.
Buffering for Pattern Matching Across Chunks
One problem: what if your pattern spans across chunk boundaries? Here's a buffered solution:
local buffer = ""
local min_buffer_size = 8192
local function buffered_sink(chunk)
if chunk then
buffer = buffer .. chunk
-- Process complete patterns
local last_match = 1
for match in buffer:gmatch('href="([^"]+)"') do
print("Link:", match)
last_match = buffer:find(match, 1, true)
end
-- Keep unprocessed tail for next chunk
if #buffer > min_buffer_size then
buffer = buffer:sub(last_match or 1)
end
return 1
else
-- Process remaining buffer on close
for match in buffer:gmatch('href="([^"]+)"') do
print("Link:", match)
end
end
end
This maintains a buffer across chunks and only processes when we have enough data. The tail stays in the buffer for the next chunk, ensuring we don't miss patterns at boundaries.
JavaScript-Heavy Sites with Splash
Pure Lua can't execute JavaScript, so sites that load content dynamically won't work with LuaSocket alone. That's where Splash comes in—a lightweight browser engine that uses Lua for scripting.
Splash runs as a separate service (via Docker) and provides HTTP endpoints you can call from Lua:
# Start Splash
docker run -p 8050:8050 scrapinghub/splash
Now send requests with Lua scripts embedded:
local http = require("socket.http")
local json = require("cjson")
local lua_script = [[
function main(splash, args)
splash:go(args.url)
splash:wait(2) -- Wait for JS to execute
return splash:html()
end
]]
local request_data = json.encode({
lua_source = lua_script,
url = "https://dynamic-site.com"
})
local response = {}
http.request{
url = "http://localhost:8050/execute",
method = "POST",
headers = {
["Content-Type"] = "application/json",
["Content-Length"] = tostring(#request_data)
},
source = ltn12.source.string(request_data),
sink = ltn12.sink.table(response)
}
local rendered_html = json.decode(table.concat(response))
print(rendered_html)
The Lua script inside Splash has access to browser methods like splash:go(), splash:wait(), and splash:html(). You can interact with pages, click buttons, fill forms—basically anything Selenium does, but faster.
Handling Infinite Scroll with Splash
Here's a practical Splash script for sites with infinite scrolling:
function main(splash, args)
splash:go(args.url)
local scroll_count = 5
local scroll_delay = 1.0
for i = 1, scroll_count do
splash:wait(scroll_delay)
splash:runjs("window.scrollTo(0, document.body.scrollHeight);")
end
splash:wait(2)
return splash:html()
end
This scrolls to the bottom five times, waiting between scrolls for content to load. After you get the fully-rendered HTML, parse it with Gumbo as usual.
Rate Limiting and Politeness
Scraping responsibly means not hammering servers. Lua doesn't have built-in sleep functions, so use the socket library:
local socket = require("socket")
local urls = {
"http://example.com/page1",
"http://example.com/page2",
"http://example.com/page3"
}
for _, url in ipairs(urls) do
local body = http.request(url)
-- Process body...
socket.sleep(1) -- Wait 1 second between requests
end
For more sophisticated rate limiting, track request timestamps:
local socket = require("socket")
local requests_per_second = 2
local request_times = {}
local function rate_limited_request(url)
-- Remove old timestamps
local now = socket.gettime()
while #request_times > 0 and now - request_times[1] > 1 do
table.remove(request_times, 1)
end
-- Wait if we've hit the limit
if #request_times >= requests_per_second then
local wait_time = 1 - (now - request_times[1])
if wait_time > 0 then
socket.sleep(wait_time)
end
end
-- Make request and log timestamp
table.insert(request_times, socket.gettime())
return http.request(url)
end
This ensures you never exceed the specified rate, regardless of how fast your scraper runs.
Avoiding Common Lua Scraping Pitfalls
Mistake 1: Using dot syntax for method calls
In Lua, : is for method calls, . is for accessing fields:
-- Wrong
login:length() -- Error!
-- Right
login.length() -- This works for field access
#login -- Or use length operator
This trips up everyone coming from Python. When in doubt, use the # operator for lengths.
Mistake 2: Not handling nil values
Lua doesn't throw exceptions when you access missing fields—it returns nil. This can cause silent failures:
-- Bad
local element = document:getElementById("nonexistent")
print(element.textContent) -- Crashes with "attempt to index a nil value"
-- Good
local element = document:getElementById("nonexistent")
if element then
print(element.textContent)
end
Always check for nil before accessing fields.
Mistake 3: Forgetting table.concat
When using ltn12.sink.table(), you get a table of chunks, not a single string:
local response = {}
http.request{
url = "http://example.com",
sink = ltn12.sink.table(response)
}
-- Wrong: prints table memory address
print(response)
-- Right: concatenates chunks into string
print(table.concat(response))
Mistake 4: Not escaping pattern characters
When searching for literal strings that contain pattern magic characters:
local url = "http://example.com/path?id=123"
-- Wrong: ? is a pattern character
if body:match(url) then ...
-- Right: escape special characters
local escaped = url:gsub("([%?%.%-])", "%%%1")
if body:match(escaped) then ...
Or better yet, use plain string search:
if body:find(url, 1, true) then -- true = plain search
print("Found it!")
end
Real-World Example: Scraping Product Data
Let's put it all together with a complete scraper that extracts product information from an e-commerce site:
local http = require("socket.http")
local gumbo = require("gumbo")
local socket = require("socket")
local function scrape_product(url)
-- Make request with proper headers
local body, code = http.request{
url = url,
headers = {
["User-Agent"] = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
}
}
if code ~= 200 then
print("Failed to fetch:", url)
return nil
end
-- Parse HTML
local document = gumbo.parse(body)
-- Extract product name
local name_elem = document:getElementById("product-title")
local name = name_elem and name_elem.textContent or "Unknown"
-- Extract price using pattern matching
local price = body:match('price":"$([%d%.]+)"')
-- Extract description
local desc_elem = document:getElementById("product-description")
local description = desc_elem and desc_elem.textContent or ""
-- Extract all image URLs
local images = {}
local img_elements = document:getElementsByTagName("img")
for i = 1, #img_elements do
local src = img_elements[i]:getAttribute("src")
if src and src:match("product%-image") then
table.insert(images, src)
end
end
return {
name = name:match("^%s*(.-)%s*$"), -- Trim whitespace
price = price,
description = description:match("^%s*(.-)%s*$"),
images = images,
url = url
}
end
-- Scrape multiple products
local product_urls = {
"https://shop.example.com/product/1",
"https://shop.example.com/product/2",
"https://shop.example.com/product/3"
}
local products = {}
for _, url in ipairs(product_urls) do
local product = scrape_product(url)
if product then
table.insert(products, product)
print(string.format("Scraped: %s - $%s", product.name, product.price))
end
socket.sleep(1) -- Be polite
end
-- Output as CSV
print("\nName,Price,URL")
for _, p in ipairs(products) do
print(string.format('"%s","$%s","%s"', p.name, p.price, p.url))
end
This scraper:
- Sets proper headers to avoid detection
- Uses Gumbo for structured parsing
- Falls back to pattern matching for JSON-embedded data
- Handles missing elements gracefully
- Rate limits requests
- Outputs clean CSV data
You can extend this by adding error logging, retry logic, or database storage.
Final Thoughts
Lua web scraping isn't mainstream, but it fills a specific need: lightweight, fast data extraction in environments where Python isn't available or practical. You won't find as many libraries or Stack Overflow answers, but the core tools—LuaSocket, Gumbo, and pattern matching—cover 90% of scraping tasks.
The real strength of Lua scraping is memory efficiency. When you're running on embedded hardware, inside game servers, or anywhere RAM is precious, Lua's tiny footprint and careful memory management beat heavier alternatives. Combine that with Lua's speed and you have a scraping solution that scales down to IoT devices and up to high-throughput data pipelines.
Start with the basics: HTTP requests, pattern matching, and simple parsing. Once you're comfortable, explore Splash for JavaScript sites and LTN12 for streaming large responses. Keep your scrapers polite, respect robots.txt, and always check a site's terms of service before scraping.
Now go build something.