knowledgebase

How to Do Web Scraping in Kotlin: The Developer's Guide

October 27, 2025

15 min read

Kotlin brings type safety and concise syntax to web scraping, letting you extract data without the boilerplate you'd write in Java.

In this guide, you'll learn how to scrape websites using Kotlin—from basic HTML parsing to parallel processing with coroutines.

I've spent years scraping everything from e-commerce sites to social media platforms, and I've found that Kotlin's combination of null safety and coroutine support makes it uniquely suited for building production-ready scrapers.

Here's what makes it different: while Python dominates web scraping tutorials, Kotlin gives you compile-time safety that catches errors before they hit production, plus seamless integration with the entire Java ecosystem.

Why Choose Kotlin for Web Scraping?

Kotlin isn't the first language people think of for web scraping—that honor goes to Python. But Kotlin brings some advantages that become obvious once you start building real scrapers:

Null safety catches bugs at compile time. Ever had a scraper crash in production because a CSS selector returned null? Kotlin's type system forces you to handle these cases explicitly. You can't accidentally call .text() on a null element.

Coroutines make parallel scraping elegant. Need to scrape 1,000 product pages? With Kotlin coroutines, you can do it concurrently without the complexity of thread pools or callbacks. It reads almost like synchronous code.

The entire Java ecosystem is available. Every Java library works in Kotlin. Jsoup for HTML parsing, Selenium for browser automation, OkHttp for HTTP requests—all available with better syntax.

Data classes are perfect for scraped data. Kotlin's data classes give you free equals(), hashCode(), and toString() methods, plus easy JSON serialization. Define your scraped product structure once and get all this for free.

The language's interoperability with Java means you can use battle-tested libraries while writing less code. In my experience, a Kotlin scraper is typically 30-40% shorter than the equivalent Java version, with no loss in performance.

Setting Up Your Kotlin Scraping Environment

Before writing any scraping code, you need a proper development setup. Here's what you'll need:

Prerequisites

Install the JDK. Kotlin runs on the JVM, so you need Java Development Kit 11 or later. Download it from Oracle's website or use OpenJDK.

Verify your installation:

java -version
# Should show: java version "17.0.x" or later

Choose a build tool. Gradle is the standard for Kotlin projects. Install it via SDKMAN (recommended) or download directly:

# Using SDKMAN (recommended)
curl -s "https://get.sdkman.io" | bash
sdk install gradle 8.5

# Verify installation
gradle --version

Pick an IDE. IntelliJ IDEA Community Edition has the best Kotlin support, but VS Code with the Kotlin extension works fine too.

Creating Your Project

Generate a new Kotlin project with Gradle:

mkdir kotlin-scraper && cd kotlin-scraper
gradle init --type kotlin-application --dsl kotlin

When prompted, select these options:

Build script DSL: kotlin
Project name: kotlin-scraper
Source package: com.scraper.demo

This creates a standard project structure:

kotlin-scraper/
├── build.gradle.kts
├── src/
│   └── main/
│       └── kotlin/
│           └── com/scraper/demo/
│               └── App.kt
└── settings.gradle.kts

Adding Dependencies

Open build.gradle.kts and add the libraries you'll need:

plugins {
    kotlin("jvm") version "1.9.22"
    application
}

repositories {
    mavenCentral()
}

dependencies {
    // HTML parsing with Jsoup
    implementation("org.jsoup:jsoup:1.17.2")
    
    // Kotlin-native scraping with Skrape{it}
    implementation("it.skrape:skrapeit:1.2.2")
    
    // Coroutines for parallel scraping
    implementation("org.jetbrains.kotlinx:kotlinx-coroutines-core:1.8.0")
    
    // JSON serialization
    implementation("org.jetbrains.kotlinx:kotlinx-serialization-json:1.6.2")
    
    // HTTP client with OkHttp
    implementation("com.squareup.okhttp3:okhttp:4.12.0")
}

Run gradle build to download these dependencies. You're now ready to scrape.

Your First Web Scraper with Jsoup

Jsoup is the Swiss Army knife of HTML parsing. It's a Java library, but works perfectly in Kotlin with cleaner syntax thanks to Kotlin's features.

Let's scrape a simple e-commerce site. We'll extract product names and prices from Books to Scrape, a scraping practice site.

First, define a data class for the scraped products:

data class Product(
    val title: String,
    val price: Double,
    val availability: String
)

Now create the scraper:

import org.jsoup.Jsoup
import org.jsoup.nodes.Document

fun scrapeProducts(url: String): List<Product> {
    // Fetch and parse the HTML
    val doc: Document = Jsoup.connect(url)
        .userAgent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36")
        .timeout(10000)
        .get()
    
    // Select all product containers
    val products = doc.select("article.product_pod")
    
    // Map each element to a Product object
    return products.mapNotNull { element ->
        try {
            Product(
                title = element.select("h3 a").attr("title"),
                price = element.select("p.price_color")
                    .text()
                    .removePrefix("£")
                    .toDouble(),
                availability = element.select("p.instock.availability")
                    .text()
                    .trim()
            )
        } catch (e: Exception) {
            // Skip malformed products
            null
        }
    }
}

fun main() {
    val url = "https://books.toscrape.com/catalogue/category/books_1/index.html"
    val products = scrapeProducts(url)
    
    products.forEach { product ->
        println("${product.title} - £${product.price} - ${product.availability}")
    }
}

Let's break down what's happening here:

Jsoup.connect(url) creates a connection to the target URL. The .get() method executes a GET request and returns a Document object representing the parsed HTML.

.userAgent() is critical. Many sites block requests without a user agent or with obvious bot user agents. We're setting one that looks like Chrome on Windows.

.timeout(10000) sets a 10-second timeout. Without this, your scraper hangs indefinitely if a site is slow to respond.

doc.select("article.product_pod") uses CSS selectors to find all product containers. If you've used jQuery, this syntax will be familiar.

mapNotNull is a Kotlin function that transforms each element and filters out nulls. If parsing any product fails, we skip it rather than crashing the entire scraper.

The try-catch inside mapNotNull is defensive programming. Real-world HTML is messy—maybe one product is missing a price, or the availability text is malformed. This pattern ensures you still get the other 99 products.

Scraping with Skrape{it}'s DSL

Jsoup works great, but if you want something that feels more Kotlin-native, Skrape{it} offers a DSL (Domain-Specific Language) that's incredibly readable.

Here's the same product scraper using Skrape{it}:

import it.skrape.core.*
import it.skrape.fetcher.*
import it.skrape.selects.html5.*

fun scrapeWithSkrape(url: String): List<Product> = skrape(HttpFetcher) {
    request {
        this.url = url
        userAgent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
        timeout = 10000
    }
    
    response {
        htmlDocument {
            article {
                withClass = "product_pod"
                findAll {
                    mapNotNull { element ->
                        try {
                            Product(
                                title = element.h3 {
                                    findFirst { a { findFirst { attribute("title") } } }
                                },
                                price = element.p {
                                    withClass = "price_color"
                                    findFirst { text }
                                        .removePrefix("£")
                                        .toDouble()
                                },
                                availability = element.p {
                                    withClass = "availability"
                                    findFirst { text.trim() }
                                }
                            )
                        } catch (e: Exception) {
                            null
                        }
                    }
                }
            }
        }
    }
}

Skrape{it}'s DSL reads almost like English: "In the HTML document, find all articles with class 'product_pod', then for each one, find the h3's text..."

The choice between Jsoup and Skrape{it} comes down to preference. Jsoup has better documentation and more Stack Overflow answers. Skrape{it} has nicer syntax and feels more Kotlin-idiomatic. I use Jsoup for quick scripts and Skrape{it} for larger projects where readability matters.

Handling Pagination Like a Pro

Real websites don't put all their products on one page. You need to handle pagination—and there's a clever way to do this that most tutorials don't cover.

Many scrapers use a simple counter:

// The naive approach
for (page in 1..50) {
    scrapeProducts("https://example.com/products?page=$page")
}

This breaks if the site only has 23 pages. You're making 27 requests that return empty results or errors.

Here's a better pattern—scrape until there's no "next" button:

fun scrapeAllPages(startUrl: String): List<Product> {
    val allProducts = mutableListOf<Product>()
    var currentUrl: String? = startUrl
    
    while (currentUrl != null) {
        val doc = Jsoup.connect(currentUrl)
            .userAgent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36")
            .get()
        
        // Scrape current page
        val pageProducts = doc.select("article.product_pod").mapNotNull { 
            // ... parsing logic here
        }
        allProducts.addAll(pageProducts)
        
        // Find next page link
        currentUrl = doc.select("li.next a").firstOrNull()?.attr("href")
            ?.let { nextPath ->
                // Convert relative URL to absolute
                if (nextPath.startsWith("http")) {
                    nextPath
                } else {
                    "${startUrl.substringBeforeLast("/")}/$nextPath"
                }
            }
        
        // Be polite—don't hammer the server
        Thread.sleep(1000)
    }
    
    return allProducts
}

This approach:

Scrapes the current page
Looks for a "next page" link
If found, constructs the absolute URL and continues
If not found, stops

The Thread.sleep(1000) is crucial. It adds a 1-second delay between requests. Many sites will ban you if you make requests too quickly. This is where being polite pays off—your scraper runs longer before getting blocked.

Parallel Scraping with Coroutines

The pagination approach above is sequential—it scrapes one page, waits, scrapes the next. For large sites, this is slow. What if you could scrape multiple pages at once?

Kotlin coroutines make this trivial. Here's how to scrape 20 pages concurrently:

import kotlinx.coroutines.*

suspend fun scrapeProductsConcurrent(url: String): List<Product> {
    return withContext(Dispatchers.IO) {
        val doc = Jsoup.connect(url)
            .userAgent("Mozilla/5.0")
            .get()
        
        doc.select("article.product_pod").mapNotNull {
            // parsing logic
        }
    }
}

suspend fun scrapeMultiplePages(urls: List<String>): List<Product> {
    return coroutineScope {
        urls.map { url ->
            async { scrapeProductsConcurrent(url) }
        }.awaitAll().flatten()
    }
}

fun main() = runBlocking {
    val urls = (1..20).map { page ->
        "https://books.toscrape.com/catalogue/page-$page.html"
    }
    
    val allProducts = scrapeMultiplePages(urls)
    println("Scraped ${allProducts.size} products")
}

Let me explain what's happening:

suspend fun marks functions that can be paused and resumed. This is how Kotlin implements asynchronous code without callbacks.

withContext(Dispatchers.IO) tells Kotlin to run this code on a thread pool optimized for I/O operations (like network requests). This is important—by default, coroutines run on a limited thread pool. I/O operations need more threads.

async { } starts a coroutine that returns a result. Each async block runs concurrently.

awaitAll() waits for all async operations to complete and collects their results.

.flatten() converts a List<List<Product>> into a List<Product>.

This code will scrape all 20 pages at the same time, potentially 20x faster than the sequential version. But there's a catch—you might get banned for making too many simultaneous requests.

Rate Limiting with Semaphores

Scraping 20 pages at once is fast, but also rude and likely to get you blocked. The solution is rate limiting—controlling how many requests run simultaneously.

Kotlin's Semaphore class is perfect for this:

import kotlinx.coroutines.*
import kotlinx.coroutines.sync.Semaphore
import kotlinx.coroutines.sync.withPermit

class RateLimitedScraper(
    private val maxConcurrent: Int = 5,
    private val delayMs: Long = 1000
) {
    private val semaphore = Semaphore(maxConcurrent)
    
    suspend fun scrapeWithRateLimit(url: String): List<Product> {
        return semaphore.withPermit {
            // Only maxConcurrent coroutines can be here at once
            delay(delayMs) // Wait before making request
            
            withContext(Dispatchers.IO) {
                val doc = Jsoup.connect(url)
                    .userAgent("Mozilla/5.0")
                    .get()
                
                doc.select("article.product_pod").mapNotNull {
                    try {
                        Product(
                            title = it.select("h3 a").attr("title"),
                            price = it.select("p.price_color")
                                .text()
                                .removePrefix("£")
                                .toDouble(),
                            availability = it.select("p.instock.availability")
                                .text()
                                .trim()
                        )
                    } catch (e: Exception) {
                        null
                    }
                }
            }
        }
    }
}

fun main() = runBlocking {
    val scraper = RateLimitedScraper(maxConcurrent = 5, delayMs = 1000)
    
    val urls = (1..50).map { page ->
        "https://books.toscrape.com/catalogue/page-$page.html"
    }
    
    val allProducts = urls.map { url ->
        async { scraper.scrapeWithRateLimit(url) }
    }.awaitAll().flatten()
    
    println("Scraped ${allProducts.size} products from ${urls.size} pages")
}

The semaphore limits concurrent operations. Only 5 coroutines can be inside the withPermit block at once. The 6th waits until one finishes.

delay(delayMs) adds a pause before each request. Even with only 5 concurrent requests, this ensures we're not hammering the server.

This pattern scales beautifully. Need to scrape 1,000 pages? Just pass in 1,000 URLs. The rate limiter ensures you never exceed your limits, and coroutines keep it fast.

In production, I typically use maxConcurrent = 10 and delayMs = 500 for most sites. For sites with stricter rate limits, drop to maxConcurrent = 3 and delayMs = 2000.

Structured Error Handling with Sealed Classes

Web scraping is inherently error-prone. Networks fail, HTML changes, sites go down. Most tutorials handle this with try-catch and logging. There's a better way using Kotlin's sealed classes.

Sealed classes let you model all possible outcomes of a scraping operation:

sealed class ScrapeResult {
    data class Success(val products: List<Product>) : ScrapeResult()
    data class PartialSuccess(
        val products: List<Product>,
        val errors: List<String>
    ) : ScrapeResult()
    data class NetworkError(val message: String) : ScrapeResult()
    data class ParseError(val message: String, val html: String) : ScrapeResult()
    data class EmptyPage(val url: String) : ScrapeResult()
}

suspend fun scrapeWithResult(url: String): ScrapeResult {
    return try {
        withContext(Dispatchers.IO) {
            val doc = Jsoup.connect(url)
                .userAgent("Mozilla/5.0")
                .timeout(10000)
                .get()
            
            val productElements = doc.select("article.product_pod")
            
            if (productElements.isEmpty()) {
                return@withContext ScrapeResult.EmptyPage(url)
            }
            
            val products = mutableListOf<Product>()
            val errors = mutableListOf<String>()
            
            productElements.forEach { element ->
                try {
                    products.add(
                        Product(
                            title = element.select("h3 a").attr("title"),
                            price = element.select("p.price_color")
                                .text()
                                .removePrefix("£")
                                .toDouble(),
                            availability = element.select("p.instock.availability")
                                .text()
                                .trim()
                        )
                    )
                } catch (e: Exception) {
                    errors.add("Failed to parse product: ${e.message}")
                }
            }
            
            if (errors.isEmpty()) {
                ScrapeResult.Success(products)
            } else {
                ScrapeResult.PartialSuccess(products, errors)
            }
        }
    } catch (e: java.net.SocketTimeoutException) {
        ScrapeResult.NetworkError("Timeout: ${e.message}")
    } catch (e: java.io.IOException) {
        ScrapeResult.NetworkError("IO error: ${e.message}")
    } catch (e: Exception) {
        ScrapeResult.ParseError(e.message ?: "Unknown error", "")
    }
}

Now you can handle each case explicitly:

fun main() = runBlocking {
    val url = "https://books.toscrape.com/catalogue/page-1.html"
    
    when (val result = scrapeWithResult(url)) {
        is ScrapeResult.Success -> {
            println("Successfully scraped ${result.products.size} products")
            result.products.forEach { println(it) }
        }
        is ScrapeResult.PartialSuccess -> {
            println("Scraped ${result.products.size} products with ${result.errors.size} errors")
            result.errors.forEach { println("Error: $it") }
        }
        is ScrapeResult.NetworkError -> {
            println("Network error: ${result.message}")
            // Maybe retry with exponential backoff?
        }
        is ScrapeResult.ParseError -> {
            println("Parse error: ${result.message}")
            // Maybe save the HTML for debugging?
        }
        is ScrapeResult.EmptyPage -> {
            println("No products found at ${result.url}")
        }
    }
}

This pattern is powerful because the compiler ensures you handle every case. Forget to handle NetworkError? Compilation fails. Add a new error type later? The compiler tells you every place you need to update.

For production scrapers, I wrap this in a retry mechanism:

suspend fun scrapeWithRetry(
    url: String,
    maxRetries: Int = 3,
    delayMs: Long = 1000
): ScrapeResult {
    repeat(maxRetries) { attempt ->
        val result = scrapeWithResult(url)
        
        when (result) {
            is ScrapeResult.Success,
            is ScrapeResult.PartialSuccess,
            is ScrapeResult.EmptyPage -> return result
            
            is ScrapeResult.NetworkError -> {
                if (attempt < maxRetries - 1) {
                    delay(delayMs * (attempt + 1)) // Exponential backoff
                    // Try again
                } else {
                    return result // Give up after max retries
                }
            }
            
            is ScrapeResult.ParseError -> return result // Don't retry parse errors
        }
    }
    
    return ScrapeResult.NetworkError("Max retries exceeded")
}

Bypassing Anti-Bot Measures

Modern sites employ multiple techniques to detect and block scrapers. Here are practical ways to bypass them.

User-Agent Rotation

Don't use the same user agent for every request. Create a pool:

val userAgents = listOf(
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/121.0.0.0 Safari/537.36",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/121.0.0.0 Safari/537.36",
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/121.0.0.0 Safari/537.36",
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:122.0) Gecko/20100101 Firefox/122.0",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.2 Safari/605.1.15"
)

fun randomUserAgent(): String = userAgents.random()

// Use it
val doc = Jsoup.connect(url)
    .userAgent(randomUserAgent())
    .get()

Adding Realistic Headers

Scrapers often omit headers that real browsers send:

fun connectWithHeaders(url: String): Document {
    return Jsoup.connect(url)
        .userAgent(randomUserAgent())
        .header("Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8")
        .header("Accept-Language", "en-US,en;q=0.5")
        .header("Accept-Encoding", "gzip, deflate, br")
        .header("DNT", "1")
        .header("Connection", "keep-alive")
        .header("Upgrade-Insecure-Requests", "1")
        .get()
}

Using Proxies

If you're getting IP banned, rotate through proxies. With OkHttp:

import okhttp3.*
import java.net.InetSocketAddress
import java.net.Proxy

fun createProxyClient(proxyHost: String, proxyPort: Int): OkHttpClient {
    val proxy = Proxy(
        Proxy.Type.HTTP,
        InetSocketAddress(proxyHost, proxyPort)
    )
    
    return OkHttpClient.Builder()
        .proxy(proxy)
        .build()
}

// Use it with Jsoup
fun scrapeWithProxy(url: String, proxyHost: String, proxyPort: Int): Document {
    val client = createProxyClient(proxyHost, proxyPort)
    
    val request = Request.Builder()
        .url(url)
        .header("User-Agent", randomUserAgent())
        .build()
    
    val response = client.newCall(request).execute()
    return Jsoup.parse(response.body?.string() ?: "")
}

For production, use a rotating proxy service. Free proxies are slow and unreliable. Paid services like BrightData or Oxylabs are worth it for serious scraping.

Handling Cookies

Some sites require cookies to be maintained across requests:

val cookies = mutableMapOf<String, String>()

fun scrapePage1(url: String): Document {
    val response = Jsoup.connect(url)
        .userAgent(randomUserAgent())
        .execute()
    
    // Save cookies
    cookies.putAll(response.cookies())
    
    return response.parse()
}

fun scrapePage2(url: String): Document {
    return Jsoup.connect(url)
        .userAgent(randomUserAgent())
        .cookies(cookies) // Use saved cookies
        .get()
}

Scraping JavaScript-Heavy Sites

Some sites load content dynamically with JavaScript. Jsoup can't execute JavaScript—it only parses static HTML. For these sites, you need a headless browser.

Selenium with Chrome is the standard approach:

First, add Selenium to your build.gradle.kts:

dependencies {
    implementation("org.seleniumhq.selenium:selenium-java:4.17.0")
}

Download ChromeDriver and add it to your project.

Here's how to scrape with Selenium:

import org.openqa.selenium.By
import org.openqa.selenium.chrome.ChromeDriver
import org.openqa.selenium.chrome.ChromeOptions

fun scrapeWithSelenium(url: String): List<Product> {
    // Path to ChromeDriver
    System.setProperty("webdriver.chrome.driver", "/path/to/chromedriver")
    
    val options = ChromeOptions().apply {
        addArguments("--headless") // Run without opening a browser window
        addArguments("--disable-gpu")
        addArguments("--no-sandbox")
        addArguments("user-agent=${randomUserAgent()}")
    }
    
    val driver = ChromeDriver(options)
    
    try {
        driver.get(url)
        
        // Wait for JavaScript to load
        Thread.sleep(3000)
        
        // Find product elements
        val productElements = driver.findElements(
            By.cssSelector("article.product_pod")
        )
        
        return productElements.mapNotNull { element ->
            try {
                Product(
                    title = element.findElement(By.cssSelector("h3 a"))
                        .getAttribute("title"),
                    price = element.findElement(By.cssSelector("p.price_color"))
                        .text
                        .removePrefix("£")
                        .toDouble(),
                    availability = element.findElement(
                        By.cssSelector("p.instock.availability")
                    ).text.trim()
                )
            } catch (e: Exception) {
                null
            }
        }
    } finally {
        driver.quit() // Always close the driver
    }
}

Selenium is powerful but slow. Each page takes 2-5 seconds to render. For sites that load content via JavaScript but don't have anti-bot protections, there's a faster way: intercept the API calls.

Most JavaScript-heavy sites fetch data from an API. Open Chrome DevTools, go to the Network tab, and watch what happens when the page loads. Look for XHR/Fetch requests returning JSON. You can often scrape the API directly, which is much faster:

import com.squareup.okhttp.OkHttpClient
import com.squareup.okhttp.Request

fun scrapeApi(apiUrl: String): String {
    val client = OkHttpClient()
    
    val request = Request.Builder()
        .url(apiUrl)
        .header("User-Agent", randomUserAgent())
        .header("Accept", "application/json")
        .build()
    
    val response = client.newCall(request).execute()
    return response.body().string()
}

// Parse JSON response
// Use kotlinx.serialization or Gson here

This is often 10x faster than Selenium and doesn't require ChromeDriver.

Data Persistence Patterns

Once you've scraped data, you need to store it. Here are three common patterns:

1. CSV Export

Simple and works for tabular data:

import java.io.File

fun exportToCsv(products: List<Product>, filename: String) {
    File(filename).printWriter().use { out ->
        // Header
        out.println("Title,Price,Availability")
        
        // Rows
        products.forEach { product ->
            out.println("\"${product.title}\",${product.price},\"${product.availability}\"")
        }
    }
}

2. JSON Export

Better for nested data structures:

import kotlinx.serialization.Serializable
import kotlinx.serialization.json.Json
import kotlinx.serialization.encodeToString
import java.io.File

@Serializable
data class Product(
    val title: String,
    val price: Double,
    val availability: String
)

fun exportToJson(products: List<Product>, filename: String) {
    val json = Json { prettyPrint = true }
    val jsonString = json.encodeToString(products)
    File(filename).writeText(jsonString)
}

3. Database Storage

For production systems, use a database. Here's an example with SQLite:

// Add to build.gradle.kts:
// implementation("org.xerial:sqlite-jdbc:3.44.1.0")

import java.sql.DriverManager
import java.sql.Connection

class ProductDatabase(private val dbPath: String) {
    private var connection: Connection? = null
    
    fun connect() {
        connection = DriverManager.getConnection("jdbc:sqlite:$dbPath")
        createTable()
    }
    
    private fun createTable() {
        val sql = """
            CREATE TABLE IF NOT EXISTS products (
                id INTEGER PRIMARY KEY AUTOINCREMENT,
                title TEXT NOT NULL,
                price REAL NOT NULL,
                availability TEXT,
                scraped_at TEXT DEFAULT CURRENT_TIMESTAMP
            )
        """.trimIndent()
        
        connection?.createStatement()?.execute(sql)
    }
    
    fun insertProduct(product: Product) {
        val sql = """
            INSERT INTO products (title, price, availability)
            VALUES (?, ?, ?)
        """.trimIndent()
        
        connection?.prepareStatement(sql)?.use { stmt ->
            stmt.setString(1, product.title)
            stmt.setDouble(2, product.price)
            stmt.setString(3, product.availability)
            stmt.executeUpdate()
        }
    }
    
    fun insertProducts(products: List<Product>) {
        products.forEach { insertProduct(it) }
    }
    
    fun close() {
        connection?.close()
    }
}

// Usage
fun main() = runBlocking {
    val db = ProductDatabase("products.db")
    db.connect()
    
    val products = scrapeAllPages("https://books.toscrape.com")
    db.insertProducts(products)
    
    db.close()
    println("Saved ${products.size} products to database")
}

Production-Ready Tips

Building a scraper that runs once is easy. Building one that runs reliably for months is harder. Here are lessons from production:

Log everything. Use a logging framework, not println():

// Add to build.gradle.kts:
// implementation("org.slf4j:slf4j-simple:2.0.9")

import org.slf4j.LoggerFactory

private val logger = LoggerFactory.getLogger("ScraperApp")

fun scrapeWithLogging(url: String): ScrapeResult {
    logger.info("Starting scrape of $url")
    
    val result = scrapeWithResult(url)
    
    when (result) {
        is ScrapeResult.Success -> 
            logger.info("Successfully scraped ${result.products.size} products")
        is ScrapeResult.NetworkError -> 
            logger.error("Network error: ${result.message}")
        // ... other cases
    }
    
    return result
}

Implement exponential backoff. When a request fails, wait longer before retrying:

suspend fun scrapeWithBackoff(
    url: String,
    maxRetries: Int = 5,
    initialDelayMs: Long = 1000
): ScrapeResult {
    var delay = initialDelayMs
    
    repeat(maxRetries) { attempt ->
        val result = scrapeWithResult(url)
        
        if (result !is ScrapeResult.NetworkError) {
            return result
        }
        
        if (attempt < maxRetries - 1) {
            logger.warn("Retry ${attempt + 1}/$maxRetries after ${delay}ms")
            delay(delay)
            delay *= 2 // Exponential backoff: 1s, 2s, 4s, 8s, 16s
        }
    }
    
    return ScrapeResult.NetworkError("Max retries exceeded")
}

Monitor your scrapers. Track success rates, response times, and errors. A simple approach:

data class ScraperMetrics(
    var successCount: Int = 0,
    var errorCount: Int = 0,
    var totalResponseTime: Long = 0
) {
    val successRate: Double
        get() = if (successCount + errorCount > 0) {
            successCount.toDouble() / (successCount + errorCount)
        } else 0.0
    
    val averageResponseTime: Long
        get() = if (successCount > 0) {
            totalResponseTime / successCount
        } else 0
}

val metrics = ScraperMetrics()

suspend fun scrapeAndTrack(url: String): ScrapeResult {
    val startTime = System.currentTimeMillis()
    val result = scrapeWithResult(url)
    val duration = System.currentTimeMillis() - startTime
    
    when (result) {
        is ScrapeResult.Success -> {
            metrics.successCount++
            metrics.totalResponseTime += duration
        }
        else -> metrics.errorCount++
    }
    
    return result
}

Respect robots.txt. Many sites specify scraping rules in /robots.txt. You can parse it:

fun isAllowedByRobots(url: String, userAgent: String = "*"): Boolean {
    try {
        val baseUrl = url.substringBefore("/", "").substringBefore("?")
        val robotsUrl = "$baseUrl/robots.txt"
        
        val robotsTxt = Jsoup.connect(robotsUrl)
            .ignoreContentType(true)
            .execute()
            .body()
        
        // Basic robots.txt parsing
        // For production, use a proper robots.txt parser library
        val disallowedPaths = robotsTxt.lines()
            .filter { it.trim().startsWith("Disallow:") }
            .map { it.substringAfter("Disallow:").trim() }
        
        val path = url.substringAfter(baseUrl)
        return disallowedPaths.none { path.startsWith(it) }
    } catch (e: Exception) {
        // If robots.txt doesn't exist or can't be fetched, assume allowed
        return true
    }
}

Schedule your scrapers. Don't run scrapers manually. Use cron (Linux/Mac) or Task Scheduler (Windows), or for JVM-based scheduling, use Quartz:

// Add to build.gradle.kts:
// implementation("org.quartz-scheduler:quartz:2.3.2")

import org.quartz.*
import org.quartz.impl.StdSchedulerFactory

class ScraperJob : Job {
    override fun execute(context: JobExecutionContext) {
        runBlocking {
            val products = scrapeAllPages("https://books.toscrape.com")
            // Save to database
            logger.info("Scheduled scrape completed: ${products.size} products")
        }
    }
}

fun scheduleDaily() {
    val scheduler = StdSchedulerFactory.getDefaultScheduler()
    
    val job = JobBuilder.newJob(ScraperJob::class.java)
        .withIdentity("dailyScrape", "scrapers")
        .build()
    
    val trigger = TriggerBuilder.newTrigger()
        .withIdentity("dailyTrigger", "scrapers")
        .withSchedule(
            CronScheduleBuilder.dailyAtHourAndMinute(2, 0) // 2 AM daily
        )
        .build()
    
    scheduler.scheduleJob(job, trigger)
    scheduler.start()
}

Wrapping Up

Kotlin brings type safety, concise syntax, and powerful concurrency tools to web scraping. While Python has more tutorials and libraries specifically for scraping, Kotlin offers compile-time safety that catches bugs before production and coroutines that make parallel scraping elegant.

The techniques in this guide—rate limiting with semaphores, structured error handling with sealed classes, and proper coroutine usage—separate production scrapers from hobby projects. Your scrapers should handle errors gracefully, respect rate limits, and run reliably without supervision.

Start with simple scrapers using Jsoup or Skrape{it}. Add coroutines when you need speed. Implement rate limiting before you get blocked. And always, always respect the sites you're scraping—add delays, rotate user agents, and check robots.txt.

Want to go deeper? The Kotlin documentation on coroutines is excellent, and the Jsoup documentation covers advanced parsing techniques. For handling JavaScript-heavy sites, look into Playwright for Kotlin, which is faster than Selenium.

Happy scraping!

Marius Bernard

Marius Bernard is a Product Advisor, Technical SEO, & Brand Ambassador at Roundproxies. He was the lead author for the SEO chapter of the 2024 Web and a reviewer for the 2023 SEO chapter.

Get the best
proxies out there

Get Proxies now

Related from Knowledge Base

C# Web Scraping Guide: Build Fast Working Scrapers

Web Scraping in R: Complete Guide 2026

Web Scraping in Rust: Complete 2026 Guide

How to Do Web Scraping in Lua: A Developer's Guide

How to Do Web Scraping in Dart: A Complete 2026 Guide

How to Do Web Scraping in Perl: The Complete Developer's Guide

Python Web Scraping Guide: Build Scrapers in 2026

How to Use Botasaurus in 2026

How to Scrape Dynamic Websites With Headless Web Browsers

12 Ways to Make HTTPS Requests in Node.js

15 Methods to Not Get Blocked Web Scraping

How to Use Playwright Playwright Proxy in 2026

How to Take Screenshots with Puppeteer

How to Store and Manage Scraped Data Efficiently

User-Agent Rotation: Why and How to Implement It

How to Scrape Data Behind Login Pages

What Are Backconnect Proxies and How They Work

How to Do Requests in Go (Golang)

How to Do Requests with C

How to Do Requests with Swift