How to use Teracrawl: LLM web scraping guide

December 04, 2025

8 min read

Getting clean, structured data from websites for AI applications is frustrating. Standard scrapers choke on JavaScript, anti-bot measures block your requests, and you end up with messy HTML instead of usable content.

Teracrawl solves this by converting any website into clean Markdown that's ready for LLMs, RAG pipelines, and AI agents. It achieved the highest coverage score (84.2%) across 14 scraping providers in the scrape-evals benchmark.

In this guide, you'll learn how to set up Teracrawl, configure it for your needs, and use its API endpoints to scrape websites at scale.

What Is Teracrawl and Why Use It?

Teracrawl is a production-ready API that turns websites into clean, LLM-ready Markdown. It uses real managed Chrome browsers through Browser.cash, ensuring high success rates even on protected sites.

Unlike basic HTML scrapers, Teracrawl handles the hard stuff automatically. JavaScript rendering, anti-bot bypasses, and parallel execution happen behind the scenes.

Here's what makes it stand out:

LLM-optimized output that converts complex HTML into semantic Markdown
Smart two-phase crawling with fast mode for static pages and dynamic mode for SPAs
Search and scrape in a single API call—query Google and scrape top results
High concurrency through a robust session pool for parallel processing

The tool is open-source and runs locally or in Docker containers.

Step 1: Install Teracrawl and Dependencies

Before installing Teracrawl, make sure you have Node.js 18 or higher on your machine. You'll also need a Browser.cash API key.

Open your terminal and clone the repository:

git clone https://github.com/BrowserCash/teracrawl.git
cd teracrawl

This downloads the complete Teracrawl source code to your local machine.

Next, install the required npm packages:

npm install

The installation pulls in dependencies for browser session management, HTML-to-Markdown conversion, and API handling.

That's it for the basic setup. Teracrawl is lightweight and doesn't require complex toolchains.

Getting Your Browser.cash API Key

Teracrawl runs on Browser.cash's remote browser infrastructure. You'll need an API key to authenticate requests.

Visit browser.cash/developers and create an account. Your API key will be available in the dashboard.

Keep this key secure—it's your access to the browser pool that powers Teracrawl's scraping capabilities.

Step 2: Configure Your Environment

Teracrawl uses environment variables for configuration. Start by copying the example file:

cp .env.example .env

Open the .env file in your editor. The minimum required setting is your Browser.cash API key:

BROWSER_API_KEY=your_browser_cash_api_key_here

Replace your_browser_cash_api_key_here with the actual key from your Browser.cash dashboard.

Optional Configuration Variables

For most use cases, the defaults work fine. But you can tune performance with these variables:

Variable	Default	What It Does
`PORT`	8085	Server port
`HOST`	0.0.0.0	Host to bind to
`POOL_SIZE`	1	Concurrent browser sessions
`CRAWL_TABS_PER_SESSION`	8	Max tabs per browser
`CRAWL_NAVIGATION_TIMEOUT_MS`	10000	Fast mode timeout
`CRAWL_SLOW_TIMEOUT_MS`	20000	Slow mode timeout

Increase POOL_SIZE if you're scraping at high volume. Each session can handle multiple tabs in parallel.

Starting the Server

Run Teracrawl in development mode with:

npm run dev

For production, build and start:

npm run build
npm start

The server starts at http://0.0.0.0:8085. You'll see confirmation in your terminal.

Test that it's running with a health check:

curl http://localhost:8085/health

You should get {"ok":true} back.

Step 3: Scrape a Single URL

The /scrape endpoint converts any URL into clean Markdown. This is the core functionality of Teracrawl.

Here's a basic request:

curl -X POST http://localhost:8085/scrape \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com/blog/post-1"
  }'

The response comes back as JSON with the page title and Markdown content:

{
  "url": "https://example.com/blog/post-1",
  "title": "My Blog Post",
  "markdown": "# My Blog Post\n\nContent of the post...",
  "status": "success"
}

Notice how Teracrawl extracts the main content and strips away navigation, ads, and clutter.

How the Two-Phase Scraping Works

Teracrawl uses a smart scraping strategy that adapts to each page:

Fast Mode kicks in first. It reuses browser contexts, blocks heavy assets like images and fonts, and works great for static or server-rendered pages.

Dynamic Mode activates automatically when fast mode doesn't capture enough content. It waits for JavaScript hydration and client-side rendering to complete.

You don't need to configure which mode to use. Teracrawl detects the page type and switches automatically.

Scraping With Python

Want to use Teracrawl from Python? Here's a quick example:

import requests

response = requests.post(
    "http://localhost:8085/scrape",
    json={"url": "https://news.ycombinator.com/"}
)

data = response.json()
print(data["markdown"])

The markdown field contains clean text that's ready for your LLM pipeline.

Scraping With JavaScript

For Node.js applications:

const response = await fetch("http://localhost:8085/scrape", {
  method: "POST",
  headers: { "Content-Type": "application/json" },
  body: JSON.stringify({ url: "https://news.ycombinator.com/" })
});

const data = await response.json();
console.log(data.markdown);

Both examples show how straightforward it is to integrate Teracrawl into existing projects.

Step 4: Search and Crawl Multiple Pages

The /crawl endpoint is where Teracrawl really shines. It queries Google, then scrapes the top results in parallel.

This is perfect for research tasks, competitive analysis, or building datasets.

Important: The /crawl endpoint requires a running instance of browser-serp on port 8080. See the Docker section for the easiest setup.

Here's how to search and scrape:

curl -X POST http://localhost:8085/crawl \
  -H "Content-Type: application/json" \
  -d '{
    "q": "What is the capital of France?",
    "count": 3
  }'

The q parameter is your search query. The count parameter specifies how many results to scrape (max 20).

The response includes Markdown content from each scraped page:

{
  "query": "What is the capital of France?",
  "results": [
    {
      "url": "https://en.wikipedia.org/wiki/Paris",
      "title": "Paris - Wikipedia",
      "markdown": "# Paris\n\nParis is the capital and most populous city of France...",
      "status": "success"
    },
    {
      "url": "https://example.com/france-info",
      "title": "France Facts",
      "markdown": "# France Facts\n\nThe capital city is Paris...",
      "status": "success"
    }
  ]
}

Failed scrapes return an error message instead of Markdown. This helps you handle partial failures gracefully.

Building a RAG Pipeline With Teracrawl

Here's a practical example. Say you want to answer questions using fresh web data:

import requests

def get_web_context(question, num_sources=5):
    """Search the web and get relevant content for a question."""
    
    response = requests.post(
        "http://localhost:8085/crawl",
        json={"q": question, "count": num_sources}
    )
    
    data = response.json()
    
    # Combine successful results into context
    context = ""
    for result in data["results"]:
        if result["status"] == "success":
            context += f"\n\n## Source: {result['title']}\n"
            context += result["markdown"][:2000]  # Limit length
    
    return context

# Use with your LLM
context = get_web_context("Latest developments in quantum computing")
prompt = f"Based on this context:\n{context}\n\nAnswer: What are the latest developments?"

This gives your LLM real-time web data instead of relying solely on training data.

SERP-Only Searches

Sometimes you just want search results without scraping the pages. The /serp/search endpoint handles this:

curl -X POST http://localhost:8085/serp/search \
  -H "Content-Type: application/json" \
  -d '{
    "q": "browser automation",
    "count": 5
  }'

Response:

{
  "results": [
    {
      "url": "https://example.com/browser-automation",
      "title": "Browser Automation Guide",
      "description": "Learn how to automate browsers..."
    }
  ]
}

Use this when you need URLs and descriptions but don't need full page content.

Step 5: Deploy With Docker for Production

Docker is the recommended way to run Teracrawl in production. It packages everything needed for consistent deployments.

Basic Docker Setup

Build the image:

docker build -t teracrawl .

Run with your environment file:

docker run -p 8085:8085 --env-file .env teracrawl

This starts Teracrawl on port 8085, ready to accept requests.

Docker Compose With SERP Service

For the full /crawl functionality, you need both Teracrawl and the browser-serp service. Docker Compose makes this easy:

version: "3.8"
services:
  teracrawl:
    build: .
    ports:
      - "8085:8085"
    environment:
      - BROWSER_API_KEY=${BROWSER_API_KEY}
      - SERP_SERVICE_URL=http://serp:8080
    depends_on:
      - serp

  serp:
    image: ghcr.io/mega-tera/browser-serp:latest
    ports:
      - "8080:8080"

Save this as docker-compose.yml and run:

docker-compose up -d

Both services start together, with Teracrawl automatically connecting to the SERP service.

Scaling for High Volume

Need to scrape thousands of pages? Adjust your configuration:

POOL_SIZE=5
CRAWL_TABS_PER_SESSION=12
CRAWL_JITTER_MS=100

POOL_SIZE controls concurrent browser sessions. Each session can handle CRAWL_TABS_PER_SESSION parallel tabs.

CRAWL_JITTER_MS adds random delay between requests. This prevents thundering herd problems and reduces load on target servers.

Advanced Configuration Options

Teracrawl offers fine-grained control over crawling behavior.

Timeout Settings

Two timeouts control how long Teracrawl waits for pages:

CRAWL_NAVIGATION_TIMEOUT_MS=10000
CRAWL_SLOW_TIMEOUT_MS=20000

The navigation timeout applies during fast mode. The slow timeout kicks in for dynamic pages that need JavaScript execution.

Increase these if you're scraping slow sites. Decrease them for faster failures on problematic URLs.

Content Quality Thresholds

CRAWL_MIN_CONTENT_LENGTH=200

Teracrawl considers a scrape successful only if the Markdown output exceeds this character count. This filters out pages that blocked or returned errors.

Set higher thresholds if you need substantial content. Lower it if you're scraping pages with minimal text.

PDF Handling

Teracrawl can extract text from PDFs when configured with a Datalab API key:

DATALAB_API_KEY=your_datalab_key

PDFs get converted to Markdown just like web pages.

Debug Logging

Enable verbose logs for troubleshooting:

DEBUG_LOG=true

This shows detailed information about browser sessions, navigation events, and content extraction.

Common Errors and How to Fix Them

"Timeout exceeded" Errors

This happens when pages take too long to load. Try:

Increase CRAWL_SLOW_TIMEOUT_MS to 30000 or higher
Check if the target site is blocking automated traffic
Verify your network connection

Empty Markdown Output

Some sites aggressively block scrapers. Teracrawl uses real browsers to avoid most blocks, but some sites still detect automation.

Solutions:

Wait and retry—some blocks are temporary
Try different URLs on the same domain
Check if the site uses advanced anti-bot measures

SERP Service Connection Failed

The /crawl endpoint needs browser-serp running on port 8080.

Check that:

The browser-serp container is running
SERP_SERVICE_URL points to the correct address
Network connectivity exists between services

Browser Session Errors

If you see session-related errors, your Browser.cash API key might be invalid or rate-limited.

Verify your key in the Browser.cash dashboard. Check your usage against plan limits.

Teracrawl vs Other Scraping Tools

How does Teracrawl compare to alternatives?

Feature	Teracrawl	Firecrawl	Crawl4AI
Benchmark Score	84.2%	~80%	~75%
Open Source	Yes	Partial	Yes
LLM Output	Markdown	Markdown/JSON	Markdown
Search + Scrape	Yes	Yes	No
Self-Hosted	Yes	Yes	Yes
Browser Backend	Browser.cash	Cloud	Local

Teracrawl's strength is its high success rate across diverse websites. The benchmark tests 1,000 URLs across many site types.

The Browser.cash backend means you don't need local browser dependencies. Sessions run on managed infrastructure.

Real-World Use Cases

Building AI Research Assistants

Teracrawl feeds current web data to AI models. Search for a topic, scrape the results, and use the Markdown as context.

The clean output means less token waste on HTML noise.

Competitive Price Monitoring

Scrape product pages and extract pricing. Teracrawl's JavaScript rendering handles dynamic e-commerce sites that break basic scrapers.

Content Aggregation

Pull articles from multiple sources and convert to a unified Markdown format. Great for building training datasets or news aggregators.

SEO Analysis

Analyze competitor content at scale. Scrape ranking pages and extract their structure, headings, and key phrases.

FAQ

How much does Teracrawl cost?

Teracrawl is open-source and free to use. You pay only for Browser.cash API usage, which provides the remote browser infrastructure.

Can Teracrawl bypass CAPTCHAs?

Teracrawl uses real Chrome browsers, which helps avoid many detection mechanisms. However, it doesn't automatically solve CAPTCHAs. For sites with heavy CAPTCHA protection, you may need additional solutions.

Does Teracrawl work with JavaScript-heavy sites?

Yes. The dynamic mode waits for JavaScript hydration before extracting content. This handles React, Vue, Angular, and other SPA frameworks.

How many pages can I scrape per minute?

Throughput depends on your POOL_SIZE and CRAWL_TABS_PER_SESSION settings. With default settings, expect 10-20 pages per minute. Scale up by increasing pool size.

Is web scraping legal?

Legality depends on what you scrape and how you use the data. Always check a site's robots.txt and terms of service. Scrape only public data and respect rate limits.

Final Thoughts

Teracrawl simplifies the hardest parts of web scraping. Real browsers handle JavaScript rendering. Smart content extraction produces clean Markdown. Parallel execution keeps things fast.

The tool works well for AI applications where you need current web data in a format LLMs can process efficiently.

Start with the basic /scrape endpoint to understand the output quality. Then move to /crawl for research tasks that need multiple sources.

Check out the Teracrawl GitHub repository for updates and community contributions.

Marius Bernard

Marius Bernard is a Web Scraping Engineer & Technical Advisor at Roundproxies. He authored the Web Scraping chapter of the 2024 Web Almanac/Techinsider. He loves python, golang and proxies.

Get the best
proxies out there

Get Proxies now

Related from Knowledge Base

Go Web Scraping: Complete 2025 Guide & Code Examples

PHP Web Scraping Guide 2026: Speed & Anti-Bot Tips

C# Web Scraping Guide: Build Fast Working Scrapers

Web Scraping in R: Complete Guide 2026

Web Scraping in Rust: Complete 2026 Guide

How to Do Web Scraping in Kotlin: The Developer's Guide

How to Do Web Scraping in Lua: A Developer's Guide

How to Do Web Scraping in Dart: A Complete 2026 Guide

How to Do Web Scraping in Perl: The Complete Developer's Guide

Python Web Scraping Guide: Build Scrapers in 2026

How to Use Botasaurus in 2026

How to Scrape Dynamic Websites With Headless Web Browsers

12 Ways to Make HTTPS Requests in Node.js

15 Methods to Not Get Blocked Web Scraping

How to Use Playwright Playwright Proxy in 2026

How to Take Screenshots with Puppeteer

How to Store and Manage Scraped Data Efficiently

User-Agent Rotation: Why and How to Implement It

How to Scrape Data Behind Login Pages

What Are Backconnect Proxies and How They Work