Getting clean, structured data from websites for AI applications is frustrating. Standard scrapers choke on JavaScript, anti-bot measures block your requests, and you end up with messy HTML instead of usable content.
Teracrawl solves this by converting any website into clean Markdown that's ready for LLMs, RAG pipelines, and AI agents. It achieved the highest coverage score (84.2%) across 14 scraping providers in the scrape-evals benchmark.
In this guide, you'll learn how to set up Teracrawl, configure it for your needs, and use its API endpoints to scrape websites at scale.
What Is Teracrawl and Why Use It?
Teracrawl is a production-ready API that turns websites into clean, LLM-ready Markdown. It uses real managed Chrome browsers through Browser.cash, ensuring high success rates even on protected sites.
Unlike basic HTML scrapers, Teracrawl handles the hard stuff automatically. JavaScript rendering, anti-bot bypasses, and parallel execution happen behind the scenes.
Here's what makes it stand out:
- LLM-optimized output that converts complex HTML into semantic Markdown
- Smart two-phase crawling with fast mode for static pages and dynamic mode for SPAs
- Search and scrape in a single API call—query Google and scrape top results
- High concurrency through a robust session pool for parallel processing
The tool is open-source and runs locally or in Docker containers.
Step 1: Install Teracrawl and Dependencies
Before installing Teracrawl, make sure you have Node.js 18 or higher on your machine. You'll also need a Browser.cash API key.
Open your terminal and clone the repository:
git clone https://github.com/BrowserCash/teracrawl.git
cd teracrawl
This downloads the complete Teracrawl source code to your local machine.
Next, install the required npm packages:
npm install
The installation pulls in dependencies for browser session management, HTML-to-Markdown conversion, and API handling.
That's it for the basic setup. Teracrawl is lightweight and doesn't require complex toolchains.
Getting Your Browser.cash API Key
Teracrawl runs on Browser.cash's remote browser infrastructure. You'll need an API key to authenticate requests.
Visit browser.cash/developers and create an account. Your API key will be available in the dashboard.
Keep this key secure—it's your access to the browser pool that powers Teracrawl's scraping capabilities.
Step 2: Configure Your Environment
Teracrawl uses environment variables for configuration. Start by copying the example file:
cp .env.example .env
Open the .env file in your editor. The minimum required setting is your Browser.cash API key:
BROWSER_API_KEY=your_browser_cash_api_key_here
Replace your_browser_cash_api_key_here with the actual key from your Browser.cash dashboard.
Optional Configuration Variables
For most use cases, the defaults work fine. But you can tune performance with these variables:
| Variable | Default | What It Does |
|---|---|---|
PORT |
8085 | Server port |
HOST |
0.0.0.0 | Host to bind to |
POOL_SIZE |
1 | Concurrent browser sessions |
CRAWL_TABS_PER_SESSION |
8 | Max tabs per browser |
CRAWL_NAVIGATION_TIMEOUT_MS |
10000 | Fast mode timeout |
CRAWL_SLOW_TIMEOUT_MS |
20000 | Slow mode timeout |
Increase POOL_SIZE if you're scraping at high volume. Each session can handle multiple tabs in parallel.
Starting the Server
Run Teracrawl in development mode with:
npm run dev
For production, build and start:
npm run build
npm start
The server starts at http://0.0.0.0:8085. You'll see confirmation in your terminal.
Test that it's running with a health check:
curl http://localhost:8085/health
You should get {"ok":true} back.
Step 3: Scrape a Single URL
The /scrape endpoint converts any URL into clean Markdown. This is the core functionality of Teracrawl.
Here's a basic request:
curl -X POST http://localhost:8085/scrape \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com/blog/post-1"
}'
The response comes back as JSON with the page title and Markdown content:
{
"url": "https://example.com/blog/post-1",
"title": "My Blog Post",
"markdown": "# My Blog Post\n\nContent of the post...",
"status": "success"
}
Notice how Teracrawl extracts the main content and strips away navigation, ads, and clutter.
How the Two-Phase Scraping Works
Teracrawl uses a smart scraping strategy that adapts to each page:
Fast Mode kicks in first. It reuses browser contexts, blocks heavy assets like images and fonts, and works great for static or server-rendered pages.
Dynamic Mode activates automatically when fast mode doesn't capture enough content. It waits for JavaScript hydration and client-side rendering to complete.
You don't need to configure which mode to use. Teracrawl detects the page type and switches automatically.
Scraping With Python
Want to use Teracrawl from Python? Here's a quick example:
import requests
response = requests.post(
"http://localhost:8085/scrape",
json={"url": "https://news.ycombinator.com/"}
)
data = response.json()
print(data["markdown"])
The markdown field contains clean text that's ready for your LLM pipeline.
Scraping With JavaScript
For Node.js applications:
const response = await fetch("http://localhost:8085/scrape", {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({ url: "https://news.ycombinator.com/" })
});
const data = await response.json();
console.log(data.markdown);
Both examples show how straightforward it is to integrate Teracrawl into existing projects.
Step 4: Search and Crawl Multiple Pages
The /crawl endpoint is where Teracrawl really shines. It queries Google, then scrapes the top results in parallel.
This is perfect for research tasks, competitive analysis, or building datasets.
Important: The /crawl endpoint requires a running instance of browser-serp on port 8080. See the Docker section for the easiest setup.
Here's how to search and scrape:
curl -X POST http://localhost:8085/crawl \
-H "Content-Type: application/json" \
-d '{
"q": "What is the capital of France?",
"count": 3
}'
The q parameter is your search query. The count parameter specifies how many results to scrape (max 20).
The response includes Markdown content from each scraped page:
{
"query": "What is the capital of France?",
"results": [
{
"url": "https://en.wikipedia.org/wiki/Paris",
"title": "Paris - Wikipedia",
"markdown": "# Paris\n\nParis is the capital and most populous city of France...",
"status": "success"
},
{
"url": "https://example.com/france-info",
"title": "France Facts",
"markdown": "# France Facts\n\nThe capital city is Paris...",
"status": "success"
}
]
}
Failed scrapes return an error message instead of Markdown. This helps you handle partial failures gracefully.
Building a RAG Pipeline With Teracrawl
Here's a practical example. Say you want to answer questions using fresh web data:
import requests
def get_web_context(question, num_sources=5):
"""Search the web and get relevant content for a question."""
response = requests.post(
"http://localhost:8085/crawl",
json={"q": question, "count": num_sources}
)
data = response.json()
# Combine successful results into context
context = ""
for result in data["results"]:
if result["status"] == "success":
context += f"\n\n## Source: {result['title']}\n"
context += result["markdown"][:2000] # Limit length
return context
# Use with your LLM
context = get_web_context("Latest developments in quantum computing")
prompt = f"Based on this context:\n{context}\n\nAnswer: What are the latest developments?"
This gives your LLM real-time web data instead of relying solely on training data.
SERP-Only Searches
Sometimes you just want search results without scraping the pages. The /serp/search endpoint handles this:
curl -X POST http://localhost:8085/serp/search \
-H "Content-Type: application/json" \
-d '{
"q": "browser automation",
"count": 5
}'
Response:
{
"results": [
{
"url": "https://example.com/browser-automation",
"title": "Browser Automation Guide",
"description": "Learn how to automate browsers..."
}
]
}
Use this when you need URLs and descriptions but don't need full page content.
Step 5: Deploy With Docker for Production
Docker is the recommended way to run Teracrawl in production. It packages everything needed for consistent deployments.
Basic Docker Setup
Build the image:
docker build -t teracrawl .
Run with your environment file:
docker run -p 8085:8085 --env-file .env teracrawl
This starts Teracrawl on port 8085, ready to accept requests.
Docker Compose With SERP Service
For the full /crawl functionality, you need both Teracrawl and the browser-serp service. Docker Compose makes this easy:
version: "3.8"
services:
teracrawl:
build: .
ports:
- "8085:8085"
environment:
- BROWSER_API_KEY=${BROWSER_API_KEY}
- SERP_SERVICE_URL=http://serp:8080
depends_on:
- serp
serp:
image: ghcr.io/mega-tera/browser-serp:latest
ports:
- "8080:8080"
Save this as docker-compose.yml and run:
docker-compose up -d
Both services start together, with Teracrawl automatically connecting to the SERP service.
Scaling for High Volume
Need to scrape thousands of pages? Adjust your configuration:
POOL_SIZE=5
CRAWL_TABS_PER_SESSION=12
CRAWL_JITTER_MS=100
POOL_SIZE controls concurrent browser sessions. Each session can handle CRAWL_TABS_PER_SESSION parallel tabs.
CRAWL_JITTER_MS adds random delay between requests. This prevents thundering herd problems and reduces load on target servers.
Advanced Configuration Options
Teracrawl offers fine-grained control over crawling behavior.
Timeout Settings
Two timeouts control how long Teracrawl waits for pages:
CRAWL_NAVIGATION_TIMEOUT_MS=10000
CRAWL_SLOW_TIMEOUT_MS=20000
The navigation timeout applies during fast mode. The slow timeout kicks in for dynamic pages that need JavaScript execution.
Increase these if you're scraping slow sites. Decrease them for faster failures on problematic URLs.
Content Quality Thresholds
CRAWL_MIN_CONTENT_LENGTH=200
Teracrawl considers a scrape successful only if the Markdown output exceeds this character count. This filters out pages that blocked or returned errors.
Set higher thresholds if you need substantial content. Lower it if you're scraping pages with minimal text.
PDF Handling
Teracrawl can extract text from PDFs when configured with a Datalab API key:
DATALAB_API_KEY=your_datalab_key
PDFs get converted to Markdown just like web pages.
Debug Logging
Enable verbose logs for troubleshooting:
DEBUG_LOG=true
This shows detailed information about browser sessions, navigation events, and content extraction.
Common Errors and How to Fix Them
"Timeout exceeded" Errors
This happens when pages take too long to load. Try:
- Increase
CRAWL_SLOW_TIMEOUT_MSto 30000 or higher - Check if the target site is blocking automated traffic
- Verify your network connection
Empty Markdown Output
Some sites aggressively block scrapers. Teracrawl uses real browsers to avoid most blocks, but some sites still detect automation.
Solutions:
- Wait and retry—some blocks are temporary
- Try different URLs on the same domain
- Check if the site uses advanced anti-bot measures
SERP Service Connection Failed
The /crawl endpoint needs browser-serp running on port 8080.
Check that:
- The browser-serp container is running
SERP_SERVICE_URLpoints to the correct address- Network connectivity exists between services
Browser Session Errors
If you see session-related errors, your Browser.cash API key might be invalid or rate-limited.
Verify your key in the Browser.cash dashboard. Check your usage against plan limits.
Teracrawl vs Other Scraping Tools
How does Teracrawl compare to alternatives?
| Feature | Teracrawl | Firecrawl | Crawl4AI |
|---|---|---|---|
| Benchmark Score | 84.2% | ~80% | ~75% |
| Open Source | Yes | Partial | Yes |
| LLM Output | Markdown | Markdown/JSON | Markdown |
| Search + Scrape | Yes | Yes | No |
| Self-Hosted | Yes | Yes | Yes |
| Browser Backend | Browser.cash | Cloud | Local |
Teracrawl's strength is its high success rate across diverse websites. The benchmark tests 1,000 URLs across many site types.
The Browser.cash backend means you don't need local browser dependencies. Sessions run on managed infrastructure.
Real-World Use Cases
Building AI Research Assistants
Teracrawl feeds current web data to AI models. Search for a topic, scrape the results, and use the Markdown as context.
The clean output means less token waste on HTML noise.
Competitive Price Monitoring
Scrape product pages and extract pricing. Teracrawl's JavaScript rendering handles dynamic e-commerce sites that break basic scrapers.
Content Aggregation
Pull articles from multiple sources and convert to a unified Markdown format. Great for building training datasets or news aggregators.
SEO Analysis
Analyze competitor content at scale. Scrape ranking pages and extract their structure, headings, and key phrases.
FAQ
How much does Teracrawl cost?
Teracrawl is open-source and free to use. You pay only for Browser.cash API usage, which provides the remote browser infrastructure.
Can Teracrawl bypass CAPTCHAs?
Teracrawl uses real Chrome browsers, which helps avoid many detection mechanisms. However, it doesn't automatically solve CAPTCHAs. For sites with heavy CAPTCHA protection, you may need additional solutions.
Does Teracrawl work with JavaScript-heavy sites?
Yes. The dynamic mode waits for JavaScript hydration before extracting content. This handles React, Vue, Angular, and other SPA frameworks.
How many pages can I scrape per minute?
Throughput depends on your POOL_SIZE and CRAWL_TABS_PER_SESSION settings. With default settings, expect 10-20 pages per minute. Scale up by increasing pool size.
Is web scraping legal?
Legality depends on what you scrape and how you use the data. Always check a site's robots.txt and terms of service. Scrape only public data and respect rate limits.
Final Thoughts
Teracrawl simplifies the hardest parts of web scraping. Real browsers handle JavaScript rendering. Smart content extraction produces clean Markdown. Parallel execution keeps things fast.
The tool works well for AI applications where you need current web data in a format LLMs can process efficiently.
Start with the basic /scrape endpoint to understand the output quality. Then move to /crawl for research tasks that need multiple sources.
Check out the Teracrawl GitHub repository for updates and community contributions.