I've been scraping websites for years, and if there's one tool that changed the game for JavaScript developers, it's Puppeteer. Unlike old-school HTTP scraping where you parse static HTML, Puppeteer gives you a real browser—meaning you can scrape JavaScript-heavy sites, SPAs, and anything that loads content dynamically.
The best part? It's free, maintained by Google's Chrome team, and works right out of the box with Node.js. No fiddling with browser drivers or compatibility issues.
In this guide, I'll show you how to build scrapers with Puppeteer—from basic page scraping to handling anti-bot detection and scaling to hundreds of concurrent requests. We'll cover the stuff that actually matters when you're building production scrapers, not just toy examples.
What you'll learn:
- Setting up Puppeteer and building your first scraper
- Extracting data from dynamic pages and handling JavaScript
- Avoiding bot detection with stealth techniques
- Scaling scrapers with browser pooling and concurrency
- Real-world tricks that most tutorials don't cover
What is Puppeteer and why use it for scraping?
Puppeteer is a Node.js library that controls headless Chrome (or Chromium) through the DevTools Protocol. When you scrape with Puppeteer, you're automating an actual browser—not just making HTTP requests like you would with Axios or fetch.
This matters because modern websites rely heavily on JavaScript. If you try scraping a React app with traditional tools, you'll get an empty shell. Puppeteer executes the JavaScript, waits for content to load, and gives you the fully rendered page.
Here's what makes Puppeteer great for scraping:
Full browser control: Click buttons, fill forms, scroll pages—anything a user can do.
JavaScript execution: Scrape SPAs and dynamically loaded content without breaking a sweat.
Network interception: Block images and CSS to speed up scraping, or capture API requests directly.
Screenshots and PDFs: Useful for visual verification or archiving pages.
Chrome DevTools access: Monitor performance, intercept requests, and debug like you're in the browser console.
The tradeoff? Puppeteer is heavier than HTTP-based scraping. Each browser instance uses significant RAM (around 50-100MB), so you need to be smart about resource management when scaling.
Setting up Puppeteer
Getting started is straightforward. First, create a new Node.js project:
mkdir puppeteer-scraper
cd puppeteer-scraper
npm init -y
Now install Puppeteer:
npm install puppeteer
This downloads Puppeteer along with a compatible version of Chromium (about 170-300MB). If you want to use your own Chrome installation instead, install puppeteer-core
and specify the executable path when launching.
Here's your first scraper—let's extract the title from a webpage:
const puppeteer = require('puppeteer');
(async () => {
// Launch browser
const browser = await puppeteer.launch({
headless: true // Set to false to see the browser
});
const page = await browser.newPage();
await page.goto('https://example.com');
// Extract the page title
const title = await page.title();
console.log('Page title:', title);
await browser.close();
})();
Run it with node scraper.js
. That's it—you just scraped your first page.
A few things to note here:
- Everything is async/await because Puppeteer communicates with the browser process
page.goto()
navigates to a URL and waits for the page to load- Always close the browser when done to free up resources
Scraping dynamic content
The real power of Puppeteer shows when dealing with JavaScript-rendered content. Let's scrape product listings that load dynamically.
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch({ headless: true });
const page = await browser.newPage();
await page.goto('https://books.toscrape.com/', {
waitUntil: 'networkidle2' // Wait until network is mostly idle
});
// Extract book data
const books = await page.evaluate(() => {
const bookElements = document.querySelectorAll('.product_pod');
return Array.from(bookElements).map(book => ({
title: book.querySelector('h3 a').getAttribute('title'),
price: book.querySelector('.price_color').textContent,
availability: book.querySelector('.availability').textContent.trim()
}));
});
console.log('Found', books.length, 'books');
console.log(books.slice(0, 3)); // Show first 3
await browser.close();
})();
Here's what's happening:
waitUntil: 'networkidle2'
tells Puppeteer to wait until there are no more than 2 network connections for 500ms. This ensures dynamic content has loaded.page.evaluate()
runs code in the browser context—like executing JavaScript in DevTools console- We use standard DOM methods inside
evaluate()
to extract data
The key insight: anything inside page.evaluate()
runs in the browser, not in Node.js. You can't access Node variables directly in there.
Waiting for elements
Sometimes networkidle2
isn't enough. Pages might load in stages, or specific elements might appear late. Here's how to wait for specific content:
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch({ headless: true });
const page = await browser.newPage();
await page.goto('https://example.com/products');
// Wait for specific element
await page.waitForSelector('.product-list', { timeout: 5000 });
// Or wait for multiple conditions
await Promise.all([
page.waitForSelector('.products'),
page.waitForSelector('.price'),
page.waitForFunction(() => document.querySelectorAll('.product').length > 0)
]);
// Now scrape
const products = await page.$$eval('.product', items => {
return items.map(item => ({
name: item.querySelector('.name')?.textContent,
price: item.querySelector('.price')?.textContent
}));
});
console.log(products);
await browser.close();
})();
Pro tip: Use page.$$eval()
as a shorthand when you want to query elements and extract data in one go. It's cleaner than evaluate()
for simple selections.
Handling pagination and infinite scroll
Real-world scraping often involves multiple pages. Here's how to crawl through pagination:
const puppeteer = require('puppeteer');
async function scrapeAllPages(url, maxPages = 5) {
const browser = await puppeteer.launch({ headless: true });
const page = await browser.newPage();
const allData = [];
let currentPage = 1;
await page.goto(url);
while (currentPage <= maxPages) {
console.log(`Scraping page ${currentPage}...`);
// Wait for content
await page.waitForSelector('.product_pod');
// Scrape current page
const pageData = await page.$$eval('.product_pod', books => {
return books.map(book => ({
title: book.querySelector('h3 a')?.getAttribute('title'),
price: book.querySelector('.price_color')?.textContent
}));
});
allData.push(...pageData);
// Check if next button exists
const nextButton = await page.$('.next a');
if (!nextButton) break;
// Click next and wait for navigation
await Promise.all([
page.waitForNavigation({ waitUntil: 'networkidle2' }),
nextButton.click()
]);
currentPage++;
}
await browser.close();
return allData;
}
scrapeAllPages('https://books.toscrape.com/').then(data => {
console.log(`Scraped ${data.length} books total`);
});
For infinite scroll pages (like social media feeds), you need a different approach:
async function autoScroll(page) {
await page.evaluate(async () => {
await new Promise((resolve) => {
let totalHeight = 0;
const distance = 100;
const timer = setInterval(() => {
const scrollHeight = document.body.scrollHeight;
window.scrollBy(0, distance);
totalHeight += distance;
if (totalHeight >= scrollHeight) {
clearInterval(timer);
resolve();
}
}, 100);
});
});
}
Avoiding bot detection
Here's where things get interesting. Default Puppeteer leaks obvious bot signals—like navigator.webdriver
being true and "HeadlessChrome" in the user agent. Many sites will block you immediately.
The solution is puppeteer-extra
with the stealth plugin:
npm install puppeteer-extra puppeteer-extra-plugin-stealth
Now your scraper looks more human:
const puppeteer = require('puppeteer-extra');
const StealthPlugin = require('puppeteer-extra-plugin-stealth');
puppeteer.use(StealthPlugin());
(async () => {
const browser = await puppeteer.launch({
headless: true,
args: [
'--no-sandbox',
'--disable-setuid-sandbox',
'--disable-blink-features=AutomationControlled'
]
});
const page = await browser.newPage();
// Set viewport and user agent
await page.setViewport({ width: 1920, height: 1080 });
await page.setUserAgent('Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36');
await page.goto('https://bot-detection-test.com');
// Test if detected
const isDetected = await page.evaluate(() => {
return navigator.webdriver;
});
console.log('Detected as bot?', isDetected); // Should be false
await browser.close();
})();
The stealth plugin applies 17 different evasion techniques:
- Removes
navigator.webdriver
flag - Fixes Chrome runtime properties
- Masks WebGL vendor and renderer
- Patches permissions API
- And many more subtle fingerprint fixes
But here's the thing—stealth plugins aren't magic. They help pass basic bot tests, but sophisticated anti-bot systems (Cloudflare, DataDome) can still detect you through behavioral analysis, TLS fingerprinting, and other advanced techniques.
A few extra tricks that actually work:
1. Randomize timing between actions:
function randomDelay(min, max) {
return new Promise(resolve => {
setTimeout(resolve, Math.random() * (max - min) + min);
});
}
await page.click('button');
await randomDelay(1000, 3000); // Wait 1-3 seconds
await page.type('input', 'search query', { delay: 100 }); // Type with human-like delays
2. Block unnecessary resources to speed up scraping:
await page.setRequestInterception(true);
page.on('request', (req) => {
const resourceType = req.resourceType();
// Block images, fonts, stylesheets
if (['image', 'stylesheet', 'font'].includes(resourceType)) {
req.abort();
} else {
req.continue();
}
});
This can speed up scraping by 30-50% since you're not downloading assets you don't need.
3. Use residential proxies for IP rotation:
const browser = await puppeteer.launch({
args: [
'--proxy-server=http://proxy-address:port'
]
});
const page = await browser.newPage();
await page.authenticate({ username: 'user', password: 'pass' });
If you're scraping at scale, rotating IPs is essential. Free proxies are unreliable—consider paid residential proxy services for production use.
Scaling with browser pooling
When you need to scrape hundreds or thousands of pages, launching a new browser for each page kills performance. Instead, use a browser pool to reuse instances efficiently.
Here's a practical approach with puppeteer-cluster
:
npm install puppeteer-cluster
const { Cluster } = require('puppeteer-cluster');
(async () => {
const cluster = await Cluster.launch({
concurrency: Cluster.CONCURRENCY_CONTEXT,
maxConcurrency: 5, // 5 parallel workers
puppeteerOptions: {
headless: true,
args: ['--no-sandbox']
}
});
// Define scraping task
await cluster.task(async ({ page, data: url }) => {
await page.goto(url, { waitUntil: 'networkidle2' });
const title = await page.title();
console.log(`Scraped: ${title}`);
return { url, title };
});
// Queue URLs
const urls = [
'https://example.com/page1',
'https://example.com/page2',
'https://example.com/page3',
// ... hundreds more
];
urls.forEach(url => cluster.queue(url));
await cluster.idle(); // Wait for all tasks
await cluster.close();
})();
This approach:
- Reuses browser contexts instead of launching new browsers
- Handles errors and retries automatically
- Manages concurrency so you don't overwhelm your system
For memory-constrained environments, keep concurrency between 3-5. Each context still uses significant RAM.
Alternative: batch processing without external libraries
If you don't want extra dependencies, you can implement batching with native Promises:
async function scrapeBatch(urls, batchSize = 5) {
const browser = await puppeteer.launch({ headless: true });
const results = [];
for (let i = 0; i < urls.length; i += batchSize) {
const batch = urls.slice(i, i + batchSize);
const batchResults = await Promise.all(
batch.map(async (url) => {
const page = await browser.newPage();
try {
await page.goto(url, { timeout: 30000 });
const data = await page.evaluate(() => document.title);
return { url, data };
} catch (error) {
return { url, error: error.message };
} finally {
await page.close();
}
})
);
results.push(...batchResults);
console.log(`Completed batch ${Math.floor(i / batchSize) + 1}`);
}
await browser.close();
return results;
}
This processes URLs in batches of 5, ensuring you never have too many pages open simultaneously.
Real-world scraping patterns
Most tutorials stop at basic examples, but real scraping involves handling edge cases. Here are patterns I use in production:
Pattern 1: Retry with exponential backoff
async function scrapeWithRetry(url, maxRetries = 3) {
const browser = await puppeteer.launch({ headless: true });
for (let attempt = 1; attempt <= maxRetries; attempt++) {
try {
const page = await browser.newPage();
await page.goto(url, { timeout: 30000 });
const data = await page.evaluate(() => {
// Your scraping logic
return document.title;
});
await page.close();
await browser.close();
return data;
} catch (error) {
console.log(`Attempt ${attempt} failed:`, error.message);
if (attempt === maxRetries) {
await browser.close();
throw error;
}
// Exponential backoff
const delay = Math.pow(2, attempt) * 1000;
await new Promise(resolve => setTimeout(resolve, delay));
}
}
}
Pattern 2: Capture network requests instead of DOM scraping
Sometimes the page fetches data from an API. Instead of scraping HTML, intercept the API calls:
const browser = await puppeteer.launch({ headless: true });
const page = await browser.newPage();
// Listen for API responses
page.on('response', async (response) => {
const url = response.url();
if (url.includes('/api/products')) {
const data = await response.json();
console.log('API data:', data);
}
});
await page.goto('https://example.com/products');
This is way faster than waiting for DOM rendering and parsing HTML. You get the raw data directly.
Pattern 3: Handle cookie consent popups
Cookie banners are annoying for scraping. Dismiss them automatically:
await page.goto('https://example.com');
// Try to click "Accept" button
try {
await page.waitForSelector('#cookie-accept', { timeout: 3000 });
await page.click('#cookie-accept');
} catch (err) {
// No cookie popup, continue
}
// Now scrape the page
When NOT to use Puppeteer
Puppeteer isn't always the right tool. Here's when to consider alternatives:
Use simple HTTP requests + Cheerio if:
- The site doesn't use JavaScript for content
- You're scraping thousands of pages and performance matters
- You don't need to interact with the page (click, scroll, etc.)
const axios = require('axios');
const cheerio = require('cheerio');
const { data } = await axios.get('https://example.com');
const $ = cheerio.load(data);
const title = $('h1').text();
This is 10-20x faster than Puppeteer because you skip the browser overhead entirely.
Use Playwright if:
- You need cross-browser testing (Firefox, Safari, Chrome)
- You want better debugging tools and developer experience
- Your project already uses Playwright for testing
Playwright and Puppeteer are similar, but Playwright has better modern features and supports more browsers natively.
Wrapping up
Puppeteer gives you a real browser for scraping JavaScript-heavy sites, which is powerful but comes with resource costs. Start with basic scraping, add stealth techniques when you hit bot detection, and scale with pooling or cloud solutions when local resources aren't enough.
Key takeaways:
- Use Puppeteer when sites require JavaScript execution
- Apply stealth plugins and randomize behavior to avoid detection
- Block unnecessary resources to improve performance
- Batch requests and use browser pooling when scaling
- Consider simpler tools like Axios + Cheerio for static sites
The scraping landscape keeps evolving—anti-bot systems get smarter, but so do evasion techniques. Keep your Puppeteer version updated, stay aware of fingerprinting trends, and don't be afraid to combine multiple approaches (HTTP requests for simple pages, Puppeteer for complex ones).
Now go build something useful with it.
Related reading: