Puppeteer controls a real Chrome browser, letting you scrape JavaScript-heavy sites that break traditional HTTP scrapers. This guide covers everything from basic setup to bypassing Cloudflare in 2026.
Quick Answer: What is Puppeteer Web Scraping?
Puppeteer web scraping is a technique that uses Google's Puppeteer library to control a headless Chrome browser for extracting data from websites. Unlike traditional HTTP scrapers, Puppeteer executes JavaScript, renders dynamic content, and simulates real user interactions—making it ideal for modern React, Vue, and Angular applications that load content client-side.
Modern websites don't serve static HTML anymore. React apps, SPAs, and lazy-loaded content require a browser that executes JavaScript before you can extract data. That's exactly what Puppeteer does.
In this guide, I'll walk you through building production-ready scrapers with Puppeteer—including stealth techniques, proxy rotation, and scaling strategies that actually work in 2026.
What You'll Learn
- Setting up Puppeteer and scraping your first page
- Handling dynamic content and infinite scroll
- Bypassing bot detection with stealth plugins
- Intercepting network requests for faster scraping
- Scaling with browser pooling and concurrency
- Cloudflare bypass techniques that work in 2026
- Memory optimization for large-scale projects
What is Puppeteer and Why Use It for Web Scraping?
Puppeteer is a Node.js library that controls headless Chrome through the DevTools Protocol. When you scrape with Puppeteer, you're automating an actual browser—not sending raw HTTP requests like you would with Axios or fetch.
This distinction matters because modern websites depend on JavaScript. Try scraping a React app with traditional tools, and you'll get an empty shell. Puppeteer executes the JavaScript, waits for content to load, and gives you the fully rendered page.
Here's what makes Puppeteer great for web scraping:
- Full browser control — Click buttons, fill forms, scroll pages, and handle any interaction a human could do
- JavaScript execution — Scrape SPAs and dynamically loaded content without workarounds
- Network interception — Block images and CSS to speed up scraping, or capture API requests directly
- Screenshots and PDFs — Useful for visual verification or archiving pages
- DevTools access — Monitor performance, intercept requests, and debug like you're in the browser console
The tradeoff? Puppeteer is heavier than HTTP-based scraping. Each browser instance consumes around 50-150MB of RAM. You need to be smart about resource management when scaling.
Setting Up Puppeteer
Getting started is straightforward. First, create a new Node.js project:
mkdir puppeteer-scraper
cd puppeteer-scraper
npm init -y
Now install Puppeteer:
npm install puppeteer
This downloads Puppeteer along with a compatible Chromium build (around 170-300MB). If you want to use your own Chrome installation, install puppeteer-core instead and specify the executable path when launching.
Here's your first scraper—extracting a page title:
const puppeteer = require('puppeteer');
(async () => {
// Launch browser in headless mode
const browser = await puppeteer.launch({
headless: true
});
// Open a new tab
const page = await browser.newPage();
// Navigate to the target URL
await page.goto('https://example.com');
// Extract the page title
const title = await page.title();
console.log('Page title:', title);
// Always close the browser to free resources
await browser.close();
})();
Run it with node scraper.js. You've just scraped your first page.
A few important notes here. Everything uses async/await because Puppeteer communicates with the browser process asynchronously. The page.goto() function navigates and waits for the page to load. Always close the browser when done to prevent memory leaks.
Scraping Dynamic Content
The real power of Puppeteer web scraping shows when dealing with JavaScript-rendered content. Let's scrape product listings that load dynamically.
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch({ headless: true });
const page = await browser.newPage();
await page.goto('https://books.toscrape.com/', {
waitUntil: 'networkidle2'
});
// Extract book data using page.evaluate()
const books = await page.evaluate(() => {
const bookElements = document.querySelectorAll('.product_pod');
return Array.from(bookElements).map(book => ({
title: book.querySelector('h3 a').getAttribute('title'),
price: book.querySelector('.price_color').textContent,
availability: book.querySelector('.availability').textContent.trim()
}));
});
console.log('Found', books.length, 'books');
console.log(books.slice(0, 3));
await browser.close();
})();
The waitUntil: 'networkidle2' option tells Puppeteer to wait until there are no more than 2 network connections for 500ms. This ensures dynamic content has loaded before extraction.
The page.evaluate() function runs code in the browser context—like executing JavaScript in DevTools console. Inside evaluate, you use standard DOM methods to extract data.
Important: anything inside page.evaluate() runs in the browser, not in Node.js. You can't access Node variables directly in there.
Waiting for Specific Elements
Sometimes networkidle2 isn't enough. Pages might load content in stages, or specific elements might appear late.
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch({ headless: true });
const page = await browser.newPage();
await page.goto('https://example.com/products');
// Wait for a specific element to appear
await page.waitForSelector('.product-list', { timeout: 10000 });
// Or wait for multiple conditions
await Promise.all([
page.waitForSelector('.products'),
page.waitForSelector('.price'),
page.waitForFunction(() => {
return document.querySelectorAll('.product').length > 0;
})
]);
// Now extract the data
const products = await page.$$eval('.product', items => {
return items.map(item => ({
name: item.querySelector('.name')?.textContent,
price: item.querySelector('.price')?.textContent
}));
});
console.log(products);
await browser.close();
})();
The page.$$eval() function is a shorthand when you want to query elements and extract data in one operation. It's cleaner than evaluate() for simple selections.
Handling Pagination and Infinite Scroll
Real-world Puppeteer web scraping often involves multiple pages. Here's how to crawl through pagination:
const puppeteer = require('puppeteer');
async function scrapeAllPages(url, maxPages = 5) {
const browser = await puppeteer.launch({ headless: true });
const page = await browser.newPage();
const allData = [];
let currentPage = 1;
await page.goto(url);
while (currentPage <= maxPages) {
console.log(`Scraping page ${currentPage}...`);
// Wait for content to load
await page.waitForSelector('.product_pod');
// Extract data from current page
const pageData = await page.$$eval('.product_pod', books => {
return books.map(book => ({
title: book.querySelector('h3 a')?.getAttribute('title'),
price: book.querySelector('.price_color')?.textContent
}));
});
allData.push(...pageData);
// Check if next button exists
const nextButton = await page.$('.next a');
if (!nextButton) break;
// Click next and wait for navigation
await Promise.all([
page.waitForNavigation({ waitUntil: 'networkidle2' }),
nextButton.click()
]);
currentPage++;
}
await browser.close();
return allData;
}
scrapeAllPages('https://books.toscrape.com/').then(data => {
console.log(`Scraped ${data.length} books total`);
});
For infinite scroll pages (social media feeds, product listings), you need a different approach:
async function autoScroll(page) {
await page.evaluate(async () => {
await new Promise((resolve) => {
let totalHeight = 0;
const distance = 100;
const timer = setInterval(() => {
const scrollHeight = document.body.scrollHeight;
window.scrollBy(0, distance);
totalHeight += distance;
if (totalHeight >= scrollHeight) {
clearInterval(timer);
resolve();
}
}, 100);
});
});
}
// Usage in your scraper
await autoScroll(page);
// Now extract the loaded content
This scrolls the page incrementally until no new content loads. Adjust the distance and interval timing based on how the target site loads content.
Avoiding Bot Detection with Stealth Techniques
Here's where things get interesting. Default Puppeteer leaks obvious bot signals—like navigator.webdriver being true and "HeadlessChrome" in the user agent. Many sites block you immediately.
The solution is puppeteer-extra with the stealth plugin:
npm install puppeteer-extra puppeteer-extra-plugin-stealth
Now your scraper looks more human:
const puppeteer = require('puppeteer-extra');
const StealthPlugin = require('puppeteer-extra-plugin-stealth');
puppeteer.use(StealthPlugin());
(async () => {
const browser = await puppeteer.launch({
headless: true,
args: [
'--no-sandbox',
'--disable-setuid-sandbox',
'--disable-blink-features=AutomationControlled'
]
});
const page = await browser.newPage();
// Set realistic viewport and user agent
await page.setViewport({ width: 1920, height: 1080 });
await page.setUserAgent(
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 ' +
'(KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36'
);
await page.goto('https://bot.sannysoft.com');
// Test if detected
const isDetected = await page.evaluate(() => {
return navigator.webdriver;
});
console.log('Detected as bot?', isDetected);
await browser.close();
})();
The stealth plugin applies 17 different evasion techniques:
- Removes the
navigator.webdriverflag - Fixes Chrome runtime properties
- Masks WebGL vendor and renderer
- Patches the permissions API
- Handles iframe contentWindow consistency
- And many more subtle fingerprint fixes
But stealth plugins aren't magic. They help pass basic bot tests, but sophisticated anti-bot systems like Cloudflare and DataDome can still detect you through behavioral analysis, TLS fingerprinting, and other advanced techniques.
Extra Tricks That Actually Work
Randomize timing between actions:
function randomDelay(min, max) {
return new Promise(resolve => {
setTimeout(resolve, Math.random() * (max - min) + min);
});
}
// Use between actions
await page.click('button');
await randomDelay(1000, 3000);
await page.type('input', 'search query', { delay: 100 });
The delay option in page.type() adds a random delay between keystrokes, simulating human typing patterns.
Simulate realistic mouse movements:
async function humanClick(page, selector) {
const element = await page.$(selector);
const box = await element.boundingBox();
// Move to random point within element
const x = box.x + Math.random() * box.width;
const y = box.y + Math.random() * box.height;
await page.mouse.move(x, y, { steps: 10 });
await randomDelay(50, 150);
await page.mouse.click(x, y);
}
This moves the mouse gradually to the element before clicking, rather than teleporting instantly.
Network Interception for Faster Scraping
Blocking unnecessary resources speeds up Puppeteer web scraping significantly—sometimes by 30-50%.
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch({ headless: true });
const page = await browser.newPage();
// Enable request interception
await page.setRequestInterception(true);
page.on('request', (req) => {
const resourceType = req.resourceType();
// Block images, fonts, and stylesheets
if (['image', 'stylesheet', 'font', 'media'].includes(resourceType)) {
req.abort();
} else {
req.continue();
}
});
await page.goto('https://example.com');
// Scrape without waiting for images to load
const data = await page.evaluate(() => document.title);
console.log(data);
await browser.close();
})();
Capturing API Responses Directly
Sometimes the page fetches data from an API. Instead of scraping HTML, intercept the API calls:
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch({ headless: true });
const page = await browser.newPage();
let apiData = null;
// Listen for API responses
page.on('response', async (response) => {
const url = response.url();
if (url.includes('/api/products')) {
try {
apiData = await response.json();
console.log('Captured API data:', apiData);
} catch (e) {
// Response wasn't JSON
}
}
});
await page.goto('https://example.com/products');
// Wait for the API call to complete
await page.waitForTimeout(2000);
console.log('Final data:', apiData);
await browser.close();
})();
This approach is way faster than waiting for DOM rendering and parsing HTML. You get the raw data directly from the source.
Using Proxies with Puppeteer
For large-scale Puppeteer web scraping, rotating IPs is essential. Here's how to configure proxies:
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch({
headless: true,
args: [
'--proxy-server=http://proxy.example.com:8080'
]
});
const page = await browser.newPage();
// Authenticate if required
await page.authenticate({
username: 'your_username',
password: 'your_password'
});
await page.goto('https://httpbin.org/ip');
// Verify proxy is working
const content = await page.content();
console.log(content);
await browser.close();
})();
For rotating proxies, launch a new browser instance with a different proxy for each session. Here's a practical pattern:
const puppeteer = require('puppeteer');
const proxies = [
'http://proxy1.example.com:8080',
'http://proxy2.example.com:8080',
'http://proxy3.example.com:8080'
];
async function scrapeWithProxy(url, proxyIndex) {
const proxy = proxies[proxyIndex % proxies.length];
const browser = await puppeteer.launch({
headless: true,
args: [`--proxy-server=${proxy}`]
});
try {
const page = await browser.newPage();
await page.goto(url, { timeout: 30000 });
const data = await page.evaluate(() => document.title);
return { success: true, data, proxy };
} catch (error) {
return { success: false, error: error.message, proxy };
} finally {
await browser.close();
}
}
For production scraping, consider residential proxy services. Datacenter IPs get blocked quickly on most sites. Residential proxies from providers like Roundproxies.com offer rotating IPs that appear as regular home internet connections.
Bypassing Cloudflare in 2026
Cloudflare protection has gotten smarter, but so have bypass techniques. Here's what works in 2026.
Method 1: Puppeteer Real Browser
The puppeteer-real-browser package creates sessions that pass most Cloudflare checks:
npm install puppeteer-real-browser
const { connect } = require('puppeteer-real-browser');
(async () => {
const { browser, page } = await connect({
headless: false,
turnstile: true,
fingerprint: true
});
await page.goto('https://cloudflare-protected-site.com');
// Wait for Cloudflare challenge to resolve
await page.waitForTimeout(5000);
// Now scrape the actual content
const content = await page.content();
console.log(content);
await browser.close();
})();
The turnstile: true option handles Cloudflare's Turnstile CAPTCHA automatically. The fingerprint: true option injects unique browser fingerprints each session.
Method 2: Cookie Persistence
Once you've passed Cloudflare once, save the cookies and reuse them:
const puppeteer = require('puppeteer-extra');
const StealthPlugin = require('puppeteer-extra-plugin-stealth');
const fs = require('fs');
puppeteer.use(StealthPlugin());
async function saveSession() {
const browser = await puppeteer.launch({ headless: false });
const page = await browser.newPage();
await page.goto('https://cloudflare-site.com');
// Manually solve any challenge
await page.waitForTimeout(30000);
// Save cookies
const cookies = await page.cookies();
fs.writeFileSync('cookies.json', JSON.stringify(cookies));
await browser.close();
}
async function reuseSession() {
const browser = await puppeteer.launch({ headless: true });
const page = await browser.newPage();
// Load saved cookies
const cookies = JSON.parse(fs.readFileSync('cookies.json'));
await page.setCookie(...cookies);
// Now navigate—should bypass challenge
await page.goto('https://cloudflare-site.com');
const data = await page.evaluate(() => document.title);
console.log(data);
await browser.close();
}
Method 3: Header Matching
Cloudflare checks for header consistency. Match your user agent with appropriate headers:
const puppeteer = require('puppeteer-extra');
const StealthPlugin = require('puppeteer-extra-plugin-stealth');
puppeteer.use(StealthPlugin());
(async () => {
const browser = await puppeteer.launch({
headless: true,
args: [
'--no-sandbox',
'--disable-setuid-sandbox',
'--disable-dev-shm-usage',
'--disable-accelerated-2d-canvas',
'--disable-gpu'
]
});
const page = await browser.newPage();
// Set consistent headers
await page.setExtraHTTPHeaders({
'Accept-Language': 'en-US,en;q=0.9',
'Accept-Encoding': 'gzip, deflate, br',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8'
});
await page.setUserAgent(
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 ' +
'(KHTML, like Gecko) Chrome/121.0.0.0 Safari/537.36'
);
// Randomize viewport
await page.setViewport({
width: 1920 + Math.floor(Math.random() * 100),
height: 1080 + Math.floor(Math.random() * 100),
deviceScaleFactor: 1
});
await page.goto('https://cloudflare-site.com', {
waitUntil: 'networkidle0',
timeout: 60000
});
await browser.close();
})();
Scaling with Browser Pooling
When you need to scrape hundreds or thousands of pages, launching a new browser for each page kills performance. Use browser pooling to reuse instances efficiently.
Using Puppeteer Cluster
npm install puppeteer-cluster
const { Cluster } = require('puppeteer-cluster');
(async () => {
const cluster = await Cluster.launch({
concurrency: Cluster.CONCURRENCY_CONTEXT,
maxConcurrency: 5,
puppeteerOptions: {
headless: true,
args: ['--no-sandbox']
},
monitor: true
});
// Define the scraping task
await cluster.task(async ({ page, data: url }) => {
await page.goto(url, { waitUntil: 'networkidle2' });
const title = await page.title();
console.log(`Scraped: ${title}`);
return { url, title };
});
// Queue URLs
const urls = [
'https://example.com/page1',
'https://example.com/page2',
'https://example.com/page3',
// ... hundreds more
];
urls.forEach(url => cluster.queue(url));
await cluster.idle();
await cluster.close();
})();
This approach reuses browser contexts instead of launching new browsers, handles errors and retries automatically, and manages concurrency to prevent system overload.
Manual Batch Processing
If you don't want external dependencies, implement batching with native Promises:
const puppeteer = require('puppeteer');
async function scrapeBatch(urls, batchSize = 5) {
const browser = await puppeteer.launch({ headless: true });
const results = [];
for (let i = 0; i < urls.length; i += batchSize) {
const batch = urls.slice(i, i + batchSize);
const batchResults = await Promise.all(
batch.map(async (url) => {
const page = await browser.newPage();
try {
await page.goto(url, { timeout: 30000 });
const data = await page.evaluate(() => document.title);
return { url, data, success: true };
} catch (error) {
return { url, error: error.message, success: false };
} finally {
await page.close();
}
})
);
results.push(...batchResults);
console.log(`Completed batch ${Math.floor(i / batchSize) + 1}`);
}
await browser.close();
return results;
}
This processes URLs in batches of 5, ensuring you never have too many pages open simultaneously.
Memory Optimization Tips
Puppeteer can consume significant memory. Here are optimization strategies for production Puppeteer web scraping:
Close pages immediately after scraping:
const page = await browser.newPage();
try {
await page.goto(url);
const data = await page.evaluate(() => document.title);
return data;
} finally {
await page.close();
}
Use a single browser instance for multiple pages:
const browser = await puppeteer.launch({ headless: true });
for (const url of urls) {
const page = await browser.newPage();
await page.goto(url);
// scrape...
await page.close();
}
await browser.close();
Disable unnecessary features:
const browser = await puppeteer.launch({
headless: true,
args: [
'--disable-gpu',
'--disable-dev-shm-usage',
'--disable-setuid-sandbox',
'--no-first-run',
'--no-sandbox',
'--no-zygote',
'--single-process',
'--disable-extensions'
]
});
Monitor memory usage:
const page = await browser.newPage();
await page.goto(url);
const metrics = await page.metrics();
console.log('JS Heap Size:', metrics.JSHeapUsedSize / 1024 / 1024, 'MB');
Real-World Patterns
Here are patterns I use in production scrapers.
Retry with Exponential Backoff
async function scrapeWithRetry(url, maxRetries = 3) {
const browser = await puppeteer.launch({ headless: true });
for (let attempt = 1; attempt <= maxRetries; attempt++) {
try {
const page = await browser.newPage();
await page.goto(url, { timeout: 30000 });
const data = await page.evaluate(() => document.title);
await page.close();
await browser.close();
return { success: true, data };
} catch (error) {
console.log(`Attempt ${attempt} failed:`, error.message);
if (attempt === maxRetries) {
await browser.close();
return { success: false, error: error.message };
}
// Exponential backoff
const delay = Math.pow(2, attempt) * 1000;
await new Promise(resolve => setTimeout(resolve, delay));
}
}
}
Handle Cookie Consent Popups
async function dismissCookieBanner(page) {
const selectors = [
'#cookie-accept',
'.accept-cookies',
'[data-testid="cookie-accept"]',
'button[aria-label="Accept cookies"]'
];
for (const selector of selectors) {
try {
await page.waitForSelector(selector, { timeout: 3000 });
await page.click(selector);
return true;
} catch (err) {
// Selector not found, try next
}
}
return false;
}
// Usage
await page.goto('https://example.com');
await dismissCookieBanner(page);
// Continue scraping
Extract Structured Data
async function extractProductData(page) {
return await page.evaluate(() => {
// Look for JSON-LD structured data
const jsonLd = document.querySelector('script[type="application/ld+json"]');
if (jsonLd) {
try {
return JSON.parse(jsonLd.textContent);
} catch (e) {}
}
// Fall back to DOM extraction
return {
title: document.querySelector('h1')?.textContent?.trim(),
price: document.querySelector('[data-price]')?.dataset?.price,
description: document.querySelector('[itemprop="description"]')?.textContent?.trim()
};
});
}
Hidden Tricks Most Guides Don't Cover
These are techniques I've picked up from years of production scraping that rarely appear in tutorials.
Trick 1: Intercept and Modify Requests
You can modify outgoing requests before they're sent. This is useful for adding authentication headers or changing request parameters:
await page.setRequestInterception(true);
page.on('request', (request) => {
const headers = request.headers();
headers['Authorization'] = 'Bearer your_token_here';
headers['X-Custom-Header'] = 'custom_value';
request.continue({ headers });
});
This works for adding API keys to requests or spoofing referrer headers.
Trick 2: Execute CDP Commands Directly
Puppeteer exposes the Chrome DevTools Protocol directly. You can do things that aren't available through the high-level API:
const client = await page.target().createCDPSession();
// Emulate network conditions
await client.send('Network.emulateNetworkConditions', {
offline: false,
downloadThroughput: 1.5 * 1024 * 1024 / 8,
uploadThroughput: 750 * 1024 / 8,
latency: 40
});
// Clear browser cache
await client.send('Network.clearBrowserCache');
// Get performance metrics
const perfMetrics = await client.send('Performance.getMetrics');
console.log(perfMetrics);
Trick 3: Persist Browser Data Between Sessions
Instead of starting fresh every time, maintain browser state:
const browser = await puppeteer.launch({
headless: true,
userDataDir: './browser_data'
});
The userDataDir option tells Puppeteer to save cookies, localStorage, and cache to a directory. Next time you launch, it loads the existing state. This is huge for sites that remember logged-in users or have multi-step verification.
Trick 4: Extract Data from Shadow DOM
Many modern sites use Shadow DOM for component isolation. Regular selectors can't reach inside:
const data = await page.evaluate(() => {
const host = document.querySelector('custom-element');
const shadow = host.shadowRoot;
const innerContent = shadow.querySelector('.hidden-content');
return innerContent?.textContent;
});
If you're getting empty results from modern web components, check if they're using Shadow DOM.
Trick 5: Handle File Downloads
Puppeteer doesn't download files by default. Here's how to enable it:
const path = require('path');
const browser = await puppeteer.launch({ headless: true });
const page = await browser.newPage();
// Set download behavior
const client = await page.target().createCDPSession();
await client.send('Page.setDownloadBehavior', {
behavior: 'allow',
downloadPath: path.resolve('./downloads')
});
await page.goto('https://example.com/download-page');
await page.click('#download-button');
// Wait for download to complete
await page.waitForTimeout(5000);
Trick 6: Capture Console Logs and Errors
Debug issues by capturing what the page logs:
page.on('console', msg => {
console.log('PAGE LOG:', msg.type(), msg.text());
});
page.on('pageerror', error => {
console.log('PAGE ERROR:', error.message);
});
page.on('requestfailed', request => {
console.log('FAILED:', request.url(), request.failure().errorText);
});
This helps diagnose why scraping fails on certain pages.
Trick 7: Use Browser Contexts for Isolation
Browser contexts are like incognito windows—isolated from each other:
const browser = await puppeteer.launch({ headless: true });
// Create isolated contexts
const context1 = await browser.createBrowserContext();
const context2 = await browser.createBrowserContext();
const page1 = await context1.newPage();
const page2 = await context2.newPage();
// Each context has separate cookies, storage, etc.
await page1.goto('https://site1.com');
await page2.goto('https://site2.com');
// Clean up contexts when done
await context1.close();
await context2.close();
Use this for scraping with multiple accounts simultaneously without interference.
2026 Puppeteer Updates You Should Know
The Puppeteer library continues evolving. Here are the notable changes for 2026:
WebDriver BiDi Support: Puppeteer now supports the WebDriver BiDi protocol alongside CDP, enabling better Firefox support and cross-browser compatibility.
New Headless Mode: Chrome's "new headless" mode (headless: 'new') is now the default. It's harder to detect than the old headless mode because it uses the same browser engine as headed Chrome.
const browser = await puppeteer.launch({
headless: 'new' // Now the default
});
Improved Locators: The locator API has matured with better auto-waiting:
await page.locator('button.submit').click();
// Automatically waits for element to be visible and clickable
Better Error Messages: Stack traces now point to the actual location in your code, not Puppeteer internals. Debugging is significantly easier.
Performance Improvements: Memory usage is down 15-20% compared to 2024 versions. Browser launch time has improved by 30%.
Debugging Puppeteer Scripts
When things go wrong, here's how to diagnose issues.
Run in headed mode to see what's happening:
const browser = await puppeteer.launch({
headless: false,
slowMo: 100 // Slow down actions by 100ms
});
Take screenshots at failure points:
try {
await page.click('.nonexistent-button');
} catch (error) {
await page.screenshot({ path: 'error-screenshot.png', fullPage: true });
throw error;
}
Generate traces for performance analysis:
await page.tracing.start({ path: 'trace.json' });
await page.goto('https://example.com');
await page.tracing.stop();
// Open trace.json in Chrome DevTools Performance tab
Use the DevTools Protocol viewer:
const browser = await puppeteer.launch({
headless: false,
devtools: true // Opens DevTools automatically
});
When NOT to Use Puppeteer
Puppeteer isn't always the right tool. Here's when to consider alternatives.
Use HTTP requests + Cheerio if:
- The site doesn't use JavaScript for content
- You're scraping thousands of pages and performance matters
- You don't need to interact with the page
const axios = require('axios');
const cheerio = require('cheerio');
const { data } = await axios.get('https://example.com');
const $ = cheerio.load(data);
const title = $('h1').text();
This approach is 10-20x faster than Puppeteer because you skip browser overhead entirely.
Use Playwright if:
- You need cross-browser testing (Firefox, Safari, Chrome)
- You want better debugging tools and auto-wait features
- Your project already uses Playwright for testing
Playwright and Puppeteer are similar, but Playwright has better modern features and supports more browsers natively.
Common Puppeteer Web Scraping Mistakes (And How to Fix Them)
After helping dozens of teams with their Puppeteer web scraping projects, these are the mistakes I see repeatedly.
Mistake 1: Not waiting long enough for dynamic content
// Wrong - content might not be loaded yet
await page.goto('https://spa-site.com');
const data = await page.$$('.products'); // Returns empty array
// Correct - wait for specific elements
await page.goto('https://spa-site.com');
await page.waitForSelector('.products', { timeout: 10000 });
const data = await page.$$('.products'); // Returns populated array
Mistake 2: Using page.waitForTimeout() in production
This function was deprecated because arbitrary delays are unreliable. Use conditional waits instead:
// Wrong
await page.waitForTimeout(5000);
// Correct
await page.waitForSelector('.loaded-indicator');
// or
await page.waitForFunction(() => window.dataLoaded === true);
Mistake 3: Not handling navigation errors
// Wrong - crashes on timeout
await page.goto('https://slow-site.com');
// Correct - handle gracefully
try {
await page.goto('https://slow-site.com', { timeout: 15000 });
} catch (error) {
if (error.name === 'TimeoutError') {
console.log('Page took too long to load');
// Retry or skip
}
}
Mistake 4: Memory leaks from unclosed browsers
// Wrong - browser stays open on error
const browser = await puppeteer.launch();
await doSomethingThatMightFail(); // If this throws, browser leaks
await browser.close();
// Correct - always close with try/finally
const browser = await puppeteer.launch();
try {
await doSomethingThatMightFail();
} finally {
await browser.close();
}
Mistake 5: Ignoring headless detection
Default Puppeteer is easily detected. Always use stealth plugins for production Puppeteer web scraping:
// Wrong
const browser = await puppeteer.launch({ headless: true });
// Correct
const puppeteer = require('puppeteer-extra');
const StealthPlugin = require('puppeteer-extra-plugin-stealth');
puppeteer.use(StealthPlugin());
const browser = await puppeteer.launch({ headless: true });
Exporting Scraped Data
Getting data out of Puppeteer and into useful formats.
Export to JSON:
const fs = require('fs');
const scrapedData = await scrapeProducts(page);
fs.writeFileSync('products.json', JSON.stringify(scrapedData, null, 2));
Export to CSV:
const fs = require('fs');
function toCSV(data) {
if (data.length === 0) return '';
const headers = Object.keys(data[0]);
const rows = data.map(item =>
headers.map(header => {
const value = item[header] ?? '';
// Escape quotes and wrap in quotes if contains comma
if (typeof value === 'string' && (value.includes(',') || value.includes('"'))) {
return `"${value.replace(/"/g, '""')}"`;
}
return value;
}).join(',')
);
return [headers.join(','), ...rows].join('\n');
}
const csv = toCSV(scrapedData);
fs.writeFileSync('products.csv', csv);
Stream to database:
const { MongoClient } = require('mongodb');
async function saveToMongo(data) {
const client = new MongoClient('mongodb://localhost:27017');
await client.connect();
const db = client.db('scraping');
const collection = db.collection('products');
await collection.insertMany(data);
await client.close();
}
Performance Benchmarks: Puppeteer Web Scraping Speed
How fast can you scrape? Here are realistic benchmarks from production environments:
| Scenario | Pages/minute | RAM Usage | Notes |
|---|---|---|---|
| Single browser, sequential | 15-20 | ~150MB | Baseline approach |
| Single browser, 5 concurrent tabs | 50-70 | ~400MB | Good balance |
| Browser pool (5 instances) | 100-150 | ~750MB | Requires more RAM |
| With resource blocking | 2-3x faster | 30% less | Block images/CSS |
| Network interception (API) | 5-10x faster | 50% less | Skip DOM parsing |
These numbers assume a typical e-commerce site with moderate complexity. Sites with heavy JavaScript frameworks will be slower; static sites faster.
Quick wins for speed:
- Block images: 30-40% faster
- Block fonts and CSS: 20-30% faster
- Intercept API responses instead of DOM scraping: 5-10x faster
- Use
networkidle2instead ofnetworkidle0: 10-20% faster - Reuse browser instances: 50% faster than launching new browsers
Error Handling Best Practices
Production scrapers need robust error handling. Here's a complete pattern:
const puppeteer = require('puppeteer');
class ScraperError extends Error {
constructor(message, url, type) {
super(message);
this.name = 'ScraperError';
this.url = url;
this.type = type;
}
}
async function scrapeWithErrorHandling(url, options = {}) {
const {
timeout = 30000,
retries = 3,
onError = () => {}
} = options;
let browser;
for (let attempt = 1; attempt <= retries; attempt++) {
try {
browser = await puppeteer.launch({ headless: true });
const page = await browser.newPage();
page.setDefaultTimeout(timeout);
const response = await page.goto(url, {
waitUntil: 'networkidle2',
timeout
});
if (!response.ok()) {
throw new ScraperError(
`HTTP ${response.status()}`,
url,
'HTTP_ERROR'
);
}
const data = await page.evaluate(() => {
// Your extraction logic
return document.title;
});
await browser.close();
return { success: true, data, attempt };
} catch (error) {
await onError(error, attempt);
if (browser) {
await browser.close().catch(() => {});
}
if (attempt === retries) {
return {
success: false,
error: error.message,
url,
attempts: attempt
};
}
// Exponential backoff
await new Promise(r => setTimeout(r, 1000 * Math.pow(2, attempt)));
}
}
}
// Usage
const result = await scrapeWithErrorHandling('https://example.com', {
timeout: 15000,
retries: 3,
onError: (error, attempt) => {
console.log(`Attempt ${attempt} failed:`, error.message);
}
});
Conclusion
Puppeteer gives you full browser control for scraping JavaScript-heavy sites. Start with basic extraction, add stealth techniques when you hit bot detection, and scale with pooling when local resources aren't enough.
Key takeaways:
- Use Puppeteer when sites require JavaScript execution
- Apply stealth plugins and randomize behavior to avoid detection
- Block unnecessary resources to improve performance
- Batch requests and use browser pooling when scaling
- Consider simpler tools like Axios + Cheerio for static sites
- Use residential proxies for large-scale operations
The scraping landscape keeps evolving. Anti-bot systems get smarter, but so do evasion techniques. Keep your Puppeteer version updated, stay aware of fingerprinting trends, and don't be afraid to combine multiple approaches.
Now go build something useful with it.
FAQ
Is Puppeteer still good for web scraping in 2026?
Yes. Puppeteer remains the standard for JavaScript-heavy site scraping. It's maintained by Google's Chrome team, has extensive plugin support, and handles modern web apps that break traditional HTTP scrapers.
How do I avoid getting blocked when scraping with Puppeteer?
Use the stealth plugin to mask automation signals. Rotate user agents and viewports. Add random delays between actions. Use residential proxies for IP rotation. Simulate human-like mouse movements and typing patterns.
Can Puppeteer bypass Cloudflare protection?
Default Puppeteer gets blocked by Cloudflare. With stealth plugins, cookie persistence, and packages like puppeteer-real-browser, you can bypass most Cloudflare challenges. Harder protections may require residential proxies or specialized services.
How much RAM does Puppeteer use?
Each browser instance uses 50-150MB of RAM. Optimize by closing pages immediately, blocking unnecessary resources, and using browser pooling for concurrent scraping rather than launching new browsers.
What's the difference between Puppeteer and Playwright?
Both control browsers for automation. Puppeteer focuses on Chrome/Chromium with deep DevTools integration. Playwright supports Chrome, Firefox, and Safari with better cross-browser testing and modern API features. For scraping, Puppeteer's stealth plugin ecosystem gives it an edge.