How to Use Puppeteer for Web Scraping in 2025

October 11, 2025

8 min read

I've been scraping websites for years, and if there's one tool that changed the game for JavaScript developers, it's Puppeteer. Unlike old-school HTTP scraping where you parse static HTML, Puppeteer gives you a real browser—meaning you can scrape JavaScript-heavy sites, SPAs, and anything that loads content dynamically.

The best part? It's free, maintained by Google's Chrome team, and works right out of the box with Node.js. No fiddling with browser drivers or compatibility issues.

In this guide, I'll show you how to build scrapers with Puppeteer—from basic page scraping to handling anti-bot detection and scaling to hundreds of concurrent requests. We'll cover the stuff that actually matters when you're building production scrapers, not just toy examples.

What you'll learn:

Setting up Puppeteer and building your first scraper
Extracting data from dynamic pages and handling JavaScript
Avoiding bot detection with stealth techniques
Scaling scrapers with browser pooling and concurrency
Real-world tricks that most tutorials don't cover

What is Puppeteer and why use it for scraping?

Puppeteer is a Node.js library that controls headless Chrome (or Chromium) through the DevTools Protocol. When you scrape with Puppeteer, you're automating an actual browser—not just making HTTP requests like you would with Axios or fetch.

This matters because modern websites rely heavily on JavaScript. If you try scraping a React app with traditional tools, you'll get an empty shell. Puppeteer executes the JavaScript, waits for content to load, and gives you the fully rendered page.

Here's what makes Puppeteer great for scraping:

Full browser control: Click buttons, fill forms, scroll pages—anything a user can do.

JavaScript execution: Scrape SPAs and dynamically loaded content without breaking a sweat.

Network interception: Block images and CSS to speed up scraping, or capture API requests directly.

Screenshots and PDFs: Useful for visual verification or archiving pages.

Chrome DevTools access: Monitor performance, intercept requests, and debug like you're in the browser console.

The tradeoff? Puppeteer is heavier than HTTP-based scraping. Each browser instance uses significant RAM (around 50-100MB), so you need to be smart about resource management when scaling.

Setting up Puppeteer

Getting started is straightforward. First, create a new Node.js project:

mkdir puppeteer-scraper
cd puppeteer-scraper
npm init -y

Now install Puppeteer:

npm install puppeteer

This downloads Puppeteer along with a compatible version of Chromium (about 170-300MB). If you want to use your own Chrome installation instead, install puppeteer-core and specify the executable path when launching.

Here's your first scraper—let's extract the title from a webpage:

const puppeteer = require('puppeteer');

(async () => {
  // Launch browser
  const browser = await puppeteer.launch({
    headless: true  // Set to false to see the browser
  });
  
  const page = await browser.newPage();
  await page.goto('https://example.com');
  
  // Extract the page title
  const title = await page.title();
  console.log('Page title:', title);
  
  await browser.close();
})();

Run it with node scraper.js. That's it—you just scraped your first page.

A few things to note here:

Everything is async/await because Puppeteer communicates with the browser process
page.goto() navigates to a URL and waits for the page to load
Always close the browser when done to free up resources

Scraping dynamic content

The real power of Puppeteer shows when dealing with JavaScript-rendered content. Let's scrape product listings that load dynamically.

const puppeteer = require('puppeteer');

(async () => {
  const browser = await puppeteer.launch({ headless: true });
  const page = await browser.newPage();
  
  await page.goto('https://books.toscrape.com/', {
    waitUntil: 'networkidle2'  // Wait until network is mostly idle
  });
  
  // Extract book data
  const books = await page.evaluate(() => {
    const bookElements = document.querySelectorAll('.product_pod');
    
    return Array.from(bookElements).map(book => ({
      title: book.querySelector('h3 a').getAttribute('title'),
      price: book.querySelector('.price_color').textContent,
      availability: book.querySelector('.availability').textContent.trim()
    }));
  });
  
  console.log('Found', books.length, 'books');
  console.log(books.slice(0, 3));  // Show first 3
  
  await browser.close();
})();

Here's what's happening:

waitUntil: 'networkidle2' tells Puppeteer to wait until there are no more than 2 network connections for 500ms. This ensures dynamic content has loaded.
page.evaluate() runs code in the browser context—like executing JavaScript in DevTools console
We use standard DOM methods inside evaluate() to extract data

The key insight: anything inside page.evaluate() runs in the browser, not in Node.js. You can't access Node variables directly in there.

Waiting for elements

Sometimes networkidle2 isn't enough. Pages might load in stages, or specific elements might appear late. Here's how to wait for specific content:

const puppeteer = require('puppeteer');

(async () => {
  const browser = await puppeteer.launch({ headless: true });
  const page = await browser.newPage();
  
  await page.goto('https://example.com/products');
  
  // Wait for specific element
  await page.waitForSelector('.product-list', { timeout: 5000 });
  
  // Or wait for multiple conditions
  await Promise.all([
    page.waitForSelector('.products'),
    page.waitForSelector('.price'),
    page.waitForFunction(() => document.querySelectorAll('.product').length > 0)
  ]);
  
  // Now scrape
  const products = await page.$$eval('.product', items => {
    return items.map(item => ({
      name: item.querySelector('.name')?.textContent,
      price: item.querySelector('.price')?.textContent
    }));
  });
  
  console.log(products);
  await browser.close();
})();

Pro tip: Use page.$$eval() as a shorthand when you want to query elements and extract data in one go. It's cleaner than evaluate() for simple selections.

Handling pagination and infinite scroll

Real-world scraping often involves multiple pages. Here's how to crawl through pagination:

const puppeteer = require('puppeteer');

async function scrapeAllPages(url, maxPages = 5) {
  const browser = await puppeteer.launch({ headless: true });
  const page = await browser.newPage();
  
  const allData = [];
  let currentPage = 1;
  
  await page.goto(url);
  
  while (currentPage <= maxPages) {
    console.log(`Scraping page ${currentPage}...`);
    
    // Wait for content
    await page.waitForSelector('.product_pod');
    
    // Scrape current page
    const pageData = await page.$$eval('.product_pod', books => {
      return books.map(book => ({
        title: book.querySelector('h3 a')?.getAttribute('title'),
        price: book.querySelector('.price_color')?.textContent
      }));
    });
    
    allData.push(...pageData);
    
    // Check if next button exists
    const nextButton = await page.$('.next a');
    if (!nextButton) break;
    
    // Click next and wait for navigation
    await Promise.all([
      page.waitForNavigation({ waitUntil: 'networkidle2' }),
      nextButton.click()
    ]);
    
    currentPage++;
  }
  
  await browser.close();
  return allData;
}

scrapeAllPages('https://books.toscrape.com/').then(data => {
  console.log(`Scraped ${data.length} books total`);
});

For infinite scroll pages (like social media feeds), you need a different approach:

async function autoScroll(page) {
  await page.evaluate(async () => {
    await new Promise((resolve) => {
      let totalHeight = 0;
      const distance = 100;
      
      const timer = setInterval(() => {
        const scrollHeight = document.body.scrollHeight;
        window.scrollBy(0, distance);
        totalHeight += distance;
        
        if (totalHeight >= scrollHeight) {
          clearInterval(timer);
          resolve();
        }
      }, 100);
    });
  });
}

Avoiding bot detection

Here's where things get interesting. Default Puppeteer leaks obvious bot signals—like navigator.webdriver being true and "HeadlessChrome" in the user agent. Many sites will block you immediately.

The solution is puppeteer-extra with the stealth plugin:

npm install puppeteer-extra puppeteer-extra-plugin-stealth

Now your scraper looks more human:

const puppeteer = require('puppeteer-extra');
const StealthPlugin = require('puppeteer-extra-plugin-stealth');

puppeteer.use(StealthPlugin());

(async () => {
  const browser = await puppeteer.launch({
    headless: true,
    args: [
      '--no-sandbox',
      '--disable-setuid-sandbox',
      '--disable-blink-features=AutomationControlled'
    ]
  });
  
  const page = await browser.newPage();
  
  // Set viewport and user agent
  await page.setViewport({ width: 1920, height: 1080 });
  await page.setUserAgent('Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36');
  
  await page.goto('https://bot-detection-test.com');
  
  // Test if detected
  const isDetected = await page.evaluate(() => {
    return navigator.webdriver;
  });
  
  console.log('Detected as bot?', isDetected);  // Should be false
  
  await browser.close();
})();

The stealth plugin applies 17 different evasion techniques:

Removes navigator.webdriver flag
Fixes Chrome runtime properties
Masks WebGL vendor and renderer
Patches permissions API
And many more subtle fingerprint fixes

But here's the thing—stealth plugins aren't magic. They help pass basic bot tests, but sophisticated anti-bot systems (Cloudflare, DataDome) can still detect you through behavioral analysis, TLS fingerprinting, and other advanced techniques.

A few extra tricks that actually work:

1. Randomize timing between actions:

function randomDelay(min, max) {
  return new Promise(resolve => {
    setTimeout(resolve, Math.random() * (max - min) + min);
  });
}

await page.click('button');
await randomDelay(1000, 3000);  // Wait 1-3 seconds
await page.type('input', 'search query', { delay: 100 });  // Type with human-like delays

2. Block unnecessary resources to speed up scraping:

await page.setRequestInterception(true);

page.on('request', (req) => {
  const resourceType = req.resourceType();
  
  // Block images, fonts, stylesheets
  if (['image', 'stylesheet', 'font'].includes(resourceType)) {
    req.abort();
  } else {
    req.continue();
  }
});

This can speed up scraping by 30-50% since you're not downloading assets you don't need.

3. Use residential proxies for IP rotation:

const browser = await puppeteer.launch({
  args: [
    '--proxy-server=http://proxy-address:port'
  ]
});

const page = await browser.newPage();
await page.authenticate({ username: 'user', password: 'pass' });

If you're scraping at scale, rotating IPs is essential. Free proxies are unreliable—consider paid residential proxy services for production use.

Scaling with browser pooling

When you need to scrape hundreds or thousands of pages, launching a new browser for each page kills performance. Instead, use a browser pool to reuse instances efficiently.

Here's a practical approach with puppeteer-cluster:

npm install puppeteer-cluster

const { Cluster } = require('puppeteer-cluster');

(async () => {
  const cluster = await Cluster.launch({
    concurrency: Cluster.CONCURRENCY_CONTEXT,
    maxConcurrency: 5,  // 5 parallel workers
    puppeteerOptions: {
      headless: true,
      args: ['--no-sandbox']
    }
  });
  
  // Define scraping task
  await cluster.task(async ({ page, data: url }) => {
    await page.goto(url, { waitUntil: 'networkidle2' });
    
    const title = await page.title();
    console.log(`Scraped: ${title}`);
    
    return { url, title };
  });
  
  // Queue URLs
  const urls = [
    'https://example.com/page1',
    'https://example.com/page2',
    'https://example.com/page3',
    // ... hundreds more
  ];
  
  urls.forEach(url => cluster.queue(url));
  
  await cluster.idle();  // Wait for all tasks
  await cluster.close();
})();

This approach:

Reuses browser contexts instead of launching new browsers
Handles errors and retries automatically
Manages concurrency so you don't overwhelm your system

For memory-constrained environments, keep concurrency between 3-5. Each context still uses significant RAM.

Alternative: batch processing without external libraries

If you don't want extra dependencies, you can implement batching with native Promises:

async function scrapeBatch(urls, batchSize = 5) {
  const browser = await puppeteer.launch({ headless: true });
  const results = [];
  
  for (let i = 0; i < urls.length; i += batchSize) {
    const batch = urls.slice(i, i + batchSize);
    
    const batchResults = await Promise.all(
      batch.map(async (url) => {
        const page = await browser.newPage();
        
        try {
          await page.goto(url, { timeout: 30000 });
          const data = await page.evaluate(() => document.title);
          return { url, data };
        } catch (error) {
          return { url, error: error.message };
        } finally {
          await page.close();
        }
      })
    );
    
    results.push(...batchResults);
    console.log(`Completed batch ${Math.floor(i / batchSize) + 1}`);
  }
  
  await browser.close();
  return results;
}

This processes URLs in batches of 5, ensuring you never have too many pages open simultaneously.

Real-world scraping patterns

Most tutorials stop at basic examples, but real scraping involves handling edge cases. Here are patterns I use in production:

Pattern 1: Retry with exponential backoff

async function scrapeWithRetry(url, maxRetries = 3) {
  const browser = await puppeteer.launch({ headless: true });
  
  for (let attempt = 1; attempt <= maxRetries; attempt++) {
    try {
      const page = await browser.newPage();
      await page.goto(url, { timeout: 30000 });
      
      const data = await page.evaluate(() => {
        // Your scraping logic
        return document.title;
      });
      
      await page.close();
      await browser.close();
      return data;
      
    } catch (error) {
      console.log(`Attempt ${attempt} failed:`, error.message);
      
      if (attempt === maxRetries) {
        await browser.close();
        throw error;
      }
      
      // Exponential backoff
      const delay = Math.pow(2, attempt) * 1000;
      await new Promise(resolve => setTimeout(resolve, delay));
    }
  }
}

Pattern 2: Capture network requests instead of DOM scraping

Sometimes the page fetches data from an API. Instead of scraping HTML, intercept the API calls:

const browser = await puppeteer.launch({ headless: true });
const page = await browser.newPage();

// Listen for API responses
page.on('response', async (response) => {
  const url = response.url();
  
  if (url.includes('/api/products')) {
    const data = await response.json();
    console.log('API data:', data);
  }
});

await page.goto('https://example.com/products');

This is way faster than waiting for DOM rendering and parsing HTML. You get the raw data directly.

Pattern 3: Handle cookie consent popups

Cookie banners are annoying for scraping. Dismiss them automatically:

await page.goto('https://example.com');

// Try to click "Accept" button
try {
  await page.waitForSelector('#cookie-accept', { timeout: 3000 });
  await page.click('#cookie-accept');
} catch (err) {
  // No cookie popup, continue
}

// Now scrape the page

When NOT to use Puppeteer

Puppeteer isn't always the right tool. Here's when to consider alternatives:

Use simple HTTP requests + Cheerio if:

The site doesn't use JavaScript for content
You're scraping thousands of pages and performance matters
You don't need to interact with the page (click, scroll, etc.)

const axios = require('axios');
const cheerio = require('cheerio');

const { data } = await axios.get('https://example.com');
const $ = cheerio.load(data);
const title = $('h1').text();

This is 10-20x faster than Puppeteer because you skip the browser overhead entirely.

Use Playwright if:

You need cross-browser testing (Firefox, Safari, Chrome)
You want better debugging tools and developer experience
Your project already uses Playwright for testing

Playwright and Puppeteer are similar, but Playwright has better modern features and supports more browsers natively.

Wrapping up

Puppeteer gives you a real browser for scraping JavaScript-heavy sites, which is powerful but comes with resource costs. Start with basic scraping, add stealth techniques when you hit bot detection, and scale with pooling or cloud solutions when local resources aren't enough.

Key takeaways:

Use Puppeteer when sites require JavaScript execution
Apply stealth plugins and randomize behavior to avoid detection
Block unnecessary resources to improve performance
Batch requests and use browser pooling when scaling
Consider simpler tools like Axios + Cheerio for static sites

The scraping landscape keeps evolving—anti-bot systems get smarter, but so do evasion techniques. Keep your Puppeteer version updated, stay aware of fingerprinting trends, and don't be afraid to combine multiple approaches (HTTP requests for simple pages, Puppeteer for complex ones).

Now go build something useful with it.

Related reading:

Marius Bernard

Marius Bernard is a Web Scraping Engineer & Technical Advisor at Roundproxies. He authored the Web Scraping chapter of the 2024 Web Almanac/Techinsider. He loves python, golang and proxies.

Get the best
proxies out there

Get Proxies now

Related from Knowledge Base

Go Web Scraping: Complete 2025 Guide & Code Examples

PHP Web Scraping Guide 2026: Speed & Anti-Bot Tips

C# Web Scraping Guide: Build Fast Working Scrapers

Web Scraping in R: Complete Guide 2026

Web Scraping in Rust: Complete 2026 Guide

How to Do Web Scraping in Kotlin: The Developer's Guide

How to Do Web Scraping in Lua: A Developer's Guide

How to Do Web Scraping in Dart: A Complete 2026 Guide

How to Do Web Scraping in Perl: The Complete Developer's Guide

Python Web Scraping Guide: Build Scrapers in 2026

How to Use Botasaurus in 2026

How to Scrape Dynamic Websites With Headless Web Browsers

12 Ways to Make HTTPS Requests in Node.js

15 Methods to Not Get Blocked Web Scraping

How to Use Playwright Playwright Proxy in 2026

How to Take Screenshots with Puppeteer

How to Store and Manage Scraped Data Efficiently

User-Agent Rotation: Why and How to Implement It

How to Scrape Data Behind Login Pages

What Are Backconnect Proxies and How They Work