Web Scraping

Web Scraping with PHP in 2026: Step-by-step tutorial

PHP remains one of the most practical languages for extracting data from websites. It powers over 75% of websites with known server-side languages, and its mature ecosystem makes web scraping with PHP straightforward once you know the right tools.

This guide walks you through building scrapers in PHP using multiple approaches. You'll learn cURL basics, modern libraries like Guzzle and Symfony Panther, and techniques for handling JavaScript-heavy sites.

What is Web Scraping with PHP?

Web scraping with PHP is the process of programmatically extracting data from websites using PHP scripts. You send HTTP requests to target URLs, receive HTML responses, and parse that HTML to pull out the specific data you need.

PHP handles this well because it was built for web tasks. It has native cURL support, excellent string manipulation functions, and libraries specifically designed for DOM traversal and CSS selector matching.

Why PHP for Web Scraping in 2026

PHP gets overlooked in scraping discussions. Python dominates the conversation with BeautifulSoup and Scrapy. But PHP has real advantages for certain use cases.

If your backend already runs PHP, adding scraping logic keeps your stack simple. No need to manage Python environments alongside your Laravel or WordPress installation.

PHP 8.3 and 8.4 brought performance improvements that matter for scraping. JIT compilation speeds up parsing operations significantly compared to PHP 7.x.

The library ecosystem matured considerably. Guzzle handles HTTP requests with async support. Symfony DomCrawler offers jQuery-like selectors. Panther controls real browsers when you need JavaScript rendering.

Here's what makes PHP practical for web scraping in 2026:

  • Native cURL ships with almost every PHP installation
  • Composer makes dependency management painless
  • Memory handling improved dramatically in recent versions
  • Excellent documentation exists for all major libraries
  • Deployment to shared hosting costs almost nothing

PHP won't outperform Python for massive distributed scraping operations. But for small to medium projects, scheduled tasks, and WordPress integrations, it works reliably.

Setting Up Your PHP Environment

Before writing any scraping code, confirm your PHP installation meets the requirements.

Open your terminal and check the PHP version:

php --version

You need PHP 8.1 or higher. PHP 8.3+ is recommended for best performance.

Next, verify Composer is installed:

composer --version

If Composer isn't available, install it from getcomposer.org.

Create a new project directory and initialize Composer:

mkdir php-scraper
cd php-scraper
composer init --no-interaction --require="php >=8.1"

This creates your composer.json file. You're ready to add libraries as needed.

Method 1: Native cURL Approach

cURL comes bundled with PHP. It's the foundation of most HTTP operations in the language.

Start with a basic request. This fetches the raw HTML from a target URL:

<?php
// Initialize a cURL session
$ch = curl_init();

// Set the target URL
curl_setopt($ch, CURLOPT_URL, 'https://books.toscrape.com');

// Return the response instead of printing it
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);

// Follow redirects automatically
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);

// Set a timeout to avoid hanging
curl_setopt($ch, CURLOPT_TIMEOUT, 30);

// Execute and store the response
$html = curl_exec($ch);

// Check for errors
if (curl_errno($ch)) {
    echo 'cURL Error: ' . curl_error($ch);
} else {
    echo "Fetched " . strlen($html) . " bytes\n";
}

// Close the session
curl_close($ch);

This returns raw HTML as a string. To extract specific data, combine cURL with DOMDocument.

Parsing HTML with DOMDocument

DOMDocument is PHP's built-in XML/HTML parser. It converts HTML strings into traversable document objects.

<?php
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, 'https://books.toscrape.com');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
$html = curl_exec($ch);
curl_close($ch);

// Suppress warnings from malformed HTML
libxml_use_internal_errors(true);

// Create DOM document and load HTML
$dom = new DOMDocument();
$dom->loadHTML($html);

// Use XPath for element selection
$xpath = new DOMXPath($dom);

// Find all book titles
$titles = $xpath->query('//article[@class="product_pod"]//h3/a/@title');

foreach ($titles as $title) {
    echo $title->nodeValue . "\n";
}

The XPath query locates <article> elements with the product_pod class, then drills down to the <h3> link's title attribute.

DOMDocument handles most real-world HTML despite validator errors. The libxml_use_internal_errors(true) line prevents warning spam.

Adding Request Headers

Websites check headers to identify automated requests. Set a realistic User-Agent and other headers:

<?php
$ch = curl_init();

curl_setopt_array($ch, [
    CURLOPT_URL => 'https://books.toscrape.com',
    CURLOPT_RETURNTRANSFER => true,
    CURLOPT_FOLLOWLOCATION => true,
    CURLOPT_TIMEOUT => 30,
    CURLOPT_HTTPHEADER => [
        'User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
        'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
        'Accept-Language: en-US,en;q=0.5',
        'Accept-Encoding: gzip, deflate',
        'Connection: keep-alive',
    ],
    CURLOPT_ENCODING => 'gzip',
]);

$html = curl_exec($ch);
$httpCode = curl_getinfo($ch, CURLINFO_HTTP_CODE);

echo "Status: $httpCode\n";
echo "Size: " . strlen($html) . " bytes\n";

curl_close($ch);

The CURLOPT_ENCODING option tells cURL to decompress gzip responses automatically.

Method 2: Using Guzzle HTTP Client

Guzzle wraps cURL in a modern, object-oriented interface. It handles cookies, sessions, and concurrent requests more cleanly than raw cURL.

Install Guzzle via Composer:

composer require guzzlehttp/guzzle

Basic usage looks like this:

<?php
require 'vendor/autoload.php';

use GuzzleHttp\Client;
use GuzzleHttp\Exception\RequestException;

$client = new Client([
    'timeout' => 30,
    'verify' => false,
    'headers' => [
        'User-Agent' => 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) Chrome/120.0.0.0',
    ],
]);

try {
    $response = $client->get('https://books.toscrape.com');
    $html = $response->getBody()->getContents();
    
    echo "Status: " . $response->getStatusCode() . "\n";
    echo "Fetched " . strlen($html) . " bytes\n";
    
} catch (RequestException $e) {
    echo "Request failed: " . $e->getMessage() . "\n";
}

Guzzle's exception handling makes error management cleaner than checking curl_errno().

Concurrent Requests with Guzzle

Scraping multiple pages sequentially wastes time. Guzzle supports concurrent requests through its Pool class:

<?php
require 'vendor/autoload.php';

use GuzzleHttp\Client;
use GuzzleHttp\Pool;
use GuzzleHttp\Psr7\Request;

$client = new Client(['timeout' => 30]);

$urls = [
    'https://books.toscrape.com/catalogue/page-1.html',
    'https://books.toscrape.com/catalogue/page-2.html',
    'https://books.toscrape.com/catalogue/page-3.html',
    'https://books.toscrape.com/catalogue/page-4.html',
    'https://books.toscrape.com/catalogue/page-5.html',
];

$requests = function () use ($urls) {
    foreach ($urls as $url) {
        yield new Request('GET', $url);
    }
};

$results = [];

$pool = new Pool($client, $requests(), [
    'concurrency' => 5,
    'fulfilled' => function ($response, $index) use (&$results, $urls) {
        $results[$urls[$index]] = $response->getBody()->getContents();
        echo "Completed: " . $urls[$index] . "\n";
    },
    'rejected' => function ($reason, $index) use ($urls) {
        echo "Failed: " . $urls[$index] . " - " . $reason . "\n";
    },
]);

$promise = $pool->promise();
$promise->wait();

echo "Scraped " . count($results) . " pages\n";

The concurrency parameter controls how many requests run simultaneously. Setting it too high can trigger rate limits or IP blocks.

Method 3: Symfony DomCrawler for Parsing

DomCrawler provides jQuery-like syntax for HTML parsing. It's more intuitive than raw XPath for most developers.

Install it with Composer:

composer require symfony/dom-crawler symfony/css-selector

The CSS Selector component enables familiar selectors like .class and #id.

<?php
require 'vendor/autoload.php';

use GuzzleHttp\Client;
use Symfony\Component\DomCrawler\Crawler;

$client = new Client();
$response = $client->get('https://books.toscrape.com');
$html = $response->getBody()->getContents();

$crawler = new Crawler($html);

// Extract all book data using CSS selectors
$books = $crawler->filter('article.product_pod')->each(function (Crawler $node) {
    return [
        'title' => $node->filter('h3 a')->attr('title'),
        'price' => $node->filter('.price_color')->text(),
        'stock' => $node->filter('.availability')->text(),
        'rating' => $node->filter('.star-rating')->attr('class'),
    ];
});

// Display results
foreach ($books as $book) {
    echo $book['title'] . ' - ' . $book['price'] . "\n";
}

The filter() method accepts CSS selectors. The each() method iterates over matched elements, returning an array of extracted data.

DomCrawler can extract links for pagination:

<?php
require 'vendor/autoload.php';

use GuzzleHttp\Client;
use Symfony\Component\DomCrawler\Crawler;

$client = new Client();
$baseUrl = 'https://books.toscrape.com/catalogue/';
$currentUrl = $baseUrl . 'page-1.html';
$allBooks = [];

while ($currentUrl) {
    echo "Scraping: $currentUrl\n";
    
    $response = $client->get($currentUrl);
    $crawler = new Crawler($response->getBody()->getContents());
    
    // Extract books from current page
    $crawler->filter('article.product_pod')->each(function ($node) use (&$allBooks) {
        $allBooks[] = [
            'title' => $node->filter('h3 a')->attr('title'),
            'price' => $node->filter('.price_color')->text(),
        ];
    });
    
    // Find next page link
    try {
        $nextLink = $crawler->filter('.next a')->attr('href');
        $currentUrl = $baseUrl . $nextLink;
    } catch (\Exception $e) {
        $currentUrl = null; // No more pages
    }
    
    // Be respectful with delays
    usleep(500000); // 0.5 second delay
}

echo "Total books scraped: " . count($allBooks) . "\n";

The usleep() call adds a half-second delay between requests. This prevents hammering the server and reduces your chance of getting blocked.

Method 4: Headless Browsers with Panther

Static HTML scrapers fail on JavaScript-heavy websites. Modern sites render content dynamically, meaning the initial HTML contains no useful data.

Symfony Panther controls real browsers (Chrome or Firefox) programmatically. It waits for JavaScript to execute and renders the complete page.

Install Panther:

composer require symfony/panther

You also need ChromeDriver or GeckoDriver. On macOS:

brew install chromedriver

On Windows with Chocolatey:

choco install chromedriver

Basic Panther usage:

<?php
require 'vendor/autoload.php';

use Symfony\Component\Panther\Client;

// Create a Chrome client
$client = Client::createChromeClient();

// Navigate to a JavaScript-heavy site
$crawler = $client->request('GET', 'https://quotes.toscrape.com/js/');

// Wait for content to load
$client->waitFor('.quote');

// Extract quotes after JavaScript execution
$quotes = $crawler->filter('.quote')->each(function ($node) {
    return [
        'text' => $node->filter('.text')->text(),
        'author' => $node->filter('.author')->text(),
    ];
});

foreach ($quotes as $quote) {
    echo $quote['author'] . ': ' . $quote['text'] . "\n\n";
}

// Always close the browser
$client->quit();

The waitFor() method pauses execution until the specified CSS selector appears in the DOM. This handles async content loading.

Interacting with Page Elements

Panther can click buttons, fill forms, and trigger JavaScript events:

<?php
require 'vendor/autoload.php';

use Symfony\Component\Panther\Client;

$client = Client::createChromeClient();
$crawler = $client->request('GET', 'https://quotes.toscrape.com/js/');

$allQuotes = [];

while (true) {
    $client->waitFor('.quote');
    
    // Extract quotes from current page state
    $crawler->filter('.quote')->each(function ($node) use (&$allQuotes) {
        $allQuotes[] = [
            'text' => $node->filter('.text')->text(),
            'author' => $node->filter('.author')->text(),
        ];
    });
    
    // Try clicking the Next button
    try {
        $client->clickLink('Next');
        usleep(1000000); // Wait 1 second for page load
        $crawler = $client->refreshCrawler();
    } catch (\Exception $e) {
        break; // No more pages
    }
}

echo "Scraped " . count($allQuotes) . " quotes\n";

$client->quit();

Panther is resource-intensive. Each instance runs a full browser process. Use it only when static methods won't work.

Handling Anti-Bot Detection

Websites deploy various techniques to block scrapers. Here's how to work around common obstacles.

Rotating User Agents

Cycling through different User-Agent strings makes requests appear more natural:

<?php
$userAgents = [
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 Chrome/120.0.0.0 Safari/537.36',
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 Chrome/120.0.0.0 Safari/537.36',
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:121.0) Gecko/20100101 Firefox/121.0',
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 Safari/17.2',
    'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 Chrome/120.0.0.0 Safari/537.36',
];

function getRandomUserAgent() {
    global $userAgents;
    return $userAgents[array_rand($userAgents)];
}

Using Proxies

Proxies route requests through different IP addresses. This prevents IP-based blocking when scraping at scale.

<?php
$ch = curl_init();

curl_setopt_array($ch, [
    CURLOPT_URL => 'https://httpbin.org/ip',
    CURLOPT_RETURNTRANSFER => true,
    CURLOPT_PROXY => 'http://proxy.example.com:8080',
    CURLOPT_PROXYUSERPWD => 'username:password',
]);

$response = curl_exec($ch);
echo $response; // Shows the proxy IP

curl_close($ch);

For rotating proxies, services like Roundproxies.com offer residential and datacenter proxy pools that automatically cycle IPs between requests. This significantly reduces detection rates compared to single-IP scraping.

With Guzzle, proxy configuration looks like this:

<?php
$client = new GuzzleHttp\Client([
    'proxy' => 'http://username:password@proxy.example.com:8080',
    'timeout' => 30,
]);

Implementing Request Delays

Hammering a server with rapid requests triggers rate limiting. Space out your requests:

<?php
class RateLimiter
{
    private array $lastRequest = [];
    
    public function wait(string $domain, float $minDelay = 1.0): void
    {
        if (isset($this->lastRequest[$domain])) {
            $elapsed = microtime(true) - $this->lastRequest[$domain];
            if ($elapsed < $minDelay) {
                $sleepTime = ($minDelay - $elapsed) * 1000000;
                usleep((int)$sleepTime);
            }
        }
        $this->lastRequest[$domain] = microtime(true);
    }
}

// Usage
$limiter = new RateLimiter();

foreach ($urls as $url) {
    $domain = parse_url($url, PHP_URL_HOST);
    $limiter->wait($domain, 1.5); // 1.5 seconds minimum between requests
    
    // Make request...
}

Handling Cookies and Sessions

Some sites require maintaining session state:

<?php
$client = new GuzzleHttp\Client([
    'cookies' => true, // Enable cookie jar
]);

// First request establishes session
$client->get('https://example.com/');

// Subsequent requests maintain cookies
$response = $client->get('https://example.com/protected-page');

Guzzle's cookie jar automatically stores and sends cookies between requests.

Saving Scraped Data

Extracted data needs storage. Common formats include CSV, JSON, and databases.

Writing to CSV

PHP's fputcsv() handles CSV creation:

<?php
$books = [
    ['title' => 'Book One', 'price' => '£51.77'],
    ['title' => 'Book Two', 'price' => '£53.74'],
];

$file = fopen('books.csv', 'w');

// Write header row
fputcsv($file, ['Title', 'Price']);

// Write data rows
foreach ($books as $book) {
    fputcsv($file, [$book['title'], $book['price']]);
}

fclose($file);

echo "Saved " . count($books) . " books to CSV\n";

Writing to JSON

JSON output preserves nested structures better:

<?php
$books = [
    ['title' => 'Book One', 'price' => '£51.77', 'details' => ['pages' => 320]],
    ['title' => 'Book Two', 'price' => '£53.74', 'details' => ['pages' => 256]],
];

$json = json_encode($books, JSON_PRETTY_PRINT | JSON_UNESCAPED_UNICODE);

file_put_contents('books.json', $json);

echo "Saved JSON data\n";

Database Storage with PDO

For larger scraping operations, database storage makes querying easier:

<?php
$pdo = new PDO('sqlite:scraper.db');

$pdo->exec('CREATE TABLE IF NOT EXISTS books (
    id INTEGER PRIMARY KEY,
    title TEXT NOT NULL,
    price TEXT,
    scraped_at DATETIME DEFAULT CURRENT_TIMESTAMP
)');

$stmt = $pdo->prepare('INSERT INTO books (title, price) VALUES (?, ?)');

foreach ($books as $book) {
    $stmt->execute([$book['title'], $book['price']]);
}

echo "Inserted " . count($books) . " records\n";

SQLite works well for local scraping scripts. For production, use MySQL or PostgreSQL.

Common Errors and Fixes

SSL Certificate Errors

Some servers have misconfigured SSL. Bypass verification (only for trusted targets):

curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, 0);

With Guzzle:

$client = new Client(['verify' => false]);

Memory Issues with Large Scrapes

Processing thousands of pages can exhaust memory. Force garbage collection periodically:

<?php
$counter = 0;

foreach ($urls as $url) {
    // Scrape and process...
    
    $counter++;
    if ($counter % 100 === 0) {
        gc_collect_cycles();
        echo "Memory: " . (memory_get_usage(true) / 1024 / 1024) . " MB\n";
    }
}

Also unset large variables when done with them:

$html = $client->get($url)->getBody()->getContents();
$data = parseHtml($html);
unset($html); // Free memory immediately

Timeout Errors

Slow servers need longer timeouts:

curl_setopt($ch, CURLOPT_TIMEOUT, 60); // 60 second timeout
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 10); // 10 second connection timeout

Character Encoding Issues

Force UTF-8 encoding when parsing:

$html = mb_convert_encoding($html, 'HTML-ENTITIES', 'UTF-8');
$dom->loadHTML($html);

Complete Web Scraper Class

Here's a production-ready scraper class that combines everything covered above:

<?php
require 'vendor/autoload.php';

use GuzzleHttp\Client;
use GuzzleHttp\Exception\RequestException;
use Symfony\Component\DomCrawler\Crawler;

class WebScraper
{
    private Client $client;
    private array $userAgents;
    private array $lastRequest = [];
    private float $minDelay;
    
    public function __construct(float $minDelay = 1.0, ?string $proxy = null)
    {
        $this->minDelay = $minDelay;
        
        $this->userAgents = [
            'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 Chrome/120.0.0.0 Safari/537.36',
            'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 Chrome/120.0.0.0 Safari/537.36',
            'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:121.0) Gecko/20100101 Firefox/121.0',
        ];
        
        $config = [
            'timeout' => 30,
            'verify' => false,
            'cookies' => true,
        ];
        
        if ($proxy) {
            $config['proxy'] = $proxy;
        }
        
        $this->client = new Client($config);
    }
    
    private function getRandomUserAgent(): string
    {
        return $this->userAgents[array_rand($this->userAgents)];
    }
    
    private function respectRateLimit(string $domain): void
    {
        if (isset($this->lastRequest[$domain])) {
            $elapsed = microtime(true) - $this->lastRequest[$domain];
            if ($elapsed < $this->minDelay) {
                usleep((int)(($this->minDelay - $elapsed) * 1000000));
            }
        }
        $this->lastRequest[$domain] = microtime(true);
    }
    
    public function fetch(string $url, int $retries = 3): ?string
    {
        $domain = parse_url($url, PHP_URL_HOST);
        $this->respectRateLimit($domain);
        
        $attempt = 0;
        while ($attempt < $retries) {
            try {
                $response = $this->client->get($url, [
                    'headers' => [
                        'User-Agent' => $this->getRandomUserAgent(),
                        'Accept' => 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
                        'Accept-Language' => 'en-US,en;q=0.5',
                    ],
                ]);
                
                return $response->getBody()->getContents();
                
            } catch (RequestException $e) {
                $attempt++;
                echo "Attempt $attempt failed: " . $e->getMessage() . "\n";
                
                if ($attempt < $retries) {
                    sleep(pow(2, $attempt)); // Exponential backoff
                }
            }
        }
        
        return null;
    }
    
    public function parse(string $html, string $selector): array
    {
        $crawler = new Crawler($html);
        
        return $crawler->filter($selector)->each(function (Crawler $node) {
            return $node->text();
        });
    }
    
    public function scrapeWithCallback(
        string $url, 
        string $itemSelector, 
        callable $extractor
    ): array {
        $html = $this->fetch($url);
        
        if (!$html) {
            return [];
        }
        
        $crawler = new Crawler($html);
        
        return $crawler->filter($itemSelector)->each($extractor);
    }
}

// Usage example
$scraper = new WebScraper(minDelay: 1.5);

$books = $scraper->scrapeWithCallback(
    'https://books.toscrape.com',
    'article.product_pod',
    function (Crawler $node) {
        return [
            'title' => $node->filter('h3 a')->attr('title'),
            'price' => $node->filter('.price_color')->text(),
        ];
    }
);

print_r($books);

This class handles rate limiting, user agent rotation, retries with exponential backoff, and clean callback-based extraction. Extend it for your specific needs.

Scraping APIs vs HTML Parsing

Not all data requires HTML parsing. Many websites expose APIs that return clean JSON. Check your browser's Network tab while browsing the target site.

If you find API endpoints, they're usually easier to work with:

<?php
$client = new GuzzleHttp\Client();

$response = $client->get('https://api.example.com/products', [
    'query' => [
        'page' => 1,
        'limit' => 50,
    ],
    'headers' => [
        'Accept' => 'application/json',
    ],
]);

$data = json_decode($response->getBody()->getContents(), true);

foreach ($data['products'] as $product) {
    echo $product['name'] . ' - $' . $product['price'] . "\n";
}

JSON responses skip the parsing complexity entirely. The data arrives structured and ready to use.

Look for GraphQL endpoints too. They often expose more data than visible on the page:

<?php
$client = new GuzzleHttp\Client();

$query = <<<GRAPHQL
{
    products(first: 50) {
        edges {
            node {
                title
                price
                description
            }
        }
    }
}
GRAPHQL;

$response = $client->post('https://example.com/graphql', [
    'json' => ['query' => $query],
]);

$result = json_decode($response->getBody()->getContents(), true);

GraphQL lets you request exactly the fields you need, reducing response size and processing time.

Performance Optimization Tips

Large scraping jobs need optimization to finish in reasonable time.

Connection Reuse

Create one cURL handle and reuse it for multiple requests:

<?php
$ch = curl_init();

curl_setopt_array($ch, [
    CURLOPT_RETURNTRANSFER => true,
    CURLOPT_FOLLOWLOCATION => true,
    CURLOPT_TIMEOUT => 30,
]);

$urls = ['https://example.com/page1', 'https://example.com/page2'];
$results = [];

foreach ($urls as $url) {
    curl_setopt($ch, CURLOPT_URL, $url);
    $results[$url] = curl_exec($ch);
    usleep(500000);
}

curl_close($ch);

Reusing handles avoids TCP handshake overhead for each request.

Async Processing with ReactPHP

For high-throughput scraping, ReactPHP provides true async capabilities:

composer require react/http react/event-loop
<?php
require 'vendor/autoload.php';

use React\EventLoop\Loop;
use React\Http\Browser;

$browser = new Browser();
$urls = [
    'https://books.toscrape.com/catalogue/page-1.html',
    'https://books.toscrape.com/catalogue/page-2.html',
    'https://books.toscrape.com/catalogue/page-3.html',
];

$promises = [];

foreach ($urls as $url) {
    $promises[$url] = $browser->get($url)->then(
        function ($response) use ($url) {
            echo "Completed: $url\n";
            return (string) $response->getBody();
        },
        function ($error) use ($url) {
            echo "Failed: $url - " . $error->getMessage() . "\n";
            return null;
        }
    );
}

React\Promise\all($promises)->then(function ($results) {
    echo "All done. Got " . count(array_filter($results)) . " pages\n";
});

Loop::run();

ReactPHP runs multiple requests truly in parallel within a single PHP process.

Caching Responses

Avoid re-fetching unchanged pages:

<?php
class CachedScraper
{
    private string $cacheDir;
    private int $cacheTime;
    
    public function __construct(string $cacheDir = './cache', int $cacheTime = 3600)
    {
        $this->cacheDir = $cacheDir;
        $this->cacheTime = $cacheTime;
        
        if (!is_dir($cacheDir)) {
            mkdir($cacheDir, 0755, true);
        }
    }
    
    public function fetch(string $url): string
    {
        $cacheFile = $this->cacheDir . '/' . md5($url) . '.html';
        
        if (file_exists($cacheFile) && (time() - filemtime($cacheFile)) < $this->cacheTime) {
            return file_get_contents($cacheFile);
        }
        
        $html = file_get_contents($url);
        file_put_contents($cacheFile, $html);
        
        return $html;
    }
}

Caching is essential during development when you're testing selectors repeatedly.

Scheduled Scraping with Cron

Production scrapers usually run on schedules. Set up a cron job to run your PHP script:

# Edit crontab
crontab -e

# Run scraper every 6 hours
0 */6 * * * /usr/bin/php /path/to/scraper.php >> /var/log/scraper.log 2>&1

For more complex scheduling, Laravel's task scheduler or Symfony Console Commands provide better management.

Basic standalone scheduler pattern:

<?php
// scraper.php
$startTime = date('Y-m-d H:i:s');
echo "[$startTime] Starting scrape job\n";

try {
    // Your scraping logic here
    $count = runScraper();
    
    $endTime = date('Y-m-d H:i:s');
    echo "[$endTime] Completed. Scraped $count items.\n";
    
} catch (Exception $e) {
    $endTime = date('Y-m-d H:i:s');
    echo "[$endTime] Error: " . $e->getMessage() . "\n";
    exit(1);
}

Log output and errors for debugging. Consider using Monolog for structured logging in larger projects.

FAQ

Can PHP handle JavaScript-rendered websites?

Yes, through headless browser libraries like Symfony Panther. Panther controls real Chrome or Firefox instances, waits for JavaScript to execute, then exposes the rendered DOM for parsing. It's resource-intensive compared to static methods but necessary for modern SPAs.

How do I avoid getting blocked while scraping?

Use realistic headers including proper User-Agent strings. Add delays between requests (1-2 seconds minimum). Consider rotating proxies through services like Roundproxies.com to distribute requests across IPs. Don't run hundreds of parallel requests against a single domain.

The legality depends on what you're scraping and how you use the data. Publicly available data is generally fair game. However, violating Terms of Service, accessing private data, or overloading servers can create legal issues. Always check the target site's robots.txt and terms of use.

Which PHP version is best for scraping?

PHP 8.3 or 8.4 offers the best performance thanks to JIT compilation. The minimum recommended version is 8.1 since older versions lack security updates and modern language features.

How does PHP compare to Python for web scraping?

Python has a larger scraping ecosystem with tools like Scrapy and BeautifulSoup. It handles concurrency better at scale. PHP is simpler when you already run PHP infrastructure and need moderate scraping capabilities. For enterprise-level distributed scraping, Python typically wins. For WordPress plugins or Laravel integrations, PHP makes more sense.

Final Thoughts

Web scraping with PHP works well when your project already uses PHP or when you need simple, reliable data extraction. The ecosystem matured significantly, and libraries like Guzzle, DomCrawler, and Panther handle most common scenarios.

For static websites, the Guzzle + DomCrawler combination offers the best balance of simplicity and power. Add Panther when JavaScript rendering becomes necessary.

Keep these principles in mind:

  • Respect robots.txt and rate limits
  • Use delays between requests to avoid overwhelming servers
  • Rotate headers and consider proxies for larger operations
  • Store data incrementally to avoid losing progress on failures
  • Handle errors gracefully with retry logic

PHP may not be the trendiest choice for web scraping in 2026, but it remains practical. If you're comfortable with the language and your requirements are moderate, there's no reason to add Python complexity to your stack.

Start with cURL and DOMDocument for simple tasks. Graduate to Guzzle and Symfony components as needs grow. Save headless browsers for sites that truly require them.