PHP gets a bad rap in the scraping world. Everyone talks about Python's BeautifulSoup or JavaScript's Puppeteer, but PHP can hold its own—especially if you're already running a PHP stack and don't want to manage another runtime.
I've scraped everything from e-commerce catalogs to job boards with PHP, and there are some tricks that make it fast and resilient.
The Basics: cURL and DOMDocument
Let's start with the foundation. Most PHP scrapers use two built-in components: cURL for HTTP requests and DOMDocument for parsing HTML.
Here's the simplest possible scraper:
<?php
// Fetch the page
$ch = curl_init('https://example.com/products');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
$html = curl_exec($ch);
curl_close($ch);
// Parse it
$dom = new DOMDocument();
@$dom->loadHTML($html); // @ suppresses warnings from malformed HTML
$xpath = new DOMXPath($dom);
// Extract product titles
$titles = $xpath->query('//h2[@class="product-title"]');
foreach ($titles as $title) {
echo $title->textContent . "\n";
}
This works, but it's slow for scraping multiple pages. The @ symbol before loadHTML() is important—most real-world HTML triggers warnings because it's not perfectly formed XML. You could also use libxml_use_internal_errors(true) if you want to handle errors more gracefully.
Why this matters: cURL gives you control over headers, timeouts, and SSL verification that file_get_contents() doesn't. DOMDocument is much faster than regex parsing for structured HTML.
Performance Hack: Parallel Scraping with curl_multi
Here's something most tutorials skip: scraping pages sequentially is painfully slow. If you need to scrape 100 product pages and each takes 2 seconds, you're looking at 200 seconds of execution time.
The solution? Parallel requests with curl_multi. This lets you make multiple requests simultaneously, dramatically reducing total scrape time.
<?php
function parallelScrape($urls) {
$mh = curl_multi_init();
$handles = [];
// Add all URLs to the multi handle
foreach ($urls as $i => $url) {
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_TIMEOUT, 10);
curl_multi_add_handle($mh, $ch);
$handles[$i] = $ch;
}
// Execute all requests simultaneously
$running = null;
do {
curl_multi_exec($mh, $running);
curl_multi_select($mh);
} while ($running > 0);
// Collect responses
$results = [];
foreach ($handles as $i => $ch) {
$results[$i] = curl_multi_getcontent($ch);
curl_multi_remove_handle($mh, $ch);
curl_close($ch);
}
curl_multi_close($mh);
return $results;
}
// Scrape 50 pages in parallel
$urls = array_map(fn($n) => "https://example.com/page/$n", range(1, 50));
$pages = parallelScrape($urls);
The catch: Don't go overboard. Hammering a site with 100 simultaneous connections will get you blocked fast. Keep it to 5-10 concurrent requests max, and add delays between batches.
Here's a refined version with rate limiting:
<?php
function scrapeInBatches($urls, $batchSize = 5, $delaySeconds = 2) {
$allResults = [];
$batches = array_chunk($urls, $batchSize);
foreach ($batches as $batch) {
$results = parallelScrape($batch);
$allResults = array_merge($allResults, $results);
sleep($delaySeconds); // Polite delay between batches
}
return $allResults;
}
This approach can cut scraping time from hours to minutes. I've used this to scrape 10,000+ product listings in under 15 minutes.
Smart Parsing with XPath (and Why It Beats CSS Selectors)
CSS selectors are intuitive, but XPath is faster and more powerful for scraping. DOMDocument doesn't natively support CSS selectors, and converting them to XPath adds overhead. Learn XPath—it pays off.
Here are patterns I use constantly:
<?php
$xpath = new DOMXPath($dom);
// Get all product prices
$prices = $xpath->query('//span[contains(@class, "price")]');
// Get links inside a specific div
$links = $xpath->query('//div[@id="products"]//a[@href]');
// Get text from elements with specific attributes
$ratings = $xpath->query('//div[@data-rating]/@data-rating');
// Get the next sibling of an element
$descriptions = $xpath->query('//h2[@class="title"]/following-sibling::p[1]');
// Get parent elements
$containers = $xpath->query('//span[@class="price"]/ancestor::div[@class="product"][1]');
Performance tip: If you're only extracting a few elements from a massive page, use specific XPath queries instead of grabbing everything. This query:
$prices = $xpath->query('//div[@id="products"]//span[@class="price"]');
Is faster than:
$everything = $xpath->query('//*'); // Don't do this
XPath also handles edge cases better. For instance, extracting text between tags without including child element text:
// Gets only the direct text, not nested elements
$directText = $xpath->query('//div[@class="desc"]/text()[1]');
Handling Dynamic Content Without Headless Browsers
JavaScript-rendered content is the bane of scrapers. Most people reach for Selenium or Puppeteer, but that's often overkill. Here are lighter alternatives:
1. Inspect Network Requests
Many "dynamic" sites actually load data via AJAX. Open DevTools (Network tab), reload the page, and look for JSON or HTML responses. You can often hit those endpoints directly:
<?php
// Instead of scraping the rendered page...
$ch = curl_init('https://example.com/api/products?page=1');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$json = curl_exec($ch);
$data = json_decode($json, true);
foreach ($data['products'] as $product) {
echo $product['title'] . ': $' . $product['price'] . "\n";
}
This is way faster than browser automation and less likely to get blocked.
2. Reverse Engineer the API
Some sites obfuscate their API endpoints. Look for patterns in the URLs or POST data. Tools like Postman or just curl from the command line help test these:
curl 'https://example.com/graphql' \
-H 'Content-Type: application/json' \
--data '{"query":"{ products { id title price } }"}'
Then replicate it in PHP:
<?php
$ch = curl_init('https://example.com/graphql');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_POST, true);
curl_setopt($ch, CURLOPT_HTTPHEADER, ['Content-Type: application/json']);
curl_setopt($ch, CURLOPT_POSTFIELDS, json_encode([
'query' => '{ products { id title price } }'
]));
$response = curl_exec($ch);
$data = json_decode($response, true);
3. Use php-webdriver Only When Necessary
If you absolutely need browser rendering, the Selenium PHP WebDriver works, but it's resource-intensive. Use it selectively:
<?php
require 'vendor/autoload.php';
use Facebook\WebDriver\Remote\RemoteWebDriver;
use Facebook\WebDriver\Remote\DesiredCapabilities;
$host = 'http://localhost:4444'; // Selenium server
$driver = RemoteWebDriver::create($host, DesiredCapabilities::chrome());
$driver->get('https://example.com/dynamic-page');
sleep(3); // Wait for JS to load
$html = $driver->getPageSource();
$driver->quit();
// Now parse $html with DOMDocument as usual
Pro tip: Only use Selenium for pages that absolutely require it, then switch back to cURL for follow-up requests. Don't waste browser resources on static pages.
Memory Management for Large-Scale Scrapes
PHP's default memory limit (128MB) chokes on large datasets. Here's how to handle millions of records without crashing:
1. Stream Processing Instead of Loading Everything
Bad approach (loads entire result set into memory):
<?php
$dom = new DOMDocument();
$dom->loadHTML($hugeHtml); // Loads the entire 10MB HTML into memory
$xpath = new DOMXPath($dom);
$products = $xpath->query('//div[@class="product"]'); // Another copy in memory
foreach ($products as $product) {
// Process...
}
Better approach (stream and discard):
<?php
ini_set('memory_limit', '256M'); // Set explicit limit
$dom = new DOMDocument();
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$products = $xpath->query('//div[@class="product"]');
foreach ($products as $product) {
// Extract data
$data = [
'title' => $xpath->query('.//h2', $product)->item(0)->textContent,
'price' => $xpath->query('.//span[@class="price"]', $product)->item(0)->textContent
];
// Save immediately instead of accumulating
file_put_contents('products.json', json_encode($data) . "\n", FILE_APPEND);
// Explicitly free memory
unset($data);
}
// Clear DOM from memory
$dom = null;
$xpath = null;
2. Use Generators for Large Result Sets
If you're processing scraped data in batches, generators prevent memory bloat:
<?php
function scrapeProducts($urls) {
foreach ($urls as $url) {
$html = file_get_contents($url);
$dom = new DOMDocument();
@$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$products = $xpath->query('//div[@class="product"]');
foreach ($products as $product) {
yield extractProductData($product, $xpath);
}
// Free memory before next iteration
$dom = null;
$xpath = null;
unset($html);
}
}
// Use it
foreach (scrapeProducts($urlList) as $product) {
// Process one product at a time
saveToDatabase($product);
}
3. Garbage Collection for Long-Running Scripts
Force PHP's garbage collector to run periodically:
<?php
$counter = 0;
foreach ($largeDataset as $item) {
processItem($item);
if (++$counter % 100 === 0) {
gc_collect_cycles(); // Force garbage collection
echo "Memory usage: " . memory_get_usage(true) / 1024 / 1024 . " MB\n";
}
}
This kept one of my scrapers running for 6+ hours without hitting memory limits.
Anti-Bot Techniques That Actually Work
Modern websites use sophisticated bot detection. Here's what works in 2025:
1. Rotate User Agents
Using the default cURL user agent (curl/7.x.x) is a red flag. Rotate through real browser user agents:
<?php
$userAgents = [
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/121.0',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.2 Safari/605.1.15'
];
curl_setopt($ch, CURLOPT_USERAGENT, $userAgents[array_rand($userAgents)]);
2. Send Complete Headers
Sites check for missing or suspicious headers. Mimic a real browser:
<?php
$headers = [
'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Accept-Language: en-US,en;q=0.5',
'Accept-Encoding: gzip, deflate, br',
'DNT: 1',
'Connection: keep-alive',
'Upgrade-Insecure-Requests: 1',
'Sec-Fetch-Dest: document',
'Sec-Fetch-Mode: navigate',
'Sec-Fetch-Site: none',
'Cache-Control: max-age=0'
];
curl_setopt($ch, CURLOPT_HTTPHEADER, $headers);
curl_setopt($ch, CURLOPT_ENCODING, ''); // Handle gzip automatically
3. Respect Timing Patterns
Bots scrape too fast. Add randomized delays:
<?php
function randomDelay($minSeconds = 1, $maxSeconds = 3) {
usleep(rand($minSeconds * 1000000, $maxSeconds * 1000000));
}
foreach ($urls as $url) {
$html = fetchPage($url);
processPage($html);
randomDelay(2, 5); // 2-5 second delay between requests
}
4. Handle Cloudflare's Challenge
Cloudflare's "Checking your browser" page sets cookies after a JavaScript challenge. The challenge changes frequently, making automated solutions fragile.
Your options:
Option 1: Use stream contexts to maintain sessions:
<?php
$context = stream_context_create([
'http' => [
'header' => "Cookie: cf_clearance=xxx\r\n" . // Get this from browser
"User-Agent: Mozilla/5.0 ...\r\n"
]
]);
$html = file_get_contents('https://protected-site.com', false, $context);
Option 2: Use FlareSolverr (external service) for challenge solving:
<?php
// FlareSolverr running on localhost:8191
$solverUrl = 'http://localhost:8191/v1';
$payload = [
'cmd' => 'request.get',
'url' => 'https://protected-site.com',
'maxTimeout' => 60000
];
$ch = curl_init($solverUrl);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_POST, true);
curl_setopt($ch, CURLOPT_POSTFIELDS, json_encode($payload));
curl_setopt($ch, CURLOPT_HTTPHEADER, ['Content-Type: application/json']);
$response = json_decode(curl_exec($ch), true);
$html = $response['solution']['response'];
$cookies = $response['solution']['cookies']; // Use these for subsequent requests
5. TLS Fingerprinting Defense
Advanced systems check your TLS handshake. cURL's defaults can be fingerprinted. This is harder to fix in PHP alone, but you can:
<?php
// Use HTTP/2
curl_setopt($ch, CURLOPT_HTTP_VERSION, CURL_HTTP_VERSION_2_0);
// Match browser cipher suites (requires recent cURL)
curl_setopt($ch, CURLOPT_SSL_CIPHER_LIST,
'TLS_AES_128_GCM_SHA256:TLS_AES_256_GCM_SHA384:TLS_CHACHA20_POLY1305_SHA256');
If sites are still blocking you based on TLS fingerprints, you'll need to use a service or proxy that handles this.
Session Persistence and Cookie Management
Many sites require login or track sessions. Handle cookies properly:
<?php
$cookieFile = tempnam(sys_get_temp_dir(), 'scraper_cookies_');
// Login request
$ch = curl_init('https://example.com/login');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_POST, true);
curl_setopt($ch, CURLOPT_POSTFIELDS, http_build_query([
'username' => 'user',
'password' => 'pass'
]));
curl_setopt($ch, CURLOPT_COOKIEJAR, $cookieFile); // Save cookies
curl_exec($ch);
curl_close($ch);
// Subsequent authenticated requests
$ch = curl_init('https://example.com/protected-page');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_COOKIEFILE, $cookieFile); // Load cookies
$html = curl_exec($ch);
curl_close($ch);
// Cleanup
unlink($cookieFile);
For multiple scraping sessions, store cookies in a database:
<?php
function saveCookies($sessionId, $cookies) {
$db = new PDO('sqlite:cookies.db');
$stmt = $db->prepare('INSERT OR REPLACE INTO cookies (session_id, data, expires) VALUES (?, ?, ?)');
$stmt->execute([$sessionId, serialize($cookies), time() + 3600]);
}
function loadCookies($sessionId) {
$db = new PDO('sqlite:cookies.db');
$stmt = $db->prepare('SELECT data FROM cookies WHERE session_id = ? AND expires > ?');
$stmt->execute([$sessionId, time()]);
$result = $stmt->fetch();
return $result ? unserialize($result['data']) : [];
}
Error Recovery Patterns
Scraping fails. A lot. Build in resilience:
1. Retry with Exponential Backoff
<?php
function fetchWithRetry($url, $maxAttempts = 3) {
$attempt = 0;
$delay = 1;
while ($attempt < $maxAttempts) {
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_TIMEOUT, 30);
$html = curl_exec($ch);
$httpCode = curl_getinfo($ch, CURLINFO_HTTP_CODE);
$error = curl_error($ch);
curl_close($ch);
if ($httpCode === 200 && $html) {
return $html;
}
$attempt++;
if ($attempt < $maxAttempts) {
echo "Attempt $attempt failed (HTTP $httpCode). Retrying in {$delay}s...\n";
sleep($delay);
$delay *= 2; // Exponential backoff
}
}
throw new Exception("Failed to fetch $url after $maxAttempts attempts");
}
2. Log Failures for Resume
<?php
$failedUrls = [];
foreach ($urls as $url) {
try {
$html = fetchWithRetry($url);
processPage($html);
} catch (Exception $e) {
$failedUrls[] = $url;
error_log("Failed to scrape $url: " . $e->getMessage());
}
}
// Save failed URLs to retry later
if (!empty($failedUrls)) {
file_put_contents('failed_urls.txt', implode("\n", $failedUrls));
}
3. Checkpoint Progress
For long-running scrapes, save progress periodically:
<?php
function saveCheckpoint($data) {
file_put_contents('checkpoint.json', json_encode($data));
}
function loadCheckpoint() {
if (file_exists('checkpoint.json')) {
return json_decode(file_get_contents('checkpoint.json'), true);
}
return ['processed' => 0, 'urls' => []];
}
$checkpoint = loadCheckpoint();
$startFrom = $checkpoint['processed'];
for ($i = $startFrom; $i < count($urls); $i++) {
$html = fetchPage($urls[$i]);
processPage($html);
if ($i % 100 === 0) {
saveCheckpoint(['processed' => $i, 'urls' => $urls]);
}
}
Streaming CSV Exports for Big Datasets
Don't build a massive array in memory—stream results directly to a file:
<?php
function streamToCSV($filePath, $data) {
$fp = fopen($filePath, 'a'); // Append mode
fputcsv($fp, $data);
fclose($fp);
}
// Initialize CSV with headers
$csvFile = 'products.csv';
streamToCSV($csvFile, ['Title', 'Price', 'URL']);
foreach ($products as $product) {
$data = [
$product['title'],
$product['price'],
$product['url']
];
streamToCSV($csvFile, $data);
// No accumulation in memory
}
For even better performance, keep the file handle open:
<?php
$fp = fopen('products.csv', 'w');
fputcsv($fp, ['Title', 'Price', 'URL']); // Header
foreach ($products as $product) {
fputcsv($fp, [
$product['title'],
$product['price'],
$product['url']
]);
}
fclose($fp);
When to Use (and Skip) Popular Libraries
Goutte
Use it when: You need a simple, jQuery-like API for basic scraping. Skip it when: The site uses JavaScript rendering or you need fine-grained control.
Goutte is deprecated as of 2024, but Symfony's BrowserKit still works:
<?php
use Symfony\Component\BrowserKit\HttpBrowser;
use Symfony\Component\HttpClient\HttpClient;
$browser = new HttpBrowser(HttpClient::create());
$crawler = $browser->request('GET', 'https://example.com');
$products = $crawler->filter('.product')->each(function ($node) {
return [
'title' => $node->filter('h2')->text(),
'price' => $node->filter('.price')->text()
];
});
Simple HTML DOM Parser
Use it when: You want CSS-like selectors without installing Composer packages. Skip it when: You're scraping large pages (it's a memory hog) or need XPath's power.
Roach PHP
Use it when: You're building a full-fledged spider with crawling logic. Skip it when: You only need to scrape a few specific pages.
Symfony Panther / php-webdriver
Use it when: JavaScript rendering is absolutely unavoidable. Skip it when: The data is available via API or in the initial HTML.
Real-World Example: Scraping an E-Commerce Site
Let's tie everything together. Here's a production-ready scraper for an e-commerce product listing:
<?php
class EcommerceScraper {
private $baseUrl = 'https://example.com';
private $cookieFile;
private $userAgents = [
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36'
];
public function __construct() {
$this->cookieFile = tempnam(sys_get_temp_dir(), 'scraper_');
}
private function fetch($url, $retries = 3) {
$attempt = 0;
while ($attempt < $retries) {
$ch = curl_init($url);
curl_setopt_array($ch, [
CURLOPT_RETURNTRANSFER => true,
CURLOPT_FOLLOWLOCATION => true,
CURLOPT_TIMEOUT => 30,
CURLOPT_USERAGENT => $this->userAgents[array_rand($this->userAgents)],
CURLOPT_COOKIEFILE => $this->cookieFile,
CURLOPT_COOKIEJAR => $this->cookieFile,
CURLOPT_ENCODING => '',
CURLOPT_HTTPHEADER => [
'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language: en-US,en;q=0.5',
'DNT: 1',
'Connection: keep-alive'
]
]);
$html = curl_exec($ch);
$httpCode = curl_getinfo($ch, CURLINFO_HTTP_CODE);
curl_close($ch);
if ($httpCode === 200 && $html) {
return $html;
}
$attempt++;
sleep(pow(2, $attempt)); // Exponential backoff
}
throw new Exception("Failed to fetch $url");
}
private function parseProducts($html) {
$dom = new DOMDocument();
@$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$productNodes = $xpath->query('//div[@class="product"]');
$products = [];
foreach ($productNodes as $node) {
$titleNode = $xpath->query('.//h2[@class="title"]', $node)->item(0);
$priceNode = $xpath->query('.//span[@class="price"]', $node)->item(0);
$linkNode = $xpath->query('.//a[@class="link"]', $node)->item(0);
if ($titleNode && $priceNode && $linkNode) {
$products[] = [
'title' => trim($titleNode->textContent),
'price' => trim($priceNode->textContent),
'url' => $linkNode->getAttribute('href')
];
}
}
return $products;
}
public function scrapeCategory($categoryUrl, $maxPages = 10) {
$allProducts = [];
for ($page = 1; $page <= $maxPages; $page++) {
try {
$url = $categoryUrl . "?page=$page";
echo "Scraping page $page...\n";
$html = $this->fetch($url);
$products = $this->parseProducts($html);
if (empty($products)) {
echo "No more products found.\n";
break;
}
$allProducts = array_merge($allProducts, $products);
// Polite delay
sleep(rand(2, 4));
} catch (Exception $e) {
error_log("Error scraping page $page: " . $e->getMessage());
continue;
}
}
return $allProducts;
}
public function exportToCSV($products, $filename) {
$fp = fopen($filename, 'w');
fputcsv($fp, ['Title', 'Price', 'URL']);
foreach ($products as $product) {
fputcsv($fp, [
$product['title'],
$product['price'],
$product['url']
]);
}
fclose($fp);
}
public function __destruct() {
if (file_exists($this->cookieFile)) {
unlink($this->cookieFile);
}
}
}
// Usage
$scraper = new EcommerceScraper();
$products = $scraper->scrapeCategory('https://example.com/electronics', 50);
$scraper->exportToCSV($products, 'electronics.csv');
echo "Scraped " . count($products) . " products\n";
This scraper includes:
- Retry logic with exponential backoff
- Random user agents and realistic headers
- Cookie persistence across requests
- Memory-efficient CSV export
- Polite delays between requests
- Error handling and logging
The API Alternative (When Scraping Isn't Worth It)
Sometimes, scraping is the wrong tool. If you're scraping at scale or the site actively blocks bots, consider these alternatives:
- Official APIs: Many sites offer APIs (Twitter, Reddit, LinkedIn). They're more reliable than scraping.
- Data providers: Services like Bright Data or ScraperAPI handle anti-bot systems for you. They're not free, but might save you weeks of maintenance.
- Pre-collected datasets: For academic or market research, check if someone already compiled the data (Kaggle, data.world, etc.).
Wrapping Up
PHP isn't glamorous for scraping, but it gets the job done. The techniques above—parallel requests, smart memory management, XPath parsing, and anti-bot strategies—will take you from hobbyist scraper to production-ready data extraction.
The biggest lesson? Don't over-engineer. Start with cURL and DOMDocument. Only add complexity (headless browsers, rotating proxies) when you actually hit roadblocks. And always, always add delays and respect robots.txt.
Further reading: