You've built scrapers for dozens of eCommerce sites. Then you point your code at Shopee and watch it fail spectacularly.
Shopee isn't like other sites. It's one of the most heavily protected eCommerce platforms in Southeast Asia, serving nearly 300 million active users across multiple countries.
The platform employs login walls, aggressive fingerprinting, JavaScript-rendered content, and frequent DOM changes that break most standard scraping approaches.
In this guide, I'll show you how to scrape Shopee using custom Python solutions. No paid APIs. No subscription services. Just clean, working code you can run today.
What You'll Learn
Scraping Shopee requires understanding its defenses before writing any code. This guide covers:
- Why standard scrapers fail against Shopee
- Setting up stealth browser automation with Playwright
- Handling authentication and session persistence
- Extracting product data, prices, and reviews
- Implementing proxy rotation to avoid IP bans
- Storing scraped data in JSON and CSV formats
Let's start by understanding what makes Shopee different.
Why Is Shopee So Hard to Scrape?
Shopee uses multiple layers of protection that work together to detect and block automated access. Understanding these defenses is the first step to bypassing them.
JavaScript-Rendered Content
Shopee loads product data dynamically through JavaScript. Send a basic HTTP request and you'll get an empty shell with no useful data.
The requests library returns this:
<div id="main"></div>
All the actual product information loads after JavaScript executes in a browser environment.
Mandatory Login Wall
Unlike Amazon or eBay, Shopee forces authentication for most useful data. Without logging in, you'll hit redirect loops and blocked pages.
The platform requires:
- Email/password authentication
- Phone number verification (OTP)
- Region-specific phone numbers for new accounts
Aggressive Bot Detection
Shopee employs sophisticated fingerprinting that checks:
- Browser automation flags (Selenium, Puppeteer detection)
- Canvas and WebGL fingerprints
- Mouse movement patterns
- Request timing and frequency
- IP reputation and geolocation consistency
Frequent DOM Changes
Shopee updates its CSS selectors and page structure regularly. A scraper working today might break tomorrow when class names change from .shopee-search-item to .search-item-result__item.
This requires building scrapers that adapt to structural changes.
Method Overview: 4 Approaches to Scrape Shopee
Before diving into code, here's how different methods compare:
| Method | Difficulty | Cost | Success Rate | Best For |
|---|---|---|---|---|
| Basic HTTP Requests | Easy | Free | Very Low | Won't work |
| Standard Playwright | Medium | Free | Low | Testing only |
| Stealth Playwright | Medium | Free | High | Most use cases |
| Playwright + Anti-Detect | Hard | Free | Very High | Scale scraping |
Quick recommendation: Start with stealth Playwright. It handles 80% of Shopee scraping needs without additional complexity.
Prerequisites
Before writing any code, ensure you have:
- Python 3.9 or higher
- A Shopee account (create one with a local phone number)
- Basic understanding of async Python
- Chrome/Chromium browser installed
Install the required packages:
pip install playwright playwright-stealth aiofiles
playwright install chromium
The playwright-stealth package patches common automation detection points that Shopee checks.
Method 1: Stealth Playwright Setup
Standard Playwright gets detected immediately. Shopee checks for automation flags like navigator.webdriver being true.
Stealth Playwright patches these detection points automatically.
Basic Stealth Configuration
Create a file called shopee_scraper.py:
import asyncio
from playwright.async_api import async_playwright
from playwright_stealth import stealth_async
async def create_stealth_browser():
"""Initialize a stealth browser that bypasses basic detection."""
playwright = await async_playwright().start()
browser = await playwright.chromium.launch(
headless=False, # Run headed first to debug
args=[
'--disable-blink-features=AutomationControlled',
'--disable-dev-shm-usage',
'--no-sandbox',
'--disable-setuid-sandbox',
]
)
context = await browser.new_context(
viewport={'width': 1920, 'height': 1080},
user_agent='Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
locale='en-SG',
timezone_id='Asia/Singapore',
)
page = await context.new_page()
await stealth_async(page)
return playwright, browser, context, page
The --disable-blink-features=AutomationControlled flag removes the automation indicator that many sites check.
Setting locale and timezone to match your target Shopee region (Singapore in this example) helps avoid geographical mismatches.
Testing the Connection
Add a simple test to verify the setup works:
async def test_connection():
"""Verify we can reach Shopee without immediate blocking."""
playwright, browser, context, page = await create_stealth_browser()
try:
await page.goto('https://shopee.sg', wait_until='networkidle')
await page.wait_for_timeout(3000)
title = await page.title()
print(f"Page title: {title}")
# Check if we're blocked or redirected
current_url = page.url
if 'blocked' in current_url.lower() or 'captcha' in current_url.lower():
print("Detected blocking - stealth may need adjustment")
else:
print("Successfully loaded Shopee homepage")
finally:
await browser.close()
await playwright.stop()
if __name__ == "__main__":
asyncio.run(test_connection())
Run this before proceeding. If you see the homepage load successfully, your stealth configuration works.
Method 2: Handling Shopee Authentication
Scraping Shopee requires authenticated sessions for meaningful data. There are two approaches: manual login with cookie persistence, or automated login with OTP handling.
Approach A: Manual Login with Cookie Export
The safest method is logging in manually once, then reusing those cookies:
import json
import os
async def login_and_save_cookies(page, cookies_file='shopee_cookies.json'):
"""Navigate to login, wait for manual authentication, save cookies."""
await page.goto('https://shopee.sg/buyer/login')
print("Please log in manually in the browser window...")
print("Press Enter after you've successfully logged in.")
input()
# Save cookies for future sessions
cookies = await page.context.cookies()
with open(cookies_file, 'w') as f:
json.dump(cookies, f, indent=2)
print(f"Cookies saved to {cookies_file}")
return cookies
async def load_cookies(context, cookies_file='shopee_cookies.json'):
"""Load previously saved cookies into the browser context."""
if not os.path.exists(cookies_file):
return False
with open(cookies_file, 'r') as f:
cookies = json.load(f)
await context.add_cookies(cookies)
return True
This approach:
- Requires only one manual login
- Stores session data locally
- Works for days before requiring re-authentication
Approach B: Session Persistence with Browser Profiles
For longer-lasting sessions, save the entire browser state:
async def create_persistent_context():
"""Create a browser context that persists across sessions."""
playwright = await async_playwright().start()
# User data directory stores cookies, localStorage, etc.
user_data_dir = './shopee_profile'
context = await playwright.chromium.launch_persistent_context(
user_data_dir,
headless=False,
viewport={'width': 1920, 'height': 1080},
user_agent='Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
locale='en-SG',
timezone_id='Asia/Singapore',
args=['--disable-blink-features=AutomationControlled'],
)
page = context.pages[0] if context.pages else await context.new_page()
await stealth_async(page)
return playwright, context, page
Browser profiles maintain all session data automatically. After the first login, subsequent script runs stay authenticated.
Method 3: Extracting Product Data
With authentication handled, let's extract actual product information.
Scraping Search Results
This function searches for products and extracts basic information:
async def scrape_search_results(page, keyword, max_pages=3):
"""Search Shopee and extract product listings."""
products = []
search_url = f'https://shopee.sg/search?keyword={keyword}'
await page.goto(search_url, wait_until='networkidle')
for page_num in range(max_pages):
print(f"Scraping page {page_num + 1}...")
# Wait for product cards to load
await page.wait_for_selector('.shopee-search-item-result__item', timeout=10000)
# Scroll to trigger lazy loading
await scroll_page(page)
# Extract product data
items = await page.query_selector_all('.shopee-search-item-result__item')
for item in items:
product = await extract_product_card(item)
if product:
products.append(product)
# Navigate to next page
next_button = await page.query_selector('[class*="next-page"]')
if next_button:
await next_button.click()
await page.wait_for_timeout(2000)
else:
break
return products
async def scroll_page(page):
"""Scroll down to trigger lazy loading of images and data."""
await page.evaluate('''
async () => {
await new Promise(resolve => {
let totalHeight = 0;
const distance = 300;
const timer = setInterval(() => {
window.scrollBy(0, distance);
totalHeight += distance;
if (totalHeight >= document.body.scrollHeight) {
clearInterval(timer);
resolve();
}
}, 100);
});
}
''')
The scroll function is essential. Shopee lazy-loads product images and some data fields.
Extracting Product Card Information
Parse individual product cards:
async def extract_product_card(item):
"""Extract data from a single product card element."""
try:
# Product name
name_el = await item.query_selector('[data-sqe="name"]')
name = await name_el.inner_text() if name_el else None
# Price - handle both discounted and regular prices
price_el = await item.query_selector('.price')
price = await price_el.inner_text() if price_el else None
# Clean price string
if price:
price = price.replace('$', '').replace(',', '').strip()
# Product link
link_el = await item.query_selector('a')
link = await link_el.get_attribute('href') if link_el else None
if link and not link.startswith('http'):
link = f'https://shopee.sg{link}'
# Sold count
sold_el = await item.query_selector('.sold')
sold = await sold_el.inner_text() if sold_el else '0'
# Rating
rating_el = await item.query_selector('.rating')
rating = await rating_el.inner_text() if rating_el else None
return {
'name': name,
'price': price,
'link': link,
'sold': sold,
'rating': rating,
}
except Exception as e:
print(f"Error extracting product: {e}")
return None
Important: Shopee changes selector names frequently. If scraping fails, inspect the page structure and update selectors accordingly.
Scraping Individual Product Pages
For detailed product information, navigate to individual pages:
async def scrape_product_details(page, product_url):
"""Extract detailed information from a product page."""
await page.goto(product_url, wait_until='networkidle')
await page.wait_for_timeout(2000)
# Scroll to load all content
await scroll_page(page)
details = {}
try:
# Product title
title_el = await page.query_selector('.product-title')
details['title'] = await title_el.inner_text() if title_el else None
# Current price
price_el = await page.query_selector('.price-current')
details['price'] = await price_el.inner_text() if price_el else None
# Original price (if discounted)
original_el = await page.query_selector('.price-original')
details['original_price'] = await original_el.inner_text() if original_el else None
# Stock quantity
stock_el = await page.query_selector('.product-stock')
details['stock'] = await stock_el.inner_text() if stock_el else None
# Description
desc_el = await page.query_selector('.product-description')
details['description'] = await desc_el.inner_text() if desc_el else None
# Seller information
seller_el = await page.query_selector('.seller-name')
details['seller'] = await seller_el.inner_text() if seller_el else None
# Ratings summary
rating_el = await page.query_selector('.product-rating')
details['rating'] = await rating_el.inner_text() if rating_el else None
# Number of reviews
review_count_el = await page.query_selector('.review-count')
details['review_count'] = await review_count_el.inner_text() if review_count_el else None
except Exception as e:
print(f"Error scraping product details: {e}")
return details
Method 4: Proxy Rotation for Scale
Single IP scraping triggers rate limits quickly. Rotating residential proxies distribute requests across many IPs.
Implementing Proxy Rotation
import random
class ProxyRotator:
"""Manage a pool of proxies for request distribution."""
def __init__(self, proxy_list):
self.proxies = proxy_list
self.current_index = 0
def get_next_proxy(self):
"""Return the next proxy in rotation."""
proxy = self.proxies[self.current_index]
self.current_index = (self.current_index + 1) % len(self.proxies)
return proxy
def get_random_proxy(self):
"""Return a random proxy from the pool."""
return random.choice(self.proxies)
async def create_browser_with_proxy(proxy_url):
"""Launch a browser configured to use a specific proxy."""
playwright = await async_playwright().start()
browser = await playwright.chromium.launch(
headless=False,
proxy={'server': proxy_url},
args=['--disable-blink-features=AutomationControlled'],
)
context = await browser.new_context(
viewport={'width': 1920, 'height': 1080},
user_agent='Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
)
page = await context.new_page()
await stealth_async(page)
return playwright, browser, context, page
Using the Proxy Rotator
# Example proxy list - replace with your actual proxies
proxies = [
'http://user:pass@proxy1.example.com:8080',
'http://user:pass@proxy2.example.com:8080',
'http://user:pass@proxy3.example.com:8080',
]
rotator = ProxyRotator(proxies)
async def scrape_with_rotation(keywords, products_per_keyword=50):
"""Scrape multiple keywords, rotating proxies between requests."""
all_products = []
for keyword in keywords:
proxy = rotator.get_next_proxy()
print(f"Scraping '{keyword}' with proxy: {proxy}")
playwright, browser, context, page = await create_browser_with_proxy(proxy)
try:
products = await scrape_search_results(page, keyword)
all_products.extend(products)
finally:
await browser.close()
await playwright.stop()
# Delay between keywords
await asyncio.sleep(random.uniform(5, 10))
return all_products
For residential proxies that work well with Shopee's geo-restrictions, provide IPs from Southeast Asian countries where Shopee operates.
Rate Limiting and Request Delays
Aggressive scraping gets you blocked fast. Implement intelligent delays:
import random
import time
class RateLimiter:
"""Control request frequency to avoid triggering rate limits."""
def __init__(self, requests_per_minute=30):
self.min_delay = 60.0 / requests_per_minute
self.last_request = 0
async def wait(self):
"""Wait appropriate time before next request."""
elapsed = time.time() - self.last_request
if elapsed < self.min_delay:
delay = self.min_delay - elapsed
# Add random jitter
delay += random.uniform(0.5, 2.0)
await asyncio.sleep(delay)
self.last_request = time.time()
# Usage
rate_limiter = RateLimiter(requests_per_minute=20)
async def scrape_with_rate_limit(page, urls):
"""Scrape URLs while respecting rate limits."""
results = []
for url in urls:
await rate_limiter.wait()
try:
data = await scrape_product_details(page, url)
results.append(data)
except Exception as e:
print(f"Error scraping {url}: {e}")
continue
return results
Keep requests under 30 per minute per IP. Lower is safer for long-running scrapes.
Saving Scraped Data
Export your data to usable formats:
JSON Export
import json
from datetime import datetime
def save_to_json(products, filename=None):
"""Save product list to JSON file."""
if filename is None:
timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
filename = f'shopee_products_{timestamp}.json'
with open(filename, 'w', encoding='utf-8') as f:
json.dump(products, f, ensure_ascii=False, indent=2)
print(f"Saved {len(products)} products to {filename}")
return filename
CSV Export
import csv
def save_to_csv(products, filename=None):
"""Save product list to CSV file."""
if not products:
print("No products to save")
return None
if filename is None:
timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
filename = f'shopee_products_{timestamp}.csv'
# Get all unique keys from products
fieldnames = set()
for product in products:
fieldnames.update(product.keys())
fieldnames = sorted(list(fieldnames))
with open(filename, 'w', newline='', encoding='utf-8') as f:
writer = csv.DictWriter(f, fieldnames=fieldnames)
writer.writeheader()
writer.writerows(products)
print(f"Saved {len(products)} products to {filename}")
return filename
Complete Working Example
Here's the full scraper combining everything:
import asyncio
import json
import random
from datetime import datetime
from playwright.async_api import async_playwright
from playwright_stealth import stealth_async
class ShopeeScraper:
"""Complete Shopee scraper with stealth and rate limiting."""
def __init__(self, cookies_file='shopee_cookies.json'):
self.cookies_file = cookies_file
self.playwright = None
self.browser = None
self.context = None
self.page = None
async def start(self):
"""Initialize the browser with stealth configuration."""
self.playwright = await async_playwright().start()
self.browser = await self.playwright.chromium.launch(
headless=False,
args=['--disable-blink-features=AutomationControlled'],
)
self.context = await self.browser.new_context(
viewport={'width': 1920, 'height': 1080},
user_agent='Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
locale='en-SG',
timezone_id='Asia/Singapore',
)
self.page = await self.context.new_page()
await stealth_async(self.page)
# Load existing cookies if available
await self._load_cookies()
async def stop(self):
"""Clean up browser resources."""
if self.browser:
await self.browser.close()
if self.playwright:
await self.playwright.stop()
async def _load_cookies(self):
"""Load saved cookies into browser context."""
try:
with open(self.cookies_file, 'r') as f:
cookies = json.load(f)
await self.context.add_cookies(cookies)
print("Loaded existing session cookies")
except FileNotFoundError:
print("No saved cookies found - manual login required")
async def _save_cookies(self):
"""Save current cookies to file."""
cookies = await self.context.cookies()
with open(self.cookies_file, 'w') as f:
json.dump(cookies, f, indent=2)
async def search_products(self, keyword, max_results=50):
"""Search and extract product listings."""
products = []
search_url = f'https://shopee.sg/search?keyword={keyword}'
await self.page.goto(search_url, wait_until='networkidle')
await self._scroll_page()
items = await self.page.query_selector_all('[data-sqe="item"]')
for item in items[:max_results]:
await asyncio.sleep(random.uniform(0.1, 0.3))
product = await self._extract_product_card(item)
if product:
products.append(product)
return products
async def _extract_product_card(self, item):
"""Extract data from a product card element."""
try:
name_el = await item.query_selector('[data-sqe="name"]')
price_el = await item.query_selector('.price')
link_el = await item.query_selector('a')
return {
'name': await name_el.inner_text() if name_el else None,
'price': await price_el.inner_text() if price_el else None,
'url': await link_el.get_attribute('href') if link_el else None,
'scraped_at': datetime.now().isoformat(),
}
except Exception:
return None
async def _scroll_page(self):
"""Scroll to load lazy content."""
await self.page.evaluate('''
() => new Promise(resolve => {
let total = 0;
const timer = setInterval(() => {
window.scrollBy(0, 300);
total += 300;
if (total >= document.body.scrollHeight) {
clearInterval(timer);
resolve();
}
}, 100);
})
''')
async def main():
"""Main execution function."""
scraper = ShopeeScraper()
try:
await scraper.start()
# Search for products
products = await scraper.search_products('wireless earbuds', max_results=20)
# Save results
timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
filename = f'shopee_results_{timestamp}.json'
with open(filename, 'w', encoding='utf-8') as f:
json.dump(products, f, ensure_ascii=False, indent=2)
print(f"Scraped {len(products)} products")
print(f"Results saved to {filename}")
finally:
await scraper.stop()
if __name__ == '__main__':
asyncio.run(main())
Run this script with python shopee_scraper.py after configuring your cookies.
Extracting Product Reviews
Reviews provide valuable market intelligence. Here's how to scrape them:
Navigating to Review Sections
Reviews load dynamically when you scroll to them. This function handles that:
async def scrape_product_reviews(page, product_url, max_reviews=50):
"""Extract reviews from a product page."""
await page.goto(product_url, wait_until='networkidle')
reviews = []
# Scroll down to reviews section
review_section = await page.query_selector('.product-ratings')
if review_section:
await review_section.scroll_into_view_if_needed()
await page.wait_for_timeout(2000)
# Click "All Reviews" tab if available
all_reviews_tab = await page.query_selector('[data-filter="0"]')
if all_reviews_tab:
await all_reviews_tab.click()
await page.wait_for_timeout(1500)
while len(reviews) < max_reviews:
# Extract visible reviews
review_items = await page.query_selector_all('.shopee-product-rating')
for item in review_items:
review = await extract_single_review(item)
if review and review not in reviews:
reviews.append(review)
# Check for "Next" button in pagination
next_btn = await page.query_selector('.shopee-icon-button--right')
if next_btn:
is_disabled = await next_btn.get_attribute('disabled')
if not is_disabled:
await next_btn.click()
await page.wait_for_timeout(2000)
else:
break
else:
break
return reviews[:max_reviews]
async def extract_single_review(item):
"""Extract data from a single review element."""
try:
# Reviewer name
author_el = await item.query_selector('.shopee-product-rating__author-name')
author = await author_el.inner_text() if author_el else 'Anonymous'
# Rating (count the filled stars)
stars = await item.query_selector_all('.icon-rating-solid')
rating = len(stars) if stars else None
# Review text
content_el = await item.query_selector('.shopee-product-rating__content')
content = await content_el.inner_text() if content_el else ''
# Review date
date_el = await item.query_selector('.shopee-product-rating__time')
date = await date_el.inner_text() if date_el else None
# Product variation purchased
variation_el = await item.query_selector('.shopee-product-rating__variation')
variation = await variation_el.inner_text() if variation_el else None
return {
'author': author.strip(),
'rating': rating,
'content': content.strip(),
'date': date,
'variation': variation,
}
except Exception as e:
print(f"Error extracting review: {e}")
return None
Filtering Reviews by Rating
Shopee allows filtering reviews by star rating:
async def scrape_filtered_reviews(page, product_url, star_filter=None):
"""Scrape reviews filtered by star rating."""
await page.goto(product_url, wait_until='networkidle')
# Navigate to reviews section
review_section = await page.query_selector('.product-ratings')
if review_section:
await review_section.scroll_into_view_if_needed()
await page.wait_for_timeout(2000)
# Apply star filter if specified (1-5)
if star_filter and 1 <= star_filter <= 5:
filter_btn = await page.query_selector(f'[data-filter="{star_filter}"]')
if filter_btn:
await filter_btn.click()
await page.wait_for_timeout(1500)
# Now extract reviews
return await scrape_product_reviews(page, page.url)
This helps analyze negative reviews specifically or focus on highly positive feedback.
Scaling Your Shopee Scraper
When scraping thousands of products, single-threaded execution becomes too slow. Here's how to scale efficiently.
Concurrent Scraping with asyncio
Run multiple browser contexts simultaneously:
import asyncio
from asyncio import Semaphore
class ScalableShopeeScraper:
"""Handle concurrent scraping with controlled parallelism."""
def __init__(self, max_concurrent=5):
self.semaphore = Semaphore(max_concurrent)
self.results = []
async def scrape_url(self, url, playwright):
"""Scrape a single URL with semaphore control."""
async with self.semaphore:
browser = await playwright.chromium.launch(
headless=True,
args=['--disable-blink-features=AutomationControlled'],
)
try:
context = await browser.new_context(
user_agent='Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
)
page = await context.new_page()
await stealth_async(page)
# Add random delay
await asyncio.sleep(random.uniform(1, 3))
data = await scrape_product_details(page, url)
return data
finally:
await browser.close()
async def scrape_many(self, urls):
"""Scrape multiple URLs concurrently."""
playwright = await async_playwright().start()
try:
tasks = [self.scrape_url(url, playwright) for url in urls]
results = await asyncio.gather(*tasks, return_exceptions=True)
# Filter out exceptions
valid_results = [r for r in results if not isinstance(r, Exception)]
return valid_results
finally:
await playwright.stop()
Using Semaphore(5) limits concurrent browsers to 5, preventing memory issues while still gaining significant speed improvements.
Distributed Scraping Across Machines
For enterprise-scale scraping, distribute work across multiple machines:
import json
import redis
class DistributedScraper:
"""Coordinate scraping across multiple workers."""
def __init__(self, redis_host='localhost'):
self.redis = redis.Redis(host=redis_host)
self.queue_name = 'shopee_urls'
self.results_name = 'shopee_results'
def add_urls_to_queue(self, urls):
"""Add URLs to the distributed queue."""
for url in urls:
self.redis.rpush(self.queue_name, url)
print(f"Added {len(urls)} URLs to queue")
async def worker_loop(self, worker_id):
"""Main loop for a worker process."""
print(f"Worker {worker_id} starting...")
scraper = ShopeeScraper()
await scraper.start()
try:
while True:
# Get URL from queue
url = self.redis.lpop(self.queue_name)
if not url:
await asyncio.sleep(5)
continue
url = url.decode('utf-8')
print(f"Worker {worker_id} processing: {url}")
try:
data = await scrape_product_details(scraper.page, url)
self.redis.rpush(self.results_name, json.dumps(data))
except Exception as e:
print(f"Worker {worker_id} error: {e}")
# Re-queue failed URL
self.redis.rpush(self.queue_name, url)
await asyncio.sleep(random.uniform(2, 5))
finally:
await scraper.stop()
Run multiple workers on different machines, each pulling URLs from the shared Redis queue.
Memory Management for Long Runs
Browser automation consumes significant memory. Restart browsers periodically:
class MemoryEfficientScraper:
"""Scraper that manages memory by recycling browsers."""
def __init__(self, restart_after=100):
self.request_count = 0
self.restart_threshold = restart_after
self.scraper = None
async def ensure_browser(self):
"""Create or restart browser as needed."""
if self.scraper is None or self.request_count >= self.restart_threshold:
if self.scraper:
await self.scraper.stop()
self.scraper = ShopeeScraper()
await self.scraper.start()
self.request_count = 0
print("Browser restarted for memory management")
return self.scraper
async def scrape(self, url):
"""Scrape with automatic browser recycling."""
scraper = await self.ensure_browser()
self.request_count += 1
return await scrape_product_details(scraper.page, url)
Restarting every 100 requests prevents memory leaks from accumulating.
Troubleshooting Common Issues
"Navigation timeout" Errors
Shopee pages load slowly. Increase timeout values:
await page.goto(url, timeout=60000, wait_until='networkidle')
CAPTCHA Challenges
If you encounter CAPTCHAs frequently:
- Reduce request frequency
- Use residential proxies instead of datacenter IPs
- Ensure browser fingerprint consistency
- Rotate user agents between sessions
Empty Product Data
Selectors change often. When scraping returns empty values:
- Open the browser in headed mode
- Inspect actual element classes
- Update selectors in your code
Login Session Expiration
Sessions expire after several hours. Implement automatic re-authentication:
async def check_login_status(page):
"""Verify if still logged in."""
await page.goto('https://shopee.sg/user/account')
return 'login' not in page.url.lower()
"Access Denied" or 403 Errors
These indicate detection. Try these fixes in order:
- Verify stealth patches are applied:
# Test if webdriver flag is hidden
result = await page.evaluate('navigator.webdriver')
print(f"Webdriver detected: {result}") # Should be None or False
- Check timezone and locale match:
# Ensure these match your proxy location
context = await browser.new_context(
locale='en-SG',
timezone_id='Asia/Singapore',
geolocation={'latitude': 1.3521, 'longitude': 103.8198},
)
- Rotate to a fresh IP address:
# Close current browser and start with new proxy
await browser.close()
new_proxy = proxy_rotator.get_random_proxy()
# Create new browser with fresh proxy
Infinite Loading or Stuck Pages
Shopee sometimes hangs during heavy JavaScript execution:
async def safe_navigate(page, url, max_retries=3):
"""Navigate with retry logic for stuck pages."""
for attempt in range(max_retries):
try:
await page.goto(url, timeout=30000, wait_until='domcontentloaded')
# Wait for key element instead of full load
await page.wait_for_selector('.main-content', timeout=15000)
return True
except Exception as e:
print(f"Attempt {attempt + 1} failed: {e}")
if attempt < max_retries - 1:
await page.reload()
await asyncio.sleep(2)
return False
Using domcontentloaded instead of networkidle prevents waiting for slow-loading trackers.
Data Returns Empty Despite Page Loading
This usually means selectors changed. Debug with:
async def debug_selectors(page):
"""Print page structure to find correct selectors."""
# Get all class names on page
classes = await page.evaluate('''
() => {
const elements = document.querySelectorAll('*');
const classes = new Set();
elements.forEach(el => {
el.classList.forEach(c => classes.add(c));
});
return Array.from(classes).sort();
}
''')
print("Classes found on page:")
for cls in classes[:50]: # First 50
print(f" .{cls}")
Run this when scraping fails to discover current class names.
Best Practices for Shopee Scraping
Keep request rates low. Under 30 requests per minute per IP prevents most rate limiting. For safer operation, 15-20 requests per minute provides more headroom.
Use regional proxies. IPs from Singapore, Malaysia, Thailand, or other Southeast Asian countries avoid geo-blocking. Datacenter IPs get flagged quickly—residential proxies work much better for Shopee.
Persist sessions. Reuse browser profiles and cookies instead of logging in repeatedly. Each new login increases account risk flags.
Monitor for changes. Shopee updates its site frequently. Set up weekly tests that verify your selectors still work. A simple test that checks if key elements exist catches most breakages.
Implement exponential backoff. When you hit errors, increase wait times progressively:
async def scrape_with_backoff(page, url, max_retries=5):
"""Scrape with exponential backoff on failures."""
base_delay = 2
for attempt in range(max_retries):
try:
return await scrape_product_details(page, url)
except Exception as e:
delay = base_delay * (2 ** attempt) + random.uniform(0, 1)
print(f"Attempt {attempt + 1} failed, waiting {delay:.1f}s")
await asyncio.sleep(delay)
raise Exception(f"Failed after {max_retries} attempts")
Respect the platform. Only collect publicly available data. Avoid personal information and honor robots.txt restrictions. Excessive scraping can lead to legal issues and harms the platform for other users.
Log everything. Track request counts, success rates, and error types:
import logging
from datetime import datetime
logging.basicConfig(
filename=f'scraper_{datetime.now():%Y%m%d}.log',
level=logging.INFO,
format='%(asctime)s - %(levelname)s - %(message)s'
)
async def logged_scrape(page, url):
"""Scrape with comprehensive logging."""
start = datetime.now()
try:
result = await scrape_product_details(page, url)
duration = (datetime.now() - start).total_seconds()
logging.info(f"SUCCESS: {url} ({duration:.2f}s)")
return result
except Exception as e:
logging.error(f"FAILED: {url} - {str(e)}")
raise
Logs help diagnose issues and track scraper health over time.
Handle anti-fingerprinting properly. Beyond basic stealth patches, consider canvas fingerprint randomization:
await page.add_init_script('''
// Randomize canvas fingerprint
const originalToDataURL = HTMLCanvasElement.prototype.toDataURL;
HTMLCanvasElement.prototype.toDataURL = function(type) {
if (type === 'image/png' && this.width === 220 && this.height === 30) {
const context = this.getContext('2d');
const imageData = context.getImageData(0, 0, this.width, this.height);
for (let i = 0; i < imageData.data.length; i += 4) {
imageData.data[i] += Math.floor(Math.random() * 10) - 5;
}
context.putImageData(imageData, 0, 0);
}
return originalToDataURL.apply(this, arguments);
};
''')
This adds noise to canvas fingerprints that anti-bot systems use for tracking.
Conclusion
Scraping Shopee requires more sophistication than typical eCommerce sites, but it's absolutely achievable with the right approach.
The combination of stealth Playwright, proper session management, and residential proxy rotation handles most scenarios effectively.
Start with the basic stealth configuration and add complexity only as needed. Most use cases work fine with cookies-based authentication and moderate rate limiting.
Your next steps:
- Set up Playwright with stealth patches
- Log in manually and save cookies
- Test with small searches before scaling
- Add proxy rotation when you need volume
The complete code examples in this guide give you everything needed to start scraping Shopee today.