Amazon scraping gives you access to millions of product listings, prices, reviews, and market data—but it's also one of the most challenging scraping targets on the web. Between AWS WAF Bot Control, CAPTCHAs, and aggressive rate limiting, getting clean data from Amazon requires more than just firing off HTTP requests.

In this guide, I'll walk you through practical methods for scraping Amazon, from basic techniques to lesser-known tricks that actually work in 2026. No fluff, no theory—just code that runs and strategies that bypass the blocks.

What You Need to Know Before Starting

Before you write a single line of code, understand what you're up against. Amazon uses AWS WAF Bot Control, which employs machine learning, browser fingerprinting, and behavioral analysis to spot bots. It's not just checking your IP address—it's analyzing request patterns, JavaScript execution, mouse movements, and even your font list.

Here's what matters:

Respect robots.txt: Amazon's robots.txt explicitly disallows scraping of certain pages. Stick to publicly accessible product pages and search results. Extended reviews behind login walls are off-limits.

Rate limiting is real: Send too many requests too fast, and you'll trigger rate limits or get served CAPTCHAs. Even with perfect headers, volume matters.

Legal considerations: Web scraping isn't illegal, but Amazon's Terms of Service prohibit it for commercial purposes. Use this knowledge for personal research, education, or analysis—not for reselling data or competitive intelligence at scale.

Method 1: Basic HTTP Request Scraping with Python

Let's start with the fundamentals. For simple, small-scale scraping, Python with Requests and BeautifulSoup works fine. Here's how to scrape a single product page:

import requests
from bs4 import BeautifulSoup
import time
import random

def scrape_amazon_product(url):
    # Mimic a real browser
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36',
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
        'Accept-Language': 'en-US,en;q=0.5',
        'Accept-Encoding': 'gzip, deflate, br',
        'DNT': '1',
        'Connection': 'keep-alive',
        'Upgrade-Insecure-Requests': '1'
    }
    
    # Random delay to mimic human behavior
    time.sleep(random.uniform(2, 5))
    
    response = requests.get(url, headers=headers)
    
    if response.status_code == 200:
        soup = BeautifulSoup(response.content, 'html.parser')
        
        # Extract product data
        title = soup.find('span', {'id': 'productTitle'})
        title_text = title.text.strip() if title else 'N/A'
        
        # Price extraction (Amazon has multiple price formats)
        price_whole = soup.find('span', {'class': 'a-price-whole'})
        price_fraction = soup.find('span', {'class': 'a-price-fraction'})
        
        if price_whole and price_fraction:
            price = f"${price_whole.text}{price_fraction.text}"
        else:
            price = 'N/A'
        
        # Extract ASIN from URL
        asin = url.split('/dp/')[1].split('/')[0] if '/dp/' in url else 'N/A'
        
        return {
            'title': title_text,
            'price': price,
            'asin': asin,
            'url': url
        }
    else:
        print(f"Failed to fetch page: {response.status_code}")
        return None

# Test it
product_url = "https://www.amazon.com/dp/B08N5WRWNW"
data = scrape_amazon_product(product_url)
print(data)

This works for a handful of requests. But scale it to hundreds of products, and you'll hit blocks fast.

Why this approach fails at scale: Amazon tracks request patterns. Even with perfect headers, if you're hammering their servers from one IP, you're getting flagged. The solution? Rotate IPs with proxies or use residential proxies that look like real users.

Method 2: Using Headless Browsers for Dynamic Content

Some Amazon pages load data dynamically via JavaScript. For these, you need a headless browser like Playwright or Selenium with anti-detection measures:

from playwright.sync_api import sync_playwright
import random
import time

def scrape_with_playwright(url):
    with sync_playwright() as p:
        # Launch browser with realistic settings
        browser = p.chromium.launch(
            headless=False,  # Sometimes headless mode gets flagged
            args=[
                '--disable-blink-features=AutomationControlled',
                '--window-size=1920,1080'
            ]
        )
        
        context = browser.new_context(
            viewport={'width': 1920, 'height': 1080},
            user_agent='Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
        )
        
        # Remove webdriver flag
        page = context.new_page()
        page.add_init_script("""
            Object.defineProperty(navigator, 'webdriver', {
                get: () => undefined
            })
        """)
        
        # Navigate with realistic delays
        page.goto(url, wait_until='networkidle')
        time.sleep(random.uniform(2, 4))
        
        # Scroll to load lazy content
        page.evaluate("window.scrollBy(0, 500)")
        time.sleep(1)
        page.evaluate("window.scrollBy(0, 500)")
        
        # Extract data
        title = page.query_selector('#productTitle')
        title_text = title.inner_text() if title else 'N/A'
        
        browser.close()
        return {'title': title_text}

# Test it
data = scrape_with_playwright("https://www.amazon.com/dp/B08N5WRWNW")
print(data)

The catch: Headless browsers are resource-heavy and slower. Use them only when necessary—like for pages that require JavaScript rendering or when you need to interact with the page (clicking buttons, expanding sections).

Unknown Trick #1: Mining ASINs from Data Attributes

Here's something most tutorials won't tell you: Amazon embeds ASINs directly in HTML data attributes on search results pages. You don't need to visit each product page to get the ASIN—it's already there.

When you search for products on Amazon, each result card has a data-asin attribute. This is gold for bulk scraping:

import requests
from bs4 import BeautifulSoup

def extract_asins_from_search(search_query):
    # Format search URL
    url = f"https://www.amazon.com/s?k={search_query.replace(' ', '+')}"
    
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
    }
    
    response = requests.get(url, headers=headers)
    soup = BeautifulSoup(response.content, 'html.parser')
    
    # Find all product divs with data-asin attribute
    product_divs = soup.find_all('div', {'data-asin': True})
    
    asins = []
    for div in product_divs:
        asin = div.get('data-asin')
        # Filter out empty ASINs (sponsored content sometimes has empty values)
        if asin and len(asin) == 10:
            asins.append(asin)
    
    return list(set(asins))  # Remove duplicates

# Get ASINs for all "wireless mouse" products on page 1
asins = extract_asins_from_search("wireless mouse")
print(f"Found {len(asins)} unique ASINs:")
print(asins)

Why this matters: Instead of scraping 20 product pages, you scrape one search page and get all the ASINs. Then you can either:

  • Use the ASINs to construct product URLs (https://amazon.com/dp/{ASIN})
  • Pass them to the Amazon Product Advertising API (if you have access)
  • Store them for later processing

This approach cuts your request volume by 20x, dramatically reducing your chances of getting blocked.

Unknown Trick #2: The Search Results Shortcut

Most people scrape individual product pages for data. But if you only need basic info (title, price, rating, review count), you can get it all from search results—no need to visit product pages at all.

Amazon's search results contain condensed product data in the HTML. Here's how to extract it:

import requests
from bs4 import BeautifulSoup

def scrape_search_results(keyword, page=1):
    url = f"https://www.amazon.com/s?k={keyword}&page={page}"
    
    headers = {
        'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36'
    }
    
    response = requests.get(url, headers=headers)
    soup = BeautifulSoup(response.content, 'html.parser')
    
    products = []
    
    # Each product result
    for item in soup.select('div[data-component-type="s-search-result"]'):
        asin = item.get('data-asin')
        
        # Title
        title_elem = item.select_one('h2 a span')
        title = title_elem.text if title_elem else 'N/A'
        
        # Price
        price_elem = item.select_one('.a-price .a-offscreen')
        price = price_elem.text if price_elem else 'N/A'
        
        # Rating
        rating_elem = item.select_one('.a-icon-alt')
        rating = rating_elem.text if rating_elem else 'N/A'
        
        # Review count
        review_elem = item.select_one('span[aria-label*="stars"]')
        review_count = review_elem.get('aria-label', '').split(' ')[-2] if review_elem else 'N/A'
        
        products.append({
            'asin': asin,
            'title': title,
            'price': price,
            'rating': rating,
            'review_count': review_count
        })
    
    return products

# Scrape first page of results
results = scrape_search_results("mechanical keyboard")
for product in results[:5]:
    print(product)

The advantage: You get 20 products per request instead of 1. For price monitoring or market research where you don't need detailed specs, this is 20x more efficient.

Unknown Trick #3: Amazon's Undocumented Widget Endpoints

This one's a bit obscure, but incredibly useful. Amazon has widget endpoints originally designed for their affiliate program. These endpoints return product images and basic data without the full page overhead—and they're less protected than regular pages.

For product images by ASIN:

def get_product_image_url(asin, marketplace='US', size='SL500'):
    """
    Get Amazon product image URL using widget endpoint
    
    Marketplaces:
    - US: MarketPlace=US, region=ws-na
    - UK: MarketPlace=GB, region=ws-eu
    - Germany: MarketPlace=DE, region=ws-eu
    - Japan: MarketPlace=JP, region=ws-fe
    
    Sizes: SL160, SL250, SL500, AC_SL500
    """
    
    region_map = {
        'US': 'ws-na',
        'GB': 'ws-eu',
        'DE': 'ws-eu',
        'FR': 'ws-eu',
        'JP': 'ws-fe'
    }
    
    region = region_map.get(marketplace, 'ws-na')
    
    image_url = (
        f"https://{region}.amazon-adsystem.com/widgets/q?"
        f"_encoding=UTF8&MarketPlace={marketplace}&ASIN={asin}"
        f"&ServiceVersion=20070822&ID=AsinImage&WS=1&Format={size}"
    )
    
    return image_url

# Example usage
asin = "B08N5WRWNW"
image_url = get_product_image_url(asin, 'US', 'SL500')
print(f"Product image: {image_url}")

# You can directly use this URL in your application
# No scraping needed, just construct the URL

Why this works: These widget endpoints are designed for high-volume traffic from affiliate sites, so they're more tolerant of automated requests. You won't get product details, but for images, it's perfect.

Unknown Trick #4: Request Timing Patterns That Work

Amazon's bot detection doesn't just look at volume—it analyzes patterns. Send requests at exactly 3-second intervals? That's a bot. Here's a better approach:

import random
import time
from datetime import datetime

class SmartThrottler:
    def __init__(self):
        self.last_request_time = None
        self.request_count = 0
        self.hourly_limit = 200  # Conservative limit
    
    def wait(self):
        """
        Implement realistic human-like delays with variance
        """
        if self.last_request_time:
            # Base delay with randomness
            base_delay = random.uniform(3, 7)
            
            # Add occasional longer pauses (like humans taking breaks)
            if random.random() < 0.1:  # 10% chance
                base_delay += random.uniform(10, 30)
            
            # Time of day adjustments (slower at night)
            hour = datetime.now().hour
            if 22 <= hour or hour <= 6:
                base_delay *= 1.5
            
            time.sleep(base_delay)
        
        self.last_request_time = time.time()
        self.request_count += 1
        
        # Enforce hourly limits
        if self.request_count >= self.hourly_limit:
            print("Hourly limit reached, sleeping for 1 hour...")
            time.sleep(3600)
            self.request_count = 0

# Usage
throttler = SmartThrottler()

asins = ['B08N5WRWNW', 'B07ZPKN6YR', 'B08G9XVZ9G']
for asin in asins:
    throttler.wait()
    # Make your request here
    print(f"Scraping ASIN: {asin}")

The trick: Vary your delays, add random pauses, and respect hourly limits. Real users don't browse at constant intervals—neither should your scraper.

Handling AWS WAF Bot Control

Amazon uses AWS WAF Bot Control, which is sophisticated. Here's what it checks and how to address each:

1. JavaScript Execution

AWS WAF injects JavaScript challenges to verify you can execute code. If you're using plain HTTP requests, you'll fail. Solution: Use a headless browser or a scraping service that handles JS rendering.

2. Browser Fingerprinting

AWS WAF collects your screen size, fonts, WebGL info, canvas fingerprint, and more. If your "browser" doesn't have these properties, you're flagged.

Solution: Use anti-detect browsers or tools like undetected-chromedriver:

import undetected_chromedriver as uc
from selenium.webdriver.common.by import By
import time

def scrape_with_undetected():
    options = uc.ChromeOptions()
    options.add_argument('--window-size=1920,1080')
    
    driver = uc.Chrome(options=options)
    
    try:
        driver.get('https://www.amazon.com/s?k=laptop')
        time.sleep(3)
        
        # Extract data
        products = driver.find_elements(By.CSS_SELECTOR, 'div[data-asin]')
        print(f"Found {len(products)} products")
        
    finally:
        driver.quit()

scrape_with_undetected()

3. IP Reputation

Datacenter IPs get flagged fast. Residential proxies work better because they look like real users. If you're serious about scraping Amazon at scale, invest in residential proxies from providers like Bright Data, Smartproxy, or Oxylabs.

4. CAPTCHA Challenges

Eventually, you'll hit CAPTCHAs. Options:

  • Slow down your requests (best solution)
  • Rotate IPs more aggressively
  • Use CAPTCHA solving services (2Captcha, CapMonster) as a last resort
  • Switch to a scraping API that handles CAPTCHAs automatically

Parsing Amazon's Messy HTML

Amazon's HTML is a nightmare. Classes change, IDs are dynamic, and structure varies by page. Here are reliable selectors that work across most product pages:

from bs4 import BeautifulSoup

def parse_product_page(html):
    soup = BeautifulSoup(html, 'html.parser')
    
    data = {}
    
    # Title - multiple fallbacks
    title_selectors = [
        {'id': 'productTitle'},
        {'class': 'product-title-word-break'}
    ]
    for selector in title_selectors:
        elem = soup.find('span', selector)
        if elem:
            data['title'] = elem.text.strip()
            break
    
    # Price - Amazon has multiple price formats
    price_elem = soup.find('span', {'class': 'a-price'})
    if price_elem:
        whole = price_elem.find('span', {'class': 'a-price-whole'})
        fraction = price_elem.find('span', {'class': 'a-price-fraction'})
        if whole and fraction:
            data['price'] = f"{whole.text}{fraction.text}"
    
    # Rating
    rating_elem = soup.find('span', {'class': 'a-icon-alt'})
    if rating_elem:
        data['rating'] = rating_elem.text.split(' ')[0]
    
    # Review count
    review_elem = soup.find('span', {'id': 'acrCustomerReviewText'})
    if review_elem:
        data['review_count'] = review_elem.text.split(' ')[0].replace(',', '')
    
    # Availability
    availability_elem = soup.find('div', {'id': 'availability'})
    if availability_elem:
        data['availability'] = availability_elem.text.strip()
    
    # Features/bullets
    features = []
    feature_bullets = soup.find('div', {'id': 'feature-bullets'})
    if feature_bullets:
        for li in feature_bullets.find_all('li'):
            span = li.find('span', {'class': 'a-list-item'})
            if span:
                features.append(span.text.strip())
    data['features'] = features
    
    return data

Pro tip: Always build fallbacks. If one selector fails, try another. Amazon A/B tests layouts constantly, so what works today might not work tomorrow.

Best Practices for Ethical Scraping

Let's be clear: Amazon doesn't want you scraping. But if you're going to do it, do it responsibly:

1. Rate limit yourself: Don't hammer their servers. Space out requests with realistic delays.

2. Use a rotating User-Agent: Mimic real browsers, but don't rotate too obviously.

user_agents = [
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36',
    'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36'
]

import random
headers = {'User-Agent': random.choice(user_agents)}

3. Respect robots.txt: Amazon's robots.txt disallows scraping of customer reviews (full text), checkout, and account pages. Stick to product pages and search results.

4. Cache aggressively: Don't request the same page twice. Store results locally:

import json
import hashlib
from pathlib import Path

def cache_page(url, content):
    cache_dir = Path('cache')
    cache_dir.mkdir(exist_ok=True)
    
    # Hash URL as filename
    filename = hashlib.md5(url.encode()).hexdigest()
    filepath = cache_dir / f"{filename}.json"
    
    with open(filepath, 'w') as f:
        json.dump({'url': url, 'content': content}, f)

def get_cached_page(url):
    cache_dir = Path('cache')
    filename = hashlib.md5(url.encode()).hexdigest()
    filepath = cache_dir / f"{filename}.json"
    
    if filepath.exists():
        with open(filepath, 'r') as f:
            return json.load(f)['content']
    return None

5. Monitor your success rate: If your success rate drops below 80%, you're being throttled. Back off.

Final Thoughts

Scraping Amazon in 2026 isn't easy, but it's doable if you combine the right techniques. The tricks in this guide—mining ASINs from data attributes, scraping search results instead of individual pages, using widget endpoints, and implementing realistic timing patterns—will get you further than standard approaches.

The real key is thinking like Amazon's detection systems. Don't just rotate IPs and hope for the best. Vary your request patterns, respect rate limits, and cache everything you can. If you're hitting blocks constantly, you're being too aggressive.

For production-scale scraping, consider using established scraping APIs like Bright Data, ScraperAPI, or Oxylabs. They handle the anti-bot dance for you and are often more reliable than rolling your own solution.

Remember: scraping is a cat-and-mouse game. What works today might not work next month. Stay adaptable, keep learning, and always have a backup plan.