How to Scrape Gymshark: Complete Python Guide (2025)

Gymshark runs on Shopify, which makes product data accessible through JSON endpoints and browser automation. This guide shows you both methods.

You'll learn the fastest way to scrape Gymshark products, handle pagination limits, and avoid common blocking issues.

What's the Fastest Way to Scrape Gymshark?

Gymshark uses Shopify's /products.json endpoint that returns structured product data without HTML parsing. This approach runs 5x faster than browser automation and works on all Gymshark regional sites (US, UK, ROW). For JavaScript-heavy pages, combine it with Playwright or Selenium for complete data extraction.

Why Scrape Gymshark?

Gymshark sells over 2,000 fitness products across multiple regions. Price monitoring reveals when products drop below retail.

Stock tracking shows which items sell fastest. This data helps with competitive analysis and inventory planning.

The site updates frequently. Scraping captures these changes automatically.

Gymshark's Technical Structure

Gymshark operates three regional Shopify stores: gymshark.com (US), uk.gymshark.com (UK), and row.gymshark.com (Rest of World).

Each store maintains separate product catalogs and pricing. The backend structure stays identical across regions.

This makes your scraper reusable. Change the domain and scrape any Gymshark site.

This method extracts data directly from Shopify's API. No HTML parsing needed.

Get All Product URLs

Gymshark publishes a sitemap with every product URL at /sitemap_products_1.xml.

import requests
import xmltodict

# Target the Gymshark US site
SITEMAP = 'https://www.gymshark.com/sitemap_products_1.xml'

# Fetch sitemap content
response = requests.get(SITEMAP)
sitemap = xmltodict.parse(response.text)

# Extract all product URLs
urls = []
for item in sitemap['urlset']['url']:
    urls.append(item['loc'])

print(f"Found {len(urls)} products")

This returns ~2,200 product URLs. The sitemap updates daily with new products.

Store URLs in a text file for batch processing later.

Extract Product Data

The /products.json endpoint returns up to 250 products per request. Pagination handles larger catalogs.

import requests
import json

def fetch_products(page=1):
    url = f'https://www.gymshark.com/products.json?limit=250&page={page}'
    
    response = requests.get(url)
    data = response.json()
    
    return data['products']

# Get first 250 products
products = fetch_products(page=1)

for product in products:
    print(f"Title: {product['title']}")
    print(f"Price: {product['variants'][0]['price']}")
    print(f"Available: {product['available']}")
    print("---")

Each product includes title, vendor, type, pricing, and availability. Variants contain size-specific data.

The API returns clean JSON. No CSS selectors or HTML parsing required.

Handle Pagination

Shopify limits results to 250 products per request. Loop through pages to get everything.

def scrape_all_products():
    page = 1
    all_products = []
    
    while True:
        products = fetch_products(page=page)
        
        # Stop when no more products
        if not products:
            break
            
        all_products.extend(products)
        page += 1
        
        # Rate limiting - be respectful
        time.sleep(0.5)
    
    return all_products

# Get complete product catalog
complete_catalog = scrape_all_products()
print(f"Total products: {len(complete_catalog)}")

This scrapes the entire Gymshark catalog systematically. Takes about 10 seconds for 2,200 products.

Rate limiting prevents server overload. Stay under 2 requests per second.

Method 2: Browser Automation

Use this when JavaScript renders product data dynamically. Playwright handles modern web apps better than Selenium.

Setup Playwright

from playwright.sync_api import sync_playwright
import json

def scrape_with_browser(url):
    with sync_playwright() as p:
        # Launch browser in headless mode
        browser = p.chromium.launch(headless=True)
        page = browser.new_page()
        
        # Navigate to product page
        page.goto(url)
        
        # Wait for product details to load
        page.wait_for_selector('.ProductTitle_product-title__2dbjR')
        
        # Extract data
        title = page.locator('.ProductTitle_product-title__2dbjR').inner_text()
        price = page.locator('.ProductPrice_product-price__1VQdR').inner_text()
        
        browser.close()
        
        return {
            'title': title,
            'price': price
        }

This method works when Gymshark's frontend loads data with JavaScript. Slower than JSON but handles dynamic content.

The selector classes change occasionally. Check the page source when scraping fails.

Method Comparison: Which Should You Use?

Method Speed Complexity Blocking Risk Best For
JSON Endpoint Fast (10s for 2k products) Low Low Bulk scraping, price monitoring
Browser Automation Slow (5-10s per product) Medium High Dynamic content, JavaScript-heavy pages
Third-Party APIs Medium Very Low Very Low Quick projects, no maintenance

JSON endpoints win for most use cases. Browser automation helps when JavaScript hides data.

Avoid Getting Blocked

Gymshark uses Cloudflare for bot detection. Here's how to scrape Gymshark without issues.

Rotate User Agents

import random

user_agents = [
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36',
    'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36'
]

headers = {
    'User-Agent': random.choice(user_agents)
}

response = requests.get(url, headers=headers)

This mimics different browsers. Cloudflare checks User-Agent strings.

Rotate after every 50-100 requests for best results.

Use Proxies for Scale

Residential proxies avoid IP bans when scraping thousands of products daily.

proxies = {
    'http': 'http://user:pass@proxy.example.com:8080',
    'https': 'http://user:pass@proxy.example.com:8080'
}

response = requests.get(url, proxies=proxies)

Free proxies fail frequently. Paid services like BrightData or Oxylabs work better.

Rotate proxies every 100-200 requests to stay under the radar.

Regional Site Differences

UK and ROW sites use identical structures. Change the domain to scrape them.

# US site
us_url = 'https://www.gymshark.com/products.json'

# UK site
uk_url = 'https://uk.gymshark.com/products.json'

# Rest of World
row_url = 'https://row.gymshark.com/products.json'

Pricing differs by region. The same leggings cost $60 USD and £48 GBP.

Currency conversion happens server-side. Scrape multiple regions for arbitrage opportunities.

Build a Price Monitor

Track Gymshark prices automatically with scheduled scraping.

import schedule
import time
import json

def check_prices():
    products = fetch_products(page=1)
    
    # Load previous prices
    try:
        with open('prices.json', 'r') as f:
            old_prices = json.load(f)
    except:
        old_prices = {}
    
    # Compare and alert on drops
    for product in products:
        product_id = str(product['id'])
        current_price = float(product['variants'][0]['price'])
        
        if product_id in old_prices:
            old_price = old_prices[product_id]
            if current_price < old_price:
                print(f"Price drop: {product['title']}")
                print(f"Was ${old_price}, now ${current_price}")
        
        old_prices[product_id] = current_price
    
    # Save updated prices
    with open('prices.json', 'w') as f:
        json.dump(old_prices, f)

# Run every 6 hours
schedule.every(6).hours.do(check_prices)

while True:
    schedule.run_pending()
    time.sleep(60)

This monitors prices without manual checking. Sends alerts when products go on sale.

Run it on a cloud server for 24/7 monitoring.

Store Data Efficiently

Save scraped data in pandas DataFrames for analysis.

import pandas as pd

def save_to_csv(products):
    # Flatten product data
    rows = []
    for product in products:
        for variant in product['variants']:
            rows.append({
                'product_id': product['id'],
                'title': product['title'],
                'vendor': product['vendor'],
                'product_type': product['product_type'],
                'variant_id': variant['id'],
                'size': variant['title'],
                'price': variant['price'],
                'available': variant['available'],
                'sku': variant['sku']
            })
    
    df = pd.DataFrame(rows)
    df.to_csv('gymshark_products.csv', index=False)
    
    return df

# Export all products
df = save_to_csv(complete_catalog)
print(df.head())

CSV format works with Excel and Google Sheets. Easy sharing with non-technical teams.

DataFrames enable quick filtering and sorting. Find all products under $30 in seconds.

Gymshark's robots.txt doesn't explicitly block product pages. The JSON endpoint serves public data.

Terms of service prohibit automated account actions. Don't scrape customer accounts or checkout flows.

Respect rate limits. Gymshark.com handles millions of visitors. Your scraper should blend in.

Common Errors and Fixes

"No products found" - You hit the end of pagination. This is normal.

"Timeout errors" - Add longer wait times between requests. Gymshark throttles aggressive scrapers.

"403 Forbidden" - Your IP got flagged. Switch proxies or reduce request frequency.

"Invalid JSON" - The endpoint returns 404 for removed products. Check status codes first.

Scale Your Scraping

Process products in parallel to speed up collection.

from concurrent.futures import ThreadPoolExecutor

def scrape_product(url):
    # Your scraping logic here
    pass

urls = [...]  # Your list of product URLs

with ThreadPoolExecutor(max_workers=5) as executor:
    results = list(executor.map(scrape_product, urls))

Limit workers to 5-10 threads. Higher numbers trigger rate limits.

Parallel processing cuts scraping time by 80%. Handle 2,000 products in under 2 minutes.

Next Steps

Start with the JSON endpoint method. It's reliable and fast for most projects.

Add browser automation only when JavaScript blocks data extraction. This happens rarely on Gymshark.

Schedule your scraper with cron jobs or cloud functions. Daily runs capture inventory changes.

Frequently Asked Questions

Is it legal to scrape Gymshark?

Scraping public product data is generally legal. Avoid scraping user accounts or violating terms of service. Consult a lawyer for commercial use.

How often does Gymshark update products?

New products launch weekly. Prices change during sales events. Daily scraping captures most updates.

Can I scrape Gymshark reviews?

Reviews load via JavaScript. Use Playwright to access the review API endpoint after page load.

What's the best proxy for Gymshark?

Residential proxies work best. Rotate IPs from US, UK, or the target region.

How do I handle CAPTCHAs?

JSON endpoints rarely show CAPTCHAs. If you see them, reduce request frequency or use CAPTCHA-solving services.

Conclusion

The JSON endpoint is the fastest way to scrape Gymshark product data. It returns clean, structured information without HTML parsing.

Browser automation handles edge cases where JavaScript renders content. Combine both methods for complete coverage.

Rate limiting and proxies keep your scraper running. Schedule regular checks to track prices and inventory automatically.