Gymshark runs on Shopify, which exposes product data through JSON endpoints and predictable URL structures. This guide covers five different methods to extract their entire product catalog.
The fastest way to scrape Gymshark is using Shopify's /products.json endpoint combined with TLS fingerprint spoofing via curl_cffi. This approach runs 10x faster than browser automation while bypassing Cloudflare's bot detection. You'll extract 2,000+ products in under 30 seconds using the methods below.
Why Scrape Gymshark?
Gymshark maintains over 2,000 fitness products across three regional stores. The UK brand updates inventory multiple times daily during sales events.
Price monitoring captures flash sales before they sell out. Competitor analysis reveals pricing strategies and new product launches.
Stock tracking shows which items sell fastest. This data feeds into market research and inventory planning systems.
Gymshark's Technical Architecture
Gymshark operates three separate Shopify stores with identical backend structures:
| Region | Domain | Currency |
|---|---|---|
| US | gymshark.com | USD |
| UK | uk.gymshark.com | GBP |
| Rest of World | row.gymshark.com | Regional |
Each store maintains its own product catalog, pricing, and inventory levels. The Shopify backend stays consistent across all three.
Your scraper code works on any Gymshark site. Just swap the domain.
Method 1: JSON Endpoint with curl_cffi (Recommended)
Standard HTTP libraries like requests get blocked by Cloudflare's TLS fingerprinting. The curl_cffi library solves this by mimicking real browser TLS signatures.
Install Dependencies
pip install curl_cffi xmltodict pandas
This installs the TLS-spoofing HTTP client, XML parser, and data handling library.
Extract Product URLs from Sitemap
Gymshark publishes a product sitemap at /sitemap_products_1.xml. This contains every product URL on the site.
from curl_cffi import requests
import xmltodict
SITEMAP_URL = 'https://www.gymshark.com/sitemap_products_1.xml'
response = requests.get(
SITEMAP_URL,
impersonate="chrome"
)
sitemap_data = xmltodict.parse(response.text)
The impersonate="chrome" parameter makes the request appear to come from a real Chrome browser. Cloudflare sees a valid TLS fingerprint and allows the request through.
# Extract all product URLs from sitemap
product_urls = []
for item in sitemap_data['urlset']['url']:
url = item['loc']
if '/products/' in url:
product_urls.append(url)
print(f"Found {len(product_urls)} products")
This typically returns 2,200+ product URLs. The sitemap updates daily as new products launch.
Fetch Products via JSON API
Shopify's /products.json endpoint returns structured data without HTML parsing. Each request fetches up to 250 products.
from curl_cffi import requests
import time
def fetch_products_page(domain, page=1):
"""Fetch a single page of products from Shopify JSON API."""
url = f"https://{domain}/products.json?limit=250&page={page}"
response = requests.get(url, impersonate="chrome")
if response.status_code != 200:
return []
data = response.json()
return data.get('products', [])
The function returns an empty list when pagination ends. Shopify returns empty results after the last page.
def scrape_all_products(domain):
"""Scrape complete product catalog from a Gymshark store."""
all_products = []
page = 1
while True:
products = fetch_products_page(domain, page)
if not products:
break
all_products.extend(products)
print(f"Page {page}: {len(products)} products")
page += 1
time.sleep(0.3) # Rate limiting
return all_products
The rate limit of 0.3 seconds prevents triggering Shopify's abuse detection. Aggressive scraping leads to temporary IP blocks.
# Scrape US store
products = scrape_all_products('www.gymshark.com')
print(f"Total products: {len(products)}")
Expect around 2,200 products in 8-10 pages. The entire catalog downloads in under 30 seconds.
Parse Product Data
Each product object contains nested variant information. Extract the fields you need into a flat structure.
def parse_product(product):
"""Extract relevant fields from product JSON."""
parsed = {
'id': product['id'],
'title': product['title'],
'vendor': product['vendor'],
'product_type': product['product_type'],
'handle': product['handle'],
'created_at': product['created_at'],
'updated_at': product['updated_at'],
'tags': ', '.join(product.get('tags', [])),
}
# Get primary variant pricing
if product.get('variants'):
variant = product['variants'][0]
parsed['price'] = variant.get('price')
parsed['compare_at_price'] = variant.get('compare_at_price')
parsed['sku'] = variant.get('sku')
parsed['available'] = variant.get('available')
return parsed
The compare_at_price field shows the original price when items are on sale. This helps identify discounted products.
import pandas as pd
# Parse all products
parsed_products = [parse_product(p) for p in products]
# Create DataFrame
df = pd.DataFrame(parsed_products)
df.to_csv('gymshark_products.csv', index=False)
print(df.head())
Method 2: Async Scraping for Speed
Sequential requests waste time waiting for responses. Async scraping processes multiple requests concurrently.
from curl_cffi.requests import AsyncSession
import asyncio
async def fetch_page_async(session, domain, page):
"""Fetch a single page asynchronously."""
url = f"https://{domain}/products.json?limit=250&page={page}"
response = await session.get(url, impersonate="chrome")
if response.status_code != 200:
return []
data = response.json()
return data.get('products', [])
The AsyncSession handles concurrent connections efficiently. Each request runs independently without blocking others.
async def scrape_all_async(domain, max_pages=20):
"""Scrape all products using concurrent requests."""
async with AsyncSession() as session:
# Create tasks for all pages
tasks = [
fetch_page_async(session, domain, page)
for page in range(1, max_pages + 1)
]
# Execute all requests concurrently
results = await asyncio.gather(*tasks)
# Flatten results
all_products = []
for page_products in results:
all_products.extend(page_products)
return all_products
# Run async scraper
products = asyncio.run(scrape_all_async('www.gymshark.com'))
print(f"Scraped {len(products)} products")
Async scraping completes in 3-5 seconds versus 30+ seconds for sequential requests. Set max_pages high enough to capture all products.
Batch Processing with Rate Limits
Too many concurrent requests trigger rate limits. Batch processing balances speed and reliability.
async def scrape_with_batches(domain, batch_size=5):
"""Scrape products in controlled batches."""
async with AsyncSession() as session:
all_products = []
page = 1
while True:
# Create batch of tasks
tasks = [
fetch_page_async(session, domain, page + i)
for i in range(batch_size)
]
results = await asyncio.gather(*tasks)
# Check if we've reached the end
batch_products = []
for result in results:
batch_products.extend(result)
if not batch_products:
break
all_products.extend(batch_products)
page += batch_size
await asyncio.sleep(0.5) # Pause between batches
return all_products
This approach scrapes 5 pages at once, pauses briefly, then continues. You get speed without overwhelming the server.
Method 3: Individual Product Pages
Some product details only appear on individual product pages. The JSON endpoint provides basic info, but product pages contain reviews, sizing guides, and additional images.
Fetch Single Product JSON
Each product has a dedicated JSON endpoint at /products/{handle}.json.
def fetch_product_details(domain, handle):
"""Fetch detailed product data from individual endpoint."""
url = f"https://{domain}/products/{handle}.json"
response = requests.get(url, impersonate="chrome")
if response.status_code != 200:
return None
return response.json().get('product')
Individual product endpoints return the same structure as the list endpoint, but with fresher data. Use this for real-time price checks.
# Get detailed info for a specific product
product = fetch_product_details(
'www.gymshark.com',
'gymshark-vital-seamless-2-0-leggings'
)
if product:
print(f"Title: {product['title']}")
print(f"Variants: {len(product['variants'])}")
Extract All Variants
Products like leggings have multiple size and color variants. Each variant has its own price, SKU, and availability.
def extract_variants(product):
"""Extract all variant information from a product."""
variants = []
for variant in product.get('variants', []):
variants.append({
'product_id': product['id'],
'product_title': product['title'],
'variant_id': variant['id'],
'variant_title': variant['title'],
'price': variant['price'],
'compare_at_price': variant.get('compare_at_price'),
'sku': variant.get('sku'),
'available': variant.get('available'),
'inventory_quantity': variant.get('inventory_quantity'),
})
return variants
Inventory quantity shows exact stock levels when available. Some stores hide this field for competitive reasons.
Method 4: Browser Automation with Playwright
JavaScript-rendered content requires browser automation. Gymshark's product pages load reviews and dynamic pricing through JavaScript.
Basic Playwright Setup
pip install playwright playwright-stealth
playwright install chromium
The stealth plugin patches detection signatures that Cloudflare looks for.
from playwright.sync_api import sync_playwright
from playwright_stealth import stealth_sync
def scrape_with_browser(url):
"""Scrape page content using headless browser."""
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
context = browser.new_context(
viewport={'width': 1920, 'height': 1080},
user_agent='Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
)
page = context.new_page()
stealth_sync(page) # Apply stealth patches
page.goto(url, wait_until='networkidle')
content = page.content()
browser.close()
return content
The stealth_sync() function modifies browser fingerprints to avoid detection. It patches navigator.webdriver and other telltale properties.
Extract Product Details from HTML
from playwright.sync_api import sync_playwright
from playwright_stealth import stealth_sync
def scrape_product_page(url):
"""Extract product details from rendered page."""
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
page = browser.new_page()
stealth_sync(page)
page.goto(url, wait_until='networkidle')
# Wait for product details to load
page.wait_for_selector('[data-testid="product-title"]', timeout=10000)
# Extract data
title = page.locator('[data-testid="product-title"]').inner_text()
price = page.locator('[data-testid="product-price"]').inner_text()
# Get all available sizes
sizes = page.locator('[data-testid="size-button"]').all_inner_texts()
browser.close()
return {
'title': title,
'price': price,
'sizes': sizes
}
Selector classes change periodically. Check the page source when scraping fails.
Intercept Network Requests
Browser automation can capture API responses that pages make internally. This reveals hidden data endpoints.
def intercept_api_calls(url):
"""Capture API responses made by the page."""
api_responses = []
def handle_response(response):
if '/api/' in response.url or 'graphql' in response.url:
try:
api_responses.append({
'url': response.url,
'status': response.status,
'body': response.json()
})
except:
pass
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
page = browser.new_page()
stealth_sync(page)
page.on('response', handle_response)
page.goto(url, wait_until='networkidle')
browser.close()
return api_responses
This technique often discovers undocumented APIs. Gymshark may use internal endpoints for reviews, recommendations, or inventory checks.
Method 5: Collections and Category Scraping
Shopify organizes products into collections. Scraping by collection helps filter specific product types.
List All Collections
def fetch_collections(domain):
"""Fetch all product collections."""
url = f"https://{domain}/collections.json"
response = requests.get(url, impersonate="chrome")
if response.status_code != 200:
return []
data = response.json()
return data.get('collections', [])
Collections represent categories like "Women's Leggings" or "Men's Hoodies". Each has a handle for filtering.
# Get all collections
collections = fetch_collections('www.gymshark.com')
for collection in collections[:10]:
print(f"{collection['title']}: {collection['handle']}")
Scrape Products by Collection
def fetch_collection_products(domain, collection_handle, page=1):
"""Fetch products from a specific collection."""
url = f"https://{domain}/collections/{collection_handle}/products.json"
url += f"?limit=250&page={page}"
response = requests.get(url, impersonate="chrome")
if response.status_code != 200:
return []
data = response.json()
return data.get('products', [])
Collection-based scraping targets specific product categories without downloading the entire catalog.
# Scrape women's leggings only
leggings = fetch_collection_products(
'www.gymshark.com',
'womens-leggings'
)
print(f"Found {len(leggings)} leggings")
Avoiding Detection and Blocks
Gymshark uses Cloudflare for bot protection. Here's how to stay under the radar.
Rotate User Agents
import random
USER_AGENTS = [
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/130.0.0.0 Safari/537.36',
'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36',
]
def get_random_headers():
"""Generate randomized request headers."""
return {
'User-Agent': random.choice(USER_AGENTS),
'Accept': 'application/json, text/plain, */*',
'Accept-Language': 'en-US,en;q=0.9',
'Accept-Encoding': 'gzip, deflate, br',
'Connection': 'keep-alive',
}
Rotate user agents every 50-100 requests. Consistent fingerprints across thousands of requests raise flags.
Implement Proxy Rotation
Residential proxies avoid IP-based blocks when scraping at scale.
def fetch_with_proxy(url, proxy_url):
"""Make request through proxy server."""
proxies = {
'http': proxy_url,
'https': proxy_url
}
response = requests.get(
url,
impersonate="chrome",
proxies=proxies
)
return response
Datacenter proxies get blocked quickly. Residential proxies from providers like Roundproxies.com mimic real user connections.
Handle Rate Limiting
import time
from curl_cffi import requests
def fetch_with_retry(url, max_retries=3):
"""Fetch URL with exponential backoff on failure."""
for attempt in range(max_retries):
try:
response = requests.get(url, impersonate="chrome")
if response.status_code == 429:
wait_time = 2 ** attempt * 5
print(f"Rate limited. Waiting {wait_time}s...")
time.sleep(wait_time)
continue
if response.status_code == 403:
print("Blocked. Switching proxy...")
time.sleep(10)
continue
return response
except Exception as e:
print(f"Error: {e}")
time.sleep(2 ** attempt)
return None
Exponential backoff prevents hammering a blocked server. Wait times increase with each failed attempt.
Build a Price Monitoring System
Automated price tracking catches sales and restocks instantly.
import json
import schedule
import time
from datetime import datetime
PRICE_FILE = 'price_history.json'
def load_price_history():
"""Load previous price data."""
try:
with open(PRICE_FILE, 'r') as f:
return json.load(f)
except FileNotFoundError:
return {}
def save_price_history(data):
"""Save price data to file."""
with open(PRICE_FILE, 'w') as f:
json.dump(data, f, indent=2)
Persistent storage tracks prices over time. JSON format makes it easy to analyze trends.
def check_for_price_drops():
"""Compare current prices to historical data."""
current_products = scrape_all_products('www.gymshark.com')
history = load_price_history()
alerts = []
for product in current_products:
product_id = str(product['id'])
current_price = float(product['variants'][0]['price'])
if product_id in history:
old_price = history[product_id]['price']
if current_price < old_price:
drop_pct = ((old_price - current_price) / old_price) * 100
alerts.append({
'title': product['title'],
'old_price': old_price,
'new_price': current_price,
'drop_percent': round(drop_pct, 1),
'url': f"https://www.gymshark.com/products/{product['handle']}"
})
history[product_id] = {
'price': current_price,
'title': product['title'],
'updated': datetime.now().isoformat()
}
save_price_history(history)
return alerts
The function returns a list of products with price drops. Set up notifications via email, Slack, or Discord.
def run_monitor():
"""Execute price check and display alerts."""
print(f"\n[{datetime.now()}] Checking prices...")
alerts = check_for_price_drops()
if alerts:
print(f"\n🚨 {len(alerts)} PRICE DROPS FOUND!")
for alert in alerts:
print(f"\n{alert['title']}")
print(f" ${alert['old_price']} → ${alert['new_price']}")
print(f" {alert['drop_percent']}% off")
print(f" {alert['url']}")
else:
print("No price changes detected")
# Schedule checks every 6 hours
schedule.every(6).hours.do(run_monitor)
# Run initial check
run_monitor()
# Keep running
while True:
schedule.run_pending()
time.sleep(60)
Deploy this on a cloud server for 24/7 monitoring. AWS Lambda or Google Cloud Functions work well for scheduled tasks.
Export Data for Analysis
Different output formats serve different use cases.
CSV Export with Pandas
import pandas as pd
def export_to_csv(products, filename='gymshark_products.csv'):
"""Export products to CSV file."""
rows = []
for product in products:
for variant in product.get('variants', []):
rows.append({
'product_id': product['id'],
'title': product['title'],
'handle': product['handle'],
'vendor': product['vendor'],
'product_type': product['product_type'],
'variant_id': variant['id'],
'variant_title': variant['title'],
'price': variant['price'],
'compare_at_price': variant.get('compare_at_price'),
'sku': variant.get('sku'),
'available': variant.get('available'),
'created_at': product['created_at'],
'updated_at': product['updated_at'],
})
df = pd.DataFrame(rows)
df.to_csv(filename, index=False)
print(f"Exported {len(rows)} variants to {filename}")
return df
CSV works with Excel, Google Sheets, and most analytics tools. Each row represents a single product variant.
JSON Export for APIs
import json
def export_to_json(products, filename='gymshark_products.json'):
"""Export products to JSON file."""
with open(filename, 'w') as f:
json.dump(products, f, indent=2)
print(f"Exported {len(products)} products to {filename}")
JSON preserves the nested structure. Useful for feeding data into web applications or databases.
SQLite Database Storage
import sqlite3
def create_database():
"""Initialize SQLite database for product storage."""
conn = sqlite3.connect('gymshark.db')
cursor = conn.cursor()
cursor.execute('''
CREATE TABLE IF NOT EXISTS products (
id INTEGER PRIMARY KEY,
title TEXT,
handle TEXT UNIQUE,
vendor TEXT,
product_type TEXT,
created_at TEXT,
updated_at TEXT
)
''')
cursor.execute('''
CREATE TABLE IF NOT EXISTS variants (
id INTEGER PRIMARY KEY,
product_id INTEGER,
title TEXT,
price REAL,
compare_at_price REAL,
sku TEXT,
available BOOLEAN,
FOREIGN KEY (product_id) REFERENCES products(id)
)
''')
cursor.execute('''
CREATE TABLE IF NOT EXISTS price_history (
id INTEGER PRIMARY KEY AUTOINCREMENT,
variant_id INTEGER,
price REAL,
recorded_at TEXT,
FOREIGN KEY (variant_id) REFERENCES variants(id)
)
''')
conn.commit()
return conn
SQLite stores historical data efficiently. Query past prices to identify sale patterns.
def save_to_database(products, conn):
"""Save scraped products to SQLite database."""
cursor = conn.cursor()
for product in products:
# Insert or update product
cursor.execute('''
INSERT OR REPLACE INTO products
(id, title, handle, vendor, product_type, created_at, updated_at)
VALUES (?, ?, ?, ?, ?, ?, ?)
''', (
product['id'],
product['title'],
product['handle'],
product['vendor'],
product['product_type'],
product['created_at'],
product['updated_at']
))
# Insert variants
for variant in product.get('variants', []):
cursor.execute('''
INSERT OR REPLACE INTO variants
(id, product_id, title, price, compare_at_price, sku, available)
VALUES (?, ?, ?, ?, ?, ?, ?)
''', (
variant['id'],
product['id'],
variant['title'],
variant['price'],
variant.get('compare_at_price'),
variant.get('sku'),
variant.get('available')
))
# Record price history
cursor.execute('''
INSERT INTO price_history (variant_id, price, recorded_at)
VALUES (?, ?, datetime('now'))
''', (variant['id'], variant['price']))
conn.commit()
Multi-Region Scraping
Compare prices across Gymshark's regional stores.
REGIONS = {
'US': 'www.gymshark.com',
'UK': 'uk.gymshark.com',
'ROW': 'row.gymshark.com'
}
def scrape_all_regions():
"""Scrape products from all regional stores."""
all_data = {}
for region, domain in REGIONS.items():
print(f"\nScraping {region} store...")
products = scrape_all_products(domain)
all_data[region] = products
print(f" Found {len(products)} products")
time.sleep(2) # Pause between regions
return all_data
Cross-region analysis reveals pricing arbitrage opportunities. The same leggings might cost less in one region.
def compare_regional_prices(all_data):
"""Compare prices for same products across regions."""
comparisons = []
# Build lookup by handle (product identifier)
us_products = {p['handle']: p for p in all_data.get('US', [])}
uk_products = {p['handle']: p for p in all_data.get('UK', [])}
for handle, us_product in us_products.items():
if handle in uk_products:
uk_product = uk_products[handle]
us_price = float(us_product['variants'][0]['price'])
uk_price = float(uk_product['variants'][0]['price'])
# Rough GBP to USD conversion
uk_price_usd = uk_price * 1.27
comparisons.append({
'title': us_product['title'],
'us_price': us_price,
'uk_price_gbp': uk_price,
'uk_price_usd': round(uk_price_usd, 2),
'difference': round(us_price - uk_price_usd, 2)
})
return comparisons
Error Handling and Logging
Production scrapers need robust error handling.
import logging
from datetime import datetime
# Configure logging
logging.basicConfig(
level=logging.INFO,
format='%(asctime)s - %(levelname)s - %(message)s',
handlers=[
logging.FileHandler('scraper.log'),
logging.StreamHandler()
]
)
logger = logging.getLogger(__name__)
def scrape_with_logging(domain):
"""Scrape products with comprehensive logging."""
logger.info(f"Starting scrape of {domain}")
start_time = datetime.now()
try:
products = scrape_all_products(domain)
elapsed = (datetime.now() - start_time).total_seconds()
logger.info(f"Completed: {len(products)} products in {elapsed:.1f}s")
return products
except Exception as e:
logger.error(f"Scrape failed: {str(e)}")
raise
Logs help debug issues in production. Store them separately from output data.
Common Errors and Solutions
403 Forbidden: Cloudflare blocked your request. Use curl_cffi with impersonate="chrome" or switch to residential proxies.
Empty JSON Response: You've reached the end of pagination. This is expected behavior, not an error.
Connection Timeout: Network issue or server overload. Implement retry logic with exponential backoff.
Rate Limit (429): Too many requests. Slow down and add delays between requests.
Invalid JSON: The endpoint returned HTML instead of JSON. This usually means a CAPTCHA page or error. Check the response content.
SSL Certificate Error: TLS handshake failed. Update curl_cffi to the latest version.
Legal and Ethical Considerations
Gymshark's /products.json endpoint serves public data intended for search engines and third-party integrations. The sitemap explicitly lists product URLs for crawlers.
Respect rate limits. Don't hammer their servers with thousands of requests per minute.
Avoid scraping customer accounts or checkout processes. That crosses into unauthorized access territory.
Check their robots.txt for any restrictions. As of 2026, Gymshark doesn't block the JSON endpoints used in this guide.
Complete Working Scraper
Here's a production-ready scraper combining all techniques:
#!/usr/bin/env python3
"""
Gymshark Product Scraper - Complete Implementation
Extracts all products from Gymshark Shopify stores.
"""
from curl_cffi import requests
import pandas as pd
import time
import logging
import json
from datetime import datetime
# Configure logging
logging.basicConfig(
level=logging.INFO,
format='%(asctime)s - %(levelname)s - %(message)s'
)
logger = logging.getLogger(__name__)
class GymsharkScraper:
"""Scraper for Gymshark product data."""
DOMAINS = {
'US': 'www.gymshark.com',
'UK': 'uk.gymshark.com',
'ROW': 'row.gymshark.com'
}
def __init__(self, region='US'):
self.domain = self.DOMAINS.get(region, self.DOMAINS['US'])
self.region = region
def fetch_page(self, page=1, max_retries=3):
"""Fetch a single page of products."""
url = f"https://{self.domain}/products.json?limit=250&page={page}"
for attempt in range(max_retries):
try:
response = requests.get(url, impersonate="chrome")
if response.status_code == 200:
return response.json().get('products', [])
if response.status_code == 429:
wait = 2 ** attempt * 5
logger.warning(f"Rate limited. Waiting {wait}s...")
time.sleep(wait)
continue
except Exception as e:
logger.error(f"Request failed: {e}")
time.sleep(2 ** attempt)
return []
def scrape_all(self):
"""Scrape complete product catalog."""
all_products = []
page = 1
while True:
products = self.fetch_page(page)
if not products:
break
all_products.extend(products)
logger.info(f"Page {page}: {len(products)} products")
page += 1
time.sleep(0.3)
logger.info(f"Total: {len(all_products)} products from {self.region}")
return all_products
def to_dataframe(self, products):
"""Convert products to pandas DataFrame."""
rows = []
for p in products:
for v in p.get('variants', []):
rows.append({
'product_id': p['id'],
'title': p['title'],
'handle': p['handle'],
'type': p['product_type'],
'variant_id': v['id'],
'variant': v['title'],
'price': v['price'],
'compare_price': v.get('compare_at_price'),
'available': v.get('available'),
'sku': v.get('sku'),
})
return pd.DataFrame(rows)
def export_csv(self, products, filename=None):
"""Export products to CSV."""
if filename is None:
filename = f"gymshark_{self.region.lower()}_{datetime.now():%Y%m%d}.csv"
df = self.to_dataframe(products)
df.to_csv(filename, index=False)
logger.info(f"Exported to {filename}")
return filename
def main():
"""Main execution function."""
# Scrape US store
scraper = GymsharkScraper(region='US')
products = scraper.scrape_all()
scraper.export_csv(products)
# Optional: scrape other regions
# for region in ['UK', 'ROW']:
# scraper = GymsharkScraper(region=region)
# products = scraper.scrape_all()
# scraper.export_csv(products)
if __name__ == '__main__':
main()
Save this as gymshark_scraper.py and run with python gymshark_scraper.py.
Next Steps
You now have multiple methods to scrape Gymshark's product catalog. The JSON endpoint approach works best for bulk data extraction.
For real-time monitoring, set up scheduled scraping with price comparison alerts. Cloud functions handle this efficiently.
Expand to other Shopify stores using the same techniques. The /products.json endpoint exists on most Shopify sites.
Consider building a web interface to browse and filter scraped data. React or Vue work well for this.
Frequently Asked Questions
Is it legal to scrape Gymshark?
Scraping publicly available product data is generally legal. The JSON endpoints serve data intended for public consumption. Don't scrape user accounts or bypass authentication.
How often should I scrape?
For price monitoring, every 6-12 hours captures most changes. During sales events, hourly checks catch flash deals.
Why do I get blocked even with curl_cffi?
You may need residential proxies for high-volume scraping. Datacenter IPs get flagged regardless of TLS fingerprint. Try rotating proxies between requests.
Can I scrape Gymshark reviews?
Reviews load via JavaScript. Use Playwright to render the page and intercept the review API calls. The network interception technique reveals the review endpoint.
How do I handle CAPTCHAs?
The JSON endpoint rarely triggers CAPTCHAs. If you see them, reduce request frequency or use residential proxies. CAPTCHA solving services work as a last resort.
What's the best proxy provider for Gymshark?
Residential proxies work best. Mobile proxies from UK and US locations have the highest success rates.
Conclusion
Gymshark's Shopify backend provides clean JSON access to their entire product catalog. The curl_cffi library bypasses TLS fingerprinting that blocks standard HTTP clients.
Start with the JSON endpoint method for bulk scraping. Add browser automation only when you need JavaScript-rendered content like reviews.
Rate limiting and proxy rotation keep your scraper running long-term. Monitor logs for blocks and adjust your approach accordingly.
The complete scraper class handles all common scenarios. Customize it for your specific use case.