SoundCloud scraping lets you extract track metadata, artist profiles, play counts, and audio information from one of the world's largest music platforms. This guide covers multiple approaches to scrape SoundCloud effectively in 2026, from lightweight HTTP requests to full browser automation.
Whether you're building a music analytics tool, tracking artist performance, or collecting data for research, you'll find working code examples for every skill level.
What is SoundCloud Scraping?
When you scrape SoundCloud, you programmatically extract data from their website or internal API endpoints. SoundCloud hosts over 320 million tracks from independent artists, podcasters, and creators worldwide. Learning to scrape SoundCloud data opens up possibilities for music analytics, trend research, and building recommendation engines.
The data you can extract includes track titles, artist names, play counts, likes, comments, waveform data, genre tags, and profile information. This makes SoundCloud a valuable source for music trend analysis, competitor research, and building recommendation systems.
SoundCloud loads most content dynamically through JavaScript. This presents a challenge because simple HTTP requests won't see data that gets rendered after page load.
You have three main options for scraping SoundCloud:
- API-based extraction - Intercept SoundCloud's internal API calls
- HTTP requests with session handling - Simulate browser requests directly
- Browser automation - Use Playwright or Puppeteer to render JavaScript
Each method has trade-offs between speed, reliability, and complexity. Let's explore all three.
Prerequisites and Setup
Before diving into code, set up your Python environment.
Install the required packages:
pip install requests beautifulsoup4 playwright pandas
For Playwright, you also need to install browser binaries:
playwright install chromium
Create a project folder and add this base configuration file:
# config.py
import random
USER_AGENTS = [
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:121.0) Gecko/20100101 Firefox/121.0",
]
def get_random_ua():
return random.choice(USER_AGENTS)
# Request delays to avoid rate limiting
MIN_DELAY = 1.5
MAX_DELAY = 3.0
# Base headers for requests
BASE_HEADERS = {
"Accept": "application/json, text/javascript, */*; q=0.01",
"Accept-Language": "en-US,en;q=0.9",
"Origin": "https://soundcloud.com",
"Referer": "https://soundcloud.com/",
}
This configuration rotates user agents and sets appropriate delays between requests.
Method 1: Extracting the Client ID
SoundCloud's internal API requires a client_id parameter for authentication. This ID changes periodically, so you need to extract it dynamically.
Here's how to find the current client ID:
# extract_client_id.py
import requests
import re
from config import get_random_ua, BASE_HEADERS
def get_client_id():
"""
Extract SoundCloud's client_id from their JavaScript bundles.
The client_id is embedded in one of the app's JS files.
"""
headers = {**BASE_HEADERS, "User-Agent": get_random_ua()}
# First, get the main page to find JS bundle URLs
response = requests.get("https://soundcloud.com", headers=headers)
if response.status_code != 200:
raise Exception(f"Failed to load SoundCloud: {response.status_code}")
# Find all script URLs that match SoundCloud's asset pattern
script_pattern = r'https://a-v2\.sndcdn\.com/assets/[a-zA-Z0-9-]+\.js'
script_urls = re.findall(script_pattern, response.text)
# The client_id is typically in one of the last JS bundles
for url in reversed(script_urls):
js_response = requests.get(url, headers=headers)
if js_response.status_code == 200:
# Look for client_id pattern in the JS code
client_id_match = re.search(
r'client_id["\']?\s*[:=]\s*["\']([a-zA-Z0-9]{32})["\']',
js_response.text
)
if client_id_match:
return client_id_match.group(1)
raise Exception("Could not find client_id in SoundCloud scripts")
if __name__ == "__main__":
client_id = get_client_id()
print(f"Found client_id: {client_id}")
The script fetches SoundCloud's homepage, finds all JavaScript bundle URLs, then searches through them for the client ID pattern.
Save your extracted client ID because you'll use it for all API-based scraping methods.
Method 2: API-Based Track Scraping
With the client ID in hand, you can query SoundCloud's internal API directly. This is the fastest approach and returns clean JSON data.
# scrape_tracks.py
import requests
import time
import random
import pandas as pd
from config import get_random_ua, BASE_HEADERS, MIN_DELAY, MAX_DELAY
from extract_client_id import get_client_id
class SoundCloudScraper:
"""
Scrape SoundCloud tracks using their internal API.
Returns structured data including play counts, likes, and metadata.
"""
def __init__(self, client_id=None):
self.client_id = client_id or get_client_id()
self.base_url = "https://api-v2.soundcloud.com"
self.session = requests.Session()
self.session.headers.update({
**BASE_HEADERS,
"User-Agent": get_random_ua()
})
def search_tracks(self, query, limit=50):
"""
Search for tracks matching a query string.
Returns list of track dictionaries with metadata.
"""
endpoint = f"{self.base_url}/search/tracks"
params = {
"q": query,
"client_id": self.client_id,
"limit": min(limit, 50), # API limit per request
"offset": 0,
}
all_tracks = []
while len(all_tracks) < limit:
response = self.session.get(endpoint, params=params)
if response.status_code != 200:
print(f"API error: {response.status_code}")
break
data = response.json()
tracks = data.get("collection", [])
if not tracks:
break
for track in tracks:
all_tracks.append(self._parse_track(track))
# Check if there are more results
next_href = data.get("next_href")
if not next_href:
break
params["offset"] += 50
# Respect rate limits
time.sleep(random.uniform(MIN_DELAY, MAX_DELAY))
return all_tracks[:limit]
def _parse_track(self, track_data):
"""
Extract relevant fields from raw API response.
Handles missing fields gracefully.
"""
user = track_data.get("user", {})
return {
"id": track_data.get("id"),
"title": track_data.get("title"),
"artist": user.get("username"),
"artist_id": user.get("id"),
"duration_ms": track_data.get("duration"),
"play_count": track_data.get("playback_count", 0),
"like_count": track_data.get("likes_count", 0),
"repost_count": track_data.get("reposts_count", 0),
"comment_count": track_data.get("comment_count", 0),
"genre": track_data.get("genre"),
"tags": track_data.get("tag_list"),
"created_at": track_data.get("created_at"),
"permalink_url": track_data.get("permalink_url"),
"downloadable": track_data.get("downloadable", False),
"artwork_url": track_data.get("artwork_url"),
}
def get_user_tracks(self, user_id, limit=100):
"""
Get all tracks uploaded by a specific user.
Useful for artist analysis and catalog scraping.
"""
endpoint = f"{self.base_url}/users/{user_id}/tracks"
params = {
"client_id": self.client_id,
"limit": 50,
"offset": 0,
}
all_tracks = []
while len(all_tracks) < limit:
response = self.session.get(endpoint, params=params)
if response.status_code != 200:
break
data = response.json()
tracks = data.get("collection", [])
if not tracks:
break
for track in tracks:
all_tracks.append(self._parse_track(track))
params["offset"] += 50
time.sleep(random.uniform(MIN_DELAY, MAX_DELAY))
return all_tracks[:limit]
def resolve_url(self, soundcloud_url):
"""
Convert a SoundCloud URL to API resource data.
Works with track, user, and playlist URLs.
"""
endpoint = f"{self.base_url}/resolve"
params = {
"url": soundcloud_url,
"client_id": self.client_id,
}
response = self.session.get(endpoint, params=params)
if response.status_code == 200:
return response.json()
return None
if __name__ == "__main__":
scraper = SoundCloudScraper()
# Search for tracks
tracks = scraper.search_tracks("electronic music", limit=20)
# Convert to DataFrame for analysis
df = pd.DataFrame(tracks)
print(f"Found {len(tracks)} tracks")
print(df[["title", "artist", "play_count", "genre"]].head(10))
# Save to CSV
df.to_csv("soundcloud_tracks.csv", index=False)
print("Saved to soundcloud_tracks.csv")
This scraper handles pagination automatically, respects rate limits, and outputs clean structured data.
The resolve_url method is particularly useful. Pass any SoundCloud URL and it returns the full API data for that resource.
Method 3: Browser Automation with Playwright
Some SoundCloud pages require full JavaScript rendering. Browser automation handles these cases reliably.
# scrape_with_playwright.py
import asyncio
import json
from playwright.async_api import async_playwright
import pandas as pd
class SoundCloudBrowserScraper:
"""
Scrape SoundCloud using browser automation.
Handles JavaScript rendering and dynamic content loading.
"""
def __init__(self):
self.browser = None
self.context = None
async def init_browser(self, headless=True, proxy=None):
"""
Initialize Playwright browser with optional proxy support.
Use headless=False for debugging.
"""
playwright = await async_playwright().start()
browser_args = {
"headless": headless,
"args": [
"--disable-blink-features=AutomationControlled",
"--no-sandbox",
]
}
self.browser = await playwright.chromium.launch(**browser_args)
context_args = {
"viewport": {"width": 1920, "height": 1080},
"user_agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
}
# Add proxy if provided
if proxy:
context_args["proxy"] = {
"server": proxy["server"],
"username": proxy.get("username"),
"password": proxy.get("password"),
}
self.context = await self.browser.new_context(**context_args)
async def scrape_search_results(self, query, max_results=50):
"""
Scrape tracks from SoundCloud search results page.
Scrolls to load additional results.
"""
page = await self.context.new_page()
# Navigate to search results
search_url = f"https://soundcloud.com/search/sounds?q={query}"
await page.goto(search_url, wait_until="networkidle")
# Handle cookie consent if present
try:
consent_btn = page.locator("#onetrust-accept-btn-handler")
if await consent_btn.is_visible(timeout=3000):
await consent_btn.click()
await page.wait_for_timeout(500)
except:
pass
tracks = []
previous_count = 0
while len(tracks) < max_results:
# Scroll to load more results
await page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
await page.wait_for_timeout(2000)
# Extract track data from page
track_elements = await page.locator(".searchList__item").all()
for element in track_elements[previous_count:]:
try:
track_data = await self._extract_track_from_element(element)
if track_data and track_data not in tracks:
tracks.append(track_data)
except Exception as e:
continue
# Check if we've loaded all results
if len(track_elements) == previous_count:
break
previous_count = len(track_elements)
await page.close()
return tracks[:max_results]
async def _extract_track_from_element(self, element):
"""
Parse track information from a search result element.
"""
title_el = element.locator(".soundTitle__title")
artist_el = element.locator(".soundTitle__usernameText")
title = await title_el.inner_text() if await title_el.count() > 0 else None
artist = await artist_el.inner_text() if await artist_el.count() > 0 else None
if not title:
return None
# Get the track URL
link_el = element.locator(".soundTitle__title a")
href = await link_el.get_attribute("href") if await link_el.count() > 0 else None
url = f"https://soundcloud.com{href}" if href else None
return {
"title": title.strip(),
"artist": artist.strip() if artist else None,
"url": url,
}
async def scrape_artist_page(self, artist_url):
"""
Scrape all tracks from an artist's profile page.
Handles infinite scrolling for large catalogs.
"""
page = await self.context.new_page()
# Navigate to artist's tracks page
tracks_url = f"{artist_url}/tracks" if not artist_url.endswith("/tracks") else artist_url
await page.goto(tracks_url, wait_until="networkidle")
# Accept cookies
try:
await page.click("#onetrust-accept-btn-handler", timeout=3000)
except:
pass
tracks = []
previous_count = 0
scroll_attempts = 0
max_scroll_attempts = 20
while scroll_attempts < max_scroll_attempts:
await page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
await page.wait_for_timeout(2000)
sound_items = await page.locator(".soundList__item").all()
if len(sound_items) == previous_count:
scroll_attempts += 1
else:
scroll_attempts = 0
previous_count = len(sound_items)
# Extract all loaded tracks
sound_items = await page.locator(".soundList__item").all()
for item in sound_items:
try:
track_data = await self._extract_track_from_sound_item(item)
if track_data:
tracks.append(track_data)
except:
continue
await page.close()
return tracks
async def _extract_track_from_sound_item(self, item):
"""
Extract track data from artist page sound item.
"""
title_el = item.locator(".soundTitle__title span")
title_text = await title_el.first.inner_text() if await title_el.count() > 0 else None
if not title_text:
return None
# Get play count if visible
play_el = item.locator(".sc-ministats-plays")
plays = None
if await play_el.count() > 0:
plays_text = await play_el.inner_text()
plays = self._parse_count(plays_text)
return {
"title": title_text.strip(),
"play_count": plays,
}
def _parse_count(self, count_text):
"""
Convert formatted count strings like '1.2M' to integers.
"""
if not count_text:
return None
count_text = count_text.strip().upper()
multipliers = {"K": 1000, "M": 1000000, "B": 1000000000}
for suffix, mult in multipliers.items():
if suffix in count_text:
try:
num = float(count_text.replace(suffix, ""))
return int(num * mult)
except:
return None
try:
return int(count_text.replace(",", ""))
except:
return None
async def close(self):
"""Clean up browser resources."""
if self.context:
await self.context.close()
if self.browser:
await self.browser.close()
async def main():
scraper = SoundCloudBrowserScraper()
await scraper.init_browser(headless=True)
# Scrape search results
tracks = await scraper.scrape_search_results("ambient music", max_results=30)
print(f"Found {len(tracks)} tracks")
for track in tracks[:5]:
print(f" {track['artist']} - {track['title']}")
await scraper.close()
if __name__ == "__main__":
asyncio.run(main())
Playwright renders JavaScript and handles infinite scroll pagination automatically. This approach works for pages where the API method fails.
The downside is speed. Browser automation runs slower than direct API calls. Use it when you need to scrape content that requires JavaScript rendering.
Method 4: Scraping with Proxies
High-volume scraping requires rotating proxies to avoid IP blocks. Here's how to integrate residential proxies into your scraper.
# scrape_with_proxies.py
import requests
import time
import random
from config import get_random_ua, BASE_HEADERS, MIN_DELAY, MAX_DELAY
class ProxiedSoundCloudScraper:
"""
SoundCloud scraper with rotating proxy support.
Suitable for high-volume data collection.
"""
def __init__(self, client_id, proxy_config=None):
"""
Initialize with proxy configuration.
proxy_config format:
{
"host": "proxy.example.com",
"port": 8080,
"username": "user",
"password": "pass"
}
"""
self.client_id = client_id
self.proxy_config = proxy_config
self.base_url = "https://api-v2.soundcloud.com"
def _get_proxies(self):
"""Build proxy dict for requests library."""
if not self.proxy_config:
return None
auth = ""
if self.proxy_config.get("username"):
auth = f"{self.proxy_config['username']}:{self.proxy_config['password']}@"
proxy_url = f"http://{auth}{self.proxy_config['host']}:{self.proxy_config['port']}"
return {
"http": proxy_url,
"https": proxy_url,
}
def search_tracks(self, query, limit=100):
"""
Search tracks with proxy rotation.
Automatically retries on failure.
"""
endpoint = f"{self.base_url}/search/tracks"
params = {
"q": query,
"client_id": self.client_id,
"limit": 50,
"offset": 0,
}
all_tracks = []
failures = 0
max_failures = 3
while len(all_tracks) < limit and failures < max_failures:
headers = {
**BASE_HEADERS,
"User-Agent": get_random_ua()
}
try:
response = requests.get(
endpoint,
params=params,
headers=headers,
proxies=self._get_proxies(),
timeout=15
)
if response.status_code == 200:
data = response.json()
tracks = data.get("collection", [])
if not tracks:
break
all_tracks.extend(tracks)
params["offset"] += 50
failures = 0
elif response.status_code == 429:
# Rate limited - increase delay
time.sleep(10)
failures += 1
else:
failures += 1
except requests.exceptions.RequestException as e:
print(f"Request failed: {e}")
failures += 1
time.sleep(random.uniform(MIN_DELAY, MAX_DELAY))
return all_tracks[:limit]
# Example usage with Roundproxies residential proxies
if __name__ == "__main__":
# Configure your proxy settings
proxy_config = {
"host": "gate.roundproxies.com",
"port": 5432,
"username": "your_username",
"password": "your_password"
}
# Get client_id first
from extract_client_id import get_client_id
client_id = get_client_id()
# Initialize scraper with proxies
scraper = ProxiedSoundCloudScraper(client_id, proxy_config)
tracks = scraper.search_tracks("hip hop beats", limit=50)
print(f"Scraped {len(tracks)} tracks with proxy rotation")
Residential proxies from providers like Roundproxies.com route requests through real user IPs, making them harder to detect and block.
For large-scale scraping, rotate between multiple proxy endpoints and use session stickiness for sequences of related requests.
Handling Common Challenges
SoundCloud scraping presents several challenges. Here's how to handle them.
Rate Limiting
SoundCloud limits API requests per IP. The signs include 429 response codes and empty responses.
Implement exponential backoff:
def request_with_backoff(url, params, max_retries=5):
"""Make requests with exponential backoff on failure."""
base_delay = 2
for attempt in range(max_retries):
response = requests.get(url, params=params)
if response.status_code == 200:
return response
if response.status_code == 429:
delay = base_delay * (2 ** attempt) + random.uniform(0, 1)
time.sleep(delay)
else:
break
return None
Geo-Blocking
Some tracks are restricted by region. Use proxies from the appropriate country to access geo-locked content.
Client ID Expiration
SoundCloud rotates client IDs periodically. Build in automatic refresh logic:
def ensure_valid_client_id(self):
"""Verify client_id works, refresh if needed."""
test_url = f"{self.base_url}/search/tracks"
params = {"q": "test", "client_id": self.client_id, "limit": 1}
response = requests.get(test_url, params=params)
if response.status_code != 200:
self.client_id = get_client_id()
Dynamic Selectors
SoundCloud updates their HTML classes frequently. Use multiple fallback selectors:
title_selectors = [
".soundTitle__title > .sc-link-dark",
".trackItem__trackTitle",
"h3 a.sc-link-dark",
]
for selector in title_selectors:
element = page.locator(selector)
if await element.count() > 0:
return await element.inner_text()
Exporting and Analyzing Data
Once you've collected data, export it for analysis.
# export_and_analyze.py
import pandas as pd
from datetime import datetime
def export_tracks(tracks, filename=None):
"""
Export track data to CSV with timestamp.
"""
df = pd.DataFrame(tracks)
if not filename:
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
filename = f"soundcloud_tracks_{timestamp}.csv"
df.to_csv(filename, index=False)
print(f"Exported {len(tracks)} tracks to {filename}")
return df
def analyze_tracks(df):
"""
Generate basic statistics from scraped data.
"""
print("\n=== Track Analysis ===")
print(f"Total tracks: {len(df)}")
if "play_count" in df.columns:
print(f"Total plays: {df['play_count'].sum():,}")
print(f"Average plays: {df['play_count'].mean():,.0f}")
print(f"Top track: {df.loc[df['play_count'].idxmax(), 'title']}")
if "genre" in df.columns:
print(f"\nTop genres:")
print(df["genre"].value_counts().head(5))
if "artist" in df.columns:
print(f"\nMost prolific artists:")
print(df["artist"].value_counts().head(5))
# Example usage
if __name__ == "__main__":
from scrape_tracks import SoundCloudScraper
scraper = SoundCloudScraper()
tracks = scraper.search_tracks("indie rock", limit=100)
df = export_tracks(tracks)
analyze_tracks(df)
Legal and Ethical Considerations
Web scraping exists in a legal gray area. Follow these guidelines to scrape SoundCloud responsibly.
Respect robots.txt
Check SoundCloud's robots.txt file for crawl directives. Honor any restrictions specified there.
Rate Limit Your Requests
Don't hammer the servers. Add delays between requests and implement backoff when you receive errors.
Don't Scrape Private Content
Only access publicly available data. Avoid attempting to access private tracks or user information.
Use Data Responsibly
Scraped data should be used for legitimate purposes like research, analysis, or building tools that add value. Don't redistribute copyrighted content.
Check Terms of Service
Review SoundCloud's Terms of Service before scraping. Commercial use may require explicit permission.
Scraping Playlists and User Profiles
Beyond individual tracks, you can scrape entire playlists and user profile data.
# scrape_playlists.py
import requests
import time
from config import get_random_ua, BASE_HEADERS, MIN_DELAY, MAX_DELAY
class PlaylistScraper:
"""
Extract tracks from SoundCloud playlists and sets.
"""
def __init__(self, client_id):
self.client_id = client_id
self.base_url = "https://api-v2.soundcloud.com"
self.session = requests.Session()
self.session.headers.update({
**BASE_HEADERS,
"User-Agent": get_random_ua()
})
def get_playlist_tracks(self, playlist_url):
"""
Get all tracks from a playlist by URL.
Returns playlist metadata and track list.
"""
# First resolve the playlist URL to get ID
resolve_endpoint = f"{self.base_url}/resolve"
params = {
"url": playlist_url,
"client_id": self.client_id,
}
response = self.session.get(resolve_endpoint, params=params)
if response.status_code != 200:
return None
playlist_data = response.json()
# Extract basic playlist info
result = {
"id": playlist_data.get("id"),
"title": playlist_data.get("title"),
"creator": playlist_data.get("user", {}).get("username"),
"track_count": playlist_data.get("track_count"),
"likes_count": playlist_data.get("likes_count"),
"tracks": []
}
# Get full track details
tracks = playlist_data.get("tracks", [])
for track in tracks:
result["tracks"].append({
"id": track.get("id"),
"title": track.get("title"),
"artist": track.get("user", {}).get("username"),
"duration_ms": track.get("duration"),
"play_count": track.get("playback_count", 0),
})
return result
def get_user_profile(self, username):
"""
Scrape comprehensive user profile data.
Includes stats, description, and social links.
"""
user_url = f"https://soundcloud.com/{username}"
resolve_endpoint = f"{self.base_url}/resolve"
params = {
"url": user_url,
"client_id": self.client_id,
}
response = self.session.get(resolve_endpoint, params=params)
if response.status_code != 200:
return None
user_data = response.json()
return {
"id": user_data.get("id"),
"username": user_data.get("username"),
"full_name": user_data.get("full_name"),
"description": user_data.get("description"),
"city": user_data.get("city"),
"country": user_data.get("country_code"),
"followers_count": user_data.get("followers_count"),
"followings_count": user_data.get("followings_count"),
"track_count": user_data.get("track_count"),
"playlist_count": user_data.get("playlist_count"),
"likes_count": user_data.get("likes_count"),
"avatar_url": user_data.get("avatar_url"),
"created_at": user_data.get("created_at"),
"verified": user_data.get("verified", False),
}
def get_user_followers(self, user_id, limit=100):
"""
Get list of users following a specific account.
Useful for network analysis.
"""
endpoint = f"{self.base_url}/users/{user_id}/followers"
params = {
"client_id": self.client_id,
"limit": 50,
"offset": 0,
}
all_followers = []
while len(all_followers) < limit:
response = self.session.get(endpoint, params=params)
if response.status_code != 200:
break
data = response.json()
followers = data.get("collection", [])
if not followers:
break
for follower in followers:
all_followers.append({
"id": follower.get("id"),
"username": follower.get("username"),
"followers_count": follower.get("followers_count"),
})
params["offset"] += 50
time.sleep(MIN_DELAY)
return all_followers[:limit]
if __name__ == "__main__":
from extract_client_id import get_client_id
client_id = get_client_id()
scraper = PlaylistScraper(client_id)
# Get playlist data
playlist = scraper.get_playlist_tracks(
"https://soundcloud.com/example-user/sets/my-playlist"
)
if playlist:
print(f"Playlist: {playlist['title']}")
print(f"Tracks: {playlist['track_count']}")
This scraper extracts playlist metadata, track listings, and user profile information including follower counts and social stats.
Storing Data in a Database
For ongoing data collection, store results in a database rather than CSV files.
# database_storage.py
import sqlite3
from datetime import datetime
class SoundCloudDatabase:
"""
SQLite storage for scraped SoundCloud data.
Supports incremental updates and deduplication.
"""
def __init__(self, db_path="soundcloud_data.db"):
self.conn = sqlite3.connect(db_path)
self._create_tables()
def _create_tables(self):
"""Initialize database schema."""
cursor = self.conn.cursor()
cursor.execute("""
CREATE TABLE IF NOT EXISTS tracks (
id INTEGER PRIMARY KEY,
title TEXT,
artist TEXT,
artist_id INTEGER,
duration_ms INTEGER,
play_count INTEGER,
like_count INTEGER,
genre TEXT,
permalink_url TEXT,
created_at TEXT,
scraped_at TEXT
)
""")
cursor.execute("""
CREATE TABLE IF NOT EXISTS users (
id INTEGER PRIMARY KEY,
username TEXT,
full_name TEXT,
followers_count INTEGER,
track_count INTEGER,
city TEXT,
country TEXT,
scraped_at TEXT
)
""")
cursor.execute("""
CREATE INDEX IF NOT EXISTS idx_tracks_artist
ON tracks(artist_id)
""")
self.conn.commit()
def save_track(self, track_data):
"""
Insert or update a track record.
Updates play counts on duplicate.
"""
cursor = self.conn.cursor()
cursor.execute("""
INSERT OR REPLACE INTO tracks
(id, title, artist, artist_id, duration_ms, play_count,
like_count, genre, permalink_url, created_at, scraped_at)
VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
""", (
track_data.get("id"),
track_data.get("title"),
track_data.get("artist"),
track_data.get("artist_id"),
track_data.get("duration_ms"),
track_data.get("play_count"),
track_data.get("like_count"),
track_data.get("genre"),
track_data.get("permalink_url"),
track_data.get("created_at"),
datetime.now().isoformat()
))
self.conn.commit()
def save_tracks_batch(self, tracks):
"""Bulk insert multiple tracks efficiently."""
for track in tracks:
self.save_track(track)
def get_tracks_by_artist(self, artist_id):
"""Retrieve all tracks by a specific artist."""
cursor = self.conn.cursor()
cursor.execute(
"SELECT * FROM tracks WHERE artist_id = ?",
(artist_id,)
)
return cursor.fetchall()
def get_top_tracks(self, limit=100):
"""Get tracks sorted by play count."""
cursor = self.conn.cursor()
cursor.execute(
"SELECT * FROM tracks ORDER BY play_count DESC LIMIT ?",
(limit,)
)
return cursor.fetchall()
def close(self):
"""Close database connection."""
self.conn.close()
SQLite handles storage for most scraping projects. For larger scale operations, migrate to PostgreSQL or MongoDB.
Scheduling Automated Scraping Jobs
Run your scrapers on a schedule to collect data over time.
# scheduler.py
import schedule
import time
from scrape_tracks import SoundCloudScraper
from database_storage import SoundCloudDatabase
def daily_trending_scrape():
"""
Scheduled job to scrape trending tracks daily.
"""
print(f"Starting daily scrape at {time.strftime('%Y-%m-%d %H:%M:%S')}")
scraper = SoundCloudScraper()
db = SoundCloudDatabase()
# Scrape multiple genres
genres = ["electronic", "hip-hop", "indie", "pop", "ambient"]
for genre in genres:
try:
tracks = scraper.search_tracks(f"trending {genre}", limit=50)
db.save_tracks_batch(tracks)
print(f"Saved {len(tracks)} {genre} tracks")
except Exception as e:
print(f"Error scraping {genre}: {e}")
db.close()
print("Daily scrape complete")
# Schedule the job
schedule.every().day.at("02:00").do(daily_trending_scrape)
if __name__ == "__main__":
print("Scheduler started. Press Ctrl+C to exit.")
while True:
schedule.run_pending()
time.sleep(60)
This scheduler runs a scraping job at 2 AM daily. Adjust the timing based on when SoundCloud sees the least traffic.
For production deployments, use a proper task queue like Celery or run jobs through cron.
Troubleshooting Common Issues
Here are solutions to problems you'll encounter when scraping SoundCloud.
Empty Responses After Working Initially
Your client ID probably expired. SoundCloud rotates these every few hours to few days. Refresh it using the extraction function.
403 Forbidden Errors
You're likely blocked by IP. Solutions include:
- Switch to a residential proxy
- Increase delays between requests
- Rotate user agents more frequently
Missing Data Fields
SoundCloud returns different fields based on the endpoint and track status. Always use .get() with defaults when accessing nested data.
Slow Playwright Scraping
Browser automation is inherently slower. Speed it up by:
- Blocking images and media:
page.route("**/*.{png,jpg,jpeg,gif,webp}", lambda route: route.abort()) - Using headless mode
- Reusing browser contexts instead of creating new ones
Inconsistent Results
SoundCloud personalizes results based on location and history. Use a consistent proxy location and clear cookies between sessions for reproducible results.
Conclusion
You now have multiple proven methods to scrape SoundCloud effectively in 2026. The API-based approach offers speed and clean data. Browser automation handles JavaScript-heavy pages. Proxy integration enables scale. If you need to scrape SoundCloud data at volume, combining all three approaches gives you maximum flexibility.
Start with the API method for most use cases. It's faster and returns structured JSON. Fall back to Playwright when you encounter pages that require full JavaScript rendering.
For production scraping at scale, combine these methods with residential proxies to maintain reliable access. Monitor for changes in SoundCloud's structure and update your selectors accordingly.
The code examples in this guide are complete and working. Adapt them to your specific data collection needs and always scrape responsibly.
Frequently Asked Questions
Can I scrape SoundCloud without getting blocked?
Yes, by using reasonable request delays, rotating user agents, and residential proxies. Avoid aggressive scraping patterns that trigger rate limits.
Is scraping SoundCloud legal?
Scraping publicly available data is generally permitted for personal use and research. Commercial applications may require explicit permission from SoundCloud. Always review their Terms of Service.
How do I download audio files from SoundCloud?
This guide focuses on metadata extraction. Downloading audio files involves additional legal considerations around copyright. Only download tracks with explicit download permissions enabled by the creator.
What's the best programming language for SoundCloud scraping?
Python offers the best combination of libraries for web scraping. The requests library handles API calls efficiently, while Playwright manages browser automation when needed.
How often does SoundCloud change their API?
SoundCloud's internal API endpoints remain relatively stable, but client IDs rotate frequently. Build your scraper to refresh credentials automatically rather than hardcoding values.