YouTube

How to scrape YouTube in 2026: 5 methods (+ working code)

YouTube holds a goldmine of data. Video metadata, engagement metrics, comments, transcripts—it's all there waiting to be extracted for market research, sentiment analysis, or training ML models.

I've spent years building scrapers for YouTube data. Whether I'm tracking trending topics, analyzing competitor channels, or gathering datasets for content recommendation systems, I keep coming back to the same proven methods.

If you want to scrape YouTube and wondering where to start, you're in the right place.

In this guide, I'll walk you through five different methods to extract YouTube data—from quick metadata grabs to large-scale channel scraping.

What is YouTube Scraping?

YouTube scraping is the process of programmatically extracting data from YouTube pages. This includes video metadata, channel information, comments, transcripts, search results, and engagement metrics.

YouTube relies heavily on JavaScript to render content. This makes traditional HTTP-based scraping challenging.

However, YouTube also exposes hidden JSON endpoints and embeds structured data in its HTML. These provide easier extraction paths than parsing rendered HTML.

In practice, you can scrape YouTube to:

  • Extract video titles, descriptions, view counts, and like counts
  • Gather comments for sentiment analysis
  • Download transcripts for content analysis
  • Monitor channel growth and posting frequency
  • Track trending topics and keywords
  • Build datasets for machine learning projects

What Data Can You Extract from YouTube?

Before diving into methods, let's clarify what you can actually scrape from YouTube.

Video Data

Field Description
Title Video title
Description Full description text
View count Total views
Like count Number of likes
Comment count Total comments
Duration Video length
Upload date When published
Thumbnail URL Video thumbnail image
Tags Associated keywords
Category Content category

Channel Data

Field Description
Channel name Display name
Subscriber count Total subscribers
Video count Number of uploads
View count Total channel views
Description About section
Join date Channel creation date
Links External links

Additional Data

  • Comments: Text, author, likes, replies
  • Transcripts: Auto-generated and manual captions
  • Search results: Videos matching keywords
  • Playlists: Video lists and metadata

5 Methods to Scrape YouTube

Let's explore each method with working code examples.

Method 1: yt-dlp Library

Best for: Quick metadata extraction without browser overhead

Difficulty: Easy | Cost: Free | Speed: Fast

yt-dlp is a command-line tool and Python library forked from youtube-dl. It's the fastest way to extract YouTube metadata without rendering JavaScript.

Installation

pip install yt-dlp

Extract Video Metadata

from yt_dlp import YoutubeDL

def get_video_info(video_url):
    """Extract metadata from a YouTube video."""
    
    ydl_opts = {
        'quiet': True,
        'no_warnings': True,
        'extract_flat': False,
    }
    
    with YoutubeDL(ydl_opts) as ydl:
        info = ydl.extract_info(video_url, download=False)
        
    return {
        'title': info.get('title'),
        'description': info.get('description'),
        'view_count': info.get('view_count'),
        'like_count': info.get('like_count'),
        'duration': info.get('duration'),
        'upload_date': info.get('upload_date'),
        'channel': info.get('channel'),
        'channel_id': info.get('channel_id'),
        'tags': info.get('tags', []),
    }

# Usage
video_url = "https://www.youtube.com/watch?v=dQw4w9WgXcQ"
data = get_video_info(video_url)
print(data)

The download=False parameter prevents downloading the actual video file.

Extract Comments

from yt_dlp import YoutubeDL

def get_video_comments(video_url, max_comments=100):
    """Extract comments from a YouTube video."""
    
    ydl_opts = {
        'quiet': True,
        'no_warnings': True,
        'getcomments': True,
        'extractor_args': {
            'youtube': {
                'max_comments': [str(max_comments)]
            }
        }
    }
    
    with YoutubeDL(ydl_opts) as ydl:
        info = ydl.extract_info(video_url, download=False)
        
    comments = info.get('comments', [])
    
    return [{
        'text': c.get('text'),
        'author': c.get('author'),
        'likes': c.get('like_count'),
        'timestamp': c.get('timestamp'),
    } for c in comments]

# Usage
comments = get_video_comments(video_url)
for comment in comments[:5]:
    print(f"{comment['author']}: {comment['text'][:50]}...")

Scrape YouTube Search Results

from yt_dlp import YoutubeDL

def search_youtube(query, max_results=10):
    """Search YouTube and return video metadata."""
    
    ydl_opts = {
        'quiet': True,
        'no_warnings': True,
        'extract_flat': True,
        'playlistend': max_results,
    }
    
    search_url = f"ytsearch{max_results}:{query}"
    
    with YoutubeDL(ydl_opts) as ydl:
        results = ydl.extract_info(search_url, download=False)
        
    videos = []
    for entry in results.get('entries', []):
        videos.append({
            'title': entry.get('title'),
            'url': entry.get('url'),
            'duration': entry.get('duration'),
            'view_count': entry.get('view_count'),
            'channel': entry.get('channel'),
        })
    
    return videos

# Usage
results = search_youtube("python web scraping tutorial")
for video in results:
    print(f"{video['title']}")

Pros and Cons

Pros:

  • No browser required—very fast
  • Handles most anti-bot detection automatically
  • Extracts comprehensive metadata
  • Active development and updates

Cons:

  • Can trigger sign-in prompts at scale
  • Limited control over request headers
  • Comments extraction can be slow

Method 2: YouTube Data API v3

Best for: Reliable, structured data with official support

Difficulty: Easy | Cost: Free (with quota limits) | Speed: Fast

The YouTube Data API is the official way to access YouTube data. It's reliable and returns clean JSON responses.

The downside? You're limited to 10,000 quota units per day.

Setup

  1. Go to Google Cloud Console
  2. Create a new project
  3. Enable YouTube Data API v3
  4. Create an API key under Credentials

Installation

pip install google-api-python-client

Search Videos

from googleapiclient.discovery import build

API_KEY = "YOUR_API_KEY"

def search_videos(query, max_results=10):
    """Search YouTube using the official API."""
    
    youtube = build('youtube', 'v3', developerKey=API_KEY)
    
    request = youtube.search().list(
        q=query,
        part='id,snippet',
        maxResults=max_results,
        type='video'
    )
    
    response = request.execute()
    
    videos = []
    for item in response.get('items', []):
        videos.append({
            'video_id': item['id']['videoId'],
            'title': item['snippet']['title'],
            'description': item['snippet']['description'],
            'channel': item['snippet']['channelTitle'],
            'published_at': item['snippet']['publishedAt'],
            'thumbnail': item['snippet']['thumbnails']['high']['url'],
        })
    
    return videos

# Usage
results = search_videos("machine learning tutorial")
for video in results:
    print(f"{video['title']}")

Get Video Statistics

def get_video_stats(video_ids):
    """Get detailed statistics for videos."""
    
    youtube = build('youtube', 'v3', developerKey=API_KEY)
    
    # API accepts up to 50 IDs per request
    request = youtube.videos().list(
        id=','.join(video_ids),
        part='statistics,contentDetails,snippet'
    )
    
    response = request.execute()
    
    stats = []
    for item in response.get('items', []):
        stats.append({
            'video_id': item['id'],
            'title': item['snippet']['title'],
            'view_count': int(item['statistics'].get('viewCount', 0)),
            'like_count': int(item['statistics'].get('likeCount', 0)),
            'comment_count': int(item['statistics'].get('commentCount', 0)),
            'duration': item['contentDetails']['duration'],
        })
    
    return stats

# Usage
video_ids = ['dQw4w9WgXcQ', 'kJQP7kiw5Fk']
stats = get_video_stats(video_ids)

Get Channel Information

def get_channel_info(channel_id):
    """Get channel details and statistics."""
    
    youtube = build('youtube', 'v3', developerKey=API_KEY)
    
    request = youtube.channels().list(
        id=channel_id,
        part='snippet,statistics,contentDetails'
    )
    
    response = request.execute()
    
    if not response.get('items'):
        return None
    
    item = response['items'][0]
    
    return {
        'channel_id': item['id'],
        'title': item['snippet']['title'],
        'description': item['snippet']['description'],
        'subscriber_count': int(item['statistics'].get('subscriberCount', 0)),
        'video_count': int(item['statistics'].get('videoCount', 0)),
        'view_count': int(item['statistics'].get('viewCount', 0)),
        'uploads_playlist': item['contentDetails']['relatedPlaylists']['uploads'],
    }

# Usage
channel = get_channel_info('UC8butISFwT-Wl7EV0hUK0BQ')
print(f"{channel['title']}: {channel['subscriber_count']} subscribers")

Quota Costs

Each API call consumes quota units:

Operation Cost
search.list 100 units
videos.list 1 unit
channels.list 1 unit
commentThreads.list 1 unit

With 10,000 units daily, you can make roughly 100 searches or 10,000 video detail requests.

Pros and Cons

Pros:

  • Official and reliable
  • Clean JSON responses
  • No blocking or CAPTCHAs
  • Well-documented

Cons:

  • 10,000 quota units daily limit
  • Search costs 100 units per call
  • Doesn't include all public data
  • Requires API key management

Method 3: Hidden JSON Endpoints

Best for: Bypassing API limits with direct data access

Difficulty: Medium | Cost: Free | Speed: Fast

YouTube embeds JSON data directly in its HTML pages. The ytInitialData and ytInitialPlayerResponse objects contain structured data you can parse without rendering JavaScript.

Extract ytInitialData

import requests
import re
import json

def extract_initial_data(url):
    """Extract ytInitialData from YouTube page."""
    
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
        'Accept-Language': 'en-US,en;q=0.9',
    }
    
    response = requests.get(url, headers=headers)
    response.raise_for_status()
    
    # Find ytInitialData in script tags
    pattern = r'var ytInitialData = ({.*?});'
    match = re.search(pattern, response.text)
    
    if not match:
        # Try alternative pattern
        pattern = r'ytInitialData\s*=\s*({.*?});'
        match = re.search(pattern, response.text)
    
    if match:
        return json.loads(match.group(1))
    
    return None

# Usage
url = "https://www.youtube.com/watch?v=dQw4w9WgXcQ"
data = extract_initial_data(url)

Parse Video Page Data

def parse_video_data(initial_data):
    """Parse video information from ytInitialData."""
    
    try:
        # Navigate to video details
        contents = initial_data['contents']['twoColumnWatchNextResults']
        primary = contents['results']['results']['contents']
        
        video_info = {}
        
        for content in primary:
            if 'videoPrimaryInfoRenderer' in content:
                renderer = content['videoPrimaryInfoRenderer']
                video_info['title'] = renderer['title']['runs'][0]['text']
                video_info['views'] = renderer['viewCount']['videoViewCountRenderer']['viewCount']['simpleText']
                
            if 'videoSecondaryInfoRenderer' in content:
                renderer = content['videoSecondaryInfoRenderer']
                video_info['channel'] = renderer['owner']['videoOwnerRenderer']['title']['runs'][0]['text']
                video_info['description'] = renderer.get('attributedDescription', {}).get('content', '')
        
        return video_info
        
    except (KeyError, IndexError) as e:
        print(f"Parse error: {e}")
        return None

Scrape Search Results via Hidden API

def scrape_youtube_search(query):
    """Scrape search results using hidden endpoint."""
    
    search_url = f"https://www.youtube.com/results?search_query={query}"
    
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
        'Accept-Language': 'en-US,en;q=0.9',
    }
    
    response = requests.get(search_url, headers=headers)
    initial_data = extract_initial_data(search_url)
    
    if not initial_data:
        return []
    
    videos = []
    
    try:
        contents = initial_data['contents']['twoColumnSearchResultsRenderer']
        items = contents['primaryContents']['sectionListRenderer']['contents'][0]
        results = items['itemSectionRenderer']['contents']
        
        for item in results:
            if 'videoRenderer' in item:
                renderer = item['videoRenderer']
                videos.append({
                    'video_id': renderer['videoId'],
                    'title': renderer['title']['runs'][0]['text'],
                    'channel': renderer['ownerText']['runs'][0]['text'],
                    'views': renderer.get('viewCountText', {}).get('simpleText', 'N/A'),
                    'duration': renderer.get('lengthText', {}).get('simpleText', 'N/A'),
                })
    
    except (KeyError, IndexError):
        pass
    
    return videos

Handle Pagination with Continuation Tokens

def get_continuation_data(continuation_token):
    """Fetch next page using continuation token."""
    
    api_url = "https://www.youtube.com/youtubei/v1/browse"
    
    headers = {
        'Content-Type': 'application/json',
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
    }
    
    payload = {
        'context': {
            'client': {
                'clientName': 'WEB',
                'clientVersion': '2.20240101.00.00',
            }
        },
        'continuation': continuation_token,
    }
    
    response = requests.post(api_url, headers=headers, json=payload)
    return response.json()

Pros and Cons

Pros:

  • No API key required
  • Faster than browser automation
  • Access to data not in official API
  • No quota limits

Cons:

  • Endpoints change without notice
  • Requires understanding JSON structure
  • Can break with YouTube updates
  • More complex parsing logic

Method 4: Selenium Browser Automation

Best for: Dynamic content requiring JavaScript execution

Difficulty: Medium | Cost: Free | Speed: Slow

When hidden endpoints don't work, Selenium provides full browser control. It renders JavaScript and handles dynamic content like infinite scroll.

Installation

pip install selenium webdriver-manager

Basic Setup

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from webdriver_manager.chrome import ChromeDriverManager
import time

def create_driver():
    """Create a configured Chrome driver."""
    
    options = Options()
    options.add_argument('--headless')
    options.add_argument('--no-sandbox')
    options.add_argument('--disable-dev-shm-usage')
    options.add_argument('--disable-blink-features=AutomationControlled')
    options.add_argument('user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36')
    
    service = Service(ChromeDriverManager().install())
    driver = webdriver.Chrome(service=service, options=options)
    
    return driver

Scrape Channel Videos

def scrape_channel_videos(channel_url, max_videos=50):
    """Scrape all videos from a YouTube channel."""
    
    driver = create_driver()
    videos = []
    
    try:
        # Navigate to channel videos tab
        videos_url = f"{channel_url}/videos"
        driver.get(videos_url)
        
        # Wait for content to load
        WebDriverWait(driver, 10).until(
            EC.presence_of_element_located((By.ID, "contents"))
        )
        
        # Scroll to load more videos
        last_height = driver.execute_script("return document.documentElement.scrollHeight")
        
        while len(videos) < max_videos:
            # Scroll down
            driver.execute_script("window.scrollTo(0, document.documentElement.scrollHeight);")
            time.sleep(2)
            
            # Check if we've reached the bottom
            new_height = driver.execute_script("return document.documentElement.scrollHeight")
            if new_height == last_height:
                break
            last_height = new_height
            
            # Extract video elements
            video_elements = driver.find_elements(By.CSS_SELECTOR, "ytd-rich-item-renderer")
            
            for element in video_elements:
                if len(videos) >= max_videos:
                    break
                    
                try:
                    title_elem = element.find_element(By.CSS_SELECTOR, "#video-title")
                    views_elem = element.find_element(By.CSS_SELECTOR, "#metadata-line span:first-child")
                    
                    video_data = {
                        'title': title_elem.text,
                        'url': title_elem.get_attribute('href'),
                        'views': views_elem.text,
                    }
                    
                    if video_data not in videos:
                        videos.append(video_data)
                        
                except Exception:
                    continue
        
        return videos
        
    finally:
        driver.quit()

Extract Video Details

def scrape_video_details(video_url):
    """Scrape detailed information from a video page."""
    
    driver = create_driver()
    
    try:
        driver.get(video_url)
        
        # Wait for video info to load
        WebDriverWait(driver, 10).until(
            EC.presence_of_element_located((By.CSS_SELECTOR, "h1.ytd-watch-metadata"))
        )
        
        # Expand description
        try:
            expand_btn = driver.find_element(By.CSS_SELECTOR, "#expand")
            expand_btn.click()
            time.sleep(1)
        except Exception:
            pass
        
        # Extract data
        title = driver.find_element(By.CSS_SELECTOR, "h1.ytd-watch-metadata").text
        
        # Get view count from info section
        info_text = driver.find_element(By.CSS_SELECTOR, "#info-container").text
        
        # Get channel name
        channel = driver.find_element(By.CSS_SELECTOR, "#channel-name a").text
        
        # Get description
        description = driver.find_element(By.CSS_SELECTOR, "#description-inner").text
        
        return {
            'title': title,
            'channel': channel,
            'description': description,
            'info': info_text,
        }
        
    finally:
        driver.quit()

Pros and Cons

Pros:

  • Handles any JavaScript-rendered content
  • Full browser capabilities
  • Can interact with page elements
  • Works when other methods fail

Cons:

  • Slowest method
  • High resource usage
  • More likely to trigger detection
  • Complex to maintain

Method 5: Playwright with Stealth

Best for: Evading bot detection while automating browsers

Difficulty: Hard | Cost: Free | Speed: Medium

Playwright offers better stealth capabilities than Selenium. Combined with anti-detection techniques, it can bypass most bot detection systems.

Installation

pip install playwright playwright-stealth
playwright install chromium

Stealth Configuration

from playwright.sync_api import sync_playwright
from playwright_stealth import stealth_sync

def create_stealth_browser():
    """Create a browser with stealth mode enabled."""
    
    playwright = sync_playwright().start()
    
    browser = playwright.chromium.launch(
        headless=True,
        args=[
            '--disable-blink-features=AutomationControlled',
            '--no-sandbox',
        ]
    )
    
    context = browser.new_context(
        viewport={'width': 1920, 'height': 1080},
        user_agent='Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
        locale='en-US',
    )
    
    page = context.new_page()
    stealth_sync(page)
    
    return playwright, browser, page

Scrape with Stealth

def scrape_youtube_stealth(url):
    """Scrape YouTube with stealth mode to avoid detection."""
    
    playwright, browser, page = create_stealth_browser()
    
    try:
        page.goto(url, wait_until='networkidle')
        
        # Handle cookie consent if present
        try:
            consent_button = page.locator('button:has-text("Accept all")')
            if consent_button.is_visible():
                consent_button.click()
                page.wait_for_timeout(1000)
        except Exception:
            pass
        
        # Wait for content
        page.wait_for_selector('#contents', timeout=10000)
        
        # Extract data using JavaScript
        data = page.evaluate('''() => {
            const videos = [];
            const items = document.querySelectorAll('ytd-video-renderer, ytd-rich-item-renderer');
            
            items.forEach(item => {
                const titleEl = item.querySelector('#video-title');
                const viewsEl = item.querySelector('#metadata-line span');
                
                if (titleEl) {
                    videos.push({
                        title: titleEl.textContent.trim(),
                        url: titleEl.href,
                        views: viewsEl ? viewsEl.textContent.trim() : 'N/A'
                    });
                }
            });
            
            return videos;
        }''')
        
        return data
        
    finally:
        browser.close()
        playwright.stop()

Block Unnecessary Resources

def scrape_fast_stealth(url):
    """Scrape with resource blocking for faster loads."""
    
    playwright, browser, page = create_stealth_browser()
    
    # Block images, videos, and fonts
    page.route('**/*.{png,jpg,jpeg,gif,webp,svg,mp4,webm,woff,woff2}', 
               lambda route: route.abort())
    
    page.route('**/googlevideo.com/**', lambda route: route.abort())
    
    try:
        page.goto(url, wait_until='domcontentloaded')
        page.wait_for_selector('#contents', timeout=10000)
        
        # Extract data...
        return page.content()
        
    finally:
        browser.close()
        playwright.stop()

Pros and Cons

Pros:

  • Best anti-detection capabilities
  • Modern API design
  • Auto-waiting for elements
  • Supports multiple browsers

Cons:

  • Steeper learning curve
  • Requires additional setup
  • Still slower than direct HTTP
  • Can still be detected at scale

Comparison: Which Method Should You Use?

Method Speed Difficulty Anti-Bot Handling Best For
yt-dlp ⚡ Fast Easy Good Quick metadata extraction
YouTube API ⚡ Fast Easy N/A Reliable structured data
Hidden JSON ⚡ Fast Medium Manual Bypassing API limits
Selenium 🐢 Slow Medium Poor Legacy systems
Playwright 🐢 Medium Hard Good Stealth scraping

Decision Guide

Choose yt-dlp if:

  • You need video metadata quickly
  • You're scraping fewer than 1,000 videos
  • You want the simplest solution

Choose YouTube API if:

  • You need reliable, official data
  • Your daily needs fit within quota
  • You want clean, structured responses

Choose Hidden JSON if:

  • API quotas are insufficient
  • You understand JSON parsing
  • You can maintain code when endpoints change

Choose Selenium/Playwright if:

  • Other methods are blocked
  • You need to interact with page elements
  • You're scraping dynamic content

Handling Anti-Bot Detection

YouTube actively detects and blocks automated access. Here's how to stay under the radar.

Use Rotating Proxies

Residential proxies distribute requests across real IP addresses.

import requests

proxy = {
    'http': 'http://user:pass@proxy-server:port',
    'https': 'http://user:pass@proxy-server:port',
}

response = requests.get(url, proxies=proxy)

For high-volume YouTube scraping, residential proxies from providers like Roundproxies significantly reduce blocking.

Add Request Delays

Never hammer YouTube with rapid requests.

import time
import random

def scrape_with_delay(urls):
    results = []
    
    for url in urls:
        result = scrape_url(url)
        results.append(result)
        
        # Random delay between 2-5 seconds
        delay = random.uniform(2, 5)
        time.sleep(delay)
    
    return results

Rotate User Agents

import random

USER_AGENTS = [
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:121.0) Gecko/20100101 Firefox/121.0',
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.2 Safari/605.1.15',
]

headers = {
    'User-Agent': random.choice(USER_AGENTS),
    'Accept-Language': 'en-US,en;q=0.9',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
}

Common Errors and Troubleshooting

"Sign in to confirm you're not a bot"

Cause: YouTube detected automated access.

Fix: Use yt-dlp with cookies from a logged-in session:

# Export cookies from browser
yt-dlp --cookies-from-browser chrome "VIDEO_URL"

# Or use a cookies file
yt-dlp --cookies cookies.txt "VIDEO_URL"

403 Forbidden Error

Cause: Request blocked by YouTube.

Fix:

  • Add realistic headers
  • Use residential proxies
  • Reduce request frequency

Empty ytInitialData

Cause: Page loaded with different structure or region restriction.

Fix:

  • Check if content requires sign-in
  • Try different Accept-Language headers
  • Use a VPN for region-locked content

Selenium Timeout Errors

Cause: Element not loading in time.

Fix:

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

# Increase timeout
element = WebDriverWait(driver, 30).until(
    EC.presence_of_element_located((By.ID, "contents"))
)

Best Practices

1. Respect Rate Limits

Even without explicit limits, excessive requests harm YouTube's servers.

  • Add 2-5 second delays between requests
  • Limit concurrent connections
  • Implement exponential backoff on errors

2. Cache Responses

Don't re-scrape data you already have.

import hashlib
import json
import os

def get_cached_or_fetch(url):
    cache_dir = '.cache'
    os.makedirs(cache_dir, exist_ok=True)
    
    # Create cache key from URL
    cache_key = hashlib.md5(url.encode()).hexdigest()
    cache_file = f'{cache_dir}/{cache_key}.json'
    
    # Check cache
    if os.path.exists(cache_file):
        with open(cache_file) as f:
            return json.load(f)
    
    # Fetch and cache
    data = fetch_data(url)
    with open(cache_file, 'w') as f:
        json.dump(data, f)
    
    return data

3. Handle Errors Gracefully

import logging
from tenacity import retry, stop_after_attempt, wait_exponential

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

@retry(stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, min=2, max=10))
def fetch_with_retry(url):
    try:
        response = requests.get(url, timeout=10)
        response.raise_for_status()
        return response
    except requests.RequestException as e:
        logger.warning(f"Request failed: {e}")
        raise

4. Save Raw Responses

Always save original data before parsing.

def scrape_and_save(url, output_dir='raw_data'):
    os.makedirs(output_dir, exist_ok=True)
    
    response = requests.get(url)
    
    # Save raw response
    timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
    filename = f'{output_dir}/response_{timestamp}.html'
    
    with open(filename, 'w', encoding='utf-8') as f:
        f.write(response.text)
    
    # Then parse
    return parse_response(response.text)

Before you scrape YouTube, understand the legal landscape.

Terms of Service

YouTube's Terms of Service prohibit automated access. However, courts have generally ruled that scraping publicly available data is legal.

Key considerations:

  • Don't scrape private or logged-in content
  • Don't circumvent technical protection measures
  • Don't use scraped data to compete with YouTube
  • Respect robots.txt (advisory)

Ethical Guidelines

Do:

  • Scrape only public data
  • Identify your scraper with a contact email
  • Minimize server load
  • Cache data to reduce requests
  • Use data responsibly

Don't:

  • Scrape personal user data without consent
  • Republish copyrighted content
  • Overload YouTube's servers
  • Sell scraped data commercially without legal review

When to Use the Official API

For commercial projects or applications requiring reliability, use the YouTube Data API. It's designed for programmatic access and won't get you blocked.

FAQs

Scraping publicly available data is generally legal, but violates YouTube's Terms of Service. Use at your own risk for personal or research purposes. For commercial use, consult legal counsel or use the official API.

Can I download YouTube videos with these methods?

Yes, yt-dlp supports video downloads. However, downloading copyrighted content may violate copyright laws. Only download videos you have rights to.

How do I scrape YouTube comments at scale?

Use yt-dlp with the --get-comments flag or the YouTube Data API's commentThreads.list endpoint. For large volumes, implement pagination and rate limiting.

Why does my scraper keep getting blocked?

YouTube blocks scrapers that:

  • Send too many requests too fast
  • Use datacenter IPs
  • Have bot-like fingerprints
  • Lack realistic headers

Use residential proxies, add delays, and rotate user agents to avoid detection.

What's the difference between yt-dlp and youtube-dl?

yt-dlp is an actively maintained fork of youtube-dl with better performance, more features, and faster bug fixes. Always use yt-dlp for new projects.

Conclusion

You now have five proven methods to scrape YouTube data:

  1. yt-dlp for quick metadata extraction
  2. YouTube Data API for official, reliable access
  3. Hidden JSON endpoints for bypassing quota limits
  4. Selenium for legacy automation needs
  5. Playwright for stealth scraping

Start with yt-dlp for simple tasks. Use the API for commercial projects. Fall back to browser automation only when necessary.

Remember to scrape responsibly, cache your data, and respect rate limits.