How to Scrape Reddit in 7 Steps in 2025

Reddit scraping is the process of automatically extracting data from Reddit's vast collection of posts, comments, and user-generated content. In this guide, we'll show you how to collect Reddit data using multiple methods - from simple JSON endpoints to advanced API techniques, all while respecting rate limits and staying within legal boundaries.

Why This Guide Matters

Since Reddit's API pricing changes in 2023, which now charge $0.24 per 1,000 API calls, developers have had to get creative with data collection methods. While apps that make less than 100 queries per minute per OAuth ID can still use the free tier, many projects require alternative approaches.

This guide covers both official and unofficial methods, with a focus on request-based solutions that don't require heavy browser automation.

Step 1: Understand Your Options (And Choose Wisely)

Before diving into code, let's map out the landscape of Reddit scraping approaches:

The JSON Endpoint Method (Easiest & Fastest)

Reddit has an unofficial API where you can add .json to the end of any URL to get back the data for that page as JSON. This is perhaps the most elegant solution for simple scraping tasks.

Pros:

  • No authentication required
  • Works with any Reddit URL
  • Returns structured JSON data
  • Rate limiting is based on user-agent

Cons:

  • Data cuts off after 14 days of posts
  • Limited to 100 posts per request
  • No access to certain metadata

The PRAW Method (Official & Robust)

PRAW is a widely-used Python package that provides simple access to Reddit's API, handling authentication and rate limiting automatically.

Pros:

  • Official API support
  • Full access to Reddit features
  • Can write data back to Reddit
  • Handles rate limiting automatically

Cons:

  • Requires API credentials
  • Subject to API pricing for high-volume use
  • More complex setup

The Old Reddit Method (Clever Workaround)

Old.reddit is very lightweight and doesn't rely on JavaScript, making it perfect for traditional web scraping.

Pros:

  • By adding the limit=500 query parameter to the URL, we are able to load 500 comments in a single request
  • No JavaScript rendering required
  • Works with simple HTTP requests

Cons:

  • May not have all modern features
  • Could be deprecated in the future

Step 2: Set Up Your Environment

First, let's install the necessary Python packages:

pip install requests beautifulsoup4 pandas praw httpx parsel loguru

For this guide, we'll use multiple libraries to demonstrate different approaches:

  • requests/httpx: For making HTTP requests
  • beautifulsoup4/parsel: For parsing HTML
  • praw: For official Reddit API access
  • pandas: For data manipulation

Step 3: Master the JSON Endpoint Technique

This is my favorite method for quick Reddit scraping. Here's how to use it effectively:

import requests
import json
from datetime import datetime

def scrape_subreddit_json(subreddit, sort='hot', limit=100, timeframe='all'):
    """
    Scrape Reddit posts using the JSON endpoint
    
    Args:
        subreddit: Name of the subreddit
        sort: 'hot', 'new', 'top', 'rising'
        limit: Number of posts (max 100)
        timeframe: 'hour', 'day', 'week', 'month', 'year', 'all'
    """
    # Build the URL
    url = f"https://www.reddit.com/r/{subreddit}/{sort}.json"
    
    # Important: Set a custom user-agent to avoid rate limiting
    headers = {
        'User-Agent': 'Reddit-Scraper/1.0 (by /u/YourUsername)'
    }
    
    # Add parameters
    params = {
        'limit': limit,
        't': timeframe
    }
    
    try:
        response = requests.get(url, headers=headers, params=params)
        response.raise_for_status()
        
        data = response.json()
        posts = []
        
        # Extract post data
        for child in data['data']['children']:
            post = child['data']
            posts.append({
                'id': post['id'],
                'title': post['title'],
                'author': post.get('author', '[deleted]'),
                'created_utc': datetime.fromtimestamp(post['created_utc']),
                'score': post['score'],
                'num_comments': post['num_comments'],
                'url': post['url'],
                'selftext': post.get('selftext', ''),
                'subreddit': post['subreddit'],
                'permalink': f"https://reddit.com{post['permalink']}"
            })
        
        return posts
        
    except requests.exceptions.RequestException as e:
        print(f"Error fetching data: {e}")
        return []

# Example usage
posts = scrape_subreddit_json('python', sort='top', limit=50, timeframe='week')
print(f"Scraped {len(posts)} posts")

Pro Tip: Handling Pagination with 'after' Parameter

The maximum returned posts can be no more than 100, but you can paginate using the 'after' parameter:

def scrape_multiple_pages(subreddit, pages=5):
    """Scrape multiple pages of posts"""
    all_posts = []
    after = None
    
    for page in range(pages):
        url = f"https://www.reddit.com/r/{subreddit}/new.json"
        params = {'limit': 100}
        
        if after:
            params['after'] = after
            
        headers = {'User-Agent': 'Reddit-Scraper/1.0'}
        response = requests.get(url, headers=headers, params=params)
        
        if response.status_code == 200:
            data = response.json()
            posts = data['data']['children']
            all_posts.extend(posts)
            
            # Get the 'after' token for next page
            after = data['data'].get('after')
            if not after:
                break
                
            # Be respectful with rate limiting
            time.sleep(2)
    
    return all_posts

Step 4: Leverage Hidden APIs for Dynamic Content

A request is sent to a hidden Reddit API to fetch an HTML page containing the data. Here's how to discover and use these endpoints:

import httpx
from parsel import Selector

async def scrape_hidden_api(subreddit):
    """Use Reddit's hidden API endpoints"""
    
    # The hidden API endpoint for getting more posts
    api_url = "https://gateway.reddit.com/desktopapi/v1/subreddits/{}/posts"
    
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
    }
    
    async with httpx.AsyncClient() as client:
        # First, get the initial page to extract tokens
        initial_response = await client.get(
            f"https://www.reddit.com/r/{subreddit}/", 
            headers=headers
        )
        
        # Extract the token from the page (you'll need to inspect the actual response)
        # This is a simplified example
        
        # Make request to hidden API
        api_response = await client.get(
            api_url.format(subreddit),
            headers=headers,
            params={
                'limit': 25,
                'sort': 'hot'
            }
        )
        
        return api_response.json()

Step 5: Scrape Comments Efficiently Using Old Reddit

Comments are rendered dynamically through scrolls. Since posts can have thousands of replies, it's not practical to rely on headless browser usage. Here's the smart solution:

def scrape_post_comments(post_id, limit=500):
    """
    Scrape comments from a Reddit post using old.reddit.com
    
    Args:
        post_id: The Reddit post ID
        limit: Number of comments to fetch (max 500)
    """
    # Use old.reddit.com for easier parsing
    url = f"https://old.reddit.com/comments/{post_id}.json"
    
    headers = {'User-Agent': 'Reddit-Comment-Scraper/1.0'}
    params = {'limit': limit}
    
    response = requests.get(url, headers=headers, params=params)
    
    if response.status_code == 200:
        data = response.json()
        
        # The response contains two listings: [post_data, comments_data]
        comments_data = data[1]['data']['children']
        
        comments = []
        for comment in comments_data:
            if comment['kind'] == 't1':  # t1 = comment
                comment_data = comment['data']
                comments.append({
                    'id': comment_data['id'],
                    'author': comment_data.get('author', '[deleted]'),
                    'body': comment_data.get('body', ''),
                    'score': comment_data['score'],
                    'created_utc': datetime.fromtimestamp(comment_data['created_utc']),
                    'parent_id': comment_data['parent_id'],
                    'replies': extract_replies(comment_data.get('replies', ''))
                })
        
        return comments
    
    return []

def extract_replies(replies_data):
    """Recursively extract nested replies"""
    if not replies_data or isinstance(replies_data, str):
        return []
    
    replies = []
    for reply in replies_data.get('data', {}).get('children', []):
        if reply['kind'] == 't1':
            reply_data = reply['data']
            replies.append({
                'author': reply_data.get('author', '[deleted]'),
                'body': reply_data.get('body', ''),
                'score': reply_data['score']
            })
    
    return replies

Step 6: Implement Smart Rate Limiting and Error Handling

Insert 3-5 second delays between page requests to avoid overloading servers. Here's a production-ready approach:

import time
import random
from functools import wraps

def rate_limit(min_delay=1, max_delay=3):
    """Decorator for rate limiting requests"""
    def decorator(func):
        @wraps(func)
        def wrapper(*args, **kwargs):
            result = func(*args, **kwargs)
            delay = random.uniform(min_delay, max_delay)
            time.sleep(delay)
            return result
        return wrapper
    return decorator

class RedditScraper:
    def __init__(self, user_agent="Reddit-Scraper/1.0"):
        self.session = requests.Session()
        self.session.headers.update({'User-Agent': user_agent})
        self.request_count = 0
        self.last_request_time = 0
    
    @rate_limit(min_delay=2, max_delay=5)
    def get(self, url, **kwargs):
        """Make a rate-limited GET request"""
        try:
            response = self.session.get(url, **kwargs)
            self.request_count += 1
            
            # Handle rate limit responses
            if response.status_code == 429:
                retry_after = int(response.headers.get('Retry-After', 60))
                print(f"Rate limited. Waiting {retry_after} seconds...")
                time.sleep(retry_after)
                return self.get(url, **kwargs)
            
            response.raise_for_status()
            return response
            
        except requests.exceptions.RequestException as e:
            print(f"Request failed: {e}")
            return None

Advanced Techniques and Best Practices

Use Proxies for Large-Scale Scraping

Reddit actively blocks and bans IP addresses that send too many requests too quickly. Proxies allow you to route your traffic through multiple IPs:

def scrape_with_proxy(url, proxy_list):
    """Rotate through proxies for requests"""
    for proxy in proxy_list:
        try:
            proxies = {'http': proxy, 'https': proxy}
            response = requests.get(url, proxies=proxies, timeout=10)
            if response.status_code == 200:
                return response
        except:
            continue
    return None

Handle Dynamic Content Without Selenium

Instead of using browser automation, look for API endpoints in the Network tab:

def find_api_endpoints(subreddit):
    """
    Monitor network traffic to find hidden API endpoints
    This is a manual process - use browser DevTools
    """
    # 1. Open Reddit in browser with DevTools
    # 2. Go to Network tab
    # 3. Filter by XHR/Fetch
    # 4. Scroll to trigger dynamic loading
    # 5. Look for JSON responses
    
    # Common patterns:
    endpoints = [
        "https://gateway.reddit.com/desktopapi/v1/",
        "https://oauth.reddit.com/",
        "https://www.reddit.com/api/",
        "https://gql.reddit.com/"  # GraphQL endpoint
    ]
    
    return endpoints

Store Data Efficiently

import pandas as pd
import sqlite3

def save_to_database(posts, db_name='reddit_data.db'):
    """Save scraped data to SQLite database"""
    conn = sqlite3.connect(db_name)
    
    # Convert to DataFrame
    df = pd.DataFrame(posts)
    
    # Save to database
    df.to_sql('posts', conn, if_exists='append', index=False)
    
    conn.close()
    print(f"Saved {len(posts)} posts to database")

While public Reddit data is fair game to scrape, there are some legal guidelines and ethical factors to consider:

  1. Only scrape public data - Never attempt to access private subreddits or user profiles
  2. Respect robots.txt - Check Reddit's robots.txt file for guidelines
  3. Use reasonable volumes - Build datasets large enough for your needs, but don't overdo it
  4. Anonymize user data - Remove personally identifiable information
  5. Don't recreate Reddit - Avoid building services that directly compete

Alternative Tools and Services

If coding isn't your thing, consider these alternatives:

No-Code Solutions

  • Apify's Reddit templates - Pre-built scrapers with visual interfaces
  • Octoparse - Point-and-click scraping tool
  • ParseHub - Visual web scraping platform

API Wrappers

  • PRAW (Python) - Handles rate limiting and authentication automatically
  • snoowrap (JavaScript) - Promise-based Reddit API wrapper
  • JRAW (Java) - Reddit API wrapper for Java

Conclusion and Next Steps

Reddit scraping in 2025 requires creativity and technical knowledge. The JSON endpoint method remains the most elegant solution for most use cases, while PRAW provides official API access for more complex needs. Remember to always respect rate limits, handle errors gracefully, and consider the ethical implications of your scraping activities.

For large-scale projects, combine multiple techniques:

  1. Use JSON endpoints for initial data discovery
  2. Leverage old.reddit.com for comment scraping
  3. Implement smart rate limiting and proxy rotation
  4. Store data efficiently in databases

The key is to think beyond traditional browser automation and use Reddit's own infrastructure to your advantage.

Marius Bernard

Marius Bernard

Marius Bernard is a Product Advisor, Technical SEO, & Brand Ambassador at Roundproxies. He was the lead author for the SEO chapter of the 2024 Web and a reviewer for the 2023 SEO chapter.