knowledgebase

How to Store and Manage Scraped Data Efficiently

October 08, 2025

10 min read

I've scraped everything from product catalogs to job listings, and one thing always catches people off guard: the storage challenge hits way before you expect it. You start with a few hundred records and suddenly you're drowning in gigabytes of duplicate data, wondering why your simple CSV file takes forever to open.

The trick to efficient data storage isn't just picking the right database—it's understanding when to use what, how to avoid storing the same thing twice, and setting up your pipeline so you can actually resume when things go wrong.

In this guide, I'll walk you through practical strategies for storing scraped data that actually scale, from simple file-based approaches to production-ready database setups. We'll also cover some clever tricks that can save you hours of debugging and thousands of dollars in storage costs.

Start with the right format (it matters more than you think)

Before you pick a database, you need to figure out what format makes sense for your data. I've seen developers jump straight to MongoDB because someone said it's "web scale," only to realize later that a simple CSV would've done the job.

Here's my breakdown of when to use each format:

CSV files work great when you're scraping tabular data that doesn't change much. Product prices, job listings, or company directories all fit this pattern. The killer feature? Everyone can open them—your data analyst colleague doesn't need to install PostgreSQL just to take a look.

import csv
from datetime import datetime

def save_to_csv(data, filename='scraped_data.csv'):
    timestamp = datetime.now().strftime('%Y-%m-%d')
    fieldnames = list(data[0].keys()) if data else []
    
    with open(f'{filename}_{timestamp}.csv', 'w', newline='', encoding='utf-8') as f:
        writer = csv.DictWriter(f, fieldnames=fieldnames)
        writer.writeheader()
        writer.writerows(data)

The downside? CSVs get messy fast when you have nested data. If you're scraping an e-commerce site with product variants, multiple images, and reviews, you'll end up with ugly JSON strings inside CSV cells.

JSON files shine when your data has hierarchy. Social media posts with nested comments, API responses with multiple levels—these belong in JSON. Plus, most web scraping returns JSON-like structures anyway, so you're not fighting against the grain.

import json
from pathlib import Path

def save_to_json(data, filename='scraped_data.json'):
    output_dir = Path('data')
    output_dir.mkdir(exist_ok=True)
    
    # Streaming write for large datasets
    with open(output_dir / filename, 'w', encoding='utf-8') as f:
        json.dump(data, f, indent=2, ensure_ascii=False)

The real trick with JSON is streaming. Don't load your entire dataset into memory just to write it. If you're dealing with hundreds of thousands of records, write them one at a time using json.dump() in a loop, or better yet, use newline-delimited JSON (JSONL).

JSONL (JSON Lines) is my go-to for scraping projects. Each line is a complete JSON object, which means you can process the file line-by-line without loading everything into RAM. This format also plays nice with most big data tools.

import json

def append_to_jsonl(item, filename='scraped_data.jsonl'):
    with open(filename, 'a', encoding='utf-8') as f:
        f.write(json.dumps(item, ensure_ascii=False) + '\n')

# In your scraper
for item in scrape_items():
    append_to_jsonl(item)

The database question: SQL vs. NoSQL

Once your scraping project grows beyond a few thousand records, files start showing their limits. You can't query them efficiently, updating is painful, and don't even think about concurrent access.

That's when you need a real database. But which one?

SQLite: The underrated champion

SQLite gets overlooked because it's so simple, but that simplicity is exactly why it's perfect for small to medium scraping projects. No server to set up, no authentication to configure—just a single file that contains your entire database.

import sqlite3
from contextlib import contextmanager

@contextmanager
def get_db_connection(db_path='scraped_data.db'):
    conn = sqlite3.connect(db_path)
    conn.row_factory = sqlite3.Row  # Return rows as dictionaries
    try:
        yield conn
        conn.commit()
    except Exception:
        conn.rollback()
        raise
    finally:
        conn.close()

def create_table():
    with get_db_connection() as conn:
        conn.execute('''
            CREATE TABLE IF NOT EXISTS products (
                id INTEGER PRIMARY KEY AUTOINCREMENT,
                url TEXT UNIQUE NOT NULL,
                title TEXT,
                price REAL,
                scraped_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
                content_hash TEXT
            )
        ''')
        # Index for fast lookups
        conn.execute('CREATE INDEX IF NOT EXISTS idx_url ON products(url)')
        conn.execute('CREATE INDEX IF NOT EXISTS idx_hash ON products(content_hash)')

def insert_product(url, title, price, content_hash):
    with get_db_connection() as conn:
        try:
            conn.execute(
                'INSERT INTO products (url, title, price, content_hash) VALUES (?, ?, ?, ?)',
                (url, title, price, content_hash)
            )
            return True
        except sqlite3.IntegrityError:
            # Duplicate URL, skip or update
            return False

The UNIQUE constraint on the URL column automatically prevents duplicates. SQLite handles this at the database level, so you don't need to check for duplicates manually.

One gotcha: SQLite doesn't handle high write concurrency well. If you're running multiple scrapers that all write to the same database, you'll hit lock contention. For single-threaded scrapers or read-heavy workloads, it's fantastic.

PostgreSQL: When you need the real deal

PostgreSQL is what I reach for when a project outgrows SQLite. It handles concurrent writes like a champ, has actual replication, and the JSON support is surprisingly good.

import psycopg2
from psycopg2.extras import Json, execute_values

def setup_postgres():
    conn = psycopg2.connect(
        host="localhost",
        database="scraping_db",
        user="scraper",
        password="your_password"
    )
    
    with conn.cursor() as cur:
        cur.execute('''
            CREATE TABLE IF NOT EXISTS scraped_items (
                id SERIAL PRIMARY KEY,
                url TEXT UNIQUE NOT NULL,
                data JSONB,
                scraped_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
            )
        ''')
        # JSONB index for fast queries
        cur.execute('CREATE INDEX IF NOT EXISTS idx_data_gin ON scraped_items USING GIN (data)')
    
    conn.commit()
    return conn

def batch_insert(items, conn):
    # Batch inserts are WAY faster
    with conn.cursor() as cur:
        execute_values(
            cur,
            'INSERT INTO scraped_items (url, data) VALUES %s ON CONFLICT (url) DO NOTHING',
            [(item['url'], Json(item)) for item in items]
        )
    conn.commit()

The JSONB column type is brilliant—you get the flexibility of JSON with the ability to query nested fields. Need to find all products under $50? Just query data->>'price'. The GIN index makes these queries fast even with millions of records.

MongoDB: For truly unstructured chaos

MongoDB gets a lot of hype, but honestly, I only reach for it when I'm scraping data with wildly varying schemas. Think crawling multiple websites where each one has completely different fields.

from pymongo import MongoClient
from datetime import datetime

client = MongoClient('mongodb://localhost:27017/')
db = client['scraping_database']
products = db['products']

# Create indexes for common queries
products.create_index([('url', 1)], unique=True)
products.create_index([('scraped_at', -1)])

def insert_with_dedup(item):
    item['scraped_at'] = datetime.utcnow()
    try:
        products.insert_one(item)
        return True
    except pymongo.errors.DuplicateKeyError:
        # Already exists, maybe update timestamp
        products.update_one(
            {'url': item['url']},
            {'$set': {'last_seen': datetime.utcnow()}}
        )
        return False

MongoDB's schemaless nature is both its strength and weakness. You can throw any structure at it, but that also means you can't enforce data quality like you can with SQL constraints.

The deduplication trick that saves gigabytes

Here's something that took me embarrassingly long to figure out: most scraped data has tons of duplicates, but not in the way you'd expect.

Sure, you might scrape the same URL twice. That's easy to catch with a UNIQUE constraint. But what about when the same product appears on multiple pages? Different URLs, same data. That's where content-based deduplication comes in.

import hashlib
import json

def compute_content_hash(item, fields_to_hash=None):
    """
    Create a hash of the item's content to detect semantic duplicates.
    Only hash the fields that matter—ignore timestamps, URLs, etc.
    """
    if fields_to_hash is None:
        # Auto-detect: exclude common metadata fields
        fields_to_hash = {k for k in item.keys() 
                         if k not in ['url', 'scraped_at', 'id', 'timestamp']}
    
    # Sort keys for consistent hashing
    hashable = {k: item[k] for k in sorted(fields_to_hash) if k in item}
    content_str = json.dumps(hashable, sort_keys=True, ensure_ascii=False)
    
    return hashlib.blake2b(content_str.encode(), digest_size=16).hexdigest()

# Usage in your scraper
def save_with_content_dedup(item, db_conn):
    content_hash = compute_content_hash(item, fields_to_hash=['title', 'description', 'price'])
    
    # Check if we've seen this exact content before
    existing = db_conn.execute(
        'SELECT id FROM products WHERE content_hash = ?',
        (content_hash,)
    ).fetchone()
    
    if existing:
        print(f"Duplicate content found, skipping: {item['title']}")
        return False
    
    # New content, save it
    item['content_hash'] = content_hash
    db_conn.execute(
        'INSERT INTO products (url, title, price, content_hash) VALUES (?, ?, ?, ?)',
        (item['url'], item['title'], item['price'], content_hash)
    )
    return True

I use BLAKE2b instead of MD5 or SHA because it's faster and just as collision-resistant for this use case. The digest_size=16 gives you a 128-bit hash, which is plenty to avoid collisions in any realistic scraping project.

This technique has saved me from storing the same product description hundreds of times when scraping aggregator sites.

Incremental scraping: Don't start from scratch every time

Nothing wastes more resources than re-scraping data you already have. But tracking what you've scraped gets tricky, especially when sites update content or add new items.

The checkpoint pattern

Scrapy has built-in support for this with JOBDIR, but the concept works anywhere:

import json
from pathlib import Path

class CheckpointManager:
    def __init__(self, checkpoint_file='scraper_checkpoint.json'):
        self.checkpoint_file = Path(checkpoint_file)
        self.state = self.load()
    
    def load(self):
        if self.checkpoint_file.exists():
            with open(self.checkpoint_file) as f:
                return json.load(f)
        return {
            'last_scraped_url': None,
            'last_scraped_at': None,
            'processed_urls': []
        }
    
    def save(self):
        with open(self.checkpoint_file, 'w') as f:
            json.dump(self.state, f, indent=2)
    
    def mark_processed(self, url):
        self.state['processed_urls'].append(url)
        if len(self.state['processed_urls']) % 100 == 0:
            self.save()  # Save every 100 URLs
    
    def is_processed(self, url):
        return url in self.state['processed_urls']

# Usage
checkpoint = CheckpointManager()

for url in urls_to_scrape:
    if checkpoint.is_processed(url):
        continue
    
    try:
        data = scrape_url(url)
        save_data(data)
        checkpoint.mark_processed(url)
    except Exception as e:
        print(f"Error scraping {url}: {e}")
        checkpoint.save()  # Save progress even on error
        raise

This pattern lets you Ctrl+C your scraper at any point and pick up exactly where you left off. The periodic saves (every 100 URLs) prevent you from losing too much progress if something crashes.

The timestamp trick for incremental updates

When you're scraping sites that update content regularly—like news sites or job boards—you want to re-scrape items but only fetch what changed.

from datetime import datetime, timedelta

def get_stale_items(db_conn, hours=24):
    """Get items that haven't been checked in the last N hours"""
    cutoff = datetime.now() - timedelta(hours=hours)
    
    cursor = db_conn.execute('''
        SELECT url, last_checked 
        FROM products 
        WHERE last_checked < ? OR last_checked IS NULL
        ORDER BY last_checked ASC NULLS FIRST
        LIMIT 1000
    ''', (cutoff,))
    
    return cursor.fetchall()

def update_item(db_conn, url, new_data):
    """Update item and refresh the last_checked timestamp"""
    old_hash = db_conn.execute(
        'SELECT content_hash FROM products WHERE url = ?',
        (url,)
    ).fetchone()
    
    new_hash = compute_content_hash(new_data)
    
    if old_hash and old_hash[0] == new_hash:
        # Content hasn't changed, just update timestamp
        db_conn.execute(
            'UPDATE products SET last_checked = ? WHERE url = ?',
            (datetime.now(), url)
        )
        return False
    
    # Content changed, update everything
    db_conn.execute('''
        UPDATE products 
        SET title = ?, price = ?, content_hash = ?, last_checked = ?
        WHERE url = ?
    ''', (new_data['title'], new_data['price'], new_hash, datetime.now(), url))
    
    return True

This approach prioritizes items that haven't been checked recently. Popular products might get checked every few hours, while rarely-viewed items only get updated daily. It's way more efficient than blindly re-scraping everything.

Compression: The storage saver nobody talks about

Scraped HTML and JSON compress ridiculously well. I'm talking 80-90% reduction in some cases. But most people store it uncompressed because they don't realize how easy compression is to add.

Transparent compression with Python

import gzip
import json

def save_compressed_json(data, filename):
    """Save JSON with gzip compression"""
    with gzip.open(f'{filename}.gz', 'wt', encoding='utf-8') as f:
        json.dump(data, f)

def load_compressed_json(filename):
    """Load gzipped JSON"""
    with gzip.open(f'{filename}.gz', 'rt', encoding='utf-8') as f:
        return json.load(f)

# For database BLOBs
import sqlite3

def save_html_compressed(url, html_content, db_conn):
    compressed = gzip.compress(html_content.encode('utf-8'))
    
    db_conn.execute('''
        INSERT INTO raw_html (url, compressed_html, scraped_at)
        VALUES (?, ?, ?)
    ''', (url, compressed, datetime.now()))

def get_html(url, db_conn):
    row = db_conn.execute(
        'SELECT compressed_html FROM raw_html WHERE url = ?',
        (url,)
    ).fetchone()
    
    if row:
        return gzip.decompress(row[0]).decode('utf-8')
    return None

The beauty of storing compressed HTML is that you can always go back and re-parse it if your extraction logic improves. I've saved countless hours by keeping the raw HTML instead of just the parsed data.

PostgreSQL's built-in compression

PostgreSQL automatically compresses JSONB and TEXT columns when they're large enough (typically over 2KB). You don't need to do anything—just use the right column type and you get compression for free.

# This automatically benefits from TOAST compression
cur.execute('''
    INSERT INTO raw_pages (url, content)
    VALUES (%s, %s)
''', (url, large_html_content))

Avoiding the storage pitfalls I learned the hard way

Don't store images in the database. I see this constantly. Store the image path in the database and save the actual file to disk or object storage. Databases are terrible at storing large binary blobs.

from pathlib import Path
import hashlib
import requests

def save_image(image_url, base_dir='scraped_images'):
    base_path = Path(base_dir)
    base_path.mkdir(exist_ok=True)
    
    # Use hash of URL as filename to avoid duplicates
    filename = hashlib.md5(image_url.encode()).hexdigest()
    ext = Path(image_url).suffix or '.jpg'
    filepath = base_path / f'{filename}{ext}'
    
    if filepath.exists():
        return str(filepath)
    
    response = requests.get(image_url, stream=True)
    with open(filepath, 'wb') as f:
        for chunk in response.iter_content(8192):
            f.write(chunk)
    
    return str(filepath)

# Store only the path
item['image_path'] = save_image(item['image_url'])

Batch your inserts. Inserting one row at a time is painfully slow. Batch them up—100 to 1000 rows per transaction makes a massive difference.

def batch_insert_products(items, batch_size=500):
    with get_db_connection() as conn:
        batch = []
        for item in items:
            batch.append((
                item['url'],
                item['title'],
                item['price'],
                compute_content_hash(item)
            ))
            
            if len(batch) >= batch_size:
                conn.executemany('''
                    INSERT OR IGNORE INTO products (url, title, price, content_hash)
                    VALUES (?, ?, ?, ?)
                ''', batch)
                batch = []
        
        # Don't forget the last partial batch
        if batch:
            conn.executemany('''
                INSERT OR IGNORE INTO products (url, title, price, content_hash)
                VALUES (?, ?, ?, ?)
            ''', batch)

Set up proper indexes from day one. Adding indexes to a table with millions of rows takes forever. Create them when your table is empty.

# Essential indexes for scraping workloads
cursor.execute('CREATE UNIQUE INDEX idx_url ON products(url)')
cursor.execute('CREATE INDEX idx_content_hash ON products(content_hash)')
cursor.execute('CREATE INDEX idx_scraped_at ON products(scraped_at DESC)')

When to scale beyond a single machine

If you're scraping millions of pages, eventually a single database won't cut it. But honestly, you can get surprisingly far with one beefy PostgreSQL instance. I've run scrapers that collect 500K records daily on a single server for years.

The real bottleneck is usually the scraping itself, not the storage. Before you architect some distributed database cluster, make sure your scraper is actually maxing out your database's capacity.

That said, when you do need to scale:

Use object storage for raw HTML. Services like MinIO (self-hosted S3-compatible storage) are perfect for archiving raw scraped content. Cheap, scales indefinitely, and you can always move it to actual S3 later if needed.

Partition your tables by date. PostgreSQL's table partitioning makes it easy to archive old data without slowing down current queries.

-- Create partitioned table
CREATE TABLE products (
    id SERIAL,
    url TEXT,
    scraped_at TIMESTAMP
) PARTITION BY RANGE (scraped_at);

-- Create monthly partitions
CREATE TABLE products_2025_01 PARTITION OF products
FOR VALUES FROM ('2025-01-01') TO ('2025-02-01');

Consider TimescaleDB for time-series data. If you're tracking price changes, stock levels, or anything that changes over time, TimescaleDB turns PostgreSQL into a time-series database. It's a game-changer for this type of scraping.

Wrapping up

Efficient data storage isn't about picking the fanciest database—it's about understanding your data, avoiding duplicates, and setting yourself up to resume when things fail.

Start simple with SQLite or even JSONL files. Add deduplication based on content hashes, not just URLs. Implement checkpointing so you can recover from failures. Compress your data, especially raw HTML. And batch your database operations.

These techniques have saved me from countless storage nightmares. Your future self will thank you when you're not debugging why your 2GB CSV file won't open or why you're paying $500/month to store the same product description 10,000 times.

The best storage solution is the one that lets you focus on extracting insights from your data, not wrestling with the infrastructure that holds it.

Marius Bernard

Marius Bernard is a Web Scraping Engineer & Technical Advisor at Roundproxies. He authored the Web Scraping chapter of the 2024 Web Almanac/Techinsider. He loves python, golang and proxies.

Get the best
proxies out there

Get Proxies now

Related from Knowledge Base

Go Web Scraping: Complete 2025 Guide & Code Examples

PHP Web Scraping Guide 2026: Speed & Anti-Bot Tips

C# Web Scraping Guide: Build Fast Working Scrapers

Web Scraping in R: Complete Guide 2026

Web Scraping in Rust: Complete 2026 Guide

How to Do Web Scraping in Kotlin: The Developer's Guide

How to Do Web Scraping in Lua: A Developer's Guide

How to Do Web Scraping in Dart: A Complete 2026 Guide

How to Do Web Scraping in Perl: The Complete Developer's Guide

Python Web Scraping Guide: Build Scrapers in 2026

How to Use Botasaurus in 2026

How to Scrape Dynamic Websites With Headless Web Browsers

12 Ways to Make HTTPS Requests in Node.js

15 Methods to Not Get Blocked Web Scraping

How to Use Playwright Playwright Proxy in 2026

How to Take Screenshots with Puppeteer

User-Agent Rotation: Why and How to Implement It

How to Scrape Data Behind Login Pages

What Are Backconnect Proxies and How They Work

How to Do Requests in Go (Golang)