Build a Fast Web Scraper with Mistral AI in 2026

October 21, 2025

6 min read

A web scraper pulls pages, parses the DOM, and extracts the pieces you care about. Pair it with a local Mistral model and you can summarize, structure, and QA scraped content—no paid APIs, no cloud bills.

In this guide we’ll wire up an async Python scraper, add JS rendering fallback, and run Mistral locally via Ollama to get clean JSON outputs you can drop straight into a database.

What you’ll build

An async crawler (httpx/aiohttp) with polite rate limits, caching, and robots.txt checks
HTML extraction that’s fast (selectolax) and readable (trafilatura)
A local Mistral pipeline (via Ollama) for summaries and schema-true JSON extraction
Optional JS fallback with Playwright (for SPA pages)
Lightweight dedup with SimHash/MinHash to avoid double work

We’ll keep everything free and local. No paid third-party APIs.

TL;DR architecture

URLs → Fetch (async + cache + robots) → Parse (selectolax / trafilatura) 
   → Chunk → Mistral (Ollama JSON mode / schema) → JSONL out (dedup)
                               ↘ optional Playwright for JS pages

1) Environment: tools we’ll use (all free)

Python 3.10+, pipx or pip
selectolax: ultra-fast HTML parser with CSS selectors.
trafilatura: robust “main content” extractor for messy pages.
aiohttp + aiohttp-client-cache or requests + requests-cache for HTTP + caching. (GitHub)
Playwright (Python) for JS-rendered pages.
Ollama (local LLM server) + Mistral/Mixtral models.
SimHash/Datasketch for near-duplicate detection.

Mistral 7B is Apache-2.0 licensed; you can run it locally without restrictions.

Install the basics:

# system deps (example: macOS)
brew install python

# python libs
pip install aiohttp aiohttp-client-cache selectolax trafilatura tqdm
pip install playwright
playwright install chromium  # once

# optional: requests flavour
pip install requests requests-cache

# dedup
pip install simhash datasketch

# ollama (mac/linux/windows installers on their site)
# after installing ollama:
ollama pull mistral    # 7B instruct
# or:
ollama pull mixtral    # 8x7B MoE, heavier but stronger

Why Ollama? It runs LLMs locally and exposes a simple REST API. It also supports JSON mode and structured outputs (JSON Schema) so you can reliably parse answers.

2. Be polite: `robots.txt`, rate limits, identity

Before you crawl, check robots and throttle. The REP (Robots Exclusion Protocol) is documented as RFC 9309—it’s not auth, but you should honor it.

# robots.py
import urllib.robotparser as rp
from urllib.parse import urljoin, urlparse

def allowed(url: str, user_agent="MyScraperBot/0.1 (+https://example.com/bot)"):
    root = f"{urlparse(url).scheme}://{urlparse(url).netloc}"
    robots_url = urljoin(root, "/robots.txt")
    parser = rp.RobotFileParser()
    parser.set_url(robots_url)
    try:
        parser.read()
    except Exception:
        # If robots not reachable, default to conservative allow=False
        return False
    return parser.can_fetch(user_agent, url)

3. Async fetcher with caching (fast and friendly)

Us aiohttp with aiohttp-client-cache to avoid re-downloading pages. It supports SQLite/Redis/etc. backends.

# fetch.py
import asyncio, re, time
from aiohttp_client_cache import CachedSession, SQLiteBackend
from contextlib import asynccontextmanager

UA = "MyScraperBot/0.1 (+https://example.com/bot)"
DOMAIN_CONCURRENCY = 4
GLOBAL_CONCURRENCY = 20

@asynccontextmanager
async def session_ctx():
    async with CachedSession(
        cache=SQLiteBackend("http_cache.sqlite", expire_after=3600),
        headers={"User-Agent": UA},
        timeout=30
    ) as s:
        yield s

async def fetch_url(session, url):
    resp = await session.get(url, allow_redirects=True)
    ct = resp.headers.get("content-type","").lower()
    if "text/html" not in ct and "application/xhtml+xml" not in ct:
        return None, ct
    text = await resp.text(errors="ignore")
    return text, ct

4. Parse quickly (selectolax) + get “main content” (trafilatura)

selectolax for targeted CSS extraction (titles, prices, etc.)
trafilatura for high-quality readable article text (boilerplate removal).

# parse.py
from selectolax.parser import HTMLParser
from trafilatura import extract

def extract_title(html: str) -> str | None:
    tree = HTMLParser(html)
    # Try OG title, then <title>, then h1
    og = tree.css_first('meta[property="og:title"]')
    if og and og.attributes.get("content"):
        return og.attributes["content"].strip()
    t = tree.css_first("title")
    if t: return t.text(strip=True)
    h1 = tree.css_first("h1")
    return h1.text(strip=True) if h1 else None

def extract_main_text(html: str) -> str | None:
    # Trafilatura’s extract() returns cleaned, readable text
    return extract(html, include_comments=False, include_tables=False)

If you prefer a framework, Scrapy provides selectors and project scaffolding.

5. Run Mistral locally (Ollama) and talk JSON

Start Ollama and pull a model:

ollama serve &
ollama pull mistral

Quick sanity check:

curl http://localhost:11434/api/generate \
  -d '{"model":"mistral","prompt":"Say hi in one short sentence."}'

Structured output: Ollama supports "format": "json" and JSON Schema in the format field. Always also ask for JSON in your prompt.

# llm.py
import json, http.client

def mistral_extract(schema: dict, content: str) -> dict:
    """
    Use Ollama's JSON Schema structured outputs.
    """
    conn = http.client.HTTPConnection("localhost", 11434, timeout=120)
    prompt = (
      "You are an information extraction engine. "
      "Return ONLY valid JSON that matches the provided schema.\n\n"
      f"TEXT:\n{content[:6000]}"
    )
    body = {
      "model": "mistral",
      "prompt": prompt,
      "format": schema,    # JSON Schema here → structured output
      "stream": False
    }
    conn.request("POST", "/api/generate", body=json.dumps(body))
    res = conn.getresponse().read()
    data = json.loads(res.decode("utf-8"))
    return json.loads(data["response"])

Prefer schema over plain JSON mode if you need type-safe outputs. (Ollama supports both.)

If you want a beefier local stack or an OpenAI-compatible server for multiple apps, run vLLM and load Mistral/Mixtral there.

6. Connect it: crawl → parse → chunk → extract with Mistral

We’ll define a small schema for articles (tweak for your niche: products, jobs, reviews…).

# schema.py
ARTICLE_SCHEMA = {
  "type": "object",
  "properties": {
    "title": {"type":"string"},
    "summary": {"type":"string"},
    "topics": {"type":"array", "items":{"type":"string"}},
    "published_date": {"type":"string"}
  },
  "required": ["title","summary","topics"]
}

Chunk long texts to stay in context. Then call mistral_extract().

# pipeline.py
import asyncio, json, hashlib
from fetch import session_ctx, fetch_url
from parse import extract_title, extract_main_text
from llm import mistral_extract
from schema import ARTICLE_SCHEMA
from robots import allowed
from tqdm.asyncio import tqdm_asyncio

def chunk(s: str, n=4000):
    for i in range(0, len(s), n):
        yield s[i:i+n]

async def process_url(session, url):
    if not allowed(url):
        return None
    html, ct = await fetch_url(session, url)
    if not html: 
        return None
    title = extract_title(html)
    text = extract_main_text(html) or ""
    if not text.strip():
        return None
    # Simple chunk + stitch strategy
    parts = []
    for c in chunk(text):
        out = mistral_extract(
          {
            "type":"object",
            "properties":{"summary":{"type":"string"}},
            "required":["summary"]
          }, c
        )
        parts.append(out["summary"])
    final = mistral_extract(ARTICLE_SCHEMA, 
        f"TITLE: {title or ''}\n\nFULL TEXT:\n" + "\n\n".join(parts)
    )
    final["source_url"] = url
    final["hash"] = hashlib.sha1((title or "")[:200].encode() + text[:4000].encode()).hexdigest()
    return final

async def run(urls: list[str]):
    async with session_ctx() as session:
        results = await tqdm_asyncio.gather(*[process_url(session,u) for u in urls], total=len(urls))
        clean = [r for r in results if r]
        with open("out.jsonl","w",encoding="utf-8") as f:
            for r in clean:
                f.write(json.dumps(r, ensure_ascii=False) + "\n")
        return clean

7. JS pages? Add a Playwright fallback

For SPA or heavy JS, you can hydrate the DOM with Playwright and then reuse the same parsers. (Keep usage modest—headless browsers are resource-heavy.)

# render.py
import asyncio
from playwright.async_api import async_playwright

async def render_html(url: str, timeout_ms=15000) -> str | None:
    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True)
        page = await browser.new_page()
        try:
            await page.goto(url, timeout=timeout_ms, wait_until="networkidle")
            html = await page.content()
            return html
        finally:
            await browser.close()

8. Kill dupes: SimHash/MinHash in 15 lines

For large crawls, remove near-duplicates before sending text to Mistral. SimHash is tiny and fast; MinHash+LSH scales well.

# dedup.py
from simhash import Simhash

def simhash_text(s: str) -> int:
    # crude tokenization; customize for your domain
    tokens = [t.lower() for t in s.split()]
    return Simhash(tokens).value

# usage: keep a set() of seen signatures and Hamming-distance filter if needed

Practical extras (that actually help)

Backoff and quotas
Add jitter and per-domain semaphores; don’t overload hosts. (Helpful even if robots allows.)
Cache first
aiohttp-client-cache / requests-cache slash re-fetches and cost.
Fast parsing over BS4
selectolax is noticeably faster for CSS queries than typical bs4 stacks—handy at scale.
Readable text when the DOM is noisy
trafilatura.extract() reliably returns main text for summaries.
JS only when needed
Detect empty content → fall back to Playwright for that URL.
Structured outputs over raw text
Prefer Ollama JSON Schema formats to avoid brittle regex post-processing.

Legal & ethics notes

Honor robots.txt and site ToS. REP is a convention, not auth, but ignoring it is a great way to get blocked (or worse).
Don’t bypass paywalls/captchas or scrape personal data.
Identify your bot with a proper UA and contact page.

Troubleshooting

Model outputs invalid JSON → Instruct JSON in the prompt and set format. If you still see drift, drop temperature and keep outputs short; schema mode is stricter than plain JSON mode.
It’s slow → increase concurrency carefully, add caching, and avoid Playwright unless necessary.
OOM on Mixtral → switch to mistral (7B) or use a quantized build.

Why Mistral for scraping?

Local (privacy, zero per-call cost)
Proven open models (Mistral 7B/Mixtral 8x7B) with permissive use; Mistral 7B was released under Apache-2.0.
Modern features (function/structured outputs in Mistral ecosystem; with Ollama/vLLM you can enforce JSON reliably).

Related articles:

Marius Bernard

Marius Bernard is a Product Advisor, Technical SEO, & Brand Ambassador at Roundproxies. He was the lead author for the SEO chapter of the 2024 Web and a reviewer for the 2023 SEO chapter.

Get the best
proxies out there

Get Proxies now

Related from Knowledge Base

How to Scrape Dynamic Websites With Headless Web Browsers

12 Ways to Make HTTPS Requests in Node.js

15 Methods to Not Get Blocked Web Scraping

How to Use Playwright Playwright Proxy in 2026

How to Take Screenshots with Puppeteer

How to Store and Manage Scraped Data Efficiently

User-Agent Rotation: Why and How to Implement It

How to Scrape Data Behind Login Pages

What Are Backconnect Proxies and How They Work

How to Do Requests in Go (Golang)

How to Do Requests with C

How to Do Requests with Swift

How to Do Requests with R

How to Make Requests with JavaScript (The Complete Guide)

How to Use Requests in Python

How to Build a RAG Chatbot in 6 Steps