A web scraper pulls pages, parses the DOM, and extracts the pieces you care about. Pair it with a local Mistral model and you can summarize, structure, and QA scraped content—no paid APIs, no cloud bills.
In this guide we’ll wire up an async Python scraper, add JS rendering fallback, and run Mistral locally via Ollama to get clean JSON outputs you can drop straight into a database.
What you’ll build
- An async crawler (httpx/aiohttp) with polite rate limits, caching, and
robots.txt
checks - HTML extraction that’s fast (selectolax) and readable (trafilatura)
- A local Mistral pipeline (via Ollama) for summaries and schema-true JSON extraction
- Optional JS fallback with Playwright (for SPA pages)
- Lightweight dedup with SimHash/MinHash to avoid double work
We’ll keep everything free and local. No paid third-party APIs.
TL;DR architecture
URLs → Fetch (async + cache + robots) → Parse (selectolax / trafilatura)
→ Chunk → Mistral (Ollama JSON mode / schema) → JSONL out (dedup)
↘ optional Playwright for JS pages
1) Environment: tools we’ll use (all free)
- Python 3.10+, pipx or pip
- selectolax: ultra-fast HTML parser with CSS selectors.
- trafilatura: robust “main content” extractor for messy pages.
- aiohttp + aiohttp-client-cache or requests + requests-cache for HTTP + caching. (GitHub)
- Playwright (Python) for JS-rendered pages.
- Ollama (local LLM server) + Mistral/Mixtral models.
- SimHash/Datasketch for near-duplicate detection.
Mistral 7B is Apache-2.0 licensed; you can run it locally without restrictions.
Install the basics:
# system deps (example: macOS)
brew install python
# python libs
pip install aiohttp aiohttp-client-cache selectolax trafilatura tqdm
pip install playwright
playwright install chromium # once
# optional: requests flavour
pip install requests requests-cache
# dedup
pip install simhash datasketch
# ollama (mac/linux/windows installers on their site)
# after installing ollama:
ollama pull mistral # 7B instruct
# or:
ollama pull mixtral # 8x7B MoE, heavier but stronger
Why Ollama? It runs LLMs locally and exposes a simple REST API. It also supports JSON mode and structured outputs (JSON Schema) so you can reliably parse answers.
2. Be polite: robots.txt
, rate limits, identity
Before you crawl, check robots and throttle. The REP (Robots Exclusion Protocol) is documented as RFC 9309—it’s not auth, but you should honor it.
# robots.py
import urllib.robotparser as rp
from urllib.parse import urljoin, urlparse
def allowed(url: str, user_agent="MyScraperBot/0.1 (+https://example.com/bot)"):
root = f"{urlparse(url).scheme}://{urlparse(url).netloc}"
robots_url = urljoin(root, "/robots.txt")
parser = rp.RobotFileParser()
parser.set_url(robots_url)
try:
parser.read()
except Exception:
# If robots not reachable, default to conservative allow=False
return False
return parser.can_fetch(user_agent, url)
3. Async fetcher with caching (fast and friendly)
Us aiohttp with aiohttp-client-cache to avoid re-downloading pages. It supports SQLite/Redis/etc. backends.
# fetch.py
import asyncio, re, time
from aiohttp_client_cache import CachedSession, SQLiteBackend
from contextlib import asynccontextmanager
UA = "MyScraperBot/0.1 (+https://example.com/bot)"
DOMAIN_CONCURRENCY = 4
GLOBAL_CONCURRENCY = 20
@asynccontextmanager
async def session_ctx():
async with CachedSession(
cache=SQLiteBackend("http_cache.sqlite", expire_after=3600),
headers={"User-Agent": UA},
timeout=30
) as s:
yield s
async def fetch_url(session, url):
resp = await session.get(url, allow_redirects=True)
ct = resp.headers.get("content-type","").lower()
if "text/html" not in ct and "application/xhtml+xml" not in ct:
return None, ct
text = await resp.text(errors="ignore")
return text, ct
4. Parse quickly (selectolax) + get “main content” (trafilatura)
- selectolax for targeted CSS extraction (titles, prices, etc.)
- trafilatura for high-quality readable article text (boilerplate removal).
# parse.py
from selectolax.parser import HTMLParser
from trafilatura import extract
def extract_title(html: str) -> str | None:
tree = HTMLParser(html)
# Try OG title, then <title>, then h1
og = tree.css_first('meta[property="og:title"]')
if og and og.attributes.get("content"):
return og.attributes["content"].strip()
t = tree.css_first("title")
if t: return t.text(strip=True)
h1 = tree.css_first("h1")
return h1.text(strip=True) if h1 else None
def extract_main_text(html: str) -> str | None:
# Trafilatura’s extract() returns cleaned, readable text
return extract(html, include_comments=False, include_tables=False)
If you prefer a framework, Scrapy provides selectors and project scaffolding.
5. Run Mistral locally (Ollama) and talk JSON
Start Ollama and pull a model:
ollama serve &
ollama pull mistral
Quick sanity check:
curl http://localhost:11434/api/generate \
-d '{"model":"mistral","prompt":"Say hi in one short sentence."}'
Structured output: Ollama supports "format": "json"
and JSON Schema in the format
field. Always also ask for JSON in your prompt.
# llm.py
import json, http.client
def mistral_extract(schema: dict, content: str) -> dict:
"""
Use Ollama's JSON Schema structured outputs.
"""
conn = http.client.HTTPConnection("localhost", 11434, timeout=120)
prompt = (
"You are an information extraction engine. "
"Return ONLY valid JSON that matches the provided schema.\n\n"
f"TEXT:\n{content[:6000]}"
)
body = {
"model": "mistral",
"prompt": prompt,
"format": schema, # JSON Schema here → structured output
"stream": False
}
conn.request("POST", "/api/generate", body=json.dumps(body))
res = conn.getresponse().read()
data = json.loads(res.decode("utf-8"))
return json.loads(data["response"])
Prefer schema over plain JSON mode if you need type-safe outputs. (Ollama supports both.)
If you want a beefier local stack or an OpenAI-compatible server for multiple apps, run vLLM and load Mistral/Mixtral there.
6. Connect it: crawl → parse → chunk → extract with Mistral
We’ll define a small schema for articles (tweak for your niche: products, jobs, reviews…).
# schema.py
ARTICLE_SCHEMA = {
"type": "object",
"properties": {
"title": {"type":"string"},
"summary": {"type":"string"},
"topics": {"type":"array", "items":{"type":"string"}},
"published_date": {"type":"string"}
},
"required": ["title","summary","topics"]
}
Chunk long texts to stay in context. Then call mistral_extract()
.
# pipeline.py
import asyncio, json, hashlib
from fetch import session_ctx, fetch_url
from parse import extract_title, extract_main_text
from llm import mistral_extract
from schema import ARTICLE_SCHEMA
from robots import allowed
from tqdm.asyncio import tqdm_asyncio
def chunk(s: str, n=4000):
for i in range(0, len(s), n):
yield s[i:i+n]
async def process_url(session, url):
if not allowed(url):
return None
html, ct = await fetch_url(session, url)
if not html:
return None
title = extract_title(html)
text = extract_main_text(html) or ""
if not text.strip():
return None
# Simple chunk + stitch strategy
parts = []
for c in chunk(text):
out = mistral_extract(
{
"type":"object",
"properties":{"summary":{"type":"string"}},
"required":["summary"]
}, c
)
parts.append(out["summary"])
final = mistral_extract(ARTICLE_SCHEMA,
f"TITLE: {title or ''}\n\nFULL TEXT:\n" + "\n\n".join(parts)
)
final["source_url"] = url
final["hash"] = hashlib.sha1((title or "")[:200].encode() + text[:4000].encode()).hexdigest()
return final
async def run(urls: list[str]):
async with session_ctx() as session:
results = await tqdm_asyncio.gather(*[process_url(session,u) for u in urls], total=len(urls))
clean = [r for r in results if r]
with open("out.jsonl","w",encoding="utf-8") as f:
for r in clean:
f.write(json.dumps(r, ensure_ascii=False) + "\n")
return clean
7. JS pages? Add a Playwright fallback
For SPA or heavy JS, you can hydrate the DOM with Playwright and then reuse the same parsers. (Keep usage modest—headless browsers are resource-heavy.)
# render.py
import asyncio
from playwright.async_api import async_playwright
async def render_html(url: str, timeout_ms=15000) -> str | None:
async with async_playwright() as p:
browser = await p.chromium.launch(headless=True)
page = await browser.new_page()
try:
await page.goto(url, timeout=timeout_ms, wait_until="networkidle")
html = await page.content()
return html
finally:
await browser.close()
8. Kill dupes: SimHash/MinHash in 15 lines
For large crawls, remove near-duplicates before sending text to Mistral. SimHash is tiny and fast; MinHash+LSH scales well.
# dedup.py
from simhash import Simhash
def simhash_text(s: str) -> int:
# crude tokenization; customize for your domain
tokens = [t.lower() for t in s.split()]
return Simhash(tokens).value
# usage: keep a set() of seen signatures and Hamming-distance filter if needed
Practical extras (that actually help)
- Backoff and quotas
Add jitter and per-domain semaphores; don’t overload hosts. (Helpful even if robots allows.) - Cache first
aiohttp-client-cache
/requests-cache
slash re-fetches and cost. - Fast parsing over BS4
selectolax
is noticeably faster for CSS queries than typical bs4 stacks—handy at scale. - Readable text when the DOM is noisy
trafilatura.extract()
reliably returns main text for summaries. - JS only when needed
Detect empty content → fall back to Playwright for that URL. - Structured outputs over raw text
Prefer Ollama JSON Schema formats to avoid brittle regex post-processing.
Legal & ethics notes
- Honor robots.txt and site ToS. REP is a convention, not auth, but ignoring it is a great way to get blocked (or worse).
- Don’t bypass paywalls/captchas or scrape personal data.
- Identify your bot with a proper UA and contact page.
Troubleshooting
- Model outputs invalid JSON → Instruct JSON in the prompt and set
format
. If you still see drift, drop temperature and keep outputs short; schema mode is stricter than plain JSON mode. - It’s slow → increase concurrency carefully, add caching, and avoid Playwright unless necessary.
- OOM on Mixtral → switch to
mistral
(7B) or use a quantized build.
Why Mistral for scraping?
- Local (privacy, zero per-call cost)
- Proven open models (Mistral 7B/Mixtral 8x7B) with permissive use; Mistral 7B was released under Apache-2.0.
- Modern features (function/structured outputs in Mistral ecosystem; with Ollama/vLLM you can enforce JSON reliably).
Related articles: