Groq brings lightning-fast AI inference to web scraping through its LPU (Language Processing Unit) chips, making it possible to parse messy HTML and extract structured data at speeds that leave traditional methods in the dust.
In this guide, we'll show you how to build scrapers that leverage Groq's inference speed to turn unstructured web data into clean JSON outputs.
What is Groq?
Before diving into the code, let's get clear on what Groq actually is. Groq is an AI inference platform built around custom LPU chips that deliver some of the fastest response times in the industry.
Unlike your standard CPU or GPU setup, these chips are purpose-built for running large language models at record-breaking speeds.
The platform is OpenAI-compatible, which means you can swap it into existing workflows without rewriting everything from scratch.
For web scraping specifically, Groq shines because it can process massive amounts of text and return structured data faster than you can say "rate limit." The free tier gives you generous token limits, and paid plans scale without breaking the bank.
Why Use AI for Web Scraping?
Traditional scraping relies on CSS selectors and XPath expressions that break the moment a site redesigns. You spend more time maintaining selectors than actually collecting data. AI-powered scraping flips this model.
Instead of telling your scraper exactly where to find data, you describe what you want in plain English. The LLM figures out how to extract it, even when the HTML structure changes.
Groq's speed advantage means you can process scraped content in near real-time without the multi-second delays typical of other AI providers. This matters when you're scraping hundreds or thousands of pages.
Step 1: Set Up Your Environment
Let's get the basics in place before we start scraping. You'll need Python 3.8 or newer installed on your machine.
First, create a project directory and set up a virtual environment:
mkdir groq-scraper
cd groq-scraper
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
Now install the required packages:
pip install groq requests beautifulsoup4 lxml python-dotenv
Here's what each package does:
groq
: Official Groq SDK for API accessrequests
: HTTP library for fetching web pagesbeautifulsoup4
: HTML parsing when you need itlxml
: Fast XML/HTML parserpython-dotenv
: Manages environment variables
Create a .env
file in your project root to store your API key:
GROQ_API_KEY=your_groq_api_key_here
Head to console.groq.com to grab your free API key. The free tier gives you thousands of tokens per minute, which is plenty for getting started.
Create a file called scraper.py
where we'll build our scraper:
import os
from groq import Groq
import requests
from bs4 import BeautifulSoup
from dotenv import load_dotenv
import json
import time
load_dotenv()
client = Groq(api_key=os.environ.get("GROQ_API_KEY"))
This setup loads your API key from the environment and initializes the Groq client. The load_dotenv()
call reads the .env
file and makes your API key available through os.environ
.
Step 2: Fetch and Parse Web Content
Now we'll build a function that fetches a web page and cleans it up for the LLM. The trick here is to strip out all the noise—scripts, styles, navigation menus—so Groq only processes the content that matters.
def fetch_page(url):
"""
Fetch a web page and return cleaned text content.
Uses a proper User-Agent to avoid basic blocks.
"""
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
}
try:
response = requests.get(url, headers=headers, timeout=10)
response.raise_for_status()
return response.text
except requests.RequestException as e:
print(f"Error fetching {url}: {e}")
return None
The User-Agent
header makes your requests look like they're coming from a real browser instead of a Python script. Many sites block requests with the default python-requests
user agent, so this simple header swap gets you past basic protection.
Next, we'll clean the HTML to extract just the meaningful content:
def clean_html(html_content):
"""
Parse HTML and extract clean text, removing scripts, styles, and nav elements.
"""
soup = BeautifulSoup(html_content, 'lxml')
# Remove unwanted elements
for element in soup(['script', 'style', 'nav', 'footer', 'header']):
element.decompose()
# Get text and clean whitespace
text = soup.get_text(separator=' ', strip=True)
# Remove excessive whitespace
text = ' '.join(text.split())
return text
This function uses BeautifulSoup to parse the HTML, strips out common noise elements, and returns clean text. The separator=' '
argument in get_text()
ensures words don't get mashed together when tags are removed.
Why clean the HTML first? Two reasons: it reduces token usage (saving money and staying under limits), and it helps the LLM focus on relevant content instead of getting distracted by CSS classes and JavaScript.
Step 3: Extract Structured Data with Groq
Here's where Groq really shines. Instead of writing brittle CSS selectors, we'll ask the LLM to extract exactly what we need in a structured format.
def extract_with_groq(text_content, extraction_prompt, model="llama-3.3-70b-versatile"):
"""
Use Groq to extract structured data from text content.
Returns parsed JSON or None if extraction fails.
"""
system_message = """You are a data extraction specialist.
Extract information as valid JSON only.
No additional text, explanations, or markdown formatting.
If information is not found, use null for that field."""
user_message = f"""Extract the following information from this content:
{extraction_prompt}
Content:
{text_content[:15000]}
Return valid JSON only."""
try:
chat_completion = client.chat.completions.create(
messages=[
{"role": "system", "content": system_message},
{"role": "user", "content": user_message}
],
model=model,
temperature=0, # Deterministic output
max_tokens=2048
)
response_text = chat_completion.choices[0].message.content
# Try to parse JSON from response
# Sometimes LLMs wrap JSON in markdown code blocks
if "```json" in response_text:
response_text = response_text.split("```json")[1].split("```")[0]
elif "```" in response_text:
response_text = response_text.split("```")[1].split("```")[0]
return json.loads(response_text.strip())
except json.JSONDecodeError as e:
print(f"Failed to parse JSON: {e}")
print(f"Response was: {response_text}")
return None
except Exception as e:
print(f"Groq API error: {e}")
return None
Let's break down what's happening here:
The system_message
sets the LLM's behavior. We explicitly tell it to return only JSON with no extra text. This is critical because LLMs love to explain themselves, and that breaks JSON parsing.
We truncate text_content
to 15,000 characters because Groq's free tier models have token limits. On production systems, you'd want to implement smart chunking to process longer content.
The temperature=0
setting makes the output deterministic. Higher temperatures add randomness, which is great for creative writing but terrible for data extraction where you want consistent results.
The JSON parsing includes fallback logic to handle cases where the LLM wraps the JSON in markdown code blocks (a common behavior).
Now let's put it together with a practical example:
def scrape_product_data(url):
"""
Scrape product information from an e-commerce page.
"""
print(f"Fetching: {url}")
html = fetch_page(url)
if not html:
return None
clean_text = clean_html(html)
extraction_prompt = """
Extract the following product information:
- product_name: string
- price: string (include currency symbol)
- description: string
- in_stock: boolean
- rating: float or null
- reviews_count: integer or null
"""
result = extract_with_groq(clean_text, extraction_prompt)
if result:
result['url'] = url
result['scraped_at'] = time.strftime('%Y-%m-%d %H:%M:%S')
return result
This function ties everything together: fetch the page, clean it, extract structured data, and add metadata. The metadata is important for tracking when data was collected and debugging issues later.
Step 4: Handle Rate Limits and Scale
Groq has generous rate limits, but you'll still hit them if you're scraping aggressively. Here's how to build a scraper that respects limits and retries intelligently:
def scrape_with_retry(url, extraction_prompt, max_retries=3):
"""
Scrape with exponential backoff retry logic.
"""
for attempt in range(max_retries):
try:
html = fetch_page(url)
if not html:
continue
clean_text = clean_html(html)
result = extract_with_groq(clean_text, extraction_prompt)
if result:
return result
except Exception as e:
wait_time = (2 ** attempt) # Exponential backoff: 1s, 2s, 4s
print(f"Attempt {attempt + 1} failed: {e}")
print(f"Retrying in {wait_time} seconds...")
time.sleep(wait_time)
return None
def scrape_multiple(urls, extraction_prompt, delay=1):
"""
Scrape multiple URLs with rate limiting.
"""
results = []
for i, url in enumerate(urls):
print(f"\nScraping {i+1}/{len(urls)}: {url}")
result = scrape_with_retry(url, extraction_prompt)
if result:
results.append(result)
print(f"✓ Success")
else:
print(f"✗ Failed")
# Rate limiting delay
if i < len(urls) - 1:
time.sleep(delay)
return results
The exponential backoff in scrape_with_retry
handles temporary failures gracefully. First failure waits 1 second, second waits 2 seconds, third waits 4 seconds. This pattern prevents hammering a server or API when something goes wrong.
The scrape_multiple
function adds a delay between requests. Even though Groq can handle the inference speed, you still need to be respectful to the websites you're scraping. A 1-second delay is reasonable for most use cases.
Let's see it in action:
if __name__ == "__main__":
# Example: Scrape product data
test_urls = [
"https://example.com/product1",
"https://example.com/product2",
"https://example.com/product3"
]
prompt = """
Extract:
- product_name: string
- price: string
- in_stock: boolean
"""
products = scrape_multiple(test_urls, prompt, delay=2)
# Save results
with open('scraped_data.json', 'w') as f:
json.dump(products, f, indent=2)
print(f"\nScraped {len(products)} products")
Advanced: Using Groq's Built-in Web Search
Here's the trick most people miss: Groq's compound models have built-in web search capabilities. Instead of scraping HTML yourself, you can let Groq fetch and parse content directly. This works great for gathering information but gives you less control over the scraping process.
def groq_web_research(query):
"""
Use Groq's compound model with built-in web search.
Great for research and data gathering without manual scraping.
"""
completion = client.chat.completions.create(
model="compound-beta",
messages=[
{
"role": "user",
"content": query
}
],
search_settings={
"enable": True,
"max_results": 10
}
)
return completion.choices[0].message.content
This approach is powerful when you need current information or want to gather data from multiple sources without managing the HTTP requests yourself. The tradeoff is less control over which specific pages get scraped and how the data is structured.
For targeted scraping where you need specific data from specific pages, stick with the manual approach from Steps 1-3. For broader research or current event data, Groq's built-in search is faster and simpler.
Dealing with Protected Sites
Some sites use Cloudflare or other anti-bot measures. While the basic techniques in this guide work for many sites, protected sites need different approaches:
Request-based approach (what we've used):
- Fast and efficient
- Works on sites without JavaScript-heavy protection
- Lower resource usage
Browser automation approach (when you need it):
- Use Playwright or Selenium with stealth plugins
- Handles JavaScript-rendered content
- Bypasses basic anti-bot measures
- Much slower and more resource-intensive
For sites with Cloudflare protection, you'd typically use Playwright with the stealth plugin, then feed the rendered HTML to Groq for extraction. That's beyond the scope of this guide, but the extraction logic stays the same.
Rate Limit Management
Groq's free tier gives you:
- Requests per minute (RPM): varies by model
- Tokens per minute (TPM): varies by model
Check your current limits at console.groq.com/settings/limits
. When you hit a rate limit, Groq returns a 429 error. Here's how to handle it:
def handle_rate_limit(func):
"""
Decorator to handle rate limit errors with exponential backoff.
"""
def wrapper(*args, **kwargs):
max_attempts = 5
base_delay = 2
for attempt in range(max_attempts):
try:
return func(*args, **kwargs)
except Exception as e:
if "rate_limit" in str(e).lower() or "429" in str(e):
if attempt < max_attempts - 1:
wait = base_delay * (2 ** attempt)
print(f"Rate limited. Waiting {wait}s...")
time.sleep(wait)
else:
raise
else:
raise
return wrapper
@handle_rate_limit
def extract_with_groq_protected(text_content, extraction_prompt):
"""
Same as extract_with_groq but with rate limit protection.
"""
return extract_with_groq(text_content, extraction_prompt)
This decorator wraps your Groq calls and automatically retries with increasing delays when you hit rate limits. For production systems, you might want to track your token usage and implement your own rate limiting before hitting Groq's limits.
Real-World Example: News Article Scraper
Let's put everything together with a practical example that scrapes news articles:
def scrape_news_article(url):
"""
Extract article data from news websites.
"""
html = fetch_page(url)
if not html:
return None
clean_text = clean_html(html)
prompt = """
Extract:
- title: string
- author: string or null
- published_date: string (ISO format if possible) or null
- article_text: string (main content only)
- categories: array of strings or null
- summary: string (3-sentence summary)
"""
return extract_with_groq(clean_text, prompt)
# Scrape multiple news articles
news_urls = [
"https://example-news.com/tech-article-1",
"https://example-news.com/tech-article-2"
]
articles = scrape_multiple(news_urls, delay=2)
# Filter and process results
valid_articles = [a for a in articles if a and a.get('title')]
print(f"Successfully scraped {len(valid_articles)} articles")
This example shows how flexible the LLM approach is. We're asking for a summary of each article alongside the structured data. Traditional scraping would require separate processing steps, but Groq handles it all in one call.
Final Thoughts
Web scraping with Groq flips the traditional model on its head. Instead of maintaining brittle CSS selectors, you describe what you want and let the LLM figure out the extraction. The speed advantage of Groq's LPU chips means you can process pages in real-time without the delays typical of other AI providers.
The approach works best for:
- Sites with changing HTML structures
- Extracting complex or unstructured data
- Scenarios where speed matters
- Cases where maintaining selectors is too costly
It's not ideal for:
- Simple, structured data (use traditional scraping)
- Sites requiring JavaScript execution (add Playwright)
- Budget-constrained projects with massive scale (consider caching)
Start with the basic pattern from Steps 1-3, then add retry logic and rate limiting as you scale. The modular structure makes it easy to swap in browser automation or add new extraction patterns as your needs grow.
Remember to respect robots.txt, implement reasonable delays, and consider the ethical implications of your scraping. Just because you can scrape something doesn't always mean you should.
Related articles: