How to Use Jan.ai: Run Private AI Models Locally in 6 Steps

August 14, 2025

5 min read

Jan.ai is an open-source ChatGPT alternative that runs 100% offline on your computer, powered by llama.cpp. It provides an OpenAI-compatible API server at localhost:1337, letting you run models like Llama 3, Mistral, and Qwen locally without sending data to external servers.

In this guide, I'll show you how to set up Jan.ai, optimize its performance, and integrate it with your development workflow – including some lesser-known tricks that can drastically improve your experience.

Why Jan.ai Over Cloud Solutions?

Before diving into the setup, let's address the elephant in the room. While everyone's jumping on the ChatGPT bandwagon, running AI locally offers distinct advantages:

Zero latency for offline work – Perfect for air-gapped environments or when traveling
Complete data sovereignty – Your prompts never leave your machine
No rate limits or usage caps – Generate as much as your hardware allows
Customizable censorship levels – Adjust alignment and moderation to your needs
Free API endpoint – Build applications without worrying about OpenAI credits

Prerequisites

Before installation, verify your hardware meets these requirements:

Minimum Specs

CPU: AVX2 support (Intel Haswell/AMD Excavator or newer)
RAM: 8GB minimum (16GB recommended)
Storage: 10GB free space for app + models

GPU Acceleration (Optional but Recommended)

NVIDIA: 6GB+ VRAM with CUDA 12.0+
AMD: Vulkan support
Apple Silicon: Metal support (built-in)

Pro tip: Run cat /proc/cpuinfo | grep avx2 on Linux or check with CPU-Z on Windows to verify AVX2 support.

Step 1: Install Jan.ai and Configure for Maximum Performance

Download and Install

# macOS/Linux users can use Homebrew (unofficial)
brew install --cask jan

# Or download directly
wget https://github.com/menloresearch/jan/releases/latest/download/jan-mac-x64.dmg  # macOS Intel
wget https://github.com/menloresearch/jan/releases/latest/download/jan-mac-arm64.dmg  # macOS Silicon
wget https://github.com/menloresearch/jan/releases/latest/download/jan-linux-x86_64.AppImage  # Linux

For Windows users, grab the .exe from jan.ai or GitHub releases.

Initial Performance Optimization

Once installed, immediately navigate to Settings > Hardware and:

Enable GPU acceleration if you have a compatible GPU
Set CPU threads to your physical core count minus 2 (leave some for the OS)
Adjust context size based on your RAM:
- 8GB RAM: 2048 tokens
- 16GB RAM: 4096 tokens
- 32GB+ RAM: 8192+ tokens

Step 2: Download Models Strategically

Not all models are created equal. Here's how to choose:

Quick Model Selection Guide

RAM Available	Recommended Model Size	Example Models
8GB	3B-7B	Qwen2.5-3B, Mistral-7B-Q4
16GB	7B-13B	Llama-3.1-8B, Mistral-Nemo-12B
32GB+	13B-30B	Llama-3.1-70B-Q4, Qwen2.5-32B

Download via Jan Hub

Click the Hub icon (four squares)
Filter by your hardware capabilities
Look for models with these quantization levels:
- Q4_K_M: Best balance (recommended)
- Q5_K_M: Higher quality, more RAM
- Q3_K_S: Faster but lower quality

Advanced: Import Custom GGUF Models

# Download directly from Hugging Face
wget https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.2-GGUF/resolve/main/mistral-7b-instruct-v0.2.Q4_K_M.gguf

# Move to Jan's model directory
mv mistral-7b-instruct-v0.2.Q4_K_M.gguf ~/jan/models/

Then create a model.json in the same directory:

{
  "id": "mistral-7b-custom",
  "object": "model",
  "name": "Mistral 7B Custom",
  "version": "1.0",
  "format": "gguf",
  "settings": {
    "ctx_len": 4096,
    "ngl": 35,
    "embedding": false,
    "n_batch": 512,
    "n_parallel": 4
  }
}

Step 3: Optimize GPU Acceleration (The Game Changer)

This is where most tutorials stop, but proper GPU configuration can 10x your inference speed.

NVIDIA CUDA Setup

# Verify CUDA installation
nvidia-smi

# If you see your GPU, enable it in Jan
# Settings > Hardware > GPUs > Toggle ON

Critical setting: Adjust the ngl (GPU layers) parameter:

Start with ngl: 35 for 7B models
Monitor VRAM usage with nvidia-smi -l 1
Increase until you hit 90% VRAM utilization
Back off by 5 if you experience crashes

AMD Vulkan Configuration

For AMD GPUs, Jan uses Vulkan. Set the backend:

# In Jan's settings.json
"llama_cpp_backend": "vulkan"

Apple Silicon Optimization

M-series Macs get Metal acceleration automatically, but you can tune it:

{
  "n_gpu_layers": -1,  // Use all available
  "metal_buffers": true,
  "use_mlock": true
}

Step 4: Expose the Local API Server

Jan's killer feature is its OpenAI-compatible API at localhost:1337. Here's how to weaponize it:

Enable the API Server

Click the <> button in Jan
Navigate to Local API Server
Configure:
- Host: 127.0.0.1 (local only) or 0.0.0.0 (network access)
- Port: 1337 (or any available port)
- API Key: Set any string (e.g., "jan-local-key")
- Enable CORS: Toggle on for web apps

Test the API

import requests
import json

url = "http://localhost:1337/v1/chat/completions"
headers = {"Content-Type": "application/json"}
payload = {
    "messages": [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain quantum computing in one sentence."}
    ],
    "model": "mistral-7b-q4",  # Use your loaded model ID
    "stream": False,
    "temperature": 0.7
}

response = requests.post(url, headers=headers, json=payload)
print(response.json()['choices'][0]['message']['content'])

Expose via Reverse Proxy (Advanced)

Want to access Jan from other devices? Use ngrok or Cloudflare Tunnel:

# Using ngrok
ngrok http 1337

# Using Cloudflare Tunnel
cloudflared tunnel --url http://localhost:1337

⚠️ Security Warning: Only expose with proper authentication in production!

Step 5: Integrate with Development Tools

VS Code + Continue.dev Integration

This setup gives you GitHub Copilot-like features using local models:

Install Continue extension in VS Code
Configure ~/.continue/config.json:

{
  "models": [
    {
      "title": "Jan Local",
      "provider": "openai",
      "model": "mistral-7b-q4",
      "apiKey": "EMPTY",
      "apiBase": "http://localhost:1337"
    }
  ]
}

Use Ctrl+L to chat, Ctrl+I for inline edits

Shell Integration with CLI

Create a bash function for quick AI queries:

# Add to ~/.bashrc or ~/.zshrc
ai() {
  curl -s http://localhost:1337/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d "{
      \"model\": \"mistral-7b-q4\",
      \"messages\": [{\"role\": \"user\", \"content\": \"$*\"}],
      \"temperature\": 0.7
    }" | jq -r '.choices[0].message.content'
}

# Usage
ai "Convert this to Python: ls -la | grep .txt"

Build Custom Applications

Since Jan provides an OpenAI-compatible API, you can use any OpenAI SDK:

// Node.js example
import OpenAI from 'openai';

const openai = new OpenAI({
  baseURL: 'http://localhost:1337/v1',
  apiKey: 'jan-local-key',
});

const completion = await openai.chat.completions.create({
  model: 'mistral-7b-q4',
  messages: [{ role: 'user', content: 'Hello!' }],
});

Step 6: Advanced Optimization Tricks

Memory Optimization for Large Models

If you're hitting RAM limits, use these techniques:

Reduce context window: Settings > Model > Context Length
Enable mmap: Allows the OS to page model data
Use quantization: Q3 versions use ~30% less RAM than Q5

{
  "use_mmap": true,
  "mlock": false,  // Don't lock in RAM
  "n_batch": 256,  // Smaller batches
  "n_ctx": 2048    // Reduced context
}

Speed Hacks

Disable unused features:

{
  "use_flash_attn": false,  // If not supported
  "embedding": false,       // If not needed
  "low_vram": true         // For limited VRAM
}

Use CPU+GPU hybrid: Set ngl to partial layers (e.g., 20 out of 35)
Parallel processing: Increase n_parallel for batch inference

Model Mixing Strategy

Run multiple specialized models instead of one large general model:

Coding: CodeQwen-7B for development tasks
Writing: Mistral-7B for general text
Analysis: Llama-3.1-8B for reasoning
Small tasks: Qwen2.5-3B for quick responses

Switch between them based on task requirements – smaller models often outperform larger ones in specialized domains.

Troubleshooting Common Issues

"Model won't start" or "Failed to fetch"

Reduce ngl (GPU layers) in model settings
Clear browser cache if using web interface
Check available RAM/VRAM

Slow inference on CPU

Ensure AVX2 is enabled in BIOS
Reduce context size to 2048
Use Q3 quantization instead of Q4/Q5

GPU not detected

# For NVIDIA
sudo nvidia-modprobe -c 0 -u

# For AMD
sudo modprobe amdgpu

# Restart Jan after driver fixes

The Real Power Move: Building Your Own AI Stack

Here's what most people miss: Jan isn't just a ChatGPT replacement – it's infrastructure for building private AI applications. Combine it with:

LangChain for complex chains
Qdrant for vector search
n8n for automation workflows
Open WebUI as a team interface

You essentially get an entire AI platform that costs nothing to run after the initial hardware investment.

Conclusion

Jan.ai transforms your computer into a private AI powerhouse. While cloud services have their place, the ability to run models locally – with complete privacy, no rate limits, and full customization – is invaluable for developers, researchers, and privacy-conscious users.

Start with a smaller model like Mistral-7B-Q4, get comfortable with the API, then scale up as needed. The ecosystem is evolving rapidly, and having local inference capability puts you ahead of the curve.

Remember: The best model is the one that runs reliably on your hardware. Don't chase parameter counts – chase actual utility.

Marius Bernard

Marius Bernard is a Product Advisor, Technical SEO, & Brand Ambassador at Roundproxies. He was the lead author for the SEO chapter of the 2024 Web and a reviewer for the 2023 SEO chapter.

Get the best
proxies out there

Get Proxies now

Related from Knowledge Base

How to Take Screenshots with Puppeteer

How to Store and Manage Scraped Data Efficiently

User-Agent Rotation: Why and How to Implement It

How to Scrape Data Behind Login Pages

What Are Backconnect Proxies and How They Work

How to Do Requests in Go (Golang)

How to Do Requests with C

How to Do Requests with Swift

How to Do Requests with R

How to Make Requests with JavaScript (The Complete Guide)

How to Use Requests in Python

How to Build a RAG Chatbot in 6 Steps