How to Use Jan.ai: Run Private AI Models Locally in 6 Steps

Jan.ai is an open-source ChatGPT alternative that runs 100% offline on your computer, powered by llama.cpp. It provides an OpenAI-compatible API server at localhost:1337, letting you run models like Llama 3, Mistral, and Qwen locally without sending data to external servers.

In this guide, I'll show you how to set up Jan.ai, optimize its performance, and integrate it with your development workflow – including some lesser-known tricks that can drastically improve your experience.

Why Jan.ai Over Cloud Solutions?

Before diving into the setup, let's address the elephant in the room. While everyone's jumping on the ChatGPT bandwagon, running AI locally offers distinct advantages:

  • Zero latency for offline work – Perfect for air-gapped environments or when traveling
  • Complete data sovereignty – Your prompts never leave your machine
  • No rate limits or usage caps – Generate as much as your hardware allows
  • Customizable censorship levels – Adjust alignment and moderation to your needs
  • Free API endpoint – Build applications without worrying about OpenAI credits

Prerequisites

Before installation, verify your hardware meets these requirements:

Minimum Specs

  • CPU: AVX2 support (Intel Haswell/AMD Excavator or newer)
  • RAM: 8GB minimum (16GB recommended)
  • Storage: 10GB free space for app + models
  • NVIDIA: 6GB+ VRAM with CUDA 12.0+
  • AMD: Vulkan support
  • Apple Silicon: Metal support (built-in)

Pro tip: Run cat /proc/cpuinfo | grep avx2 on Linux or check with CPU-Z on Windows to verify AVX2 support.

Step 1: Install Jan.ai and Configure for Maximum Performance

Download and Install

# macOS/Linux users can use Homebrew (unofficial)
brew install --cask jan

# Or download directly
wget https://github.com/menloresearch/jan/releases/latest/download/jan-mac-x64.dmg  # macOS Intel
wget https://github.com/menloresearch/jan/releases/latest/download/jan-mac-arm64.dmg  # macOS Silicon
wget https://github.com/menloresearch/jan/releases/latest/download/jan-linux-x86_64.AppImage  # Linux

For Windows users, grab the .exe from jan.ai or GitHub releases.

Initial Performance Optimization

Once installed, immediately navigate to Settings > Hardware and:

  1. Enable GPU acceleration if you have a compatible GPU
  2. Set CPU threads to your physical core count minus 2 (leave some for the OS)
  3. Adjust context size based on your RAM:
    • 8GB RAM: 2048 tokens
    • 16GB RAM: 4096 tokens
    • 32GB+ RAM: 8192+ tokens

Step 2: Download Models Strategically

Not all models are created equal. Here's how to choose:

Quick Model Selection Guide

RAM Available Recommended Model Size Example Models
8GB 3B-7B Qwen2.5-3B, Mistral-7B-Q4
16GB 7B-13B Llama-3.1-8B, Mistral-Nemo-12B
32GB+ 13B-30B Llama-3.1-70B-Q4, Qwen2.5-32B

Download via Jan Hub

  1. Click the Hub icon (four squares)
  2. Filter by your hardware capabilities
  3. Look for models with these quantization levels:
    • Q4_K_M: Best balance (recommended)
    • Q5_K_M: Higher quality, more RAM
    • Q3_K_S: Faster but lower quality

Advanced: Import Custom GGUF Models

# Download directly from Hugging Face
wget https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.2-GGUF/resolve/main/mistral-7b-instruct-v0.2.Q4_K_M.gguf

# Move to Jan's model directory
mv mistral-7b-instruct-v0.2.Q4_K_M.gguf ~/jan/models/

Then create a model.json in the same directory:

{
  "id": "mistral-7b-custom",
  "object": "model",
  "name": "Mistral 7B Custom",
  "version": "1.0",
  "format": "gguf",
  "settings": {
    "ctx_len": 4096,
    "ngl": 35,
    "embedding": false,
    "n_batch": 512,
    "n_parallel": 4
  }
}

Step 3: Optimize GPU Acceleration (The Game Changer)

This is where most tutorials stop, but proper GPU configuration can 10x your inference speed.

NVIDIA CUDA Setup

# Verify CUDA installation
nvidia-smi

# If you see your GPU, enable it in Jan
# Settings > Hardware > GPUs > Toggle ON

Critical setting: Adjust the ngl (GPU layers) parameter:

  1. Start with ngl: 35 for 7B models
  2. Monitor VRAM usage with nvidia-smi -l 1
  3. Increase until you hit 90% VRAM utilization
  4. Back off by 5 if you experience crashes

AMD Vulkan Configuration

For AMD GPUs, Jan uses Vulkan. Set the backend:

# In Jan's settings.json
"llama_cpp_backend": "vulkan"

Apple Silicon Optimization

M-series Macs get Metal acceleration automatically, but you can tune it:

{
  "n_gpu_layers": -1,  // Use all available
  "metal_buffers": true,
  "use_mlock": true
}

Step 4: Expose the Local API Server

Jan's killer feature is its OpenAI-compatible API at localhost:1337. Here's how to weaponize it:

Enable the API Server

  1. Click the <> button in Jan
  2. Navigate to Local API Server
  3. Configure:
    • Host: 127.0.0.1 (local only) or 0.0.0.0 (network access)
    • Port: 1337 (or any available port)
    • API Key: Set any string (e.g., "jan-local-key")
    • Enable CORS: Toggle on for web apps

Test the API

import requests
import json

url = "http://localhost:1337/v1/chat/completions"
headers = {"Content-Type": "application/json"}
payload = {
    "messages": [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain quantum computing in one sentence."}
    ],
    "model": "mistral-7b-q4",  # Use your loaded model ID
    "stream": False,
    "temperature": 0.7
}

response = requests.post(url, headers=headers, json=payload)
print(response.json()['choices'][0]['message']['content'])

Expose via Reverse Proxy (Advanced)

Want to access Jan from other devices? Use ngrok or Cloudflare Tunnel:

# Using ngrok
ngrok http 1337

# Using Cloudflare Tunnel
cloudflared tunnel --url http://localhost:1337

⚠️ Security Warning: Only expose with proper authentication in production!

Step 5: Integrate with Development Tools

VS Code + Continue.dev Integration

This setup gives you GitHub Copilot-like features using local models:

  1. Install Continue extension in VS Code
  2. Configure ~/.continue/config.json:
{
  "models": [
    {
      "title": "Jan Local",
      "provider": "openai",
      "model": "mistral-7b-q4",
      "apiKey": "EMPTY",
      "apiBase": "http://localhost:1337"
    }
  ]
}
  1. Use Ctrl+L to chat, Ctrl+I for inline edits

Shell Integration with CLI

Create a bash function for quick AI queries:

# Add to ~/.bashrc or ~/.zshrc
ai() {
  curl -s http://localhost:1337/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d "{
      \"model\": \"mistral-7b-q4\",
      \"messages\": [{\"role\": \"user\", \"content\": \"$*\"}],
      \"temperature\": 0.7
    }" | jq -r '.choices[0].message.content'
}

# Usage
ai "Convert this to Python: ls -la | grep .txt"

Build Custom Applications

Since Jan provides an OpenAI-compatible API, you can use any OpenAI SDK:

// Node.js example
import OpenAI from 'openai';

const openai = new OpenAI({
  baseURL: 'http://localhost:1337/v1',
  apiKey: 'jan-local-key',
});

const completion = await openai.chat.completions.create({
  model: 'mistral-7b-q4',
  messages: [{ role: 'user', content: 'Hello!' }],
});

Step 6: Advanced Optimization Tricks

Memory Optimization for Large Models

If you're hitting RAM limits, use these techniques:

  1. Reduce context window: Settings > Model > Context Length
  2. Enable mmap: Allows the OS to page model data
  3. Use quantization: Q3 versions use ~30% less RAM than Q5
{
  "use_mmap": true,
  "mlock": false,  // Don't lock in RAM
  "n_batch": 256,  // Smaller batches
  "n_ctx": 2048    // Reduced context
}

Speed Hacks

  1. Disable unused features:
{
  "use_flash_attn": false,  // If not supported
  "embedding": false,       // If not needed
  "low_vram": true         // For limited VRAM
}
  1. Use CPU+GPU hybrid: Set ngl to partial layers (e.g., 20 out of 35)
  2. Parallel processing: Increase n_parallel for batch inference

Model Mixing Strategy

Run multiple specialized models instead of one large general model:

  • Coding: CodeQwen-7B for development tasks
  • Writing: Mistral-7B for general text
  • Analysis: Llama-3.1-8B for reasoning
  • Small tasks: Qwen2.5-3B for quick responses

Switch between them based on task requirements – smaller models often outperform larger ones in specialized domains.

Troubleshooting Common Issues

"Model won't start" or "Failed to fetch"

  • Reduce ngl (GPU layers) in model settings
  • Clear browser cache if using web interface
  • Check available RAM/VRAM

Slow inference on CPU

  • Ensure AVX2 is enabled in BIOS
  • Reduce context size to 2048
  • Use Q3 quantization instead of Q4/Q5

GPU not detected

# For NVIDIA
sudo nvidia-modprobe -c 0 -u

# For AMD
sudo modprobe amdgpu

# Restart Jan after driver fixes

The Real Power Move: Building Your Own AI Stack

Here's what most people miss: Jan isn't just a ChatGPT replacement – it's infrastructure for building private AI applications. Combine it with:

  • LangChain for complex chains
  • Qdrant for vector search
  • n8n for automation workflows
  • Open WebUI as a team interface

You essentially get an entire AI platform that costs nothing to run after the initial hardware investment.

Conclusion

Jan.ai transforms your computer into a private AI powerhouse. While cloud services have their place, the ability to run models locally – with complete privacy, no rate limits, and full customization – is invaluable for developers, researchers, and privacy-conscious users.

Start with a smaller model like Mistral-7B-Q4, get comfortable with the API, then scale up as needed. The ecosystem is evolving rapidly, and having local inference capability puts you ahead of the curve.

Remember: The best model is the one that runs reliably on your hardware. Don't chase parameter counts – chase actual utility.

Marius Bernard

Marius Bernard

Marius Bernard is a Product Advisor, Technical SEO, & Brand Ambassador at Roundproxies. He was the lead author for the SEO chapter of the 2024 Web and a reviewer for the 2023 SEO chapter.