Overview

Run AI models entirely on your own machine. No internet, no cloud, no data leaves your device. Cross-platform server with bindings for Python, JavaScript, Rust, C++, and Java.

Current version: v0.1.5 (March 30, 2026) | License: Apache 2.0

What Is This?

The Offline Intelligence Library is a server that runs AI language models (LLMs) on your own computer. You download a model file once, and from that point on all AI inference happens locally. No API calls to OpenAI, no subscription fees, no data sent to anyone.

The server is written in Rust for speed and stability. Once it is running, you talk to it over HTTP from any language: Python, JavaScript, Java, C++, or Rust. The server handles everything: loading the model, managing conversation memory, streaming responses token by token, and optionally fetching live data (weather, currency, crypto prices) to answer questions the model alone could not.

How It Works

The Rust crate is the server. All other language bindings (Python, JavaScript, Java, C++) are pure HTTP clients that connect to the Rust server running on port 9999. The Rust server in turn manages the llama-server process and the GGUF model file on your machine.

Your App (Python / JS / Java / C++)
    ↕  HTTP  (port 9999)

Offline Intelligence Rust Server
↕ HTTP (port 8081)
llama-server (llama.cpp)
↕
GGUF model file (local)

Features

FeatureDescription
5 Language BindingsRust, Python, JavaScript/Node.js, Java, C++. All talk to the same server over HTTP
Fully OfflineRuns entirely on your machine. No internet required after model download
Privacy FirstAll data stays local. No telemetry, no cloud calls
Streaming ResponsesTokens stream back in real time, just like ChatGPT
Conversation MemorySQLite-backed persistent memory with semantic search (HNSW index)
Live Web ToolsAutomatically fetches weather, currency rates, and crypto prices to answer live questions
User AuthenticationBuilt-in registration, login, JWT sessions, and Google OAuth 2.0
API Key ManagementStores your HuggingFace and OpenRouter keys encrypted on-device
Online / Offline ToggleSwitch between local llama.cpp and OpenRouter cloud at runtime without restarting the server
File AttachmentsUpload and attach files to conversations
Auto Hardware DetectionAutomatically picks the right GPU layers, thread count, and memory limits for your machine
Prometheus Metrics/metrics endpoint compatible with Grafana and any Prometheus-based monitoring stack
Multi-Format ModelsSupports GGUF, GGML, ONNX, SafeTensors, CoreML, TensorRT model formats

Supported Platforms

OSArchitecturesMinimum Version
Windowsx86_64, ARM64Windows 10
Linuxx86_64, ARM64Ubuntu 20.04 / CentOS 8
macOSx86_64, Apple SiliconmacOS 11.0

Quick Start

This gets you from zero to a running AI server in 5 steps.

Step 1: Download llama-server

llama-server is the engine that runs the AI model. Download a prebuilt binary from: github.com/ggerganov/llama.cpp/releases

Look for the most recent release and download the zip matching your OS:

OSFile to look for
Windowsllama-b*-bin-win-*-x64.zip → extract llama-server.exe
macOS Apple Siliconllama-b*-bin-macos-arm64.zip → extract llama-server
macOS Intelllama-b*-bin-macos-x64.zip → extract llama-server
Linux x86_64llama-b*-bin-ubuntu-x64.zip → extract llama-server

Place the binary somewhere on your system, for example:

Windows: C:llamallama-server.exe
macOS/Linux: /usr/local/bin/llama-server

Step 2: Download a Model

The library uses GGUF format model files. Pick one based on your available RAM:

ModelFile sizeRAM neededDownload
Llama 3.2 3B Q4~2 GB4 GBDownload
Mistral 7B Q4~4 GB8 GBDownload
Llama 3 8B Q4~5 GB10 GBDownload
Llama 3 70B Q4~40 GB48 GBDownload

Not sure which to pick? Start with Llama 3.2 3B Q4: it runs on almost any machine and is a good baseline.

Browse all GGUF models: huggingface.co/models?library=gguf

Step 3: Create a .env File

Create a file called .env in the folder where you will run the server. This tells the server where your files are.

macOS / Linux:

LLAMA_BIN=/usr/local/bin/llama-server
MODEL_PATH=/home/yourname/.offline-intelligence/models/llama-3.2-3b-instruct-q4_k_m.gguf
API_HOST=127.0.0.1
API_PORT=9999

Windows:

LLAMA_BIN=C:llamallama-server.exe
MODEL_PATH=C:modelsllama-3.2-3b-instruct-q4_k_m.gguf
API_HOST=127.0.0.1
API_PORT=9999

Everything else (GPU layers, thread count, memory limits) is detected automatically.

Step 4: Start the Server

cargo install offline-intelligence
offline-intelligence

You should see:

Starting with thread-based architecture
Memory database initialized
Model manager initialized successfully
Starting server on 127.0.0.1:9999

Verify it is running:

curl http://127.0.0.1:9999/healthz

Expected response: {"status":"ok"}

Note: The server must be running before you use any of the language clients below.

Step 5: Use Any Language Client

With the server running on port 9999, pick the language you want:

pip install offline-intelligence==0.1.5
npm install offline-intelligence@0.1.5
cargo add offline-intelligence@0.1.5

See the Language Usage Guide for full examples in each language.

Language Usage Guide

View on GitHub

Important: The Rust crate is the server. Every other language binding (Python, JavaScript, Java, C++) is an HTTP client that talks to the Rust server over port 9999. You must start the server first before using any non-Rust client.

Installation

Rust (Cargo)

cargo add offline-intelligence@0.1.5

Python (PyPI)

pip install offline-intelligence==0.1.5

JavaScript / Node.js (npm)

npm install offline-intelligence@0.1.5

Java (Maven)

<repositories>
    <repository>
        <id>jitpack.io</id>
        <url>https://jitpack.io</url>
    </repository>
</repositories>
<dependency>
    <groupId>com.github.OfflineIntelligence</groupId>
    <artifactId>offline-intelligence</artifactId>
    <version>v0.1.5</version>
</dependency>

Java (Gradle)

repositories { maven { url 'https://jitpack.io' } }
dependencies { 
    implementation 'com.github.OfflineIntelligence:offline-intelligence:v0.1.5' 
}

C++ (CMake FetchContent)

include(FetchContent)
FetchContent_Declare(
    offline_intelligence
    GIT_REPOSITORY https://github.com/OfflineIntelligence/offline-intelligence.git
    GIT_TAG        v0.1.5
    GIT_SHALLOW    TRUE
)
FetchContent_MakeAvailable(offline_intelligence)
target_link_libraries(your_target PRIVATE offline_intelligence)

C++ (Conan)

conan install --requires="offline-intelligence/0.1.5" --build=missing

C++ (Manual)

Copy bindings/cpp/include/offline_intelligence/offline_intelligence.hpp into your project. Requires cpp-httplib and nlohmann/json headers.

Usage Examples

Rust

In Rust, you embed the server directly in your application.

use offline_intelligence::{config::Config, run_thread_server};

#[tokio::main]
async fn main() -> anyhow::Result<()> {
    let cfg = Config::from_env()?;
    run_thread_server(cfg, None).await
}

Custom configuration:

use offline_intelligence::{config::Config, run_thread_server};

#[tokio::main]
async fn main() -> anyhow::Result<()> {
    let mut cfg = Config::from_env()?;
    cfg.api_host  = "0.0.0.0".to_string();
    cfg.api_port  = 9999;
    cfg.model_path = "/path/to/model.gguf".to_string();
    cfg.gpu_layers = 35;
    run_thread_server(cfg, None).await
}

Python

from offline_intelligence import OfflineIntelligence, Config
import requests

cfg = Config.from_env()
ai  = OfflineIntelligence(cfg)

print(ai.health_check())

response = ai.generate("Explain quantum computing in simple terms")
print(response)

for chunk in ai.generate_stream("Write a short poem about the ocean"):
    print(chunk, end="", flush=True)

convs = ai.get_conversations()
title = ai.generate_title(session_id="abc123", first_message="Tell me about space")

stats = ai.get_memory_stats("abc123")
ai.optimize_memory()

# Tools, mode switching, API keys, feedback
settings = requests.get("http://127.0.0.1:9999/tools/settings").json()
requests.post("http://127.0.0.1:9999/mode", json={"mode": "online"})
requests.post("http://127.0.0.1:9999/api-keys", json={"key_type": "openrouter", "value": "sk-or-..."})
requests.post("http://127.0.0.1:9999/feedback", json={"message": "Great!"})

Custom configuration:

from offline_intelligence import Config, OfflineIntelligence

cfg = Config()
cfg.api_host        = "127.0.0.1"
cfg.api_port        = 9999
cfg.backend_url     = "http://127.0.0.1:8081"
cfg.openrouter_api_key = "sk-or-..."

ai = OfflineIntelligence(cfg)

JavaScript / Node.js

const { OfflineIntelligence, Config } = require('offline-intelligence');

const cfg = Config.fromEnv();
const ai  = new OfflineIntelligence(cfg);

async function main() {
    const health = await ai.healthCheck();
    console.log(health);

    const response = await ai.generate('What is machine learning?');
    console.log(response);

    await ai.generateStream('Tell me a story', chunk => process.stdout.write(chunk));

    const convs = await ai.getConversations();
    const title = await ai.generateTitle('abc123', 'Tell me about black holes');

    const stats = await ai.getMemoryStats('abc123');
    await ai.optimizeMemory();
}

main();

Java

import com.offlineintelligence.OfflineIntelligence;
import com.offlineintelligence.Config;

public class Main {
    public static void main(String[] args) {
        Config cfg = Config.fromEnv();
        OfflineIntelligence ai = new OfflineIntelligence(cfg);

        System.out.println(ai.healthCheck());

        String response = ai.generate("Explain recursion");
        System.out.println(response);

        ai.generateStream("Write a haiku", chunk -> System.out.print(chunk));

        List<Conversation> convs = ai.getConversations();
        String title = ai.generateTitle("abc123", "Space exploration");

        MemoryStats stats = ai.getMemoryStats("abc123");
        ai.optimizeMemory();
    }
}

C++

#include <offline_intelligence/offline_intelligence.hpp>
#include <iostream>

int main() {
    auto cfg = offline_intelligence::Config::from_env();
    auto ai  = offline_intelligence::OfflineIntelligence(cfg);

    std::cout << ai.health_check() << std::endl;

    auto response = ai.generate("What is AI?");
    std::cout << response << std::endl;

    ai.generate_stream("Count to 10", [](const std::string& chunk) {
        std::cout << chunk << std::flush;
    });

    auto convs = ai.get_conversations();
    auto title = ai.generate_title("abc123", "Programming help");

    auto stats = ai.get_memory_stats("abc123");
    ai.optimize_memory();

    return 0;
}

What's New in v0.1.5

Live Web Tools

The server now detects certain questions in real time and fetches live data before sending the conversation to the AI model. This means the model can answer questions it otherwise couldn't (current temperature, today's exchange rates, live crypto prices).

How it works: Every incoming user message is scanned for intent. If a relevant intent is detected, the data is fetched in parallel (max 8 seconds per source, 10-second hard deadline), formatted with numbered [1], [2] citation markers, and injected as a system context block. If the fetch times out or fails, the model answers from its training data silently, with no error shown to the user.

IntentTrigger exampleData source
Weather"What's the weather in Tokyo?"Open-Meteo + Nominatim (keyless)
Currency"Convert 200 USD to EUR"ExchangeRate-API, 160+ currencies (keyless)
Crypto price"What is Bitcoin worth right now?"CoinGecko free API (keyless)

Manage tools via API:

GET  http://127.0.0.1:9999/tools/settings
POST http://127.0.0.1:9999/tools/settings   {"enabled": true, "brave_key": "optional"}

User Authentication

Full auth stack built into the server. No third-party service needed:

POST /auth/register   {"username": "alice", "email": "alice@example.com", "password": "secret"}
POST /auth/login      {"email": "alice@example.com", "password": "secret"}
GET  /auth/google?redirect_uri=http://localhost:3000/callback
GET  /auth/verify?token=<email-verification-token>

Passwords are hashed with Argon2. Login returns a JWT token. Pass it as Authorization: Bearer <token> on protected endpoints.

Encrypted API Key Storage

Store your HuggingFace and OpenRouter keys on-device. They are encrypted using a machine-specific key before being written to SQLite. They never exist in plaintext outside the process:

POST   /api-keys   {"key_type": "huggingface", "value": "hf_..."}
POST   /api-keys   {"key_type": "openrouter",  "value": "sk-or-..."}
GET    /api-keys?key_type=huggingface
DELETE /api-keys?key_type=openrouter

Runtime Mode Switching

Switch between local (llama.cpp) and cloud (OpenRouter) inference without restarting:

POST /mode   {"mode": "offline"}
POST /mode   {"mode": "online"}

User Feedback

POST /feedback   {"message": "Really helpful!", "email": "optional@email.com"}

Configuration Guide

Default Configuration Parameters

The Offline Intelligence Server utilizes a comprehensive configuration system that balances performance, resource utilization, and functionality across diverse deployment environments. The default settings are optimized for general use cases while maintaining stability and efficient resource management.

# Required (no auto-detection)

LLAMA_BIN=/usr/local/bin/llama-server
MODEL_PATH=/home/user/.offline-intelligence/models/model.q4_k_m.gguf

Server (defaults shown)
API_HOST=127.0.0.1
API_PORT=9999
LLAMA_HOST=127.0.0.1
LLAMA_PORT=8081
PROMETHEUS_PORT=9000

Backend and cloud fallback
BACKEND_URL=http://127.0.0.1:8081
OPENROUTER_API_KEY=sk-or-... # optional, only for online mode

Performance: auto-detected if omitted
CTX_SIZE=8192
BATCH_SIZE=256
GPU_LAYERS=auto
THREADS=auto
Rate limiting and concurrency
MAX_CONCURRENT_STREAMS=4
REQUESTS_PER_SECOND=24

All Environment Variables

VariableDefaultAuto-detectDescription
LLAMA_BINnoneNoPath to llama-server binary (required)
MODEL_PATHnoneYesPath to GGUF model file (required)
BACKEND_URLhttp://127.0.0.1:8081NoFull URL to llama-server
OPENROUTER_API_KEYnoneNoOpenRouter key for online/cloud fallback mode
API_HOST127.0.0.1NoAPI server bind address
API_PORT9999NoAPI server port
LLAMA_HOST127.0.0.1Nollama-server host
LLAMA_PORT8081Nollama-server port
CTX_SIZE8192YesContext window size in tokens
BATCH_SIZE256YesProcessing batch size
THREADSautoYesCPU thread count (see CPU detection below)
GPU_LAYERSautoYesGPU acceleration layers (see GPU detection below)
MAX_CONCURRENT_STREAMS4NoMax simultaneous requests
PROMETHEUS_PORT9000NoPrometheus metrics endpoint port
REQUESTS_PER_SECOND24NoRate limiting threshold

Performance Optimization Guidelines

The system incorporates intelligent automatic tuning mechanisms that adapt to available hardware resources. Manual configuration adjustments are typically only necessary for specialized deployment scenarios or specific performance requirements beyond the automated optimization capabilities.

ScenarioCTX_SIZEBATCH_SIZEGPU_LAYERSTHREADSMAX_CONCURRENT_STREAMS
Edge device (Raspberry Pi, Jetson Nano, 4 GB RAM, CPU only)204832021
Laptop / workstation (8–16 GB RAM, no GPU)40961280auto2
Standard GPU server (8–16 GB VRAM)8192256autoauto4
High-performance GPU server (16 GB+ VRAM)8192512autoauto8

GPU Layer Auto-Detection (NVIDIA)

VRAM AvailableGPU_LAYERS assigned
0 to 4 GB12
5 to 8 GB20
9 to 12 GB32
13 to 16 GB40
16 GB+50
Apple Silicon (macOS ARM64)24 to 56 (Metal, unified memory)
Intel Mac (macOS x86_64)0 (CPU only)

Common Operational Scenarios and Resolution Strategies

The following section outlines frequently encountered operational situations and recommended approaches for resolution and system optimization.

Memory Resource Management and Optimization
# Memory Conservation Configuration Parameters

CTX_SIZE=2048
BATCH_SIZE=64
GPU_LAYERS=0
Performance Enhancement and Throughput Optimization
# Performance Maximization Configuration

GPU_LAYERS=40
BATCH_SIZE=512
THREADS=8
Network Port Configuration and Conflict Resolution
# Alternative Network Port Assignments

API_PORT=8001
LLAMA_PORT=8082
PROMETHEUS_PORT=9001

API Reference and Monitoring

All endpoints are served by the Rust server at http://API_HOST:API_PORT (default 127.0.0.1:9999). The Prometheus metrics endpoint is served separately at PROMETHEUS_PORT (default 9000).

Core Endpoints

MethodEndpointDescription
POST/generate/streamSSE streaming generation. Body: messages[], session_id, temperature, max_tokens, top_p, frequency_penalty
GET/healthzLiveness check. Always returns {"status":"ok"} when the server process is alive
GET/readyzReadiness check. Returns backend_connected and model_loaded status; use for Kubernetes readiness probes
GET/metricsPrometheus-compatible metrics (requests, duration, resource usage). Served on PROMETHEUS_PORT (default 9000)

Admin Endpoints

MethodEndpointDescription
GET/admin/statusReturns status, version, uptime, active connections, total requests served
POST/admin/loadLoad a model. Body: model_path, ctx_size, gpu_layers, batch_size
POST/admin/stopStop the llama-server backend process

Memory Endpoints

MethodEndpointDescription
GET/memory/stats/{session_id}Returns message count, total tokens, timestamps, and storage size for a session
POST/memory/optimizeRuns PRAGMA optimize + WAL checkpoint across the SQLite memory database
POST/memory/cleanupRemoves stale and expired session entries using elapsed-time thresholds

Conversation Endpoints

MethodEndpointDescription
GET/conversationsList all stored conversations
GET/conversations/{id}Get a specific conversation with full message history
DELETE/conversations/{id}Delete a conversation and all associated messages
GET/conversations/{id}/titleGet the auto-generated title for a conversation
POST/generate/titleGenerate a title. Body: session_id, first_message

Authentication Endpoints

MethodEndpointDescription
POST/auth/registerRegister new user. Body: username, email, password
POST/auth/loginLogin and receive JWT token. Body: email, password
GET/auth/googleGoogle OAuth 2.0 login. Query: redirect_uri
GET/auth/verifyVerify email with token. Query: token

API Keys Endpoints

MethodEndpointDescription
POST/api-keysStore encrypted API key. Body: key_type, value
GET/api-keysRetrieve stored API key. Query: key_type
DELETE/api-keysDelete stored API key. Query: key_type

Mode & Tools Endpoints

MethodEndpointDescription
POST/modeSwitch between local and online mode. Body: mode ("offline" or "online")
GET/tools/settingsGet live web tools settings
POST/tools/settingsUpdate tools settings. Body: enabled, brave_key (optional)
POST/feedbackSubmit user feedback. Body: message, email (optional)

Metrics Access

# Health and readiness

curl http://127.0.0.1:9999/healthz
curl http://127.0.0.1:9999/readyz

Prometheus metrics (note: separate port 9000)
curl http://127.0.0.1:9000/metrics

Admin status
curl http://127.0.0.1:9999/admin/status

Memory stats for a session
curl http://127.0.0.1:9999/memory/stats/my-session-id

Changelog

v0.1.5, March 30, 2026

  • Live Web Tools: Real-time intent detection for weather, currency conversion, and crypto prices. Data fetched from Open-Meteo, ExchangeRate-API, and CoinGecko with automatic citation injection
  • User Authentication: Full auth stack with registration, login, JWT sessions, Google OAuth 2.0, and email verification. Passwords hashed with Argon2
  • Encrypted API Key Storage: Store HuggingFace and OpenRouter keys encrypted on-device using machine-specific keys. Keys never exist in plaintext outside the process
  • Runtime Mode Switching: Switch between local (llama.cpp) and cloud (OpenRouter) inference at runtime without restarting the server via POST /mode
  • User Feedback Endpoint: New POST /feedback endpoint for collecting user feedback with optional email
  • File Attachments: Upload and attach files to conversations

v0.1.4, March 27, 2026

  • Lazy HNSW index rebuild: EmbeddingStore now uses an AtomicBool dirty flag. Index is rebuilt once on the first search after inserts, eliminating the previous per-insert O(n²) rebuild cost
  • Content-aware importance scoring: Replaced all hardcoded 0.5 values with score_message_importance(role, content): role base scores (system=0.9, assistant=0.6, user=0.4) plus bonuses for code blocks, key concepts, and message length
  • Real llama-server KV cache integration: LlamaKVCacheInterface now queries GET /slots for live token counts; cache operations use POST /slots/0 with erase/restore actions
  • Token-bucket KV entries: Slot token sequences divided into 64-token buckets; importance derived from position fraction (earlier = higher priority)
  • sysinfo-based memory limits: estimate_max_cache_memory() uses real available system RAM. 25% is allocated to KV cache, clamped between 256 MB and 8 GB
  • Database and cache workers fully wired: All database operations (store, get, update, delete conversations) call real database methods; cache worker flushes to database on update
  • Admin maintenance operational: Session cleanup uses DashMap::retain() with elapsed-time thresholds; optimize_database runs PRAGMA optimize + WAL checkpoint(TRUNCATE)

v0.1.3, March 22, 2026

  • Thread-based server architecture (run_thread_server) replacing single-threaded server
  • All four language bindings rewritten as pure HTTP clients (Python: requests; JavaScript: axios; Java: Java 11 HttpClient; C++: cpp-httplib + nlohmann/json)
  • Multi-format model support: .gguf, .onnx, .trt, .engine, .safetensors, .ggml, .mlmodel
  • New BACKEND_URL and OPENROUTER_API_KEY config fields
  • API port default changed from 8000 to 9999
  • New modules: model_management, model_runtime, engine_management, worker_threads
  • New APIs: conversations CRUD, title generation, memory optimize/cleanup, mode switching
  • Lock-free backend URL switching via arc-swap
  • Platform-specific GPU detection: Apple Silicon Metal, NVIDIA NVML, CPU fallback
  • JitPack support (jitpack.yml) and Conan package support (conanfile.py) added

v0.1.2, February 7, 2026

  • Automatic hardware detection added
  • Improved memory management
  • Enhanced error handling
  • Fixed critical security vulnerabilities

v0.1.1, December 15, 2025

  • Initial public release
  • Multi-language bindings (Rust, Python, JavaScript, Java, C++)
  • Core LLM integration via llama.cpp
  • SQLite-backed memory management system

License: Apache 2.0 (core 80%). Commercial extensions available for advanced context management and enterprise features. Third-party components: llama.cpp (MIT), Axum (MIT), Tokio (MIT), Serde (MIT/Apache 2.0), SQLite (Public Domain).