Offline Intelligence - Documentation

SDK Documentation

v0.1.5, March 2026

Overview

Run AI models entirely on your own machine. No internet, no cloud, no data leaves your device. Cross-platform server with bindings for Python, JavaScript, Rust, C++, and Java.

Current version: v0.1.5 (March 30, 2026) | License: Apache 2.0

What Is This?

The Offline Intelligence Library is a server that runs AI language models (LLMs) on your own computer. You download a model file once, and from that point on all AI inference happens locally. No API calls to OpenAI, no subscription fees, no data sent to anyone.

The server is written in Rust for speed and stability. Once it is running, you talk to it over HTTP from any language: Python, JavaScript, Java, C++, or Rust. The server handles everything: loading the model, managing conversation memory, streaming responses token by token, and optionally fetching live data (weather, currency, crypto prices) to answer questions the model alone could not.

How It Works

The Rust crate is the server. All other language bindings (Python, JavaScript, Java, C++) are pure HTTP clients that connect to the Rust server running on port 9999. The Rust server in turn manages the llama-server process and the GGUF model file on your machine.

Your App (Python / JS / Java / C++)
    ↕  HTTP  (port 9999)

Offline Intelligence Rust Server
↕ HTTP (port 8081)
llama-server (llama.cpp)
↕
GGUF model file (local)

Features

Feature	Description
5 Language Bindings	Rust, Python, JavaScript/Node.js, Java, C++. All talk to the same server over HTTP
Fully Offline	Runs entirely on your machine. No internet required after model download
Privacy First	All data stays local. No telemetry, no cloud calls
Streaming Responses	Tokens stream back in real time, just like ChatGPT
Conversation Memory	SQLite-backed persistent memory with semantic search (HNSW index)
Live Web Tools	Automatically fetches weather, currency rates, and crypto prices to answer live questions
User Authentication	Built-in registration, login, JWT sessions, and Google OAuth 2.0
API Key Management	Stores your HuggingFace and OpenRouter keys encrypted on-device
Online / Offline Toggle	Switch between local llama.cpp and OpenRouter cloud at runtime without restarting the server
File Attachments	Upload and attach files to conversations
Auto Hardware Detection	Automatically picks the right GPU layers, thread count, and memory limits for your machine
Prometheus Metrics	/metrics endpoint compatible with Grafana and any Prometheus-based monitoring stack
Multi-Format Models	Supports GGUF, GGML, ONNX, SafeTensors, CoreML, TensorRT model formats

Supported Platforms

OS	Architectures	Minimum Version
Windows	x86_64, ARM64	Windows 10
Linux	x86_64, ARM64	Ubuntu 20.04 / CentOS 8
macOS	x86_64, Apple Silicon	macOS 11.0

Quick Start

This gets you from zero to a running AI server in 5 steps.

Step 1: Download llama-server

llama-server is the engine that runs the AI model. Download a prebuilt binary from: github.com/ggerganov/llama.cpp/releases

Look for the most recent release and download the zip matching your OS:

OS	File to look for
Windows	llama-b-bin-win--x64.zip → extract llama-server.exe
macOS Apple Silicon	llama-b*-bin-macos-arm64.zip → extract llama-server
macOS Intel	llama-b*-bin-macos-x64.zip → extract llama-server
Linux x86_64	llama-b*-bin-ubuntu-x64.zip → extract llama-server

Place the binary somewhere on your system, for example:

Windows: C:llamallama-server.exe
macOS/Linux: /usr/local/bin/llama-server

Step 2: Download a Model

The library uses GGUF format model files. Pick one based on your available RAM:

Model	File size	RAM needed	Download
Llama 3.2 3B Q4	~2 GB	4 GB	Download
Mistral 7B Q4	~4 GB	8 GB	Download
Llama 3 8B Q4	~5 GB	10 GB	Download
Llama 3 70B Q4	~40 GB	48 GB	Download

Not sure which to pick? Start with Llama 3.2 3B Q4: it runs on almost any machine and is a good baseline.

Browse all GGUF models: huggingface.co/models?library=gguf

Step 3: Create a .env File

Create a file called .env in the folder where you will run the server. This tells the server where your files are.

macOS / Linux:

LLAMA_BIN=/usr/local/bin/llama-server
MODEL_PATH=/home/yourname/.offline-intelligence/models/llama-3.2-3b-instruct-q4_k_m.gguf
API_HOST=127.0.0.1
API_PORT=9999

Windows:

LLAMA_BIN=C:llamallama-server.exe
MODEL_PATH=C:modelsllama-3.2-3b-instruct-q4_k_m.gguf
API_HOST=127.0.0.1
API_PORT=9999

Everything else (GPU layers, thread count, memory limits) is detected automatically.

Step 4: Start the Server

cargo install offline-intelligence
offline-intelligence

You should see:

Starting with thread-based architecture
Memory database initialized
Model manager initialized successfully
Starting server on 127.0.0.1:9999

Verify it is running:

curl http://127.0.0.1:9999/healthz

Expected response: {"status":"ok"}

Note: The server must be running before you use any of the language clients below.

Step 5: Use Any Language Client

With the server running on port 9999, pick the language you want:

pip install offline-intelligence==0.1.5
npm install offline-intelligence@0.1.5
cargo add offline-intelligence@0.1.5

See the Language Usage Guide for full examples in each language.

Language Usage Guide

View on GitHub

Important: The Rust crate is the server. Every other language binding (Python, JavaScript, Java, C++) is an HTTP client that talks to the Rust server over port 9999. You must start the server first before using any non-Rust client.

Installation

Rust (Cargo)

cargo add offline-intelligence@0.1.5

Python (PyPI)

pip install offline-intelligence==0.1.5

JavaScript / Node.js (npm)

npm install offline-intelligence@0.1.5

Java (Maven)

<repositories>
    <repository>
        <id>jitpack.io</id>
        <url>https://jitpack.io</url>
    </repository>
</repositories>
<dependency>
    <groupId>com.github.OfflineIntelligence</groupId>
    <artifactId>offline-intelligence</artifactId>
    <version>v0.1.5</version>
</dependency>

Java (Gradle)

repositories { maven { url 'https://jitpack.io' } }
dependencies { 
    implementation 'com.github.OfflineIntelligence:offline-intelligence:v0.1.5' 
}

C++ (CMake FetchContent)

include(FetchContent)
FetchContent_Declare(
    offline_intelligence
    GIT_REPOSITORY https://github.com/OfflineIntelligence/offline-intelligence.git
    GIT_TAG        v0.1.5
    GIT_SHALLOW    TRUE
)
FetchContent_MakeAvailable(offline_intelligence)
target_link_libraries(your_target PRIVATE offline_intelligence)

C++ (Conan)

conan install --requires="offline-intelligence/0.1.5" --build=missing

C++ (Manual)

Copy bindings/cpp/include/offline_intelligence/offline_intelligence.hpp into your project. Requires cpp-httplib and nlohmann/json headers.

Usage Examples

Rust

In Rust, you embed the server directly in your application.

use offline_intelligence::{config::Config, run_thread_server};

#[tokio::main]
async fn main() -> anyhow::Result<()> {
    let cfg = Config::from_env()?;
    run_thread_server(cfg, None).await
}

Custom configuration:

use offline_intelligence::{config::Config, run_thread_server};

#[tokio::main]
async fn main() -> anyhow::Result<()> {
    let mut cfg = Config::from_env()?;
    cfg.api_host  = "0.0.0.0".to_string();
    cfg.api_port  = 9999;
    cfg.model_path = "/path/to/model.gguf".to_string();
    cfg.gpu_layers = 35;
    run_thread_server(cfg, None).await
}

Python

from offline_intelligence import OfflineIntelligence, Config
import requests

cfg = Config.from_env()
ai  = OfflineIntelligence(cfg)

print(ai.health_check())

response = ai.generate("Explain quantum computing in simple terms")
print(response)

for chunk in ai.generate_stream("Write a short poem about the ocean"):
    print(chunk, end="", flush=True)

convs = ai.get_conversations()
title = ai.generate_title(session_id="abc123", first_message="Tell me about space")

stats = ai.get_memory_stats("abc123")
ai.optimize_memory()

# Tools, mode switching, API keys, feedback
settings = requests.get("http://127.0.0.1:9999/tools/settings").json()
requests.post("http://127.0.0.1:9999/mode", json={"mode": "online"})
requests.post("http://127.0.0.1:9999/api-keys", json={"key_type": "openrouter", "value": "sk-or-..."})
requests.post("http://127.0.0.1:9999/feedback", json={"message": "Great!"})

Custom configuration:

from offline_intelligence import Config, OfflineIntelligence

cfg = Config()
cfg.api_host        = "127.0.0.1"
cfg.api_port        = 9999
cfg.backend_url     = "http://127.0.0.1:8081"
cfg.openrouter_api_key = "sk-or-..."

ai = OfflineIntelligence(cfg)

JavaScript / Node.js

const { OfflineIntelligence, Config } = require('offline-intelligence');

const cfg = Config.fromEnv();
const ai  = new OfflineIntelligence(cfg);

async function main() {
    const health = await ai.healthCheck();
    console.log(health);

    const response = await ai.generate('What is machine learning?');
    console.log(response);

    await ai.generateStream('Tell me a story', chunk => process.stdout.write(chunk));

    const convs = await ai.getConversations();
    const title = await ai.generateTitle('abc123', 'Tell me about black holes');

    const stats = await ai.getMemoryStats('abc123');
    await ai.optimizeMemory();
}

main();

Java

import com.offlineintelligence.OfflineIntelligence;
import com.offlineintelligence.Config;

public class Main {
    public static void main(String[] args) {
        Config cfg = Config.fromEnv();
        OfflineIntelligence ai = new OfflineIntelligence(cfg);

        System.out.println(ai.healthCheck());

        String response = ai.generate("Explain recursion");
        System.out.println(response);

        ai.generateStream("Write a haiku", chunk -> System.out.print(chunk));

        List<Conversation> convs = ai.getConversations();
        String title = ai.generateTitle("abc123", "Space exploration");

        MemoryStats stats = ai.getMemoryStats("abc123");
        ai.optimizeMemory();
    }
}

C++

#include <offline_intelligence/offline_intelligence.hpp>
#include <iostream>

int main() {
    auto cfg = offline_intelligence::Config::from_env();
    auto ai  = offline_intelligence::OfflineIntelligence(cfg);

    std::cout << ai.health_check() << std::endl;

    auto response = ai.generate("What is AI?");
    std::cout << response << std::endl;

    ai.generate_stream("Count to 10", [](const std::string& chunk) {
        std::cout << chunk << std::flush;
    });

    auto convs = ai.get_conversations();
    auto title = ai.generate_title("abc123", "Programming help");

    auto stats = ai.get_memory_stats("abc123");
    ai.optimize_memory();

    return 0;
}

What's New in v0.1.5

Live Web Tools

The server now detects certain questions in real time and fetches live data before sending the conversation to the AI model. This means the model can answer questions it otherwise couldn't (current temperature, today's exchange rates, live crypto prices).

How it works: Every incoming user message is scanned for intent. If a relevant intent is detected, the data is fetched in parallel (max 8 seconds per source, 10-second hard deadline), formatted with numbered [1], [2] citation markers, and injected as a system context block. If the fetch times out or fails, the model answers from its training data silently, with no error shown to the user.

Intent	Trigger example	Data source
Weather	"What's the weather in Tokyo?"	Open-Meteo + Nominatim (keyless)
Currency	"Convert 200 USD to EUR"	ExchangeRate-API, 160+ currencies (keyless)
Crypto price	"What is Bitcoin worth right now?"	CoinGecko free API (keyless)

Manage tools via API:

GET  http://127.0.0.1:9999/tools/settings
POST http://127.0.0.1:9999/tools/settings   {"enabled": true, "brave_key": "optional"}

User Authentication

Full auth stack built into the server. No third-party service needed:

POST /auth/register   {"username": "alice", "email": "alice@example.com", "password": "secret"}
POST /auth/login      {"email": "alice@example.com", "password": "secret"}
GET  /auth/google?redirect_uri=http://localhost:3000/callback
GET  /auth/verify?token=<email-verification-token>

Passwords are hashed with Argon2. Login returns a JWT token. Pass it as Authorization: Bearer <token> on protected endpoints.

Encrypted API Key Storage

Store your HuggingFace and OpenRouter keys on-device. They are encrypted using a machine-specific key before being written to SQLite. They never exist in plaintext outside the process:

POST   /api-keys   {"key_type": "huggingface", "value": "hf_..."}
POST   /api-keys   {"key_type": "openrouter",  "value": "sk-or-..."}
GET    /api-keys?key_type=huggingface
DELETE /api-keys?key_type=openrouter

Runtime Mode Switching

Switch between local (llama.cpp) and cloud (OpenRouter) inference without restarting:

POST /mode   {"mode": "offline"}
POST /mode   {"mode": "online"}

User Feedback

POST /feedback   {"message": "Really helpful!", "email": "optional@email.com"}

Configuration Guide

Default Configuration Parameters

The Offline Intelligence Server utilizes a comprehensive configuration system that balances performance, resource utilization, and functionality across diverse deployment environments. The default settings are optimized for general use cases while maintaining stability and efficient resource management.

# Required (no auto-detection)

LLAMA_BIN=/usr/local/bin/llama-server
MODEL_PATH=/home/user/.offline-intelligence/models/model.q4_k_m.gguf

Server (defaults shown)
API_HOST=127.0.0.1
API_PORT=9999
LLAMA_HOST=127.0.0.1
LLAMA_PORT=8081
PROMETHEUS_PORT=9000

Backend and cloud fallback
BACKEND_URL=http://127.0.0.1:8081
OPENROUTER_API_KEY=sk-or-... # optional, only for online mode

Performance: auto-detected if omitted
CTX_SIZE=8192
BATCH_SIZE=256
GPU_LAYERS=auto
THREADS=auto
Rate limiting and concurrency
MAX_CONCURRENT_STREAMS=4
REQUESTS_PER_SECOND=24

All Environment Variables

Variable	Default	Auto-detect	Description
LLAMA_BIN	none	No	Path to llama-server binary (required)
MODEL_PATH	none	Yes	Path to GGUF model file (required)
BACKEND_URL	http://127.0.0.1:8081	No	Full URL to llama-server
OPENROUTER_API_KEY	none	No	OpenRouter key for online/cloud fallback mode
API_HOST	127.0.0.1	No	API server bind address
API_PORT	9999	No	API server port
LLAMA_HOST	127.0.0.1	No	llama-server host
LLAMA_PORT	8081	No	llama-server port
CTX_SIZE	8192	Yes	Context window size in tokens
BATCH_SIZE	256	Yes	Processing batch size
THREADS	auto	Yes	CPU thread count (see CPU detection below)
GPU_LAYERS	auto	Yes	GPU acceleration layers (see GPU detection below)
MAX_CONCURRENT_STREAMS	4	No	Max simultaneous requests
PROMETHEUS_PORT	9000	No	Prometheus metrics endpoint port
REQUESTS_PER_SECOND	24	No	Rate limiting threshold

Performance Optimization Guidelines

The system incorporates intelligent automatic tuning mechanisms that adapt to available hardware resources. Manual configuration adjustments are typically only necessary for specialized deployment scenarios or specific performance requirements beyond the automated optimization capabilities.

Scenario	CTX_SIZE	BATCH_SIZE	GPU_LAYERS	THREADS	MAX_CONCURRENT_STREAMS
Edge device (Raspberry Pi, Jetson Nano, 4 GB RAM, CPU only)	2048	32	0	2	1
Laptop / workstation (8–16 GB RAM, no GPU)	4096	128	0	auto	2
Standard GPU server (8–16 GB VRAM)	8192	256	auto	auto	4
High-performance GPU server (16 GB+ VRAM)	8192	512	auto	auto	8

GPU Layer Auto-Detection (NVIDIA)

VRAM Available	GPU_LAYERS assigned
0 to 4 GB	12
5 to 8 GB	20
9 to 12 GB	32
13 to 16 GB	40
16 GB+	50
Apple Silicon (macOS ARM64)	24 to 56 (Metal, unified memory)
Intel Mac (macOS x86_64)	0 (CPU only)

Common Operational Scenarios and Resolution Strategies

The following section outlines frequently encountered operational situations and recommended approaches for resolution and system optimization.

Memory Resource Management and Optimization

# Memory Conservation Configuration Parameters

CTX_SIZE=2048
BATCH_SIZE=64
GPU_LAYERS=0

Performance Enhancement and Throughput Optimization

# Performance Maximization Configuration

GPU_LAYERS=40
BATCH_SIZE=512
THREADS=8

Network Port Configuration and Conflict Resolution

# Alternative Network Port Assignments

API_PORT=8001
LLAMA_PORT=8082
PROMETHEUS_PORT=9001

API Reference and Monitoring

All endpoints are served by the Rust server at http://API_HOST:API_PORT (default 127.0.0.1:9999). The Prometheus metrics endpoint is served separately at PROMETHEUS_PORT (default 9000).

Core Endpoints

Method	Endpoint	Description
POST	/generate/stream	SSE streaming generation. Body: messages[], session_id, temperature, max_tokens, top_p, frequency_penalty
GET	/healthz	Liveness check. Always returns {"status":"ok"} when the server process is alive
GET	/readyz	Readiness check. Returns backend_connected and model_loaded status; use for Kubernetes readiness probes
GET	/metrics	Prometheus-compatible metrics (requests, duration, resource usage). Served on PROMETHEUS_PORT (default 9000)

Admin Endpoints

Method	Endpoint	Description
GET	/admin/status	Returns status, version, uptime, active connections, total requests served
POST	/admin/load	Load a model. Body: model_path, ctx_size, gpu_layers, batch_size
POST	/admin/stop	Stop the llama-server backend process

Memory Endpoints

Method	Endpoint	Description
GET	/memory/stats/{session_id}	Returns message count, total tokens, timestamps, and storage size for a session
POST	/memory/optimize	Runs PRAGMA optimize + WAL checkpoint across the SQLite memory database
POST	/memory/cleanup	Removes stale and expired session entries using elapsed-time thresholds

Conversation Endpoints

Method	Endpoint	Description
GET	/conversations	List all stored conversations
GET	/conversations/{id}	Get a specific conversation with full message history
DELETE	/conversations/{id}	Delete a conversation and all associated messages
GET	/conversations/{id}/title	Get the auto-generated title for a conversation
POST	/generate/title	Generate a title. Body: session_id, first_message

Authentication Endpoints

Method	Endpoint	Description
POST	/auth/register	Register new user. Body: username, email, password
POST	/auth/login	Login and receive JWT token. Body: email, password
GET	/auth/google	Google OAuth 2.0 login. Query: redirect_uri
GET	/auth/verify	Verify email with token. Query: token

API Keys Endpoints

Method	Endpoint	Description
POST	/api-keys	Store encrypted API key. Body: key_type, value
GET	/api-keys	Retrieve stored API key. Query: key_type
DELETE	/api-keys	Delete stored API key. Query: key_type

Mode & Tools Endpoints

Method	Endpoint	Description
POST	/mode	Switch between local and online mode. Body: mode ("offline" or "online")
GET	/tools/settings	Get live web tools settings
POST	/tools/settings	Update tools settings. Body: enabled, brave_key (optional)
POST	/feedback	Submit user feedback. Body: message, email (optional)

Metrics Access

# Health and readiness

curl http://127.0.0.1:9999/healthz
curl http://127.0.0.1:9999/readyz

Prometheus metrics (note: separate port 9000)
curl http://127.0.0.1:9000/metrics

Admin status
curl http://127.0.0.1:9999/admin/status

Memory stats for a session
curl http://127.0.0.1:9999/memory/stats/my-session-id

Changelog

v0.1.5, March 30, 2026

Live Web Tools: Real-time intent detection for weather, currency conversion, and crypto prices. Data fetched from Open-Meteo, ExchangeRate-API, and CoinGecko with automatic citation injection
User Authentication: Full auth stack with registration, login, JWT sessions, Google OAuth 2.0, and email verification. Passwords hashed with Argon2
Encrypted API Key Storage: Store HuggingFace and OpenRouter keys encrypted on-device using machine-specific keys. Keys never exist in plaintext outside the process
Runtime Mode Switching: Switch between local (llama.cpp) and cloud (OpenRouter) inference at runtime without restarting the server via POST /mode
User Feedback Endpoint: New POST /feedback endpoint for collecting user feedback with optional email
File Attachments: Upload and attach files to conversations

v0.1.4, March 27, 2026

Lazy HNSW index rebuild: EmbeddingStore now uses an AtomicBool dirty flag. Index is rebuilt once on the first search after inserts, eliminating the previous per-insert O(n²) rebuild cost
Content-aware importance scoring: Replaced all hardcoded 0.5 values with score_message_importance(role, content): role base scores (system=0.9, assistant=0.6, user=0.4) plus bonuses for code blocks, key concepts, and message length
Real llama-server KV cache integration: LlamaKVCacheInterface now queries GET /slots for live token counts; cache operations use POST /slots/0 with erase/restore actions
Token-bucket KV entries: Slot token sequences divided into 64-token buckets; importance derived from position fraction (earlier = higher priority)
sysinfo-based memory limits: estimate_max_cache_memory() uses real available system RAM. 25% is allocated to KV cache, clamped between 256 MB and 8 GB
Database and cache workers fully wired: All database operations (store, get, update, delete conversations) call real database methods; cache worker flushes to database on update
Admin maintenance operational: Session cleanup uses DashMap::retain() with elapsed-time thresholds; optimize_database runs PRAGMA optimize + WAL checkpoint(TRUNCATE)

v0.1.3, March 22, 2026

Thread-based server architecture (run_thread_server) replacing single-threaded server
All four language bindings rewritten as pure HTTP clients (Python: requests; JavaScript: axios; Java: Java 11 HttpClient; C++: cpp-httplib + nlohmann/json)
Multi-format model support: .gguf, .onnx, .trt, .engine, .safetensors, .ggml, .mlmodel
New BACKEND_URL and OPENROUTER_API_KEY config fields
API port default changed from 8000 to 9999
New modules: model_management, model_runtime, engine_management, worker_threads
New APIs: conversations CRUD, title generation, memory optimize/cleanup, mode switching
Lock-free backend URL switching via arc-swap
Platform-specific GPU detection: Apple Silicon Metal, NVIDIA NVML, CPU fallback
JitPack support (jitpack.yml) and Conan package support (conanfile.py) added

v0.1.2, February 7, 2026

Automatic hardware detection added
Improved memory management
Enhanced error handling
Fixed critical security vulnerabilities

v0.1.1, December 15, 2025

Initial public release
Multi-language bindings (Rust, Python, JavaScript, Java, C++)
Core LLM integration via llama.cpp
SQLite-backed memory management system

License: Apache 2.0 (core 80%). Commercial extensions available for advanced context management and enterprise features. Third-party components: llama.cpp (MIT), Axum (MIT), Tokio (MIT), Serde (MIT/Apache 2.0), SQLite (Public Domain).