SDK Documentation
v0.1.5, March 2026
Overview
Run AI models entirely on your own machine. No internet, no cloud, no data leaves your device. Cross-platform server with bindings for Python, JavaScript, Rust, C++, and Java.
Current version: v0.1.5 (March 30, 2026) | License: Apache 2.0
What Is This?
The Offline Intelligence Library is a server that runs AI language models (LLMs) on your own computer. You download a model file once, and from that point on all AI inference happens locally. No API calls to OpenAI, no subscription fees, no data sent to anyone.
The server is written in Rust for speed and stability. Once it is running, you talk to it over HTTP from any language: Python, JavaScript, Java, C++, or Rust. The server handles everything: loading the model, managing conversation memory, streaming responses token by token, and optionally fetching live data (weather, currency, crypto prices) to answer questions the model alone could not.
How It Works
The Rust crate is the server. All other language bindings (Python, JavaScript, Java, C++) are pure HTTP clients that connect to the Rust server running on port 9999. The Rust server in turn manages the llama-server process and the GGUF model file on your machine.
Your App (Python / JS / Java / C++)
↕ HTTP (port 9999)
Offline Intelligence Rust Server
↕ HTTP (port 8081)
llama-server (llama.cpp)
↕
GGUF model file (local)Features
| Feature | Description |
|---|---|
| 5 Language Bindings | Rust, Python, JavaScript/Node.js, Java, C++. All talk to the same server over HTTP |
| Fully Offline | Runs entirely on your machine. No internet required after model download |
| Privacy First | All data stays local. No telemetry, no cloud calls |
| Streaming Responses | Tokens stream back in real time, just like ChatGPT |
| Conversation Memory | SQLite-backed persistent memory with semantic search (HNSW index) |
| Live Web Tools | Automatically fetches weather, currency rates, and crypto prices to answer live questions |
| User Authentication | Built-in registration, login, JWT sessions, and Google OAuth 2.0 |
| API Key Management | Stores your HuggingFace and OpenRouter keys encrypted on-device |
| Online / Offline Toggle | Switch between local llama.cpp and OpenRouter cloud at runtime without restarting the server |
| File Attachments | Upload and attach files to conversations |
| Auto Hardware Detection | Automatically picks the right GPU layers, thread count, and memory limits for your machine |
| Prometheus Metrics | /metrics endpoint compatible with Grafana and any Prometheus-based monitoring stack |
| Multi-Format Models | Supports GGUF, GGML, ONNX, SafeTensors, CoreML, TensorRT model formats |
Supported Platforms
| OS | Architectures | Minimum Version |
|---|---|---|
| Windows | x86_64, ARM64 | Windows 10 |
| Linux | x86_64, ARM64 | Ubuntu 20.04 / CentOS 8 |
| macOS | x86_64, Apple Silicon | macOS 11.0 |
Quick Start
This gets you from zero to a running AI server in 5 steps.
Step 1: Download llama-server
llama-server is the engine that runs the AI model. Download a prebuilt binary from: github.com/ggerganov/llama.cpp/releases
Look for the most recent release and download the zip matching your OS:
| OS | File to look for |
|---|---|
| Windows | llama-b*-bin-win-*-x64.zip → extract llama-server.exe |
| macOS Apple Silicon | llama-b*-bin-macos-arm64.zip → extract llama-server |
| macOS Intel | llama-b*-bin-macos-x64.zip → extract llama-server |
| Linux x86_64 | llama-b*-bin-ubuntu-x64.zip → extract llama-server |
Place the binary somewhere on your system, for example:
Windows: C:llamallama-server.exe macOS/Linux: /usr/local/bin/llama-server
Step 2: Download a Model
The library uses GGUF format model files. Pick one based on your available RAM:
| Model | File size | RAM needed | Download |
|---|---|---|---|
| Llama 3.2 3B Q4 | ~2 GB | 4 GB | Download |
| Mistral 7B Q4 | ~4 GB | 8 GB | Download |
| Llama 3 8B Q4 | ~5 GB | 10 GB | Download |
| Llama 3 70B Q4 | ~40 GB | 48 GB | Download |
Not sure which to pick? Start with Llama 3.2 3B Q4: it runs on almost any machine and is a good baseline.
Browse all GGUF models: huggingface.co/models?library=gguf
Step 3: Create a .env File
Create a file called .env in the folder where you will run the server. This tells the server where your files are.
macOS / Linux:
LLAMA_BIN=/usr/local/bin/llama-server MODEL_PATH=/home/yourname/.offline-intelligence/models/llama-3.2-3b-instruct-q4_k_m.gguf API_HOST=127.0.0.1 API_PORT=9999
Windows:
LLAMA_BIN=C:llamallama-server.exe MODEL_PATH=C:modelsllama-3.2-3b-instruct-q4_k_m.gguf API_HOST=127.0.0.1 API_PORT=9999
Everything else (GPU layers, thread count, memory limits) is detected automatically.
Step 4: Start the Server
cargo install offline-intelligence offline-intelligence
You should see:
Starting with thread-based architecture Memory database initialized Model manager initialized successfully Starting server on 127.0.0.1:9999
Verify it is running:
curl http://127.0.0.1:9999/healthz
Expected response: {"status":"ok"}
Note: The server must be running before you use any of the language clients below.
Step 5: Use Any Language Client
With the server running on port 9999, pick the language you want:
pip install offline-intelligence==0.1.5 npm install offline-intelligence@0.1.5 cargo add offline-intelligence@0.1.5
See the Language Usage Guide for full examples in each language.
Language Usage Guide
View on GitHubImportant: The Rust crate is the server. Every other language binding (Python, JavaScript, Java, C++) is an HTTP client that talks to the Rust server over port 9999. You must start the server first before using any non-Rust client.
Installation
Rust (Cargo)
cargo add offline-intelligence@0.1.5
Python (PyPI)
pip install offline-intelligence==0.1.5
JavaScript / Node.js (npm)
npm install offline-intelligence@0.1.5
Java (Maven)
<repositories>
<repository>
<id>jitpack.io</id>
<url>https://jitpack.io</url>
</repository>
</repositories>
<dependency>
<groupId>com.github.OfflineIntelligence</groupId>
<artifactId>offline-intelligence</artifactId>
<version>v0.1.5</version>
</dependency>Java (Gradle)
repositories { maven { url 'https://jitpack.io' } }
dependencies {
implementation 'com.github.OfflineIntelligence:offline-intelligence:v0.1.5'
}C++ (CMake FetchContent)
include(FetchContent)
FetchContent_Declare(
offline_intelligence
GIT_REPOSITORY https://github.com/OfflineIntelligence/offline-intelligence.git
GIT_TAG v0.1.5
GIT_SHALLOW TRUE
)
FetchContent_MakeAvailable(offline_intelligence)
target_link_libraries(your_target PRIVATE offline_intelligence)C++ (Conan)
conan install --requires="offline-intelligence/0.1.5" --build=missing
C++ (Manual)
Copy bindings/cpp/include/offline_intelligence/offline_intelligence.hpp into your project. Requires cpp-httplib and nlohmann/json headers.
Usage Examples
Rust
In Rust, you embed the server directly in your application.
use offline_intelligence::{config::Config, run_thread_server};
#[tokio::main]
async fn main() -> anyhow::Result<()> {
let cfg = Config::from_env()?;
run_thread_server(cfg, None).await
}Custom configuration:
use offline_intelligence::{config::Config, run_thread_server};
#[tokio::main]
async fn main() -> anyhow::Result<()> {
let mut cfg = Config::from_env()?;
cfg.api_host = "0.0.0.0".to_string();
cfg.api_port = 9999;
cfg.model_path = "/path/to/model.gguf".to_string();
cfg.gpu_layers = 35;
run_thread_server(cfg, None).await
}Python
from offline_intelligence import OfflineIntelligence, Config
import requests
cfg = Config.from_env()
ai = OfflineIntelligence(cfg)
print(ai.health_check())
response = ai.generate("Explain quantum computing in simple terms")
print(response)
for chunk in ai.generate_stream("Write a short poem about the ocean"):
print(chunk, end="", flush=True)
convs = ai.get_conversations()
title = ai.generate_title(session_id="abc123", first_message="Tell me about space")
stats = ai.get_memory_stats("abc123")
ai.optimize_memory()
# Tools, mode switching, API keys, feedback
settings = requests.get("http://127.0.0.1:9999/tools/settings").json()
requests.post("http://127.0.0.1:9999/mode", json={"mode": "online"})
requests.post("http://127.0.0.1:9999/api-keys", json={"key_type": "openrouter", "value": "sk-or-..."})
requests.post("http://127.0.0.1:9999/feedback", json={"message": "Great!"})Custom configuration:
from offline_intelligence import Config, OfflineIntelligence cfg = Config() cfg.api_host = "127.0.0.1" cfg.api_port = 9999 cfg.backend_url = "http://127.0.0.1:8081" cfg.openrouter_api_key = "sk-or-..." ai = OfflineIntelligence(cfg)
JavaScript / Node.js
const { OfflineIntelligence, Config } = require('offline-intelligence');
const cfg = Config.fromEnv();
const ai = new OfflineIntelligence(cfg);
async function main() {
const health = await ai.healthCheck();
console.log(health);
const response = await ai.generate('What is machine learning?');
console.log(response);
await ai.generateStream('Tell me a story', chunk => process.stdout.write(chunk));
const convs = await ai.getConversations();
const title = await ai.generateTitle('abc123', 'Tell me about black holes');
const stats = await ai.getMemoryStats('abc123');
await ai.optimizeMemory();
}
main();Java
import com.offlineintelligence.OfflineIntelligence;
import com.offlineintelligence.Config;
public class Main {
public static void main(String[] args) {
Config cfg = Config.fromEnv();
OfflineIntelligence ai = new OfflineIntelligence(cfg);
System.out.println(ai.healthCheck());
String response = ai.generate("Explain recursion");
System.out.println(response);
ai.generateStream("Write a haiku", chunk -> System.out.print(chunk));
List<Conversation> convs = ai.getConversations();
String title = ai.generateTitle("abc123", "Space exploration");
MemoryStats stats = ai.getMemoryStats("abc123");
ai.optimizeMemory();
}
}C++
#include <offline_intelligence/offline_intelligence.hpp>
#include <iostream>
int main() {
auto cfg = offline_intelligence::Config::from_env();
auto ai = offline_intelligence::OfflineIntelligence(cfg);
std::cout << ai.health_check() << std::endl;
auto response = ai.generate("What is AI?");
std::cout << response << std::endl;
ai.generate_stream("Count to 10", [](const std::string& chunk) {
std::cout << chunk << std::flush;
});
auto convs = ai.get_conversations();
auto title = ai.generate_title("abc123", "Programming help");
auto stats = ai.get_memory_stats("abc123");
ai.optimize_memory();
return 0;
}What's New in v0.1.5
Live Web Tools
The server now detects certain questions in real time and fetches live data before sending the conversation to the AI model. This means the model can answer questions it otherwise couldn't (current temperature, today's exchange rates, live crypto prices).
How it works: Every incoming user message is scanned for intent. If a relevant intent is detected, the data is fetched in parallel (max 8 seconds per source, 10-second hard deadline), formatted with numbered [1], [2] citation markers, and injected as a system context block. If the fetch times out or fails, the model answers from its training data silently, with no error shown to the user.
| Intent | Trigger example | Data source |
|---|---|---|
| Weather | "What's the weather in Tokyo?" | Open-Meteo + Nominatim (keyless) |
| Currency | "Convert 200 USD to EUR" | ExchangeRate-API, 160+ currencies (keyless) |
| Crypto price | "What is Bitcoin worth right now?" | CoinGecko free API (keyless) |
Manage tools via API:
GET http://127.0.0.1:9999/tools/settings
POST http://127.0.0.1:9999/tools/settings {"enabled": true, "brave_key": "optional"}User Authentication
Full auth stack built into the server. No third-party service needed:
POST /auth/register {"username": "alice", "email": "alice@example.com", "password": "secret"}
POST /auth/login {"email": "alice@example.com", "password": "secret"}
GET /auth/google?redirect_uri=http://localhost:3000/callback
GET /auth/verify?token=<email-verification-token>Passwords are hashed with Argon2. Login returns a JWT token. Pass it as Authorization: Bearer <token> on protected endpoints.
Encrypted API Key Storage
Store your HuggingFace and OpenRouter keys on-device. They are encrypted using a machine-specific key before being written to SQLite. They never exist in plaintext outside the process:
POST /api-keys {"key_type": "huggingface", "value": "hf_..."}
POST /api-keys {"key_type": "openrouter", "value": "sk-or-..."}
GET /api-keys?key_type=huggingface
DELETE /api-keys?key_type=openrouterRuntime Mode Switching
Switch between local (llama.cpp) and cloud (OpenRouter) inference without restarting:
POST /mode {"mode": "offline"}
POST /mode {"mode": "online"}User Feedback
POST /feedback {"message": "Really helpful!", "email": "optional@email.com"}Configuration Guide
Default Configuration Parameters
The Offline Intelligence Server utilizes a comprehensive configuration system that balances performance, resource utilization, and functionality across diverse deployment environments. The default settings are optimized for general use cases while maintaining stability and efficient resource management.
# Required (no auto-detection) LLAMA_BIN=/usr/local/bin/llama-server MODEL_PATH=/home/user/.offline-intelligence/models/model.q4_k_m.gguf Server (defaults shown) API_HOST=127.0.0.1 API_PORT=9999 LLAMA_HOST=127.0.0.1 LLAMA_PORT=8081 PROMETHEUS_PORT=9000 Backend and cloud fallback BACKEND_URL=http://127.0.0.1:8081 OPENROUTER_API_KEY=sk-or-... # optional, only for online mode Performance: auto-detected if omitted CTX_SIZE=8192 BATCH_SIZE=256 GPU_LAYERS=auto THREADS=auto Rate limiting and concurrency MAX_CONCURRENT_STREAMS=4 REQUESTS_PER_SECOND=24
All Environment Variables
| Variable | Default | Auto-detect | Description |
|---|---|---|---|
| LLAMA_BIN | none | No | Path to llama-server binary (required) |
| MODEL_PATH | none | Yes | Path to GGUF model file (required) |
| BACKEND_URL | http://127.0.0.1:8081 | No | Full URL to llama-server |
| OPENROUTER_API_KEY | none | No | OpenRouter key for online/cloud fallback mode |
| API_HOST | 127.0.0.1 | No | API server bind address |
| API_PORT | 9999 | No | API server port |
| LLAMA_HOST | 127.0.0.1 | No | llama-server host |
| LLAMA_PORT | 8081 | No | llama-server port |
| CTX_SIZE | 8192 | Yes | Context window size in tokens |
| BATCH_SIZE | 256 | Yes | Processing batch size |
| THREADS | auto | Yes | CPU thread count (see CPU detection below) |
| GPU_LAYERS | auto | Yes | GPU acceleration layers (see GPU detection below) |
| MAX_CONCURRENT_STREAMS | 4 | No | Max simultaneous requests |
| PROMETHEUS_PORT | 9000 | No | Prometheus metrics endpoint port |
| REQUESTS_PER_SECOND | 24 | No | Rate limiting threshold |
Performance Optimization Guidelines
The system incorporates intelligent automatic tuning mechanisms that adapt to available hardware resources. Manual configuration adjustments are typically only necessary for specialized deployment scenarios or specific performance requirements beyond the automated optimization capabilities.
| Scenario | CTX_SIZE | BATCH_SIZE | GPU_LAYERS | THREADS | MAX_CONCURRENT_STREAMS |
|---|---|---|---|---|---|
| Edge device (Raspberry Pi, Jetson Nano, 4 GB RAM, CPU only) | 2048 | 32 | 0 | 2 | 1 |
| Laptop / workstation (8–16 GB RAM, no GPU) | 4096 | 128 | 0 | auto | 2 |
| Standard GPU server (8–16 GB VRAM) | 8192 | 256 | auto | auto | 4 |
| High-performance GPU server (16 GB+ VRAM) | 8192 | 512 | auto | auto | 8 |
GPU Layer Auto-Detection (NVIDIA)
| VRAM Available | GPU_LAYERS assigned |
|---|---|
| 0 to 4 GB | 12 |
| 5 to 8 GB | 20 |
| 9 to 12 GB | 32 |
| 13 to 16 GB | 40 |
| 16 GB+ | 50 |
| Apple Silicon (macOS ARM64) | 24 to 56 (Metal, unified memory) |
| Intel Mac (macOS x86_64) | 0 (CPU only) |
Common Operational Scenarios and Resolution Strategies
The following section outlines frequently encountered operational situations and recommended approaches for resolution and system optimization.
Memory Resource Management and Optimization
# Memory Conservation Configuration Parameters
CTX_SIZE=2048
BATCH_SIZE=64
GPU_LAYERS=0Performance Enhancement and Throughput Optimization
# Performance Maximization Configuration
GPU_LAYERS=40
BATCH_SIZE=512
THREADS=8Network Port Configuration and Conflict Resolution
# Alternative Network Port Assignments
API_PORT=8001
LLAMA_PORT=8082
PROMETHEUS_PORT=9001API Reference and Monitoring
All endpoints are served by the Rust server at http://API_HOST:API_PORT (default 127.0.0.1:9999). The Prometheus metrics endpoint is served separately at PROMETHEUS_PORT (default 9000).
Core Endpoints
| Method | Endpoint | Description |
|---|---|---|
| POST | /generate/stream | SSE streaming generation. Body: messages[], session_id, temperature, max_tokens, top_p, frequency_penalty |
| GET | /healthz | Liveness check. Always returns {"status":"ok"} when the server process is alive |
| GET | /readyz | Readiness check. Returns backend_connected and model_loaded status; use for Kubernetes readiness probes |
| GET | /metrics | Prometheus-compatible metrics (requests, duration, resource usage). Served on PROMETHEUS_PORT (default 9000) |
Admin Endpoints
| Method | Endpoint | Description |
|---|---|---|
| GET | /admin/status | Returns status, version, uptime, active connections, total requests served |
| POST | /admin/load | Load a model. Body: model_path, ctx_size, gpu_layers, batch_size |
| POST | /admin/stop | Stop the llama-server backend process |
Memory Endpoints
| Method | Endpoint | Description |
|---|---|---|
| GET | /memory/stats/{session_id} | Returns message count, total tokens, timestamps, and storage size for a session |
| POST | /memory/optimize | Runs PRAGMA optimize + WAL checkpoint across the SQLite memory database |
| POST | /memory/cleanup | Removes stale and expired session entries using elapsed-time thresholds |
Conversation Endpoints
| Method | Endpoint | Description |
|---|---|---|
| GET | /conversations | List all stored conversations |
| GET | /conversations/{id} | Get a specific conversation with full message history |
| DELETE | /conversations/{id} | Delete a conversation and all associated messages |
| GET | /conversations/{id}/title | Get the auto-generated title for a conversation |
| POST | /generate/title | Generate a title. Body: session_id, first_message |
Authentication Endpoints
| Method | Endpoint | Description |
|---|---|---|
| POST | /auth/register | Register new user. Body: username, email, password |
| POST | /auth/login | Login and receive JWT token. Body: email, password |
| GET | /auth/google | Google OAuth 2.0 login. Query: redirect_uri |
| GET | /auth/verify | Verify email with token. Query: token |
API Keys Endpoints
| Method | Endpoint | Description |
|---|---|---|
| POST | /api-keys | Store encrypted API key. Body: key_type, value |
| GET | /api-keys | Retrieve stored API key. Query: key_type |
| DELETE | /api-keys | Delete stored API key. Query: key_type |
Mode & Tools Endpoints
| Method | Endpoint | Description |
|---|---|---|
| POST | /mode | Switch between local and online mode. Body: mode ("offline" or "online") |
| GET | /tools/settings | Get live web tools settings |
| POST | /tools/settings | Update tools settings. Body: enabled, brave_key (optional) |
| POST | /feedback | Submit user feedback. Body: message, email (optional) |
Metrics Access
# Health and readiness curl http://127.0.0.1:9999/healthz curl http://127.0.0.1:9999/readyz Prometheus metrics (note: separate port 9000) curl http://127.0.0.1:9000/metrics Admin status curl http://127.0.0.1:9999/admin/status Memory stats for a session curl http://127.0.0.1:9999/memory/stats/my-session-id
Changelog
v0.1.5, March 30, 2026
- Live Web Tools: Real-time intent detection for weather, currency conversion, and crypto prices. Data fetched from Open-Meteo, ExchangeRate-API, and CoinGecko with automatic citation injection
- User Authentication: Full auth stack with registration, login, JWT sessions, Google OAuth 2.0, and email verification. Passwords hashed with Argon2
- Encrypted API Key Storage: Store HuggingFace and OpenRouter keys encrypted on-device using machine-specific keys. Keys never exist in plaintext outside the process
- Runtime Mode Switching: Switch between local (llama.cpp) and cloud (OpenRouter) inference at runtime without restarting the server via POST /mode
- User Feedback Endpoint: New POST /feedback endpoint for collecting user feedback with optional email
- File Attachments: Upload and attach files to conversations
v0.1.4, March 27, 2026
- Lazy HNSW index rebuild: EmbeddingStore now uses an AtomicBool dirty flag. Index is rebuilt once on the first search after inserts, eliminating the previous per-insert O(n²) rebuild cost
- Content-aware importance scoring: Replaced all hardcoded 0.5 values with score_message_importance(role, content): role base scores (system=0.9, assistant=0.6, user=0.4) plus bonuses for code blocks, key concepts, and message length
- Real llama-server KV cache integration: LlamaKVCacheInterface now queries GET /slots for live token counts; cache operations use POST /slots/0 with erase/restore actions
- Token-bucket KV entries: Slot token sequences divided into 64-token buckets; importance derived from position fraction (earlier = higher priority)
- sysinfo-based memory limits: estimate_max_cache_memory() uses real available system RAM. 25% is allocated to KV cache, clamped between 256 MB and 8 GB
- Database and cache workers fully wired: All database operations (store, get, update, delete conversations) call real database methods; cache worker flushes to database on update
- Admin maintenance operational: Session cleanup uses DashMap::retain() with elapsed-time thresholds; optimize_database runs PRAGMA optimize + WAL checkpoint(TRUNCATE)
v0.1.3, March 22, 2026
- Thread-based server architecture (run_thread_server) replacing single-threaded server
- All four language bindings rewritten as pure HTTP clients (Python: requests; JavaScript: axios; Java: Java 11 HttpClient; C++: cpp-httplib + nlohmann/json)
- Multi-format model support: .gguf, .onnx, .trt, .engine, .safetensors, .ggml, .mlmodel
- New BACKEND_URL and OPENROUTER_API_KEY config fields
- API port default changed from 8000 to 9999
- New modules: model_management, model_runtime, engine_management, worker_threads
- New APIs: conversations CRUD, title generation, memory optimize/cleanup, mode switching
- Lock-free backend URL switching via arc-swap
- Platform-specific GPU detection: Apple Silicon Metal, NVIDIA NVML, CPU fallback
- JitPack support (jitpack.yml) and Conan package support (conanfile.py) added
v0.1.2, February 7, 2026
- Automatic hardware detection added
- Improved memory management
- Enhanced error handling
- Fixed critical security vulnerabilities
v0.1.1, December 15, 2025
- Initial public release
- Multi-language bindings (Rust, Python, JavaScript, Java, C++)
- Core LLM integration via llama.cpp
- SQLite-backed memory management system
License: Apache 2.0 (core 80%). Commercial extensions available for advanced context management and enterprise features. Third-party components: llama.cpp (MIT), Axum (MIT), Tokio (MIT), Serde (MIT/Apache 2.0), SQLite (Public Domain).