BeigeBox
OpenAI-compatible LLM middleware • routing • observability • local history

Stop rebuilding LLM integrations.
Smart routing. Local. Open.

BeigeBox sits between your app and any language model—automatically routing to cheaper models, caching expensive responses, and switching backends without touching your code. Zero frontend changes. Transparent to clients, invisible to backends.

  • 🎯 Smart routing · automatic model selection
  • 💾 Response cache · no redundant API calls
  • 🧩 Multi-backend support · OpenRouter + Ollama + OpenAI-compatible
  • ⚡ Local-first storage · SQLite + ChromaDB
  • 🚀 One HTML file UI · no build step
  • ✅ Production-ready · Apache 2.0 license

Unique Features

Capabilities you won't find assembled anywhere else.

Semantic response cache

Responses are cached by meaning using cosine similarity, not by exact string or TTL. Near-identical questions return instantly — no backend round-trip, no model load.

WASM post-processing pipeline

Drop any WASI-compiled module into wasm_modules/. BeigeBox buffers the stream, pipes it through your transform (stdin → stdout), and re-emits the clean result. Any language, zero Python deps.

Council mode with model affinity batching

Fan a single prompt out to multiple models simultaneously, then vote, merge, or surface the best answer. Model-affinity ensures each council member stays on the backend it performs best on.

Trajectory evaluation scoring

Every agentic run is scored end-to-end: tool selection, reasoning chain, and final answer quality. Scores surface in the dashboard so you can compare models on real tasks, not synthetic benchmarks.

Session-sticky multi-turn routing

The routing decision for turn 1 is cached for the whole conversation. A code session stays on the code model; a chat session stays on the fast model — no mid-conversation backend flips.

Operator with persistent notes

The operator agent writes structured notes to a persistent memory store between runs. Across sessions it recalls context, tool results, and prior reasoning — no re-discovery from scratch each time.

How it works

Client / UI BeigeBox route · cache · log · transform SQLite history ChromaDB vectors Ollama local OpenRouter cloud
Solid — request path  ·  Dashed — async writes
  1. Ingest. Any OpenAI-compatible client hits localhost:1337/v1. No changes to your client, frontend, or tools required.
  2. Cache check. The request is embedded and checked against the semantic cache. A sufficiently similar previous query returns the cached response immediately — no backend round-trip, no model load.
  3. Classify & route. A 4-tier pipeline picks the right model: Z-command overrides → agentic keyword scorer → embedding classifier (~50 ms) → decision LLM for borderline cases. Session-sticky so conversations stay on one model.
  4. Execute. The request forwards to the chosen provider — Ollama, OpenRouter, or any OpenAI-compatible endpoint. BeigeBox streams the response back transparently with latency-aware failover.
  5. Transform. If a WASM module is active, BeigeBox buffers the full stream, runs the transform, and re-emits. The client always sees the final output — not raw intermediate fragments.
  6. Log, embed & measure. Every event writes to a structured JSONL wire log and SQLite. Responses embed into ChromaDB for semantic memory. Latency, TTFT, token counts, and cost surface in the dashboard.

Quick start

Get running in 2 minutes

Full stack with Ollama + BeigeBox — Docker Compose does the rest.

git clone --recursive https://github.com/ralabarge/beigebox.git
cd beigebox/docker
cp env.example .env        # optional — set GPU, ports, API keys
docker compose up -d

Open http://localhost:1337 for the web UI. The OpenAI-compatible API is at http://localhost:1337/v1.

Optional: Add profiles for browser automation, voice I/O, or alternative inference engines.

Other setups:

Already have Ollama/another provider?

Run just BeigeBox in front of it. Set backend.url in config.yaml to your existing provider.

git clone --recursive https://github.com/ralabarge/beigebox.git
cd beigebox/docker
cp env.example .env
docker compose up -d beigebox

Then update config.yaml to point to your Ollama/OpenRouter/custom API endpoint.

Dev mode (hot-reload)

Edit config.yaml or runtime_config.yaml — changes take effect immediately, no restart.

pip install -e .
cp config.example.yaml config.yaml
beigebox dial

FastAPI at localhost:8000 · docs at /docs

Any OpenAI-compatible client works. Point your tools (ChatGPT UI, LM Studio, Cursor, Aider, Continue) at http://localhost:1337/v1 — zero code changes needed.
Docker Compose profiles (optional)

Browser automation (CDP): docker compose --profile cdp up -d

Voice I/O (STT + TTS): docker compose --profile voice up -d

Both: docker compose --profile cdp --profile voice up -d

Alternative engines: llama.cpp (engines-cpp), vLLM (engines-vllm), ExecutorTorch (engines-executorch)

Deploy anywhere

BeigeBox is self-hosted and portable. Choose your platform:

🐳 Docker Compose

The simplest path. Includes Ollama, BeigeBox, and all dependencies. .env controls everything — GPU, ports, model pulls.

See deploy/docker/ for production variant with volume mounts.

☸ Kubernetes

Deployments, Services, ConfigMaps, PVCs ready to go. Includes resource requests/limits, liveness + readiness probes, and security context.

See deploy/k8s/ for YAML manifests.

🖥 Bare Metal

Systemd unit file for Linux servers. Virtual environment support, security hardening, and environment variable overrides included.

See deploy/systemd/ for unit template.

Your data, local-first. All conversations, embeddings, and metrics stay in SQLite + ChromaDB — no cloud dependency, no vendor lock-in. Portable files across machines.

Who is this for

Perfect for you if:

  • ✓ Running Ollama or multiple LLM providers
  • ✓ Building LLM-powered features in an app
  • ✓ Want to reduce LLM API costs (route to cheaper models)
  • ✓ Need observability into LLM usage (latency, cost, tokens)
  • ✓ Self-hosted teams keeping data on-prem
  • ✓ Evaluating how different models perform on your workload

Not for you if:

  • ✗ You only use OpenAI (no multi-model benefit)
  • ✗ Your only concern is latency (caching helps but isn't the main goal)
  • ✗ You prefer managed cloud solutions (BeigeBox is self-hosted)
  • ✗ You have minimal LLM usage (overhead of routing > gains)

What makes it different

Most LLM proxies do passthrough and load balancing. BeigeBox goes further.

4
routing tiers
3
cache layers
3
multi-model modes
~50ms
classifier latency
0
JS dependencies
1
HTML file for the UI

An LLM that routes to other LLMs

The routing pipeline has four tiers. Simple cases never touch a model. Borderline cases get evaluated by a small, fast decision LLM before dispatch. Session-sticky so a conversation stays on one model.

  • tier 1 Z-command user overrides (instant, inline)
  • tier 2 Agentic keyword scorer (zero-cost pre-filter)
  • tier 3 Embedding classifier (~50 ms cosine distance)
  • tier 4 Decision LLM for borderline cases

Semantic response cache

Not a TTL cache. Every query is embedded with nomic-embed-text and checked against stored responses by cosine similarity. A sufficiently similar question gets an instant answer — no model load, no backend round-trip.

Combined with an embedding dedup cache (avoids redundant embeds) and a tool result cache (SHA-256 keyed, short TTL for deterministic tools).

WASM post-processing pipeline

Drop a compiled .wasm (WASI target) into wasm_modules/. BeigeBox buffers the full response stream, runs the transform (stdin → stdout), and re-emits. The client always sees the transformed output.

Any WASI-targeting language: Rust, C, Go, AssemblyScript. Timeout-enforced — if the module hangs, the original response passes through unmodified. Included: opener_strip removes sycophantic openers.

Latency-aware multi-backend routing

Each backend tracks a rolling P95 latency window over 100 samples. Backends that exceed a configurable threshold are automatically deprioritised to second-pass fallback — traffic returns when they recover.

A/B traffic splitting via traffic_split weights. Weighted random selection among healthy backends for percentage-based distribution.

Three multi-model orchestration modes

Run the same prompt across N models simultaneously and compare, vote, or iterate:

  • harness Side-by-side comparison, no arbiter
  • ensemble LLM judge picks the best response with reasoning
  • orchestrate Goal-driven loop: plan → dispatch → evaluate → iterate (SSE stream)

Built-in UI — one HTML file

Seven-tab interface with no build step, no JavaScript framework, no dependencies. Dashboard, multi-pane Chat, Conversations, Tap (live wire log), Operator, Harness, Config.

Per-pane chat settings: temperature, top-p/k, context window, repeat penalty, GPU layer count, force-reload (evicts model from VRAM for immediate effect), and per-pane system prompt.

Full feature set

Routing & delegation

4-tier pipeline with session-sticky assignment. Per-backend priority ordering with automatic health-check failover. Force a route with a Z-command at any time.

Semantic + tool caching

Semantic cache (cosine similarity), embedding dedup cache, and SHA-256 tool result cache. Three layers, all in-process, zero extra services.

Multi-backend with A/B splitting

Ollama, OpenRouter, or any OpenAI-compatible endpoint. Weighted traffic distribution. Rolling P95 auto-deprioritization with automatic recovery.

Observability

Structured JSONL wire log (TTFT, P50/P90/P95/P99, tokens/sec, cost), SQLite conversation history, per-model hardware stats (GPU layers, VRAM, context window), verbose agentic logging (thoughts, tool calls, iterations), and live backend health dashboard with latency tracking.

Harness + ensemble + orchestration

Three multi-model modes. Ensemble voting with an LLM arbiter that returns reasoning. Goal-driven orchestration via SSE stream for long-running tasks.

Operator agent

JSON tool-loop agent with web search, calculator, datetime, system info, memory recall, and ensemble. Shell execution through a 4-layer bubblewrap sandbox (no network, no home dir).

Conversation replay & fork

Any stored conversation reconstructed and re-run against a different model or config. Fork a conversation mid-thread without affecting the original branch.

Local-first storage

SQLite for conversations, metrics, and cost. ChromaDB (embedded, no separate service) for vectors. Portable files — your data, no cloud dependency.

Per-model GPU options

Set options.num_gpu per model in config.yaml. Override per chat pane. Force-reload evicts from VRAM and reloads immediately — no restart required.

Hooks & plugins

Pre/post request interceptors for prompt injection detection, RAG injection, and custom policy. Drop a *Tool class in plugins/ to auto-register a new tool.

System context injection

Hot-reloaded system_context.md prepended to every request. Override per chat pane with a window-specific system prompt — the ⚙ button glows when active.

Voice (optional profile)

Whisper (STT) + Kokoro (TTS) via Docker Compose voice profile. Push-to-talk with configurable hotkey. Auto-play TTS after each response. No changes to core proxy required.

Per-endpoint timeouts

Configure timeout_ms per backend. Ollama on slow hardware? Use 120s. OpenRouter? 60s. Prevents one slow backend from stalling the router. Overrides per-backend, hot-reloadable.

Response format validation

Optional JSON/XML/YAML schema enforcement. Validates final responses before caching. Non-blocking — validation failures logged but responses still forwarded. Helps catch malformed outputs early.

Browser automation (CDP)

Operator agent can control Chrome via DevTools Protocol. Navigate pages, take screenshots, click elements, read DOM, type text. Full I/O captured for cross-task context. Phase 1 live, more coming.

MCP server

Model Context Protocol server at POST /mcp. Expose BeigeBox tools to Claude Desktop, external agents, and third-party MCP clients. All tools + operator/run + skills as resources. Works with header auth.

OAuth2 web UI auth

Optional OAuth2 authentication for the web UI. Google provider included with PKCE S256. User sessions gated via signed cookies. Expandable provider protocol for GitHub, Discord, etc. Multi-user conversation isolation roadmap.

Z-commands

Prefix any message with z: to override routing, force tools, or branch a conversation — without leaving the chat.

z: simpleRoute this message to the fast model
z: complexRoute to the large / capable model
z: codeRoute to the code-specialist model
z: llama3.2:3bRoute to an exact model by name
z: searchForce web search and inject results before sending
z: memorySearch past conversations and inject relevant context
z: calc 2^32Evaluate a math expression inline (sandboxed)
z: sysinfoInject current system resource stats
z: forkBranch this conversation into a new thread
z: complex,searchChain multiple directives in one prefix
z: helpShow all available commands

UI snapshots

Built-in seven-tab web interface. No build step, no framework, one HTML file.

BeigeBox dashboard UI screenshot
Dashboard — storage stats, P50/P90/P99 latency charts, TTFT, cost breakdown, backend health.
BeigeBox multi-pane chat screenshot
Chat — multi-pane view, fan out to N models simultaneously, per-pane settings with GPU layer control and force-reload.

Licensing

BeigeBox is dual-licensed to support both open-source and commercial use.

AGPL-3.0 (Free)

Use, modify, and distribute freely.

  • ✓ Open-source projects
  • ✓ Internal use
  • ✓ Non-commercial use
  • ✓ Public modifications required

If you modify BeigeBox and run it as a service, you must share your modifications.

Read full AGPL terms →

Commercial License (Paid)

Use in proprietary products.

  • ✓ Proprietary/closed-source use
  • ✓ SaaS & managed services
  • ✓ Keep modifications private
  • ✓ No open-source obligations

For enterprise deployments, commercial products, or proprietary modifications.

Learn more about commercial licensing →