Measured on real APIs.
Every number is live.
A 15-turn product-support conversation. Token counts taken directly from usage.input_tokens — not estimated, not modeled. Open source, reproducible, and auditable.
Context grows linearly. NocturnusAI stays flat.
Each naive turn replays the full conversation. NocturnusAI retrieves only stored facts — so cost stays constant regardless of conversation length.
Gemini growth is steeper because the naive approach concatenates the full conversation string. Compression ratio improves with conversation length.
Input tokens per turn — all 15 turns.
| Turn | Claude Naive | Claude Nocturnus | Ratio | Gemini Naive | Gemini Nocturnus | Ratio |
|---|---|---|---|---|---|---|
| T1 | 53 | 78 | 0.7× | 48 | 72 | 0.7× |
| T2 | 225 | 91 | 2.5× | 132 | 89 | 1.5× |
| T3 | 396 | 108 | 3.7× | 244 | 105 | 2.3× |
| T4 | 567 | 128 | 4.4× | 322 | 126 | 2.6× |
| T5 | 740 | 157 | 4.7× | 450 | 151 | 3.0× |
| T6 | 912 | 180 | 5.1× | 574 | 173 | 3.3× |
| T7 | 1,091 | 224 | 4.9× | 717 | 217 | 3.3× |
| T8 | 1,266 | 244 | 5.2× | 1,168 | 236 | 4.9× |
| T9 | 1,435 | 251 | 5.7× | 1,726 | 246 | 7.0× |
| T10 | 1,607 | 272 | 5.9× | 2,464 | 269 | 9.2× |
| T11 | 1,776 | 285 | 6.2× | 3,321 | 277 | 12.0× |
| T12 | 1,949 | 309 | 6.3× | 4,004 | 305 | 13.1× |
| T13 | 2,119 | 318 | 6.7× | 4,823 | 319 | 15.1× |
| T14 | 2,290 | 332 | 6.9× | 5,482 | 331 | 16.6× |
| T15 | 2,460 | 331 | 7.4× | 7,084 | 330 | 21.5× |
| Avg | 1,259 | 221 | 5.7× | 2,171 | 216 | 10.0× |
Claude Naive T1 is low (53 tokens) because turn 1 has no history yet. The ratio compounds from T2 onwards.
Reproducible. Open. Skeptics welcome.
- Conversation history accumulated as a message list
- Every turn sends the complete history to the model
- Token count from API
usage.input_tokens - 0.35s sleep between calls to respect rate limits
- Key facts extracted per turn (predicate + value pairs)
- Stored via
POST /assert/factwith tenant isolation - Each turn retrieves facts via
POST /queryper predicate - Only retrieved facts sent as context — no history replay
What we measured (and what we didn't)
- Input tokens only (from API response)
- 15 turns × 4 passes (Claude naive, Claude NocturnusAI, Gemini naive, Gemini NocturnusAI)
- Fresh tenant per run (timestamp-suffixed)
- Output tokens (identical for both approaches)
- Response quality or accuracy
- NocturnusAI latency overhead (~2ms per turn)
nocturnusai-bench/run_benchmark.py — single file, stdlib + anthropic + google-genai
50,000 turns/month. Real prices.
| Model | Price | Naive avg | Naive cost/mo | Nocturnus avg | Nocturnus cost/mo | Savings |
|---|---|---|---|---|---|---|
| Claude Opus 4 | $15/1M | 1,259 tok | $944 | 221 tok | $166 | 82% off |
| Gemini 2.0 Flash | $0.10/1M | 2,171 tok | $10.86 | 216 tok | $1.08 | 90% off |
At 50,000 turns/month. The compression ratio (5.7× Claude, 10.0× Gemini) is constant regardless of volume.
Run it yourself.
OPEN SOURCE · SINGLE SCRIPT · BRING YOUR OWN API KEYS
# 1. Start NocturnusAI docker run -p 9300:9300 ghcr.io/auctalis/nocturnusai:latest # 2. Set API keys export ANTHROPIC_API_KEY=sk-ant-... export GOOGLE_API_KEY=AIza... # 3. Run git clone https://github.com/Auctalis/nocturnusai cd nocturnusai/nocturnusai-bench uv run run_benchmark.py