Live Benchmark

Measured on real APIs.
Every number is live.

A 15-turn product-support conversation. Token counts taken directly from usage.input_tokens — not estimated, not modeled. Open source, reproducible, and auditable.

Run it yourself Read the docs
Token Usage Per Turn

Context grows linearly. NocturnusAI stays flat.

Each naive turn replays the full conversation. NocturnusAI retrieves only stored facts — so cost stays constant regardless of conversation length.

Claude Opus 4 (claude-opus-4-6)
$15/1M input tokens · live API · usage.input_tokens
5.7× fewer tokens
0 625 1,250 1,875 2,500 T1 T4 T7 T9 T11 T13 T15 2,460 331 Naive (full history) NocturnusAI (retrieved facts)
Gemini 2.0 Flash (gemini-2.0-flash)
$0.10/1M input tokens · live API · count_tokens API
10.0× fewer tokens
0 1,875 3,750 5,625 7,500 T1 T4 T7 T9 T11 T13 T15 7,084 330 Naive (full history) NocturnusAI (retrieved facts)

Gemini growth is steeper because the naive approach concatenates the full conversation string. Compression ratio improves with conversation length.


Claude Opus 4
5.7×
fewer input tokens per turn
221 avg with NocturnusAI vs 1,259 naive
82% reduction
Gemini 2.0 Flash
10.0×
fewer input tokens per turn
216 avg with NocturnusAI vs 2,171 naive
90% reduction
Scenario
Product support, 15 turns API timeout → connection pool exhaustion → root cause diagnosis. Facts extracted per turn, queried individually via POST /query.
v0.3.10 · localhost:9300 · April 2026

Raw Numbers

Input tokens per turn — all 15 turns.

Turn Claude Naive Claude Nocturnus Ratio Gemini Naive Gemini Nocturnus Ratio
T1 53 78 0.7× 48 72 0.7×
T2 225 91 2.5× 132 89 1.5×
T3 396 108 3.7× 244 105 2.3×
T4 567 128 4.4× 322 126 2.6×
T5 740 157 4.7× 450 151 3.0×
T6 912 180 5.1× 574 173 3.3×
T7 1,091 224 4.9× 717 217 3.3×
T8 1,266 244 5.2× 1,168 236 4.9×
T9 1,435 251 5.7× 1,726 246 7.0×
T10 1,607 272 5.9× 2,464 269 9.2×
T11 1,776 285 6.2× 3,321 277 12.0×
T12 1,949 309 6.3× 4,004 305 13.1×
T13 2,119 318 6.7× 4,823 319 15.1×
T14 2,290 332 6.9× 5,482 331 16.6×
T15 2,460 331 7.4× 7,084 330 21.5×
Avg 1,259 221 5.7× 2,171 216 10.0×

Claude Naive T1 is low (53 tokens) because turn 1 has no history yet. The ratio compounds from T2 onwards.


Methodology

Reproducible. Open. Skeptics welcome.

Naive Approach
  • Conversation history accumulated as a message list
  • Every turn sends the complete history to the model
  • Token count from API usage.input_tokens
  • 0.35s sleep between calls to respect rate limits
NocturnusAI Approach
  • Key facts extracted per turn (predicate + value pairs)
  • Stored via POST /assert/fact with tenant isolation
  • Each turn retrieves facts via POST /query per predicate
  • Only retrieved facts sent as context — no history replay

What we measured (and what we didn't)

Measured
  • Input tokens only (from API response)
  • 15 turns × 4 passes (Claude naive, Claude NocturnusAI, Gemini naive, Gemini NocturnusAI)
  • Fresh tenant per run (timestamp-suffixed)
Not measured here
  • Output tokens (identical for both approaches)
  • Response quality or accuracy
  • NocturnusAI latency overhead (~2ms per turn)
See the full derivation on our Calculations page — every headline number traced back to the exact script line or notebook cell that produces it.
View source on GitHub

nocturnusai-bench/run_benchmark.py — single file, stdlib + anthropic + google-genai


Cost Projection

50,000 turns/month. Real prices.

Model Price Naive avg Naive cost/mo Nocturnus avg Nocturnus cost/mo Savings
Claude Opus 4 $15/1M 1,259 tok $944 221 tok $166 82% off
Gemini 2.0 Flash $0.10/1M 2,171 tok $10.86 216 tok $1.08 90% off

At 50,000 turns/month. The compression ratio (5.7× Claude, 10.0× Gemini) is constant regardless of volume.

Run it yourself.

OPEN SOURCE · SINGLE SCRIPT · BRING YOUR OWN API KEYS

3 steps
# 1. Start NocturnusAI
docker run -p 9300:9300 ghcr.io/auctalis/nocturnusai:latest

# 2. Set API keys
export ANTHROPIC_API_KEY=sk-ant-...
export GOOGLE_API_KEY=AIza...

# 3. Run
git clone https://github.com/Auctalis/nocturnusai
cd nocturnusai/nocturnusai-bench
uv run run_benchmark.py