Live Benchmark

Measured on real APIs.
Every number is live.

A 15-turn product-support conversation. Token counts taken directly from usage.input_tokens — not estimated, not modeled. Open source, reproducible, and auditable.

Run it yourself Read the docs

Token Usage Per Turn

Context grows linearly. NocturnusAI stays flat.

Each naive turn replays the full conversation. NocturnusAI retrieves only stored facts — so cost stays constant regardless of conversation length.

Claude Opus 4 (claude-opus-4-6)

$15/1M input tokens · live API · usage.input_tokens

5.7× fewer tokens

Gemini 2.0 Flash (gemini-2.0-flash)

$0.10/1M input tokens · live API · count_tokens API

10.0× fewer tokens

Gemini growth is steeper because the naive approach concatenates the full conversation string. Compression ratio improves with conversation length.

Claude Opus 4

5.7×

fewer input tokens per turn

221 avg with NocturnusAI vs 1,259 naive

82% reduction

Gemini 2.0 Flash

10.0×

fewer input tokens per turn

216 avg with NocturnusAI vs 2,171 naive

90% reduction

Scenario

Product support, 15 turns API timeout → connection pool exhaustion → root cause diagnosis. Facts extracted per turn, queried individually via POST /query.

v0.3.10 · localhost:9300 · April 2026

Raw Numbers

Input tokens per turn — all 15 turns.

Turn	Claude Naive	Claude Nocturnus	Ratio	Gemini Naive	Gemini Nocturnus	Ratio
T1	53	78	0.7×	48	72	0.7×
T2	225	91	2.5×	132	89	1.5×
T3	396	108	3.7×	244	105	2.3×
T4	567	128	4.4×	322	126	2.6×
T5	740	157	4.7×	450	151	3.0×
T6	912	180	5.1×	574	173	3.3×
T7	1,091	224	4.9×	717	217	3.3×
T8	1,266	244	5.2×	1,168	236	4.9×
T9	1,435	251	5.7×	1,726	246	7.0×
T10	1,607	272	5.9×	2,464	269	9.2×
T11	1,776	285	6.2×	3,321	277	12.0×
T12	1,949	309	6.3×	4,004	305	13.1×
T13	2,119	318	6.7×	4,823	319	15.1×
T14	2,290	332	6.9×	5,482	331	16.6×
T15	2,460	331	7.4×	7,084	330	21.5×
Avg	1,259	221	5.7×	2,171	216	10.0×

Claude Naive T1 is low (53 tokens) because turn 1 has no history yet. The ratio compounds from T2 onwards.

Methodology

Reproducible. Open. Skeptics welcome.

Naive Approach

Conversation history accumulated as a message list
Every turn sends the complete history to the model
Token count from API usage.input_tokens
0.35s sleep between calls to respect rate limits

NocturnusAI Approach

Key facts extracted per turn (predicate + value pairs)
Stored via POST /assert/fact with tenant isolation
Each turn retrieves facts via POST /query per predicate
Only retrieved facts sent as context — no history replay

What we measured (and what we didn't)

Measured

Input tokens only (from API response)
15 turns × 4 passes (Claude naive, Claude NocturnusAI, Gemini naive, Gemini NocturnusAI)
Fresh tenant per run (timestamp-suffixed)

Not measured here

Output tokens (identical for both approaches)
Response quality or accuracy
NocturnusAI latency overhead (~2ms per turn)

See the full derivation on our Calculations page — every headline number traced back to the exact script line or notebook cell that produces it.

View source on GitHub

nocturnusai-bench/run_benchmark.py — single file, stdlib + anthropic + google-genai

Cost Projection

50,000 turns/month. Real prices.

Model	Price	Naive avg	Naive cost/mo	Nocturnus avg	Nocturnus cost/mo	Savings
Claude Opus 4	$15/1M	1,259 tok	$944	221 tok	$166	82% off
Gemini 2.0 Flash	$0.10/1M	2,171 tok	$10.86	216 tok	$1.08	90% off

At 50,000 turns/month. The compression ratio (5.7× Claude, 10.0× Gemini) is constant regardless of volume.

Run it yourself.

OPEN SOURCE · SINGLE SCRIPT · BRING YOUR OWN API KEYS

3 steps

# 1. Start NocturnusAI
docker run -p 9300:9300 ghcr.io/auctalis/nocturnusai:latest

# 2. Set API keys
export ANTHROPIC_API_KEY=sk-ant-...
export GOOGLE_API_KEY=AIza...

# 3. Run
git clone https://github.com/Auctalis/nocturnusai
cd nocturnusai/nocturnusai-bench
uv run run_benchmark.py

View benchmark source · Read the docs

Measured on real APIs.Every number is live.