Transparency

Every claim, derived.

Every number on this site comes from reproducible code — either a live API call counted by usage.input_tokens, or a parametric model with editable parameters in a Jupyter notebook. This page traces each headline figure back to the exact script line or notebook cell that produces it.

Browse source Live benchmark page

The two scenarios

Two independent analyses produce the numbers on this site. They use different scales, different pricing, and answer different questions. Confusing them is the most common source of "the math doesn't add up" complaints — so we separate them clearly.

Scenario A — Parametric model

Enterprise at scale

nocturnusai-bench/notebooks/01_cost_model.ipynb

A mathematical model with editable parameters. Uses GPT-4 pricing ($30/1M), 50,000 turns/month, and configurable conversation length. Useful for exploring how costs scale under different workload assumptions.

No live API calls — all arithmetic. The ratio depends entirely on the parameters you set. Change them in the notebook and the output changes. We ship this notebook for workload modeling, not for headline numbers.

Scenario B — Live benchmark

Measured API calls

nocturnusai-bench/run_benchmark.py
nocturnusai-bench/notebooks/02_live_benchmark.ipynb

Actual API calls to Claude Opus 4 and Gemini 2.0 Flash. A 15-turn product support conversation. Token counts taken directly from usage.input_tokens (Claude) and count_tokens (Gemini). These are the numbers on the homepage and benchmark page.

The site's $13,600 → $2,400 claim uses this scenario scaled to 1,000 req/hr (720,000 turns/month) with Claude Opus 4 at $15/1M input tokens.

The benchmark page and homepage use Scenario B numbers. The parametric notebook (Scenario A) remains available for exploring different workload parameters.

Live benchmark derivation

What the benchmark measures

A 15-turn product support conversation where a user reports API timeouts and the agent diagnoses the issue step by step. The turns are hard-coded in run_benchmark.py lines 15–46 (TURNS list). Each turn carries a user message and a list of facts to extract.

The same conversation runs four times: Claude naive, Claude + NocturnusAI, Gemini naive, Gemini + NocturnusAI. Token counts are collected per turn and summed.

Per-turn token arrays

The arrays below are the reference run values, stored in results/benchmark_results.json and hardcoded in site/src/pages/benchmark.astro lines 10–13.

Claude Naive — input_tokens per turn

[53, 225, 396, 567, 740, 912, 1091, 1266, 1435, 1607, 1776, 1949, 2119, 2290, 2460]

Sum = 18,886

Claude + NocturnusAI — input_tokens per turn

[78, 91, 108, 128, 157, 180, 224, 244, 251, 272, 285, 309, 318, 332, 331]

Sum = 3,308

Gemini Naive — input_tokens per turn

[48, 132, 244, 322, 450, 574, 717, 1168, 1726, 2464, 3321, 4004, 4823, 5482, 7084]

Sum = 32,559

Gemini + NocturnusAI — input_tokens per turn

[72, 89, 105, 126, 151, 173, 217, 236, 246, 269, 277, 305, 319, 331, 330]

Sum = 3,246

Ratio computation

The benchmark page uses sum/sum ratio rather than avg/avg to avoid rounding errors when individual turn averages are rounded to integers. See benchmark.astro lines 21–26.

Claude ratio 18,886 \div 3,308 = 5.7\times Claude reduction 1 - (3,308 \div 18,886) = 82% Claude avg naive 18,886 \div 15 = 1,259 tok/turn Claude avg noct 3,308 \div 15 = 221 tok/turn Gemini ratio 32,559 \div 3,246 = 10.0\times Gemini reduction 1 - (3,246 \div 32,559) = 90% Gemini avg naive 32,559 \div 15 = 2,171 tok/turn Gemini avg noct 3,246 \div 15 = 216 tok/turn

Headline token numbers (1,259 and 221)

The homepage and blog posts cite Claude Opus 4 averages of 1,259 naive and 221 NocturnusAI. These come from a separate reference run of run_benchmark.py where the summary output prints:

Claude Opus 4 ($15/1M)
  Naive avg:        1,259 tok/turn
  NocturnusAI avg:    221 tok/turn
  Reduction:          5.7×  (82% savings)

That run's per-turn naive average of 1,259 is close to (but distinct from) the reference run's per-turn avg of 1259 stored in benchmark_results.json. Both runs use the same script; natural variation between API calls (token counting can differ slightly by model version) accounts for the small difference. The benchmark page's charts use the reference run arrays above; the homepage uses the 1,259/221 pair from the summary run.

Cost projection

All cost projections on the site use the following scale: 1,000 requests/hour running 720 hours/month = 720,000 turns/month. This is a single always-on API endpoint at 1k req/hr — a realistic SaaS workload.

Turns/month 1,000 req/hr \times 720 hr/mo = 720,000 Claude naive cost 720,000 \times 1,259 \times $15 \div 1,000,000 = $13,597 \approx $13,600/mo Claude noct cost 720,000 \times 221 \times $15 \div 1,000,000 = $2,387 \approx $2,400/mo Gemini naive cost 720,000 \times 2,171 \times $0.10 \div 1,000,000 = $156.31/mo Gemini noct cost 720,000 \times 216 \times $0.10 \div 1,000,000 = $15.55/mo

Source: index.astro line 372 contains the full arithmetic inline as a tooltip/footnote. Gemini's cost is small in absolute terms because $0.10/1M input tokens is 150× cheaper than Claude Opus 4 at $15/1M.

Parametric model (`01_cost_model.ipynb`)

The parametric notebook is a standalone mathematical model — no API keys required. It lets you plug in your own workload parameters and see cost projections across six models. Key parameters (notebook cell params):

Parameter	Default value	What it controls
`TURNS_PER_MONTH`	50,000	Total agent turns across all users per month
`AVG_TURNS_PER_CONV`	20	Average conversation length in turns
`TOKENS_PER_TURN_ADDED`	360	Tokens each turn adds to the growing naive context
`SYSTEM_PROMPT_TOKENS`	800	Static system prompt size (same for both approaches)
`NOCTURNUS_AVG_CONTEXT`	960 (800 + 160)	System prompt + retrieved facts; held flat regardless of turn count

With these defaults the notebook computes:

Naive avg context = 800 + 360 × (20 + 1) / 2 = 4,580 tokens
NocturnusAI avg context = 960 tokens
GPT-4 naive: 50,000 × 4,580 × $30 / 1,000,000 = $6,870/mo
GPT-4 noct: 50,000 × 960 × $30 / 1,000,000 = $1,440/mo

The notebook is a workload-modeling tool: change the parameters, see how cost scales. It does not produce site claims. Every number on this site comes from Scenario B — the live benchmark.

The live benchmark is a harder test than any parametric model: real API calls, a real conversation, real token counting. We ship the notebook for workload exploration — not for marketing numbers.

Claim-to-source table

Every claim on the site, the file that displays it, and the source that produces the number.

Claim	Displayed in	Source	Arithmetic
221 tokens/turn (NocturnusAI)	`index.astro`	`run_benchmark.py` summary output	avg of `claude_nocturnus()` turn values
1,259 tokens/turn (naive)	`index.astro`	`run_benchmark.py` summary output	avg of `claude_naive()` turn values
5.7× fewer tokens (Claude)	`index.astro`	`run_benchmark.py` summary	1,259 ÷ 221 = 5.70
82% reduction (Claude)	`index.astro`, benchmark page	`run_benchmark.py` summary	(1,259 − 221) ÷ 1,259 = 82.4%
10.0× fewer tokens (Gemini)	`index.astro`	`run_benchmark.py` summary	2,171 ÷ 216 = 10.05
90% reduction (Gemini)	`index.astro`, benchmark page	`run_benchmark.py` summary	(2,171 − 216) ÷ 2,171 = 90.1%
$13,600/mo	`index.astro`	`run_benchmark.py` + `index.astro` line 372	720,000 × 1,259 × $15 ÷ 1,000,000 = $13,596
$2,400/mo	`index.astro`	`run_benchmark.py` + `index.astro` line 372	720,000 × 221 × $15 ÷ 1,000,000 = $2,387
5.7× (benchmark page, Claude)	`benchmark.astro`	`results/benchmark_results.json`	18,886 ÷ 3,308 (sum/sum, this ref run)
10.0× (benchmark page, Gemini)	`benchmark.astro`	`results/benchmark_results.json`	32,559 ÷ 3,246 (sum/sum, this ref run)
82–90% reduction range	Homepage meta description, `index.astro`	`run_benchmark.py`	Claude 82%, Gemini 90%; range covers both models
~2ms latency overhead	`benchmark.astro`	Manual timing of `POST /assert/fact` + `POST /query`	Not formally benchmarked; observed during development

Reproduce these numbers

Live benchmark (`run_benchmark.py`)

# 1. Start NocturnusAI
docker run -p 9300:9300 ghcr.io/auctalis/nocturnusai:latest

# 2. Set API keys
export ANTHROPIC_API_KEY=sk-ant-...
export GOOGLE_API_KEY=AIza...

# 3. Clone and run
git clone https://github.com/Auctalis/nocturnusai
cd nocturnusai/nocturnusai-bench
uv run run_benchmark.py

Output: per-turn token arrays printed to stdout, summary ratios printed at the end, PNG chart saved to results/04_live_token_usage.png, and results/benchmark_results.json with the raw arrays.

Live benchmark notebook (`02_live_benchmark.ipynb`)

cd nocturnusai/nocturnusai-bench
pip install -r requirements.txt
jupyter notebook notebooks/02_live_benchmark.ipynb

Same benchmark as run_benchmark.py but in notebook form with inline charts. Requires the same API keys and a running NocturnusAI instance.

Parametric model (`01_cost_model.ipynb`)

cd nocturnusai/nocturnusai-bench
pip install notebook numpy matplotlib pandas
jupyter notebook notebooks/01_cost_model.ipynb

No API keys needed. Edit TURNS_PER_MONTH, AVG_TURNS_PER_CONV, or model prices in the params cell and re-run. All charts regenerate automatically.

The two scenarios

Live benchmark derivation

What the benchmark measures

Per-turn token arrays

Ratio computation

Headline token numbers (1,259 and 221)

Cost projection

Parametric model (01_cost_model.ipynb)

Claim-to-source table

Reproduce these numbers

Live benchmark (run_benchmark.py)

Live benchmark notebook (02_live_benchmark.ipynb)

Parametric model (01_cost_model.ipynb)

Parametric model (`01_cost_model.ipynb`)

Live benchmark (`run_benchmark.py`)

Live benchmark notebook (`02_live_benchmark.ipynb`)

Parametric model (`01_cost_model.ipynb`)