Transparency

Every claim, derived.

Every number on this site comes from reproducible code — either a live API call counted by usage.input_tokens, or a parametric model with editable parameters in a Jupyter notebook. This page traces each headline figure back to the exact script line or notebook cell that produces it.


The two scenarios

Two independent analyses produce the numbers on this site. They use different scales, different pricing, and answer different questions. Confusing them is the most common source of "the math doesn't add up" complaints — so we separate them clearly.

Scenario A — Parametric model
Enterprise at scale
nocturnusai-bench/notebooks/01_cost_model.ipynb

A mathematical model with editable parameters. Uses GPT-4 pricing ($30/1M), 50,000 turns/month, and configurable conversation length. Useful for exploring how costs scale under different workload assumptions.

No live API calls — all arithmetic. The ratio depends entirely on the parameters you set. Change them in the notebook and the output changes. We ship this notebook for workload modeling, not for headline numbers.

Scenario B — Live benchmark
Measured API calls
nocturnusai-bench/run_benchmark.py
nocturnusai-bench/notebooks/02_live_benchmark.ipynb

Actual API calls to Claude Opus 4 and Gemini 2.0 Flash. A 15-turn product support conversation. Token counts taken directly from usage.input_tokens (Claude) and count_tokens (Gemini). These are the numbers on the homepage and benchmark page.

The site's $13,600 → $2,400 claim uses this scenario scaled to 1,000 req/hr (720,000 turns/month) with Claude Opus 4 at $15/1M input tokens.

The benchmark page and homepage use Scenario B numbers. The parametric notebook (Scenario A) remains available for exploring different workload parameters.

Live benchmark derivation

What the benchmark measures

A 15-turn product support conversation where a user reports API timeouts and the agent diagnoses the issue step by step. The turns are hard-coded in run_benchmark.py lines 15–46 (TURNS list). Each turn carries a user message and a list of facts to extract.

The same conversation runs four times: Claude naive, Claude + NocturnusAI, Gemini naive, Gemini + NocturnusAI. Token counts are collected per turn and summed.

Per-turn token arrays

The arrays below are the reference run values, stored in results/benchmark_results.json and hardcoded in site/src/pages/benchmark.astro lines 10–13.

Claude Naive — input_tokens per turn
[53, 225, 396, 567, 740, 912, 1091, 1266, 1435, 1607, 1776, 1949, 2119, 2290, 2460]
Sum = 18,886
Claude + NocturnusAI — input_tokens per turn
[78, 91, 108, 128, 157, 180, 224, 244, 251, 272, 285, 309, 318, 332, 331]
Sum = 3,308
Gemini Naive — input_tokens per turn
[48, 132, 244, 322, 450, 574, 717, 1168, 1726, 2464, 3321, 4004, 4823, 5482, 7084]
Sum = 32,559
Gemini + NocturnusAI — input_tokens per turn
[72, 89, 105, 126, 151, 173, 217, 236, 246, 269, 277, 305, 319, 331, 330]
Sum = 3,246

Ratio computation

The benchmark page uses sum/sum ratio rather than avg/avg to avoid rounding errors when individual turn averages are rounded to integers. See benchmark.astro lines 21–26.

Claude ratio 18,886 ÷ 3,308 = 5.7×
Claude reduction 1 − (3,308 ÷ 18,886) = 82%
Claude avg naive 18,886 ÷ 15 = 1,259 tok/turn
Claude avg noct 3,308 ÷ 15 = 221 tok/turn
Gemini ratio 32,559 ÷ 3,246 = 10.0×
Gemini reduction 1 − (3,246 ÷ 32,559) = 90%
Gemini avg naive 32,559 ÷ 15 = 2,171 tok/turn
Gemini avg noct 3,246 ÷ 15 = 216 tok/turn

Headline token numbers (1,259 and 221)

The homepage and blog posts cite Claude Opus 4 averages of 1,259 naive and 221 NocturnusAI. These come from a separate reference run of run_benchmark.py where the summary output prints:

Claude Opus 4 ($15/1M)
  Naive avg:        1,259 tok/turn
  NocturnusAI avg:    221 tok/turn
  Reduction:          5.7×  (82% savings)

That run's per-turn naive average of 1,259 is close to (but distinct from) the reference run's per-turn avg of 1259 stored in benchmark_results.json. Both runs use the same script; natural variation between API calls (token counting can differ slightly by model version) accounts for the small difference. The benchmark page's charts use the reference run arrays above; the homepage uses the 1,259/221 pair from the summary run.

Cost projection

All cost projections on the site use the following scale: 1,000 requests/hour running 720 hours/month = 720,000 turns/month. This is a single always-on API endpoint at 1k req/hr — a realistic SaaS workload.

Turns/month 1,000 req/hr × 720 hr/mo = 720,000
Claude naive cost 720,000 × 1,259 × $15 ÷ 1,000,000 = $13,597 ≈ $13,600/mo
Claude noct cost 720,000 × 221 × $15 ÷ 1,000,000 = $2,387 ≈ $2,400/mo
Gemini naive cost 720,000 × 2,171 × $0.10 ÷ 1,000,000 = $156.31/mo
Gemini noct cost 720,000 × 216 × $0.10 ÷ 1,000,000 = $15.55/mo

Source: index.astro line 372 contains the full arithmetic inline as a tooltip/footnote. Gemini's cost is small in absolute terms because $0.10/1M input tokens is 150× cheaper than Claude Opus 4 at $15/1M.


Parametric model (01_cost_model.ipynb)

The parametric notebook is a standalone mathematical model — no API keys required. It lets you plug in your own workload parameters and see cost projections across six models. Key parameters (notebook cell params):

Parameter Default value What it controls
TURNS_PER_MONTH 50,000 Total agent turns across all users per month
AVG_TURNS_PER_CONV 20 Average conversation length in turns
TOKENS_PER_TURN_ADDED 360 Tokens each turn adds to the growing naive context
SYSTEM_PROMPT_TOKENS 800 Static system prompt size (same for both approaches)
NOCTURNUS_AVG_CONTEXT 960 (800 + 160) System prompt + retrieved facts; held flat regardless of turn count

With these defaults the notebook computes:

The notebook is a workload-modeling tool: change the parameters, see how cost scales. It does not produce site claims. Every number on this site comes from Scenario B — the live benchmark.

The live benchmark is a harder test than any parametric model: real API calls, a real conversation, real token counting. We ship the notebook for workload exploration — not for marketing numbers.

Claim-to-source table

Every claim on the site, the file that displays it, and the source that produces the number.

Claim Displayed in Source Arithmetic
221 tokens/turn (NocturnusAI) index.astro run_benchmark.py summary output avg of claude_nocturnus() turn values
1,259 tokens/turn (naive) index.astro run_benchmark.py summary output avg of claude_naive() turn values
5.7× fewer tokens (Claude) index.astro run_benchmark.py summary 1,259 ÷ 221 = 5.70
82% reduction (Claude) index.astro, benchmark page run_benchmark.py summary (1,259 − 221) ÷ 1,259 = 82.4%
10.0× fewer tokens (Gemini) index.astro run_benchmark.py summary 2,171 ÷ 216 = 10.05
90% reduction (Gemini) index.astro, benchmark page run_benchmark.py summary (2,171 − 216) ÷ 2,171 = 90.1%
$13,600/mo index.astro run_benchmark.py + index.astro line 372 720,000 × 1,259 × $15 ÷ 1,000,000 = $13,596
$2,400/mo index.astro run_benchmark.py + index.astro line 372 720,000 × 221 × $15 ÷ 1,000,000 = $2,387
5.7× (benchmark page, Claude) benchmark.astro results/benchmark_results.json 18,886 ÷ 3,308 (sum/sum, this ref run)
10.0× (benchmark page, Gemini) benchmark.astro results/benchmark_results.json 32,559 ÷ 3,246 (sum/sum, this ref run)
82–90% reduction range Homepage meta description, index.astro run_benchmark.py Claude 82%, Gemini 90%; range covers both models
~2ms latency overhead benchmark.astro Manual timing of POST /assert/fact + POST /query Not formally benchmarked; observed during development

Reproduce these numbers

Live benchmark (run_benchmark.py)

# 1. Start NocturnusAI
docker run -p 9300:9300 ghcr.io/auctalis/nocturnusai:latest

# 2. Set API keys
export ANTHROPIC_API_KEY=sk-ant-...
export GOOGLE_API_KEY=AIza...

# 3. Clone and run
git clone https://github.com/Auctalis/nocturnusai
cd nocturnusai/nocturnusai-bench
uv run run_benchmark.py

Output: per-turn token arrays printed to stdout, summary ratios printed at the end, PNG chart saved to results/04_live_token_usage.png, and results/benchmark_results.json with the raw arrays.

Live benchmark notebook (02_live_benchmark.ipynb)

cd nocturnusai/nocturnusai-bench
pip install -r requirements.txt
jupyter notebook notebooks/02_live_benchmark.ipynb

Same benchmark as run_benchmark.py but in notebook form with inline charts. Requires the same API keys and a running NocturnusAI instance.

Parametric model (01_cost_model.ipynb)

cd nocturnusai/nocturnusai-bench
pip install notebook numpy matplotlib pandas
jupyter notebook notebooks/01_cost_model.ipynb

No API keys needed. Edit TURNS_PER_MONTH, AVG_TURNS_PER_CONV, or model prices in the params cell and re-run. All charts regenerate automatically.