Every claim, derived.
Every number on this site comes from reproducible code — either a live API call counted
by usage.input_tokens, or a parametric model with editable parameters in a
Jupyter notebook. This page traces each headline figure back to the exact script line or
notebook cell that produces it.
The two scenarios
Two independent analyses produce the numbers on this site. They use different scales, different pricing, and answer different questions. Confusing them is the most common source of "the math doesn't add up" complaints — so we separate them clearly.
nocturnusai-bench/notebooks/01_cost_model.ipynb A mathematical model with editable parameters. Uses GPT-4 pricing ($30/1M), 50,000 turns/month, and configurable conversation length. Useful for exploring how costs scale under different workload assumptions.
No live API calls — all arithmetic. The ratio depends entirely on the parameters you set. Change them in the notebook and the output changes. We ship this notebook for workload modeling, not for headline numbers.
nocturnusai-bench/run_benchmark.pynocturnusai-bench/notebooks/02_live_benchmark.ipynb
Actual API calls to Claude Opus 4 and Gemini 2.0 Flash. A 15-turn product
support conversation. Token counts taken directly from usage.input_tokens
(Claude) and count_tokens (Gemini). These are the numbers on the
homepage and benchmark page.
The site's $13,600 → $2,400 claim uses this scenario scaled to 1,000 req/hr (720,000 turns/month) with Claude Opus 4 at $15/1M input tokens.
Live benchmark derivation
What the benchmark measures
A 15-turn product support conversation where a user reports API timeouts and the agent
diagnoses the issue step by step. The turns are hard-coded in
run_benchmark.py lines 15–46 (TURNS list). Each turn carries
a user message and a list of facts to extract.
The same conversation runs four times: Claude naive, Claude + NocturnusAI, Gemini naive, Gemini + NocturnusAI. Token counts are collected per turn and summed.
Per-turn token arrays
The arrays below are the reference run values, stored in
results/benchmark_results.json and hardcoded in
site/src/pages/benchmark.astro lines 10–13.
Ratio computation
The benchmark page uses sum/sum ratio rather than avg/avg to avoid
rounding errors when individual turn averages are rounded to integers.
See benchmark.astro lines 21–26.
Headline token numbers (1,259 and 221)
The homepage and blog posts cite Claude Opus 4 averages of 1,259 naive
and 221 NocturnusAI. These come from a separate reference run of
run_benchmark.py where the summary output prints:
Claude Opus 4 ($15/1M) Naive avg: 1,259 tok/turn NocturnusAI avg: 221 tok/turn Reduction: 5.7× (82% savings)
That run's per-turn naive average of 1,259 is close to (but distinct from) the reference
run's per-turn avg of 1259 stored in benchmark_results.json. Both runs
use the same script; natural variation between API calls (token counting can differ
slightly by model version) accounts for the small difference. The benchmark page's
charts use the reference run arrays above; the homepage uses the 1,259/221 pair from
the summary run.
Cost projection
All cost projections on the site use the following scale: 1,000 requests/hour running 720 hours/month = 720,000 turns/month. This is a single always-on API endpoint at 1k req/hr — a realistic SaaS workload.
Source: index.astro line 372 contains the full arithmetic inline as a
tooltip/footnote. Gemini's cost is small in absolute terms because $0.10/1M input
tokens is 150× cheaper than Claude Opus 4 at $15/1M.
Parametric model (01_cost_model.ipynb)
The parametric notebook is a standalone mathematical model — no API keys required.
It lets you plug in your own workload parameters and see cost projections across six
models. Key parameters (notebook cell params):
| Parameter | Default value | What it controls |
|---|---|---|
TURNS_PER_MONTH | 50,000 | Total agent turns across all users per month |
AVG_TURNS_PER_CONV | 20 | Average conversation length in turns |
TOKENS_PER_TURN_ADDED | 360 | Tokens each turn adds to the growing naive context |
SYSTEM_PROMPT_TOKENS | 800 | Static system prompt size (same for both approaches) |
NOCTURNUS_AVG_CONTEXT | 960 (800 + 160) | System prompt + retrieved facts; held flat regardless of turn count |
With these defaults the notebook computes:
- Naive avg context = 800 + 360 × (20 + 1) / 2 = 4,580 tokens
- NocturnusAI avg context = 960 tokens
- GPT-4 naive: 50,000 × 4,580 × $30 / 1,000,000 = $6,870/mo
- GPT-4 noct: 50,000 × 960 × $30 / 1,000,000 = $1,440/mo
The notebook is a workload-modeling tool: change the parameters, see how cost scales. It does not produce site claims. Every number on this site comes from Scenario B — the live benchmark.
Claim-to-source table
Every claim on the site, the file that displays it, and the source that produces the number.
| Claim | Displayed in | Source | Arithmetic |
|---|---|---|---|
| 221 tokens/turn (NocturnusAI) | index.astro | run_benchmark.py summary output | avg of claude_nocturnus() turn values |
| 1,259 tokens/turn (naive) | index.astro | run_benchmark.py summary output | avg of claude_naive() turn values |
| 5.7× fewer tokens (Claude) | index.astro | run_benchmark.py summary | 1,259 ÷ 221 = 5.70 |
| 82% reduction (Claude) | index.astro, benchmark page | run_benchmark.py summary | (1,259 − 221) ÷ 1,259 = 82.4% |
| 10.0× fewer tokens (Gemini) | index.astro | run_benchmark.py summary | 2,171 ÷ 216 = 10.05 |
| 90% reduction (Gemini) | index.astro, benchmark page | run_benchmark.py summary | (2,171 − 216) ÷ 2,171 = 90.1% |
| $13,600/mo | index.astro | run_benchmark.py + index.astro line 372 | 720,000 × 1,259 × $15 ÷ 1,000,000 = $13,596 |
| $2,400/mo | index.astro | run_benchmark.py + index.astro line 372 | 720,000 × 221 × $15 ÷ 1,000,000 = $2,387 |
| 5.7× (benchmark page, Claude) | benchmark.astro | results/benchmark_results.json | 18,886 ÷ 3,308 (sum/sum, this ref run) |
| 10.0× (benchmark page, Gemini) | benchmark.astro | results/benchmark_results.json | 32,559 ÷ 3,246 (sum/sum, this ref run) |
| 82–90% reduction range | Homepage meta description, index.astro | run_benchmark.py | Claude 82%, Gemini 90%; range covers both models |
| ~2ms latency overhead | benchmark.astro | Manual timing of POST /assert/fact + POST /query | Not formally benchmarked; observed during development |
Reproduce these numbers
Live benchmark (run_benchmark.py)
# 1. Start NocturnusAI docker run -p 9300:9300 ghcr.io/auctalis/nocturnusai:latest # 2. Set API keys export ANTHROPIC_API_KEY=sk-ant-... export GOOGLE_API_KEY=AIza... # 3. Clone and run git clone https://github.com/Auctalis/nocturnusai cd nocturnusai/nocturnusai-bench uv run run_benchmark.py
Output: per-turn token arrays printed to stdout, summary ratios printed at the end,
PNG chart saved to results/04_live_token_usage.png, and
results/benchmark_results.json with the raw arrays.
Live benchmark notebook (02_live_benchmark.ipynb)
cd nocturnusai/nocturnusai-bench pip install -r requirements.txt jupyter notebook notebooks/02_live_benchmark.ipynb
Same benchmark as run_benchmark.py but in notebook form with inline charts.
Requires the same API keys and a running NocturnusAI instance.
Parametric model (01_cost_model.ipynb)
cd nocturnusai/nocturnusai-bench pip install notebook numpy matplotlib pandas jupyter notebook notebooks/01_cost_model.ipynb
No API keys needed. Edit TURNS_PER_MONTH, AVG_TURNS_PER_CONV,
or model prices in the params cell and re-run. All charts regenerate
automatically.