Cut the prompt bill.
Big threads in. Smaller prompts out. Everything else is backend.
Most of the waste is old context.
The model keeps getting charged to reread things it does not need.
Repeated history
Old messages and old summaries keep showing up long after they stopped mattering.
Tool chatter
Trace output, logs, notes, and retries are useful to systems, not to every next prompt.
Stale state
Resolved issues and dead branches still ride along unless something cuts them out.
Three calls. Lower spend.
Cut the prompt. Tighten the next step. Stop replaying stable state.
POST /context with sessionId
Send just the new turns. The server tracks priors, extracts facts under the conversation scope, and returns a briefingDelta — only what's new.
{
"turns": [new messages only],
"scope": "conversation-id",
"sessionId": "conversation-id"
} POST /memory/context with goals
Once the next step is clear, narrow the window to just the subset the model needs.
DELETE /scope/{id}
Conversation done? Delete the scope. Promote durable facts to the tenant layer first.
Backend mechanics. Not the pitch.
This is the machinery that keeps the cut reliable.
Turn arrays become usable state
Raw text gets extracted, ranked, and deduplicated before it hits the model.
State can decay, compress, and expire
Old context can decay, compress, and expire instead of growing forever.
Backend mechanics keep stale conclusions from lingering
Truth maintenance and temporal state keep the reducer from drifting.
Three layers. One identifier.
Use the conversation id as both scope and sessionId. Nocturnus handles the rest.
Durable customer knowledge
Facts with no scope: customer profiles, policies, product catalogs. Survives across every conversation. Think CRM state.
Facts scoped to this dialog
Facts tagged with scope=conversationId. Tool results, user statements, agent commitments. Deleted when the ticket closes.
Inferred automatically
Rules fire across both layers. If a customer is enterprise tier and has an outage, the priority escalation rule triggers without you writing glue code.
// 1. Each turn — extract, store, get delta r = POST /context(turns, scope, sessionId) next_prompt = r.briefingDelta // 2. Need a goal-specific subset? ctx = POST /memory/context(goals, sessionId, format="natural") next_prompt = ctx.formattedText // 3. Promote durable facts before cleanup POST /assert/fact({"predicate":"IsA","args":["Acme","Enterprise"]}) // 4. Conversation done — clean up DELETE /scope/{conversationId} POST /context/session/clear({"sessionId":"..."})
Keep your stack.
OpenClaw is a clear example: more context means more tokens. Nocturnus cuts the payload before the run.
Fast MCP now. Full Context Engine later.
Add it with openclaw mcp set now, or replace context assembly completely later.
Python and TypeScript
Call windows, optimize, diffs, summaries, and cleanup directly.
Use the reducer where the stack already runs.
Use REST in app code, MCP for agents, or CLI locally.