How NocturnusAI compresses context by 82–90%.
Most agent memory systems store embeddings and retrieve by similarity — so your context window fills with everything that looks related. NocturnusAI takes a different approach: it extracts structured facts, reasons about what's actually relevant to the current goal, and returns only what changed. That's logical compression, not summarization.
Your agent replays the entire thread every call.
In a typical agentic workflow, every LLM call includes the full conversation history. By turn 10, your model is processing thousands of tokens of old context — most of which it already knows. You're paying to re-read the same information over and over.
The standard "fix" is RAG — store embeddings, retrieve by cosine similarity. But similarity search returns everything that looks related, not what the agent actually needs for its next step. The context window is smaller, but full of noise.
Turn 1: User said X Turn 2: Tool returned Y Turn 3: Agent decided Z Turn 4: User clarified W Turn 5: Tool returned more data Turn 6: Agent updated plan Turn 7: User confirmed Turn 8: Tool ran query Turn 9: Agent summarized Turn 10: New question — but model re-reads ALL of this ~1,607 tokens on turn 10 · ~$13,600/month at 1,000 req/hr
Three steps. Extract, infer, return the delta.
NocturnusAI sits between your agent and the LLM. Each turn passes through a compression pipeline powered by a logic engine — not vector search.
Extract structured facts
When your agent sends raw conversation turns to POST /context, an LLM extracts structured predicates. Not embeddings — actual facts with defined relationships.
"User: We can't log in after the Okta cutover. Account is Acme Corp, enterprise tier."
login_status(acme, failing) cause(login_failure, okta_cutover) customer_tier(acme, enterprise)
Infer what's relevant to the current goal
This is where NocturnusAI diverges from every other memory system. Instead of "find similar facts," it runs backward-chaining inference — starting from the agent's current goals and working backward through the knowledge base to find only the facts that are logically reachable.
# Agent's goal: resolve the login issue Goal: resolution(login_failure, ?fix) # Engine traces backward through rules and facts: resolution(X, ?fix) :- cause(X, ?cause), fix_for(?cause, ?fix) → needs cause(login_failure, ?cause) → found: cause(login_failure, okta_cutover) → needs fix_for(okta_cutover, ?fix) # Only these facts are relevant. Everything else stays out. Context window: 3 facts, not 47
Similarity search would return every fact that mentions login, Okta, or Acme. Backward chaining returns only facts in the logical chain from the goal. That's why the compression is 82%, not 30%.
Return only the delta
The response includes a briefingDelta — a natural-language summary of only what's new since the last turn. Your agent feeds this to the LLM instead of replaying the entire thread.
14 failed SAML assertions at 09:12 UTC. Issuer mismatch after IdP migration.
~221 tokens avg · 5.7× fewer (Claude Opus 4)
Full thread replay: all turns, tool calls, system events, retries...
~1,259 tokens avg · full-history replay
A logic engine, not a vector database.
NocturnusAI stores facts in a Hexastore (6-way indexed triple store) and reasons over them with two inference engines. No embeddings. No cosine similarity. Deterministic results.
Hexastore
Facts are indexed 6 ways (SPO, SOP, PSO, POS, OSP, OPS) for sub-millisecond pattern matching regardless of which terms are bound vs variable. No table scans.
Backward Chainer
Goal-driven SLD resolution with unification. Given a goal, it finds all facts and rules that can prove it — and nothing else. This is how context windows shrink by 82–90%.
Rete Engine
Forward chaining triggered on fact assertion. When new facts arrive, rules fire automatically to derive new conclusions. Supports negation-as-failure (NAF) for closed-world reasoning.
Truth Maintenance
The provenance tracker records why each fact was derived. When a premise is retracted, all dependent conclusions are automatically removed. No stale facts in your context.
Salience Scoring
Composite scores from recency, access frequency, and explicit priority determine which facts matter most. When the context window has a budget, salience decides what makes the cut.
Temporal Awareness
Facts carry validFrom, validUntil, and ttl fields. Expired facts are automatically excluded. Point-in-time queries return what was true at any moment.
Inference vs similarity search.
NocturnusAI (inference)
- Finds facts that are logically reachable from the goal
- Deterministic — same query, same result, every time
- Derives new facts from rules (not just retrieval)
- Returns a delta — only what changed
- Works without an LLM for the reasoning step
- No hallucinations in the reasoning layer
Vector RAG (similarity)
- Finds facts that look similar to the query
- Probabilistic — results vary with embedding model
- Can only retrieve, not derive new conclusions
- Returns a ranked list, not a diff
- Requires LLM or embedding model for every operation
- Similarity ≠ relevance (noise in context window)
One endpoint. One call per turn.
Your agent sends raw turns to POST /context. NocturnusAI handles extraction, storage, inference, and delta generation. The response includes everything your LLM needs.
curl -s -X POST http://localhost:9300/context \
-H "Content-Type: application/json" \
-H "X-Tenant-ID: default" \
-d '{
"turns": [
"User: We cannot log in after the Okta cutover.",
"Tool crm_lookup: account=acme tier=enterprise",
"Tool auth_audit: 14 failed SAML assertions since 09:12 UTC"
],
"scope": "ticket-4821",
"sessionId": "ticket-4821"
}' {
"briefingDelta": "Login failing post-Okta cutover. 14 failed SAML
assertions since 09:12. Acme is enterprise tier.",
"factsExtracted": 4,
"factsInScope": 4,
"contextWindowTokens": 221,
"sessionId": "ticket-4821"
} Your agent code
r = POST /context(turns, scope, sessionId) messages = [ system(r.briefingDelta), # ~221 tokens avg user(next_question), ] llm_response = call_llm(messages) # LLM sees only what changed — not the whole thread
What you skip
No embedding pipeline. No vector database. No retrieval tuning. No chunking strategy. No re-ranking. The logic engine handles relevance through inference, not similarity. You send turns, you get a delta.
Try it yourself.
DOCKER · NO SIGNUP · FIRST RESULT IN 2 MINUTES
Quick Start
Docker one-liner, zero config. Running in 10 seconds.
API Reference
Full endpoint docs, SDK examples, MCP tools.
Concepts
Predicates, rules, inference, salience — the backend mechanics.