How It Works

How NocturnusAI compresses context by 82–90%.

Most agent memory systems store embeddings and retrieve by similarity — so your context window fills with everything that looks related. NocturnusAI takes a different approach: it extracts structured facts, reasons about what's actually relevant to the current goal, and returns only what changed. That's logical compression, not summarization.

Agent

Raw turns, tool outputs, conversation history

→

Nocturnus

Extract facts → backward-chain inference → compute delta

→

LLM

Receives only what changed — ~221 tokens instead of ~1,259

The Problem

Your agent replays the entire thread every call.

In a typical agentic workflow, every LLM call includes the full conversation history. By turn 10, your model is processing thousands of tokens of old context — most of which it already knows. You're paying to re-read the same information over and over.

The standard "fix" is RAG — store embeddings, retrieve by cosine similarity. But similarity search returns everything that looks related, not what the agent actually needs for its next step. The context window is smaller, but full of noise.

What your LLM sees on turn 10

Turn 1: User said X
Turn 2: Tool returned Y
Turn 3: Agent decided Z
Turn 4: User clarified W
Turn 5: Tool returned more data
Turn 6: Agent updated plan
Turn 7: User confirmed
Turn 8: Tool ran query
Turn 9: Agent summarized
Turn 10: New question — but model re-reads ALL of this

~1,607 tokens on turn 10 · ~$13,600/month at 1,000 req/hr

The Pipeline

Three steps. Extract, infer, return the delta.

NocturnusAI sits between your agent and the LLM. Each turn passes through a compression pipeline powered by a logic engine — not vector search.

Extract structured facts

When your agent sends raw conversation turns to POST /context, an LLM extracts structured predicates. Not embeddings — actual facts with defined relationships.

Raw turn

"User: We can't log in after
the Okta cutover. Account is
Acme Corp, enterprise tier."

Extracted facts

login_status(acme, failing)
cause(login_failure, okta_cutover)
customer_tier(acme, enterprise)

Infer what's relevant to the current goal

This is where NocturnusAI diverges from every other memory system. Instead of "find similar facts," it runs backward-chaining inference — starting from the agent's current goals and working backward through the knowledge base to find only the facts that are logically reachable.

Backward chaining in action

# Agent's goal: resolve the login issue
Goal:  resolution(login_failure, ?fix)

# Engine traces backward through rules and facts:
resolution(X, ?fix) :- cause(X, ?cause), fix_for(?cause, ?fix)
  → needs cause(login_failure, ?cause)
  → found: cause(login_failure, okta_cutover)
  → needs fix_for(okta_cutover, ?fix)

# Only these facts are relevant. Everything else stays out.
Context window: 3 facts, not 47

Similarity search would return every fact that mentions login, Okta, or Acme. Backward chaining returns only facts in the logical chain from the goal. That's why the compression is 82%, not 30%.

Return only the delta

The response includes a briefingDelta — a natural-language summary of only what's new since the last turn. Your agent feeds this to the LLM instead of replaying the entire thread.

What your model sees (turn 2+)

14 failed SAML assertions at 09:12 UTC.
Issuer mismatch after IdP migration.

~221 tokens avg · 5.7× fewer (Claude Opus 4)

Without NocturnusAI

Full thread replay:
all turns, tool calls,
system events, retries...

~1,259 tokens avg · full-history replay

Inside the Engine

A logic engine, not a vector database.

NocturnusAI stores facts in a Hexastore (6-way indexed triple store) and reasons over them with two inference engines. No embeddings. No cosine similarity. Deterministic results.

Hexastore

Facts are indexed 6 ways (SPO, SOP, PSO, POS, OSP, OPS) for sub-millisecond pattern matching regardless of which terms are bound vs variable. No table scans.

Backward Chainer

Goal-driven SLD resolution with unification. Given a goal, it finds all facts and rules that can prove it — and nothing else. This is how context windows shrink by 82–90%.

Rete Engine

Forward chaining triggered on fact assertion. When new facts arrive, rules fire automatically to derive new conclusions. Supports negation-as-failure (NAF) for closed-world reasoning.

Truth Maintenance

The provenance tracker records why each fact was derived. When a premise is retracted, all dependent conclusions are automatically removed. No stale facts in your context.

Salience Scoring

Composite scores from recency, access frequency, and explicit priority determine which facts matter most. When the context window has a budget, salience decides what makes the cut.

Temporal Awareness

Facts carry validFrom, validUntil, and ttl fields. Expired facts are automatically excluded. Point-in-time queries return what was true at any moment.

The Difference

Inference vs similarity search.

NocturnusAI (inference)

Finds facts that are logically reachable from the goal
Deterministic — same query, same result, every time
Derives new facts from rules (not just retrieval)
Returns a delta — only what changed
Works without an LLM for the reasoning step
No hallucinations in the reasoning layer

Vector RAG (similarity)

Finds facts that look similar to the query
Probabilistic — results vary with embedding model
Can only retrieve, not derive new conclusions
Returns a ranked list, not a diff
Requires LLM or embedding model for every operation
Similarity ≠ relevance (noise in context window)

The API

One endpoint. One call per turn.

Your agent sends raw turns to POST /context. NocturnusAI handles extraction, storage, inference, and delta generation. The response includes everything your LLM needs.

POST /context

curl -s -X POST http://localhost:9300/context \
  -H "Content-Type: application/json" \
  -H "X-Tenant-ID: default" \
  -d '{
    "turns": [
      "User: We cannot log in after the Okta cutover.",
      "Tool crm_lookup: account=acme tier=enterprise",
      "Tool auth_audit: 14 failed SAML assertions since 09:12 UTC"
    ],
    "scope": "ticket-4821",
    "sessionId": "ticket-4821"
  }'

Response

{
  "briefingDelta": "Login failing post-Okta cutover. 14 failed SAML
    assertions since 09:12. Acme is enterprise tier.",
  "factsExtracted": 4,
  "factsInScope": 4,
  "contextWindowTokens": 221,
  "sessionId": "ticket-4821"
}

Your agent code

r = POST /context(turns, scope, sessionId)

messages = [
  system(r.briefingDelta),  # ~221 tokens avg
  user(next_question),
]

llm_response = call_llm(messages)
# LLM sees only what changed — not the whole thread

What you skip

No embedding pipeline. No vector database. No retrieval tuning. No chunking strategy. No re-ranking. The logic engine handles relevance through inference, not similarity. You send turns, you get a delta.

Try it yourself.

DOCKER · NO SIGNUP · FIRST RESULT IN 2 MINUTES

Quick Start

Docker one-liner, zero config. Running in 10 seconds.

Context workflow →

API Reference

Full endpoint docs, SDK examples, MCP tools.

API docs →

Concepts

Predicates, rules, inference, salience — the backend mechanics.

Learn more →