The Real Context Workflow
Think in terms of turn reduction, not knowledge modeling. Your input is a big array of turns. Your output is a smaller context window for the next model call.
ollama pull granite3.3:8b (the Docker image connects to Ollama on your host by default).
Alternatively, set OPENAI_API_KEY, ANTHROPIC_API_KEY, or LLM_BASE_URL/LLM_MODEL as environment variables.
If using the CLI, run setup in the REPL to configure your provider interactively.
Without a configured LLM, POST /context and POST /context/ingest will time out or return empty results.
What The Problem Actually Looks Like
In production, a thread is rarely just user and assistant messages. It also includes tool results, CRM data, system events, previous summaries, internal guidance, and repeated restatements of the same issue.
{
"turns": [
"User: We still cannot log in after yesterday's Okta cutover.",
"Agent: Pulling account metadata and auth logs.",
"Tool crm_lookup: account=acme_corp tier=enterprise billing=current renewal=2026-07-01",
"Tool auth_audit: 14 failed SAML assertions since 09:12 UTC; issuer mismatch detected.",
"Internal note: Customer is not delinquent. Keep ticket in support queue.",
"Previous ticket: promised service credit if outage exceeds 4 hours.",
"Slack escalation: INC-4821 open; workaround is manual issuer override.",
"User: Three teams lost admin access after yesterday's metadata change.",
"Tool statuspage: degraded identity service in us-east-1.",
"Agent: Need a concise handoff context before the next model call."
]
} That array is still far smaller than what many teams send in reality. The important point is that it already contains overlap, stale details, and the same issue stated in different ways.
Step 1: First Reduction With POST /context
POST /context uses an LLM to extract structured facts from raw text.
The Docker image connects to Ollama on your host by default (install Ollama, then ollama pull granite3.3:8b).
Or pass -e ANTHROPIC_API_KEY / -e OPENAI_API_KEY for a cloud LLM.
First calls take 10–30s on local Ollama hardware.
If you just want to query pre-asserted facts without an LLM, skip to Step 2 using POST /memory/context.
Use POST /context when you want the first compact pass. Send raw turns. Get back the normalized state that seems most important. This endpoint extracts facts from your turns and stores them in the knowledge base.
Always include scope and sessionId so the server can track the conversation across turns and compute deltas for Steps 2–4. Use the same value for both (e.g. your ticket or thread ID).
curl -X POST http://localhost:9300/context \
-H "Content-Type: application/json" \
-H "X-Tenant-ID: default" \
-d '{
"turns": [
"User: We still cannot log in after yesterday'\''s Okta cutover.",
"Tool crm_lookup: account=acme_corp tier=enterprise billing=current",
"Tool auth_audit: issuer mismatch detected after IdP migration.",
"Slack escalation: workaround is manual issuer override."
],
"scope": "support-thread-4821",
"sessionId": "support-thread-4821",
"maxFacts": 12
}' {
"facts": [
{ "predicate": "HasRole", "args": ["acme_corp", "Enterprise Account"], "salience": 0.64 },
{ "predicate": "Identified", "args": ["Auth audit", "Issuer mismatch"], "salience": 0.64 },
{ "predicate": "Proposed", "args": ["Workaround", "Manual issuer override"], "salience": 0.64 },
{ "predicate": "HasStatus", "args": ["Login", "Failed"], "salience": 0.64 }
],
"factsReturned": 4,
"contradictions": 0,
"newFactsExtracted": 4,
"briefingDelta": "## Okta Cutover\nLogin is failing after the Okta cutover...",
"sessionId": "support-thread-4821"
} Tracking A Conversation Across Many Turns
For multi-turn agent loops you usually want three things at once: facts that belong to this conversation should be partitioned, the LLM extractor should see prior turns so pronouns resolve, and the next model call only needs to see what's new. POST /context handles all three with two fields.
The Three-Layer Mental Model
It helps to think about facts in three layers, mapped onto NocturnusAI primitives that already exist:
- Tenant layer — durable knowledge that survives across conversations: customer profiles, product catalogs, policies. Stored under a
tenantIdwithscope = null. Selected via theX-Tenant-IDheader. - Conversation layer — facts extracted from one specific dialog. Use the conversation id as the
scopeon every assertion so the conversation is a logical partition you can list, diff, merge, or delete in one shot. - Derived layer — facts produced by inference rules off the other two. The provenance tracker keeps these consistent automatically; you don't manage them by hand.
The recommended pattern is to use the same string as both the scope and the sessionId. The scope partitions the facts; the sessionId keys the diff snapshot. Two clean roles, one identifier.
Incremental Updates Per Turn
Pass sessionId and scope on every /context call for the same conversation. The server then:
- Tags every extracted fact and rule with that
scope, so the conversation stays partitioned from tenant-wide knowledge. - Keeps a small ring buffer of the last few turns under that key and feeds them to the extractor as
contextHint, so references like "the same customer" resolve correctly. - Snapshots the optimized window per
sessionId, so the next call can compute a delta against it.
You can override the auto-built hint by sending contextHint explicitly. You can omit scope if you want all conversation facts to land under the tenant layer (not recommended for long-running agents — it conflates customers).
The briefingDelta Field
From the second turn onward (i.e. once a snapshot exists for the sessionId), the response includes a briefingDelta: a short LLM-formatted natural-language briefing of just the facts that are new this turn. This is what you want to add to the next model prompt — not the entire window again.
# Turn 1 — first call for this conversation
curl -X POST http://localhost:9300/context \
-H "Content-Type: application/json" \
-H "X-Tenant-ID: default" \
-d '{
"turns": [
"User: We still cannot log in after yesterday'\''s Okta cutover.",
"Tool crm_lookup: account=acme_corp tier=enterprise"
],
"scope": "support-thread-4821",
"sessionId": "support-thread-4821",
"maxFacts": 12
}' # Turn 2 — same scope + sessionId; the server pulls turn 1 as contextHint
curl -X POST http://localhost:9300/context \
-H "Content-Type: application/json" \
-H "X-Tenant-ID: default" \
-d '{
"turns": [
"Tool auth_audit: 14 failed SAML assertions; issuer mismatch detected.",
"Slack escalation: workaround is manual issuer override."
],
"scope": "support-thread-4821",
"sessionId": "support-thread-4821",
"maxFacts": 12
}' {
"facts": [ /* full optimized window */ ],
"factsReturned": 7,
"newFactsExtracted": 3,
"briefingDelta": "## Authentication\nThe auth_audit tool detected an issuer mismatch after 14 failed SAML assertions. The accepted workaround is a manual issuer override.",
"sessionId": "support-thread-4821"
}
Pass briefingDelta straight into the next model call as a short system message. The full facts array is still available when you need to inspect or audit.
Cleaning Up A Finished Conversation
When the conversation is done, delete the scope and clear the snapshot in two calls:
curl -X DELETE http://localhost:9300/scope/support-thread-4821 \
-H "X-Tenant-ID: default"
curl -X POST http://localhost:9300/context/session/clear \
-H "Content-Type: application/json" \
-H "X-Tenant-ID: default" \
-d '{"sessionId":"support-thread-4821"}' Step 2: Ask The Next Question With POST /memory/context
The first pass is broad. The next pass should be question-specific. Once you know what the next model call is trying to do, use /memory/context with goals to narrow the window further.
POST /memory/context is the unified context endpoint. Simple requests (just maxFacts/minSalience) use a fast salience-ranked path. Adding goals, sessionId, or relevanceBuckets automatically triggers the full optimization engine — goal-driven backward chaining, contradiction handling, and session snapshots.
curl -X POST http://localhost:9300/memory/context \
-H "Content-Type: application/json" \
-H "X-Tenant-ID: default" \
-d '{
"sessionId": "support-thread-4821",
"maxFacts": 10,
"goals": [
{"predicate":"next_best_action","args":["acme_corp"]},
{"predicate":"service_credit_applicable","args":["acme_corp"]}
],
"format": "natural"
}' This is where the context window becomes operational instead of merely descriptive. The model stops seeing the whole incident and starts seeing the subset needed to answer the next action question.
The format parameter controls output shape: "natural" returns a pre-formatted formattedText field ready to paste into a system prompt, while "structured" (the default) returns the entries array for programmatic use.
POST /context/optimize still works but is deprecated (sunset 2026-07-01). It returns a Deprecation: true header. Migrate to POST /memory/context with the same parameters.
Step 3: Stop Re-Sending The Same Context With POST /context/diff
This is the step most teams miss. If the conversation continues, do not send the whole optimized window again. Keep the same sessionId and ask for the delta.
curl -X POST http://localhost:9300/context/diff \
-H "Content-Type: application/json" \
-H "X-Tenant-ID: default" \
-d '{
"sessionId": "support-thread-4821",
"maxFacts": 10
}' {
"previousWindowId": "ctx-01",
"currentWindowId": "ctx-02",
"added": [
{ "predicate": "temporary_access_restored", "args": ["acme_corp", "true"], "salience": 0.92 }
],
"removed": [],
"unchanged": 9,
"fullRefreshRecommended": false
} That is the production benefit: later model calls pay for the change, not for the entire thread history again.
Step 4: End The Thread Cleanly
When the thread is finished, clear the diff snapshot:
curl -X POST http://localhost:9300/context/session/clear \
-H "Content-Type: application/json" \
-H "X-Tenant-ID: default" \
-d '{"sessionId":"support-thread-4821"}' Formatting For The Model
Most teams take the returned entries and flatten them into a short system or tool message. The model does not need the original transcript if the compact context already captures the operational state.
The easiest approach: pass "format": "natural" and use the formattedText field directly in your system prompt. No manual formatting needed.
Using the Python SDK (recommended):
from nocturnusai import SyncNocturnusAIClient
from openai import OpenAI
nocturnus = SyncNocturnusAIClient("http://localhost:9300", tenant_id="default")
openai_client = OpenAI()
ctx = nocturnus.context(
session_id="support-thread-4821",
max_facts=10,
goals=[{"predicate": "next_best_action", "args": ["acme_corp"]}],
format="natural",
)
context_text = ctx.formatted_text or ""
response = openai_client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": "Use only this reduced context:\n" + context_text},
{"role": "user", "content": "Write the next support reply."},
],
) Or with raw HTTP if you prefer no SDK dependency:
import requests
from openai import OpenAI
client = OpenAI()
ctx = requests.post(
"http://localhost:9300/memory/context",
headers={"X-Tenant-ID": "default"},
json={
"sessionId": "support-thread-4821",
"maxFacts": 10,
"goals": [{"predicate": "next_best_action", "args": ["acme_corp"]}],
"format": "natural",
},
).json()
# With format="natural", use formattedText directly
context_text = ctx.get("formattedText", "")
# Or with format="structured" (default), build lines from entries
# context_lines = [
# f"- {entry['predicate']}({', '.join(entry['args'])})"
# for entry in ctx["entries"]
# ]
# context_text = "\n".join(context_lines)
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": "Use only this reduced context:\n" + context_text},
{"role": "user", "content": "Write the next support reply."},
],
) Where The Other Surfaces Fit
- API: best when your app already owns the turn array and prompt assembly. Use
POST /memory/contextfor both simple salience windows and goal-driven optimization. - Python SDK: best when you want
client.context(),diff_context(), and session cleanup in app code. - TypeScript SDK: best when you want
client.context(),diffContext(), andclearContextSession()directly in app code. - MCP: best when your agent runtime already uses tool calling. The MCP
contexttool now supportsgoals,sessionId,format, andincludeRules— so you get the full optimization pipeline without leaving MCP.
If You Want The Backend Details
Predicates, rules, scopes, salience scoring, truth maintenance, and memory lifecycle still exist. They just belong in the backend explanation, not at the front of the product story.
If that is what you need next, go to How It Works on the Backend.