Docs Navigation

The Real Context Workflow

Think in terms of turn reduction, not knowledge modeling. Your input is a big array of turns. Your output is a smaller context window for the next model call.

LLM required. The context workflow uses an LLM to extract structured facts from raw text. You must configure a provider before these endpoints will work. The simplest path is to install Ollama and run ollama pull granite3.3:8b (the Docker image connects to Ollama on your host by default). Alternatively, set OPENAI_API_KEY, ANTHROPIC_API_KEY, or LLM_BASE_URL/LLM_MODEL as environment variables. If using the CLI, run setup in the REPL to configure your provider interactively. Without a configured LLM, POST /context and POST /context/ingest will time out or return empty results.
Do not optimize the wrong thing. Most teams lose time trying to make the raw transcript "better." The practical problem is simpler: too many turns go into the prompt, so the model pays attention to noise and you pay for the noise too.

What The Problem Actually Looks Like

In production, a thread is rarely just user and assistant messages. It also includes tool results, CRM data, system events, previous summaries, internal guidance, and repeated restatements of the same issue.

{
  "turns": [
    "User: We still cannot log in after yesterday's Okta cutover.",
    "Agent: Pulling account metadata and auth logs.",
    "Tool crm_lookup: account=acme_corp tier=enterprise billing=current renewal=2026-07-01",
    "Tool auth_audit: 14 failed SAML assertions since 09:12 UTC; issuer mismatch detected.",
    "Internal note: Customer is not delinquent. Keep ticket in support queue.",
    "Previous ticket: promised service credit if outage exceeds 4 hours.",
    "Slack escalation: INC-4821 open; workaround is manual issuer override.",
    "User: Three teams lost admin access after yesterday's metadata change.",
    "Tool statuspage: degraded identity service in us-east-1.",
    "Agent: Need a concise handoff context before the next model call."
  ]
}

That array is still far smaller than what many teams send in reality. The important point is that it already contains overlap, stale details, and the same issue stated in different ways.


Step 1: First Reduction With POST /context

Requires LLM extraction. POST /context uses an LLM to extract structured facts from raw text. The Docker image connects to Ollama on your host by default (install Ollama, then ollama pull granite3.3:8b). Or pass -e ANTHROPIC_API_KEY / -e OPENAI_API_KEY for a cloud LLM. First calls take 10–30s on local Ollama hardware. If you just want to query pre-asserted facts without an LLM, skip to Step 2 using POST /memory/context.

Use POST /context when you want the first compact pass. Send raw turns. Get back the normalized state that seems most important. This endpoint extracts facts from your turns and stores them in the knowledge base.

Always include scope and sessionId so the server can track the conversation across turns and compute deltas for Steps 2–4. Use the same value for both (e.g. your ticket or thread ID).

curl -X POST http://localhost:9300/context \
  -H "Content-Type: application/json" \
  -H "X-Tenant-ID: default" \
  -d '{
    "turns": [
      "User: We still cannot log in after yesterday'\''s Okta cutover.",
      "Tool crm_lookup: account=acme_corp tier=enterprise billing=current",
      "Tool auth_audit: issuer mismatch detected after IdP migration.",
      "Slack escalation: workaround is manual issuer override."
    ],
    "scope": "support-thread-4821",
    "sessionId": "support-thread-4821",
    "maxFacts": 12
  }'
{
  "facts": [
    { "predicate": "HasRole", "args": ["acme_corp", "Enterprise Account"], "salience": 0.64 },
    { "predicate": "Identified", "args": ["Auth audit", "Issuer mismatch"], "salience": 0.64 },
    { "predicate": "Proposed", "args": ["Workaround", "Manual issuer override"], "salience": 0.64 },
    { "predicate": "HasStatus", "args": ["Login", "Failed"], "salience": 0.64 }
  ],
  "factsReturned": 4,
  "contradictions": 0,
  "newFactsExtracted": 4,
  "briefingDelta": "## Okta Cutover\nLogin is failing after the Okta cutover...",
  "sessionId": "support-thread-4821"
}
Important: you do not need to invent predicate names up front. The LLM extracts and normalizes structure from your turns. Exact predicate names and output phrasing vary by model. First calls may take 10–30 seconds depending on your LLM provider.

Tracking A Conversation Across Many Turns

For multi-turn agent loops you usually want three things at once: facts that belong to this conversation should be partitioned, the LLM extractor should see prior turns so pronouns resolve, and the next model call only needs to see what's new. POST /context handles all three with two fields.

The Three-Layer Mental Model

It helps to think about facts in three layers, mapped onto NocturnusAI primitives that already exist:

  • Tenant layer — durable knowledge that survives across conversations: customer profiles, product catalogs, policies. Stored under a tenantId with scope = null. Selected via the X-Tenant-ID header.
  • Conversation layer — facts extracted from one specific dialog. Use the conversation id as the scope on every assertion so the conversation is a logical partition you can list, diff, merge, or delete in one shot.
  • Derived layer — facts produced by inference rules off the other two. The provenance tracker keeps these consistent automatically; you don't manage them by hand.

The recommended pattern is to use the same string as both the scope and the sessionId. The scope partitions the facts; the sessionId keys the diff snapshot. Two clean roles, one identifier.

Incremental Updates Per Turn

Pass sessionId and scope on every /context call for the same conversation. The server then:

  • Tags every extracted fact and rule with that scope, so the conversation stays partitioned from tenant-wide knowledge.
  • Keeps a small ring buffer of the last few turns under that key and feeds them to the extractor as contextHint, so references like "the same customer" resolve correctly.
  • Snapshots the optimized window per sessionId, so the next call can compute a delta against it.

You can override the auto-built hint by sending contextHint explicitly. You can omit scope if you want all conversation facts to land under the tenant layer (not recommended for long-running agents — it conflates customers).

The briefingDelta Field

From the second turn onward (i.e. once a snapshot exists for the sessionId), the response includes a briefingDelta: a short LLM-formatted natural-language briefing of just the facts that are new this turn. This is what you want to add to the next model prompt — not the entire window again.

# Turn 1 — first call for this conversation
curl -X POST http://localhost:9300/context \
  -H "Content-Type: application/json" \
  -H "X-Tenant-ID: default" \
  -d '{
    "turns": [
      "User: We still cannot log in after yesterday'\''s Okta cutover.",
      "Tool crm_lookup: account=acme_corp tier=enterprise"
    ],
    "scope": "support-thread-4821",
    "sessionId": "support-thread-4821",
    "maxFacts": 12
  }'
# Turn 2 — same scope + sessionId; the server pulls turn 1 as contextHint
curl -X POST http://localhost:9300/context \
  -H "Content-Type: application/json" \
  -H "X-Tenant-ID: default" \
  -d '{
    "turns": [
      "Tool auth_audit: 14 failed SAML assertions; issuer mismatch detected.",
      "Slack escalation: workaround is manual issuer override."
    ],
    "scope": "support-thread-4821",
    "sessionId": "support-thread-4821",
    "maxFacts": 12
  }'
{
  "facts": [ /* full optimized window */ ],
  "factsReturned": 7,
  "newFactsExtracted": 3,
  "briefingDelta": "## Authentication\nThe auth_audit tool detected an issuer mismatch after 14 failed SAML assertions. The accepted workaround is a manual issuer override.",
  "sessionId": "support-thread-4821"
}

Pass briefingDelta straight into the next model call as a short system message. The full facts array is still available when you need to inspect or audit.

Cleaning Up A Finished Conversation

When the conversation is done, delete the scope and clear the snapshot in two calls:

curl -X DELETE http://localhost:9300/scope/support-thread-4821 \
  -H "X-Tenant-ID: default"

curl -X POST http://localhost:9300/context/session/clear \
  -H "Content-Type: application/json" \
  -H "X-Tenant-ID: default" \
  -d '{"sessionId":"support-thread-4821"}'

Step 2: Ask The Next Question With POST /memory/context

The first pass is broad. The next pass should be question-specific. Once you know what the next model call is trying to do, use /memory/context with goals to narrow the window further.

POST /memory/context is the unified context endpoint. Simple requests (just maxFacts/minSalience) use a fast salience-ranked path. Adding goals, sessionId, or relevanceBuckets automatically triggers the full optimization engine — goal-driven backward chaining, contradiction handling, and session snapshots.

curl -X POST http://localhost:9300/memory/context \
  -H "Content-Type: application/json" \
  -H "X-Tenant-ID: default" \
  -d '{
    "sessionId": "support-thread-4821",
    "maxFacts": 10,
    "goals": [
      {"predicate":"next_best_action","args":["acme_corp"]},
      {"predicate":"service_credit_applicable","args":["acme_corp"]}
    ],
    "format": "natural"
  }'

This is where the context window becomes operational instead of merely descriptive. The model stops seeing the whole incident and starts seeing the subset needed to answer the next action question.

The format parameter controls output shape: "natural" returns a pre-formatted formattedText field ready to paste into a system prompt, while "structured" (the default) returns the entries array for programmatic use.

Deprecation notice: POST /context/optimize still works but is deprecated (sunset 2026-07-01). It returns a Deprecation: true header. Migrate to POST /memory/context with the same parameters.

Step 3: Stop Re-Sending The Same Context With POST /context/diff

This is the step most teams miss. If the conversation continues, do not send the whole optimized window again. Keep the same sessionId and ask for the delta.

curl -X POST http://localhost:9300/context/diff \
  -H "Content-Type: application/json" \
  -H "X-Tenant-ID: default" \
  -d '{
    "sessionId": "support-thread-4821",
    "maxFacts": 10
  }'
{
  "previousWindowId": "ctx-01",
  "currentWindowId": "ctx-02",
  "added": [
    { "predicate": "temporary_access_restored", "args": ["acme_corp", "true"], "salience": 0.92 }
  ],
  "removed": [],
  "unchanged": 9,
  "fullRefreshRecommended": false
}

That is the production benefit: later model calls pay for the change, not for the entire thread history again.


Step 4: End The Thread Cleanly

When the thread is finished, clear the diff snapshot:

curl -X POST http://localhost:9300/context/session/clear \
  -H "Content-Type: application/json" \
  -H "X-Tenant-ID: default" \
  -d '{"sessionId":"support-thread-4821"}'

Formatting For The Model

Most teams take the returned entries and flatten them into a short system or tool message. The model does not need the original transcript if the compact context already captures the operational state.

The easiest approach: pass "format": "natural" and use the formattedText field directly in your system prompt. No manual formatting needed.

Using the Python SDK (recommended):

from nocturnusai import SyncNocturnusAIClient
from openai import OpenAI

nocturnus = SyncNocturnusAIClient("http://localhost:9300", tenant_id="default")
openai_client = OpenAI()

ctx = nocturnus.context(
    session_id="support-thread-4821",
    max_facts=10,
    goals=[{"predicate": "next_best_action", "args": ["acme_corp"]}],
    format="natural",
)
context_text = ctx.formatted_text or ""

response = openai_client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": "Use only this reduced context:\n" + context_text},
        {"role": "user", "content": "Write the next support reply."},
    ],
)

Or with raw HTTP if you prefer no SDK dependency:

import requests
from openai import OpenAI

client = OpenAI()

ctx = requests.post(
    "http://localhost:9300/memory/context",
    headers={"X-Tenant-ID": "default"},
    json={
        "sessionId": "support-thread-4821",
        "maxFacts": 10,
        "goals": [{"predicate": "next_best_action", "args": ["acme_corp"]}],
        "format": "natural",
    },
).json()

# With format="natural", use formattedText directly
context_text = ctx.get("formattedText", "")

# Or with format="structured" (default), build lines from entries
# context_lines = [
#     f"- {entry['predicate']}({', '.join(entry['args'])})"
#     for entry in ctx["entries"]
# ]
# context_text = "\n".join(context_lines)

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": "Use only this reduced context:\n" + context_text},
        {"role": "user", "content": "Write the next support reply."},
    ],
)

Where The Other Surfaces Fit

  • API: best when your app already owns the turn array and prompt assembly. Use POST /memory/context for both simple salience windows and goal-driven optimization.
  • Python SDK: best when you want client.context(), diff_context(), and session cleanup in app code.
  • TypeScript SDK: best when you want client.context(), diffContext(), and clearContextSession() directly in app code.
  • MCP: best when your agent runtime already uses tool calling. The MCP context tool now supports goals, sessionId, format, and includeRules — so you get the full optimization pipeline without leaving MCP.

If You Want The Backend Details

Predicates, rules, scopes, salience scoring, truth maintenance, and memory lifecycle still exist. They just belong in the backend explanation, not at the front of the product story.

If that is what you need next, go to How It Works on the Backend.


What's Next?