How Chat History Compaction Works in OpenCaddis | OpenCaddis Blog - OpenCaddis

The Problem: Context Windows Have a Ceiling

Every message in an agent conversation costs tokens. The system prompt, every user question, every assistant response, every tool call and its result — it all goes into the context window. For a simple Q&A chat, that's fine. But OpenCaddis agents are designed for sustained work: multi-hour research sessions, ongoing project assistance, personal secretary duties that span days.

Here's what happens without compaction:

Token costs climb linearly. Every new message sends the entire history to the LLM. A 100-message conversation means you're paying for all 100 messages on message 101.
You hit the context ceiling. Even GPT-4o's 128K context window fills up. When it does, the API returns an error and the conversation breaks.
Older context becomes noise. That file listing from 80 messages ago? The tool call result that returned 5,000 characters of JSON? It's still in the context, consuming tokens that would be better spent on the current task.

The naive solution is to truncate — just drop old messages. But truncation is lossy in the worst way: it discards decisions, forgets context the user explicitly established, and can leave orphaned tool messages that break the API contract. We needed something smarter.

The Design: Summarize, Don't Truncate

Compaction uses the same LLM that powers the agent to compress its own history. The idea is simple: when the conversation gets too long, ask the model to produce a concise summary of the older messages, then replace those messages with the summary. The recent messages stay intact — they're the ones most likely to matter for the current task.

The Key Insight

Compaction is not a background job or a scheduled task. It runs inline, before each message is processed, and only triggers when the token estimate exceeds a configurable threshold. If the conversation fits comfortably in the context window, compaction does nothing.

The pipeline looks like this:

Estimate Tokens Check Threshold Split Messages Summarize Old Replace

The Implementation: Step by Step

The compaction engine lives in CompactionService in the FabrCore SDK. Here's exactly what happens when an agent processes a message.

1. Flush Pending Messages

Before compaction can assess the conversation size, it needs the complete picture. FabrCore's chat history system (FabrChatHistoryProvider) buffers recent messages in memory before persisting them to Orleans grain state. Compaction starts by flushing any pending messages so the token estimate covers everything:

CompactionService.cs

if (provider.HasPendingMessages)
{
    await provider.FlushAsync(ct);
}
var messages = await provider.GetStoredMessagesAsync();

The flush is thread-safe — it uses a lock internally and re-queues messages on failure to prevent data loss. GetStoredMessagesAsync then bypasses the in-memory cache and reads directly from the persisted grain state, giving us the authoritative message list.

2. Estimate Token Count

We need to know how much of the context window is consumed. Rather than making an API call to count tokens (which would add latency and cost on every message), we use a fast heuristic:

CompactionService.cs — Token Estimation

private static int EstimateTokens(List<StoredChatMessage> messages)
{
    var totalChars = messages.Sum(m =>
        (m.Role?.Length ?? 0) +
        (m.AuthorName?.Length ?? 0) +
        (m.ContentsJson?.Length ?? 0));
    return totalChars / 4;
}

The totalChars / 4 heuristic approximates 1 token per 4 characters. It's not exact — real tokenization varies by model — but it's consistent and fast. For compaction's purpose, we don't need precision. We need to know "are we getting close to the limit?" and the heuristic answers that well enough. The threshold-based approach (default 75%) provides a comfortable buffer that absorbs the estimation error.

3. Check the Threshold

With the token estimate in hand, the service compares it against the configured limit:

CompactionService.cs — Threshold Check

var estimatedTokens = EstimateTokens(messages);
var threshold = config.MaxContextTokens.Value * config.Threshold;

if (estimatedTokens <= threshold)
{
    return new CompactionResult
    {
        WasCompacted = false,
        OriginalMessageCount = messages.Count,
        EstimatedTokensBefore = estimatedTokens
    };
}

If the default model has ContextWindowTokens: 128000 and the threshold is 0.75, compaction triggers at ~96,000 estimated tokens. This early exit means compaction has near-zero overhead on most messages — it's just a sum and a comparison.

4. Split the History

When compaction triggers, we need to decide where to cut. The KeepLastN parameter (default 20) controls how many recent messages are preserved verbatim:

CompactionService.cs — Message Splitting

var keepCount = Math.Min(config.KeepLastN, messages.Count);
var splitIndex = messages.Count - keepCount;

// If KeepLastN covers all messages but we're over threshold,
// reduce the keep window so we actually compact something.
// Always keep at least 2 messages (the most recent exchange).
if (splitIndex == 0 && messages.Count > 2)
{
    keepCount = Math.Max(2, messages.Count / 2);
    splitIndex = messages.Count - keepCount;
}

// Adjust split point forward past any orphaned "tool" role messages.
// Tool messages must follow their assistant message with tool_calls —
// if we split between them, the API rejects the orphaned tool result.
while (splitIndex < messages.Count &&
       string.Equals(messages[splitIndex].Role, "tool",
           StringComparison.OrdinalIgnoreCase))
{
    splitIndex++;
}

var toSummarize = messages.Take(splitIndex).ToList();
var toKeep = messages.Skip(splitIndex).ToList();

There's an edge case here: if KeepLastN covers all the messages but the token estimate still exceeds the threshold (e.g., 15 messages with very large tool call results), the service reduces the keep window to half the message count (minimum 2). This ensures compaction always makes progress rather than silently skipping when the recent messages themselves are too large.

The tool message guard below that is equally critical. LLM APIs require that tool role messages immediately follow the assistant message that generated the tool call. If we split the history between an assistant tool-call and its tool result, the API returns a validation error. The while loop walks the split point forward past any orphaned tool messages, ensuring the kept messages always start with a clean boundary.

5. Summarize with the LLM

The older messages are formatted as text and sent to the same LLM model the agent uses, with a carefully crafted prompt:

CompactionService.cs — Summarization Prompt

"Summarize the following conversation history concisely. Preserve:
- Key decisions and conclusions
- Important facts, names, and numbers
- Outstanding tasks or open questions
- The overall topic and context

Return ONLY the summary, no preamble."

The summarization call uses a MaxOutputTokens of 2048 — enough for a thorough summary but not so large that the summary itself becomes a token burden. The messages are extracted from their serialized AIContent JSON format into readable text (role: content format) before being passed to the LLM.

6. Atomic Replacement

With the summary in hand, the service constructs a new message list and atomically replaces the conversation history:

CompactionService.cs — Replacement

// Build the compacted history
var summaryMessage = new StoredChatMessage
{
    Role = "system",
    ContentsJson = /* serialized "[Compacted History]\n{summary}" */
};

var newMessages = new List<StoredChatMessage> { summaryMessage };
newMessages.AddRange(toKeep);

// Replace atomically
await provider.ReplaceAndResetCacheAsync(newMessages);

ReplaceAndResetCacheAsync does three things in sequence: replaces all messages in the Orleans grain state, resets the local in-memory cache to the new messages, and clears the pending buffer. The atomic nature of this operation is important — if the replacement fails partway through, the original messages are still intact in grain state.

How Agents Call Compaction

In the FabrCore SDK, compaction is exposed through FabrAgentProxy — the base class for all agent implementations. A single protected method, TryCompactAsync(), handles everything:

AssistantAgent.cs — Compaction Integration

var compaction = await TryCompactAsync(
    onCompacting: () => ThinkingNotifier.SendThinkingAsync(
        fabrAgentHost, "Compacting history..."));

if (compaction?.WasCompacted == true)
{
    await ThinkingNotifier.SendThinkingAsync(fabrAgentHost,
        $"Compacted history: {compaction.OriginalMessageCount} → {compaction.CompactedMessageCount} messages");
}

This pattern appears in both AssistantAgent and DelegateAgent. The method is designed to be safe to call on every message:

Returns null if compaction is not configured (no ContextWindowTokens set)
Returns a result with WasCompacted = false if the threshold wasn't exceeded
Catches and logs exceptions without breaking the message flow
Lazily initializes the CompactionService on first call

The onCompacting callback lets each agent type notify its user in the appropriate way. Assistant agents use ThinkingNotifier; Delegate agents use their own SendThinkingAsync helper. In the UI, the user sees a brief "Compacting history..." indicator, followed by the result count.

Configuration: Build from Agent Args

Compaction configuration lives in the agent's Args dictionary, parsed by BuildCompactionConfigAsync() in the FabrCore SDK:

FabrAgentProxy.cs — Configuration Building

private async Task<CompactionConfig> BuildCompactionConfigAsync()
{
    var args = AgentConfiguration.Args ?? new();

    var enabled = !args.TryGetValue("CompactionEnabled", out var e)
        || !bool.TryParse(e, out var b) || b;

    var keepLastN = args.TryGetValue("CompactionKeepLastN", out var k)
        && int.TryParse(k, out var n) ? n : 20;

    // Fall back to model configuration's ContextWindowTokens
    int? maxTokens = args.TryGetValue("CompactionMaxContextTokens", out var m)
        && int.TryParse(m, out var t) ? t : null;
    if (maxTokens == null)
        maxTokens = modelConfig?.ContextWindowTokens;

    var threshold = args.TryGetValue("CompactionThreshold", out var th)
        && double.TryParse(th, out var d) ? d : 0.75;

    return new CompactionConfig { ... };
}

The fallback chain means you typically only need to set ContextWindowTokens on your model in fabr.json — the defaults handle the rest. Per-agent overrides are there when you need them, like a conservative threshold for a long-running research agent or aggressive compaction for a quick-chat agent on a smaller model.

Strategy	Threshold	KeepLastN	Good For
Standard	0.75	20	Most agents — balanced cost and context
Conservative	0.90	30+	Research agents that need deep context
Aggressive	0.50	10	Quick-chat agents on smaller models
Disabled	—	—	Short conversations that never approach limits

Safety Mechanisms

Compaction touches the most sensitive part of an agent — its conversation memory. We built several safety mechanisms to make it reliable:

Tool Message Integrity

LLM APIs enforce a strict contract: tool role messages must follow their parent assistant message with tool_calls. A split between them produces an API error. The split-point adjustment loop guarantees this never happens by walking forward past any tool messages at the boundary.

Atomic State Replacement

Message replacement goes through ReplaceAndResetCacheAsync, which replaces the entire Orleans grain state in one operation. If the operation fails, the original messages remain in grain state unchanged. The local cache is only reset after a successful replacement.

Flush-Before-Read

Any pending messages are flushed to grain state before compaction reads the full history. This prevents a scenario where recent messages exist only in the in-memory buffer and get lost during the replacement. The flush itself uses lock-based synchronization and re-queues on failure.

Graceful Failure

TryCompactAsync() wraps the entire compaction call in a try-catch. If the summarization LLM call fails, the JSON deserialization breaks, or anything else goes wrong, the error is logged and the agent continues with the original uncompacted history. The user never sees an error — the worst case is slightly higher token usage until the next message triggers a retry.

Architecture: Where It Sits

Compaction spans two layers of the OpenCaddis stack:

Layer	Component	Role
FabrCore SDK	`CompactionService`	Core algorithm: token estimation, message splitting, LLM summarization, history replacement
FabrCore SDK	`FabrAgentProxy`	Base class integration: `TryCompactAsync()`, lazy initialization, config building from agent args
FabrCore SDK	`FabrChatHistoryProvider`	Storage layer: flush, read, and atomic replace operations on the message thread
OpenCaddis	`AssistantAgent`	Calls `TryCompactAsync()` before each message with thinking notifications
OpenCaddis	`DelegateAgent`	Calls `TryCompactAsync()` before routing with its own notification pattern
OpenCaddis	Settings UI	Exposes `ContextWindowTokens` in the model configuration panel

The design is intentionally layered. CompactionService is a pure service with no knowledge of agents — it takes a config, a chat history provider, and a chat client, and does the work. FabrAgentProxy handles the integration plumbing. And the concrete agents (Assistant, Delegate) decide when to call it and how to notify the user. This means any new agent type built on FabrCore gets compaction for free by calling TryCompactAsync().

What It Looks Like in Practice

Here's what users see when compaction runs during a long conversation:

You

Now analyze the third competitor from our research list.

Research Assistant thinking

Compacting history... Compacted history: 147 → 22 messages

Research Assistant

Based on our research framework and the two competitors already analyzed, here's the breakdown for Acme Corp...

147 messages compressed to 22 — the system message with the summary plus the 20 most recent messages and the current one. The agent's response shows it retained the context from earlier in the conversation ("our research framework", "two competitors already analyzed") even though those messages were summarized.

The CompactionResult gives you the numbers:

CompactionResult

{
  WasCompacted: true,
  OriginalMessageCount: 147,
  CompactedMessageCount: 22,
  EstimatedTokensBefore: 98420,
  EstimatedTokensAfter: 14850
}

Getting Started

Compaction is enabled by default in OpenCaddis. The only thing you need to do is set ContextWindowTokens on your model configuration so the system knows the ceiling:

fabr.json

{
  "Name": "default",
  "Provider": "Azure",
  "Model": "gpt-4o",
  "ContextWindowTokens": 128000,
  // ... other settings
}

Or set it in the OpenCaddis Settings UI under the Context Window field on the model configuration tab.

That's it. Your agents will automatically compact when they approach the context limit. For fine-tuning, see the compaction configuration docs.

Looking Ahead

Compaction is one piece of the context management puzzle. The current summarization approach works well for general conversations, but there's room to evolve: smarter chunking that understands tool call boundaries better, multi-pass summarization for very long histories, and configurable summary prompts that agents can tailor to their domain.

The foundation is in place in the FabrCore SDK, and any improvements there flow through to every OpenCaddis agent automatically. If you're building agents that need to sustain long conversations — research assistants, project coordinators, personal secretaries — compaction gives them the endurance to keep going without losing the thread.

Check out the OpenCaddis source on GitHub and the FabrCore framework to see the full implementation.