Building a Chain-of-Thought Reasoning Agent on FabrCore: What We Learned About LLM Performance

OpenCaddis Team February 21, 2026 at 2:35 PM 12 min read

We recently built a new agent type for OpenCaddis — the ChainOfThoughtAgent — that breaks complex questions into steps, executes them, and iteratively refines its answer until it reaches a confidence threshold. Along the way, we discovered some important lessons about LLM performance on Azure OpenAI that every developer building agentic systems should know.

The Architecture

The CoT agent follows a simple loop: Assess → Plan → Execute → Synthesize → (optionally Replan). Each phase makes a structured JSON call to the LLM. The agent tracks a confidence score and uses a linearly decaying threshold — starting strict (0.75) and relaxing to a floor (0.40) over up to 8 loops. Simple questions skip the loop entirely via a "fast path."

What makes this interesting is that every internal LLM call uses structured JSON output — we define C# types like AssessmentOutput, PlanOutput, and SynthesisOutput, generate JSON schemas from them, and pass them via ChatOptions.ResponseFormat. The LLM returns valid JSON every time, which we deserialize directly. No regex parsing, no prompt-and-pray.

The execution engine supports parallel step execution via topological sorting. Steps declare their dependencies, and we compute execution layers — all steps in the same layer with no side effects run concurrently via Task.WhenAll.

Discovery 1: GPT-5 Models Are Significantly Slower Than GPT-4

Our first surprise was latency. After upgrading from GPT-4o to GPT-5-nano on Azure, every call took 13–47 seconds. The same structured output calls that completed in 2–5 seconds on GPT-4o were now 10–15x slower.

This isn't a bug — it's by design. GPT-5 models use internal reasoning (chain-of-thought) even when you don't ask for it. Every API call spends time "thinking" before generating output. On a CoT agent that makes 8+ LLM calls per user message, this compounds fast.

Lesson: For agentic workloads with many sequential LLM calls, model selection matters enormously. A faster model with slightly lower quality may produce better end-to-end results because the agent gets more iterations in the same wall-clock time.

Discovery 2: ReasoningEffort.None Is a Game-Changer

The Microsoft.Extensions.AI abstraction library (v10.3.0, shipping in-box with .NET 10) exposes a ChatOptions.Reasoning property with an Effort enum: None, Low, Medium, High, ExtraHigh.

Setting ReasoningEffort.None on our structured output calls reduced latency by 30–40%. Our Plan phase dropped from 46.6s to 19.5s. This makes sense — our CoT agent is already doing the reasoning at the orchestration level. We don't need the LLM to reason internally about how to fill in a JSON schema; we just need it to follow the schema and produce output.

var chatOptions = new ChatOptions
{
    ResponseFormat = ChatResponseFormat.ForJsonSchema(schema, name, desc),
    Reasoning = new ReasoningOptions
    {
        Effort = ReasoningEffort.None,
        Output = ReasoningOutput.None
    },
    AllowMultipleToolCalls = false
};
Lesson: If your agent architecture handles reasoning at the orchestration layer, disable LLM-level reasoning for the individual calls. You're paying double for reasoning you don't need.

Discovery 3: Not All ChatOptions Work With All Models

We tried adding Temperature = 0f for deterministic output. GPT-5-nano rejected it:

HTTP 400: 'temperature' does not support 0 with this model.
Only the default (1) value is supported.

GPT-5 reasoning models (like the o-series before them) lock temperature to 1. Similarly, we tried MaxOutputTokens = 2000 to reduce output size, but combining it with ReasoningEffort.None and structured JSON output caused every response to return empty (Content-Length: 899, zero actual content). The combination was silently incompatible.

Lesson: Test ChatOptions combinations carefully on your target model. Options that work independently may break when combined. Build your agent to gracefully handle these constraints.

Discovery 4: Structured Output Has a Hidden Schema Cache

Azure OpenAI converts JSON schemas to context-free grammars (CFGs) on the server side. The first request with a new schema incurs 10–60 seconds of additional latency. The schema is then cached with a ~120 second TTL.

Our CoT agent uses 5 different structured types (Assessment, Plan, StepExecution, Synthesis, Replan). Each one pays the schema compilation penalty on first use. On a cold start, the first full CoT run is significantly slower than subsequent runs.

Lesson: If you use structured output with multiple schemas, warm them on startup with throwaway requests. Reuse the same schema objects — don't regenerate them per request.

Discovery 5: Output Size Is the Real Bottleneck

With reasoning effort disabled, the dominant factor in latency became output token generation. Our step execution calls were producing 7,000–18,000 character JSON responses. A step generating 18K chars took 47 seconds; one generating 7K took 20 seconds. The relationship is roughly linear.

We tried MaxOutputTokens to cap this, but it was incompatible with our other settings on GPT-5-nano. The effective solution is prompt-level guidance — telling the LLM in the system prompt to keep responses concise.

Lesson: For agentic workloads, optimize output size through prompt engineering. Every token the LLM generates costs wall-clock time. If you're feeding the output into the next step's context (not showing it to the user), brevity is a feature.

The Numbers

Our final benchmark for a medium-complexity question ("Compare microservices vs monolithic architecture for a startup"):

MetricValue
Total time181 seconds (3:01)
LLM calls8
Loops1 (no replan needed)
Steps executed4
Final confidence0.78
ModelAzure OpenAI gpt-5-nano

The breakdown: Assess (13s) + Plan (19s) + 4 Steps (120s total) + Synthesize (18s) + Finalize (10s).

What's Next

The CoT agent is model-agnostic. We're testing it with Gemini 2.5 Flash and Grok 4, both of which should offer significantly faster inference. The agent architecture means you can swap the model without changing the reasoning logic — just update ModelConfig in the agent configuration.

We're also exploring reducing the number of sequential dependencies in plans. Our current run had all 4 steps in a chain (each depending on the previous), which prevented any parallelization. Prompt tuning to encourage the planner to create independent steps could cut execution time by 40–50% for appropriate questions.

The ChainOfThoughtAgent is available now in OpenCaddis. Configure it with "AgentType": "chainofthought" and start asking it hard questions.


Built with FabrCore and Microsoft.Extensions.AI on .NET 10.


OpenCaddis Team

Builders of OpenCaddis and the FabrCore framework.