Building in public

Why we killed our sufficiency check

Parsa BaratiJune 3, 20263 min read

latency
llm
engineering
building in public

iCog's chat was slow. Not "needs-a-spinner" slow. It was 17 to 40 seconds per reply slow, with a long tail that occasionally hit 190 seconds. For a product whose whole pitch is "an AI that remembers you," taking three minutes to remember anything is a bad look.

So I went hunting. What I found was three bugs stacked on top of each other, and one of them was a feature I'd been proud of.

The 300ms tax nobody was paying for

Every chat turn ran a sufficiency check: a small LLM classifier that looked at the recalled memories and decided, "do we have enough context to answer well, or should we go fetch more?" Elegant on the whiteboard. The kind of thing you put on a slide.

In production it had a 300ms timeout. I pulled the telemetry for a full day: 27 out of 27 turns hit that timeout. Every single one. The classifier never returned in time, so it contributed exactly zero signal while adding 300ms of dead air to the front of every reply.

It was a tax with no service behind it. I set ICOG_SUFFICIENCY_CHECK=off. Nothing got worse, because it was already doing nothing.

The lesson isn't "sufficiency checks are bad." It's that a clever idea that times out 100% of the time is indistinguishable from a sleep(300). Measure the thing, not the diagram.

The 40× fix was one routing setting

That was the small one. The big one was the model.

We route synthesis through gpt-oss-120b on OpenRouter. OpenRouter load-balances across providers, and by default it optimizes for price. That means your request lands on whichever provider is cheapest, not whichever is fastest. For an open-weight model served by a dozen providers of wildly varying speed, that's the difference between 1.5 seconds and 190.

One setting, provider: { sort: "throughput" }, tells OpenRouter to route to the fastest provider instead of the cheapest. The result, on the same prompt sizes:

Before: 17–40s typical, 100–190s tail
After: 753ms, 1.49s, 1.77s

Roughly a 10–60× reduction, from a one-line change. Latency was never about our code. It was about where the request landed.

The bug that returned empty answers

Then a third one surfaced while smoke-testing: some replies came back completely empty. The logs showed the model generating 512 tokens and stopping with finish=max_tokens, but the answer text was blank.

gpt-oss-120b is a reasoning model. It spends tokens thinking before it emits a single word of the answer. If you cap max_tokens too low, the reasoning eats the entire budget and the model hits the ceiling before it ever starts the reply. You get a confident, well-reasoned silence.

The fix was a floor: any task routed to a reasoning model gets at least 4096 tokens, enforced centrally so no call site can starve it again. Reasoning models change the rules. You budget for the thinking and the talking, not just the talking.

The throughline

Three bugs, one theme: the slow, broken, and silent parts of the system were all in the seams. The timeout on a clever classifier, the default sort on a router, a token cap that predated reasoning models. The cognition core was healthy the whole time. The latency lived in the plumbing.

This is the part of building an AI product that the demos never show: most of the work isn't the model. It's the 300ms here, the routing default there, the assumption that quietly stopped being true. iCog's job is to remember you; my job is to make sure it does it in under two seconds.

It does now.