COGNITIVEX · BENCHMARK

CognitiveX on LongMemEval

LongMemEval is a benchmark for long-horizon conversational memory: can a system hold and reason over hundreds of turns across many sessions, not just recall a single stated fact? It targets exactly what raw LLMs fail at. Here is what it measures, why it is hard, and why the LCM is built for it.

WHAT IT MEASURES

A test of memory, not of a context window.

LongMemEval constructs long, realistic chat histories and then asks hundreds of questions about them. The histories are deliberately long enough that you cannot just paste the whole conversation back into the prompt and hope.

The questions are sorted into a handful of distinct abilities. The easy end is information extraction: a single fact was stated once, can you find it again. The hard end is multi-session reasoning, where the answer only exists if you stitch together evidence from several separate conversations. Temporal reasoning asks questions whose answer depends on when something happened. Knowledge updates check whether you return the current value of a fact that changed over time, or the stale one. And abstention checks the discipline to say nothing was ever stated rather than invent a confident answer.

Read the headline accuracy with suspicion: it averages over all five abilities and hides the one that matters. A system can look strong by acing single-fact extraction while collapsing on multi-session reasoning, which is the slice that actually predicts whether a memory layer is useful in a long-lived product.

WHY IT IS HARD

Long-horizon recall is the part LLMs fail at.

A raw language model has no memory between sessions, and its grip on a single long context is weaker than the marketing suggests.

Three failures compound. First, no persistence: close the session and the model forgets everything, so cross-session questions are unanswerable by definition. Second, lost-in-the-middle: even inside one context window, attention degrades over distance, and the fact you need is rarely at the edges where the model looks hardest. Third, no supersession: when a fact changes across sessions, the model has no notion that the later statement replaced the earlier one, so it averages contradictions or returns whichever it attended to.

The naive fix, a bigger context window, does not address any of this. It makes the haystack larger, raises cost linearly, and still leaves the model without a way to rank, order, or retire facts. The real answer is an external memory that decides what to keep, when it was true, and how it relates to everything else. That is a different problem from generation, and it is the problem the Large Cognition Model is built around.

OUR APPROACH

Consolidation, salience, and four memory tiers.

Four tiers, not one bucket

Episodic holds what happened and when, preserving order for temporal questions.
Semantic holds settled facts; foundational holds the stable, identity-level ones.
Procedural holds how-tos, so recall fits the kind of question being asked.

Consolidation across sessions

Overnight dream consolidation promotes recurring episodes into semantic structure.
A later fact supersedes the earlier one, so updates are tracked, not averaged.
Multi-session answers become a recall, not a context-window reconstruction.

Salience decides what surfaces

Salience scoring weights what a user repeats over what they said once in passing.
Recall returns ranked memories with scores, which supports honest abstention.
Decay fades what no longer matters, keeping the haystack small.

The through-line is that LongMemEval rewards a system for doing the work a raw LLM cannot: keeping memory across sessions, ordering it in time, ranking it by importance, and retiring what is stale. Those are not prompt tricks. They are the mechanisms the LCM ships, described in detail in the research and exposed over the cogx SDK and MCP.

ABILITY BY ABILITY

What each sub-task asks, and what answers it.

The benchmark is not one number. It is a set of distinct abilities, and each one maps to a specific mechanism in the LCM rather than to a bigger prompt.

Ability	What it tests	CognitiveX LCM mechanism
Information extraction	Pull a single stated fact out of a long, noisy history	Semantic + episodic recall, ranked by relevance and salience
Multi-session reasoning	Synthesize an answer that spans many separate sessions	Dream consolidation promotes recurring episodes into semantic structure
Temporal reasoning	Answer questions whose truth depends on when things happened	Episodic tier preserves order; recency + salience weight what surfaces
Knowledge updates	Track a fact that changed across sessions, not the stale version	Consolidation supersedes outdated episodes; decay fades what no longer holds
Abstention	Say nothing was stated rather than hallucinate a plausible answer	Recall returns ranked memories with scores, not a forced single guess

RESULTS · STATUS

No invented numbers. A score with code, or nothing.

Here is the honest part. CognitiveX does not publish a LongMemEval score today.

The AI-memory field is full of self-reported figures from harnesses nobody else can run. We are not going to add to that pile. When CognitiveX reports a LongMemEval result it will be from the public harness, with the configuration and the code published alongside it, so anyone can reproduce it. Until then, the claim on this page is narrow and defensible: the benchmark targets long-horizon, multi-session memory, and the LCM is architected around exactly that, consolidation, salience, temporal episodic memory, and supersession of stale facts, rather than a bigger context window.

If you want to evaluate it yourself before any leaderboard exists, you can: point an agent at the hosted memory, replay a multi-session history through remember, and probe it with recall. The honest way to read a memory benchmark is to watch the multi-session reasoning slice and to trust only numbers that ship with a way to reproduce them.

COMMON QUESTIONS

Quick answers.

What does LongMemEval test?

Long-horizon, multi-session conversational memory. The benchmark builds long chat histories and then asks hundreds of questions across five abilities: information extraction, multi-session reasoning, temporal reasoning, knowledge updates, and abstention. The histories are long enough that stuffing everything into a context window is not a real strategy, which is the whole point.

Why is long-horizon recall hard for LLMs?

A raw LLM has no memory between sessions, and even within a session its attention degrades over a long context: facts in the middle get lost, and the model cannot tell which earlier turn was superseded. LongMemEval is built to expose exactly that. The answer is not a bigger context window but an external memory that consolidates, orders, and ranks what matters.

Which sub-score matters most?

Multi-session reasoning. It is where pure store-and-retrieve plumbing struggles, because the answer depends on synthesizing across sessions rather than recalling one stated fact. It is also where a consolidation engine, one that promotes episodes into semantic structure, should earn its keep.

Does CognitiveX publish a LongMemEval score?

Not yet. CognitiveX publishes no benchmark score and will not claim one until it runs the open harness transparently with reproducible code. This page explains the benchmark and why the architecture is designed to do well on it. The numbers will follow as evaluated runs, not invented figures.

KEEP READING

Go deeper on the architecture in the Large Cognition Model and the research behind it, see how it is exposed in the platform, or read the head-to-head memory-layer comparisons across Mem0, Zep, Letta, and Cognee.

Start building → Try iCog →