Benchmark
LongMemEval, explained
LongMemEval is the benchmark the AI-memory field increasingly fights over. It tests whether a system can hold and reason over long, multi-session histories, not just recall a single stated fact. Here is what it measures and how to read the claims around it.
The sub-score that matters
Headline accuracy hides the interesting part. The multi-session reasoning slice is where store-and-retrieve plumbing struggles, because the answer requires synthesizing across sessions. That is exactly where a consolidation engine, one that promotes episodes into semantic structure, should earn its keep. Read benchmark claims with that slice in mind, and prefer reproducible, open-harness numbers over self-reported ones.
CognitiveX publishes no score yet and will not claim one until it runs the open harness in the open. The architecture behind why we expect to do well on multi-session is in memory consolidation, explained.
FAQ
What does LongMemEval test?
Long-term, multi-session memory across hundreds of questions: information extraction, multi-session reasoning, temporal reasoning, and knowledge updates over long histories.
Which sub-score matters most?
Multi-session reasoning. It is where pure store-and-retrieve approaches struggle and where a consolidation engine should have an advantage, because the answer depends on synthesizing across sessions, not recalling one fact.
Does CognitiveX publish a score?
Not yet. CognitiveX publishes no benchmark score and will not claim one until it runs the open harness transparently. This page explains the benchmark; the number will follow with reproducible code.