
LongMemEval: The AI Memory Benchmark, Explained
LongMemEval is a benchmark that tests whether an AI assistant can remember across many chat sessions. Published at ICLR 2025, it puts 500 human-curated questions inside long, multi-session histories and measures five abilities: information extraction, multi-session reasoning, knowledge updates, temporal reasoning, and abstention. Its headline finding: as history grows, accuracy collapses.
What LongMemEval actually tests
The benchmark (arXiv 2410.10813, Wu/Di et al.) frames memory as a chat assistant problem, not a document-retrieval one. You don't get a clean knowledge base, you get a sprawling conversation history and a question whose answer is buried somewhere in it, possibly updated three sessions later, possibly not present at all.
The 500 questions span seven types and stress five core skills:
- Information extraction, find a specific fact stated once, sessions ago.
- Multi-session reasoning, combine facts spread across different sessions.
- Knowledge updates, track a fact that changed and return the current value.
- Temporal reasoning, answer "when" and order-of-events questions.
- Abstention, know when the answer simply isn't in the history and decline.
That last one matters more than it looks: a system that confidently fabricates an answer when it should abstain is worse than useless. LongMemEval scores it explicitly.
LongMemEval_S vs LongMemEval_M
The benchmark ships at two scales, and the gap between them is the whole point.
| Variant | History size | Sessions | What it stresses |
|---|---|---|---|
| LongMemEval_S | ~115k tokens | 30-40 | Fits in a large context window; tests recall under load |
| LongMemEval_M | ~1.5M tokens | ~500 | Far beyond any context window; forces real memory architecture |
LongMemEval_S is the "can a long-context model brute-force this?" test. LongMemEval_M is the "no, it can't, now what?" test. A 1.5M-token history doesn't fit anywhere, so the M scale separates systems that store and retrieve from systems that just stuff everything into the prompt.
The leaderboard: why long context isn't enough
Here's the result that launched a hundred memory startups. On LongMemEval_S, commercial assistants and long-context LLMs show a 30-60% accuracy drop as history accumulates. GPT-4o-class systems land at only ~30-70% accuracy depending on setup. The context window was supposedly "big enough", and accuracy still fell off a cliff.
This is the empirical backbone of the memory-that-learns thesis: raw context length does not solve memory; structure does. The LongMemEval paper proves it directly. It models memory as three stages, indexing → retrieval → reading, and shows deterministic, architecture-level wins:
- Round-level storage (vs session-level) improves what's retrievable.
- Fact-augmented keys add +4% recall@k.
- Time-aware query expansion adds +7-11% on temporal reasoning.
- Structured / Chain-of-Note reading adds ~10 absolute points.
None of these require a bigger model. They require a better-organized memory. That is the entire argument for a dedicated memory layer over a fatter prompt.
The multi-session sub-score, where consolidation should win
Multi-session reasoning is the hardest part of LongMemEval because the answer doesn't exist in any single place, it has to be assembled from fragments across the history. A flat vector store reconstructs that link at query time, every time, and only if retrieval happens to pull all the right fragments together.
A consolidating memory does the opposite: it pre-computes cross-session links offline. When CognitiveX's consolidation engine promotes episodic traces into semantic facts and extracts patterns during dream consolidation, it's building exactly the connective tissue that multi-session questions demand, before the question is ever asked. That's why the multi-session sub-score is the line on the leaderboard a consolidating architecture should beat a retrieval-only one on. (For the full mechanism, see memory consolidation for AI agents.)
Zep's published numbers gesture at the same shape. They report 71.2% overall on LongMemEval (gpt-4o) at ~2.6s latency, vs 60.2% for vanilla full-context at ~29s, better and roughly 10x faster, with the largest relative gains on temporal reasoning (+17.3pp) and multi-session QA (+13.6pp). The mechanism is a temporal knowledge graph that stamps valid_at/invalid_at on every edge, so it knows when a fact was true. Worth noting: those figures are vendor-reported, not third-party. Which brings us to the part of this post that matters most.
How to read memory-benchmark claims skeptically
A single benchmark headline tells you almost nothing. Two cautionary tales:
1. The Mem0-vs-Zep dispute. Mem0 published LoCoMo numbers claiming SOTA. Zep responded that Mem0 had misconfigured Zep, assigned the user role to both speakers, appended timestamps to message text instead of using the created_at field, and ran searches sequentially (inflating reported latency). Zep's corrected score: 75.14% ±0.17 vs the 65.99% Mem0 had reported for it. Both sides are interested parties. The lesson: when a vendor benchmarks a competitor, audit the competitor's config first. (We unpack the field in Mem0 vs Zep vs Letta vs Cognee.)
2. The LoCoMo audit. LoCoMo is the older memory benchmark vendors love to cite. Penfield Labs audited it and found 99 of its 1,540 answer keys (6.4%) are corrupted, so the theoretical max score is ~93.6%, not 100%. The errors include hallucinated facts (a "Ferrari 488 GTB" that exists only in an internal annotator field no memory system ingests), mis-resolved dates, and 24 questions with the wrong speaker attributed. Worse, the gpt-4o-mini judge accepted 62.81% of intentionally wrong-but-on-topic answers, the judge rewards vagueness. The takeaway is blunt: score differences below a few points on LoCoMo are not interpretable.
The honest reading: be most skeptical of vendor-reported numbers benchmarked on a competitor's misconfiguration, and least skeptical of academic, peer-reviewed harnesses with public data, which is exactly what LongMemEval offers.
| Claim | Source | How to weight it |
|---|---|---|
| 30-60% drop on LongMemEval_S | LongMemEval paper (ICLR 2025) | Peer-reviewed, trust |
| Zep 71.2% / +17.3pp temporal | Zep report | Vendor-reported, verify config |
| Zep 75.14% corrected on LoCoMo | Zep blog | Vendor dispute, both sides interested |
| 6.4% of LoCoMo keys wrong | Penfield Labs audit | Third-party, trust, and discount LoCoMo |
Where CognitiveX stands
CognitiveX's run on the open LongMemEval harness is in progress, and we will not state a number before it's published. That's deliberate. The harness, data, baselines, and GPT-4o judge are public (the judge reports >97% agreement with human experts), so any number is reproducible, which means it has to be honest. When we publish, we'll publish the methodology with it: which scale, which judge, which config. Until then, the honest claim is the architectural one: LongMemEval's own paper rewards the exact moves CognitiveX is built on, consolidation, temporal-aware indexing, salience-weighted decay, and the multi-session sub-score is where that architecture should show its edge.
FAQ
What is the LongMemEval benchmark and what does it test? A benchmark (ICLR 2025) of long-term, multi-session AI memory. 500 questions test information extraction, multi-session reasoning, knowledge updates, temporal reasoning, and abstention inside long chat histories.
What's the difference between LongMemEval_S and LongMemEval_M?
_S uses 115k-token histories (30-40 sessions) that can fit a large context window. _M uses ~1.5M tokens (500 sessions), far beyond any window, forcing a real memory architecture rather than prompt-stuffing.
Why do long-context LLMs lose accuracy over many sessions? Bigger context windows fill up and degrade ("context rot"). LongMemEval shows a 30-60% accuracy drop on the _S scale as history grows. Structured indexing and retrieval beat brute-force context.
LongMemEval vs LoCoMo, which is more reliable? LongMemEval is peer-reviewed with a public harness. LoCoMo has 6.4% corrupted answer keys per a third-party audit, and its judge accepts ~63% of wrong-but-on-topic answers, so small LoCoMo score gaps aren't meaningful.
Can you trust vendor-reported memory scores like Mem0 vs Zep? Treat them as a starting point, not a verdict. The Mem0/Zep dispute showed a competitor benchmarked with the wrong config can swing a score ~9 points. Audit the configuration before believing any cross-vendor number.
Want a memory layer built around consolidation, not just storage, the architecture LongMemEval rewards? Try CognitiveX →