Skip to content

AI Agent Long-Term Memory

Standard RAG gives LLM agents access to facts — but not to how a problem was solved before. This maps to cognitive science’s separation between semantic memory (what you know) and episodic memory (what you’ve done and how it went).

Current vector stores are unordered — they cannot return the sequence of states an episode traversed. CVX provides episode identity, temporal ordering, causal continuation, reward filtering, and trajectory analytics in a single index.


SystemArchitectureALFWorldModelYear
AutoManualOnline rule learning (procedural)97.4%GPT-4-turboNeurIPS 2024
ReflexionVerbal self-reflection (episodic)~97%GPT-4NeurIPS 2023
ReflActGoal-state reflection94.8%GPT-4o2025
MempProcedural memory repository87.1%GPT-4o2025
AutoManualSame, smaller model86.2%GPT-3.5-turboNeurIPS 2024
ExpeLExperience extraction + insights~59%GPT-4 (1-shot)2023
CLINContinual causal abstractions+11pp unseenGPT-4COLM 2024
CVX-CausalTemporal episodic (causal retrieval)20%qwen2.5:7b (1-shot)This work
No memoryBaseline3.3%qwen2.5:7bThis work

Key insight: CVX achieves 6x improvement (3.3%→20%) with a 7B model. The SOTA systems use GPT-4/GPT-4-turbo. The critical missing experiment: CVX + GPT-4o vs ExpeL + GPT-4 (both 1-shot, same conditions).

SystemStorageRetrievalTemporal?Reward filter?
Generative AgentsNL streamRecency + relevance + importanceDecay onlyNo
ReflexionVerbal reflectionsSliding windowNoNo
ExpeLTrajectories + rulesTask similarityPartialPost-hoc
VoyagerCode skill libraryEmbedding similarityNoNo
MemGPT/LettaHierarchical pagingOS-style tier routingPartialNo
CLINCausal abstractionsPersistent updatesTrial-levelNo
Zep/GraphitiTemporal knowledge graphTime + semantic + graphYesNo
CVXTemporal vector indexHNSW + causal searchYesYes (bitmap)

CVX is the only system that combines episode identity, temporal ordering, causal continuation, reward pre-filtering, and trajectory analytics (signatures, changepoints) in a single index.

BenchmarkDomainCVX FitCurrent SOTA
ALFWorldHousehold RLHigh — step-by-step causal retrieval97.4% (AutoManual, GPT-4-turbo)
LongMemEvalTemporal reasoningHigh — timestamps, ordering95% (OM+gpt-5-mini)
Mem2ActBenchAction from memoryHigh — temporal span, metadataNew benchmark, few baselines
MemoryAgentBenchKnowledge updatesMedium — changepoint detectionNo system masters all 4 tasks
HumanEvalCode generationLow — saturated by frontier models96.2% (o1-mini)

FeatureAPIPurpose
Episode encodingentity_id = episode_id * 10000 + stepGroup steps into episodes
Causal searchindex.causal_search(vec, k, temporal_context)”What happened next?” — successor/predecessor edges
Hybrid searchindex.hybrid_search(vec, k, beta)Beam search exploring semantic + temporal neighbors
Reward filteringindex.search_with_reward(vec, k, min_reward)Only retrieve successful experiences (bitmap pre-filter)
Metadata pre-filteringindex.insert(..., metadata={"goal": "clean"})Context-dependent retrieval via inverted index
Native centeringindex.set_centroid(centroid)30x signal improvement for anisotropic embeddings
Path signaturescvx.path_signature(traj, depth)Trajectory shape comparison
Change point detectioncvx.detect_changepoints(id, traj)Identify regime shifts in agent behavior
entity_id: episode_id * 10000 + step_index
timestamp: episode_id * 10000 + step_index (monotonic within episode)
vector: embed(observation + action)
reward: 0.0-1.0 (set retroactively via set_reward())
metadata: {"goal": "clean", "room": "kitchen", "task_type": "pick"}

All experiments use the same protocol: 336 expert trajectories from AgentInstruct indexed in CVX with episode encoding. The agent queries CVX at each step with its current observation and receives expert continuations from similar past states. 30 games per condition, eval_out_of_distribution split (134 games).

ExperimentModelNoMemoryCVX-CausalImprovement
E3 (proof of concept)qwen2.5:7b (Ollama)3.3%20.0%+16.7pp (6.0×)
E6GPT-4o-mini13.3%26.7%+13.4pp (2.0×)
E5GPT-4o20.0%43.3%+23.3pp (2.2×)

Key findings:

  1. CVX memory improves performance across all model scales — from 7B local to frontier GPT-4o
  2. The absolute improvement grows with model capability — stronger models leverage the retrieved expert actions more effectively (+16.7pp at 7B, +23.3pp at GPT-4o)
  3. Memory partially compensates for model size — CVX + GPT-4o-mini (26.7%) outperforms NoMemory + GPT-4o-mini (13.3%) and approaches NoMemory + GPT-4o (20.0%)
  4. Critical implementation detail: the causal context must include actual action text from expert successors, not just similarity scores. Empty context (E5 v1) showed zero improvement

Comparison with SOTA (ALFWorld, 1-shot, no retry)

Section titled “Comparison with SOTA (ALFWorld, 1-shot, no retry)”
SystemSuccess RateModelMemory Type
ExpeL~59%GPT-4Experience extraction + insights
CVX-Causal43.3%GPT-4oTemporal vector retrieval
ReAct (no memory)~20%GPT-4oNone
CVX-Causal26.7%GPT-4o-miniTemporal vector retrieval
CVX-Causal20.0%qwen:7bTemporal vector retrieval
No memory3.3%qwen:7bNone

CVX at 43.3% does not yet beat ExpeL (59%), but ExpeL uses experience extraction + insight learning (a complex multi-stage pipeline) while CVX uses pure retrieval from a temporal index — no post-processing, no rule extraction.

Online Learning (E7b) — Self-Improving Memory

Section titled “Online Learning (E7b) — Self-Improving Memory”

The agent plays multiple rounds, adding its own experience to CVX after each round. Successful episodes get reward=1.0, failures get reward=0.0. Expert trajectories that were retrieved but led to failure get a 10% reward decay.

Key insight — context format matters for small models: Including expert observations in the prompt (E7 verbose, 722 chars) degraded performance vs compact action chains (E7b, ~200 chars). Small models lose performance with long contexts. The best format combines abstract strategy templates with compact action chains:

Strategy: Find object, take it, go to sinkbasin, clean it, go to target, put it.
Expert action sequences:
[1] go to drawer 2 -> open drawer 2 -> take soapbar 1 -> go to sinkbasin 1
[2] go to cabinet 1 -> open cabinet 1 -> take cloth 2 -> go to sinkbasin 1
RoundIndex sizeE7 (verbose)E7b (compact)
1 (expert only)4,5426.7%6.7%
2 (+own experience)~5,4006.7%13.3%
3 (+reward decay)~6,30016.7%26.7%

E7b Round 3 (26.7%) beats the E3 baseline (20%) — online learning + compact strategy templates outperform static expert retrieval.

Learning Curve & Memory Dynamics (10 rounds, qwen:7b)

Section titled “Learning Curve & Memory Dynamics (10 rounds, qwen:7b)”

Three variants tested over 10 rounds to understand how memory quality evolves with accumulated experience:

Round: 1 2 3 4 5 6 7 8 9 10 Mean R4-10
E7c: 6.7 10.0 20.0 13.3 16.7 16.7 13.3 10.0 16.7 16.7 14.8%
E7d: 10.0 23.3 26.7 13.3 13.3 16.7 13.3 26.7 13.3 23.3 17.1%
E7e: 10.0 23.3 30.0 16.7 16.7 23.3 13.3 26.7 13.3 26.7 19.5%
VariantApproachPeakMean R4-10
E7cAll experience in index, blind decay (10%)20.0%14.8%
E7dWins-only index, blind decay (15%)26.7%17.1%
E7eWins-only index, context-aware decay (25%)30.0%19.5%

1. Memory contamination degrades retrieval (E7c)

Each round adds ~25 failed episodes and ~4 successful ones. By Round 10, only 59% of memory episodes are successful. Failed experiences are semantically similar to new queries (same task types) and pollute retrieval. Fix: never add failed episodes to the retrieval index (E7d).

2. Blind reward decay destroys useful experts (E7d)

An expert trajectory for clean soapbar retrieved during a failed cool lettuce game gets penalized — even though the expert was irrelevant to that failure. After several rounds, good experts in unrelated task types lose reward unfairly.

3. Context-aware decay preserves memory quality (E7e)

Decay only when: (a) expert task type matches the failed game’s task type, AND (b) the agent actually followed the expert’s suggested action. This protects cross-task experts and experts whose advice was ignored.

ScenarioBlind decayContext-aware
Expert clean retrieved during failed cool game-15%No decay
Expert action retrieved but agent chose differently-15%No decay
Expert followed AND agent failed at same task type-15%-25% decay

See RFC-013 Part E for the full analysis and RFC-013 Part F for the integrated active memory architecture that combines these findings.

E1: Code Generation (MBPP → HumanEval) — 77.8% pass@1 with episodic retrieval (qwen:7b)

E2: ALFWorld Plan Quality — 0.709 semantic similarity with episodic retrieval

E4: Iterative Debugging (APPS) — 28% → 31% with error-to-fix memory (qwen:7b)


Phase 1: Scale to Frontier Models — DONE

Section titled “Phase 1: Scale to Frontier Models — DONE”

E5 (GPT-4o) and E6 (GPT-4o-mini) completed. CVX-Causal shows consistent 2× improvement across model scales. The 43.3% result with GPT-4o is the current best.

E7b shows 4× improvement across 3 rounds with compact strategy templates + online reward annotation. The learning curve (6.7% → 13.3% → 26.7%) suggests further rounds may improve. Next: study the saturation curve to understand when and why improvement plateaus.

Phase 2: Learning Curve & Saturation Analysis (In Progress)

Section titled “Phase 2: Learning Curve & Saturation Analysis (In Progress)”
ExperimentApproachQuestion
E7cE7b with 10+ roundsWhere does the learning curve saturate?
E7c analysisPer-task-type breakdown across roundsWhich task types improve most? Which plateau?
Transfer analysisCompare retrieved vs actual actionsWhy does retrieval fail for specific task types?

Phase 3: Structural Extensions (Under Investigation)

Section titled “Phase 3: Structural Extensions (Under Investigation)”

Open question: can auxiliary structures (knowledge graphs, HMMs, Bayesian networks) improve the memory beyond what pure vector retrieval provides? See RFC-012 Part D.

LongMemEval — Temporal Reasoning Subtask

CVX’s temporal features (timestamps, ordering, causal_search) directly address temporal reasoning:

  • “When did X happen relative to Y?”
  • “What changed between session 3 and session 7?”
  • “What was the most recent update to topic Z?”

Current SOTA: 71.2% (Zep/Graphiti + GPT-4o) on temporal reasoning, 95% (OM + gpt-5-mini) overall.

E8: Mem2ActBench

New benchmark (2025) testing whether agents can infer constraints from history and ground them into tool calls. Few baselines exist — early entry opportunity. CVX’s metadata filtering and temporal span tracking are directly applicable.

ExperimentCVX FeatureResearch Question
Reward-weighted retrievalsearch_with_reward()Does filtering by success improve action quality?
Context-conditioned searchMetadata pre-filteringDoes goal-aware retrieval outperform embedding-only?
Trajectory signature matchingpath_signature()Can we match solution shapes rather than states?
Memory regime detectiondetect_changepoints()Can we detect when agent strategy shifts?
Cross-episode Granger causalitygranger_causality()Do actions in episode A cause improvements in B?

Target paper: “Trajectory-Aware Vector Memory for Interactive Agents”

  • Venue: NeurIPS / ICML / COLM
  • Core claim: Temporal vector memory with causal continuation outperforms unstructured retrieval and procedural memory for interactive agents
  • Experiments: E5 (ALFWorld + GPT-4o) + E7 (LongMemEval) + E8 (Mem2ActBench)
  • Baseline comparison: ExpeL, Memp, Reflexion, CLIN, Zep

  1. Park et al. “Generative Agents” (UIST 2023) — Memory stream with recency/relevance/importance
  2. Shinn et al. “Reflexion” (NeurIPS 2023) — Verbal self-reflection as memory
  3. Zhao et al. “ExpeL” (2023) — Experience extraction + insight learning
  4. Chen et al. “AutoManual” (NeurIPS 2024) — Online rule learning, 97.4% ALFWorld
  5. Majumder et al. “CLIN” (COLM 2024) — Continual causal abstractions
  6. Fang et al. “Memp” (2025) — Procedural memory repository
  7. Packer et al. “MemGPT/Letta” (2023) — OS-style hierarchical memory
  1. Rasmussen et al. “Zep/Graphiti” (2025) — Temporal knowledge graph for agents
  2. “MapAgent” (2025) — Trajectory-constructed memory for planning
  3. Zheng et al. “Synapse” (2023) — Trajectory-as-exemplar prompting
  1. Wu et al. “LongMemEval” (ICLR 2025) — 5 memory abilities, temporal reasoning
  2. Maharana et al. “LoCoMo” (ACL 2024) — Very long-term conversational memory
  3. “MemBench” (ACL Findings 2025) — Comprehensive memory evaluation
  4. “MemoryAgentBench” (ICLR 2026) — Incremental multi-turn interactions
  5. “Mem2ActBench” (2025) — Memory to action grounding
  1. “Memory in the Age of AI Agents” (2024) — Comprehensive taxonomy
  2. “Memory for Autonomous LLM Agents” (2026) — Mechanisms, evaluation, frontiers

NotebookModelFocusKey Result
E1_episodic_codingqwen:7bCode gen with episodic retrieval77.8% HumanEval pass@1
E2_episodic_alfworldqwen:7bPlan quality from episodic retrieval0.709 semantic similarity
E3_interactive_alfworldqwen:7bStep-by-step agent with CVX causal search3.3% → 20.0% (6×)
E4_iterative_codingqwen:7bDebug retry with error memory28% → 31%
E5_alfworld_gpt4oGPT-4oALFWorld with frontier model20.0% → 43.3% (2.2×)
E6_alfworld_gpt4o_miniGPT-4o-miniCost-effective scaling test13.3% → 26.7% (2.0×)
E7_online_learningqwen:7bOnline learning with observation context6.7% → 16.7% (verbose)
E7bqwen:7bCompact strategy + online learning6.7% → 26.7% (3 rounds)
E7cqwen:7b10-round saturation studyPeak 20%, plateau 14.8%
E7dqwen:7bClean memory (wins-only index)Peak 26.7%, plateau 17.1%
E7eqwen:7bContext-aware reward decayPeak 30%, plateau 19.5%
NotebookFocusTarget
E7cLearning curve saturation (10+ rounds)Find plateau point
E8_longmemevalTemporal reasoning evaluationLongMemEval (ICLR 2025)
E9_mem2actMemory-to-action groundingMem2ActBench (2025)