Skip to content

Episodic Memory for LLM Agents — Experimental Report

Large Language Models (LLMs) generate text from scratch each time — they have no memory of how they solved similar problems before. When a coding agent hits a TypeError and eventually fixes it, that debugging experience is lost. When a household robot learns to find a mug in the kitchen, it can’t recall that experience when asked to find a plate.

This is the difference between semantic memory (general knowledge in model weights) and episodic memory (specific past experiences). Humans use both; current LLM agents use only semantic memory.

Retrieval-Augmented Generation (RAG) partially addresses this by retrieving relevant documents from a vector store. But standard RAG retrieves facts — it cannot retrieve how a problem was solved step by step, because vector stores have no concept of temporal ordering or episode identity.

ChronosVector is a temporal vector database built in Rust. Unlike standard vector stores (FAISS, Pinecone, etc.), CVX natively supports:

CapabilityStandard Vector StoreCVX
Find similar vectors✓ Cosine/L2 search✓ Same, via HNSW
Episode identity✗ Vectors are independententity_id groups vectors into episodes
Temporal ordering✗ No concept of “before/after”✓ Timestamps define step order within episodes
”What happened next?”✗ Cannot extract continuations✓ Search → find episode → extract subsequent steps
Trajectory reconstruction✗ Not possibletrajectory(entity_id) returns ordered steps

When an agent queries CVX with its current state (not a static task description), and CVX returns what successful agents did next from similar states, the agent performs significantly better than without memory.

This is “causal retrieval”: instead of “find similar problems”, we ask “find where someone was in my situation, and show me what they did next.”

Each step in an agent’s experience is stored as a vector in CVX:

entity_id: episode_id << 16 | step_index (groups steps into episodes)
timestamp: episode_id * 1000 + step_index (defines step order)
vector: embed(state + action + observation)

This encoding allows CVX to:

  1. Search across all steps of all episodes (finding similar states anywhere)
  2. Identify which episode a match belongs to (via timestamp // 1000)
  3. Extract the continuation (steps after the match) as procedural guidance

ComponentChoice
LLMqwen2.5-coder:7b-instruct via Ollama on GPU server (NVIDIA 3090)
Embeddingall-MiniLM-L6-v2 (D=384, local)
CVXChronosVector TemporalIndex (M=16, ef_construction=100)
BaselineFlatCosine: numpy brute-force cosine similarity (no temporal structure)

We run four experiments with increasing complexity, each designed to test whether CVX’s temporal structure adds value over flat vector search:

ExperimentDomainQuery TypeCVX Features Used
E1Code generation (HumanEval)Static (problem → retrieve → generate)insert, search
E2Embodied planning (ALFWorld)Static (task → retrieve → plan)insert, search
E3Embodied execution (ALFWorld)Interactive (observation → retrieve → act → repeat)insert, search, episode encoding, continuation extraction
E4Code debugging (APPS)Iterative (error → retrieve fix → retry → repeat)insert, search, error matching, multi-step traces

Goal: Can retrieving similar solved problems improve code generation?

  • Training corpus: MBPP (384 Python problems with solutions), each stored as a 3-step episode in CVX (problem → plan → solution)
  • Test benchmark: HumanEval (164 problems), split into validation (0–81) and test (82–163)
  • Protocol: Validation sweep k∈{1,3,5,7} at T=0; test with best k, T=0.2, 5 random seeds

Conditions:

ConditionHow it retrievesWhat it provides to the LLM
NoMemoryNothingJust the problem
RandomFewShotk random MBPP solutionsFull problem + solution
FlatCosinenumpy cosine on problem embeddingsFull problem + solution
CVX-EpisodicCVX search, filter to step_0 onlyFull problem + solution
CVX-CausalCVX search across ALL stepsOnly the continuation from match point

Notebook: E1_episodic_coding.ipynb — Cell 7 defines all retrieval functions, Cell 11 runs the validation sweep, Cell 14 runs the 5-seed test evaluation.

Validation (T=0, n=82):

kNoMemoryRandomFlatCosineCVX-EpisodicCVX-Causal
163.4%84.1%85.4%84.1%81.7%
363.4%78.0%86.6%86.6%80.5%
563.4%82.9%84.1%85.4%85.4%
763.4%82.9%82.9%82.9%82.9%

Test (T=0.2, 5 seeds, k=5, n=82):

Conditionpass@1 (mean ± std)
NoMemory72.9% ± 2.1%
CVX-Causal75.1% ± 1.8%
RandomFewShot76.6% ± 2.9%
CVX-Episodic77.8% ± 1.6%
FlatCosine78.5% ± 1.8%

Key output: Cell 18 — McNemar’s test shows CVX-Episodic vs FlatCosine has 0 discordant pairs (p=1.0). Retrieval overlap between CVX and FlatCosine is 96.7%.

CVX does not differentiate from flat cosine for code generation. All few-shot conditions perform similarly (~75-78%), significantly above zero-shot (73%), but the source of the examples doesn’t matter. For code generation, the value of few-shot comes from teaching the model the output format, not from problem-specific similarity.

CVX-Causal underperforms because with only 3 steps per episode (problem → plan → solution), the “continuation” is always either the full solution or nothing — there’s no meaningful mid-episode state.


4. Experiment E2: Static Embodied Planning

Section titled “4. Experiment E2: Static Embodied Planning”

Goal: Can retrieving similar expert trajectories improve action planning?

  • Dataset: AgentInstruct ALFWorld split — 336 GPT-4 expert trajectories (3–35 action steps each), parsed from conversation format
  • Evaluation: 99 episodes stratified by task type, leave-one-out protocol (each episode excluded from its own retrieval)

Conditions add FlatCosine and CVX-Causal to the baseline set.

Metrics (computed against expert trajectory):

  • Semantic Similarity: cosine(embed(plan), embed(expert trajectory))
  • Action Overlap: fraction of expert action verbs (go, take, put, open, etc.) present in plan
  • Object Recall: fraction of expert-mentioned objects (fridge 1, countertop 2, etc.) in plan
  • Step Ratio: plan_steps / expert_steps (closer to 1.0 = more detailed)

Notebook: E2_episodic_alfworld.ipynb — Cell 3 parses AgentInstruct conversations into structured episodes, Cell 7 defines all retrieval + metric functions, Cell 8 runs the 99-episode evaluation.

ConditionSemantic SimAction OverlapObject RecallStep Ratio
NoMemory0.6000.600.090.56
RandomTrajectory0.6140.780.170.70
FlatCosine0.7020.890.300.74
CVX-Episodic0.7090.880.280.74
CVX-Causal0.5880.700.210.43

Statistical tests (Wilcoxon signed-rank, Cell 11):

  • CVX-Episodic vs NoMemory: Δ=+0.109, p < 0.0001 ***
  • FlatCosine vs NoMemory: Δ=+0.102, p < 0.0001 ***
  • CVX-Episodic vs FlatCosine: Δ=+0.007, p=0.213 ns
  • CVX-Causal vs NoMemory: Δ=-0.012, p=0.927 ns

Semantic retrieval significantly outperforms random and zero-shot for embodied planning (p<0.0001). Object recall triples (0.09 → 0.30) when the agent sees expert trajectories with relevant objects and locations.

But CVX ≈ FlatCosine — both retrieve similar episodes and produce indistinguishable results. The temporal structure adds no value when retrieval is done once per task with a static query.

CVX-Causal is worse than zero-shot. The match step distribution (Cell 11) reveals why: 36.7% of matches land on step 10+ (near the end of episodes), so the “continuation” is just the last 2-3 actions without context. The problem: we’re querying with a task description against embeddings that contain "Task: X | Action: Y | Obs: Z" — the task text dilutes the match, causing it to hit late-episode states where the task prefix happens to be a close match.


5. Experiment E3: Interactive Embodied Agent

Section titled “5. Experiment E3: Interactive Embodied Agent”

E2 failed for CVX-Causal because the query was a static task description. But the whole point of causal retrieval is to query with the agent’s current state — which changes at every step. E3 tests this by running an actual agent inside the ALFWorld simulator.

Environment: ALFWorld TextWorld (134 eval_out_of_distribution games, household tasks like “put a cool tomato in microwave”)

The interactive loop (Cell 9):

1. env.reset() → observation ("You are in the middle of a room...")
2. embed(observation + task) → query CVX for similar past states
3. CVX returns continuations: "from a similar state, agents did X next"
4. LLM(observation + continuations + admissible_actions) → choose action
5. env.step(action) → new observation
6. Repeat until task complete or 30 steps

Critical difference from E2: the query to CVX changes at every step. When the agent is at “countertop 2 with a tomato”, CVX finds expert states at countertops with tomatoes and returns “take tomato → go to fridge → cool” — directly executable guidance.

Conditions:

ConditionMemory at Each Step
NoMemoryOnly current observation + admissible actions
CVX-CausalObservation → CVX search → inject continuation

Notebook: E3_interactive_alfworld.ipynb — Cell 7 implements retrieve_causal() (queries CVX with the current observation), Cell 9 implements the full agent loop with run_episode().

ConditionTasks CompletedRateMean Steps
NoMemory1/303.3%29.3
CVX-Causal6/3020.0%27.2

McNemar’s test (Cell 13):

  • CVX-Causal only won: 5 tasks
  • NoMemory only won: 0 tasks
  • Net: +5, χ²=3.20, p=0.074 (borderline at n=30)

6x improvement in task completion (3.3% → 20.0%). This is the strongest result across all experiments.

The same CVX index, the same LLM, the same expert trajectories as E2 — the only change is that the query evolves at each step based on the real environment state. This is what makes CVX’s temporal structure useful: it can find similar mid-episode states and extract what happened next.

Why FlatCosine cannot replicate this: A flat vector store has no concept of step ordering or episode identity. Given a matched vector, it cannot determine “this was step 5 of a 12-step episode” and extract steps 6-12. It can only return the k most similar vectors — which are not necessarily from the same episode or in any meaningful order.

From the retrieval test (Cell 7):

Query: "You are in the middle of a room... Your task is to: put a cool tomato in microwave"
Retrieved continuations:
ep=35, step 4/11 (sim=0.334): "cool some tomato and put it in microwave"
→ take tomato 1 from countertop 2
→ go to fridge 1
ep=89, step 4/11 (sim=0.357): "cool some tomato and put it in microwave"
→ take tomato 2 from countertop 2
→ go to fridge 1
ep=68, step 2/10 (sim=0.371): "put a cool tomato in microwave"
→ go to countertop 2
→ take tomato 1 from countertop 2

The agent receives concrete, executable next-actions derived from expert experience — not generic trajectories, but specifically what to do from its current situation.


6. Experiment E4: Iterative Code Debugging

Section titled “6. Experiment E4: Iterative Code Debugging”

E1 showed that static retrieval doesn’t help for code generation. E3 showed that interactive retrieval works for embodied tasks. E4 asks: can we make code generation interactive?

The idea: instead of generating code once, run a generate → test → debug loop. When the code fails, embed the error message and search CVX for similar past errors. CVX returns how those errors were fixed.

Phase 1 — Build debug trace corpus (Cell 6):

  • Generate solutions for 200 APPS interview problems (hard) at T=0
  • Run tests, capture errors
  • Retry with error feedback, capture second error
  • Store as 6-step episodes: problem → attempt1 → error1 → attempt2 → error2 → ground truth fix

Phase 2 — Evaluate (Cell 12):

  • 100 APPS introductory problems (medium difficulty, ~28% single-pass for 7B model)
  • 3 conditions, max 3 retries per problem
ConditionHow retries work
SinglePassOne attempt, no retry
Retry-NoMemoryRetry with error feedback only
Retry-CVX-CausalRetry with error + CVX fix suggestions from similar past errors

Notebook: E4_iterative_coding.ipynb — Cell 6 builds 172 multi-step traces with diverse error types, Cell 8 indexes in CVX (1032 vectors), Cell 10 implements retrieve_fix_suggestions(), Cell 12 runs the full evaluation.

Unlike E4 v1 (MBPP, 95% NameError), APPS interview produces semantically rich errors:

Error TypeCount%
Wrong Answer22766.4%
ValueError4814.0%
IndexError277.9%
Timeout154.4%
EOFError92.6%
Other164.7%

These are real algorithmic errors — wrong logic, edge cases, off-by-one — not typos.

Conditionpass@1Rescued from SinglePass
SinglePass28/100 = 28.0%
Retry-NoMemory28/100 = 28.0%2/72
Retry-CVX-Causal31/100 = 31.0%3/72

McNemar (Cell 13): CVX only=3, NoMem only=0, Net=+3, p=0.248 (ns at n=100)

Retry alone doesn’t help on APPS (28.0% = same as SinglePass). When the error is “Wrong Answer” (66% of cases), telling the model “your output was wrong” doesn’t help it fix the logic — it needs to see how similar errors were fixed.

CVX-Causal rescues 3 additional problems (3 CVX-only, 0 regressions). The direction is correct but the effect is modest (+3pp). The corpus of 172 traces from a different difficulty level (interview → introductory) limits transfer.


ExperimentDomainStatic/InteractiveCVX vs FlatCVX vs NoMemoryp-value
E1Code gen (HumanEval)Static= (96.7% overlap)+4.9pp (ns)1.000
E2Planning (ALFWorld)Static= (67% overlap)+10.9pp (***)<0.0001
E3Execution (ALFWorld)InteractiveN/A+16.7pp0.074
E4Debugging (APPS)IterativeN/A+3.0pp0.248

CVX’s temporal structure adds value only in interactive settings where the query evolves at each step.

  • Static retrieval (E1, E2): CVX ≈ FlatCosine. When you search once with a task description, HNSW finds the same neighbors as brute-force numpy. The episode encoding and temporal ordering go unused.

  • Interactive retrieval (E3, E4): The query is the agent’s actual current state — an environment observation or an error message — which changes at every step. CVX’s ability to find similar mid-episode states and extract continuations provides guidance that a flat store cannot.

7.3 What CVX Provides That Flat Cosine Cannot

Section titled “7.3 What CVX Provides That Flat Cosine Cannot”
CapabilityUsed in E1/E2?Used in E3/E4?Impact
HNSW approximate searchYesYesNone (same results as brute-force at this scale)
Episode encoding (entity_id)NoYesGroups steps into episodes for continuation extraction
Timestamp orderingNoYesDetermines step order → “what came next”
Multi-step embeddingsPartiallyYesAll action-observation steps are searchable, not just task descriptions
Continuation extractionNoYesReturn steps N+1..end from matched episode

The experiments suggest a clear architecture pattern for CVX-powered agents:

┌─────────────────────────────────────────────┐
│ Environment / Test Runner / User │
│ (provides observations / errors / feedback) │
└──────────────────┬──────────────────────────┘
│ observation at step t
┌──────────────────────────────────────────────┐
│ CVX Episodic Memory │
│ search(embed(observation)) → similar states │
│ extract continuation → "do X next" │
└──────────────────┬───────────────────────────┘
│ causal context
┌──────────────────────────────────────────────┐
│ LLM Agent │
│ (observation + causal context + actions) │
│ → choose next action │
└──────────────────┬───────────────────────────┘
│ action
Environment step → new observation
→ repeat

The key design principle: CVX is not a retrieval augmentation for prompts — it is a step-by-step memory system that the agent consults at every decision point.


  • Scale: The largest corpus is 336 episodes (4542 vectors). At this scale, HNSW and brute-force are equivalent. CVX’s scalability advantages (sub-linear search) would emerge at 100K+ vectors.
  • Single model: All experiments use qwen2.5-coder:7b. Larger models may show different patterns (ceiling effects for easy tasks, more benefit from memory for hard tasks).
  • E3 sample size: 30 games (p=0.074). Scaling to 134 games would likely reach statistical significance.
  • E4 trace transfer: Debug traces from APPS interview may not transfer well to APPS introductory (different difficulty, different error patterns).
  • No causal_search() in Python: We simulate causal retrieval with search() + timestamp decoding. Native causal_search() (available in Rust/MCP) would be more efficient.
  1. Scale E3 to 134 games for statistical power
  2. Add FlatCosine interactive baseline to E3 (give it full episodes at each step) to directly measure the value of continuation extraction vs full-episode context
  3. Scale E4 trace corpus to 1000+ APPS problems for richer error diversity
  4. Expose causal_search() in Python bindings for native temporal retrieval
  • Online learning: Agents that add their own experience to CVX during evaluation (not just expert trajectories)
  • Velocity-based filtering: Use cvx.velocity() to select episodes where the agent was making rapid progress (high semantic velocity) vs stuck (low velocity)
  • Cross-domain transfer: Can debug traces from one domain help in another?

NotebookLocationKey Cells
E1notebooks/E1_episodic_coding.ipynbCell 7 (retrieval), Cell 11 (val sweep), Cell 14 (test), Cell 18 (stats)
E2notebooks/E2_episodic_alfworld.ipynbCell 7 (retrieval + metrics), Cell 8 (evaluation), Cell 11 (stats)
E3notebooks/E3_interactive_alfworld.ipynbCell 7 (causal retrieval), Cell 9 (agent loop), Cell 11 (evaluation)
E4notebooks/E4_iterative_coding.ipynbCell 6 (trace building), Cell 10 (fix retrieval), Cell 12 (evaluation)
Terminal window
# Python packages
pip install sentence-transformers openai datasets scipy plotly
# For E3: ALFWorld environment
pip install alfworld
alfworld-download
# For E3 on Python 3.13: patch textworld eval bug
# In textworld/envs/pddl/textgen/__init__.py, line 94-97:
# Replace: locals().update(context["variables"])
# value = eval(self.expression)
# With: value = eval(self.expression, {}, context["variables"])
# Ollama on GPU server
OLLAMA_HOST=0.0.0.0 ollama serve
ollama pull qwen2.5-coder:7b-instruct

All CVX indices and metadata are cached in data/episodic/ and rebuild automatically if missing. The MBPP, HumanEval, AgentInstruct, and APPS datasets are downloaded from HuggingFace on first run.


  • Fang et al. (2025). Memp: Exploring Agent Procedural Memory. arXiv:2508.06433
  • Maharana et al. (2024). LoCoMo: Evaluating Very Long-Term Conversational Memory. ACL 2024
  • Xu et al. (2025). A-MEM: Agentic Memory for LLM Agents. arXiv:2502.12110
  • Position paper (2025). Episodic Memory is the Missing Piece. arXiv:2502.06975
  • Shinn et al. (2023). Reflexion: Language Agents with Verbal Reinforcement Learning. NeurIPS 2023
  • Zhou et al. (2023). Language Agent Tree Search. NeurIPS 2023
  • Hendrycks et al. (2021). Measuring Coding Challenge Competence With APPS. NeurIPS 2021
  • Shridhar et al. (2021). ALFWorld: Aligning Text and Embodied Environments. ICLR 2021