Episodic Memory for LLM Agents — Experimental Report
1. Introduction
Section titled “1. Introduction”1.1 What Problem Are We Solving?
Section titled “1.1 What Problem Are We Solving?”Large Language Models (LLMs) generate text from scratch each time — they have no memory of how they solved similar problems before. When a coding agent hits a TypeError and eventually fixes it, that debugging experience is lost. When a household robot learns to find a mug in the kitchen, it can’t recall that experience when asked to find a plate.
This is the difference between semantic memory (general knowledge in model weights) and episodic memory (specific past experiences). Humans use both; current LLM agents use only semantic memory.
Retrieval-Augmented Generation (RAG) partially addresses this by retrieving relevant documents from a vector store. But standard RAG retrieves facts — it cannot retrieve how a problem was solved step by step, because vector stores have no concept of temporal ordering or episode identity.
1.2 What is ChronosVector (CVX)?
Section titled “1.2 What is ChronosVector (CVX)?”ChronosVector is a temporal vector database built in Rust. Unlike standard vector stores (FAISS, Pinecone, etc.), CVX natively supports:
| Capability | Standard Vector Store | CVX |
|---|---|---|
| Find similar vectors | ✓ Cosine/L2 search | ✓ Same, via HNSW |
| Episode identity | ✗ Vectors are independent | ✓ entity_id groups vectors into episodes |
| Temporal ordering | ✗ No concept of “before/after” | ✓ Timestamps define step order within episodes |
| ”What happened next?” | ✗ Cannot extract continuations | ✓ Search → find episode → extract subsequent steps |
| Trajectory reconstruction | ✗ Not possible | ✓ trajectory(entity_id) returns ordered steps |
1.3 The Core Hypothesis
Section titled “1.3 The Core Hypothesis”When an agent queries CVX with its current state (not a static task description), and CVX returns what successful agents did next from similar states, the agent performs significantly better than without memory.
This is “causal retrieval”: instead of “find similar problems”, we ask “find where someone was in my situation, and show me what they did next.”
1.4 Episode Data Model
Section titled “1.4 Episode Data Model”Each step in an agent’s experience is stored as a vector in CVX:
entity_id: episode_id << 16 | step_index (groups steps into episodes)timestamp: episode_id * 1000 + step_index (defines step order)vector: embed(state + action + observation)This encoding allows CVX to:
- Search across all steps of all episodes (finding similar states anywhere)
- Identify which episode a match belongs to (via timestamp // 1000)
- Extract the continuation (steps after the match) as procedural guidance
2. Experimental Setup
Section titled “2. Experimental Setup”2.1 Infrastructure
Section titled “2.1 Infrastructure”| Component | Choice |
|---|---|
| LLM | qwen2.5-coder:7b-instruct via Ollama on GPU server (NVIDIA 3090) |
| Embedding | all-MiniLM-L6-v2 (D=384, local) |
| CVX | ChronosVector TemporalIndex (M=16, ef_construction=100) |
| Baseline | FlatCosine: numpy brute-force cosine similarity (no temporal structure) |
2.2 Four Experiments
Section titled “2.2 Four Experiments”We run four experiments with increasing complexity, each designed to test whether CVX’s temporal structure adds value over flat vector search:
| Experiment | Domain | Query Type | CVX Features Used |
|---|---|---|---|
| E1 | Code generation (HumanEval) | Static (problem → retrieve → generate) | insert, search |
| E2 | Embodied planning (ALFWorld) | Static (task → retrieve → plan) | insert, search |
| E3 | Embodied execution (ALFWorld) | Interactive (observation → retrieve → act → repeat) | insert, search, episode encoding, continuation extraction |
| E4 | Code debugging (APPS) | Iterative (error → retrieve fix → retry → repeat) | insert, search, error matching, multi-step traces |
3. Experiment E1: Static Code Generation
Section titled “3. Experiment E1: Static Code Generation”3.1 Setup
Section titled “3.1 Setup”Goal: Can retrieving similar solved problems improve code generation?
- Training corpus: MBPP (384 Python problems with solutions), each stored as a 3-step episode in CVX (problem → plan → solution)
- Test benchmark: HumanEval (164 problems), split into validation (0–81) and test (82–163)
- Protocol: Validation sweep k∈{1,3,5,7} at T=0; test with best k, T=0.2, 5 random seeds
Conditions:
| Condition | How it retrieves | What it provides to the LLM |
|---|---|---|
| NoMemory | Nothing | Just the problem |
| RandomFewShot | k random MBPP solutions | Full problem + solution |
| FlatCosine | numpy cosine on problem embeddings | Full problem + solution |
| CVX-Episodic | CVX search, filter to step_0 only | Full problem + solution |
| CVX-Causal | CVX search across ALL steps | Only the continuation from match point |
Notebook:
E1_episodic_coding.ipynb— Cell 7 defines all retrieval functions, Cell 11 runs the validation sweep, Cell 14 runs the 5-seed test evaluation.
3.2 Results
Section titled “3.2 Results”Validation (T=0, n=82):
| k | NoMemory | Random | FlatCosine | CVX-Episodic | CVX-Causal |
|---|---|---|---|---|---|
| 1 | 63.4% | 84.1% | 85.4% | 84.1% | 81.7% |
| 3 | 63.4% | 78.0% | 86.6% | 86.6% | 80.5% |
| 5 | 63.4% | 82.9% | 84.1% | 85.4% | 85.4% |
| 7 | 63.4% | 82.9% | 82.9% | 82.9% | 82.9% |
Test (T=0.2, 5 seeds, k=5, n=82):
| Condition | pass@1 (mean ± std) |
|---|---|
| NoMemory | 72.9% ± 2.1% |
| CVX-Causal | 75.1% ± 1.8% |
| RandomFewShot | 76.6% ± 2.9% |
| CVX-Episodic | 77.8% ± 1.6% |
| FlatCosine | 78.5% ± 1.8% |
Key output: Cell 18 — McNemar’s test shows CVX-Episodic vs FlatCosine has 0 discordant pairs (p=1.0). Retrieval overlap between CVX and FlatCosine is 96.7%.
3.3 Interpretation
Section titled “3.3 Interpretation”CVX does not differentiate from flat cosine for code generation. All few-shot conditions perform similarly (~75-78%), significantly above zero-shot (73%), but the source of the examples doesn’t matter. For code generation, the value of few-shot comes from teaching the model the output format, not from problem-specific similarity.
CVX-Causal underperforms because with only 3 steps per episode (problem → plan → solution), the “continuation” is always either the full solution or nothing — there’s no meaningful mid-episode state.
4. Experiment E2: Static Embodied Planning
Section titled “4. Experiment E2: Static Embodied Planning”4.1 Setup
Section titled “4.1 Setup”Goal: Can retrieving similar expert trajectories improve action planning?
- Dataset: AgentInstruct ALFWorld split — 336 GPT-4 expert trajectories (3–35 action steps each), parsed from conversation format
- Evaluation: 99 episodes stratified by task type, leave-one-out protocol (each episode excluded from its own retrieval)
Conditions add FlatCosine and CVX-Causal to the baseline set.
Metrics (computed against expert trajectory):
- Semantic Similarity: cosine(embed(plan), embed(expert trajectory))
- Action Overlap: fraction of expert action verbs (go, take, put, open, etc.) present in plan
- Object Recall: fraction of expert-mentioned objects (fridge 1, countertop 2, etc.) in plan
- Step Ratio: plan_steps / expert_steps (closer to 1.0 = more detailed)
Notebook:
E2_episodic_alfworld.ipynb— Cell 3 parses AgentInstruct conversations into structured episodes, Cell 7 defines all retrieval + metric functions, Cell 8 runs the 99-episode evaluation.
4.2 Results
Section titled “4.2 Results”| Condition | Semantic Sim | Action Overlap | Object Recall | Step Ratio |
|---|---|---|---|---|
| NoMemory | 0.600 | 0.60 | 0.09 | 0.56 |
| RandomTrajectory | 0.614 | 0.78 | 0.17 | 0.70 |
| FlatCosine | 0.702 | 0.89 | 0.30 | 0.74 |
| CVX-Episodic | 0.709 | 0.88 | 0.28 | 0.74 |
| CVX-Causal | 0.588 | 0.70 | 0.21 | 0.43 |
Statistical tests (Wilcoxon signed-rank, Cell 11):
- CVX-Episodic vs NoMemory: Δ=+0.109, p < 0.0001 ***
- FlatCosine vs NoMemory: Δ=+0.102, p < 0.0001 ***
- CVX-Episodic vs FlatCosine: Δ=+0.007, p=0.213 ns
- CVX-Causal vs NoMemory: Δ=-0.012, p=0.927 ns
4.3 Interpretation
Section titled “4.3 Interpretation”Semantic retrieval significantly outperforms random and zero-shot for embodied planning (p<0.0001). Object recall triples (0.09 → 0.30) when the agent sees expert trajectories with relevant objects and locations.
But CVX ≈ FlatCosine — both retrieve similar episodes and produce indistinguishable results. The temporal structure adds no value when retrieval is done once per task with a static query.
CVX-Causal is worse than zero-shot. The match step distribution (Cell 11) reveals why: 36.7% of matches land on step 10+ (near the end of episodes), so the “continuation” is just the last 2-3 actions without context. The problem: we’re querying with a task description against embeddings that contain "Task: X | Action: Y | Obs: Z" — the task text dilutes the match, causing it to hit late-episode states where the task prefix happens to be a close match.
5. Experiment E3: Interactive Embodied Agent
Section titled “5. Experiment E3: Interactive Embodied Agent”5.1 The Key Insight
Section titled “5.1 The Key Insight”E2 failed for CVX-Causal because the query was a static task description. But the whole point of causal retrieval is to query with the agent’s current state — which changes at every step. E3 tests this by running an actual agent inside the ALFWorld simulator.
5.2 Setup
Section titled “5.2 Setup”Environment: ALFWorld TextWorld (134 eval_out_of_distribution games, household tasks like “put a cool tomato in microwave”)
The interactive loop (Cell 9):
1. env.reset() → observation ("You are in the middle of a room...")2. embed(observation + task) → query CVX for similar past states3. CVX returns continuations: "from a similar state, agents did X next"4. LLM(observation + continuations + admissible_actions) → choose action5. env.step(action) → new observation6. Repeat until task complete or 30 stepsCritical difference from E2: the query to CVX changes at every step. When the agent is at “countertop 2 with a tomato”, CVX finds expert states at countertops with tomatoes and returns “take tomato → go to fridge → cool” — directly executable guidance.
Conditions:
| Condition | Memory at Each Step |
|---|---|
| NoMemory | Only current observation + admissible actions |
| CVX-Causal | Observation → CVX search → inject continuation |
Notebook:
E3_interactive_alfworld.ipynb— Cell 7 implementsretrieve_causal()(queries CVX with the current observation), Cell 9 implements the full agent loop withrun_episode().
5.3 Results
Section titled “5.3 Results”| Condition | Tasks Completed | Rate | Mean Steps |
|---|---|---|---|
| NoMemory | 1/30 | 3.3% | 29.3 |
| CVX-Causal | 6/30 | 20.0% | 27.2 |
McNemar’s test (Cell 13):
- CVX-Causal only won: 5 tasks
- NoMemory only won: 0 tasks
- Net: +5, χ²=3.20, p=0.074 (borderline at n=30)
5.4 Interpretation
Section titled “5.4 Interpretation”6x improvement in task completion (3.3% → 20.0%). This is the strongest result across all experiments.
The same CVX index, the same LLM, the same expert trajectories as E2 — the only change is that the query evolves at each step based on the real environment state. This is what makes CVX’s temporal structure useful: it can find similar mid-episode states and extract what happened next.
Why FlatCosine cannot replicate this: A flat vector store has no concept of step ordering or episode identity. Given a matched vector, it cannot determine “this was step 5 of a 12-step episode” and extract steps 6-12. It can only return the k most similar vectors — which are not necessarily from the same episode or in any meaningful order.
5.5 Qualitative Example
Section titled “5.5 Qualitative Example”From the retrieval test (Cell 7):
Query: "You are in the middle of a room... Your task is to: put a cool tomato in microwave"
Retrieved continuations: ep=35, step 4/11 (sim=0.334): "cool some tomato and put it in microwave" → take tomato 1 from countertop 2 → go to fridge 1 ep=89, step 4/11 (sim=0.357): "cool some tomato and put it in microwave" → take tomato 2 from countertop 2 → go to fridge 1 ep=68, step 2/10 (sim=0.371): "put a cool tomato in microwave" → go to countertop 2 → take tomato 1 from countertop 2The agent receives concrete, executable next-actions derived from expert experience — not generic trajectories, but specifically what to do from its current situation.
6. Experiment E4: Iterative Code Debugging
Section titled “6. Experiment E4: Iterative Code Debugging”6.1 Motivation
Section titled “6.1 Motivation”E1 showed that static retrieval doesn’t help for code generation. E3 showed that interactive retrieval works for embodied tasks. E4 asks: can we make code generation interactive?
The idea: instead of generating code once, run a generate → test → debug loop. When the code fails, embed the error message and search CVX for similar past errors. CVX returns how those errors were fixed.
6.2 Setup
Section titled “6.2 Setup”Phase 1 — Build debug trace corpus (Cell 6):
- Generate solutions for 200 APPS interview problems (hard) at T=0
- Run tests, capture errors
- Retry with error feedback, capture second error
- Store as 6-step episodes: problem → attempt1 → error1 → attempt2 → error2 → ground truth fix
Phase 2 — Evaluate (Cell 12):
- 100 APPS introductory problems (medium difficulty, ~28% single-pass for 7B model)
- 3 conditions, max 3 retries per problem
| Condition | How retries work |
|---|---|
| SinglePass | One attempt, no retry |
| Retry-NoMemory | Retry with error feedback only |
| Retry-CVX-Causal | Retry with error + CVX fix suggestions from similar past errors |
Notebook:
E4_iterative_coding.ipynb— Cell 6 builds 172 multi-step traces with diverse error types, Cell 8 indexes in CVX (1032 vectors), Cell 10 implementsretrieve_fix_suggestions(), Cell 12 runs the full evaluation.
6.3 Debug Trace Quality
Section titled “6.3 Debug Trace Quality”Unlike E4 v1 (MBPP, 95% NameError), APPS interview produces semantically rich errors:
| Error Type | Count | % |
|---|---|---|
| Wrong Answer | 227 | 66.4% |
| ValueError | 48 | 14.0% |
| IndexError | 27 | 7.9% |
| Timeout | 15 | 4.4% |
| EOFError | 9 | 2.6% |
| Other | 16 | 4.7% |
These are real algorithmic errors — wrong logic, edge cases, off-by-one — not typos.
6.4 Results
Section titled “6.4 Results”| Condition | pass@1 | Rescued from SinglePass |
|---|---|---|
| SinglePass | 28/100 = 28.0% | — |
| Retry-NoMemory | 28/100 = 28.0% | 2/72 |
| Retry-CVX-Causal | 31/100 = 31.0% | 3/72 |
McNemar (Cell 13): CVX only=3, NoMem only=0, Net=+3, p=0.248 (ns at n=100)
6.5 Interpretation
Section titled “6.5 Interpretation”Retry alone doesn’t help on APPS (28.0% = same as SinglePass). When the error is “Wrong Answer” (66% of cases), telling the model “your output was wrong” doesn’t help it fix the logic — it needs to see how similar errors were fixed.
CVX-Causal rescues 3 additional problems (3 CVX-only, 0 regressions). The direction is correct but the effect is modest (+3pp). The corpus of 172 traces from a different difficulty level (interview → introductory) limits transfer.
7. Consolidated Findings
Section titled “7. Consolidated Findings”7.1 Summary Table
Section titled “7.1 Summary Table”| Experiment | Domain | Static/Interactive | CVX vs Flat | CVX vs NoMemory | p-value |
|---|---|---|---|---|---|
| E1 | Code gen (HumanEval) | Static | = (96.7% overlap) | +4.9pp (ns) | 1.000 |
| E2 | Planning (ALFWorld) | Static | = (67% overlap) | +10.9pp (***) | <0.0001 |
| E3 | Execution (ALFWorld) | Interactive | N/A | +16.7pp | 0.074 |
| E4 | Debugging (APPS) | Iterative | N/A | +3.0pp | 0.248 |
7.2 The Central Insight
Section titled “7.2 The Central Insight”CVX’s temporal structure adds value only in interactive settings where the query evolves at each step.
-
Static retrieval (E1, E2): CVX ≈ FlatCosine. When you search once with a task description, HNSW finds the same neighbors as brute-force numpy. The episode encoding and temporal ordering go unused.
-
Interactive retrieval (E3, E4): The query is the agent’s actual current state — an environment observation or an error message — which changes at every step. CVX’s ability to find similar mid-episode states and extract continuations provides guidance that a flat store cannot.
7.3 What CVX Provides That Flat Cosine Cannot
Section titled “7.3 What CVX Provides That Flat Cosine Cannot”| Capability | Used in E1/E2? | Used in E3/E4? | Impact |
|---|---|---|---|
| HNSW approximate search | Yes | Yes | None (same results as brute-force at this scale) |
| Episode encoding (entity_id) | No | Yes | Groups steps into episodes for continuation extraction |
| Timestamp ordering | No | Yes | Determines step order → “what came next” |
| Multi-step embeddings | Partially | Yes | All action-observation steps are searchable, not just task descriptions |
| Continuation extraction | No | Yes | Return steps N+1..end from matched episode |
7.4 Implications for System Design
Section titled “7.4 Implications for System Design”The experiments suggest a clear architecture pattern for CVX-powered agents:
┌─────────────────────────────────────────────┐│ Environment / Test Runner / User ││ (provides observations / errors / feedback) │└──────────────────┬──────────────────────────┘ │ observation at step t ▼┌──────────────────────────────────────────────┐│ CVX Episodic Memory ││ search(embed(observation)) → similar states ││ extract continuation → "do X next" │└──────────────────┬───────────────────────────┘ │ causal context ▼┌──────────────────────────────────────────────┐│ LLM Agent ││ (observation + causal context + actions) ││ → choose next action │└──────────────────┬───────────────────────────┘ │ action ▼ Environment step → new observation → repeatThe key design principle: CVX is not a retrieval augmentation for prompts — it is a step-by-step memory system that the agent consults at every decision point.
8. Limitations and Future Work
Section titled “8. Limitations and Future Work”8.1 Current Limitations
Section titled “8.1 Current Limitations”- Scale: The largest corpus is 336 episodes (4542 vectors). At this scale, HNSW and brute-force are equivalent. CVX’s scalability advantages (sub-linear search) would emerge at 100K+ vectors.
- Single model: All experiments use qwen2.5-coder:7b. Larger models may show different patterns (ceiling effects for easy tasks, more benefit from memory for hard tasks).
- E3 sample size: 30 games (p=0.074). Scaling to 134 games would likely reach statistical significance.
- E4 trace transfer: Debug traces from APPS interview may not transfer well to APPS introductory (different difficulty, different error patterns).
- No
causal_search()in Python: We simulate causal retrieval withsearch()+ timestamp decoding. Nativecausal_search()(available in Rust/MCP) would be more efficient.
8.2 Immediate Next Steps
Section titled “8.2 Immediate Next Steps”- Scale E3 to 134 games for statistical power
- Add FlatCosine interactive baseline to E3 (give it full episodes at each step) to directly measure the value of continuation extraction vs full-episode context
- Scale E4 trace corpus to 1000+ APPS problems for richer error diversity
- Expose
causal_search()in Python bindings for native temporal retrieval
8.3 Research Directions
Section titled “8.3 Research Directions”- Online learning: Agents that add their own experience to CVX during evaluation (not just expert trajectories)
- Velocity-based filtering: Use
cvx.velocity()to select episodes where the agent was making rapid progress (high semantic velocity) vs stuck (low velocity) - Cross-domain transfer: Can debug traces from one domain help in another?
9. Reproducibility
Section titled “9. Reproducibility”9.1 Notebooks
Section titled “9.1 Notebooks”| Notebook | Location | Key Cells |
|---|---|---|
| E1 | notebooks/E1_episodic_coding.ipynb | Cell 7 (retrieval), Cell 11 (val sweep), Cell 14 (test), Cell 18 (stats) |
| E2 | notebooks/E2_episodic_alfworld.ipynb | Cell 7 (retrieval + metrics), Cell 8 (evaluation), Cell 11 (stats) |
| E3 | notebooks/E3_interactive_alfworld.ipynb | Cell 7 (causal retrieval), Cell 9 (agent loop), Cell 11 (evaluation) |
| E4 | notebooks/E4_iterative_coding.ipynb | Cell 6 (trace building), Cell 10 (fix retrieval), Cell 12 (evaluation) |
9.2 Prerequisites
Section titled “9.2 Prerequisites”# Python packagespip install sentence-transformers openai datasets scipy plotly
# For E3: ALFWorld environmentpip install alfworldalfworld-download
# For E3 on Python 3.13: patch textworld eval bug# In textworld/envs/pddl/textgen/__init__.py, line 94-97:# Replace: locals().update(context["variables"])# value = eval(self.expression)# With: value = eval(self.expression, {}, context["variables"])
# Ollama on GPU serverOLLAMA_HOST=0.0.0.0 ollama serveollama pull qwen2.5-coder:7b-instruct9.3 Data
Section titled “9.3 Data”All CVX indices and metadata are cached in data/episodic/ and rebuild automatically if missing. The MBPP, HumanEval, AgentInstruct, and APPS datasets are downloaded from HuggingFace on first run.
10. References
Section titled “10. References”- Fang et al. (2025). Memp: Exploring Agent Procedural Memory. arXiv:2508.06433
- Maharana et al. (2024). LoCoMo: Evaluating Very Long-Term Conversational Memory. ACL 2024
- Xu et al. (2025). A-MEM: Agentic Memory for LLM Agents. arXiv:2502.12110
- Position paper (2025). Episodic Memory is the Missing Piece. arXiv:2502.06975
- Shinn et al. (2023). Reflexion: Language Agents with Verbal Reinforcement Learning. NeurIPS 2023
- Zhou et al. (2023). Language Agent Tree Search. NeurIPS 2023
- Hendrycks et al. (2021). Measuring Coding Challenge Competence With APPS. NeurIPS 2021
- Shridhar et al. (2021). ALFWorld: Aligning Text and Embodied Environments. ICLR 2021