E2: Episodic ALFWorld Benchmark — Results
Hypothesis
Section titled “Hypothesis”H1: Semantic retrieval significantly outperforms random/zero-shot for embodied planning.
H2: CVX-Causal (match mid-episode states, return continuation) outperforms full-episode retrieval by providing targeted procedural context.
Experimental Setup
Section titled “Experimental Setup”| Component | Choice |
|---|---|
| Dataset | AgentInstruct ALFWorld (336 GPT-4 expert trajectories, 3–35 steps) |
| LLM | qwen2.5-coder:7b-instruct (Ollama) |
| Embedding | all-MiniLM-L6-v2 (D=384) |
| Evaluation | 99 episodes, stratified by task type, leave-one-out, k=3 |
Conditions
Section titled “Conditions”| Condition | Retrieval | Context |
|---|---|---|
| NoMemory | None | Zero-shot |
| RandomTrajectory | Random | Full expert trajectory |
| FlatCosine | numpy cosine on tasks | Full trajectory |
| CVX-Episodic | CVX search | Full trajectory |
| CVX-Causal | CVX search ALL steps | Continuation only |
Metrics
Section titled “Metrics”- Semantic Similarity: cosine(plan embedding, expert trajectory embedding)
- Action Overlap: fraction of expert action verbs in plan
- Object Recall: fraction of expert objects in plan
- Step Ratio: plan_steps / expert_steps
Results
Section titled “Results”| Condition | SemSim | ActOvl | ObjRec | StepR |
|---|---|---|---|---|
| NoMemory | 0.600 | 0.60 | 0.09 | 0.56 |
| RandomTrajectory | 0.614 | 0.78 | 0.17 | 0.70 |
| FlatCosine | 0.702 | 0.89 | 0.30 | 0.74 |
| CVX-Episodic | 0.709 | 0.88 | 0.28 | 0.74 |
| CVX-Causal | 0.588 | 0.70 | 0.21 | 0.43 |
Statistical Tests (Wilcoxon, semantic similarity)
Section titled “Statistical Tests (Wilcoxon, semantic similarity)”| Comparison | Δ | p | Sig |
|---|---|---|---|
| CVX-Episodic vs NoMemory | +0.109 | <0.0001 | *** |
| FlatCosine vs NoMemory | +0.102 | <0.0001 | *** |
| CVX-Episodic vs Random | +0.095 | <0.0001 | *** |
| CVX-Episodic vs FlatCosine | +0.007 | 0.213 | ns |
| CVX-Causal vs NoMemory | -0.012 | 0.927 | ns |
| CVX-Causal vs FlatCosine | -0.114 | 1.000 | ns |
CVX-Causal Match Step Distribution
Section titled “CVX-Causal Match Step Distribution”Step 0: 21 ( 7.1%) ██Step 1: 8 ( 2.7%) █Step 2: 9 ( 3.0%) █Step 3: 22 ( 7.4%) ██Step 5: 23 ( 7.7%) ███Step 7: 39 (13.1%) █████Step 10+: 109 (36.7%) ███████████████Key Findings
Section titled “Key Findings”-
H1 confirmed: Semantic retrieval (FlatCosine/CVX-Episodic) significantly outperforms zero-shot (+0.10 semantic sim, p<0.0001) and random (+0.09, p<0.0001). Object recall triples (0.09 → 0.30).
-
H2 rejected: CVX-Causal (0.588) is worse than NoMemory (0.600). The continuation-only format, without full task context, confuses the LLM.
-
CVX-Episodic ≈ FlatCosine: +0.007 semantic sim (p=0.213 ns), 67% retrieval overlap.
-
Why CVX-Causal fails: 36.7% of matches are at step 10+, meaning the continuation is just the last 2-3 steps of an episode (e.g., “put X in/on Y”). Without the preceding context (which objects, which rooms), these fragments are useless. The query (task description) matches late-episode states because they contain the task text as a prefix.
Root Cause Analysis: CVX-Causal
Section titled “Root Cause Analysis: CVX-Causal”The causal retrieval hypothesis is sound but the implementation context is wrong:
| What we do | What we should do |
|---|---|
| Query = task description | Query = current agent state (observation) |
| Match against any step | Match against observation-only embeddings |
| Static plan generation | Interactive step-by-step with environment |
| Return text continuation | Return structured action sequence |
CVX-Causal requires an interactive loop:
- Agent takes action in environment
- Environment returns observation
- Embed observation →
search()in CVX - Find similar mid-episode state in memory
- Return continuation → agent uses it for next action
- Repeat until task completion
This is a fundamentally different architecture than static plan generation. It requires the ALFWorld simulator running in the loop, which is the natural next step.
What These Results Mean for CVX
Section titled “What These Results Mean for CVX”- As a vector store (insert + search): CVX ≈ numpy. No differentiation at this scale.
- As an episodic memory with temporal structure: The value is real but requires an interactive agent architecture to exploit it. Static benchmarks don’t exercise the temporal features.
- Next step: Implement the interactive ALFWorld agent loop where the query to CVX changes at each step based on the actual environment state.
CVX Features Exercised
Section titled “CVX Features Exercised”TemporalIndex, insert, search (multi-step), save/load, episode encoding, timestamp-based step extraction, continuation slicing.