Skip to content

E2: Episodic ALFWorld Benchmark — Results

H1: Semantic retrieval significantly outperforms random/zero-shot for embodied planning.

H2: CVX-Causal (match mid-episode states, return continuation) outperforms full-episode retrieval by providing targeted procedural context.

ComponentChoice
DatasetAgentInstruct ALFWorld (336 GPT-4 expert trajectories, 3–35 steps)
LLMqwen2.5-coder:7b-instruct (Ollama)
Embeddingall-MiniLM-L6-v2 (D=384)
Evaluation99 episodes, stratified by task type, leave-one-out, k=3
ConditionRetrievalContext
NoMemoryNoneZero-shot
RandomTrajectoryRandomFull expert trajectory
FlatCosinenumpy cosine on tasksFull trajectory
CVX-EpisodicCVX searchFull trajectory
CVX-CausalCVX search ALL stepsContinuation only
  • Semantic Similarity: cosine(plan embedding, expert trajectory embedding)
  • Action Overlap: fraction of expert action verbs in plan
  • Object Recall: fraction of expert objects in plan
  • Step Ratio: plan_steps / expert_steps
ConditionSemSimActOvlObjRecStepR
NoMemory0.6000.600.090.56
RandomTrajectory0.6140.780.170.70
FlatCosine0.7020.890.300.74
CVX-Episodic0.7090.880.280.74
CVX-Causal0.5880.700.210.43

Statistical Tests (Wilcoxon, semantic similarity)

Section titled “Statistical Tests (Wilcoxon, semantic similarity)”
ComparisonΔpSig
CVX-Episodic vs NoMemory+0.109<0.0001***
FlatCosine vs NoMemory+0.102<0.0001***
CVX-Episodic vs Random+0.095<0.0001***
CVX-Episodic vs FlatCosine+0.0070.213ns
CVX-Causal vs NoMemory-0.0120.927ns
CVX-Causal vs FlatCosine-0.1141.000ns
Step 0: 21 ( 7.1%) ██
Step 1: 8 ( 2.7%) █
Step 2: 9 ( 3.0%) █
Step 3: 22 ( 7.4%) ██
Step 5: 23 ( 7.7%) ███
Step 7: 39 (13.1%) █████
Step 10+: 109 (36.7%) ███████████████
  1. H1 confirmed: Semantic retrieval (FlatCosine/CVX-Episodic) significantly outperforms zero-shot (+0.10 semantic sim, p<0.0001) and random (+0.09, p<0.0001). Object recall triples (0.09 → 0.30).

  2. H2 rejected: CVX-Causal (0.588) is worse than NoMemory (0.600). The continuation-only format, without full task context, confuses the LLM.

  3. CVX-Episodic ≈ FlatCosine: +0.007 semantic sim (p=0.213 ns), 67% retrieval overlap.

  4. Why CVX-Causal fails: 36.7% of matches are at step 10+, meaning the continuation is just the last 2-3 steps of an episode (e.g., “put X in/on Y”). Without the preceding context (which objects, which rooms), these fragments are useless. The query (task description) matches late-episode states because they contain the task text as a prefix.

The causal retrieval hypothesis is sound but the implementation context is wrong:

What we doWhat we should do
Query = task descriptionQuery = current agent state (observation)
Match against any stepMatch against observation-only embeddings
Static plan generationInteractive step-by-step with environment
Return text continuationReturn structured action sequence

CVX-Causal requires an interactive loop:

  1. Agent takes action in environment
  2. Environment returns observation
  3. Embed observation → search() in CVX
  4. Find similar mid-episode state in memory
  5. Return continuation → agent uses it for next action
  6. Repeat until task completion

This is a fundamentally different architecture than static plan generation. It requires the ALFWorld simulator running in the loop, which is the natural next step.

  • As a vector store (insert + search): CVX ≈ numpy. No differentiation at this scale.
  • As an episodic memory with temporal structure: The value is real but requires an interactive agent architecture to exploit it. Static benchmarks don’t exercise the temporal features.
  • Next step: Implement the interactive ALFWorld agent loop where the query to CVX changes at each step based on the actual environment state.

TemporalIndex, insert, search (multi-step), save/load, episode encoding, timestamp-based step extraction, continuation slicing.