E3: Interactive ALFWorld Agent — Results
Hypothesis
Section titled “Hypothesis”H: An LLM agent that queries CVX at each step using its current environment state (not a static task description) will complete more tasks than a zero-shot agent, because the temporal structure of CVX enables extraction of relevant continuations from similar mid-episode states.
This is the hypothesis that E2’s static evaluation could not test.
Experimental Setup
Section titled “Experimental Setup”| Component | Choice |
|---|---|
| Environment | ALFWorld TextWorld (eval_out_of_distribution, 134 games) |
| LLM | qwen2.5-coder:7b-instruct (Ollama) |
| Memory | CVX index from E2 (336 AgentInstruct expert episodes, 4542 vectors) |
| Evaluation | 30 games, max 30 steps per game |
The Interactive Loop
Section titled “The Interactive Loop”1. env.reset() → observation + task2. embed(observation + task context) → query CVX3. CVX.search() → find similar mid-episode states across all expert episodes4. Extract continuation: next 5 steps from each matched episode5. LLM(observation + continuations + admissible_actions) → choose action6. env.step(action) → new observation7. → goto 2 until task complete or max_stepsKey Difference from E2
Section titled “Key Difference from E2”| E2 (Static) | E3 (Interactive) | |
|---|---|---|
| Query | Task description | Current observation from env |
| When | Once per task | Every step |
| Context | Full episode trajectories | Continuations from similar states |
| Metric | Plan similarity to expert | Task completion (binary) |
| Environment | None | ALFWorld TextWorld simulator |
Conditions
Section titled “Conditions”| Condition | Memory | Query |
|---|---|---|
| NoMemory | None | Only observation + admissible actions |
| CVX-Causal | CVX step-by-step | Observation → search → continuation |
Results
Section titled “Results”| Condition | Completed | Rate | Mean Steps |
|---|---|---|---|
| NoMemory | 1/30 | 3.3% | 29.3 |
| CVX-Causal | 6/30 | 20.0% | 27.2 |
Statistical Test
Section titled “Statistical Test”McNemar’s test:
- CVX-Causal only won: 5 tasks
- NoMemory only won: 0 tasks
- Both won: 1
- Neither won: 24
- Net: +5 tasks, χ²=3.20, p=0.074 (borderline at n=30)
Retrieval Characteristics
Section titled “Retrieval Characteristics”The match step distribution shows CVX-Causal correctly matching mid-episode states:
- Early steps (0-3): room descriptions, initial exploration
- Mid steps (4-7): object interaction, navigation
- Late steps (8+): task completion actions
Findings
Section titled “Findings”-
6x improvement in task completion (3.3% → 20.0%). This is the strongest result across E1-E3.
-
The interactive loop is essential. E2 showed CVX-Causal was worse than zero-shot when the query was a static task description. E3 shows it’s 6x better when the query is the real environment state. Same memory, same CVX index, same LLM — the only difference is what you query with.
-
Continuations provide actionable guidance. When the agent sees “You are at countertop 2, you see a tomato”, CVX finds expert states like “agent was at countertop with tomato in a ‘cool tomato’ task” and returns “take tomato → go to fridge → cool tomato” — directly executable actions.
-
NoMemory is nearly helpless. A 7B model with only admissible actions and no examples solves 3.3% of out-of-distribution ALFWorld tasks. It lacks the procedural knowledge of which sequence of actions leads to task completion.
-
p=0.074 is borderline. With n=30, statistical power is limited. Scaling to n=134 (all eval games) would likely reach significance given the 0/5 discordant pair ratio.
Why This Validates CVX
Section titled “Why This Validates CVX”| Capability | Used in E3? | Why it matters |
|---|---|---|
search() across all steps | Yes | Finds similar states anywhere in any episode |
| Episode encoding (entity_id) | Yes | Groups steps by episode for continuation extraction |
| Timestamp ordering | Yes | Determines step order → extracts “what came next” |
| Step-level embeddings | Yes | Each action-observation pair is a searchable state |
FlatCosine cannot replicate this because:
- It has no concept of step ordering within an episode
- It cannot extract “continuation from step N” — only “top-k most similar vectors”
- It would need to re-implement episode grouping, step ordering, and continuation slicing — which is exactly what CVX provides natively
Limitations
Section titled “Limitations”- n=30: Small sample. Should scale to full 134 eval games.
- No FlatCosine baseline: We should add a flat cosine agent that retrieves full episodes (like E2’s best condition) for fair comparison.
- No ablation on continuation length: We use next-5-steps; optimal length is unknown.
- Single model: Larger models may not need memory (ceiling effect).
- Admissible actions provided: ALFWorld gives the valid action set; real environments don’t.
Next Steps
Section titled “Next Steps”- Scale to n=134 (all eval_ood games) for statistical power
- Add FlatCosine interactive baseline (full-episode context at each step)
- Expose
causal_search()in Python for native CVX support - Apply to code generation as an iterative generate-test-debug loop (see E4 proposal)