Skip to content

E1: Episodic Coding Benchmark — Results

H1: Semantic retrieval outperforms random few-shot for code generation.

H2: CVX-Causal (match any step, return continuation) outperforms CVX-Episodic (match problem, return full episode) by leveraging temporal structure.

ComponentChoice
Training corpusMBPP sanitized (384 problems, 3 steps each)
Test benchmarkHumanEval (164 problems)
Embeddingall-MiniLM-L6-v2 (D=384)
LLMqwen2.5-coder:7b-instruct (Ollama)
ConditionRetrievalFormatting
NoMemoryNoneZero-shot
RandomFewShotRandom MBPPFull problem + solution
FlatCosinenumpy cosine on problemsFull problem + solution
CVX-EpisodicCVX search, step_0 onlyFull problem + solution
CVX-CausalCVX search, ALL stepsContinuation only (what happened after match)
  • Validation: HumanEval[0:82], T=0, k ∈ {1,3,5,7}
  • Test: HumanEval[82:164], T=0.2, 5 seeds, best k from validation
  • Statistics: McNemar (majority-vote), paired t-test (seed-level)
kNoMemoryRandomFlatCVX-EpisodicCVX-Causal
159.8%85.4%85.4%84.1%81.7%
358.5%78.0%85.4%84.1%78.0%
558.5%82.9%81.7%81.7%82.9%
758.5%85.4%79.3%79.3%81.7%
Conditionpass@1 (mean ± std)
NoMemory71.2% ± 1.5%
CVX-Causal75.4% ± 2.6%
RandomFewShot77.3% ± 1.7%
CVX-Episodic77.3% ± 2.0%
FlatCosine78.5% ± 1.8%
  1. H1 rejected: Semantic retrieval (FlatCosine 78.5%, CVX-Episodic 77.3%) does not significantly outperform random few-shot (77.3%). For code generation, any example teaches formatting; similarity adds marginal value.

  2. H2 rejected: CVX-Causal (75.4%) underperforms both CVX-Episodic (77.3%) and FlatCosine (78.5%). Matching on plan/solution steps without interactive state feedback introduces noise rather than signal.

  3. CVX ≈ FlatCosine: 96.7% retrieval overlap at this corpus size. HNSW approximate search finds the same neighbors as brute-force.

  4. CVX-Causal step distribution: When searching across all steps, the model finds matches at plan (step 1) and solution (step 2) — but the resulting continuations (just the solution without the problem context) are less useful than the full episode.

The CVX-Causal null result is expected for code generation with static retrieval. The causal hypothesis (“show what happened after a similar state”) requires the query to be an in-progress state, not a task description. For code:

  • The “state” at query time is always the same: “I have a problem description, generate code”
  • There is no mid-episode state differentiation — the problem is either solved or not
  • Matching on plan/solution vectors returns examples based on code similarity rather than problem similarity, which is less useful for prompting

Where CVX-Causal should work: Interactive environments (ALFWorld, games) where the agent has genuine mid-episode states that evolve. This requires step-by-step environment interaction, not static plan generation.

TemporalIndex, insert, search (across all steps, not just step_0), save/load, episode encoding, timestamp-based step filtering.