Skip to content

E4: Iterative Code Generation with CVX Memory (Proposal)

E1 showed that static retrieval (find similar problems → inject solutions) doesn’t differentiate CVX from flat cosine for code generation. The reason: code generation was treated as a single-step task — the query is always “here’s a problem, generate code.” There’s no evolving state.

E3 showed that CVX shines when the query changes at each step based on real feedback. The fix for code generation is to make it interactive: a generate-test-debug loop where each iteration produces a new state (the error message, the failing test, the partial solution) that CVX can match against.

┌─────────────────────────────────────────────────────────────┐
│ 1. LLM generates code (attempt N) │
│ 2. Execute tests → pass? → done ✓ │
│ 3. If fail: embed(error_message + failing_test + code) │
│ 4. CVX.search(error_embedding) → find similar past errors │
│ 5. Extract continuation: "this error was fixed by..." │
│ 6. LLM(code + error + CVX fix suggestions) → attempt N+1 │
│ 7. → goto 2 │
└─────────────────────────────────────────────────────────────┘

Each debug episode is a multi-step trajectory stored in CVX:

StepEmbeddingTimestampContent
0problem descriptiont₀”Write a function that…“
1attempt 1 (code)t₁Generated code
2error from attempt 1t₂”TypeError: list indices must be integers”
3fix appliedt₃Diff or corrected code
4attempt 2 (code)t₄Updated code
5pass ✓t₅Final working solution

The reward encodes whether the episode ultimately succeeded. Failed debug traces (never resolved) get reward=0; successful ones get reward=1.

  • Query: problem description
  • CVX returns: similar problems and their solutions
  • This is what E1 already does — baseline

Step 2: First error (NEW — this is where CVX differentiates)

Section titled “Step 2: First error (NEW — this is where CVX differentiates)”
  • Query: embed("TypeError: list indices must be integers, not str" + code_context)
  • CVX searches ALL steps, finds similar errors in past debug traces
  • Returns: “when this error occurred before, the fix was to cast index to int”
  • This is impossible with flat cosine — it only has problem embeddings, not error embeddings
  • Query: embed(new_error + previous_fix_attempt)
  • CVX tracks the trajectory of debugging — if the agent is going in circles (same error repeatedly), CVX can detect this via velocity() and suggest a different approach
  • Use drift() between consecutive attempts — if the code isn’t changing (low drift), the agent is stuck
  • Use hurst_exponent() on the debug trajectory — anti-persistent trajectories (H < 0.5) indicate productive exploration, persistent ones (H > 0.5) indicate the agent is repeating the same mistakes

Training Data: Where Do Debug Traces Come From?

Section titled “Training Data: Where Do Debug Traces Come From?”
  1. Run the LLM on all MBPP problems with T=0.8 (intentionally imperfect)
  2. Capture the errors when solutions fail
  3. Fix each error (either by the LLM with feedback or by using the known solution)
  4. Store the full trace: problem → bad code → error → fix → good code

Option B: Real debug traces from development

Section titled “Option B: Real debug traces from development”
  • Instrument a coding workflow to capture edit → test → error → fix cycles
  • Each commit-test cycle is a step in the episode
  • Git history provides natural temporal ordering
  • Generate solutions from multiple models/temperatures
  • Errors from weaker attempts + fixes from stronger models
  • Creates diverse debug traces covering many error patterns
FeatureUsageWhy flat cosine can’t
search() across all stepsMatch on errors, not just problemsFlat store only has problem embeddings
Episode encodingGroup problem → error → fix as one trajectoryNo episode concept
Timestamp orderingKnow that the fix came AFTER the errorNo ordering
Continuation extraction”This error was followed by this fix”Can’t extract “what came next”
velocity()Detect productive vs stuck debuggingNo trajectory concept
drift()Measure if attempts are convergingNo temporal structure
detect_changepoints()Find where the debugging approach pivotedNo sequence
Reward filteringOnly retrieve from successful debug tracesPost-hoc only
  • MBPP (384 problems) for building debug traces
  • HumanEval (164 problems) for evaluation
  • LiveCodeBench (if available) for harder problems where multi-step debugging matters more
ConditionDescription
NoMemorySingle-pass generation, no retry
Retry-NoMemoryGenerate → test → retry with error (no CVX)
Retry-FlatCosineRetry + flat cosine retrieval on error text
Retry-CVX-CausalRetry + CVX causal retrieval (match error, return fix continuation)
  • pass@1: Single generation (baseline)
  • pass@1 after k retries: With debug loop (primary metric)
  • Mean retries to pass: Efficiency of debugging
  • Error diversity: Does CVX help the agent try different fixes (not repeat)?

The hypothesis is that CVX-Causal will differentiate most on hard problems that require debugging:

  • Easy problems: all conditions pass on first attempt → no difference
  • Medium problems: Retry-NoMemory may fix with generic error feedback
  • Hard problems: Only CVX-Causal can retrieve specific fix patterns from similar past errors
  1. Build debug trace corpus (Option A): Run LLM on MBPP with T=0.8, capture traces
  2. Index in CVX: Each trace is an episode with problem/attempt/error/fix steps
  3. Implement retry loop: generate → test → embed error → search CVX → retry
  4. Evaluate on HumanEval: Compare pass@k across conditions
  5. Analyze: Which error types benefit most from CVX memory?

This positions CVX as a prefrontal cortex for coding agents:

  • Working memory: Current problem + recent attempts (conversation context)
  • Episodic memory (CVX): Past debug experiences — “I’ve seen this error before, here’s what worked”
  • Semantic memory: General coding knowledge (LLM weights)

The iterative debug loop mirrors human problem-solving: try → fail → recall similar past failures → apply learned fix → retry. CVX provides the “recall” step that transforms a stateless LLM into a learning agent.