RFC-013: Probabilistic & Relational Memory Structures
Status: Proposed
Section titled “Status: Proposed”Motivation
Section titled “Motivation”CVX stores agent experiences as vectors in time with temporal edges and reward annotations. This enables “find similar states” and “what happened next”. However, the retrieval is purely similarity-based — it cannot answer:
- “In states like mine, which action has the highest success probability?”
- “What sequence of regions do successful episodes traverse?”
- “Is there a causal relationship between action X and outcome Y?”
These questions require probabilistic and relational structure over the existing vector index — not replacing HNSW, but augmenting it.
Part A: Region Transition Model (MDP over HNSW Regions)
Section titled “Part A: Region Transition Model (MDP over HNSW Regions)”Concept
Section titled “Concept”HNSW regions (hub nodes at level L) define a discrete state space. Observed episode trajectories define transitions between these states. Reward annotations define outcome signals. Together, this is a Markov Decision Process (MDP):
State: s = region(node, level=L)Action: a = action_type (go_to, take, put, clean, heat, cool, use, open)Transition: P(s' | s, a) = count(s→s' via a) / count(s, a)Reward: R(s, a) = mean(reward | transitions from s via a)What CVX Already Provides
Section titled “What CVX Already Provides”| Primitive | API | MDP role |
|---|---|---|
| Discrete states | region_assignments(level) | State space |
| State membership | assign_region(vec, level) | Observation → state |
| Trajectories | trajectory(entity_id) | State sequences |
| Outcomes | reward(node_id) | Reward signal |
| Temporal ordering | Timestamps | Transition ordering |
What Needs to Be Built (Python layer, no Rust changes)
Section titled “What Needs to Be Built (Python layer, no Rust changes)”class RegionMDP: """MDP over HNSW regions, learned from episode trajectories."""
def __init__(self, index, level=1): self.transitions = {} # (region, action_type) → Counter{next_region: count} self.rewards = {} # (region, action_type) → [reward1, reward2, ...]
def learn_from_episodes(self, index, episodes): """Build transition model from observed trajectories.""" assignments = index.region_assignments(level) for episode in episodes: traj = index.trajectory(episode.entity_id) for i in range(len(traj) - 1): s = region_of[traj[i].node_id] s_next = region_of[traj[i+1].node_id] action_type = classify_action(traj[i].action) reward = index.reward(traj[i].node_id)
self.transitions[(s, action_type)][s_next] += 1 self.rewards[(s, action_type)].append(reward)
def best_action(self, current_region): """P(success | region, action_type) for each action type.""" scores = {} for (s, a), rewards in self.rewards.items(): if s == current_region: scores[a] = np.mean(rewards) return sorted(scores.items(), key=lambda x: -x[1])
def transition_probability(self, s, a, s_next): """P(s' | s, a)""" counts = self.transitions.get((s, a), {}) total = sum(counts.values()) return counts.get(s_next, 0) / total if total > 0 else 0Integration with CVX Retrieval
Section titled “Integration with CVX Retrieval”The MDP augments, not replaces, vector retrieval:
1. Agent observes current state → embed → query CVX2. CVX returns k similar past states with causal continuations3. MDP scores each continuation: P(success | current_region, action_type)4. Agent receives: retrieval results RANKED by P(success)Expected Impact
Section titled “Expected Impact”- Task types with clear phase structure (heat_then_place: find→take→microwave→take→put) should improve most — the MDP learns the successful phase sequence
- Task types with high location variance (pick_and_place: object can be anywhere) may improve less — location is not captured by region
References
Section titled “References”- DreamerV3 (Hafner et al., 2023) — World model with discrete latent states and learned transition dynamics. Validates discretizing continuous spaces for planning.
- LATS (Zhou et al., NeurIPS 2023) — Uses LLM as world model + MCTS with past trajectories for planning. Shows experience-based transition prediction.
- VQ-BeT (Lee et al., ICML 2024) — Vector-quantized behavior tokens for discretizing continuous action spaces. Codebook approach analogous to HNSW regions as state tokens.
Part B: Typed Edge Layer
Section titled “Part B: Typed Edge Layer”Concept
Section titled “Concept”Add a second edge layer to TemporalGraphIndex with typed, weighted,
cross-entity edges:
pub struct TypedEdgeLayer { edges: HashMap<EdgeType, Vec<(u32, u32, f32)>>, // type → (from, to, weight) adjacency: Vec<Vec<(EdgeType, u32, f32)>>, // node → outgoing}
pub enum EdgeType { Temporal, // Already exists (predecessor/successor) SameActionType, // Same action in different episodes CausedSuccess, // Action → successful outcome CausedFailure, // Action → failed outcome RegionTransition, // Cross-region link with probability weight GrangerCausal, // Granger causality between entities}Query Patterns Enabled
Section titled “Query Patterns Enabled”| Query | Edge types used | Example |
|---|---|---|
| ”What actions succeed from here?” | CausedSuccess | Navigate from current node via success edges |
| ”What do other agents do in this situation?” | SameActionType | Cross-entity exploration |
| ”What region should I aim for?” | RegionTransition (high reward) | Follow high-reward transition paths |
| ”Does this entity influence that one?” | GrangerCausal | Multi-agent causal analysis |
Proposed API
Section titled “Proposed API”# Add typed edgeindex.add_edge(from_node, to_node, edge_type="caused_success", weight=0.85)
# Query by edge typesuccessors = index.typed_neighbors(node_id, edge_type="caused_success", min_weight=0.5)
# Multi-hop traversalpaths = index.traverse(start_node, edge_types=["caused_success", "temporal"], max_hops=3)
# Hybrid search: vector similarity + edge reachabilityresults = index.graph_search(query_vec, k=5, edge_boost={"caused_success": 0.3}, # Boost nodes reachable via success edges)Implementation Complexity
Section titled “Implementation Complexity”| Component | Effort | Where |
|---|---|---|
TypedEdgeLayer struct | Medium | cvx-index (new module) |
| Serialization | Low | Postcard, alongside temporal edges |
| Python bindings | Low | PyO3 wrapper |
graph_search with edge boosting | High | Modify HNSW traversal to follow typed edges |
References
Section titled “References”- Graph RAG (Edge et al., Microsoft, 2024) — Community-based summarization over knowledge graphs + vector retrieval. Validates hybrid graph-vector architectures.
- Neo4j vector index (2023-2024) — Industry system combining property graph traversal with vector similarity search.
- Weaviate cross-references (2023-2024) — Typed relationships between vector-indexed objects.
Part C: Bayesian Retrieval Scoring
Section titled “Part C: Bayesian Retrieval Scoring”Concept
Section titled “Concept”Replace flat cosine scoring with a Bayesian posterior:
Where the prior combines:
| Factor | Current CVX support | Formula |
|---|---|---|
| Similarity | HNSW distance | |
| Recency | Timestamp | |
| Reward | reward(node_id) | |
| Region match | assign_region() | |
| Usage success | Track externally |
Scoring Function
Section titled “Scoring Function”def bayesian_score(similarity, recency, reward, region_match, usage_success, weights=(1.0, 0.3, 0.5, 0.2, 0.4)): """Weighted log-posterior for episode relevance.""" w = weights return (w[0] * similarity + w[1] * np.exp(-0.01 * recency) + w[2] * reward + w[3] * float(region_match) + w[4] * usage_success)Learning the Weights
Section titled “Learning the Weights”The weights can be learned from the online learning loop:
For each retrieval that led to success: increase weights that contributedFor each retrieval that led to failure: decrease weights that contributedThis is a simple logistic regression: features = (similarity, recency, reward, region_match, usage_success), label = game_won.
References
Section titled “References”- CoALA (Sumers et al., TMLR 2024) — Cognitive Architectures for Language Agents. Surveys agent memory through cognitive science lens, discusses Bayesian memory models.
- Generative Agents (Park et al., UIST 2023) — Recency + relevance + importance scoring for memory retrieval. Foundational multi-factor scoring model.
- MuZero (Schrittwieser et al., Nature 2020) — Learned value function for scoring future states. Validates learning to score state quality.
Part D: Proposed Query API
Section titled “Part D: Proposed Query API”New Query Types
Section titled “New Query Types”| Query | Input | Output | Uses |
|---|---|---|---|
transition_model(level) | HNSW level | RegionMDP object | Part A |
best_action(region, task_type) | Current region + context | Ranked action types | Part A |
typed_neighbors(node, edge_type) | Node + edge filter | Connected nodes | Part B |
graph_search(vec, k, edge_boost) | Vector + edge preferences | Boosted results | Part B |
bayesian_search(vec, k, weights) | Vector + scoring weights | Re-scored results | Part C |
Composability
Section titled “Composability”All three approaches can be composed:
# 1. Vector retrieval (existing)candidates = index.causal_search(query_vec, k=10, temporal_context=5)
# 2. MDP scoring (Part A)mdp = RegionMDP(index, level=1)current_region = mdp.assign(query_vec)action_scores = mdp.best_action(current_region)
# 3. Bayesian re-ranking (Part C)scored = []for c in candidates: s = bayesian_score( similarity=c['score'], recency=now - c['timestamp'], reward=index.reward(c['node_id']), region_match=(region_of[c['node_id']] == current_region), usage_success=usage_tracker[c['entity_id']], ) scored.append((c, s))
# 4. Present top results with MDP-informed strategyscored.sort(key=lambda x: -x[1])strategy = mdp.strategy_for(current_region, task_type)Implementation Priority
Section titled “Implementation Priority”| Phase | Component | Effort | Impact | Dependencies |
|---|---|---|---|---|
| 1 | RegionMDP (Python) | Low | High | None — uses existing CVX API |
| 2 | Bayesian scoring (Python) | Low | Medium | Usage tracking from online learning |
| 3 | TypedEdgeLayer (Rust) | Medium | High | New module in cvx-index |
| 4 | graph_search with edge boosting | High | High | Part 3 |
| 5 | Weight learning (Python) | Low | Medium | Parts 1-2 + online learning data |
Relationship to Other RFCs
Section titled “Relationship to Other RFCs”- RFC-010 (Temporal Graph): Provides the temporal edge infrastructure. Part B extends this with additional edge types.
- RFC-012 Part C (Agent Memory Gaps): This RFC addresses the “auxiliary structures” question raised in Gap 4 and Part D.
- RFC-012 Part B (Centering): Bayesian scoring (Part C) benefits from centered embeddings for more meaningful similarity scores.
Part E: Experimental Evidence — Online Learning Dynamics
Section titled “Part E: Experimental Evidence — Online Learning Dynamics”Experiments Conducted
Section titled “Experiments Conducted”| Experiment | Approach | 10-round results (qwen:7b) |
|---|---|---|
| E7c | Online learning, all experience added to index | Peak 20% (R3), plateau 15% (R4-10) |
| E7d | Clean memory — only wins in index, fails → decay only | Peak 27% (R3), plateau 17% (R4-10) |
| E8 | Region MDP scoring + online learning | Unstable: 27% (R3), 7-17% (R2,R4-5) |
The Contamination Problem
Section titled “The Contamination Problem”E7c showed progressive memory degradation:
Round: 1 2 3 4 5 6 7 8 9 10Successin mem: 92% 86% 81% 77% 73% 70% 67% 64% 62% 59%Rate: 6.7 10.0 20.0 13.3 16.7 16.7 13.3 10.0 16.7 16.7Each round adds ~25 fails and ~4 wins. The retrieval index becomes contaminated because failed experiences are semantically similar to new queries (same task types, same observations). HNSW returns them as nearest neighbors regardless of reward.
Clean Memory (E7d) — Partial Fix
Section titled “Clean Memory (E7d) — Partial Fix”Excluding fails from the index improved late-round mean (17.1% vs 14.8%) but did not eliminate degradation. The remaining problem: blind reward decay removes useful experts along with useless ones.
An expert trajectory for clean soapbar that happened to be retrieved
during a failed cool lettuce game gets penalized — even though the
expert was irrelevant to that failure. After several rounds of
cross-task decay, good experts lose their reward unfairly.
The Need for Context-Aware Reward Decay
Section titled “The Need for Context-Aware Reward Decay”Current decay: reward *= 0.85 for every expert successor retrieved
during a failed game. This is context-blind.
Context-aware decay would consider:
| Factor | Question | Available in CVX? |
|---|---|---|
| Task type match | Was the expert solving the same type of task? | Yes — metadata |
| Action follow-through | Did the agent actually follow the expert’s suggestion? | Yes — compare expert action vs agent action |
| Phase match | Was the expert action at the same stage of the task? | Partial — step index in trajectory |
| Region proximity | Was the expert in the same semantic region? | Yes — region_assignments() |
How RFC-013 Parts Address This
Section titled “How RFC-013 Parts Address This”Part A (Region MDP) provides the framework:
- P(success | region, action_type) gives a contextual estimate
- Decay should be proportional to 1 - P(success | expert_region, expert_action_type)
- Experts in high-success regions should be protected from decay
- Experts in low-success regions should decay faster
Part B (Typed Edges) provides attribution:
caused_successedges trace which experts contributed to winscaused_failureedges trace which experts were present during fails- But: an expert can be
present_during_failurewithoutcausing_failure - The edge type distinction enables selective decay: only decay experts where the agent followed the expert’s action and still failed
Part C (Bayesian Scoring) provides the scoring model:
- Instead of hard decay (reward *= 0.85), use Bayesian update:
- An expert with high prior P(useful) (many past successes) should barely be affected by one failure
- An expert with weak prior (few observations) should update more
Proposed Context-Aware Decay
Section titled “Proposed Context-Aware Decay”def context_aware_decay(expert_node, failed_game, index, mdp): """Decay expert reward only if contextually relevant to failure."""
expert_task_type = task_type_of(expert_node) game_task_type = failed_game['task_type']
# 1. Task type mismatch → no decay (expert was irrelevant) if expert_task_type != game_task_type: return # don't penalize cross-task retrievals
# 2. Did agent follow the expert? expert_action = action_of(expert_node) agent_actions = [step['action'] for step in failed_game['history']] if expert_action not in agent_actions: return # expert was retrieved but ignored → not responsible
# 3. Region-conditional decay expert_region = region_of(expert_node) region_quality = mdp.region_quality(expert_region)
# Decay proportional to region failure rate # High-quality region: small decay (expert probably fine, game was unlucky) # Low-quality region: large decay (expert is in a bad area) decay_factor = 0.7 + 0.25 * region_quality # range: 0.7 (bad region) to 0.95 (good region)
current_reward = index.reward(expert_node) if current_reward is not None: index.set_reward(expert_node, current_reward * decay_factor)Expected Impact
Section titled “Expected Impact”| Scenario | Blind decay (current) | Context-aware decay (proposed) |
|---|---|---|
Expert for clean retrieved during failed cool game | -15% reward | No decay (wrong task type) |
| Expert action retrieved but agent chose differently | -15% reward | No decay (not followed) |
| Expert in high-success region, game failed for other reasons | -15% reward | -5% decay (protected) |
| Expert in low-success region, agent followed and failed | -15% reward | -30% decay (amplified) |
Part F: Integrated Active Memory Architecture
Section titled “Part F: Integrated Active Memory Architecture”From Passive Retrieval to Active Learning
Section titled “From Passive Retrieval to Active Learning”Parts A-E describe individual components. Together, they form a closed-loop active memory system where CVX evolves from a retrieval engine into a learning agent that refines its own memory.
The Loop
Section titled “The Loop”┌─────────────────────────────────────────────────────┐│ DECISION TIME ││ ││ 1. Embed observation → query vector ││ 2. CVX causal_search → k candidate episodes ││ 3. Region MDP scores each candidate's action type ││ P(success | current_region, action_type) ││ 4. Bayesian model re-ranks candidates ││ P(useful | similarity, recency, reward, ││ region_match, task_match, usage_history) ││ 5. Format top candidates as LLM context ││ 6. LLM chooses action │└──────────────────────┬──────────────────────────────┘ │ ▼┌─────────────────────────────────────────────────────┐│ EXECUTION ││ ││ 7. Environment step → observation, reward ││ 8. If episode complete → outcome signal │└──────────────────────┬──────────────────────────────┘ │ ▼┌─────────────────────────────────────────────────────┐│ MEMORY UPDATE ││ ││ 9. Win → add episode to index (reward=1.0) ││ Fail → do NOT add to index ││ ││ 10. Context-aware decay of retrieved experts: ││ - Skip if task_type mismatch ││ - Skip if agent didn't follow expert action ││ - Decay ∝ (1 - region_quality) if followed+fail ││ ││ 11. Update MDP transitions: ││ (region, action_type) → next_region [+reward] ││ ││ 12. Update Bayesian weights: ││ logistic regression on (features → outcome) ││ ││ 13. Add typed edges: ││ Win: candidate → caused_success ││ Fail + followed: candidate → caused_failure ││ Fail + ignored: candidate → irrelevant │└─────────────────────────────────────────────────────┘What Each Part Contributes
Section titled “What Each Part Contributes”| Step | Part | What it provides | Without it |
|---|---|---|---|
| 3 | A (MDP) | Contextual action scoring | All actions scored equally |
| 4 | C (Bayesian) | Learned retrieval ranking | Fixed cosine ranking |
| 10 | E (Context decay) | Selective expert refinement | Blind decay destroys good experts |
| 13 | B (Typed edges) | Causal attribution of outcomes | No attribution, no learning |
Properties of the Integrated System
Section titled “Properties of the Integrated System”Self-improving: Each episode refines the memory — wins add knowledge, fails sharpen the decay. The MDP and Bayesian weights update continuously.
Non-destructive: Original expert trajectories are never deleted. Reward decay reduces their retrieval priority but preserves them for potential re-evaluation. Clean memory (E7d) ensures fails never pollute the index.
Compositional: Each part can be added incrementally. Part A (MDP, Python) works without Parts B-C. Parts B-C enhance but don’t require each other.
Model-agnostic: The memory system works with any LLM (qwen:7b, GPT-4o). Stronger models extract more value from the same memory (E5: 43% with GPT-4o vs E7b: 27% with qwen:7b, both from same expert trajectories).
Experimental Evidence for Each Step
Section titled “Experimental Evidence for Each Step”| Step | Evidence | Result |
|---|---|---|
| Wins-only index (9) | E7d vs E7c | Mean 17.1% vs 14.8% in late rounds |
| Context-aware decay (10) | E7e (running) | Pending |
| Strategy templates (5) | E7b vs E7 | Round 3: 26.7% vs 16.7% |
| Online learning (9-12) | E7c Rounds 1-3 | 6.7% → 20% (3× improvement) |
| Frontier model scaling (6) | E5 vs E3 | 43.3% vs 20% (same memory, bigger model) |
Additional References
Section titled “Additional References”- Causal Reinforcement Learning (Bareinboim et al., 2022-2024) — Structural causal models for RL agents. Formal framework for causal discovery from trajectory data.
- CausalCuriosity (Sontakke et al., ICML 2021) — Agents learn causal structure by active experimentation. Uses trajectory data to discover causal graphs.
- Decision Transformer (Chen et al., NeurIPS 2021) — Frames RL as sequence modeling over (state, action, reward) trajectories. Foundational for treating episode memory as sequence prediction.