Skip to content

RFC-013: Probabilistic & Relational Memory Structures


CVX stores agent experiences as vectors in time with temporal edges and reward annotations. This enables “find similar states” and “what happened next”. However, the retrieval is purely similarity-based — it cannot answer:

  • “In states like mine, which action has the highest success probability?”
  • “What sequence of regions do successful episodes traverse?”
  • “Is there a causal relationship between action X and outcome Y?”

These questions require probabilistic and relational structure over the existing vector index — not replacing HNSW, but augmenting it.


Part A: Region Transition Model (MDP over HNSW Regions)

Section titled “Part A: Region Transition Model (MDP over HNSW Regions)”

HNSW regions (hub nodes at level L) define a discrete state space. Observed episode trajectories define transitions between these states. Reward annotations define outcome signals. Together, this is a Markov Decision Process (MDP):

State: s = region(node, level=L)
Action: a = action_type (go_to, take, put, clean, heat, cool, use, open)
Transition: P(s' | s, a) = count(s→s' via a) / count(s, a)
Reward: R(s, a) = mean(reward | transitions from s via a)
PrimitiveAPIMDP role
Discrete statesregion_assignments(level)State space SS
State membershipassign_region(vec, level)Observation → state
Trajectoriestrajectory(entity_id)State sequences
Outcomesreward(node_id)Reward signal RR
Temporal orderingTimestampsTransition ordering

What Needs to Be Built (Python layer, no Rust changes)

Section titled “What Needs to Be Built (Python layer, no Rust changes)”
class RegionMDP:
"""MDP over HNSW regions, learned from episode trajectories."""
def __init__(self, index, level=1):
self.transitions = {} # (region, action_type) → Counter{next_region: count}
self.rewards = {} # (region, action_type) → [reward1, reward2, ...]
def learn_from_episodes(self, index, episodes):
"""Build transition model from observed trajectories."""
assignments = index.region_assignments(level)
for episode in episodes:
traj = index.trajectory(episode.entity_id)
for i in range(len(traj) - 1):
s = region_of[traj[i].node_id]
s_next = region_of[traj[i+1].node_id]
action_type = classify_action(traj[i].action)
reward = index.reward(traj[i].node_id)
self.transitions[(s, action_type)][s_next] += 1
self.rewards[(s, action_type)].append(reward)
def best_action(self, current_region):
"""P(success | region, action_type) for each action type."""
scores = {}
for (s, a), rewards in self.rewards.items():
if s == current_region:
scores[a] = np.mean(rewards)
return sorted(scores.items(), key=lambda x: -x[1])
def transition_probability(self, s, a, s_next):
"""P(s' | s, a)"""
counts = self.transitions.get((s, a), {})
total = sum(counts.values())
return counts.get(s_next, 0) / total if total > 0 else 0

The MDP augments, not replaces, vector retrieval:

1. Agent observes current state → embed → query CVX
2. CVX returns k similar past states with causal continuations
3. MDP scores each continuation: P(success | current_region, action_type)
4. Agent receives: retrieval results RANKED by P(success)
  • Task types with clear phase structure (heat_then_place: find→take→microwave→take→put) should improve most — the MDP learns the successful phase sequence
  • Task types with high location variance (pick_and_place: object can be anywhere) may improve less — location is not captured by region
  1. DreamerV3 (Hafner et al., 2023) — World model with discrete latent states and learned transition dynamics. Validates discretizing continuous spaces for planning.
  2. LATS (Zhou et al., NeurIPS 2023) — Uses LLM as world model + MCTS with past trajectories for planning. Shows experience-based transition prediction.
  3. VQ-BeT (Lee et al., ICML 2024) — Vector-quantized behavior tokens for discretizing continuous action spaces. Codebook approach analogous to HNSW regions as state tokens.

Add a second edge layer to TemporalGraphIndex with typed, weighted, cross-entity edges:

pub struct TypedEdgeLayer {
edges: HashMap<EdgeType, Vec<(u32, u32, f32)>>, // type → (from, to, weight)
adjacency: Vec<Vec<(EdgeType, u32, f32)>>, // node → outgoing
}
pub enum EdgeType {
Temporal, // Already exists (predecessor/successor)
SameActionType, // Same action in different episodes
CausedSuccess, // Action → successful outcome
CausedFailure, // Action → failed outcome
RegionTransition, // Cross-region link with probability weight
GrangerCausal, // Granger causality between entities
}
QueryEdge types usedExample
”What actions succeed from here?”CausedSuccessNavigate from current node via success edges
”What do other agents do in this situation?”SameActionTypeCross-entity exploration
”What region should I aim for?”RegionTransition (high reward)Follow high-reward transition paths
”Does this entity influence that one?”GrangerCausalMulti-agent causal analysis
# Add typed edge
index.add_edge(from_node, to_node, edge_type="caused_success", weight=0.85)
# Query by edge type
successors = index.typed_neighbors(node_id, edge_type="caused_success", min_weight=0.5)
# Multi-hop traversal
paths = index.traverse(start_node, edge_types=["caused_success", "temporal"], max_hops=3)
# Hybrid search: vector similarity + edge reachability
results = index.graph_search(query_vec, k=5,
edge_boost={"caused_success": 0.3}, # Boost nodes reachable via success edges
)
ComponentEffortWhere
TypedEdgeLayer structMediumcvx-index (new module)
SerializationLowPostcard, alongside temporal edges
Python bindingsLowPyO3 wrapper
graph_search with edge boostingHighModify HNSW traversal to follow typed edges
  1. Graph RAG (Edge et al., Microsoft, 2024) — Community-based summarization over knowledge graphs + vector retrieval. Validates hybrid graph-vector architectures.
  2. Neo4j vector index (2023-2024) — Industry system combining property graph traversal with vector similarity search.
  3. Weaviate cross-references (2023-2024) — Typed relationships between vector-indexed objects.

Replace flat cosine scoring with a Bayesian posterior:

P(episode relevantquery,context)P(queryepisode)similarity×P(episode)priorP(\text{episode relevant} | \text{query}, \text{context}) \propto \underbrace{P(\text{query} | \text{episode})}_{\text{similarity}} \times \underbrace{P(\text{episode})}_{\text{prior}}

Where the prior combines:

P(episode)=f(recency)×f(reward)×f(region_match)×f(usage_count)P(\text{episode}) = f(\text{recency}) \times f(\text{reward}) \times f(\text{region\_match}) \times f(\text{usage\_count})

FactorCurrent CVX supportFormula
SimilarityHNSW distanceexp(dcosine)\exp(-d_{\text{cosine}})
RecencyTimestampexp(λage)\exp(-\lambda \cdot \text{age})
Rewardreward(node_id)r[0,1]r \in [0, 1]
Region matchassign_region()1[same region]\mathbb{1}[\text{same region}]
Usage successTrack externallywins/(wins+uses)\text{wins} / (\text{wins} + \text{uses})
def bayesian_score(similarity, recency, reward, region_match, usage_success,
weights=(1.0, 0.3, 0.5, 0.2, 0.4)):
"""Weighted log-posterior for episode relevance."""
w = weights
return (w[0] * similarity +
w[1] * np.exp(-0.01 * recency) +
w[2] * reward +
w[3] * float(region_match) +
w[4] * usage_success)

The weights can be learned from the online learning loop:

For each retrieval that led to success: increase weights that contributed
For each retrieval that led to failure: decrease weights that contributed

This is a simple logistic regression: features = (similarity, recency, reward, region_match, usage_success), label = game_won.

  1. CoALA (Sumers et al., TMLR 2024) — Cognitive Architectures for Language Agents. Surveys agent memory through cognitive science lens, discusses Bayesian memory models.
  2. Generative Agents (Park et al., UIST 2023) — Recency + relevance + importance scoring for memory retrieval. Foundational multi-factor scoring model.
  3. MuZero (Schrittwieser et al., Nature 2020) — Learned value function for scoring future states. Validates learning to score state quality.

QueryInputOutputUses
transition_model(level)HNSW levelRegionMDP objectPart A
best_action(region, task_type)Current region + contextRanked action typesPart A
typed_neighbors(node, edge_type)Node + edge filterConnected nodesPart B
graph_search(vec, k, edge_boost)Vector + edge preferencesBoosted resultsPart B
bayesian_search(vec, k, weights)Vector + scoring weightsRe-scored resultsPart C

All three approaches can be composed:

# 1. Vector retrieval (existing)
candidates = index.causal_search(query_vec, k=10, temporal_context=5)
# 2. MDP scoring (Part A)
mdp = RegionMDP(index, level=1)
current_region = mdp.assign(query_vec)
action_scores = mdp.best_action(current_region)
# 3. Bayesian re-ranking (Part C)
scored = []
for c in candidates:
s = bayesian_score(
similarity=c['score'],
recency=now - c['timestamp'],
reward=index.reward(c['node_id']),
region_match=(region_of[c['node_id']] == current_region),
usage_success=usage_tracker[c['entity_id']],
)
scored.append((c, s))
# 4. Present top results with MDP-informed strategy
scored.sort(key=lambda x: -x[1])
strategy = mdp.strategy_for(current_region, task_type)

PhaseComponentEffortImpactDependencies
1RegionMDP (Python)LowHighNone — uses existing CVX API
2Bayesian scoring (Python)LowMediumUsage tracking from online learning
3TypedEdgeLayer (Rust)MediumHighNew module in cvx-index
4graph_search with edge boostingHighHighPart 3
5Weight learning (Python)LowMediumParts 1-2 + online learning data
  • RFC-010 (Temporal Graph): Provides the temporal edge infrastructure. Part B extends this with additional edge types.
  • RFC-012 Part C (Agent Memory Gaps): This RFC addresses the “auxiliary structures” question raised in Gap 4 and Part D.
  • RFC-012 Part B (Centering): Bayesian scoring (Part C) benefits from centered embeddings for more meaningful similarity scores.

Part E: Experimental Evidence — Online Learning Dynamics

Section titled “Part E: Experimental Evidence — Online Learning Dynamics”
ExperimentApproach10-round results (qwen:7b)
E7cOnline learning, all experience added to indexPeak 20% (R3), plateau 15% (R4-10)
E7dClean memory — only wins in index, fails → decay onlyPeak 27% (R3), plateau 17% (R4-10)
E8Region MDP scoring + online learningUnstable: 27% (R3), 7-17% (R2,R4-5)

E7c showed progressive memory degradation:

Round: 1 2 3 4 5 6 7 8 9 10
Success
in mem: 92% 86% 81% 77% 73% 70% 67% 64% 62% 59%
Rate: 6.7 10.0 20.0 13.3 16.7 16.7 13.3 10.0 16.7 16.7

Each round adds ~25 fails and ~4 wins. The retrieval index becomes contaminated because failed experiences are semantically similar to new queries (same task types, same observations). HNSW returns them as nearest neighbors regardless of reward.

Excluding fails from the index improved late-round mean (17.1% vs 14.8%) but did not eliminate degradation. The remaining problem: blind reward decay removes useful experts along with useless ones.

An expert trajectory for clean soapbar that happened to be retrieved during a failed cool lettuce game gets penalized — even though the expert was irrelevant to that failure. After several rounds of cross-task decay, good experts lose their reward unfairly.

Current decay: reward *= 0.85 for every expert successor retrieved during a failed game. This is context-blind.

Context-aware decay would consider:

FactorQuestionAvailable in CVX?
Task type matchWas the expert solving the same type of task?Yes — metadata
Action follow-throughDid the agent actually follow the expert’s suggestion?Yes — compare expert action vs agent action
Phase matchWas the expert action at the same stage of the task?Partial — step index in trajectory
Region proximityWas the expert in the same semantic region?Yes — region_assignments()

Part A (Region MDP) provides the framework:

  • P(success | region, action_type) gives a contextual estimate
  • Decay should be proportional to 1 - P(success | expert_region, expert_action_type)
  • Experts in high-success regions should be protected from decay
  • Experts in low-success regions should decay faster

Part B (Typed Edges) provides attribution:

  • caused_success edges trace which experts contributed to wins
  • caused_failure edges trace which experts were present during fails
  • But: an expert can be present_during_failure without causing_failure
  • The edge type distinction enables selective decay: only decay experts where the agent followed the expert’s action and still failed

Part C (Bayesian Scoring) provides the scoring model:

  • Instead of hard decay (reward *= 0.85), use Bayesian update: P(usefulretrieved in failure)=P(failureuseful)P(useful)P(failure)P(\text{useful} | \text{retrieved in failure}) = \frac{P(\text{failure} | \text{useful}) \cdot P(\text{useful})}{P(\text{failure})}
  • An expert with high prior P(useful) (many past successes) should barely be affected by one failure
  • An expert with weak prior (few observations) should update more
def context_aware_decay(expert_node, failed_game, index, mdp):
"""Decay expert reward only if contextually relevant to failure."""
expert_task_type = task_type_of(expert_node)
game_task_type = failed_game['task_type']
# 1. Task type mismatch → no decay (expert was irrelevant)
if expert_task_type != game_task_type:
return # don't penalize cross-task retrievals
# 2. Did agent follow the expert?
expert_action = action_of(expert_node)
agent_actions = [step['action'] for step in failed_game['history']]
if expert_action not in agent_actions:
return # expert was retrieved but ignored → not responsible
# 3. Region-conditional decay
expert_region = region_of(expert_node)
region_quality = mdp.region_quality(expert_region)
# Decay proportional to region failure rate
# High-quality region: small decay (expert probably fine, game was unlucky)
# Low-quality region: large decay (expert is in a bad area)
decay_factor = 0.7 + 0.25 * region_quality # range: 0.7 (bad region) to 0.95 (good region)
current_reward = index.reward(expert_node)
if current_reward is not None:
index.set_reward(expert_node, current_reward * decay_factor)
ScenarioBlind decay (current)Context-aware decay (proposed)
Expert for clean retrieved during failed cool game-15% rewardNo decay (wrong task type)
Expert action retrieved but agent chose differently-15% rewardNo decay (not followed)
Expert in high-success region, game failed for other reasons-15% reward-5% decay (protected)
Expert in low-success region, agent followed and failed-15% reward-30% decay (amplified)

Part F: Integrated Active Memory Architecture

Section titled “Part F: Integrated Active Memory Architecture”

Parts A-E describe individual components. Together, they form a closed-loop active memory system where CVX evolves from a retrieval engine into a learning agent that refines its own memory.

┌─────────────────────────────────────────────────────┐
│ DECISION TIME │
│ │
│ 1. Embed observation → query vector │
│ 2. CVX causal_search → k candidate episodes │
│ 3. Region MDP scores each candidate's action type │
│ P(success | current_region, action_type) │
│ 4. Bayesian model re-ranks candidates │
│ P(useful | similarity, recency, reward, │
│ region_match, task_match, usage_history) │
│ 5. Format top candidates as LLM context │
│ 6. LLM chooses action │
└──────────────────────┬──────────────────────────────┘
┌─────────────────────────────────────────────────────┐
│ EXECUTION │
│ │
│ 7. Environment step → observation, reward │
│ 8. If episode complete → outcome signal │
└──────────────────────┬──────────────────────────────┘
┌─────────────────────────────────────────────────────┐
│ MEMORY UPDATE │
│ │
│ 9. Win → add episode to index (reward=1.0) │
│ Fail → do NOT add to index │
│ │
│ 10. Context-aware decay of retrieved experts: │
│ - Skip if task_type mismatch │
│ - Skip if agent didn't follow expert action │
│ - Decay ∝ (1 - region_quality) if followed+fail │
│ │
│ 11. Update MDP transitions: │
│ (region, action_type) → next_region [+reward] │
│ │
│ 12. Update Bayesian weights: │
│ logistic regression on (features → outcome) │
│ │
│ 13. Add typed edges: │
│ Win: candidate → caused_success │
│ Fail + followed: candidate → caused_failure │
│ Fail + ignored: candidate → irrelevant │
└─────────────────────────────────────────────────────┘
StepPartWhat it providesWithout it
3A (MDP)Contextual action scoringAll actions scored equally
4C (Bayesian)Learned retrieval rankingFixed cosine ranking
10E (Context decay)Selective expert refinementBlind decay destroys good experts
13B (Typed edges)Causal attribution of outcomesNo attribution, no learning

Self-improving: Each episode refines the memory — wins add knowledge, fails sharpen the decay. The MDP and Bayesian weights update continuously.

Non-destructive: Original expert trajectories are never deleted. Reward decay reduces their retrieval priority but preserves them for potential re-evaluation. Clean memory (E7d) ensures fails never pollute the index.

Compositional: Each part can be added incrementally. Part A (MDP, Python) works without Parts B-C. Parts B-C enhance but don’t require each other.

Model-agnostic: The memory system works with any LLM (qwen:7b, GPT-4o). Stronger models extract more value from the same memory (E5: 43% with GPT-4o vs E7b: 27% with qwen:7b, both from same expert trajectories).

StepEvidenceResult
Wins-only index (9)E7d vs E7cMean 17.1% vs 14.8% in late rounds
Context-aware decay (10)E7e (running)Pending
Strategy templates (5)E7b vs E7Round 3: 26.7% vs 16.7%
Online learning (9-12)E7c Rounds 1-36.7% → 20% (3× improvement)
Frontier model scaling (6)E5 vs E343.3% vs 20% (same memory, bigger model)

  1. Causal Reinforcement Learning (Bareinboim et al., 2022-2024) — Structural causal models for RL agents. Formal framework for causal discovery from trajectory data.
  2. CausalCuriosity (Sontakke et al., ICML 2021) — Agents learn causal structure by active experimentation. Uses trajectory data to discover causal graphs.
  3. Decision Transformer (Chen et al., NeurIPS 2021) — Frames RL as sequence modeling over (state, action, reward) trajectories. Foundational for treating episode memory as sequence prediction.