RFC-013: Probabilistic & Relational Memory Structures

Status: Proposed

Motivation

CVX stores agent experiences as vectors in time with temporal edges and reward annotations. This enables “find similar states” and “what happened next”. However, the retrieval is purely similarity-based — it cannot answer:

“In states like mine, which action has the highest success probability?”
“What sequence of regions do successful episodes traverse?”
“Is there a causal relationship between action X and outcome Y?”

These questions require probabilistic and relational structure over the existing vector index — not replacing HNSW, but augmenting it.

Part A: Region Transition Model (MDP over HNSW Regions)

Concept

HNSW regions (hub nodes at level L) define a discrete state space. Observed episode trajectories define transitions between these states. Reward annotations define outcome signals. Together, this is a Markov Decision Process (MDP):

State:      s = region(node, level=L)
Action:     a = action_type (go_to, take, put, clean, heat, cool, use, open)
Transition: P(s' | s, a) = count(s→s' via a) / count(s, a)
Reward:     R(s, a) = mean(reward | transitions from s via a)

What CVX Already Provides

Primitive	API	MDP role
Discrete states	`region_assignments(level)`	State space $S$
State membership	`assign_region(vec, level)`	Observation → state
Trajectories	`trajectory(entity_id)`	State sequences
Outcomes	`reward(node_id)`	Reward signal $R$
Temporal ordering	Timestamps	Transition ordering

What Needs to Be Built (Python layer, no Rust changes)

class RegionMDP:
    """MDP over HNSW regions, learned from episode trajectories."""

    def __init__(self, index, level=1):
        self.transitions = {}  # (region, action_type) → Counter{next_region: count}
        self.rewards = {}      # (region, action_type) → [reward1, reward2, ...]

    def learn_from_episodes(self, index, episodes):
        """Build transition model from observed trajectories."""
        assignments = index.region_assignments(level)
        for episode in episodes:
            traj = index.trajectory(episode.entity_id)
            for i in range(len(traj) - 1):
                s = region_of[traj[i].node_id]
                s_next = region_of[traj[i+1].node_id]
                action_type = classify_action(traj[i].action)
                reward = index.reward(traj[i].node_id)

                self.transitions[(s, action_type)][s_next] += 1
                self.rewards[(s, action_type)].append(reward)

    def best_action(self, current_region):
        """P(success | region, action_type) for each action type."""
        scores = {}
        for (s, a), rewards in self.rewards.items():
            if s == current_region:
                scores[a] = np.mean(rewards)
        return sorted(scores.items(), key=lambda x: -x[1])

    def transition_probability(self, s, a, s_next):
        """P(s' | s, a)"""
        counts = self.transitions.get((s, a), {})
        total = sum(counts.values())
        return counts.get(s_next, 0) / total if total > 0 else 0

Integration with CVX Retrieval

The MDP augments, not replaces, vector retrieval:

1. Agent observes current state → embed → query CVX
2. CVX returns k similar past states with causal continuations
3. MDP scores each continuation: P(success | current_region, action_type)
4. Agent receives: retrieval results RANKED by P(success)

Expected Impact

Task types with clear phase structure (heat_then_place: find→take→microwave→take→put) should improve most — the MDP learns the successful phase sequence
Task types with high location variance (pick_and_place: object can be anywhere) may improve less — location is not captured by region

References

DreamerV3 (Hafner et al., 2023) — World model with discrete latent states and learned transition dynamics. Validates discretizing continuous spaces for planning.
LATS (Zhou et al., NeurIPS 2023) — Uses LLM as world model + MCTS with past trajectories for planning. Shows experience-based transition prediction.
VQ-BeT (Lee et al., ICML 2024) — Vector-quantized behavior tokens for discretizing continuous action spaces. Codebook approach analogous to HNSW regions as state tokens.

Part B: Typed Edge Layer

Concept

Add a second edge layer to TemporalGraphIndex with typed, weighted, cross-entity edges:

pub struct TypedEdgeLayer {
    edges: HashMap<EdgeType, Vec<(u32, u32, f32)>>,  // type → (from, to, weight)
    adjacency: Vec<Vec<(EdgeType, u32, f32)>>,       // node → outgoing
}

pub enum EdgeType {
    Temporal,         // Already exists (predecessor/successor)
    SameActionType,   // Same action in different episodes
    CausedSuccess,    // Action → successful outcome
    CausedFailure,    // Action → failed outcome
    RegionTransition, // Cross-region link with probability weight
    GrangerCausal,    // Granger causality between entities
}

Query Patterns Enabled

Query	Edge types used	Example
”What actions succeed from here?”	CausedSuccess	Navigate from current node via success edges
”What do other agents do in this situation?”	SameActionType	Cross-entity exploration
”What region should I aim for?”	RegionTransition (high reward)	Follow high-reward transition paths
”Does this entity influence that one?”	GrangerCausal	Multi-agent causal analysis

Proposed API

# Add typed edge
index.add_edge(from_node, to_node, edge_type="caused_success", weight=0.85)

# Query by edge type
successors = index.typed_neighbors(node_id, edge_type="caused_success", min_weight=0.5)

# Multi-hop traversal
paths = index.traverse(start_node, edge_types=["caused_success", "temporal"], max_hops=3)

# Hybrid search: vector similarity + edge reachability
results = index.graph_search(query_vec, k=5,
    edge_boost={"caused_success": 0.3},  # Boost nodes reachable via success edges
)

Implementation Complexity

Component	Effort	Where
`TypedEdgeLayer` struct	Medium	`cvx-index` (new module)
Serialization	Low	Postcard, alongside temporal edges
Python bindings	Low	PyO3 wrapper
`graph_search` with edge boosting	High	Modify HNSW traversal to follow typed edges

References

Graph RAG (Edge et al., Microsoft, 2024) — Community-based summarization over knowledge graphs + vector retrieval. Validates hybrid graph-vector architectures.
Neo4j vector index (2023-2024) — Industry system combining property graph traversal with vector similarity search.
Weaviate cross-references (2023-2024) — Typed relationships between vector-indexed objects.

Part C: Bayesian Retrieval Scoring

Concept

Replace flat cosine scoring with a Bayesian posterior:

$P(\text{episode relevant} | \text{query}, \text{context}) \propto \underbrace{P(\text{query} | \text{episode})}_{\text{similarity}} \times \underbrace{P(\text{episode})}_{\text{prior}}$

Where the prior combines:

$P(\text{episode}) = f(\text{recency}) \times f(\text{reward}) \times f(\text{region\_match}) \times f(\text{usage\_count})$

Factor	Current CVX support	Formula
Similarity	HNSW distance	$\exp(-d_{\text{cosine}})$
Recency	Timestamp	$\exp(-\lambda \cdot \text{age})$
Reward	`reward(node_id)`	$r \in [0, 1]$
Region match	`assign_region()`	$\mathbb{1}[\text{same region}]$
Usage success	Track externally	$\text{wins} / (\text{wins} + \text{uses})$

Scoring Function

def bayesian_score(similarity, recency, reward, region_match, usage_success,
                   weights=(1.0, 0.3, 0.5, 0.2, 0.4)):
    """Weighted log-posterior for episode relevance."""
    w = weights
    return (w[0] * similarity +
            w[1] * np.exp(-0.01 * recency) +
            w[2] * reward +
            w[3] * float(region_match) +
            w[4] * usage_success)

Learning the Weights

The weights can be learned from the online learning loop:

For each retrieval that led to success: increase weights that contributed
For each retrieval that led to failure: decrease weights that contributed

This is a simple logistic regression: features = (similarity, recency, reward, region_match, usage_success), label = game_won.

References

CoALA (Sumers et al., TMLR 2024) — Cognitive Architectures for Language Agents. Surveys agent memory through cognitive science lens, discusses Bayesian memory models.
Generative Agents (Park et al., UIST 2023) — Recency + relevance + importance scoring for memory retrieval. Foundational multi-factor scoring model.
MuZero (Schrittwieser et al., Nature 2020) — Learned value function for scoring future states. Validates learning to score state quality.

Part D: Proposed Query API

New Query Types

Query	Input	Output	Uses
`transition_model(level)`	HNSW level	RegionMDP object	Part A
`best_action(region, task_type)`	Current region + context	Ranked action types	Part A
`typed_neighbors(node, edge_type)`	Node + edge filter	Connected nodes	Part B
`graph_search(vec, k, edge_boost)`	Vector + edge preferences	Boosted results	Part B
`bayesian_search(vec, k, weights)`	Vector + scoring weights	Re-scored results	Part C

Composability

All three approaches can be composed:

# 1. Vector retrieval (existing)
candidates = index.causal_search(query_vec, k=10, temporal_context=5)

# 2. MDP scoring (Part A)
mdp = RegionMDP(index, level=1)
current_region = mdp.assign(query_vec)
action_scores = mdp.best_action(current_region)

# 3. Bayesian re-ranking (Part C)
scored = []
for c in candidates:
    s = bayesian_score(
        similarity=c['score'],
        recency=now - c['timestamp'],
        reward=index.reward(c['node_id']),
        region_match=(region_of[c['node_id']] == current_region),
        usage_success=usage_tracker[c['entity_id']],
    )
    scored.append((c, s))

# 4. Present top results with MDP-informed strategy
scored.sort(key=lambda x: -x[1])
strategy = mdp.strategy_for(current_region, task_type)

Implementation Priority

Phase	Component	Effort	Impact	Dependencies
1	RegionMDP (Python)	Low	High	None — uses existing CVX API
2	Bayesian scoring (Python)	Low	Medium	Usage tracking from online learning
3	TypedEdgeLayer (Rust)	Medium	High	New module in cvx-index
4	`graph_search` with edge boosting	High	High	Part 3
5	Weight learning (Python)	Low	Medium	Parts 1-2 + online learning data

Relationship to Other RFCs

RFC-010 (Temporal Graph): Provides the temporal edge infrastructure. Part B extends this with additional edge types.
RFC-012 Part C (Agent Memory Gaps): This RFC addresses the “auxiliary structures” question raised in Gap 4 and Part D.
RFC-012 Part B (Centering): Bayesian scoring (Part C) benefits from centered embeddings for more meaningful similarity scores.

Part E: Experimental Evidence — Online Learning Dynamics

Experiments Conducted

Experiment	Approach	10-round results (qwen:7b)
E7c	Online learning, all experience added to index	Peak 20% (R3), plateau 15% (R4-10)
E7d	Clean memory — only wins in index, fails → decay only	Peak 27% (R3), plateau 17% (R4-10)
E8	Region MDP scoring + online learning	Unstable: 27% (R3), 7-17% (R2,R4-5)

The Contamination Problem

E7c showed progressive memory degradation:

Round:    1     2     3     4     5     6     7     8     9     10
Success
in mem:  92%   86%   81%   77%   73%   70%   67%   64%   62%   59%
Rate:    6.7   10.0  20.0  13.3  16.7  16.7  13.3  10.0  16.7  16.7

Each round adds ~25 fails and ~4 wins. The retrieval index becomes contaminated because failed experiences are semantically similar to new queries (same task types, same observations). HNSW returns them as nearest neighbors regardless of reward.

Clean Memory (E7d) — Partial Fix

Excluding fails from the index improved late-round mean (17.1% vs 14.8%) but did not eliminate degradation. The remaining problem: blind reward decay removes useful experts along with useless ones.

An expert trajectory for clean soapbar that happened to be retrieved during a failed cool lettuce game gets penalized — even though the expert was irrelevant to that failure. After several rounds of cross-task decay, good experts lose their reward unfairly.

The Need for Context-Aware Reward Decay

Current decay: reward *= 0.85 for every expert successor retrieved during a failed game. This is context-blind.

Context-aware decay would consider:

Factor	Question	Available in CVX?
Task type match	Was the expert solving the same type of task?	Yes — metadata
Action follow-through	Did the agent actually follow the expert’s suggestion?	Yes — compare expert action vs agent action
Phase match	Was the expert action at the same stage of the task?	Partial — step index in trajectory
Region proximity	Was the expert in the same semantic region?	Yes — `region_assignments()`

How RFC-013 Parts Address This

Part A (Region MDP) provides the framework:

P(success | region, action_type) gives a contextual estimate
Decay should be proportional to 1 - P(success | expert_region, expert_action_type)
Experts in high-success regions should be protected from decay
Experts in low-success regions should decay faster

Part B (Typed Edges) provides attribution:

caused_success edges trace which experts contributed to wins
caused_failure edges trace which experts were present during fails
But: an expert can be present_during_failure without causing_failure
The edge type distinction enables selective decay: only decay experts where the agent followed the expert’s action and still failed

Part C (Bayesian Scoring) provides the scoring model:

Instead of hard decay (reward *= 0.85), use Bayesian update: $P(\text{useful} | \text{retrieved in failure}) = \frac{P(\text{failure} | \text{useful}) \cdot P(\text{useful})}{P(\text{failure})}$
An expert with high prior P(useful) (many past successes) should barely be affected by one failure
An expert with weak prior (few observations) should update more

Proposed Context-Aware Decay

def context_aware_decay(expert_node, failed_game, index, mdp):
    """Decay expert reward only if contextually relevant to failure."""

    expert_task_type = task_type_of(expert_node)
    game_task_type = failed_game['task_type']

    # 1. Task type mismatch → no decay (expert was irrelevant)
    if expert_task_type != game_task_type:
        return  # don't penalize cross-task retrievals

    # 2. Did agent follow the expert?
    expert_action = action_of(expert_node)
    agent_actions = [step['action'] for step in failed_game['history']]
    if expert_action not in agent_actions:
        return  # expert was retrieved but ignored → not responsible

    # 3. Region-conditional decay
    expert_region = region_of(expert_node)
    region_quality = mdp.region_quality(expert_region)

    # Decay proportional to region failure rate
    # High-quality region: small decay (expert probably fine, game was unlucky)
    # Low-quality region: large decay (expert is in a bad area)
    decay_factor = 0.7 + 0.25 * region_quality  # range: 0.7 (bad region) to 0.95 (good region)

    current_reward = index.reward(expert_node)
    if current_reward is not None:
        index.set_reward(expert_node, current_reward * decay_factor)

Expected Impact

Scenario	Blind decay (current)	Context-aware decay (proposed)
Expert for `clean` retrieved during failed `cool` game	-15% reward	No decay (wrong task type)
Expert action retrieved but agent chose differently	-15% reward	No decay (not followed)
Expert in high-success region, game failed for other reasons	-15% reward	-5% decay (protected)
Expert in low-success region, agent followed and failed	-15% reward	-30% decay (amplified)

Part F: Integrated Active Memory Architecture

From Passive Retrieval to Active Learning

Parts A-E describe individual components. Together, they form a closed-loop active memory system where CVX evolves from a retrieval engine into a learning agent that refines its own memory.

The Loop

┌─────────────────────────────────────────────────────┐
│                   DECISION TIME                      │
│                                                      │
│  1. Embed observation → query vector                 │
│  2. CVX causal_search → k candidate episodes         │
│  3. Region MDP scores each candidate's action type   │
│     P(success | current_region, action_type)          │
│  4. Bayesian model re-ranks candidates               │
│     P(useful | similarity, recency, reward,           │
│              region_match, task_match, usage_history)  │
│  5. Format top candidates as LLM context              │
│  6. LLM chooses action                                │
└──────────────────────┬──────────────────────────────┘
                       │
                       ▼
┌─────────────────────────────────────────────────────┐
│                   EXECUTION                          │
│                                                      │
│  7. Environment step → observation, reward            │
│  8. If episode complete → outcome signal              │
└──────────────────────┬──────────────────────────────┘
                       │
                       ▼
┌─────────────────────────────────────────────────────┐
│                   MEMORY UPDATE                      │
│                                                      │
│  9.  Win → add episode to index (reward=1.0)         │
│      Fail → do NOT add to index                      │
│                                                      │
│  10. Context-aware decay of retrieved experts:        │
│      - Skip if task_type mismatch                    │
│      - Skip if agent didn't follow expert action     │
│      - Decay ∝ (1 - region_quality) if followed+fail │
│                                                      │
│  11. Update MDP transitions:                         │
│      (region, action_type) → next_region [+reward]   │
│                                                      │
│  12. Update Bayesian weights:                        │
│      logistic regression on (features → outcome)     │
│                                                      │
│  13. Add typed edges:                                │
│      Win: candidate → caused_success                 │
│      Fail + followed: candidate → caused_failure     │
│      Fail + ignored: candidate → irrelevant          │
└─────────────────────────────────────────────────────┘

What Each Part Contributes

Step	Part	What it provides	Without it
3	A (MDP)	Contextual action scoring	All actions scored equally
4	C (Bayesian)	Learned retrieval ranking	Fixed cosine ranking
10	E (Context decay)	Selective expert refinement	Blind decay destroys good experts
13	B (Typed edges)	Causal attribution of outcomes	No attribution, no learning

Properties of the Integrated System

Self-improving: Each episode refines the memory — wins add knowledge, fails sharpen the decay. The MDP and Bayesian weights update continuously.

Non-destructive: Original expert trajectories are never deleted. Reward decay reduces their retrieval priority but preserves them for potential re-evaluation. Clean memory (E7d) ensures fails never pollute the index.

Compositional: Each part can be added incrementally. Part A (MDP, Python) works without Parts B-C. Parts B-C enhance but don’t require each other.

Model-agnostic: The memory system works with any LLM (qwen:7b, GPT-4o). Stronger models extract more value from the same memory (E5: 43% with GPT-4o vs E7b: 27% with qwen:7b, both from same expert trajectories).

Experimental Evidence for Each Step

Step	Evidence	Result
Wins-only index (9)	E7d vs E7c	Mean 17.1% vs 14.8% in late rounds
Context-aware decay (10)	E7e (running)	Pending
Strategy templates (5)	E7b vs E7	Round 3: 26.7% vs 16.7%
Online learning (9-12)	E7c Rounds 1-3	6.7% → 20% (3× improvement)
Frontier model scaling (6)	E5 vs E3	43.3% vs 20% (same memory, bigger model)

Additional References

Causal Reinforcement Learning (Bareinboim et al., 2022-2024) — Structural causal models for RL agents. Formal framework for causal discovery from trajectory data.
CausalCuriosity (Sontakke et al., ICML 2021) — Agents learn causal structure by active experimentation. Uses trajectory data to discover causal graphs.
Decision Transformer (Chen et al., NeurIPS 2021) — Frames RL as sequence modeling over (state, action, reward) trajectories. Foundational for treating episode memory as sequence prediction.