RFC-012: Performance, Correctness & Agent Memory Architecture

Status: Proposed

Part A: Parallel HNSW Construction

Problem

Building an HNSW index for 1.3M points × D=768 takes ~30min on a single core. bulk_insert is fully sequential — each insertion does a greedy search + neighbor connection.

Proposed Solution

Two-phase parallel construction with rayon:

Sequential node allocation (fast): assign node IDs and levels, add vectors to storage
Parallel neighbor connection (slow part): partition nodes into chunks, each thread connects neighbors using RwLock on the graph adjacency lists

Expected speedup

~4-6x on 8 cores. The bottleneck is distance computation during neighbor search (O(ef_construction × D) per node), which is embarrassingly parallel across nodes in the same level.

Alternatives considered

Approach	Speedup	Complexity	Trade-off
Parallel rayon insertion	4-6x	Medium	Lock contention on shared neighbors
PCA pre-reduction (768→128)	~6x	Low	Loses precision for anchor projections
Scalar quantization during build	~2x	Low (already implemented)	Approximate distances
Bottom-up batch construction	~10x	High	Requires restructuring the graph builder

Implementation notes

rayon already in workspace dependencies
ConcurrentTemporalHnsw already uses RwLock — extend to build phase
Must maintain insertion order determinism for reproducibility (optional flag)

Part B: Native Embedding Space Centering (Anisotropy Correction)

Problem

Modern sentence embedding models (MentalRoBERTa, sentence-transformers, OpenAI, Cohere) produce embeddings that occupy a narrow cone in the high-dimensional space — a phenomenon known as representation anisotropy. All vectors share a dominant component (the “average text” direction), and the discriminative signal is compressed into a small residual.

Empirically observed in CVX with MentalRoBERTa (D=768) on eRisk data:

Metric	Before centering	After centering
Depression user → depressed_mood anchor	cosine sim 0.975	cosine sim 0.42
Control user → depressed_mood anchor	cosine sim 0.964	cosine sim 0.09
Discriminative gap	0.011	0.33

The gap increases 30× after centering. Without centering, anchor projections, drift measurements, and similarity searches all operate on a signal buried under shared bias.

Background & References

The anisotropy problem in contextual embeddings is well-documented:

Ethayarajh (2019) — “How Contextual are Contextualized Word Representations? Comparing the Geometry of BERT, ELMo, and GPT-2”. EMNLP 2019. First systematic measurement showing BERT embeddings are anisotropic — all representations occupy a narrow cone, with average cosine similarity between random sentences > 0.95.
Li et al. (2020) — “On the Sentence Embeddings from Pre-trained Language Models”. EMNLP 2020. Shows that BERT sentence embeddings have a dominant direction that accounts for most of the variance. Proposes BERT-flow (normalizing flow transformation) to correct the distribution.
Su et al. (2021) — “Whitening Sentence Representations for Better Semantics and Faster Retrieval”. ACL 2021. Proposes whitening (centering + rotation to decorrelate dimensions) as a simpler alternative to flow-based correction. Shows that even simple mean-centering significantly improves semantic similarity tasks.
Huang et al. (2021) — “WhiteningBERT: An Easy Unsupervised Sentence Embedding Approach”. EMNLP Findings 2021. Confirms that centering + optional whitening improves STS benchmarks without any fine-tuning, across multiple models.
Rajaee & Pilehvar (2021) — “A Cluster-based Approach for Improving Isotropy in Contextual Embedding Space”. ACL 2021. Analyzes the geometric structure of the anisotropic cone and proposes cluster-based correction.

The consistent finding across all papers: subtracting the mean embedding vector is the single most impactful correction, often recovering 70-90% of the performance gap between anisotropic and isotropic representations.

Relevance to CVX

CVX computes temporal analytics (drift, velocity, changepoints, anchor projections) on embedding trajectories. All of these operations use cosine distance. In an anisotropic space, cosine distances are compressed into a narrow range, causing:

Anchor projections (project_to_anchors): All posts equidistant to all anchors
Drift measurements (drift, velocity): Signal-to-noise ratio degraded
HNSW search (search): Nearest-neighbor quality reduced (many false ties)
Changepoint detection (detect_changepoints): Reduced sensitivity to real regime changes
Region quality (regions, region_assignments): Regions semantically less meaningful

Centering is a universal fix that benefits all downstream operations regardless of the specific embedding model used.

Proposed API

// In TemporalHnsw
pub struct TemporalHnsw<D: DistanceMetric> {
    // ... existing fields ...
    centroid: Option<Vec<f32>>,  // NEW: global mean for centering
}

Option 1: Manual centroid

# Python API
centroid = index.compute_centroid()       # O(N×D) single pass over stored vectors
index.set_centroid(centroid)              # All subsequent operations use centered distances

# Or provide an external centroid (e.g., from a larger corpus)
index.set_centroid(precomputed_centroid)

Option 2: Auto-centering on build

index = cvx.TemporalIndex(m=16, ef_construction=200, centering=True)
index.bulk_insert(entity_ids, timestamps, vectors)
# Centroid computed automatically from inserted vectors
# Stored alongside the index in the .cvx file

Option 3: Centering as distance metric wrapper

pub struct CenteredCosine {
    inner: CosineDistance,
    centroid: Vec<f32>,
}

impl DistanceMetric for CenteredCosine {
    fn distance(&self, a: &[f32], b: &[f32]) -> f32 {
        // Center both vectors, then compute cosine
        let a_c: Vec<f32> = a.iter().zip(&self.centroid).map(|(x, c)| x - c).collect();
        let b_c: Vec<f32> = b.iter().zip(&self.centroid).map(|(x, c)| x - c).collect();
        self.inner.distance(&a_c, &b_c)
    }
}

Recommended approach

Option 1 (manual centroid) for initial implementation:

Simplest, no breaking changes
compute_centroid(): single O(N×D) pass
set_centroid(): stores in struct, serialized with index
All functions that compute distances check self.centroid.is_some() and center before computing
project_to_anchors centers both the trajectory vectors AND the anchor vectors

This is non-invasive: existing indices without a centroid continue to work unchanged.

Affected functions

Function	Current behavior	With centering
`project_to_anchors`	cosine(raw_vec, raw_anchor)	cosine(vec - μ, anchor - μ)
`drift`	cosine(raw_v1, raw_v2)	cosine(v1 - μ, v2 - μ)
`velocity`	Δ(raw vectors) / Δt	Δ(centered vectors) / Δt
`search`	kNN on raw space	kNN on centered space
`detect_changepoints`	PELT on raw distances	PELT on centered distances
`region_assignments`	assign_region on raw vectors	assign_region on centered vectors

Beyond centering: whitening

Full whitening (centering + rotation by inverse covariance) would further decorrelate dimensions, but requires computing and storing a D×D matrix. For D=768, this is 2.4M floats (~9MB). The marginal improvement over centering alone is typically small (Su et al. 2021 report ~2-5% STS improvement from whitening vs centering-only).

Recommendation: Implement centering first. Add whitening as optional enhancement later if benchmarks show meaningful improvement on CVX-specific tasks.

Implementation plan

Phase	Work	Complexity
1	`compute_centroid()` + `set_centroid()` + serialize in snapshot	Low
2	Center vectors in `project_to_anchors`, `drift`, `velocity`	Low
3	Center in `search` and `assign_region` (affects HNSW traversal)	Medium
4	Auto-centering mode in `bulk_insert`	Low
5	Python bindings + notebook validation	Low
6	Optional whitening (`compute_whitening_transform()`)	Medium

Part C: Architecture Review — Gaps & Refactoring Priorities

Context

An architecture audit was conducted evaluating CVX as a tool for AI agent long-term memory — specifically for storing and retrieving successful action sequences dependent on context. This section documents the findings.

Current capabilities for agent memory

CVX already supports episodic memory via episode_encoding.rs:

// entity_id = (episode_id << 16) | step_index
// Max 281 trillion episodes × 65535 steps each
encode_entity_id(episode_id, step_index) -> u64
decode_entity_id(entity_id) -> (episode_id, step_index)
episode_range(episode_id) -> (start_id, end_id)

Validated in notebooks E1–E4:

Experiment	Task	Baseline	CVX Memory	Improvement
E1 (code gen)	MBPP → HumanEval	—	77.8% pass@1	Episodic retrieval works
E3 (ALFWorld)	Interactive RL	3.3% completion	20.0% completion	6× with causal retrieval
E4 (debugging)	APPS retries	28.0%	31.0%	+3 rescued problems

Identified architectural gaps

Gap 1: No outcome awareness

CVX stores vectors but has no concept of success or failure. An agent searching “what did I do in similar states?” retrieves ALL experiences without distinguishing successful from failed ones.

Impact: Retrieval noise — failed strategies pollute the result set.

Proposed extension:

// New field in TemporalPoint or as indexed metadata
pub struct OutcomeAnnotation {
    reward: f32,              // Continuous reward signal
    success: bool,            // Binary outcome
    outcome_vector: Vec<f32>, // Optional: embedding of the final state
}

Python API:

index.insert(entity_id, timestamp, vector, reward=1.0)
results = index.search(query, k=5, min_reward=0.5)  # Only successful experiences

Complexity: Low. Reward is a float stored alongside the vector; filtered via bitmap like temporal filtering.

Gap 2: Causal continuation not exposed

The most valuable pattern for agents: “given a similar state, what steps came AFTER?”. TemporalGraphIndex (RFC-010) implements this with predecessor/successor edges, but:

ConcurrentTemporalHnsw wraps TemporalHnsw, NOT TemporalGraphIndex
causal_search is not available in the Python API
The temporal edge layer is invisible to end users

Impact: The primary agent memory pattern requires manual multi-step reconstruction in Python instead of a single native call.

Proposed fix: Restructure the wrapper chain:

Current:  ConcurrentTemporalHnsw<D> → RwLock<TemporalHnsw<D>>
Proposed: ConcurrentTemporalHnsw<D> → RwLock<TemporalGraphIndex<D>>
                                         └── inner: TemporalHnsw<D>
                                         └── edges: TemporalEdgeLayer

Expose in Python:

results = index.causal_search(
    query=embedding,
    k=5,
    continuation_steps=10  # Return next 10 steps from each match
)
# results[i] = {
#   "match": (entity_id, timestamp, score),
#   "continuation": [(entity_id, timestamp, vector), ...]
# }

Complexity: Medium. The data structures exist; needs wiring.

Gap 3: No structured context filtering

An agent needs: “in situations similar to X when the goal was Y”. Currently, context is mixed into the embedding — there’s no way to filter by goal, task type, or environment state.

Metadata exists (HashMap<String, String>) but is post-filtered (over-fetch 4k candidates → filter → take k). Not indexed.

Proposed extension: Inverted index on metadata keys:

pub struct IndexedMetadata {
    // key → value → RoaringBitmap of matching node_ids
    indices: HashMap<String, HashMap<String, RoaringBitmap>>,
}

This allows O(1) membership checks during HNSW traversal, same as temporal filtering. Pre-filter instead of post-filter.

Python API:

index.insert(entity_id, ts, vec, metadata={"goal": "clean", "room": "kitchen"})
results = index.search(query, k=5, metadata={"goal": "clean"})  # Pre-filtered

Complexity: Medium. Mirrors the temporal bitmap pattern.

Gap 4: Memory consolidation — deferred to roadmap

Biological memory consolidates repeated experiences into prototypes. At scale (10M+ episodes), unconsolidated accumulation may degrade retrieval quality through noise.

However, consolidation introduces serious risks:

Destroys episodic structure: A centroid of 10 episodes has no predecessor/successor edges — causal search breaks on prototypes
Loses variance: Edge cases (often the most informative) are averaged away
Consistency is hard: Updating prototypes when source episodes change or are re-evaluated requires complex invalidation policies

Decision: Defer consolidation. For current and near-term scale (1-10M episodes), improving retrieval quality (centering, metadata filtering, outcome weighting) is more impactful and less risky than consolidation.

See Part D for a future design (complementary prototypes with tiered fidelity) when scale demands it.

Gap 5: No recency-weighted retrieval

More recent experiences should be more accessible by default. CVX has time_decay_weight in temporal_edges.rs but doesn’t use it in general search scoring.

Proposed extension: Optional recency factor in composite distance:

d_final = α·d_semantic + β·d_temporal + γ·recency_penalty(age)

Where recency_penalty = 1 - exp(-λ·age) (older = higher penalty).

Complexity: Low. Adds one term to the scoring function.

Design pattern observations

Strengths

Composition over inheritance: Each layer (HnswGraph → TemporalHnsw → TemporalGraphIndex → ConcurrentWrapper) adds responsibility without modifying the inner layer. Idiomatic Rust decorator pattern.
SmallVec for neighbor lists: SmallVec<[u32; 16]> keeps neighbor lists inline (no heap allocation) for the default M=16.
RoaringBitmap temporal filtering: Sub-byte per vector, O(1) membership check. Excellent for 1M+ scale.
Postcard serialization: Compact binary format with separate snapshot structs — serialization logic doesn’t pollute domain structs.
Trait-based polymorphism: DistanceMetric, TemporalIndexAccess, StorageBackend enable real loose coupling and testability.

Weaknesses identified

TemporalIndexAccess is a god trait: 12 methods with empty defaults. Violates Interface Segregation. Should split into:
- TemporalSearch (search_raw, search_with_metadata)
- TrajectoryAccess (trajectory, vector, entity_id, timestamp)
- RegionAccess (regions, region_members, region_assignments, region_trajectory)
Python API bypasses query engine: cvx-python calls cvx-index and cvx-analytics directly, not through cvx-query. Features must be exposed twice. The query engine’s TemporalQuery enum (15 query types) is richer than what Python exposes.
No snapshot versioning: A struct field change silently breaks deserialization. Needs version: u32 in TemporalSnapshot.
Composite distance scale mismatch: α·d_semantic + (1-α)·d_temporal assumes comparable scales, but cosine ∈ [0,2] vs temporal ∈ [0,1]. With α=0.5, semantic has 2× the effective weight.
Entity ID is untyped: A user, document, and episode are all u64. No way to distinguish entity types at the index level.

Refactoring priority matrix

#	Refactoring	Impact	Effort	Benefit for AI agents
1	Native centering (Part B)	30× discrimination	Low	More precise memory retrieval
2	Expose causal_search in Python	Enables continuation pattern	Medium	Primary agent memory pattern
3	Indexed metadata filtering	Context-dependent retrieval	Medium	”Similar state + same goal”
4	Outcome-aware search	Filter by success/reward	Low	Only retrieve what worked
5	Snapshot versioning	Robustness	Low	Avoid silent data corruption
6	Trait segregation	Maintainability	Medium	Cleaner extension points
7	Recency-weighted search	Temporal relevance	Low	Prefer recent experiences
8	Distance scale normalization	Correctness	Low	Balanced semantic/temporal weighting

Memory consolidation has been deferred to roadmap (see Part D). At current scale, improving retrieval quality through items 1-4 is higher impact and lower risk than lossy consolidation.

Relationship to other RFCs

RFC-010 (Temporal Graph Index): Provides the causal_search infrastructure. Gap 2 is about exposing it, not reimplementing it.
RFC-004 (Semantic Regions): region_assignments provides the clustering infrastructure that future consolidation would build on.
RFC-005 (Region Members): Temporal filtering on regions enables time-scoped analytics and future tiered consolidation.

Part D: Future Directions

D.1 Memory Consolidation via Tiered Fidelity

Problem

At scale (10M+ episodes), unconsolidated accumulation increases retrieval noise. But naive consolidation (replacing episodes with centroids) destroys episodic structure — a centroid has no steps, no predecessor/successor edges, no causal continuation capability.

Design: Complementary Prototypes (not substitutive)

Prototypes complement episodes, they never replace them. A prototype is an additional node in the HNSW with metadata linking to its source episodes.

COLD = original episodes (append-only, immutable ground truth)
WARM = original episodes + derived prototypes (marked as synthetic)
HOT  = recent episodes + most-consulted prototypes

Key principles:

Cold is immutable: Original data is never modified or deleted. This is the source of truth for re-derivation if consolidation introduces artifacts.
Prototypes are traceable: Each prototype stores {type: "prototype", source_episodes: [A, B, C], n_sources: 10}. If retrieval returns a prototype and the agent needs more detail, it follows the links to the source episodes in cold.
Fidelity degrades gracefully: Hot (fast, possibly consolidated) → Warm (moderate, mixed) → Cold (slow, always original). An agent can “zoom in” from a prototype match to the actual episodes.
Consistency policy: Prototypes are invalidated when their source episodes’ metadata changes (e.g., reward updated). Re-derivation is triggered lazily on next access or eagerly via background compaction.

Why not now

Current scale (1-10M episodes) doesn’t require consolidation
HNSW search is O(log N) — even 10M is fast
Retrieval quality improvements (centering, metadata filtering, outcome weighting) have higher impact at current scale
The consolidation algorithm itself is a research question: what to consolidate, when, how to preserve episodic structure in prototypes

Prerequisites

Gap 1 (outcome awareness): Need reward annotations to know which episodes are worth consolidating
Gap 3 (metadata indexing): Need to mark prototypes as synthetic and link to sources
Tiered storage wiring: Cold tier PQ code exists but is not connected

D.2 Auxiliary Structures — Evaluation

Question

Should CVX incorporate structures beyond HNSW — specifically Bayesian networks, knowledge graphs, or causal DAGs?

What HNSW cannot represent

Causal relationships: “action A caused outcome B” is a directed edge, not a distance
Conditional dependencies: “strategy X works IF condition Y holds” requires structured inference, not similarity search
Compositional knowledge: “tool A is-a instrument” — taxonomic relations are discrete and transitive
Probabilistic reasoning: P(state | observations) requires belief propagation

Potential auxiliary structures

Structure	What it adds	Integration point	Use case
Knowledge graph	Typed entities + relations	Indexed metadata	Compositional planning
Bayesian network	Conditional probabilities	Region transitions as CPTs	Decision under uncertainty
Causal DAG	Directed cause-effect	Granger causality → edges	Counterfactual reasoning

Recommendation

Defer to a future RFC. The gaps in Part C (outcome awareness, context filtering, causal continuation) are prerequisites. Auxiliary structures become valuable only after primary retrieval is reliable and context-aware.

If/when needed, the most natural integrations are:

Knowledge graph as metadata index: Entity relations stored as indexed metadata, leveraging Gap 3’s infrastructure
Bayesian network as post-retrieval scorer: Lightweight BN scoring P(success | region, context) over HNSW candidates
Causal DAG from Granger tests: Materialize Granger causality results (already computed) as persistent directed graph

These would be companion crates (cvx-graph, cvx-bayes), not modifications to the core index.

D.3 Documentation Debt — Architecture vs Implementation

An audit identified significant gaps between architecture documentation and actual implementation. The following components are documented as features but have no implementation:

Component	Docs status	Implementation	Action
Data Virtualization	10+ sections	0%	Move to roadmap “Production Ingestion”
Distributed Deployment	Full architecture	0%	Move to roadmap “Phase 5+“
Observability (Prometheus/OTLP)	Detailed	Only `tracing` crate	Mark as planned
Temporal ML (Burn/Torch backends)	3 backends	Only AnalyticBackend	Mark differentiable as future
Multi-Scale Alignment	4 methods	Only resample()	Keep Procrustes, remove rest
Interpretability	6 artifacts	Only drift attribution	Document what exists
gRPC QueryStream	Documented	IngestStream + WatchDrift only	Sufficient for now
Cold Storage	PQ codebook	Code exists, not wired	Wire when scale demands

Conversely, the following implemented features are poorly documented or absent from architecture pages:

Feature	Implementation	Current docs	Action
`region_assignments()` O(N)	Complete in `temporal.rs` + Python	Only in examples overview	Add to temporal-index.md
Episodic memory data model	`episode_encoding.rs`, E1-E4 validated	Only in research section	Add to data-model.md and analytics-engine.md
Anchor projection pipeline	`anchor.rs`, `anchor_index.rs`, Python `project_to_anchors()`	Only in RFC-006	Add to analytics-engine.md
Centering / anisotropy correction	Manual in notebooks, native planned (RFC-012 Part B)	Only in RFC-012	Add to analytics-engine.md when implemented
Metadata filtering	`MetadataStore`, `MetadataFilter`, `search_with_metadata()`	Not in architecture	Add to temporal-index.md
Temporal edges / causal search	`TemporalEdgeLayer`, `TemporalGraphIndex` (RFC-010)	Not in temporal-index.md	Add temporal edges section
`region_trajectory()` EMA smoothing	Complete in `temporal.rs` + Python	Not documented	Add to temporal-index.md
Scalar quantization	`enable_quantization()` / `disable_quantization()`	Mentioned in RFC-002 only	Add to temporal-index.md

Action plan:

~~Update intro/vision with actual state~~ (done)
~~Mark unimplemented architecture sections with badges~~ (done)
Update temporal-index.md: add temporal edges, metadata, regions, SQ
Update analytics-engine.md: add anchor projection, episodic encoding
Update data-model.md: add episode encoding scheme, metadata model
Move enterprise features (distributed, data virtualization) to roadmap