Skip to content

RFC-012: Performance, Correctness & Agent Memory Architecture


Building an HNSW index for 1.3M points × D=768 takes ~30min on a single core. bulk_insert is fully sequential — each insertion does a greedy search + neighbor connection.

Two-phase parallel construction with rayon:

  1. Sequential node allocation (fast): assign node IDs and levels, add vectors to storage
  2. Parallel neighbor connection (slow part): partition nodes into chunks, each thread connects neighbors using RwLock on the graph adjacency lists

~4-6x on 8 cores. The bottleneck is distance computation during neighbor search (O(ef_construction × D) per node), which is embarrassingly parallel across nodes in the same level.

ApproachSpeedupComplexityTrade-off
Parallel rayon insertion4-6xMediumLock contention on shared neighbors
PCA pre-reduction (768→128)~6xLowLoses precision for anchor projections
Scalar quantization during build~2xLow (already implemented)Approximate distances
Bottom-up batch construction~10xHighRequires restructuring the graph builder
  • rayon already in workspace dependencies
  • ConcurrentTemporalHnsw already uses RwLock — extend to build phase
  • Must maintain insertion order determinism for reproducibility (optional flag)

Part B: Native Embedding Space Centering (Anisotropy Correction)

Section titled “Part B: Native Embedding Space Centering (Anisotropy Correction)”

Modern sentence embedding models (MentalRoBERTa, sentence-transformers, OpenAI, Cohere) produce embeddings that occupy a narrow cone in the high-dimensional space — a phenomenon known as representation anisotropy. All vectors share a dominant component (the “average text” direction), and the discriminative signal is compressed into a small residual.

Empirically observed in CVX with MentalRoBERTa (D=768) on eRisk data:

MetricBefore centeringAfter centering
Depression user → depressed_mood anchorcosine sim 0.975cosine sim 0.42
Control user → depressed_mood anchorcosine sim 0.964cosine sim 0.09
Discriminative gap0.0110.33

The gap increases 30× after centering. Without centering, anchor projections, drift measurements, and similarity searches all operate on a signal buried under shared bias.

The anisotropy problem in contextual embeddings is well-documented:

  1. Ethayarajh (2019) — “How Contextual are Contextualized Word Representations? Comparing the Geometry of BERT, ELMo, and GPT-2”. EMNLP 2019. First systematic measurement showing BERT embeddings are anisotropic — all representations occupy a narrow cone, with average cosine similarity between random sentences > 0.95.

  2. Li et al. (2020) — “On the Sentence Embeddings from Pre-trained Language Models”. EMNLP 2020. Shows that BERT sentence embeddings have a dominant direction that accounts for most of the variance. Proposes BERT-flow (normalizing flow transformation) to correct the distribution.

  3. Su et al. (2021) — “Whitening Sentence Representations for Better Semantics and Faster Retrieval”. ACL 2021. Proposes whitening (centering + rotation to decorrelate dimensions) as a simpler alternative to flow-based correction. Shows that even simple mean-centering significantly improves semantic similarity tasks.

  4. Huang et al. (2021) — “WhiteningBERT: An Easy Unsupervised Sentence Embedding Approach”. EMNLP Findings 2021. Confirms that centering + optional whitening improves STS benchmarks without any fine-tuning, across multiple models.

  5. Rajaee & Pilehvar (2021) — “A Cluster-based Approach for Improving Isotropy in Contextual Embedding Space”. ACL 2021. Analyzes the geometric structure of the anisotropic cone and proposes cluster-based correction.

The consistent finding across all papers: subtracting the mean embedding vector is the single most impactful correction, often recovering 70-90% of the performance gap between anisotropic and isotropic representations.

CVX computes temporal analytics (drift, velocity, changepoints, anchor projections) on embedding trajectories. All of these operations use cosine distance. In an anisotropic space, cosine distances are compressed into a narrow range, causing:

  • Anchor projections (project_to_anchors): All posts equidistant to all anchors
  • Drift measurements (drift, velocity): Signal-to-noise ratio degraded
  • HNSW search (search): Nearest-neighbor quality reduced (many false ties)
  • Changepoint detection (detect_changepoints): Reduced sensitivity to real regime changes
  • Region quality (regions, region_assignments): Regions semantically less meaningful

Centering is a universal fix that benefits all downstream operations regardless of the specific embedding model used.

// In TemporalHnsw
pub struct TemporalHnsw<D: DistanceMetric> {
// ... existing fields ...
centroid: Option<Vec<f32>>, // NEW: global mean for centering
}
# Python API
centroid = index.compute_centroid() # O(N×D) single pass over stored vectors
index.set_centroid(centroid) # All subsequent operations use centered distances
# Or provide an external centroid (e.g., from a larger corpus)
index.set_centroid(precomputed_centroid)
index = cvx.TemporalIndex(m=16, ef_construction=200, centering=True)
index.bulk_insert(entity_ids, timestamps, vectors)
# Centroid computed automatically from inserted vectors
# Stored alongside the index in the .cvx file

Option 3: Centering as distance metric wrapper

Section titled “Option 3: Centering as distance metric wrapper”
pub struct CenteredCosine {
inner: CosineDistance,
centroid: Vec<f32>,
}
impl DistanceMetric for CenteredCosine {
fn distance(&self, a: &[f32], b: &[f32]) -> f32 {
// Center both vectors, then compute cosine
let a_c: Vec<f32> = a.iter().zip(&self.centroid).map(|(x, c)| x - c).collect();
let b_c: Vec<f32> = b.iter().zip(&self.centroid).map(|(x, c)| x - c).collect();
self.inner.distance(&a_c, &b_c)
}
}

Option 1 (manual centroid) for initial implementation:

  • Simplest, no breaking changes
  • compute_centroid(): single O(N×D) pass
  • set_centroid(): stores in struct, serialized with index
  • All functions that compute distances check self.centroid.is_some() and center before computing
  • project_to_anchors centers both the trajectory vectors AND the anchor vectors

This is non-invasive: existing indices without a centroid continue to work unchanged.

FunctionCurrent behaviorWith centering
project_to_anchorscosine(raw_vec, raw_anchor)cosine(vec - μ, anchor - μ)
driftcosine(raw_v1, raw_v2)cosine(v1 - μ, v2 - μ)
velocityΔ(raw vectors) / ΔtΔ(centered vectors) / Δt
searchkNN on raw spacekNN on centered space
detect_changepointsPELT on raw distancesPELT on centered distances
region_assignmentsassign_region on raw vectorsassign_region on centered vectors

Full whitening (centering + rotation by inverse covariance) would further decorrelate dimensions, but requires computing and storing a D×D matrix. For D=768, this is 2.4M floats (~9MB). The marginal improvement over centering alone is typically small (Su et al. 2021 report ~2-5% STS improvement from whitening vs centering-only).

Recommendation: Implement centering first. Add whitening as optional enhancement later if benchmarks show meaningful improvement on CVX-specific tasks.

PhaseWorkComplexity
1compute_centroid() + set_centroid() + serialize in snapshotLow
2Center vectors in project_to_anchors, drift, velocityLow
3Center in search and assign_region (affects HNSW traversal)Medium
4Auto-centering mode in bulk_insertLow
5Python bindings + notebook validationLow
6Optional whitening (compute_whitening_transform())Medium

Part C: Architecture Review — Gaps & Refactoring Priorities

Section titled “Part C: Architecture Review — Gaps & Refactoring Priorities”

An architecture audit was conducted evaluating CVX as a tool for AI agent long-term memory — specifically for storing and retrieving successful action sequences dependent on context. This section documents the findings.

CVX already supports episodic memory via episode_encoding.rs:

// entity_id = (episode_id << 16) | step_index
// Max 281 trillion episodes × 65535 steps each
encode_entity_id(episode_id, step_index) -> u64
decode_entity_id(entity_id) -> (episode_id, step_index)
episode_range(episode_id) -> (start_id, end_id)

Validated in notebooks E1–E4:

ExperimentTaskBaselineCVX MemoryImprovement
E1 (code gen)MBPP → HumanEval77.8% pass@1Episodic retrieval works
E3 (ALFWorld)Interactive RL3.3% completion20.0% completion6× with causal retrieval
E4 (debugging)APPS retries28.0%31.0%+3 rescued problems

CVX stores vectors but has no concept of success or failure. An agent searching “what did I do in similar states?” retrieves ALL experiences without distinguishing successful from failed ones.

Impact: Retrieval noise — failed strategies pollute the result set.

Proposed extension:

// New field in TemporalPoint or as indexed metadata
pub struct OutcomeAnnotation {
reward: f32, // Continuous reward signal
success: bool, // Binary outcome
outcome_vector: Vec<f32>, // Optional: embedding of the final state
}

Python API:

index.insert(entity_id, timestamp, vector, reward=1.0)
results = index.search(query, k=5, min_reward=0.5) # Only successful experiences

Complexity: Low. Reward is a float stored alongside the vector; filtered via bitmap like temporal filtering.

The most valuable pattern for agents: “given a similar state, what steps came AFTER?”. TemporalGraphIndex (RFC-010) implements this with predecessor/successor edges, but:

  • ConcurrentTemporalHnsw wraps TemporalHnsw, NOT TemporalGraphIndex
  • causal_search is not available in the Python API
  • The temporal edge layer is invisible to end users

Impact: The primary agent memory pattern requires manual multi-step reconstruction in Python instead of a single native call.

Proposed fix: Restructure the wrapper chain:

Current: ConcurrentTemporalHnsw<D> → RwLock<TemporalHnsw<D>>
Proposed: ConcurrentTemporalHnsw<D> → RwLock<TemporalGraphIndex<D>>
└── inner: TemporalHnsw<D>
└── edges: TemporalEdgeLayer

Expose in Python:

results = index.causal_search(
query=embedding,
k=5,
continuation_steps=10 # Return next 10 steps from each match
)
# results[i] = {
# "match": (entity_id, timestamp, score),
# "continuation": [(entity_id, timestamp, vector), ...]
# }

Complexity: Medium. The data structures exist; needs wiring.

An agent needs: “in situations similar to X when the goal was Y”. Currently, context is mixed into the embedding — there’s no way to filter by goal, task type, or environment state.

Metadata exists (HashMap<String, String>) but is post-filtered (over-fetch 4k candidates → filter → take k). Not indexed.

Proposed extension: Inverted index on metadata keys:

pub struct IndexedMetadata {
// key → value → RoaringBitmap of matching node_ids
indices: HashMap<String, HashMap<String, RoaringBitmap>>,
}

This allows O(1) membership checks during HNSW traversal, same as temporal filtering. Pre-filter instead of post-filter.

Python API:

index.insert(entity_id, ts, vec, metadata={"goal": "clean", "room": "kitchen"})
results = index.search(query, k=5, metadata={"goal": "clean"}) # Pre-filtered

Complexity: Medium. Mirrors the temporal bitmap pattern.

Gap 4: Memory consolidation — deferred to roadmap

Section titled “Gap 4: Memory consolidation — deferred to roadmap”

Biological memory consolidates repeated experiences into prototypes. At scale (10M+ episodes), unconsolidated accumulation may degrade retrieval quality through noise.

However, consolidation introduces serious risks:

  • Destroys episodic structure: A centroid of 10 episodes has no predecessor/successor edges — causal search breaks on prototypes
  • Loses variance: Edge cases (often the most informative) are averaged away
  • Consistency is hard: Updating prototypes when source episodes change or are re-evaluated requires complex invalidation policies

Decision: Defer consolidation. For current and near-term scale (1-10M episodes), improving retrieval quality (centering, metadata filtering, outcome weighting) is more impactful and less risky than consolidation.

See Part D for a future design (complementary prototypes with tiered fidelity) when scale demands it.

More recent experiences should be more accessible by default. CVX has time_decay_weight in temporal_edges.rs but doesn’t use it in general search scoring.

Proposed extension: Optional recency factor in composite distance:

d_final = α·d_semantic + β·d_temporal + γ·recency_penalty(age)

Where recency_penalty = 1 - exp(-λ·age) (older = higher penalty).

Complexity: Low. Adds one term to the scoring function.

  1. Composition over inheritance: Each layer (HnswGraph → TemporalHnsw → TemporalGraphIndex → ConcurrentWrapper) adds responsibility without modifying the inner layer. Idiomatic Rust decorator pattern.

  2. SmallVec for neighbor lists: SmallVec<[u32; 16]> keeps neighbor lists inline (no heap allocation) for the default M=16.

  3. RoaringBitmap temporal filtering: Sub-byte per vector, O(1) membership check. Excellent for 1M+ scale.

  4. Postcard serialization: Compact binary format with separate snapshot structs — serialization logic doesn’t pollute domain structs.

  5. Trait-based polymorphism: DistanceMetric, TemporalIndexAccess, StorageBackend enable real loose coupling and testability.

  1. TemporalIndexAccess is a god trait: 12 methods with empty defaults. Violates Interface Segregation. Should split into:

    • TemporalSearch (search_raw, search_with_metadata)
    • TrajectoryAccess (trajectory, vector, entity_id, timestamp)
    • RegionAccess (regions, region_members, region_assignments, region_trajectory)
  2. Python API bypasses query engine: cvx-python calls cvx-index and cvx-analytics directly, not through cvx-query. Features must be exposed twice. The query engine’s TemporalQuery enum (15 query types) is richer than what Python exposes.

  3. No snapshot versioning: A struct field change silently breaks deserialization. Needs version: u32 in TemporalSnapshot.

  4. Composite distance scale mismatch: α·d_semantic + (1-α)·d_temporal assumes comparable scales, but cosine ∈ [0,2] vs temporal ∈ [0,1]. With α=0.5, semantic has 2× the effective weight.

  5. Entity ID is untyped: A user, document, and episode are all u64. No way to distinguish entity types at the index level.

#RefactoringImpactEffortBenefit for AI agents
1Native centering (Part B)30× discriminationLowMore precise memory retrieval
2Expose causal_search in PythonEnables continuation patternMediumPrimary agent memory pattern
3Indexed metadata filteringContext-dependent retrievalMedium”Similar state + same goal”
4Outcome-aware searchFilter by success/rewardLowOnly retrieve what worked
5Snapshot versioningRobustnessLowAvoid silent data corruption
6Trait segregationMaintainabilityMediumCleaner extension points
7Recency-weighted searchTemporal relevanceLowPrefer recent experiences
8Distance scale normalizationCorrectnessLowBalanced semantic/temporal weighting

| 9 | Parallel HNSW build (Part A) | 4-6× build speedup | Medium | 30min→5min for researchers | | 10 | Procrustes model alignment | Cross-model robustness | Medium | Preserve memory across model changes |

Memory consolidation has been deferred to roadmap (see Part D). At current scale, improving retrieval quality through items 1-4 is higher impact and lower risk than lossy consolidation.

  • RFC-010 (Temporal Graph Index): Provides the causal_search infrastructure. Gap 2 is about exposing it, not reimplementing it.
  • RFC-004 (Semantic Regions): region_assignments provides the clustering infrastructure that future consolidation would build on.
  • RFC-005 (Region Members): Temporal filtering on regions enables time-scoped analytics and future tiered consolidation.

D.1 Memory Consolidation via Tiered Fidelity

Section titled “D.1 Memory Consolidation via Tiered Fidelity”

At scale (10M+ episodes), unconsolidated accumulation increases retrieval noise. But naive consolidation (replacing episodes with centroids) destroys episodic structure — a centroid has no steps, no predecessor/successor edges, no causal continuation capability.

Design: Complementary Prototypes (not substitutive)

Section titled “Design: Complementary Prototypes (not substitutive)”

Prototypes complement episodes, they never replace them. A prototype is an additional node in the HNSW with metadata linking to its source episodes.

COLD = original episodes (append-only, immutable ground truth)
WARM = original episodes + derived prototypes (marked as synthetic)
HOT = recent episodes + most-consulted prototypes

Key principles:

  1. Cold is immutable: Original data is never modified or deleted. This is the source of truth for re-derivation if consolidation introduces artifacts.

  2. Prototypes are traceable: Each prototype stores {type: "prototype", source_episodes: [A, B, C], n_sources: 10}. If retrieval returns a prototype and the agent needs more detail, it follows the links to the source episodes in cold.

  3. Fidelity degrades gracefully: Hot (fast, possibly consolidated) → Warm (moderate, mixed) → Cold (slow, always original). An agent can “zoom in” from a prototype match to the actual episodes.

  4. Consistency policy: Prototypes are invalidated when their source episodes’ metadata changes (e.g., reward updated). Re-derivation is triggered lazily on next access or eagerly via background compaction.

  • Current scale (1-10M episodes) doesn’t require consolidation
  • HNSW search is O(log N) — even 10M is fast
  • Retrieval quality improvements (centering, metadata filtering, outcome weighting) have higher impact at current scale
  • The consolidation algorithm itself is a research question: what to consolidate, when, how to preserve episodic structure in prototypes
  • Gap 1 (outcome awareness): Need reward annotations to know which episodes are worth consolidating
  • Gap 3 (metadata indexing): Need to mark prototypes as synthetic and link to sources
  • Tiered storage wiring: Cold tier PQ code exists but is not connected

Should CVX incorporate structures beyond HNSW — specifically Bayesian networks, knowledge graphs, or causal DAGs?

  1. Causal relationships: “action A caused outcome B” is a directed edge, not a distance
  2. Conditional dependencies: “strategy X works IF condition Y holds” requires structured inference, not similarity search
  3. Compositional knowledge: “tool A is-a instrument” — taxonomic relations are discrete and transitive
  4. Probabilistic reasoning: P(state | observations) requires belief propagation
StructureWhat it addsIntegration pointUse case
Knowledge graphTyped entities + relationsIndexed metadataCompositional planning
Bayesian networkConditional probabilitiesRegion transitions as CPTsDecision under uncertainty
Causal DAGDirected cause-effectGranger causality → edgesCounterfactual reasoning

Defer to a future RFC. The gaps in Part C (outcome awareness, context filtering, causal continuation) are prerequisites. Auxiliary structures become valuable only after primary retrieval is reliable and context-aware.

If/when needed, the most natural integrations are:

  1. Knowledge graph as metadata index: Entity relations stored as indexed metadata, leveraging Gap 3’s infrastructure
  2. Bayesian network as post-retrieval scorer: Lightweight BN scoring P(success | region, context) over HNSW candidates
  3. Causal DAG from Granger tests: Materialize Granger causality results (already computed) as persistent directed graph

These would be companion crates (cvx-graph, cvx-bayes), not modifications to the core index.

D.3 Documentation Debt — Architecture vs Implementation

Section titled “D.3 Documentation Debt — Architecture vs Implementation”

An audit identified significant gaps between architecture documentation and actual implementation. The following components are documented as features but have no implementation:

ComponentDocs statusImplementationAction
Data Virtualization10+ sections0%Move to roadmap “Production Ingestion”
Distributed DeploymentFull architecture0%Move to roadmap “Phase 5+“
Observability (Prometheus/OTLP)DetailedOnly tracing crateMark as planned
Temporal ML (Burn/Torch backends)3 backendsOnly AnalyticBackendMark differentiable as future
Multi-Scale Alignment4 methodsOnly resample()Keep Procrustes, remove rest
Interpretability6 artifactsOnly drift attributionDocument what exists
gRPC QueryStreamDocumentedIngestStream + WatchDrift onlySufficient for now
Cold StoragePQ codebookCode exists, not wiredWire when scale demands

Conversely, the following implemented features are poorly documented or absent from architecture pages:

FeatureImplementationCurrent docsAction
region_assignments() O(N)Complete in temporal.rs + PythonOnly in examples overviewAdd to temporal-index.md
Episodic memory data modelepisode_encoding.rs, E1-E4 validatedOnly in research sectionAdd to data-model.md and analytics-engine.md
Anchor projection pipelineanchor.rs, anchor_index.rs, Python project_to_anchors()Only in RFC-006Add to analytics-engine.md
Centering / anisotropy correctionManual in notebooks, native planned (RFC-012 Part B)Only in RFC-012Add to analytics-engine.md when implemented
Metadata filteringMetadataStore, MetadataFilter, search_with_metadata()Not in architectureAdd to temporal-index.md
Temporal edges / causal searchTemporalEdgeLayer, TemporalGraphIndex (RFC-010)Not in temporal-index.mdAdd temporal edges section
region_trajectory() EMA smoothingComplete in temporal.rs + PythonNot documentedAdd to temporal-index.md
Scalar quantizationenable_quantization() / disable_quantization()Mentioned in RFC-002 onlyAdd to temporal-index.md

Action plan:

  1. Update intro/vision with actual state (done)
  2. Mark unimplemented architecture sections with badges (done)
  3. Update temporal-index.md: add temporal edges, metadata, regions, SQ
  4. Update analytics-engine.md: add anchor projection, episodic encoding
  5. Update data-model.md: add episode encoding scheme, metadata model
  6. Move enterprise features (distributed, data virtualization) to roadmap