RFC-002: Correctness & Performance
This RFC documents 10 identified weaknesses in the current ChronosVector implementation and proposes concrete fixes for each. Issues span durability, algorithmic correctness, concurrency, numerical stability, and API robustness.
Summary Table
Section titled “Summary Table”| # | Severity | Issue | Effort | Status |
|---|---|---|---|---|
| 01 | CRITICAL | WAL fsync not guaranteed on commit | Low | ✅ Done |
| 02 | HIGH | NaN panic in distance sort | Low | ✅ Done |
| 03 | HIGH | Heuristic neighbor condition too restrictive | Medium | ✅ Done |
| 04 | HIGH | Single-writer RwLock bottleneck | High | ✅ Done |
| 05 | HIGH | gRPC delivery semantics undefined | High | ✅ Done |
| 06 | MEDIUM | PELT candidate set unbounded growth | Medium | ✅ Done |
| 07 | MEDIUM | Random level generation unbounded | Low | ✅ Done |
| 08 | MEDIUM | Warm store O(N) range scans | Medium | ✅ Done |
| 09 | MEDIUM | PQ codebook initialization bias | Medium | ✅ Done |
| 10 | LOW | ODE solver no stiffness detection | High | Deferred |
Implementation Phases
Section titled “Implementation Phases”Phase 1: Critical Fixes (1-2 days)
Section titled “Phase 1: Critical Fixes (1-2 days)”RFC-002-01, 002-02, 002-07 — Less than 20 lines total. Prevent data loss and panics.
Phase 2: Correctness (1 week)
Section titled “Phase 2: Correctness (1 week)”RFC-002-03, 002-06 — Fix heuristic neighbor selection (Malkov §4.2 Algorithm 4) and PELT pruning bounds.
Phase 3: Architecture (2-3 weeks)
Section titled “Phase 3: Architecture (2-3 weeks)”RFC-002-04, 002-05, 002-08, 002-09 — Concurrent insert architecture, gRPC exactly-once semantics, warm store zone maps, PQ k-means++ initialization.
Phase 4: Deferred
Section titled “Phase 4: Deferred”RFC-002-10 — ODE stiffness detection. Triggers when burn ML backend is integrated.
RFC-002-01: WAL fsync Guarantee on Commit
Section titled “RFC-002-01: WAL fsync Guarantee on Commit”Severity: CRITICAL | Effort: Low
Problem
Section titled “Problem”Wal::commit() calls BufWriter::flush() (pushes data to OS page cache) and persist_meta() (atomic rename), but never calls sync_all() on the segment file. If the process crashes between flush and the OS flushing the page cache to disk, committed entries are lost.
pub fn commit(&mut self, sequence: u64) -> Result<(), StorageError> { self.meta.committed_sequence = sequence; self.flush()?; // BufWriter::flush — NOT fsync self.persist_meta()?; // rename — NOT directory fsync Ok(())}pub fn commit(&mut self, sequence: u64) -> Result<(), StorageError> { self.meta.committed_sequence = sequence; if let Some(ref mut writer) = self.current_writer { writer.flush()?; writer.get_ref().sync_all()?; // Sync segment to disk } self.persist_meta()?; let dir = std::fs::File::open(&self.dir)?; dir.sync_all()?; // Sync directory to make rename durable Ok(())}References
Section titled “References”- Pillai et al., “All File Systems Are Not Created Equal” (OSDI 2014)
- Zheng et al., “BtrBlk: Efficient B-tree Logging” (SIGMOD 2022)
RFC-002-02: NaN-Safe Distance Sorting
Section titled “RFC-002-02: NaN-Safe Distance Sorting”Severity: HIGH | Effort: Low
Problem
Section titled “Problem”6 occurrences of partial_cmp(&b.1).unwrap() across HNSW modules. If any distance is NaN, the server panics.
// Replace all occurrences:result_vec.sort_by(|a, b| a.1.total_cmp(&b.1));f32::total_cmp (stable since Rust 1.62) defines total ordering including NaN.
References
Section titled “References”- IEEE 754-2019, §5.10:
totalOrderpredicate
RFC-002-03: Heuristic Neighbor Selection
Section titled “RFC-002-03: Heuristic Neighbor Selection”Severity: HIGH | Effort: Medium
Problem
Section titled “Problem”Current condition uses <= (non-strict), should be < (strict) per Malkov Algorithm 4:
// Current (too permissive for equidistant candidates):cand_dist <= dist_to_selected
// Correct (per paper):cand_dist < dist_to_selectedExpected Impact
Section titled “Expected Impact”1-3% recall improvement at high dimensionality (D≥128).
References
Section titled “References”- Malkov & Yashunin, “Efficient and Robust Approximate Nearest Neighbor Using HNSW Graphs” (IEEE TPAMI 2018), §4.2
- Fu et al., “Fast ANN Search with the Navigating Spreading-out Graph” (VLDB 2019)
RFC-002-04: Concurrent Insert Architecture
Section titled “RFC-002-04: Concurrent Insert Architecture”Severity: HIGH | Effort: High
Problem
Section titled “Problem”Single RwLock<TemporalHnsw> serializes all inserts and blocks all searches during insert.
Proposed Options
Section titled “Proposed Options”| Option | Approach | Latency Impact | Complexity |
|---|---|---|---|
| A (recommended) | Insert queue + batch commit | Eventual visibility (~10ms) | Medium |
| B | Per-node locks | Immediate visibility | High |
| C | Segmented graph | Immediate, partitioned | High |
References
Section titled “References”- Singh et al., “FreshDiskANN” (arXiv 2023) — per-node concurrent insert
- Subramanya et al., “DiskANN” (NeurIPS 2019) — lock-free graph traversal
RFC-002-05: gRPC Delivery Semantics
Section titled “RFC-002-05: gRPC Delivery Semantics”Severity: HIGH | Effort: High
Problem
Section titled “Problem”- No deduplication on re-sent points
- Only last ACK returned (partial failures invisible)
- No backpressure mechanism
Proposed Protocol
Section titled “Proposed Protocol”Client assigns sequence numbers. Server ACKs every N points with committed WAL sequence. On reconnect, client resumes from last ACK.
References
Section titled “References”- Kreps et al., “Kafka: A Distributed Messaging System” (NetDB 2011)
RFC-002-06: PELT Candidate Set Bounding
Section titled “RFC-002-06: PELT Candidate Set Bounding”Severity: MEDIUM | Effort: Medium
Problem
Section titled “Problem”Pruning fallback f[t] == f64::INFINITY retains all candidates during early iterations, allowing O(N) candidate set growth.
Remove infinity fallback and add explicit cap on candidate set size.
References
Section titled “References”- Killick et al., “Optimal Detection of Changepoints with Linear Computational Cost” (JASA 2012), Theorem 3.1
RFC-002-07: Random Level Bounding
Section titled “RFC-002-07: Random Level Bounding”Severity: MEDIUM | Effort: Low
let level = (-r.ln() * self.config.level_mult).floor() as usize;level.min(32) // Cap at theoretical max for 10^38 nodesReferences
Section titled “References”- Malkov & Yashunin (2018), §4.1
RFC-002-08: Warm Store Zone Maps
Section titled “RFC-002-08: Warm Store Zone Maps”Severity: MEDIUM | Effort: Medium
Problem
Section titled “Problem”Range queries deserialize ALL points then filter in memory. O(N) for selective queries.
Per-chunk manifest with (min_ts, max_ts). Skip non-overlapping chunks.
References
Section titled “References”- Lamb et al., “The Vertica Analytic Database” (VLDB 2012) — zone maps
RFC-002-09: PQ Codebook Initialization
Section titled “RFC-002-09: PQ Codebook Initialization”Severity: MEDIUM | Effort: Medium
Problem
Section titled “Problem”Sequential initialization from temporally ordered data biases centroids toward early observations.
k-means++ initialization: sample centroids proportional to .
References
Section titled “References”- Arthur & Vassilvitskii, “k-means++: The Advantages of Careful Seeding” (SODA 2007)
- Jégou et al., “Product Quantization for Nearest Neighbor Search” (IEEE TPAMI 2011)
RFC-002-10: ODE Stiffness Detection
Section titled “RFC-002-10: ODE Stiffness Detection”Severity: LOW | Status: Deferred
Dormand-Prince (explicit RK45) cannot handle stiff systems efficiently. Deferred until Neural ODE training (burn backend) is implemented.
References
Section titled “References”- Hairer & Wanner, “Solving ODEs II: Stiff Problems” (Springer, 2002)
- Chen et al., “Neural Ordinary Differential Equations” (NeurIPS 2018)
Full Reference List
Section titled “Full Reference List”- Arthur & Vassilvitskii. “k-means++: The Advantages of Careful Seeding.” SODA 2007.
- Chen et al. “Neural Ordinary Differential Equations.” NeurIPS 2018.
- Fu et al. “Fast ANN Search with the Navigating Spreading-out Graph.” VLDB 2019.
- Ge et al. “Optimized Product Quantization.” IEEE TPAMI 2014.
- Hairer & Wanner. “Solving Ordinary Differential Equations II.” Springer, 2002.
- IEEE 754-2019. “Standard for Floating-Point Arithmetic.”
- Jégou, Douze & Schmid. “Product Quantization for Nearest Neighbor Search.” IEEE TPAMI 2011.
- Kidger. “On Neural Differential Equations.” PhD Thesis, Oxford, 2022.
- Killick, Fearnhead & Eckley. “Optimal Detection of Changepoints with Linear Computational Cost.” JASA 2012.
- Kreps, Narkhede & Rao. “Kafka: A Distributed Messaging System.” NetDB 2011.
- Lamb et al. “The Vertica Analytic Database: C-Store 7 Years Later.” VLDB 2012.
- Maidstone et al. “On Optimal Multiple Changepoint Algorithms for Large Data.” Statistics & Computing 2017.
- Malkov & Yashunin. “Efficient and Robust ANN Using HNSW Graphs.” IEEE TPAMI 2018.
- Pillai et al. “All File Systems Are Not Created Equal.” OSDI 2014.
- Singh et al. “FreshDiskANN: Streaming Similarity Search.” arXiv 2023.
- Subramanya et al. “DiskANN: Billion-point Nearest Neighbor Search.” NeurIPS 2019.
- Zheng et al. “BtrBlk: Efficient B-tree Logging for Database Storage.” SIGMOD 2022.