RFC-002: Correctness & Performance

This RFC documents 10 identified weaknesses in the current ChronosVector implementation and proposes concrete fixes for each. Issues span durability, algorithmic correctness, concurrency, numerical stability, and API robustness.

Summary Table

#	Severity	Issue	Effort	Status
01	CRITICAL	WAL fsync not guaranteed on commit	Low	✅ Done
02	HIGH	NaN panic in distance sort	Low	✅ Done
03	HIGH	Heuristic neighbor condition too restrictive	Medium	✅ Done
04	HIGH	Single-writer RwLock bottleneck	High	✅ Done
05	HIGH	gRPC delivery semantics undefined	High	✅ Done
06	MEDIUM	PELT candidate set unbounded growth	Medium	✅ Done
07	MEDIUM	Random level generation unbounded	Low	✅ Done
08	MEDIUM	Warm store O(N) range scans	Medium	✅ Done
09	MEDIUM	PQ codebook initialization bias	Medium	✅ Done
10	LOW	ODE solver no stiffness detection	High	Deferred

Implementation Phases

Phase 1: Critical Fixes (1-2 days)

RFC-002-01, 002-02, 002-07 — Less than 20 lines total. Prevent data loss and panics.

Phase 2: Correctness (1 week)

RFC-002-03, 002-06 — Fix heuristic neighbor selection (Malkov §4.2 Algorithm 4) and PELT pruning bounds.

Phase 3: Architecture (2-3 weeks)

RFC-002-04, 002-05, 002-08, 002-09 — Concurrent insert architecture, gRPC exactly-once semantics, warm store zone maps, PQ k-means++ initialization.

Phase 4: Deferred

RFC-002-10 — ODE stiffness detection. Triggers when burn ML backend is integrated.

RFC-002-01: WAL fsync Guarantee on Commit

Severity: CRITICAL | Effort: Low

Problem

Wal::commit() calls BufWriter::flush() (pushes data to OS page cache) and persist_meta() (atomic rename), but never calls sync_all() on the segment file. If the process crashes between flush and the OS flushing the page cache to disk, committed entries are lost.

pub fn commit(&mut self, sequence: u64) -> Result<(), StorageError> {
    self.meta.committed_sequence = sequence;
    self.flush()?;        // BufWriter::flush — NOT fsync
    self.persist_meta()?; // rename — NOT directory fsync
    Ok(())
}

Fix

pub fn commit(&mut self, sequence: u64) -> Result<(), StorageError> {
    self.meta.committed_sequence = sequence;
    if let Some(ref mut writer) = self.current_writer {
        writer.flush()?;
        writer.get_ref().sync_all()?;  // Sync segment to disk
    }
    self.persist_meta()?;
    let dir = std::fs::File::open(&self.dir)?;
    dir.sync_all()?;  // Sync directory to make rename durable
    Ok(())
}

References

Pillai et al., “All File Systems Are Not Created Equal” (OSDI 2014)
Zheng et al., “BtrBlk: Efficient B-tree Logging” (SIGMOD 2022)

RFC-002-02: NaN-Safe Distance Sorting

Severity: HIGH | Effort: Low

Problem

6 occurrences of partial_cmp(&b.1).unwrap() across HNSW modules. If any distance is NaN, the server panics.

Fix

// Replace all occurrences:
result_vec.sort_by(|a, b| a.1.total_cmp(&b.1));

f32::total_cmp (stable since Rust 1.62) defines total ordering including NaN.

References

IEEE 754-2019, §5.10: totalOrder predicate

RFC-002-03: Heuristic Neighbor Selection

Severity: HIGH | Effort: Medium

Problem

Current condition uses <= (non-strict), should be < (strict) per Malkov Algorithm 4:

// Current (too permissive for equidistant candidates):
cand_dist <= dist_to_selected

// Correct (per paper):
cand_dist < dist_to_selected

Expected Impact

1-3% recall improvement at high dimensionality (D≥128).

References

Malkov & Yashunin, “Efficient and Robust Approximate Nearest Neighbor Using HNSW Graphs” (IEEE TPAMI 2018), §4.2
Fu et al., “Fast ANN Search with the Navigating Spreading-out Graph” (VLDB 2019)

RFC-002-04: Concurrent Insert Architecture

Severity: HIGH | Effort: High

Problem

Single RwLock<TemporalHnsw> serializes all inserts and blocks all searches during insert.

Proposed Options

Option	Approach	Latency Impact	Complexity
A (recommended)	Insert queue + batch commit	Eventual visibility (~10ms)	Medium
B	Per-node locks	Immediate visibility	High
C	Segmented graph	Immediate, partitioned	High

References

Singh et al., “FreshDiskANN” (arXiv 2023) — per-node concurrent insert
Subramanya et al., “DiskANN” (NeurIPS 2019) — lock-free graph traversal

RFC-002-05: gRPC Delivery Semantics

Severity: HIGH | Effort: High

Problem

No deduplication on re-sent points
Only last ACK returned (partial failures invisible)
No backpressure mechanism

Proposed Protocol

Client assigns sequence numbers. Server ACKs every N points with committed WAL sequence. On reconnect, client resumes from last ACK.

References

Kreps et al., “Kafka: A Distributed Messaging System” (NetDB 2011)

RFC-002-06: PELT Candidate Set Bounding

Severity: MEDIUM | Effort: Medium

Problem

Pruning fallback f[t] == f64::INFINITY retains all candidates during early iterations, allowing O(N) candidate set growth.

Fix

Remove infinity fallback and add explicit cap on candidate set size.

References

Killick et al., “Optimal Detection of Changepoints with Linear Computational Cost” (JASA 2012), Theorem 3.1

RFC-002-07: Random Level Bounding

Severity: MEDIUM | Effort: Low

Fix

let level = (-r.ln() * self.config.level_mult).floor() as usize;
level.min(32)  // Cap at theoretical max for 10^38 nodes

References

Malkov & Yashunin (2018), §4.1

RFC-002-08: Warm Store Zone Maps

Severity: MEDIUM | Effort: Medium

Problem

Range queries deserialize ALL points then filter in memory. O(N) for selective queries.

Fix

Per-chunk manifest with (min_ts, max_ts). Skip non-overlapping chunks.

References

Lamb et al., “The Vertica Analytic Database” (VLDB 2012) — zone maps

RFC-002-09: PQ Codebook Initialization

Severity: MEDIUM | Effort: Medium

Problem

Sequential initialization from temporally ordered data biases centroids toward early observations.

Fix

k-means++ initialization: sample centroids proportional to $D^2(x)$ .

References

Arthur & Vassilvitskii, “k-means++: The Advantages of Careful Seeding” (SODA 2007)
Jégou et al., “Product Quantization for Nearest Neighbor Search” (IEEE TPAMI 2011)

RFC-002-10: ODE Stiffness Detection

Severity: LOW | Status: Deferred

Dormand-Prince (explicit RK45) cannot handle stiff systems efficiently. Deferred until Neural ODE training (burn backend) is implemented.

References

Hairer & Wanner, “Solving ODEs II: Stiff Problems” (Springer, 2002)
Chen et al., “Neural Ordinary Differential Equations” (NeurIPS 2018)

Full Reference List

Arthur & Vassilvitskii. “k-means++: The Advantages of Careful Seeding.” SODA 2007.
Chen et al. “Neural Ordinary Differential Equations.” NeurIPS 2018.
Fu et al. “Fast ANN Search with the Navigating Spreading-out Graph.” VLDB 2019.
Ge et al. “Optimized Product Quantization.” IEEE TPAMI 2014.
Hairer & Wanner. “Solving Ordinary Differential Equations II.” Springer, 2002.
IEEE 754-2019. “Standard for Floating-Point Arithmetic.”
Jégou, Douze & Schmid. “Product Quantization for Nearest Neighbor Search.” IEEE TPAMI 2011.
Kidger. “On Neural Differential Equations.” PhD Thesis, Oxford, 2022.
Killick, Fearnhead & Eckley. “Optimal Detection of Changepoints with Linear Computational Cost.” JASA 2012.
Kreps, Narkhede & Rao. “Kafka: A Distributed Messaging System.” NetDB 2011.
Lamb et al. “The Vertica Analytic Database: C-Store 7 Years Later.” VLDB 2012.
Maidstone et al. “On Optimal Multiple Changepoint Algorithms for Large Data.” Statistics & Computing 2017.
Malkov & Yashunin. “Efficient and Robust ANN Using HNSW Graphs.” IEEE TPAMI 2018.
Pillai et al. “All File Systems Are Not Created Equal.” OSDI 2014.
Singh et al. “FreshDiskANN: Streaming Similarity Search.” arXiv 2023.
Subramanya et al. “DiskANN: Billion-point Nearest Neighbor Search.” NeurIPS 2019.
Zheng et al. “BtrBlk: Efficient B-tree Logging for Database Storage.” SIGMOD 2022.