Skip to content

Benchmark Strategy

“Benchmark what matters, not what’s easy.”

We do not measure operations that every database handles identically (e.g., health check latency). We measure the operations where CVX should excel (temporal queries, trajectory analytics) and the operations where CVX must not lose (vanilla kNN, ingest throughput).


Category A: Unique Capabilities (CVX-Only)

Section titled “Category A: Unique Capabilities (CVX-Only)”

These benchmarks measure operations that no existing vector database supports natively. The comparison baseline is an ad-hoc solution a user would have to build on top of a generic VDB.

BenchmarkWhat It MeasuresKey MetricTarget
A1: Temporal kNNSnapshot kNN with native temporal awareness vs. Qdrant with timestamp as payload filterRecall@10 vs. ground truth temporal kNNCVX recall 0.95\geq 0.95
A2: Trajectory ReconstructionRetrieve full entity trajectory (single call) vs. N individual point lookupsLatency and storage comparison5×\geq 5\times faster, 3-5×3\text{-}5\times less storage
A3: Change Point DetectionDetect semantic shifts in entity trajectoriesF1 score vs. planted ground truthPELT F1 0.85\geq 0.85
A4: Drift AttributionMeasure and explain concept driftCorrelation with known semantic shiftsTop-10 dimensions capture 80%\geq 80\% of shifts
A5: Prediction (Neural ODE)Predict future vector statesMSE vs. linear extrapolation baseline15%\geq 15\% lower MSE
A6: Temporal Analogy”What in 2018 played the role that X does in 2024?”MRR on curated analogy datasetMRR 0.4\geq 0.4

Category B: Competitive Parity (CVX vs. Qdrant)

Section titled “Category B: Competitive Parity (CVX vs. Qdrant)”

These benchmarks demonstrate that CVX does not sacrifice base performance by adding temporal capabilities. The competitor is Qdrant (latest stable release) on identical hardware.

BenchmarkWhat It MeasuresKey MetricTarget
B1: Vanilla kNNPure semantic kNN (α=1.0\alpha = 1.0, no temporal component)QPS at recall@10 0.95\geq 0.95Within 80% of Qdrant’s QPS
B2: Ingest ThroughputSustained vector insertion rateVectors/second over 10M inserts50\geq 50K vectors/sec
B3: Memory EfficiencyRAM usage per million vectorsRSS per million vectorsWithin ±20%\pm 20\% of Qdrant
B4: Concurrent QueriesQuery latency under concurrent loadLatency p50/p99 at 1-100 concurrent queriesSimilar degradation curve
BenchmarkWhat It MeasuresKey MetricTarget
C1: Delta CompressionStorage savings from delta encoding vs. full vector storageCompression ratio by drift rate3×\geq 3\times for slow drift
C2: Tiered StorageTotal storage cost across hot/warm/cold tiersCold tier size vs. hot tierCold <5%< 5\% of hot with recall 0.90\geq 0.90
BenchmarkWhat It MeasuresKey MetricTarget
D1: Characterization AccuracyClassification accuracy of drift significance, mean reversion, Hurst exponentCorrect classification on synthetic processes95%\geq 95\% accuracy
D2: Signature QualityCan signature-based kNN find trajectories with similar dynamics?Recall@10 for same-pattern trajectories85%\geq 85\% recall
D3: Neural SDE CalibrationDoes Neural SDE provide better calibrated uncertainty than Neural ODE?% of true values within 95% confidence intervalNeural SDE calibration 90%\geq 90\%

Monthly snapshots of Wikipedia articles from 2018-2025, embedded with a sentence transformer. Known events (COVID-19, the Ukraine conflict, the AI boom) create natural ground-truth change points. Available in three sizes:

SubsetArticlesMonthsTotal Points
Small10K84~840K
Medium100K84~8.4M
Large500K84~42M

Random walks in D=768D = 768 with planted abrupt shifts at known timestamps. Parameters vary across drift rate (σ[0.001,0.05]\sigma \in [0.001, 0.05]), change point magnitude (0.1-2.0×σ0.1\text{-}2.0 \times \sigma), number of change points (0-5), and trajectory length (100-100K). Total: 40K trajectories with ground truth.

ArXiv paper abstracts embedded with SPECTER2 (D=768D = 768), with annual snapshots from 2010-2025. Ground truth includes known field evolution patterns: RNN to LSTM to Transformer, SVM to DNN to Foundation Models. Used primarily for temporal analogy benchmarks.

Random uniform vectors in D=768D = 768 with no temporal component. Used for apples-to-apples comparison with Qdrant on vanilla kNN, following the ann-benchmarks methodology. Sizes: 1M, 5M, 10M vectors.


Every competitive benchmark follows these rules:

  1. Identical hardware. Both CVX and Qdrant run on the same machine with the same resource limits.
  2. Default configurations. Both systems use default settings unless tuning is explicitly part of the benchmark.
  3. Equivalent index parameters. Qdrant uses HNSW with comparable MM and ef_construction values.
  4. Warm-up period. The first 10% of queries are discarded before measurement begins.
  5. Statistical rigor. Minimum 5 repetitions per measurement. Report median, p95, and p99 — not mean. Include 95% confidence intervals on all comparative claims.
  6. Version documentation. Every run records the exact CVX commit hash and Qdrant version.
  7. Reproducibility. All scripts, dataset generators (with fixed random seeds), and Docker Compose configurations are in the repository.

CriterionTargetStatus
CVX temporal kNN recall \geq Qdrant post-filterrecall@10 0.95\geq 0.95Pending
CVX trajectory retrieval faster than N Qdrant lookups5×\geq 5\times speedupPending
PELT F1 on planted changes0.85\geq 0.85Pending
Neural ODE << linear extrapolation error15%\geq 15\% lower MSEPending
CVX vanilla kNN within 80% of Qdrant QPSAt equivalent recallPending
Delta encoding 3×\geq 3\times compressionOn slow-drift dataPending
Cold tier <5%< 5\% of hot tier storageWith recall 0.90\geq 0.90Pending
Stochastic classification accuracy95%\geq 95\% on synthetic dataPending
All benchmarks reproducible in CIGreen on weekly runPending

benches/
├── datasets/ # Dataset generation scripts
│ ├── wikipedia_temporal.py
│ ├── synthetic_drift.py
│ ├── arxiv_temporal.py
│ └── synthetic_uniform.py
├── criterion/ # Rust micro-benchmarks
│ ├── distance_kernels.rs
│ ├── hnsw_search.rs
│ ├── delta_encoding.rs
│ └── pelt.rs
├── integration/ # Full system benchmarks
│ ├── temporal_knn_vs_qdrant.py # A1
│ ├── trajectory_efficiency.py # A2
│ ├── cpd_accuracy.py # A3
│ ├── vanilla_knn.py # B1
│ └── ...
└── reports/
└── generate_report.py # Comparison charts
ModeTriggerDurationScope
QuickPRs touching benches/~5 minCriterion micro-benchmarks only
FullWeekly schedule + releases~60 minAll categories A/B/C/D with Qdrant comparison

Criterion benchmarks track results across git commits and alert when performance regresses by more than 5%.

Each benchmark run produces three outputs:

  1. JSON results — machine-readable for historical tracking
  2. Markdown summary — human-readable for PR comments and the project README
  3. PNG charts — visual comparisons (recall-QPS curves, compression ratios, latency distributions, F1 scores)