Benchmark Strategy
Guiding Principle
Section titled “Guiding Principle”“Benchmark what matters, not what’s easy.”
We do not measure operations that every database handles identically (e.g., health check latency). We measure the operations where CVX should excel (temporal queries, trajectory analytics) and the operations where CVX must not lose (vanilla kNN, ingest throughput).
Benchmark Categories
Section titled “Benchmark Categories”Category A: Unique Capabilities (CVX-Only)
Section titled “Category A: Unique Capabilities (CVX-Only)”These benchmarks measure operations that no existing vector database supports natively. The comparison baseline is an ad-hoc solution a user would have to build on top of a generic VDB.
| Benchmark | What It Measures | Key Metric | Target |
|---|---|---|---|
| A1: Temporal kNN | Snapshot kNN with native temporal awareness vs. Qdrant with timestamp as payload filter | Recall@10 vs. ground truth temporal kNN | CVX recall |
| A2: Trajectory Reconstruction | Retrieve full entity trajectory (single call) vs. N individual point lookups | Latency and storage comparison | faster, less storage |
| A3: Change Point Detection | Detect semantic shifts in entity trajectories | F1 score vs. planted ground truth | PELT F1 |
| A4: Drift Attribution | Measure and explain concept drift | Correlation with known semantic shifts | Top-10 dimensions capture of shifts |
| A5: Prediction (Neural ODE) | Predict future vector states | MSE vs. linear extrapolation baseline | lower MSE |
| A6: Temporal Analogy | ”What in 2018 played the role that X does in 2024?” | MRR on curated analogy dataset | MRR |
Category B: Competitive Parity (CVX vs. Qdrant)
Section titled “Category B: Competitive Parity (CVX vs. Qdrant)”These benchmarks demonstrate that CVX does not sacrifice base performance by adding temporal capabilities. The competitor is Qdrant (latest stable release) on identical hardware.
| Benchmark | What It Measures | Key Metric | Target |
|---|---|---|---|
| B1: Vanilla kNN | Pure semantic kNN (, no temporal component) | QPS at recall@10 | Within 80% of Qdrant’s QPS |
| B2: Ingest Throughput | Sustained vector insertion rate | Vectors/second over 10M inserts | K vectors/sec |
| B3: Memory Efficiency | RAM usage per million vectors | RSS per million vectors | Within of Qdrant |
| B4: Concurrent Queries | Query latency under concurrent load | Latency p50/p99 at 1-100 concurrent queries | Similar degradation curve |
Category C: Storage Efficiency
Section titled “Category C: Storage Efficiency”| Benchmark | What It Measures | Key Metric | Target |
|---|---|---|---|
| C1: Delta Compression | Storage savings from delta encoding vs. full vector storage | Compression ratio by drift rate | for slow drift |
| C2: Tiered Storage | Total storage cost across hot/warm/cold tiers | Cold tier size vs. hot tier | Cold of hot with recall |
Category D: Stochastic Analytics
Section titled “Category D: Stochastic Analytics”| Benchmark | What It Measures | Key Metric | Target |
|---|---|---|---|
| D1: Characterization Accuracy | Classification accuracy of drift significance, mean reversion, Hurst exponent | Correct classification on synthetic processes | accuracy |
| D2: Signature Quality | Can signature-based kNN find trajectories with similar dynamics? | Recall@10 for same-pattern trajectories | recall |
| D3: Neural SDE Calibration | Does Neural SDE provide better calibrated uncertainty than Neural ODE? | % of true values within 95% confidence interval | Neural SDE calibration |
Datasets
Section titled “Datasets”Wikipedia Temporal Embeddings
Section titled “Wikipedia Temporal Embeddings”Monthly snapshots of Wikipedia articles from 2018-2025, embedded with a sentence transformer. Known events (COVID-19, the Ukraine conflict, the AI boom) create natural ground-truth change points. Available in three sizes:
| Subset | Articles | Months | Total Points |
|---|---|---|---|
| Small | 10K | 84 | ~840K |
| Medium | 100K | 84 | ~8.4M |
| Large | 500K | 84 | ~42M |
Synthetic Planted Drift
Section titled “Synthetic Planted Drift”Random walks in with planted abrupt shifts at known timestamps. Parameters vary across drift rate (), change point magnitude (), number of change points (0-5), and trajectory length (100-100K). Total: 40K trajectories with ground truth.
ArXiv Temporal Embeddings
Section titled “ArXiv Temporal Embeddings”ArXiv paper abstracts embedded with SPECTER2 (), with annual snapshots from 2010-2025. Ground truth includes known field evolution patterns: RNN to LSTM to Transformer, SVM to DNN to Foundation Models. Used primarily for temporal analogy benchmarks.
Synthetic Uniform
Section titled “Synthetic Uniform”Random uniform vectors in with no temporal component. Used for apples-to-apples comparison with Qdrant on vanilla kNN, following the ann-benchmarks methodology. Sizes: 1M, 5M, 10M vectors.
Fair Comparison Methodology
Section titled “Fair Comparison Methodology”Every competitive benchmark follows these rules:
- Identical hardware. Both CVX and Qdrant run on the same machine with the same resource limits.
- Default configurations. Both systems use default settings unless tuning is explicitly part of the benchmark.
- Equivalent index parameters. Qdrant uses HNSW with comparable and
ef_constructionvalues. - Warm-up period. The first 10% of queries are discarded before measurement begins.
- Statistical rigor. Minimum 5 repetitions per measurement. Report median, p95, and p99 — not mean. Include 95% confidence intervals on all comparative claims.
- Version documentation. Every run records the exact CVX commit hash and Qdrant version.
- Reproducibility. All scripts, dataset generators (with fixed random seeds), and Docker Compose configurations are in the repository.
Success Criteria
Section titled “Success Criteria”| Criterion | Target | Status |
|---|---|---|
| CVX temporal kNN recall Qdrant post-filter | recall@10 | Pending |
| CVX trajectory retrieval faster than N Qdrant lookups | speedup | Pending |
| PELT F1 on planted changes | Pending | |
| Neural ODE linear extrapolation error | lower MSE | Pending |
| CVX vanilla kNN within 80% of Qdrant QPS | At equivalent recall | Pending |
| Delta encoding compression | On slow-drift data | Pending |
| Cold tier of hot tier storage | With recall | Pending |
| Stochastic classification accuracy | on synthetic data | Pending |
| All benchmarks reproducible in CI | Green on weekly run | Pending |
Infrastructure
Section titled “Infrastructure”Benchmark Runner Structure
Section titled “Benchmark Runner Structure”benches/├── datasets/ # Dataset generation scripts│ ├── wikipedia_temporal.py│ ├── synthetic_drift.py│ ├── arxiv_temporal.py│ └── synthetic_uniform.py├── criterion/ # Rust micro-benchmarks│ ├── distance_kernels.rs│ ├── hnsw_search.rs│ ├── delta_encoding.rs│ └── pelt.rs├── integration/ # Full system benchmarks│ ├── temporal_knn_vs_qdrant.py # A1│ ├── trajectory_efficiency.py # A2│ ├── cpd_accuracy.py # A3│ ├── vanilla_knn.py # B1│ └── ...└── reports/ └── generate_report.py # Comparison chartsCI Integration
Section titled “CI Integration”| Mode | Trigger | Duration | Scope |
|---|---|---|---|
| Quick | PRs touching benches/ | ~5 min | Criterion micro-benchmarks only |
| Full | Weekly schedule + releases | ~60 min | All categories A/B/C/D with Qdrant comparison |
Criterion benchmarks track results across git commits and alert when performance regresses by more than 5%.
Reporting
Section titled “Reporting”Each benchmark run produces three outputs:
- JSON results — machine-readable for historical tracking
- Markdown summary — human-readable for PR comments and the project README
- PNG charts — visual comparisons (recall-QPS curves, compression ratios, latency distributions, F1 scores)