Anchor Projection & Centering
The Problem: Anisotropic Embeddings
Section titled “The Problem: Anisotropic Embeddings”Modern sentence embedding models (BERT, RoBERTa, sentence-transformers) produce vectors that occupy a narrow cone in high-dimensional space. All vectors share a dominant component — the “average text” direction — and the discriminative signal is compressed into a small residual (Ethayarajh, EMNLP 2019).
Consequence for CVX: Without correction, cosine distances between ANY two vectors are nearly identical. Anchor projections, drift measurements, and similarity searches all operate on a signal buried under shared bias.
The Fix: Mean Centering
Section titled “The Fix: Mean Centering”Subtracting the global mean vector removes the shared component:
This amplifies the discriminative signal — empirically 30x improvement in our experiments with MentalRoBERTa on eRisk data.
Demonstration
Section titled “Demonstration”import chronos_vector as cvximport numpy as npnp.random.seed(42)D = 64
index = cvx.TemporalIndex(m=16, ef_construction=100)
# Simulate anisotropic embeddings with a strong shared directiondominant = np.random.randn(D).astype(np.float32)dominant = dominant / np.linalg.norm(dominant) * 5.0
# Group A: drifting toward dimensions 0-4 over timefor t in range(50): signal = np.zeros(D, dtype=np.float32) signal[0:5] = 0.3 + t * 0.005 vec = dominant + signal + np.random.randn(D).astype(np.float32) * 0.05 index.insert(1, t * 86400, vec.tolist())
# Group B: stable signal in dimensions 10-14for t in range(50): signal = np.zeros(D, dtype=np.float32) signal[10:15] = 0.3 vec = dominant + signal + np.random.randn(D).astype(np.float32) * 0.05 index.insert(2, t * 86400, vec.tolist())Raw vs Centered Similarity
Section titled “Raw vs Centered Similarity”centroid = index.compute_centroid()index.set_centroid(centroid)
traj_a, traj_b = index.trajectory(1), index.trajectory(2)va, vb = np.array(traj_a[0][1]), np.array(traj_b[0][1])raw_sim = np.dot(va, vb) / (np.linalg.norm(va) * np.linalg.norm(vb))
c = np.array(centroid)vac, vbc = va - c, vb - ccentered_sim = np.dot(vac, vbc) / (np.linalg.norm(vac) * np.linalg.norm(vbc))
print(f"Raw cosine sim: {raw_sim:.4f}")print(f"Centered cosine sim: {centered_sim:.4f}")print(f"Gap amplification: {(1-centered_sim)/(1-raw_sim):.0f}x")Raw cosine sim: 0.9712Centered cosine sim: 0.1834Gap amplification: 28x⚠️ Raw similarity is useless
Left panel: raw cosine similarities clustered around 0.97 — Groups A and B are indistinguishable. Right panel: after centering, similarity drops to ~0.18, revealing the actual discriminative signal.
CVX Centering API
Section titled “CVX Centering API”centroid = index.compute_centroid() # O(N*D) single passindex.set_centroid(centroid) # Persisted with save/loadcentered = index.centered_vector(vec) # vec - centroidindex.clear_centroid() # Revert to rawAnchor Projection
Section titled “Anchor Projection”Anchors are reference vectors representing interpretable dimensions. project_to_anchors() computes the cosine distance from each trajectory point to each anchor:
This transforms a -dimensional trajectory into a -dimensional one ( = number of anchors).
In clinical NLP, anchors are DSM-5 symptom descriptions embedded by the same model. In political analysis, anchors represent rhetorical strategies.
anchors = []for i in range(3): a = np.zeros(D, dtype=np.float32) a[i*5:(i+1)*5] = 1.0 anchors.append(a.tolist())
proj_a = cvx.project_to_anchors(traj_a, anchors, metric="cosine")proj_b = cvx.project_to_anchors(traj_b, anchors, metric="cosine")Group A (red) shows decreasing distance to Anchor 0 over time — it’s approaching that dimension. Group B (blue) stays stable across all anchors.
Centered Projection (Recommended)
Section titled “Centered Projection (Recommended)”For maximum discrimination, center both the trajectory vectors and the anchor vectors:
def project_centered(traj, anchors, centroid): c = np.array(centroid, dtype=np.float32) anchor_matrix = np.array(anchors) - c anchor_norms = np.linalg.norm(anchor_matrix, axis=1, keepdims=True) + 1e-8 anchor_matrix = anchor_matrix / anchor_norms results = [] for ts, vec in traj: v = np.array(vec, dtype=np.float32) - c v = v / (np.linalg.norm(v) + 1e-8) dists = (1.0 - v @ anchor_matrix.T).tolist() results.append((ts, dists)) return resultsAnchor Summary
Section titled “Anchor Summary”summary = cvx.anchor_summary(proj_a)| Statistic | Meaning |
|---|---|
mean | Average distance to anchor across all timesteps |
min | Closest approach to anchor |
trend | Linear slope. Negative = approaching the anchor |
last | Distance at the most recent timestep |
References
Section titled “References”- Ethayarajh (2019) — “How Contextual are Contextualized Word Representations?” EMNLP 2019
- Su et al. (2021) — “Whitening Sentence Representations for Better Semantics and Faster Retrieval” ACL 2021
- Li et al. (2020) — “On the Sentence Embeddings from Pre-trained Language Models” EMNLP 2020