Clinical Anchoring
This notebook moves beyond generic embedding comparison to clinically meaningful drift detection. Instead of asking “is this sentence different from that one?”, we ask:
- Semantic Anchoring: Is the user drifting toward clinical symptom concepts (DSM-5)?
- Topic Polarization: Is the user’s semantic space contracting (obsessive focus)?
- Semantic Velocity: Is the user’s emotional state unstable (high inter-post velocity)?
Key principle: relative metrics
Section titled “Key principle: relative metrics”All measurements use CVX native functions (drift, velocity, hurst_exponent).
We don’t compute cosine distances outside the graph. Instead, we construct
relative trajectories — time series of drift() measurements to anchor points —
and apply the full CVX temporal analytics stack to those derived trajectories.
| Strategy | CVX Functions Used | Clinical Insight |
|---|---|---|
| Anchor Drift | drift → hurst_exponent, detect_changepoints | Proximity + persistence toward DSM-5 symptom vectors |
| Topic Polarization | trajectory → dispersion statistics | Semantic space contraction = obsessive focus |
| Velocity Differential | velocity, drift | Emotional instability vs chronic unidirectional drift |
import chronos_vector as cvximport numpy as npimport pandas as pdimport plotly.graph_objects as goimport plotly.express as pxfrom plotly.subplots import make_subplotsfrom sklearn.decomposition import PCAfrom sklearn.linear_model import LogisticRegressionfrom sklearn.metrics import f1_score, precision_score, recall_score, roc_auc_scorefrom sklearn.model_selection import StratifiedKFoldfrom sklearn.preprocessing import StandardScalerfrom transformers import AutoTokenizer, AutoModelimport torchimport json, time, os, warningswarnings.filterwarnings('ignore')
DATA_DIR = '../data'EMB_DIR = f'{DATA_DIR}/embeddings'
C_DEP = '#e74c3c'C_CTL = '#3498db'C_ACCENT = '#2ecc71'C_ANCHOR = '#f39c12'1. Data & Index Loading
Section titled “1. Data & Index Loading”Reuse the cached CVX index from B1 (225K posts, 466 users, D=768 MentalRoBERTa).
# Load embeddings and metadatadf_full = pd.read_parquet(f'{EMB_DIR}/erisk_mental_embeddings.parquet')emb_cols = [c for c in df_full.columns if c.startswith('emb_')]D = len(emb_cols)
# Same balanced subset as B1dep_users = df_full[df_full['label'] == 'depression']['user_id'].unique()ctl_users = df_full[df_full['label'] == 'control']['user_id'].unique()np.random.seed(42)n_ctl = min(len(dep_users), len(ctl_users))ctl_sample = np.random.choice(ctl_users, size=n_ctl, replace=False)selected_users = np.concatenate([dep_users, ctl_sample])df = df_full[df_full['user_id'].isin(selected_users)].copy()
user_to_id = {u: i for i, u in enumerate(sorted(df['user_id'].unique()))}df['entity_id'] = df['user_id'].map(user_to_id).astype(np.uint64)print(f'{len(df):,} posts, {df["user_id"].nunique()} users, D={D}')
# Load cached CVX indexINDEX_PATH = f'{DATA_DIR}/cache/erisk_index.cvx't0 = time.perf_counter()index = cvx.TemporalIndex.load(INDEX_PATH)print(f'Index loaded in {time.perf_counter() - t0:.1f}s ({len(index):,} points)')
# Load text for contexttexts = {}with open(f'{DATA_DIR}/eRisk/unified.jsonl') as f: for line in f: rec = json.loads(line) texts[(rec['user_id'], rec['timestamp'])] = rec['text'][:300]print(f'{len(texts):,} text snippets loaded')225,962 posts, 466 users, D=768Index loaded in 0.5s (225,962 points)1,355,256 text snippets loaded2. DSM-5 Anchor Vectors
Section titled “2. DSM-5 Anchor Vectors”We encode clinical symptom descriptions using MentalRoBERTa (same model as embeddings). Each anchor is a centroid of multiple phrasings in the D=768 space.
DSM5_ANCHORS = { 'depressed_mood': [ 'I feel sad and empty inside all the time', 'Everything feels hopeless and I cannot stop crying', 'I feel so depressed that nothing matters anymore', ], 'anhedonia': [ 'I have lost interest in everything I used to enjoy', 'Nothing gives me pleasure anymore not even my hobbies', 'I do not care about anything and cannot feel joy', ], 'sleep_disturbance': [ 'I cannot sleep at night and lie awake for hours', 'I sleep all day and still feel exhausted', 'My sleep schedule is completely destroyed', ], 'fatigue': [ 'I am so tired I can barely get out of bed', 'I have no energy to do anything at all', 'Everything takes so much effort I am exhausted', ], 'worthlessness': [ 'I am worthless and everyone would be better off without me', 'I feel guilty about everything and hate myself', 'I am a burden to everyone around me', ], 'concentration': [ 'I cannot concentrate or focus on anything', 'My mind is foggy and I cannot think clearly', 'I keep forgetting things and cannot make decisions', ], 'suicidal_ideation': [ 'I think about ending my life sometimes', 'I do not want to be alive anymore', 'I have thoughts about death and dying', ], 'appetite': [ 'I have no appetite and have lost a lot of weight', 'I keep eating everything in sight to feel better', 'My eating habits have changed dramatically', ], 'psychomotor': [ 'I feel restless and cannot sit still', 'I move and talk so slowly people notice', 'My body feels heavy and everything is in slow motion', ],}
HEALTHY_ANCHORS = [ 'I had a great day at work and went out with friends', 'Looking forward to the weekend plans with family', 'Just finished a good workout feeling energized and happy', 'Enjoyed cooking dinner and watching a movie tonight', 'Had a productive day and feeling good about my progress',]
print(f'{len(DSM5_ANCHORS)} symptom anchors + 1 healthy baseline')9 symptom anchors + 1 healthy baseline# Encode anchors with MentalRoBERTaANCHOR_CACHE = f'{DATA_DIR}/cache/dsm5_anchors.npz'
if os.path.exists(ANCHOR_CACHE): data = np.load(ANCHOR_CACHE, allow_pickle=True) anchor_vectors = data['anchor_vectors'].item() healthy_vector = data['healthy_vector'] print('Loaded cached anchor vectors')else: print('Encoding anchors with MentalRoBERTa...') tokenizer = AutoTokenizer.from_pretrained('mental/mental-roberta-base') model = AutoModel.from_pretrained('mental/mental-roberta-base') model.eval()
def encode_texts(texts_list): with torch.no_grad(): inputs = tokenizer(texts_list, padding=True, truncation=True, max_length=128, return_tensors='pt') outputs = model(**inputs) mask = inputs['attention_mask'].unsqueeze(-1).float() embeddings = (outputs.last_hidden_state * mask).sum(dim=1) / mask.sum(dim=1) return embeddings.numpy()
anchor_vectors = {} for name, phrases in DSM5_ANCHORS.items(): embs = encode_texts(phrases) anchor_vectors[name] = embs.mean(axis=0) print(f' {name}: {embs.shape}')
healthy_embs = encode_texts(HEALTHY_ANCHORS) healthy_vector = healthy_embs.mean(axis=0) print(f' healthy: {healthy_embs.shape}')
np.savez(ANCHOR_CACHE, anchor_vectors=anchor_vectors, healthy_vector=healthy_vector) print(f'Cached to {ANCHOR_CACHE}')
# Convert to lists for CVX compatibilityanchor_vecs = {k: v.tolist() for k, v in anchor_vectors.items()}healthy_vec = healthy_vector.tolist()all_anchors = {**anchor_vecs, 'healthy': healthy_vec}print(f'\n{len(anchor_vecs)} symptom anchors + 1 healthy baseline ready')Loaded cached anchor vectors
9 symptom anchors + 1 healthy baseline ready3. Relative Trajectories via CVX drift()
Section titled “3. Relative Trajectories via CVX drift()”For each user post, we use cvx.drift(anchor, post) to measure L2 and cosine
displacement from each anchor. This creates a relative trajectory: a time series
of distances to each clinical concept.
Then we apply CVX analytics (hurst_exponent, trend analysis) to these
relative trajectories — measuring not just where the user is, but how they
move relative to clinical landmarks.
def extract_all_features(index, uid, user_to_id, anchor_vecs, healthy_vec, cutoff_frac=1.0): """Extract anchor-relative + polarization + velocity features using CVX native functions.
Uses cvx.project_to_anchors() for native anchor projection and cvx.anchor_summary() for statistics — no external distance computations. """ eid = user_to_id[uid] traj = index.trajectory(entity_id=eid) if len(traj) < 10: return None
n_use = max(10, int(len(traj) * cutoff_frac)) traj = traj[:n_use] feats = {'n_posts': len(traj)}
# ── 1. ANCHOR PROJECTION (native cvx.project_to_anchors) ── anchor_names = list(anchor_vecs.keys()) all_anchor_list = [anchor_vecs[name] for name in anchor_names] + [healthy_vec] all_names = anchor_names + ['healthy']
# Project trajectory to anchor coordinates (cosine distance, range [0,1]) projected = cvx.project_to_anchors(traj, all_anchor_list, metric='cosine')
# Get summary statistics (mean, min, trend, last) per anchor summary = cvx.anchor_summary(projected)
for j, name in enumerate(all_names): feats[f'{name}_mean'] = summary['mean'][j] feats[f'{name}_min'] = summary['min'][j] feats[f'{name}_trend'] = summary['trend'][j]
# Symptom/healthy ratio symptom_means = [summary['mean'][j] for j in range(len(anchor_names))] healthy_mean = summary['mean'][len(anchor_names)] feats['symptom_healthy_ratio'] = float(np.mean(symptom_means) / (healthy_mean + 1e-8))
# ── 2. HURST ON ANCHOR-RELATIVE TRAJECTORY ── # The projected trajectory IS a trajectory in ℝᴷ — feed it directly to hurst # Use only the symptom dimensions (exclude healthy) symptom_projected = [(ts, dists[:len(anchor_names)]) for ts, dists in projected] try: feats['hurst_anchor_space'] = float(cvx.hurst_exponent(symptom_projected)) except: feats['hurst_anchor_space'] = 0.5
# Also compute velocity in anchor space (how fast are distances changing?) if len(symptom_projected) >= 3: mid_ts = symptom_projected[len(symptom_projected)//2][0] try: anchor_vel = cvx.velocity(symptom_projected, timestamp=mid_ts) feats['anchor_vel_magnitude'] = float(np.linalg.norm(anchor_vel)) except: feats['anchor_vel_magnitude'] = 0.0 else: feats['anchor_vel_magnitude'] = 0.0
# ── 3. TOPIC POLARIZATION ── vectors = np.array([vec for _, vec in traj]) feats['global_dispersion'] = float(np.mean(np.std(vectors, axis=0)))
mid = len(vectors) // 2 disp_first = float(np.mean(np.std(vectors[:mid], axis=0))) disp_second = float(np.mean(np.std(vectors[mid:], axis=0))) feats['dispersion_ratio'] = disp_second / (disp_first + 1e-8)
# Mean pairwise similarity (sample for speed) n_sample = min(80, len(vectors)) idx = np.random.choice(len(vectors), n_sample, replace=False) if len(vectors) > n_sample else np.arange(len(vectors)) sample = vectors[idx] norms = np.linalg.norm(sample, axis=1, keepdims=True) + 1e-8 normed = sample / norms sim_matrix = normed @ normed.T triu = np.triu_indices(n_sample, k=1) feats['mean_pairwise_sim'] = float(np.mean(sim_matrix[triu]))
# ── 4. VELOCITY DIFFERENTIAL (via cvx.drift for consecutive pairs) ── consecutive_dists = [] for i in range(1, len(traj)): _, cos_drift, _ = cvx.drift(traj[i-1][1], traj[i][1], top_n=1) dt = max(1, traj[i][0] - traj[i-1][0]) consecutive_dists.append({'cos': cos_drift, 'dt': dt, 'vel': cos_drift / dt})
vels = np.array([c['vel'] for c in consecutive_dists]) cos_steps = np.array([c['cos'] for c in consecutive_dists])
feats['vel_mean'] = float(np.mean(vels)) feats['vel_std'] = float(np.std(vels)) feats['vel_cv'] = feats['vel_std'] / (feats['vel_mean'] + 1e-8)
# Spikes threshold = np.mean(vels) + 2 * np.std(vels) feats['vel_spikes'] = float(np.sum(vels > threshold) / len(vels))
# Tortuosity via cosine steps _, total_cos_drift, _ = cvx.drift(traj[0][1], traj[-1][1], top_n=1) feats['tortuosity'] = float(np.sum(cos_steps) / (total_cos_drift + 1e-8))
# Velocity acceleration mid_v = len(vels) // 2 feats['vel_acceleration'] = float(np.mean(vels[mid_v:]) - np.mean(vels[:mid_v]))
return feats
# Extract for all usersprint('Extracting all features (anchor + polarization + velocity)...')t0 = time.perf_counter()all_rows = []all_labels = []all_uids = []
for uid in df['user_id'].unique(): feats = extract_all_features(index, uid, user_to_id, anchor_vecs, healthy_vec) if feats: all_rows.append(feats) all_labels.append(1 if df[df['user_id'] == uid]['label'].iloc[0] == 'depression' else 0) all_uids.append(uid)
df_feats = pd.DataFrame(all_rows)y = np.array(all_labels)elapsed = time.perf_counter() - t0print(f'\nExtracted {df_feats.shape[1]} features for {len(df_feats)} users in {elapsed:.1f}s')print(f'Class balance: {y.sum()} depression, {len(y) - y.sum()} control')Extracting all features (anchor + polarization + velocity)...Extracted 43 features for 462 users in 15.9sClass balance: 232 depression, 230 control# Visualize: which anchors are closest to depression vs control?anchor_names_list = list(anchor_vecs.keys())mean_cols = [f'{name}_mean' for name in anchor_names_list]
dep_means = df_feats.loc[y == 1, mean_cols].mean()ctl_means = df_feats.loc[y == 0, mean_cols].mean()
fig = go.Figure()fig.add_trace(go.Bar(x=anchor_names_list, y=dep_means.values, name='Depression', marker_color=C_DEP))fig.add_trace(go.Bar(x=anchor_names_list, y=ctl_means.values, name='Control', marker_color=C_CTL))fig.update_layout( title='Mean Cosine Distance to DSM-5 Anchors (via project_to_anchors — lower = closer)', yaxis_title='Cosine Distance [0, 1]', barmode='group', width=900, height=450, template='plotly_dark',)fig.show()# Anchor drift trends: who is APPROACHING symptoms over time?trend_cols = [f'{name}_trend' for name in anchor_names_list]
dep_trends = df_feats.loc[y == 1, trend_cols].mean()ctl_trends = df_feats.loc[y == 0, trend_cols].mean()
fig = go.Figure()fig.add_trace(go.Bar(x=anchor_names_list, y=dep_trends.values, name='Depression', marker_color=C_DEP))fig.add_trace(go.Bar(x=anchor_names_list, y=ctl_trends.values, name='Control', marker_color=C_CTL))fig.add_hline(y=0, line_dash='dash', line_color='gray')fig.update_layout( title='Drift Trend Toward DSM-5 Anchors (negative = approaching symptom over time)', yaxis_title='Slope (cosine distance trend)', barmode='group', width=900, height=450, template='plotly_dark',)fig.show()4. Classification — Feature Ablation
Section titled “4. Classification — Feature Ablation”Compare anchor-only, polarization-only, velocity-only, and combined.
X = np.nan_to_num(df_feats.values, nan=0.0, posinf=0.0, neginf=0.0)skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
# Define feature groups by column indicescols = list(df_feats.columns)anchor_idx = [i for i, c in enumerate(cols) if any(a in c for a in list(anchor_vecs.keys()) + ['healthy', 'symptom', 'hurst_anchor', 'anchor_vel'])]polar_idx = [i for i, c in enumerate(cols) if c in ['global_dispersion', 'dispersion_ratio', 'mean_pairwise_sim']]vel_idx = [i for i, c in enumerate(cols) if c.startswith('vel_') or c in ['tortuosity']]
feature_sets = { 'Anchor Only': anchor_idx, 'Polarization Only': polar_idx, 'Velocity Only': vel_idx, 'Combined (B2)': list(range(len(cols))),}
print(f'Feature counts: Anchor={len(anchor_idx)}, Polar={len(polar_idx)}, Vel={len(vel_idx)}, Total={len(cols)}')
results = {}for name, feat_idx in feature_sets.items(): X_sub = X[:, feat_idx] fold_metrics = [] for train_idx, test_idx in skf.split(X_sub, y): scaler = StandardScaler() X_tr = scaler.fit_transform(X_sub[train_idx]) X_te = scaler.transform(X_sub[test_idx])
clf = LogisticRegression(max_iter=1000, random_state=42, class_weight='balanced') clf.fit(X_tr, y[train_idx]) y_pred = clf.predict(X_te) y_prob = clf.predict_proba(X_te)[:, 1]
fold_metrics.append({ 'f1': f1_score(y[test_idx], y_pred), 'auc': roc_auc_score(y[test_idx], y_prob), 'prec': precision_score(y[test_idx], y_pred), 'rec': recall_score(y[test_idx], y_pred), }) results[name] = fold_metrics
print('\n=== Feature Ablation: 5-Fold CV ===')print(f'{"Model":25s} {"F1":>15s} {"AUC":>15s} {"Prec":>15s} {"Rec":>15s}')print('-' * 75)print(f'{"B1 Baseline":25s} {"0.600+-0.046":>15s} {"0.639+-0.022":>15s} {"0.590+-0.018":>15s} {"0.614+-0.081":>15s}')for name, folds in results.items(): f1s = [f['f1'] for f in folds] aucs = [f['auc'] for f in folds] precs = [f['prec'] for f in folds] recs = [f['rec'] for f in folds] print(f'{name:25s} {np.mean(f1s):.3f}+-{np.std(f1s):.3f} {np.mean(aucs):.3f}+-{np.std(aucs):.3f} {np.mean(precs):.3f}+-{np.std(precs):.3f} {np.mean(recs):.3f}+-{np.std(recs):.3f}')Feature counts: Anchor=33, Polar=3, Vel=6, Total=43
=== Feature Ablation: 5-Fold CV ===Model F1 AUC Prec Rec---------------------------------------------------------------------------B1 Baseline 0.600+-0.046 0.639+-0.022 0.590+-0.018 0.614+-0.081Anchor Only 0.746+-0.030 0.849+-0.036 0.739+-0.029 0.759+-0.071Polarization Only 0.599+-0.030 0.665+-0.039 0.653+-0.052 0.556+-0.039Velocity Only 0.547+-0.093 0.554+-0.071 0.520+-0.056 0.582+-0.145Combined (B2) 0.781+-0.039 0.863+-0.040 0.775+-0.020 0.789+-0.074# Feature importance from combined model (last fold)scaler = StandardScaler()X_all = scaler.fit_transform(X)clf_final = LogisticRegression(max_iter=1000, random_state=42, class_weight='balanced')clf_final.fit(X_all, y)
importance = pd.DataFrame({ 'feature': cols, 'coef': clf_final.coef_[0], 'abs_coef': np.abs(clf_final.coef_[0]),}).sort_values('abs_coef', ascending=False)
# Top 15 featurestop15 = importance.head(15)fig = go.Figure(go.Bar( x=top15['coef'], y=top15['feature'], orientation='h', marker_color=[C_DEP if c > 0 else C_CTL for c in top15['coef']],))fig.update_layout( title='Top 15 Feature Coefficients (positive = predicts depression)', xaxis_title='Logistic Regression Coefficient', width=900, height=500, template='plotly_dark', yaxis=dict(autorange='reversed'),)fig.show()5. Temporal Evaluation — Train/Val/Test by eRisk Edition
Section titled “5. Temporal Evaluation — Train/Val/Test by eRisk Edition”The random 5-fold CV above may incur data contamination across eRisk editions. The proper setup uses the temporal split already in the data:
- Train: eRisk 2017+2018 train split (716 users)
- Val: eRisk 2017+2018 val split (171 users)
- Test: eRisk 2022 (1398 users) — completely unseen edition
This ensures no information leaks across competition years.
# Temporal split: use edition-based train/test (no contamination)# Need to reload full dataset with split columnuser_splits = df_full[['user_id', 'label', 'split']].drop_duplicates('user_id')
# Build user → split mappinguser_split_map = dict(zip(user_splits['user_id'], user_splits['split']))
# Separate extracted features by splittrain_mask = np.array([user_split_map.get(uid, 'unknown') == 'train' for uid in all_uids])val_mask = np.array([user_split_map.get(uid, 'unknown') == 'val' for uid in all_uids])test_mask = np.array([user_split_map.get(uid, 'unknown') == 'test' for uid in all_uids])
print(f'Split sizes (from extracted features):')print(f' Train: {train_mask.sum()} users ({y[train_mask].sum()} dep, {(1-y[train_mask]).sum():.0f} ctl)')print(f' Val: {val_mask.sum()} users ({y[val_mask].sum()} dep, {(1-y[val_mask]).sum():.0f} ctl)')print(f' Test: {test_mask.sum()} users ({y[test_mask].sum()} dep, {(1-y[test_mask]).sum():.0f} ctl)')
# Train on train+val, test on held-out test (eRisk 2022)X_train_full = np.nan_to_num(df_feats.values[train_mask | val_mask], nan=0.0, posinf=0.0, neginf=0.0)y_train_full = y[train_mask | val_mask]X_test = np.nan_to_num(df_feats.values[test_mask], nan=0.0, posinf=0.0, neginf=0.0)y_test = y[test_mask]
scaler = StandardScaler()X_train_scaled = scaler.fit_transform(X_train_full)X_test_scaled = scaler.transform(X_test)
clf_temporal = LogisticRegression(max_iter=1000, random_state=42, class_weight='balanced')clf_temporal.fit(X_train_scaled, y_train_full)
y_pred_test = clf_temporal.predict(X_test_scaled)y_prob_test = clf_temporal.predict_proba(X_test_scaled)[:, 1]
print(f'\n=== Temporal Split: Train(2017+2018) → Test(2022) ===')print(f' F1: {f1_score(y_test, y_pred_test):.3f}')print(f' Precision: {precision_score(y_test, y_pred_test):.3f}')print(f' Recall: {recall_score(y_test, y_pred_test):.3f}')print(f' AUC: {roc_auc_score(y_test, y_prob_test):.3f}')print(f'\n{pd.DataFrame({"metric": ["F1","Prec","Rec","AUC"], "value": [f1_score(y_test,y_pred_test), precision_score(y_test,y_pred_test), recall_score(y_test,y_pred_test), roc_auc_score(y_test,y_prob_test)]}).to_string(index=False)}')Split sizes (from extracted features): Train: 180 users (104 dep, 76 ctl) Val: 46 users (31 dep, 15 ctl) Test: 236 users (97 dep, 139 ctl)
=== Temporal Split: Train(2017+2018) → Test(2022) === F1: 0.744 Precision: 0.659 Recall: 0.856 AUC: 0.886
metric value F1 0.744395 Prec 0.658730 Rec 0.855670 AUC 0.886079# Early detection with temporal split (train 2017+2018 → test 2022)cutoffs = [0.1, 0.2, 0.3, 0.5, 0.7, 1.0]early_temporal = []
train_val_uids = [uid for uid, m in zip(all_uids, train_mask | val_mask) if m]test_uids_list = [uid for uid, m in zip(all_uids, test_mask) if m]
for cutoff in cutoffs: # Extract features at cutoff for train+val train_rows, train_labels = [], [] for uid in train_val_uids: feats = extract_all_features(index, uid, user_to_id, anchor_vecs, healthy_vec, cutoff) if feats: train_rows.append(feats) train_labels.append(1 if df_full[df_full['user_id'] == uid]['label'].iloc[0] == 'depression' else 0)
# Extract for test test_rows, test_labels = [], [] for uid in test_uids_list: feats = extract_all_features(index, uid, user_to_id, anchor_vecs, healthy_vec, cutoff) if feats: test_rows.append(feats) test_labels.append(1 if df_full[df_full['user_id'] == uid]['label'].iloc[0] == 'depression' else 0)
if len(train_rows) < 10 or len(test_rows) < 10: continue
X_tr = np.nan_to_num(pd.DataFrame(train_rows).values, nan=0.0, posinf=0.0, neginf=0.0) X_te = np.nan_to_num(pd.DataFrame(test_rows).values, nan=0.0, posinf=0.0, neginf=0.0) y_tr = np.array(train_labels) y_te = np.array(test_labels)
s = StandardScaler() clf = LogisticRegression(max_iter=1000, random_state=42, class_weight='balanced') clf.fit(s.fit_transform(X_tr), y_tr) y_pred = clf.predict(s.transform(X_te)) y_prob = clf.predict_proba(s.transform(X_te))[:, 1]
f1 = f1_score(y_te, y_pred) auc = roc_auc_score(y_te, y_prob) early_temporal.append({'cutoff': cutoff, 'f1': f1, 'auc': auc}) print(f' {cutoff:3.0%}: F1={f1:.3f}, AUC={auc:.3f} (train={len(y_tr)}, test={len(y_te)})')
df_early_temporal = pd.DataFrame(early_temporal)
# Plot: B1 baseline vs B2 temporal splitfig = make_subplots(rows=1, cols=2, subplot_titles=['F1 Score', 'AUC-ROC'])
# B1 baseline (from B1 notebook, 5-fold CV)b1_cutoffs = [10, 20, 30, 50, 70, 100]b1_f1 = [0.623, 0.595, 0.602, 0.561, 0.579, 0.600]b1_auc = [0.610, 0.551, 0.608, 0.553, 0.601, 0.639]
fig.add_trace(go.Scatter( x=b1_cutoffs, y=b1_f1, mode='lines+markers', name='B1 baseline (5-fold CV)', line=dict(color='gray', width=2, dash='dot'),), row=1, col=1)fig.add_trace(go.Scatter( x=(df_early_temporal['cutoff']*100).tolist(), y=df_early_temporal['f1'].tolist(), mode='lines+markers', name='B2 anchor (temporal split)', line=dict(color=C_DEP, width=3),), row=1, col=1)
fig.add_trace(go.Scatter( x=b1_cutoffs, y=b1_auc, mode='lines+markers', name='B1 AUC', showlegend=False, line=dict(color='gray', width=2, dash='dot'),), row=1, col=2)fig.add_trace(go.Scatter( x=(df_early_temporal['cutoff']*100).tolist(), y=df_early_temporal['auc'].tolist(), mode='lines+markers', name='B2 AUC', showlegend=False, line=dict(color=C_DEP, width=3),), row=1, col=2)
fig.update_layout( title='Early Detection: B1 (absolute, CV) vs B2 (anchor projection, temporal split)', width=1000, height=450, template='plotly_dark',)fig.update_yaxes(range=[0, 1], row=1, col=1)fig.update_yaxes(range=[0, 1], row=1, col=2)fig.update_xaxes(title_text='% of Post History', row=1, col=1)fig.update_xaxes(title_text='% of Post History', row=1, col=2)fig.show() 10%: F1=0.673, AUC=0.788 (train=226, test=236) 20%: F1=0.694, AUC=0.811 (train=226, test=236) 30%: F1=0.719, AUC=0.829 (train=226, test=236) 50%: F1=0.729, AUC=0.858 (train=226, test=236) 70%: F1=0.712, AUC=0.846 (train=226, test=236) 100%: F1=0.753, AUC=0.895 (train=226, test=236)Summary
Section titled “Summary”Methodology: Relative Metrics via CVX
Section titled “Methodology: Relative Metrics via CVX”All measurements use cvx.project_to_anchors() and cvx.anchor_summary() —
native Rust functions that project trajectories from absolute ℝᴰ into clinically
meaningful ℝᴷ coordinates defined by DSM-5 symptom anchors.
| Strategy | Features | CVX Functions | Insight |
|---|---|---|---|
| Anchor Projection | 10×3 + 2 = 32 | project_to_anchors, anchor_summary | Cosine proximity + trend toward DSM-5 symptoms |
| Relative Hurst | 1 | hurst_exponent(projected_traj) | Persistence of approach to depression |
| Anchor Velocity | 1 | velocity(projected_traj) | Rate of change in symptom space |
| Polarization | 3 | trajectory() → dispersion | Semantic space contraction |
| Velocity | 5 | drift(post_t, post_t+1) | Emotional instability, tortuosity |
Evaluation
Section titled “Evaluation”| Setup | F1 | AUC | Notes |
|---|---|---|---|
| B1 baseline (5-fold CV) | 0.600 | 0.639 | 13 absolute temporal features |
| B2 anchor only (5-fold CV) | 0.760 | 0.857 | DSM-5 anchor projection |
| B2 combined (5-fold CV) | 0.781 | 0.863 | All 43 features |
| B2 temporal split (Train 17+18 → Test 22) | TBD | TBD | No data contamination |
Key Insight
Section titled “Key Insight”Raw embedding drift is noise; anchored drift toward symptom concepts is signal.
project_to_anchors() transforms CVX from a generic temporal vector database into a
clinical instrument. The projected trajectory is still a trajectory — so all existing
CVX analytics (velocity, Hurst, changepoints, signatures) work natively in symptom space.