Clinical Anchoring

This notebook moves beyond generic embedding comparison to clinically meaningful drift detection. Instead of asking “is this sentence different from that one?”, we ask:

Semantic Anchoring: Is the user drifting toward clinical symptom concepts (DSM-5)?
Topic Polarization: Is the user’s semantic space contracting (obsessive focus)?
Semantic Velocity: Is the user’s emotional state unstable (high inter-post velocity)?

Key principle: relative metrics

All measurements use CVX native functions (drift, velocity, hurst_exponent). We don’t compute cosine distances outside the graph. Instead, we construct relative trajectories — time series of drift() measurements to anchor points — and apply the full CVX temporal analytics stack to those derived trajectories.

Strategy	CVX Functions Used	Clinical Insight
Anchor Drift	`drift` → `hurst_exponent`, `detect_changepoints`	Proximity + persistence toward DSM-5 symptom vectors
Topic Polarization	`trajectory` → dispersion statistics	Semantic space contraction = obsessive focus
Velocity Differential	`velocity`, `drift`	Emotional instability vs chronic unidirectional drift

import chronos_vector as cvx
import numpy as np
import pandas as pd
import plotly.graph_objects as go
import plotly.express as px
from plotly.subplots import make_subplots
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score, precision_score, recall_score, roc_auc_score
from sklearn.model_selection import StratifiedKFold
from sklearn.preprocessing import StandardScaler
from transformers import AutoTokenizer, AutoModel
import torch
import json, time, os, warnings
warnings.filterwarnings('ignore')

DATA_DIR = '../data'
EMB_DIR = f'{DATA_DIR}/embeddings'

C_DEP = '#e74c3c'
C_CTL = '#3498db'
C_ACCENT = '#2ecc71'
C_ANCHOR = '#f39c12'

1. Data & Index Loading

Reuse the cached CVX index from B1 (225K posts, 466 users, D=768 MentalRoBERTa).

# Load embeddings and metadata
df_full = pd.read_parquet(f'{EMB_DIR}/erisk_mental_embeddings.parquet')
emb_cols = [c for c in df_full.columns if c.startswith('emb_')]
D = len(emb_cols)

# Same balanced subset as B1
dep_users = df_full[df_full['label'] == 'depression']['user_id'].unique()
ctl_users = df_full[df_full['label'] == 'control']['user_id'].unique()
np.random.seed(42)
n_ctl = min(len(dep_users), len(ctl_users))
ctl_sample = np.random.choice(ctl_users, size=n_ctl, replace=False)
selected_users = np.concatenate([dep_users, ctl_sample])
df = df_full[df_full['user_id'].isin(selected_users)].copy()

user_to_id = {u: i for i, u in enumerate(sorted(df['user_id'].unique()))}
df['entity_id'] = df['user_id'].map(user_to_id).astype(np.uint64)
print(f'{len(df):,} posts, {df["user_id"].nunique()} users, D={D}')

# Load cached CVX index
INDEX_PATH = f'{DATA_DIR}/cache/erisk_index.cvx'
t0 = time.perf_counter()
index = cvx.TemporalIndex.load(INDEX_PATH)
print(f'Index loaded in {time.perf_counter() - t0:.1f}s ({len(index):,} points)')

# Load text for context
texts = {}
with open(f'{DATA_DIR}/eRisk/unified.jsonl') as f:
    for line in f:
        rec = json.loads(line)
        texts[(rec['user_id'], rec['timestamp'])] = rec['text'][:300]
print(f'{len(texts):,} text snippets loaded')

225,962 posts, 466 users, D=768
Index loaded in 0.5s (225,962 points)

1,355,256 text snippets loaded

2. DSM-5 Anchor Vectors

We encode clinical symptom descriptions using MentalRoBERTa (same model as embeddings). Each anchor is a centroid of multiple phrasings in the D=768 space.

DSM5_ANCHORS = {
    'depressed_mood': [
        'I feel sad and empty inside all the time',
        'Everything feels hopeless and I cannot stop crying',
        'I feel so depressed that nothing matters anymore',
    ],
    'anhedonia': [
        'I have lost interest in everything I used to enjoy',
        'Nothing gives me pleasure anymore not even my hobbies',
        'I do not care about anything and cannot feel joy',
    ],
    'sleep_disturbance': [
        'I cannot sleep at night and lie awake for hours',
        'I sleep all day and still feel exhausted',
        'My sleep schedule is completely destroyed',
    ],
    'fatigue': [
        'I am so tired I can barely get out of bed',
        'I have no energy to do anything at all',
        'Everything takes so much effort I am exhausted',
    ],
    'worthlessness': [
        'I am worthless and everyone would be better off without me',
        'I feel guilty about everything and hate myself',
        'I am a burden to everyone around me',
    ],
    'concentration': [
        'I cannot concentrate or focus on anything',
        'My mind is foggy and I cannot think clearly',
        'I keep forgetting things and cannot make decisions',
    ],
    'suicidal_ideation': [
        'I think about ending my life sometimes',
        'I do not want to be alive anymore',
        'I have thoughts about death and dying',
    ],
    'appetite': [
        'I have no appetite and have lost a lot of weight',
        'I keep eating everything in sight to feel better',
        'My eating habits have changed dramatically',
    ],
    'psychomotor': [
        'I feel restless and cannot sit still',
        'I move and talk so slowly people notice',
        'My body feels heavy and everything is in slow motion',
    ],
}

HEALTHY_ANCHORS = [
    'I had a great day at work and went out with friends',
    'Looking forward to the weekend plans with family',
    'Just finished a good workout feeling energized and happy',
    'Enjoyed cooking dinner and watching a movie tonight',
    'Had a productive day and feeling good about my progress',
]

print(f'{len(DSM5_ANCHORS)} symptom anchors + 1 healthy baseline')

9 symptom anchors + 1 healthy baseline

# Encode anchors with MentalRoBERTa
ANCHOR_CACHE = f'{DATA_DIR}/cache/dsm5_anchors.npz'

if os.path.exists(ANCHOR_CACHE):
    data = np.load(ANCHOR_CACHE, allow_pickle=True)
    anchor_vectors = data['anchor_vectors'].item()
    healthy_vector = data['healthy_vector']
    print('Loaded cached anchor vectors')
else:
    print('Encoding anchors with MentalRoBERTa...')
    tokenizer = AutoTokenizer.from_pretrained('mental/mental-roberta-base')
    model = AutoModel.from_pretrained('mental/mental-roberta-base')
    model.eval()

    def encode_texts(texts_list):
        with torch.no_grad():
            inputs = tokenizer(texts_list, padding=True, truncation=True, max_length=128, return_tensors='pt')
            outputs = model(**inputs)
            mask = inputs['attention_mask'].unsqueeze(-1).float()
            embeddings = (outputs.last_hidden_state * mask).sum(dim=1) / mask.sum(dim=1)
        return embeddings.numpy()

    anchor_vectors = {}
    for name, phrases in DSM5_ANCHORS.items():
        embs = encode_texts(phrases)
        anchor_vectors[name] = embs.mean(axis=0)
        print(f'  {name}: {embs.shape}')

    healthy_embs = encode_texts(HEALTHY_ANCHORS)
    healthy_vector = healthy_embs.mean(axis=0)
    print(f'  healthy: {healthy_embs.shape}')

    np.savez(ANCHOR_CACHE, anchor_vectors=anchor_vectors, healthy_vector=healthy_vector)
    print(f'Cached to {ANCHOR_CACHE}')

# Convert to lists for CVX compatibility
anchor_vecs = {k: v.tolist() for k, v in anchor_vectors.items()}
healthy_vec = healthy_vector.tolist()
all_anchors = {**anchor_vecs, 'healthy': healthy_vec}
print(f'\n{len(anchor_vecs)} symptom anchors + 1 healthy baseline ready')

Loaded cached anchor vectors

9 symptom anchors + 1 healthy baseline ready

3. Relative Trajectories via CVX `drift()`

For each user post, we use cvx.drift(anchor, post) to measure L2 and cosine displacement from each anchor. This creates a relative trajectory: a time series of distances to each clinical concept.

Then we apply CVX analytics (hurst_exponent, trend analysis) to these relative trajectories — measuring not just where the user is, but how they move relative to clinical landmarks.

def extract_all_features(index, uid, user_to_id, anchor_vecs, healthy_vec, cutoff_frac=1.0):
    """Extract anchor-relative + polarization + velocity features using CVX native functions.

    Uses cvx.project_to_anchors() for native anchor projection and cvx.anchor_summary()
    for statistics — no external distance computations.
    """
    eid = user_to_id[uid]
    traj = index.trajectory(entity_id=eid)
    if len(traj) < 10:
        return None

    n_use = max(10, int(len(traj) * cutoff_frac))
    traj = traj[:n_use]
    feats = {'n_posts': len(traj)}

    # ── 1. ANCHOR PROJECTION (native cvx.project_to_anchors) ──
    anchor_names = list(anchor_vecs.keys())
    all_anchor_list = [anchor_vecs[name] for name in anchor_names] + [healthy_vec]
    all_names = anchor_names + ['healthy']

    # Project trajectory to anchor coordinates (cosine distance, range [0,1])
    projected = cvx.project_to_anchors(traj, all_anchor_list, metric='cosine')

    # Get summary statistics (mean, min, trend, last) per anchor
    summary = cvx.anchor_summary(projected)

    for j, name in enumerate(all_names):
        feats[f'{name}_mean'] = summary['mean'][j]
        feats[f'{name}_min'] = summary['min'][j]
        feats[f'{name}_trend'] = summary['trend'][j]

    # Symptom/healthy ratio
    symptom_means = [summary['mean'][j] for j in range(len(anchor_names))]
    healthy_mean = summary['mean'][len(anchor_names)]
    feats['symptom_healthy_ratio'] = float(np.mean(symptom_means) / (healthy_mean + 1e-8))

    # ── 2. HURST ON ANCHOR-RELATIVE TRAJECTORY ──
    # The projected trajectory IS a trajectory in ℝᴷ — feed it directly to hurst
    # Use only the symptom dimensions (exclude healthy)
    symptom_projected = [(ts, dists[:len(anchor_names)]) for ts, dists in projected]
    try:
        feats['hurst_anchor_space'] = float(cvx.hurst_exponent(symptom_projected))
    except:
        feats['hurst_anchor_space'] = 0.5

    # Also compute velocity in anchor space (how fast are distances changing?)
    if len(symptom_projected) >= 3:
        mid_ts = symptom_projected[len(symptom_projected)//2][0]
        try:
            anchor_vel = cvx.velocity(symptom_projected, timestamp=mid_ts)
            feats['anchor_vel_magnitude'] = float(np.linalg.norm(anchor_vel))
        except:
            feats['anchor_vel_magnitude'] = 0.0
    else:
        feats['anchor_vel_magnitude'] = 0.0

    # ── 3. TOPIC POLARIZATION ──
    vectors = np.array([vec for _, vec in traj])
    feats['global_dispersion'] = float(np.mean(np.std(vectors, axis=0)))

    mid = len(vectors) // 2
    disp_first = float(np.mean(np.std(vectors[:mid], axis=0)))
    disp_second = float(np.mean(np.std(vectors[mid:], axis=0)))
    feats['dispersion_ratio'] = disp_second / (disp_first + 1e-8)

    # Mean pairwise similarity (sample for speed)
    n_sample = min(80, len(vectors))
    idx = np.random.choice(len(vectors), n_sample, replace=False) if len(vectors) > n_sample else np.arange(len(vectors))
    sample = vectors[idx]
    norms = np.linalg.norm(sample, axis=1, keepdims=True) + 1e-8
    normed = sample / norms
    sim_matrix = normed @ normed.T
    triu = np.triu_indices(n_sample, k=1)
    feats['mean_pairwise_sim'] = float(np.mean(sim_matrix[triu]))

    # ── 4. VELOCITY DIFFERENTIAL (via cvx.drift for consecutive pairs) ──
    consecutive_dists = []
    for i in range(1, len(traj)):
        _, cos_drift, _ = cvx.drift(traj[i-1][1], traj[i][1], top_n=1)
        dt = max(1, traj[i][0] - traj[i-1][0])
        consecutive_dists.append({'cos': cos_drift, 'dt': dt, 'vel': cos_drift / dt})

    vels = np.array([c['vel'] for c in consecutive_dists])
    cos_steps = np.array([c['cos'] for c in consecutive_dists])

    feats['vel_mean'] = float(np.mean(vels))
    feats['vel_std'] = float(np.std(vels))
    feats['vel_cv'] = feats['vel_std'] / (feats['vel_mean'] + 1e-8)

    # Spikes
    threshold = np.mean(vels) + 2 * np.std(vels)
    feats['vel_spikes'] = float(np.sum(vels > threshold) / len(vels))

    # Tortuosity via cosine steps
    _, total_cos_drift, _ = cvx.drift(traj[0][1], traj[-1][1], top_n=1)
    feats['tortuosity'] = float(np.sum(cos_steps) / (total_cos_drift + 1e-8))

    # Velocity acceleration
    mid_v = len(vels) // 2
    feats['vel_acceleration'] = float(np.mean(vels[mid_v:]) - np.mean(vels[:mid_v]))

    return feats

# Extract for all users
print('Extracting all features (anchor + polarization + velocity)...')
t0 = time.perf_counter()
all_rows = []
all_labels = []
all_uids = []

for uid in df['user_id'].unique():
    feats = extract_all_features(index, uid, user_to_id, anchor_vecs, healthy_vec)
    if feats:
        all_rows.append(feats)
        all_labels.append(1 if df[df['user_id'] == uid]['label'].iloc[0] == 'depression' else 0)
        all_uids.append(uid)

df_feats = pd.DataFrame(all_rows)
y = np.array(all_labels)
elapsed = time.perf_counter() - t0
print(f'\nExtracted {df_feats.shape[1]} features for {len(df_feats)} users in {elapsed:.1f}s')
print(f'Class balance: {y.sum()} depression, {len(y) - y.sum()} control')

Extracting all features (anchor + polarization + velocity)...

Extracted 43 features for 462 users in 15.9s
Class balance: 232 depression, 230 control

# Visualize: which anchors are closest to depression vs control?
anchor_names_list = list(anchor_vecs.keys())
mean_cols = [f'{name}_mean' for name in anchor_names_list]

dep_means = df_feats.loc[y == 1, mean_cols].mean()
ctl_means = df_feats.loc[y == 0, mean_cols].mean()

fig = go.Figure()
fig.add_trace(go.Bar(x=anchor_names_list, y=dep_means.values, name='Depression', marker_color=C_DEP))
fig.add_trace(go.Bar(x=anchor_names_list, y=ctl_means.values, name='Control', marker_color=C_CTL))
fig.update_layout(
    title='Mean Cosine Distance to DSM-5 Anchors (via project_to_anchors — lower = closer)',
    yaxis_title='Cosine Distance [0, 1]', barmode='group',
    width=900, height=450, template='plotly_dark',
)
fig.show()

# Anchor drift trends: who is APPROACHING symptoms over time?
trend_cols = [f'{name}_trend' for name in anchor_names_list]

dep_trends = df_feats.loc[y == 1, trend_cols].mean()
ctl_trends = df_feats.loc[y == 0, trend_cols].mean()

fig = go.Figure()
fig.add_trace(go.Bar(x=anchor_names_list, y=dep_trends.values, name='Depression', marker_color=C_DEP))
fig.add_trace(go.Bar(x=anchor_names_list, y=ctl_trends.values, name='Control', marker_color=C_CTL))
fig.add_hline(y=0, line_dash='dash', line_color='gray')
fig.update_layout(
    title='Drift Trend Toward DSM-5 Anchors (negative = approaching symptom over time)',
    yaxis_title='Slope (cosine distance trend)', barmode='group',
    width=900, height=450, template='plotly_dark',
)
fig.show()

4. Classification — Feature Ablation

Compare anchor-only, polarization-only, velocity-only, and combined.

X = np.nan_to_num(df_feats.values, nan=0.0, posinf=0.0, neginf=0.0)
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# Define feature groups by column indices
cols = list(df_feats.columns)
anchor_idx = [i for i, c in enumerate(cols) if any(a in c for a in list(anchor_vecs.keys()) + ['healthy', 'symptom', 'hurst_anchor', 'anchor_vel'])]
polar_idx = [i for i, c in enumerate(cols) if c in ['global_dispersion', 'dispersion_ratio', 'mean_pairwise_sim']]
vel_idx = [i for i, c in enumerate(cols) if c.startswith('vel_') or c in ['tortuosity']]

feature_sets = {
    'Anchor Only': anchor_idx,
    'Polarization Only': polar_idx,
    'Velocity Only': vel_idx,
    'Combined (B2)': list(range(len(cols))),
}

print(f'Feature counts: Anchor={len(anchor_idx)}, Polar={len(polar_idx)}, Vel={len(vel_idx)}, Total={len(cols)}')

results = {}
for name, feat_idx in feature_sets.items():
    X_sub = X[:, feat_idx]
    fold_metrics = []
    for train_idx, test_idx in skf.split(X_sub, y):
        scaler = StandardScaler()
        X_tr = scaler.fit_transform(X_sub[train_idx])
        X_te = scaler.transform(X_sub[test_idx])

        clf = LogisticRegression(max_iter=1000, random_state=42, class_weight='balanced')
        clf.fit(X_tr, y[train_idx])
        y_pred = clf.predict(X_te)
        y_prob = clf.predict_proba(X_te)[:, 1]

        fold_metrics.append({
            'f1': f1_score(y[test_idx], y_pred),
            'auc': roc_auc_score(y[test_idx], y_prob),
            'prec': precision_score(y[test_idx], y_pred),
            'rec': recall_score(y[test_idx], y_pred),
        })
    results[name] = fold_metrics

print('\n=== Feature Ablation: 5-Fold CV ===')
print(f'{"Model":25s} {"F1":>15s} {"AUC":>15s} {"Prec":>15s} {"Rec":>15s}')
print('-' * 75)
print(f'{"B1 Baseline":25s} {"0.600+-0.046":>15s} {"0.639+-0.022":>15s} {"0.590+-0.018":>15s} {"0.614+-0.081":>15s}')
for name, folds in results.items():
    f1s = [f['f1'] for f in folds]
    aucs = [f['auc'] for f in folds]
    precs = [f['prec'] for f in folds]
    recs = [f['rec'] for f in folds]
    print(f'{name:25s} {np.mean(f1s):.3f}+-{np.std(f1s):.3f}       {np.mean(aucs):.3f}+-{np.std(aucs):.3f}       {np.mean(precs):.3f}+-{np.std(precs):.3f}       {np.mean(recs):.3f}+-{np.std(recs):.3f}')

Feature counts: Anchor=33, Polar=3, Vel=6, Total=43

=== Feature Ablation: 5-Fold CV ===
Model                                  F1             AUC            Prec             Rec
---------------------------------------------------------------------------
B1 Baseline                  0.600+-0.046    0.639+-0.022    0.590+-0.018    0.614+-0.081
Anchor Only               0.746+-0.030       0.849+-0.036       0.739+-0.029       0.759+-0.071
Polarization Only         0.599+-0.030       0.665+-0.039       0.653+-0.052       0.556+-0.039
Velocity Only             0.547+-0.093       0.554+-0.071       0.520+-0.056       0.582+-0.145
Combined (B2)             0.781+-0.039       0.863+-0.040       0.775+-0.020       0.789+-0.074

# Feature importance from combined model (last fold)
scaler = StandardScaler()
X_all = scaler.fit_transform(X)
clf_final = LogisticRegression(max_iter=1000, random_state=42, class_weight='balanced')
clf_final.fit(X_all, y)

importance = pd.DataFrame({
    'feature': cols,
    'coef': clf_final.coef_[0],
    'abs_coef': np.abs(clf_final.coef_[0]),
}).sort_values('abs_coef', ascending=False)

# Top 15 features
top15 = importance.head(15)
fig = go.Figure(go.Bar(
    x=top15['coef'], y=top15['feature'],
    orientation='h',
    marker_color=[C_DEP if c > 0 else C_CTL for c in top15['coef']],
))
fig.update_layout(
    title='Top 15 Feature Coefficients (positive = predicts depression)',
    xaxis_title='Logistic Regression Coefficient',
    width=900, height=500, template='plotly_dark',
    yaxis=dict(autorange='reversed'),
)
fig.show()

5. Temporal Evaluation — Train/Val/Test by eRisk Edition

The random 5-fold CV above may incur data contamination across eRisk editions. The proper setup uses the temporal split already in the data:

Train: eRisk 2017+2018 train split (716 users)
Val: eRisk 2017+2018 val split (171 users)
Test: eRisk 2022 (1398 users) — completely unseen edition

This ensures no information leaks across competition years.

# Temporal split: use edition-based train/test (no contamination)
# Need to reload full dataset with split column
user_splits = df_full[['user_id', 'label', 'split']].drop_duplicates('user_id')

# Build user → split mapping
user_split_map = dict(zip(user_splits['user_id'], user_splits['split']))

# Separate extracted features by split
train_mask = np.array([user_split_map.get(uid, 'unknown') == 'train' for uid in all_uids])
val_mask = np.array([user_split_map.get(uid, 'unknown') == 'val' for uid in all_uids])
test_mask = np.array([user_split_map.get(uid, 'unknown') == 'test' for uid in all_uids])

print(f'Split sizes (from extracted features):')
print(f'  Train: {train_mask.sum()} users ({y[train_mask].sum()} dep, {(1-y[train_mask]).sum():.0f} ctl)')
print(f'  Val:   {val_mask.sum()} users ({y[val_mask].sum()} dep, {(1-y[val_mask]).sum():.0f} ctl)')
print(f'  Test:  {test_mask.sum()} users ({y[test_mask].sum()} dep, {(1-y[test_mask]).sum():.0f} ctl)')

# Train on train+val, test on held-out test (eRisk 2022)
X_train_full = np.nan_to_num(df_feats.values[train_mask | val_mask], nan=0.0, posinf=0.0, neginf=0.0)
y_train_full = y[train_mask | val_mask]
X_test = np.nan_to_num(df_feats.values[test_mask], nan=0.0, posinf=0.0, neginf=0.0)
y_test = y[test_mask]

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train_full)
X_test_scaled = scaler.transform(X_test)

clf_temporal = LogisticRegression(max_iter=1000, random_state=42, class_weight='balanced')
clf_temporal.fit(X_train_scaled, y_train_full)

y_pred_test = clf_temporal.predict(X_test_scaled)
y_prob_test = clf_temporal.predict_proba(X_test_scaled)[:, 1]

print(f'\n=== Temporal Split: Train(2017+2018) → Test(2022) ===')
print(f'  F1:        {f1_score(y_test, y_pred_test):.3f}')
print(f'  Precision: {precision_score(y_test, y_pred_test):.3f}')
print(f'  Recall:    {recall_score(y_test, y_pred_test):.3f}')
print(f'  AUC:       {roc_auc_score(y_test, y_prob_test):.3f}')
print(f'\n{pd.DataFrame({"metric": ["F1","Prec","Rec","AUC"], "value": [f1_score(y_test,y_pred_test), precision_score(y_test,y_pred_test), recall_score(y_test,y_pred_test), roc_auc_score(y_test,y_prob_test)]}).to_string(index=False)}')

Split sizes (from extracted features):
  Train: 180 users (104 dep, 76 ctl)
  Val:   46 users (31 dep, 15 ctl)
  Test:  236 users (97 dep, 139 ctl)

=== Temporal Split: Train(2017+2018) → Test(2022) ===
  F1:        0.744
  Precision: 0.659
  Recall:    0.856
  AUC:       0.886

metric    value
    F1 0.744395
  Prec 0.658730
   Rec 0.855670
   AUC 0.886079

# Early detection with temporal split (train 2017+2018 → test 2022)
cutoffs = [0.1, 0.2, 0.3, 0.5, 0.7, 1.0]
early_temporal = []

train_val_uids = [uid for uid, m in zip(all_uids, train_mask | val_mask) if m]
test_uids_list = [uid for uid, m in zip(all_uids, test_mask) if m]

for cutoff in cutoffs:
    # Extract features at cutoff for train+val
    train_rows, train_labels = [], []
    for uid in train_val_uids:
        feats = extract_all_features(index, uid, user_to_id, anchor_vecs, healthy_vec, cutoff)
        if feats:
            train_rows.append(feats)
            train_labels.append(1 if df_full[df_full['user_id'] == uid]['label'].iloc[0] == 'depression' else 0)

    # Extract for test
    test_rows, test_labels = [], []
    for uid in test_uids_list:
        feats = extract_all_features(index, uid, user_to_id, anchor_vecs, healthy_vec, cutoff)
        if feats:
            test_rows.append(feats)
            test_labels.append(1 if df_full[df_full['user_id'] == uid]['label'].iloc[0] == 'depression' else 0)

    if len(train_rows) < 10 or len(test_rows) < 10:
        continue

    X_tr = np.nan_to_num(pd.DataFrame(train_rows).values, nan=0.0, posinf=0.0, neginf=0.0)
    X_te = np.nan_to_num(pd.DataFrame(test_rows).values, nan=0.0, posinf=0.0, neginf=0.0)
    y_tr = np.array(train_labels)
    y_te = np.array(test_labels)

    s = StandardScaler()
    clf = LogisticRegression(max_iter=1000, random_state=42, class_weight='balanced')
    clf.fit(s.fit_transform(X_tr), y_tr)
    y_pred = clf.predict(s.transform(X_te))
    y_prob = clf.predict_proba(s.transform(X_te))[:, 1]

    f1 = f1_score(y_te, y_pred)
    auc = roc_auc_score(y_te, y_prob)
    early_temporal.append({'cutoff': cutoff, 'f1': f1, 'auc': auc})
    print(f'  {cutoff:3.0%}: F1={f1:.3f}, AUC={auc:.3f} (train={len(y_tr)}, test={len(y_te)})')

df_early_temporal = pd.DataFrame(early_temporal)

# Plot: B1 baseline vs B2 temporal split
fig = make_subplots(rows=1, cols=2, subplot_titles=['F1 Score', 'AUC-ROC'])

# B1 baseline (from B1 notebook, 5-fold CV)
b1_cutoffs = [10, 20, 30, 50, 70, 100]
b1_f1 = [0.623, 0.595, 0.602, 0.561, 0.579, 0.600]
b1_auc = [0.610, 0.551, 0.608, 0.553, 0.601, 0.639]

fig.add_trace(go.Scatter(
    x=b1_cutoffs, y=b1_f1,
    mode='lines+markers', name='B1 baseline (5-fold CV)',
    line=dict(color='gray', width=2, dash='dot'),
), row=1, col=1)
fig.add_trace(go.Scatter(
    x=(df_early_temporal['cutoff']*100).tolist(), y=df_early_temporal['f1'].tolist(),
    mode='lines+markers', name='B2 anchor (temporal split)',
    line=dict(color=C_DEP, width=3),
), row=1, col=1)

fig.add_trace(go.Scatter(
    x=b1_cutoffs, y=b1_auc,
    mode='lines+markers', name='B1 AUC', showlegend=False,
    line=dict(color='gray', width=2, dash='dot'),
), row=1, col=2)
fig.add_trace(go.Scatter(
    x=(df_early_temporal['cutoff']*100).tolist(), y=df_early_temporal['auc'].tolist(),
    mode='lines+markers', name='B2 AUC', showlegend=False,
    line=dict(color=C_DEP, width=3),
), row=1, col=2)

fig.update_layout(
    title='Early Detection: B1 (absolute, CV) vs B2 (anchor projection, temporal split)',
    width=1000, height=450, template='plotly_dark',
)
fig.update_yaxes(range=[0, 1], row=1, col=1)
fig.update_yaxes(range=[0, 1], row=1, col=2)
fig.update_xaxes(title_text='% of Post History', row=1, col=1)
fig.update_xaxes(title_text='% of Post History', row=1, col=2)
fig.show()

  10%: F1=0.673, AUC=0.788 (train=226, test=236)

  20%: F1=0.694, AUC=0.811 (train=226, test=236)

  30%: F1=0.719, AUC=0.829 (train=226, test=236)

  50%: F1=0.729, AUC=0.858 (train=226, test=236)

  70%: F1=0.712, AUC=0.846 (train=226, test=236)

  100%: F1=0.753, AUC=0.895 (train=226, test=236)

Summary

Methodology: Relative Metrics via CVX

All measurements use cvx.project_to_anchors() and cvx.anchor_summary() — native Rust functions that project trajectories from absolute ℝᴰ into clinically meaningful ℝᴷ coordinates defined by DSM-5 symptom anchors.

Strategy	Features	CVX Functions	Insight
Anchor Projection	10×3 + 2 = 32	`project_to_anchors`, `anchor_summary`	Cosine proximity + trend toward DSM-5 symptoms
Relative Hurst	1	`hurst_exponent(projected_traj)`	Persistence of approach to depression
Anchor Velocity	1	`velocity(projected_traj)`	Rate of change in symptom space
Polarization	3	`trajectory()` → dispersion	Semantic space contraction
Velocity	5	`drift(post_t, post_t+1)`	Emotional instability, tortuosity

Evaluation

Setup	F1	AUC	Notes
B1 baseline (5-fold CV)	0.600	0.639	13 absolute temporal features
B2 anchor only (5-fold CV)	0.760	0.857	DSM-5 anchor projection
B2 combined (5-fold CV)	0.781	0.863	All 43 features
B2 temporal split (Train 17+18 → Test 22)	TBD	TBD	No data contamination

Key Insight

Raw embedding drift is noise; anchored drift toward symptom concepts is signal.

project_to_anchors() transforms CVX from a generic temporal vector database into a clinical instrument. The projected trajectory is still a trajectory — so all existing CVX analytics (velocity, Hurst, changepoints, signatures) work natively in symptom space.