Skip to content

Mental Health & Clinical NLP

CVX enables a fundamentally different approach to mental health detection from social media: instead of treating each post as an independent feature vector, it tracks how a user’s language evolves over time — velocity, drift direction, change points, and proximity to clinical symptom anchors.

This is the most mature application domain for CVX, validated on the eRisk shared task dataset (1.36M Reddit posts, 2,285 users) with MentalRoBERTa embeddings (D=768).

Embedding Anisotropy Correction (30x Signal Improvement)

Section titled “Embedding Anisotropy Correction (30x Signal Improvement)”

MentalRoBERTa embeddings occupy a narrow cone in high-dimensional space — all pairwise cosine similarities are ~0.96 regardless of content. This makes raw anchor projections useless.

Centering (subtracting the global mean vector) amplifies the discriminative signal 30x:

MetricBefore centeringAfter centering
Depression user → depressed_mood anchorcosine sim 0.975cosine sim 0.42
Control user → depressed_mood anchorcosine sim 0.964cosine sim 0.09
Discriminative gap0.0110.33

This correction benefits ALL downstream CVX operations. See RFC-012 Part B for the academic references (Ethayarajh 2019, Su et al. 2021).

9 DSM-5 symptom anchors + 1 healthy baseline, encoded as MentalRoBERTa centroids of representative clinical phrases. After centering, population-level profiles clearly separate depression from control:

  • Depression users: highest proximity to depressed_mood (0.35), worthlessness (0.31), anhedonia (0.28)
  • Control users: uniformly low proximity across all symptoms (0.05-0.09)
  • Drift direction: depression users show increasing proximity to symptoms over time; controls remain stable

HNSW’s natural hierarchy produces semantic regions with strong clinical meaning:

  • Level 2 regions show clusters with depression ratios from 0.15 to 0.85
  • High-depression clusters contain posts about hopelessness, isolation, sleep disruption
  • Low-depression clusters contain posts about social activities, hobbies, future plans
  • This demonstrates unsupervised specialization — the graph structure separates clinical from non-clinical content without labels

Anchor-projected temporal features on a proper temporal split (train 2017+2018 → test 2022):

ModelF1AUCPrecisionRecall
B1 Baseline (absolute features)0.6000.6390.5900.614
B2 Combined (anchor + polarization + velocity)0.7440.8860.7390.750
Early detection (10% of posts)0.673
FeaturePurpose
project_to_anchors()DSM-5 symptom proximity trajectories
anchor_summary()Per-user mean, min, trend per symptom
drift() / velocity()Rate and direction of linguistic change
detect_changepoints()Onset/escalation event detection (PELT)
regions() / region_assignments()Unsupervised semantic clustering via HNSW hierarchy
hurst_exponent()Long-memory estimation (persistent vs antipersistent drift)
path_signature()Trajectory shape classification
Centering (manual, RFC-012 pending)Anisotropy correction for 30x signal improvement
NotebookFocusStatus
B1_interactive_explorerHNSW hierarchy visualization, depression ratio per clusterBest: cluster visualization
B2_clinical_anchoringAnchor projection pipeline, classification benchmarksComplete
B3_clinical_dashboardDSM-5 radar, symptom drift direction, clinical timelineBest: population profiles, drift tracking
B1_erisk_rigorousRigorous eRisk evaluation (SVC+PCA, no data balancing)Complete