Skip to content

Temporal ML (Differentiable Features)

CVX extracts rich temporal features from embedding trajectories: velocity, acceleration, drift, change points, volatility. These features are powerful for downstream classification tasks — for example, early detection of psychological disorders from social media post histories.

But if these features are not differentiable, the gradient from the classifier cannot propagate back to the base embedding model (BERT, sentence-transformers). This prevents end-to-end fine-tuning and limits the system’s power:

Text → Embedding Model → v(t) → CVX features → Classifier → loss
(BERT) ────────────
If this is not differentiable,
the gradient dies here and
BERT cannot adjust

CVX offers two paths for temporal features — same mathematical computation, different execution context:

PathImplementationDifferentiablePurpose
AnalyticRust, SIMDNoServing, API, interpretation
MLburn / tch-rs with autogradYesEnd-to-end training, fine-tuning

Both paths share the same logic via a TemporalOps trait and produce numerically identical results. The difference is that the ML path records operations in an autograd graph, enabling gradient flow.

The trait abstracts temporal operations over tensors, with three backend implementations:

pub trait TemporalOps {
type Tensor;
fn velocity(embeddings: &Self::Tensor, timestamps: &Self::Tensor) -> Self::Tensor;
fn acceleration(embeddings: &Self::Tensor, timestamps: &Self::Tensor) -> Self::Tensor;
fn drift(embeddings: &Self::Tensor) -> Self::Tensor;
fn volatility(embeddings: &Self::Tensor, timestamps: &Self::Tensor) -> Self::Tensor;
fn soft_changepoints(embeddings: &Self::Tensor, timestamps: &Self::Tensor, temperature: f64) -> Self::Tensor;
fn extract_all(embeddings: &Self::Tensor, timestamps: &Self::Tensor, config: &TemporalFeaturesConfig) -> Self::Tensor;
}
BackendTypeWhen to use
AnalyticBackendVec<Vec<f32>>API serving, cvx-explain, any context that does not need gradients
BurnBackendburn::tensor::Tensor<B, 2>Pure-Rust training with CUDA support. Shares backend with Neural ODE.
TorchBackendtch::TensorPython interop — gradients cross the Rust/Python boundary via PyO3 with zero-copy tensors

Backends are feature-gated: temporal-ml-burn enables BurnBackend, temporal-ml-torch enables TorchBackend. The analytic backend is always available.

These features compute identically across all backends. In the ML backends, they register operations in the autograd graph:

FeatureFormulaGradient
Velocity(vt+1vt)/Δt(v_{t+1} - v_t) / \Delta t1/Δt-1/\Delta t and +1/Δt+1/\Delta t (linear)
Acceleration(vt+22vt+1+vt)/Δt2(v_{t+2} - 2v_{t+1} + v_t) / \Delta t^2Linear
Total driftvtlastvtfirstv_{t_{last}} - v_{t_{first}}±1\pm 1 (trivial)
Per-dim delta$v_{t_2}[d] - v_{t_1}[d]
Volatilitystd(vt+1vt)\text{std}(\|v_{t+1} - v_t\|)Derivative of std, well-defined
Cosine distance1cos(vt1,vt2)1 - \cos(v_{t_1}, v_{t_2})Standard, differentiable
EMAeλagevt\sum e^{-\lambda \cdot \text{age}} \cdot v_tFixed exponential weights
Neural ODE statez(t)=ODESolve(fθ,z0,t)z(t) = \text{ODESolve}(f_\theta, z_0, t)Adjoint method (Chen 2018)

Soft Relaxations (Differentiable Approximations of Discrete Features)

Section titled “Soft Relaxations (Differentiable Approximations of Discrete Features)”

Some features are inherently discrete in the analytic path (PELT, counts) but have continuous differentiable approximations:

Discrete featureDifferentiable relaxationControl parameter
Number of change pointsσ((deviationμ)/τ)\sum \sigma((\text{deviation} - \mu) / \tau) — sum of sigmoidsτ\tau (temperature)
Maximum severitysoftmax(severity)severity\text{softmax}(\text{severity}) \cdot \text{severity} — smooth maxτ\tau
Top-K dimensions$\text{gumbel_softmax}(\Delta
Silence countσ((gapθ)/τ)\sum \sigma((\text{gap} - \theta) / \tau)θ\theta (threshold), τ\tau

As τ0\tau \to 0, the relaxations converge to the discrete versions. During training, τ>0\tau > 0 allows gradients; at inference time, τ0\tau \to 0 recovers exact results.

These features exist only for human interpretation, not for training:

FeatureWhy not differentiable
PELT (exact segmentation)Discrete combinatorial optimization
BOCPD posteriorDiscrete run-length
Change point narrativeProduces structure/text, not numeric
Drift attribution (Pareto)Sorting + cumsum with discrete threshold

The TemporalFeatureExtractor is not just fixed computations — it includes trainable components that optimize jointly with the classifier:

A linear module that learns which embedding dimensions matter for the task:

output=driftσ(Wdrift)\text{output} = \text{drift} \odot \sigma(W \cdot \text{drift})

This lets the model learn, for example, that dimensions associated with “negative affect” are more relevant than “syntax” dimensions for depression detection. It is the learnable equivalent of CVX’s analytical drift attribution.

Learnable weights for the relative importance of each temporal scale:

output=swsfeaturess\text{output} = \sum_s w_s \cdot \text{features}_s

The model learns whether the signal is stronger at the daily, weekly, or monthly scale — analogous to CVX’s multi-scale analysis, but optimized for the specific task.

The temperature τ\tau itself can be a learnable parameter, so the model discovers the optimal sensitivity to changes for the task.

For D=768D = 768 and n_scales=3n\_scales = 3, the TemporalFeatureExtractor produces a fixed-size feature vector regardless of sequence length:

ComponentDimensions
Mean velocity768
Attended drift768
Volatility768
Mean acceleration768
Scalar features (5)5
Scale features6
Total3077

This solves the variable-length problem: users with 10 posts and users with 500 posts produce the same-shaped feature vector.

Concrete Example: Social Media Classification

Section titled “Concrete Example: Social Media Classification”

A dataset of NN users, each with a variable number of timestamped posts and a binary label. The goal is to classify users based on how their language evolves over time.

Training path: Text goes through BERT, produces per-post embeddings, CVX extracts differentiable temporal features, a classifier produces logits, and loss.backward() propagates gradients all the way back to BERT.

Analysis path: After training, the analytic (non-differentiable) tools provide interpretability — PELT finds exact change points, drift attribution identifies responsible dimensions, trajectory projection visualizes the evolution.

For PyTorch users, the TorchBackend enables gradient flow across the Rust/Python boundary:

import torch
from transformers import AutoModel
import cvx_python # Rust compiled extension
bert = AutoModel.from_pretrained("bert-base-uncased")
classifier = torch.nn.Linear(3077, 2)
for batch in dataloader:
embeddings = bert(batch.tokens).last_hidden_state[:, 0]
features = cvx_python.temporal_features(embeddings, batch.timestamps) # Rust, autograd OK
logits = classifier(features)
loss = F.cross_entropy(logits, batch.label)
loss.backward() # gradients reach BERT

After training, the learned DimensionAttention weights can be compared against CVX’s analytical drift attribution:

  • High correlation — the model learned what CVX’s analytics already show. Increases confidence in both.
  • Low correlation — the model discovered signals that the analytics do not capture. Worth investigating.

This cross-validation between the differentiable and analytic paths is a unique diagnostic capability.

Social media posts are not periodic. The implementation handles:

  • Long gaps: Velocity is computed with real Δt\Delta t, not assuming uniform intervals.
  • Bursts: Optionally aggregate by time window before computing features.
  • Silence as signal: The posting pattern itself (gaps, frequency, burstiness) is a differentiable feature via soft silence counting: σ((gapθ)/τ)\sum \sigma((\text{gap} - \theta) / \tau).
OperationTarget
Feature extraction (100 posts, D=768D=768, burn CPU)< 1ms
Feature extraction (100 posts, D=768D=768, CUDA)< 0.5ms
Batch extraction (1000 users x 100 posts, CUDA)< 100ms
Backward pass overhead vs forward-only< 2x
Feature vector size (fixed)3077 (for D=768D=768, 3 scales)