Temporal Word Embeddings for Early Detection of Psychological Disorders on Social Media¶

Abastract¶

Mental health disorders represent a public health challenge, where early detection is critical to mitigating adverse outcomes for individuals and society. The study of language and behavior is a pivotal component in mental health research, and the content from social media platforms serves as a valuable tool for identifying signs of mental health risks. This paper presents a novel framework leveraging temporal word embeddings to capture linguistic changes over time. We specifically aim at at identifying emerging psychological concerns on social media. By adapting temporal word representations, our approach quantifies shifts in language use that may signal mental health risks. To that end, we implement two alternative temporal word embedding models to detect linguistic variations and exploit these variations to train early detection classifiers. Our experiments, conducted on 18 datasets from the eRisk initiative (covering signs of conditions such as depression, anorexia, and self-harm), show that simple models focusing exclusively on temporal word usage patterns achieve competitive performance compared to state-of-the-art systems. Additionally, we perform a word-level analysis to understand the evolution of key terms among positive and control users. These findings underscore the potential of time-sensitive word models in this domain, being a promising avenue for future research in mental health surveillance.

Models¶

TWEC¶

In this tutorial, we will focus exclusively on TWEC. Let's begin by defining our temporal word embedding models. The first model, TWEC (Temporal Word Embeddings with a Compass), is an extension of Word2Vec that incorporates temporal information. It captures linguistic shifts over time by leveraging the context of surrounding words across different time periods.

In [1]:

Copied!

from models.twec import TWEC
from models.twec import TWEC

Deltas¶

Deltas is a metric designed to quantify semantic drift in word meaning over time within a diachronic corpus. It is computed by applying similarity measures—such as cosine similarity or Euclidean distance—between temporally contextualized word embeddings and their corresponding static representations.

In [2]:

Copied!

from models.deltas import DISTANCES
from models.deltas import DISTANCES

Filters¶

Now that we have defined our models, we need to encapsulate them inside a class that implements the BaseFilter interface and binds them to the container.

In [7]:

Copied!

# !pip install framework3==1.1.1
# !pip install framework3==1.1.1

In [8]:

Copied!

from labchain.utils.patch_type_guard import patch_inspect_for_notebooks

patch_inspect_for_notebooks()
from labchain.utils.patch_type_guard import patch_inspect_for_notebooks

patch_inspect_for_notebooks()

✅ Patched inspect.getsource using dill.

In [9]:

Copied!





from concurrent.futures import ThreadPoolExecutor, as_completed
from typing import List, Literal
from scipy.sparse import dok_matrix
from tqdm import tqdm
from labchain.base.base_clases import BaseFilter
from labchain.base.base_types import XYData
from labchain.container import Container

import pandas as pd
import numpy as np
import torch
import os


@Container.bind()
class TWECFilter(BaseFilter):
    def __init__(
        self,
        context_size: int,
        _cpus: int = 4,
        deltas_f: List[
            Literal[
                "cosine",
                "euclidean",
                "chebyshev",
                "jensen_shannon",
                "wasserstein",
                "manhattan",
                "minkowski",
            ]
        ] = ["cosine"],
    ):
        super().__init__()
        self._twec = TWEC(size=300, window=context_size)
        self.deltas_f = deltas_f
        self.context_size = context_size
        actual_cpus = os.cpu_count()
        if actual_cpus is not None:
            self._cpus = min(actual_cpus, _cpus)
        else:
            self._cpus = _cpus

    def fit(self, x: XYData, y: XYData | None) -> float | None:
        data: pd.DataFrame = x.value
        self._twec.train_compass(data.text.values.tolist())
        self._vocab_hash_map = dict(
            zip(
                self._twec.compass.wv.index_to_key,  # type: ignore
                range(len(self._twec.compass.wv.index_to_key)),  # type: ignore
            )
        )

    def predict(self, x: XYData) -> XYData:
        data: pd.DataFrame = x.value
        n_rows = len(data.index)
        n_cols = len(self._vocab_hash_map.items())
        metric_names = self.deltas_f

        all_deltas = {
            metric: dok_matrix((n_rows, n_cols), dtype=np.float32)
            for metric in metric_names
        }

        def process_user_deltas(i, tc):
            result = {metric: [] for metric in metric_names}
            for word in tc.wv.index_to_key:  # type: ignore
                if word in self._vocab_hash_map:
                    j = self._vocab_hash_map[word]
                    for metric in metric_names:
                        dist = (
                            DISTANCES[metric](
                                torch.tensor(np.array([[self._twec.compass.wv[word]]])),  # type: ignore
                                torch.tensor(np.array([[tc.wv[word]]])),  # type: ignore
                            )
                            .detach()
                            .cpu()
                            .item()
                        )
                        result[metric].append((i, j, dist))
            return result

        with ThreadPoolExecutor(max_workers=self._cpus) as executor:
            futures = {
                executor.submit(
                    process_user_deltas, i, self._twec.train_slice(row.text)
                ): i
                for i, row in tqdm(
                    enumerate(data.itertuples()),
                    total=n_rows,
                    desc="generating embeddings",
                )
            }

            for future in tqdm(
                as_completed(futures), total=n_rows, desc="parallel prediction"
            ):
                chunk_result = future.result()
                for metric, values in chunk_result.items():
                    for i, j, val in values:
                        all_deltas[metric][i, j] = val

        return XYData.mock(all_deltas)
from concurrent.futures import ThreadPoolExecutor, as_completed
from typing import List, Literal
from scipy.sparse import dok_matrix
from tqdm import tqdm
from labchain.base.base_clases import BaseFilter
from labchain.base.base_types import XYData
from labchain.container import Container

import pandas as pd
import numpy as np
import torch
import os


@Container.bind()
class TWECFilter(BaseFilter):
    def __init__(
        self,
        context_size: int,
        _cpus: int = 4,
        deltas_f: List[
            Literal[
                "cosine",
                "euclidean",
                "chebyshev",
                "jensen_shannon",
                "wasserstein",
                "manhattan",
                "minkowski",
            ]
        ] = ["cosine"],
    ):
        super().__init__()
        self._twec = TWEC(size=300, window=context_size)
        self.deltas_f = deltas_f
        self.context_size = context_size
        actual_cpus = os.cpu_count()
        if actual_cpus is not None:
            self._cpus = min(actual_cpus, _cpus)
        else:
            self._cpus = _cpus

    def fit(self, x: XYData, y: XYData | None) -> float | None:
        data: pd.DataFrame = x.value
        self._twec.train_compass(data.text.values.tolist())
        self._vocab_hash_map = dict(
            zip(
                self._twec.compass.wv.index_to_key,  # type: ignore
                range(len(self._twec.compass.wv.index_to_key)),  # type: ignore
            )
        )

    def predict(self, x: XYData) -> XYData:
        data: pd.DataFrame = x.value
        n_rows = len(data.index)
        n_cols = len(self._vocab_hash_map.items())
        metric_names = self.deltas_f

        all_deltas = {
            metric: dok_matrix((n_rows, n_cols), dtype=np.float32)
            for metric in metric_names
        }

        def process_user_deltas(i, tc):
            result = {metric: [] for metric in metric_names}
            for word in tc.wv.index_to_key:  # type: ignore
                if word in self._vocab_hash_map:
                    j = self._vocab_hash_map[word]
                    for metric in metric_names:
                        dist = (
                            DISTANCES[metric](
                                torch.tensor(np.array([[self._twec.compass.wv[word]]])),  # type: ignore
                                torch.tensor(np.array([[tc.wv[word]]])),  # type: ignore
                            )
                            .detach()
                            .cpu()
                            .item()
                        )
                        result[metric].append((i, j, dist))
            return result

        with ThreadPoolExecutor(max_workers=self._cpus) as executor:
            futures = {
                executor.submit(
                    process_user_deltas, i, self._twec.train_slice(row.text)
                ): i
                for i, row in tqdm(
                    enumerate(data.itertuples()),
                    total=n_rows,
                    desc="generating embeddings",
                )
            }

            for future in tqdm(
                as_completed(futures), total=n_rows, desc="parallel prediction"
            ):
                chunk_result = future.result()
                for metric, values in chunk_result.items():
                    for i, j, val in values:
                        all_deltas[metric][i, j] = val

        return XYData.mock(all_deltas)

The classifiers¶

This work addresses an early prediction task using the eRisk dataset, which requires the use of classification models. We will now define a set of classifiers and integrate them within the framework by wrapping them in the appropriate classes.

⚠️ Warning: In order for classes to be parallelizable, they must be defined in a standalone module. For this reason, we have moved the classifiers to separate files. The code shown here is provided for reference purposes only.

⚠️ Warning: Also note that some hyperparameters are not primitive types. While this works well with sklearn and Optuna optimizers, it may break when using the wandb optimizer. The code should be adapted accordingly if you plan to use wandb for optimization.

SVM¶

from typing import Callable, Mapping
from sklearn.svm import SVC


@Container.bind()
class ClassifierSVM(BaseFilter):
    def __init__(
        self,
        C: float = 1,
        kernel: Callable | Literal['linear', 'poly', 'rbf', 'sigmoid', 'precomputed'] = "rbf",
        gamma: float | Literal['scale', 'auto'] = "scale",
        coef0:float=0.0,
        tol:float=0.001,
        decision_function_shape:Literal['ovo', 'ovr'] = "ovr",
        class_weight_1: Mapping[Any, Any] | str | None = None,
        probability:bool = False,
    ):
        super().__init__()
        self.proba = probability
        self._model = SVC(
            C=C,
            kernel=kernel,
            gamma=gamma,
            coef0=coef0,
            tol=tol,
            decision_function_shape=decision_function_shape,
            class_weight={1: class_weight_1},
            probability=probability,
            random_state=43,
        )

    def fit(self, x: XYData, y: XYData | None):
        if y is None:
            raise ValueError("y must be provided for training")
        self._model.fit(x.value, y.value)

    def predict(self, x: XYData) -> XYData:
        if self.proba:
            result = list(map(lambda i: i[1], self._model.predict_proba(x.value)))
            return XYData.mock(result)
        else:
            result = self._model.predict(x.value)
            return XYData.mock(result)

In [10]:

Copied!

from models.svm import ClassifierSVM
from models.svm import ClassifierSVM

The Metrics¶

Since the early prediction task is essentially a classification problem, we will use standard classification metrics such as F1-score, Precision, and Recall. However, due to the early nature of the task, we also need to include metrics that penalize delayed decisions, as timing is a critical aspect of the evaluation.

In [11]:

Copied!

from labchain.plugins.metrics import F1, Precission, Recall

f1 = F1()
precision = Precission()
recall = Recall()
from labchain.plugins.metrics import F1, Precission, Recall

f1 = F1()
precision = Precission()
recall = Recall()

from typing import Iterable
from sklearn.metrics import confusion_matrix
from labchain import BaseMetric

from numpy import exp

@Container.bind()
class ERDE(BaseMetric):
    def __init__(self, count: Iterable, k: int = 5):
        self.k = k
        self.count = count

    def evaluate(
        self, x_data: XYData, y_true: XYData | None, y_pred: XYData
    ) -> float | np.ndarray:
        if y_true is None:
            raise ValueError("y_true must be provided for evaluation")

        all_erde = []
        _, _, _, tp = confusion_matrix(y_true.value, y_pred.value).ravel()
        for expected, result, count in list(
            zip(y_true.value, y_pred.value, self.count)
        ):
            if result == 1 and expected == 0:
                all_erde.append(float(tp) / len(y_true.value))
            elif result == 0 and expected == 1:
                all_erde.append(1.0)
            elif result == 1 and expected == 1:
                all_erde.append(1.0 - (1.0 / (1.0 + exp(count - self.k))))
            elif result == 0 and expected == 0:
                all_erde.append(0.0)
        return float(np.mean(all_erde) * 100)

In [12]:

Copied!

from metrics.erde import ERDE_5, ERDE_50
from metrics.erde import ERDE_5, ERDE_50

For simplicity, we will only consider the 2023 gambling data.¶

In [13]:

Copied!

gambling_2023_train = pd.read_csv("data/standard_gambling_train_2023.csv", index_col=0)
gambling_2023_train.head(5)

gambling_2023_test = pd.read_csv("data/standard_gambling_2023.csv", index_col=0)
gambling_2023_test.head(5)
gambling_2023_train = pd.read_csv("data/standard_gambling_train_2023.csv", index_col=0)
gambling_2023_train.head(5)

gambling_2023_test = pd.read_csv("data/standard_gambling_2023.csv", index_col=0)
gambling_2023_test.head(5)

Out[13]:

	id	text	date	user
0	subject5539_0	For PC: I don't know what company, but they ne...	2015-12-03 13:31:29	subject5539
1	subject5539_1	You play as a Pokmon trainer (that you customi...	2015-12-04 19:45:40	subject5539
2	subject5539_2	A Clash of Clans RPG (or MMORPG)	2015-12-23 23:32:51	subject5539
3	subject5539_3	You would have to manage your species's needs ...	2015-12-26 20:45:30	subject5539
4	subject5539_4	The game starts you as a child and you have to...	2016-01-02 08:40:18	subject5539

In [14]:

Copied!





gg_2023_train = (
    gambling_2023_train.groupby(["user", "chunk"])
    .agg(
        {
            "id": "count",
            "text": list,
            "date": list,
            "label": "first",
        }
    )
    .rename(columns={"id": "n_texts"})
    .reset_index()
)

gg_2023_test = (
    gambling_2023_test.groupby(["user", "chunk"])
    .agg(
        {
            "id": "count",
            "text": list,
            "date": list,
            "label": "first",
        }
    )
    .rename(columns={"id": "n_texts"})
    .reset_index()
)
gg_2023_train
gg_2023_train = (
    gambling_2023_train.groupby(["user", "chunk"])
    .agg(
        {
            "id": "count",
            "text": list,
            "date": list,
            "label": "first",
        }
    )
    .rename(columns={"id": "n_texts"})
    .reset_index()
)

gg_2023_test = (
    gambling_2023_test.groupby(["user", "chunk"])
    .agg(
        {
            "id": "count",
            "text": list,
            "date": list,
            "label": "first",
        }
    )
    .rename(columns={"id": "n_texts"})
    .reset_index()
)
gg_2023_train

Out[14]:

	user	chunk	n_texts	text	date	label
0	subject1	0	132	[Vulcan's ultimate landing at max range is so ...	[2017-08-18 11:34:09, 2017-08-20 15:26:34, 201...	0
1	subject1	1	132	[Awesome! It is always good to hear these news...	[2018-05-18 23:46:33, 2018-06-18 17:17:55, 201...	0
2	subject1	2	132	[The syringe is a lie!, I'd say Scylla or Than...	[2018-09-20 08:20:44, 2018-09-24 10:12:03, 201...	0
3	subject1	3	131	[Some of the symptoms you may experience are b...	[2019-05-06 17:50:52, 2019-05-06 19:05:44, 201...	0
4	subject1	4	132	[So ur saying that huge map is better than Afg...	[2019-10-06 23:22:46, 2019-10-12 18:08:06, 201...	0
...	...	...	...	...	...	...
39265	subject9999	5	8	[10v10, is those a bunch of bots, I didn't eve...	[2021-07-04 06:51:04, 2021-07-04 07:01:05, 202...	0
39266	subject9999	6	8	[I'm commenting this based on the fact that Am...	[2021-07-15 19:11:42, 2021-07-27 17:44:09, 202...	0
39267	subject9999	7	8	[Aesthetic set, It's a fucking downgrade,, It'...	[2021-09-12 14:55:05, 2021-09-23 00:31:11, 202...	0
39268	subject9999	8	8	[u/save, u/savevideo, Snu snu ! Snu snu! Snu s...	[2021-09-23 13:38:48, 2021-10-08 13:44:30, 202...	0
39269	subject9999	9	8	[Why every fucking time there's a new weapon o...	[2021-11-27 06:45:16, 2021-11-27 07:04:36, 202...	0

39270 rows × 6 columns

⚠️ Warning: There are several restrictions for the plugins to work properly:

Constructor arguments should be public attributes.
Other data must be set as private attributes.
All public attributes must be serializable using jsonable_encoder.

In [16]:

Copied!

test_erde_5 = ERDE_5()
test_erde_50 = ERDE_50()
test_erde_5 = ERDE_5()
test_erde_50 = ERDE_50()

Selector¶

We are using sklearn for grid search. This optimizer will check the input dimensions of the X and y values. We have generated a dictionary with the deltas based on different distance measures, but this results in an incompatible dimensions error from sklearn. To work around this issue, we define a class that selects the appropriate deltas based on a hyperparameter.

In [22]:

Copied!





@Container.bind()
class DeltaSelectorFilter(BaseFilter):
    def __init__(
        self,
        deltas_f: Literal[
            "cosine",
            "euclidean",
            "chebyshev",
            "jensen_shannon",
            "wasserstein",
            "manhattan",
            "minkowski",
        ] = "cosine",
    ):
        super().__init__()
        self.deltas_f = deltas_f

    def fit(self, x: XYData, y: XYData | None):
        pass

    def predict(self, x: XYData) -> XYData:
        # Crear una nueva dok_matrix con el mismo shape y en float32
        old_dok = x.value[self.deltas_f]
        new_dok = dok_matrix(old_dok.shape, dtype=np.float32)

        # Copiar todos los valores existentes y convertir el dtype
        for (i, j), value in old_dok.items():
            new_dok[i, j] = float(value)  # conversión a float32 implícita
        return XYData.mock(new_dok.tocsr())
@Container.bind()
class DeltaSelectorFilter(BaseFilter):
    def __init__(
        self,
        deltas_f: Literal[
            "cosine",
            "euclidean",
            "chebyshev",
            "jensen_shannon",
            "wasserstein",
            "manhattan",
            "minkowski",
        ] = "cosine",
    ):
        super().__init__()
        self.deltas_f = deltas_f

    def fit(self, x: XYData, y: XYData | None):
        pass

    def predict(self, x: XYData) -> XYData:
        # Crear una nueva dok_matrix con el mismo shape y en float32
        old_dok = x.value[self.deltas_f]
        new_dok = dok_matrix(old_dok.shape, dtype=np.float32)

        # Copiar todos los valores existentes y convertir el dtype
        for (i, j), value in old_dok.items():
            new_dok[i, j] = float(value)  # conversión a float32 implícita
        return XYData.mock(new_dok.tocsr())

The pipeline¶

Now comes the most exciting part: integrating the filters into the pipeline. This step can be done incrementally, which is more convenient when developing a model. However, since we already have a clear understanding of the process, we will combine all the parts into one step.

In [ ]:

Copied!





from labchain import Cached, SklearnOptimizer
from labchain.plugins.pipelines.sequential import F3Pipeline

all_test_metrics = [
    f1,
    precision,
    recall,
    test_erde_5,
    test_erde_50,
]

pipeline_svm = F3Pipeline(
    filters=[
        Cached(
            filter=TWECFilter(
                context_size=25,
                _cpus=10,
                deltas_f=["cosine", "euclidean", "manhattan", "chebyshev"],
            ),
        ),
        DeltaSelectorFilter(deltas_f="cosine"),
        F3Pipeline(
            filters=[
                ClassifierSVM(
                    tol=0.003,
                    probability=False,
                    decision_function_shape="ovr",
                    kernel="rbf",
                    gamma="scale",
                ).grid(
                    {
                        "C": [1, 3, 5],
                        "class_weight_1": [{1: 1.5}, {1: 2.5}, {1: 3.0}],
                    }
                )
            ],
            metrics=[F1()],
        ).optimizer(SklearnOptimizer(scoring="f1_weighted", cv=2, n_jobs=-1)),
    ],
    metrics=all_test_metrics,
)
from labchain import Cached, SklearnOptimizer
from labchain.plugins.pipelines.sequential import F3Pipeline

all_test_metrics = [
    f1,
    precision,
    recall,
    test_erde_5,
    test_erde_50,
]

pipeline_svm = F3Pipeline(
    filters=[
        Cached(
            filter=TWECFilter(
                context_size=25,
                _cpus=10,
                deltas_f=["cosine", "euclidean", "manhattan", "chebyshev"],
            ),
        ),
        DeltaSelectorFilter(deltas_f="cosine"),
        F3Pipeline(
            filters=[
                ClassifierSVM(
                    tol=0.003,
                    probability=False,
                    decision_function_shape="ovr",
                    kernel="rbf",
                    gamma="scale",
                ).grid(
                    {
                        "C": [1, 3, 5],
                        "class_weight_1": [{1: 1.5}, {1: 2.5}, {1: 3.0}],
                    }
                )
            ],
            metrics=[F1()],
        ).optimizer(SklearnOptimizer(scoring="f1_weighted", cv=2, n_jobs=-1)),
    ],
    metrics=all_test_metrics,
)

Data Preparation¶

In F3, all data must be wrapped in the XYData class. This ensures that each data transformation is hashed and the results are cached.

In [25]:

Copied!





train_x = XYData(_hash=" Gambling_2023_train_x", _path="/dataset", _value=gg_2023_train)
train_y = XYData(
    _hash="Gambling_2023_train_y", _path="/dataset", _value=gg_2023_train.label.tolist()
)

test_x = XYData(_hash="Gambling_2023_test_x", _path="/dataset", _value=gg_2023_test)

test_y = XYData(
    _hash="Gambling_2023_test_y", _path="/dataset", _value=gg_2023_test.label.tolist()
)
train_x = XYData(_hash=" Gambling_2023_train_x", _path="/dataset", _value=gg_2023_train)
train_y = XYData(
    _hash="Gambling_2023_train_y", _path="/dataset", _value=gg_2023_train.label.tolist()
)

test_x = XYData(_hash="Gambling_2023_test_x", _path="/dataset", _value=gg_2023_test)

test_y = XYData(
    _hash="Gambling_2023_test_y", _path="/dataset", _value=gg_2023_test.label.tolist()
)

Model training¶

⚠️ Warning: Please note that for parallel backend usage, a considerable amount of RAM will be required.

In [26]:

Copied!





from joblib import parallel_backend
import sys

with parallel_backend("loky", n_jobs=-1):
    print("Starting GridSearchCV fitting...", flush=True)
    pipeline_svm.fit(train_x, train_y)
    sys.stdout.flush()
from joblib import parallel_backend
import sys

with parallel_backend("loky", n_jobs=-1):
    print("Starting GridSearchCV fitting...", flush=True)
    pipeline_svm.fit(train_x, train_y)
    sys.stdout.flush()

Starting GridSearchCV fitting...
Calling prefit on Cached
Calling prefit on DeltaSelectorFilter
Calling prefit on SklearnOptimizer

____________________________________________________________________________________________________
Fitting pipeline...
****************************************************************************************************

Cached(
    filter=TWECFilter(deltas_f=['cosine', 'euclidean', 'manhattan', 'chebyshev'], context_size=25),
    cache_data=True,
    cache_filter=True,
    overwrite=False,
    storage=None
)

Calling prefit on TWECFilter

         - El filtro TWECFilter({'deltas_f': ['cosine', 'euclidean', 'manhattan', 'chebyshev'], 'context_size': 
25}) Existe, se carga del storage.

         - El dato XYData(_hash='f991d8f14f3bbdb0a54b565de7e60e42cfd36dc9', 
_path='TWECFilter/60b322a0bd676ce665f0d6b568a28ef664fef914') Existe, se carga del storage.

DeltaSelectorFilter(deltas_f='cosine')

Calling prefit on DeltaSelectorFilter
	 * Downloading: <_io.BufferedReader name='cache/TWECFilter/60b322a0bd676ce665f0d6b568a28ef664fef914/f991d8f14f3bbdb0a54b565de7e60e42cfd36dc9'>

SklearnOptimizer(
    scoring='f1_weighted',
    cv=2,
    pipeline=F3Pipeline(
        filters=[ClassifierSVM(proba=False)],
        metrics=[F1(average='binary')],
        overwrite=False,
        store=False,
        log=False
    ),
    n_jobs=-1
)

Calling prefit on SklearnOptimizer
Fitting 2 folds for each of 9 candidates, totalling 18 fits
[CV 1/2; 3/9] START ClassifierSVM__C=1, ClassifierSVM__class_weight_1={1: 3.0}..
[CV 2/2; 7/9] START ClassifierSVM__C=5, ClassifierSVM__class_weight_1={1: 1.5}..
Calling prefit on ClassifierSVM
[CV 1/2; 1/9] START ClassifierSVM__C=1, ClassifierSVM__class_weight_1={1: 1.5}..
[CV 1/2; 6/9] START ClassifierSVM__C=3, ClassifierSVM__class_weight_1={1: 3.0}..
[CV 2/2; 9/9] START ClassifierSVM__C=5, ClassifierSVM__class_weight_1={1: 3.0}..
Calling prefit on ClassifierSVM
Calling prefit on ClassifierSVM
Calling prefit on ClassifierSVM
Calling prefit on ClassifierSVM
[CV 1/2; 9/9] START ClassifierSVM__C=5, ClassifierSVM__class_weight_1={1: 3.0}..
Calling prefit on ClassifierSVM
[CV 2/2; 3/9] START ClassifierSVM__C=1, ClassifierSVM__class_weight_1={1: 3.0}..
Calling prefit on ClassifierSVM
[CV 1/2; 5/9] START ClassifierSVM__C=3, ClassifierSVM__class_weight_1={1: 2.5}..
[CV 2/2; 1/9] START ClassifierSVM__C=1, ClassifierSVM__class_weight_1={1: 1.5}..
[CV 1/2; 2/9] START ClassifierSVM__C=1, ClassifierSVM__class_weight_1={1: 2.5}..
[CV 1/2; 7/9] START ClassifierSVM__C=5, ClassifierSVM__class_weight_1={1: 1.5}..
Calling prefit on ClassifierSVM
[CV 2/2; 4/9] START ClassifierSVM__C=3, ClassifierSVM__class_weight_1={1: 1.5}..
Calling prefit on ClassifierSVM
Calling prefit on ClassifierSVM
Calling prefit on ClassifierSVM
Calling prefit on ClassifierSVM
[CV 2/2; 2/9] START ClassifierSVM__C=1, ClassifierSVM__class_weight_1={1: 2.5}..
[CV 1/2; 8/9] START ClassifierSVM__C=5, ClassifierSVM__class_weight_1={1: 2.5}..
[CV 1/2; 4/9] START ClassifierSVM__C=3, ClassifierSVM__class_weight_1={1: 1.5}..
[CV 2/2; 5/9] START ClassifierSVM__C=3, ClassifierSVM__class_weight_1={1: 2.5}..
Calling prefit on ClassifierSVM
Calling prefit on ClassifierSVM
[CV 2/2; 6/9] START ClassifierSVM__C=3, ClassifierSVM__class_weight_1={1: 3.0}..
Calling prefit on ClassifierSVM
Calling prefit on ClassifierSVM
[CV 2/2; 8/9] START ClassifierSVM__C=5, ClassifierSVM__class_weight_1={1: 2.5}..
Calling prefit on ClassifierSVM
Calling prefit on ClassifierSVM
[CV 1/2; 9/9] END ClassifierSVM__C=5, ClassifierSVM__class_weight_1={1: 3.0};, score=0.965 total time=24.9min
[CV 1/2; 5/9] END ClassifierSVM__C=3, ClassifierSVM__class_weight_1={1: 2.5};, score=0.965 total time=25.2min
[CV 1/2; 1/9] END ClassifierSVM__C=1, ClassifierSVM__class_weight_1={1: 1.5};, score=0.964 total time=25.5min
[CV 2/2; 2/9] END ClassifierSVM__C=1, ClassifierSVM__class_weight_1={1: 2.5};, score=0.964 total time=26.1min
[CV 1/2; 4/9] END ClassifierSVM__C=3, ClassifierSVM__class_weight_1={1: 1.5};, score=0.965 total time=26.3min
[CV 1/2; 2/9] END ClassifierSVM__C=1, ClassifierSVM__class_weight_1={1: 2.5};, score=0.964 total time=26.4min
[CV 2/2; 1/9] END ClassifierSVM__C=1, ClassifierSVM__class_weight_1={1: 1.5};, score=0.963 total time=26.5min
[CV 1/2; 7/9] END ClassifierSVM__C=5, ClassifierSVM__class_weight_1={1: 1.5};, score=0.965 total time=26.8min
[CV 1/2; 3/9] END ClassifierSVM__C=1, ClassifierSVM__class_weight_1={1: 3.0};, score=0.964 total time=27.4min
[CV 1/2; 6/9] END ClassifierSVM__C=3, ClassifierSVM__class_weight_1={1: 3.0};, score=0.965 total time=27.6min
[CV 2/2; 6/9] END ClassifierSVM__C=3, ClassifierSVM__class_weight_1={1: 3.0};, score=0.966 total time=27.7min
[CV 1/2; 8/9] END ClassifierSVM__C=5, ClassifierSVM__class_weight_1={1: 2.5};, score=0.965 total time=27.8min
[CV 2/2; 3/9] END ClassifierSVM__C=1, ClassifierSVM__class_weight_1={1: 3.0};, score=0.965 total time=28.1min
[CV 2/2; 9/9] END ClassifierSVM__C=5, ClassifierSVM__class_weight_1={1: 3.0};, score=0.966 total time=28.4min
[CV 2/2; 4/9] END ClassifierSVM__C=3, ClassifierSVM__class_weight_1={1: 1.5};, score=0.964 total time=28.8min
[CV 2/2; 8/9] END ClassifierSVM__C=5, ClassifierSVM__class_weight_1={1: 2.5};, score=0.965 total time=28.9min
[CV 2/2; 5/9] END ClassifierSVM__C=3, ClassifierSVM__class_weight_1={1: 2.5};, score=0.966 total time=29.2min
[CV 2/2; 7/9] END ClassifierSVM__C=5, ClassifierSVM__class_weight_1={1: 1.5};, score=0.965 total time=29.3min
Calling prefit on ClassifierSVM

   param_ClassifierSVM__C param_ClassifierSVM__class_weight_1  \
5                       3                            {1: 3.0}   
4                       3                            {1: 2.5}   
8                       5                            {1: 3.0}   
7                       5                            {1: 2.5}   
6                       5                            {1: 1.5}   
2                       1                            {1: 3.0}   
3                       3                            {1: 1.5}   
1                       1                            {1: 2.5}   
0                       1                            {1: 1.5}   

                                              params  split0_test_score  \
5  {'ClassifierSVM__C': 3, 'ClassifierSVM__class_...           0.965313   
4  {'ClassifierSVM__C': 3, 'ClassifierSVM__class_...           0.965364   
8  {'ClassifierSVM__C': 5, 'ClassifierSVM__class_...           0.965114   
7  {'ClassifierSVM__C': 5, 'ClassifierSVM__class_...           0.965356   
6  {'ClassifierSVM__C': 5, 'ClassifierSVM__class_...           0.965201   
2  {'ClassifierSVM__C': 1, 'ClassifierSVM__class_...           0.964449   
3  {'ClassifierSVM__C': 3, 'ClassifierSVM__class_...           0.964589   
1  {'ClassifierSVM__C': 1, 'ClassifierSVM__class_...           0.964195   
0  {'ClassifierSVM__C': 1, 'ClassifierSVM__class_...           0.963823   

   split1_test_score  mean_test_score  std_test_score  rank_test_score  
5           0.965726         0.965519        0.000206                1  
4           0.965614         0.965489        0.000125                2  
8           0.965626         0.965370        0.000256                3  
7           0.965307         0.965332        0.000024                4  
6           0.965129         0.965165        0.000036                5  
2           0.965158         0.964803        0.000354                6  
3           0.964434         0.964512        0.000078                7  
1           0.964178         0.964186        0.000009                8  
0           0.963234         0.963529        0.000295                9

In [27]:

Copied!

_y = pipeline_svm.predict(test_x)
_y = pipeline_svm.predict(test_x)

____________________________________________________________________________________________________
Predicting pipeline...
****************************************************************************************************

Cached(
    filter=TWECFilter(deltas_f=['cosine', 'euclidean', 'manhattan', 'chebyshev'], context_size=25),
    cache_data=True,
    cache_filter=True,
    overwrite=False,
    storage=None
)

         - El dato XYData(_hash='ed20b892e7858a253df46cdd3d19ef040844625d', 
_path='TWECFilter/60b322a0bd676ce665f0d6b568a28ef664fef914') Existe, se carga del storage.

DeltaSelectorFilter(deltas_f='cosine')

	 * Downloading: <_io.BufferedReader name='cache/TWECFilter/60b322a0bd676ce665f0d6b568a28ef664fef914/ed20b892e7858a253df46cdd3d19ef040844625d'>

SklearnOptimizer(
    scoring='f1_weighted',
    cv=2,
    pipeline=F3Pipeline(
        filters=[ClassifierSVM(proba=False)],
        metrics=[F1(average='binary')],
        overwrite=False,
        store=False,
        log=False
    ),
    n_jobs=-1
)

Evaluation¶

After training the model on the training set using cross-validation, we evaluate its performance on the test set. This comparison is somewhat biased, as it involves predicting the label of individual chunks while evaluating against labels that were propagated from user-level annotations to their corresponding chunks.

In [28]:

Copied!

pipeline_svm.evaluate(test_x, test_y, _y)
pipeline_svm.evaluate(test_x, test_y, _y)

____________________________________________________________________________________________________
Evaluating pipeline......
****************************************************************************************************

Out[28]:

{'F1': 0.845859872611465,
 'Precission': 0.8736842105263158,
 'Recall': 0.8197530864197531,
 'ERDE_5': 3.0647720557294345,
 'ERDE_50': 0.959133296979966}

If we perform a fairer evaluation by propagating the predictions to the user level—assigning a user as positive if at least one of their chunks is predicted as positive—we observe that the performance remains similar or even improves, which indicates that the system is working as intended.

In [29]:

Copied!

gg_2023_test["_y"] = _y.value
gg_2023_test.head(5)
gg_2023_test["_y"] = _y.value
gg_2023_test.head(5)

Out[29]:

	user	chunk	n_texts	text	date
0	subject1	0	64	[Dope, No retcons or changes. The way it was, ...	[2020-08-03 21:33:23, 2020-08-12 16:41:30, 202...
1	subject1	1	64	[Where did you get this?, I have no idea how t...	[2020-10-04 22:34:08, 2020-10-04 22:38:55, 202...
2	subject1	2	64	[A little something im working on, Tried to do...	[2021-02-10 21:13:06, 2021-02-17 20:34:45, 202...
3	subject1	3	64	[Oh the episodes after the characters stories ...	[2021-04-08 22:55:11, 2021-04-08 23:48:15, 202...
4	subject1	4	64	[They need to drop easter eggs or hints like t...	[2021-04-25 23:19:00, 2021-04-28 23:10:38, 202...

In [30]:

Copied!





aux = gg_2023_test.groupby(["user"]).agg(
    {"label": "first", "_y": lambda x: 1 if any(list(x)) else 0}
)
aux
aux = gg_2023_test.groupby(["user"]).agg(
    {"label": "first", "_y": lambda x: 1 if any(list(x)) else 0}
)
aux

Out[30]:

	label	_y
user
subject1	0	0
subject10	0	0
subject10000	0	0
subject1001	0	0
subject1005	0	0
...	...	...
subject9982	0	0
subject9984	0	0
subject999	0	0
subject9990	0	0
subject9999	0	0

2079 rows × 2 columns

In [31]:

Copied!

aux.groupby(["label"]).describe()
aux.groupby(["label"]).describe()

Out[31]:

	_y
	count	mean	std	min	25%	50%	75%	max
label
0	1998.0	0.009009	0.094511	0.0	0.0	0.0	0.0	1.0
1	81.0	0.975309	0.156150	0.0	1.0	1.0	1.0	1.0

In [32]:

Copied!





{
    "F1": F1().evaluate(
        test_x, XYData.mock(aux.label.tolist()), XYData.mock(aux._y.tolist())
    ),
    "Precission": Precission().evaluate(
        test_x, XYData.mock(aux.label.tolist()), XYData.mock(aux._y.tolist())
    ),
    "Recall": Recall().evaluate(
        test_x, XYData.mock(aux.label.tolist()), XYData.mock(aux._y.tolist())
    ),
    "ERDE_5": test_erde_5.evaluate(
        test_x, XYData.mock(aux.label.tolist()), XYData.mock(aux._y.tolist())
    ),
    "ERDE_50": test_erde_50.evaluate(
        test_x, XYData.mock(aux.label.tolist()), XYData.mock(aux._y.tolist())
    ),
}
{
    "F1": F1().evaluate(
        test_x, XYData.mock(aux.label.tolist()), XYData.mock(aux._y.tolist())
    ),
    "Precission": Precission().evaluate(
        test_x, XYData.mock(aux.label.tolist()), XYData.mock(aux._y.tolist())
    ),
    "Recall": Recall().evaluate(
        test_x, XYData.mock(aux.label.tolist()), XYData.mock(aux._y.tolist())
    ),
    "ERDE_5": test_erde_5.evaluate(
        test_x, XYData.mock(aux.label.tolist()), XYData.mock(aux._y.tolist())
    ),
    "ERDE_50": test_erde_50.evaluate(
        test_x, XYData.mock(aux.label.tolist()), XYData.mock(aux._y.tolist())
    ),
}

Out[32]:

{'F1': 0.8876404494382022,
 'Precission': 0.8144329896907216,
 'Recall': 0.9753086419753086,
 'ERDE_5': 3.3773273408026343,
 'ERDE_50': 1.8609453827254776}