Caching Data¶

How to cache heavy computations while looking for different pipeline combinations.¶

Sometimes we might want to avoid running the same pipeline multiple times with the same parameters. For this porpouse F3 have a caching mechanism.

Data preparation¶

In [1]:

Copied!

from framework3.utils.patch_type_guard import patch_inspect_for_notebooks

patch_inspect_for_notebooks()
from framework3.utils.patch_type_guard import patch_inspect_for_notebooks

patch_inspect_for_notebooks()

✅ Patched inspect.getsource using dill.

In [2]:

Copied!

from sklearn.datasets import fetch_20newsgroups

# Cargar el conjunto de datos 20 Newsgroups
train = fetch_20newsgroups(subset="train")
test = fetch_20newsgroups(subset="test")
from sklearn.datasets import fetch_20newsgroups

# Cargar el conjunto de datos 20 Newsgroups
train = fetch_20newsgroups(subset="train")
test = fetch_20newsgroups(subset="test")

In [3]:

Copied!





from framework3.base import XYData


X_train = XYData(
    _hash="20NG train X",
    _path="/datasets",
    _value=train.data,  # type: ignore
)
X_test = XYData(
    _hash="20NG test X",
    _path="/datasets",
    _value=test.data,  # type: ignore
)
y_train = XYData(
    _hash="20NG train y",
    _path="/datasets",
    _value=train.target,  # type: ignore
)
y_test = XYData(
    _hash="20NG test y",
    _path="/datasets",
    _value=test.target,  # type: ignore
)
from framework3.base import XYData


X_train = XYData(
    _hash="20NG train X",
    _path="/datasets",
    _value=train.data,  # type: ignore
)
X_test = XYData(
    _hash="20NG test X",
    _path="/datasets",
    _value=test.data,  # type: ignore
)
y_train = XYData(
    _hash="20NG train y",
    _path="/datasets",
    _value=train.target,  # type: ignore
)
y_test = XYData(
    _hash="20NG test y",
    _path="/datasets",
    _value=test.target,  # type: ignore
)

First we need to transform our text data into numerical vectors using Sentence Transformers.¶

Let's download a pre-trained language model for sentence embeddings.

In [4]:

Copied!

from framework3.plugins.filters.llm import HuggingFaceSentenceTransformerPlugin

llm = HuggingFaceSentenceTransformerPlugin(
    model_name="sentence-transformers/all-mpnet-base-v2"
)
from framework3.plugins.filters.llm import HuggingFaceSentenceTransformerPlugin

llm = HuggingFaceSentenceTransformerPlugin(
    model_name="sentence-transformers/all-mpnet-base-v2"
)

Pipeline preparation¶

We can perform PCA to reduce dimensionality and then apply SVM for classification. But we don't know wich hiperparamters are the best for this particular data. For that porpose we can use a simple grid search.

In [5]:

Copied!





from framework3 import F1, ClassifierSVMPlugin, SklearnOptimizer
from framework3.plugins.filters import PCAPlugin
from framework3.plugins.pipelines.sequential import F3Pipeline


grid_pipelin = F3Pipeline(
    filters=[
        PCAPlugin().grid({"n_components": [10, 50, 100]}),
        ClassifierSVMPlugin(kernel="rbf").grid(
            {"C": [0.1, 1.0, 10.0], "gamma": [1e-3, 1e-4]}
        ),
    ],
    metrics=[F1()],
).optimizer(SklearnOptimizer(scoring="f1_weighted", cv=2))
from framework3 import F1, ClassifierSVMPlugin, SklearnOptimizer
from framework3.plugins.filters import PCAPlugin
from framework3.plugins.pipelines.sequential import F3Pipeline


grid_pipelin = F3Pipeline(
    filters=[
        PCAPlugin().grid({"n_components": [10, 50, 100]}),
        ClassifierSVMPlugin(kernel="rbf").grid(
            {"C": [0.1, 1.0, 10.0], "gamma": [1e-3, 1e-4]}
        ),
    ],
    metrics=[F1()],
).optimizer(SklearnOptimizer(scoring="f1_weighted", cv=2))

{
    'clazz': 'F3Pipeline',
    'params': {
        'filters': [
            {'clazz': 'PCAPlugin', 'params': {'n_components': 2}, '_grid': {'n_components': [10, 50, 100]}},
            {
                'clazz': 'ClassifierSVMPlugin',
                'params': {'C': 1.0, 'kernel': 'rbf', 'gamma': 'scale'},
                '_grid': {'C': [0.1, 1.0, 10.0], 'gamma': [0.001, 0.0001]}
            }
        ],
        'metrics': [{'clazz': 'F1', 'params': {'average': 'weighted'}}],
        'overwrite': False,
        'store': False,
        'log': False
    }
}

Caching: The funny part¶

Now it comes the moment to put all together. The llm will create a heavy data item wich is the output embeddings of the model and we don't what to compute several times the same embeddings. We will cache this item using the cache_data=True parameter.

In [6]:

Copied!

from framework3 import Cached, Precission, Recall

final_pipeline = F3Pipeline(
    filters=[Cached(llm, cache_data=True, cache_filter=False), grid_pipelin],
    metrics=[F1(), Precission(), Recall()],
)
from framework3 import Cached, Precission, Recall

final_pipeline = F3Pipeline(
    filters=[Cached(llm, cache_data=True, cache_filter=False), grid_pipelin],
    metrics=[F1(), Precission(), Recall()],
)

In [7]:

Copied!

final_pipeline.fit(X_train, y_train)
_y = final_pipeline.predict(X_test)
final_pipeline.fit(X_train, y_train)
_y = final_pipeline.predict(X_test)

____________________________________________________________________________________________________
Fitting pipeline...
****************************************************************************************************

        *Cached({'filter': HuggingFaceSentenceTransformerPlugin({'model_name': 
'sentence-transformers/all-mpnet-base-v2'}), 'cache_data': True, 'cache_filter': False, 'overwrite': False, 
'storage': None})

         - El filtro HuggingFaceSentenceTransformerPlugin({'model_name': 
'sentence-transformers/all-mpnet-base-v2'}) con hash cd52a2089d77df27ae1a888d97422cd38e3bb01a No existe, se va a 
entrenar.

         - El dato XYData(_hash='0bb5f0568e0be233f4dfc5bbe7e893a6369f353e', 
_path='HuggingFaceSentenceTransformerPlugin/cd52a2089d77df27ae1a888d97422cd38e3bb01a') No existe, se va a crear.

         - El dato XYData(_hash='0bb5f0568e0be233f4dfc5bbe7e893a6369f353e', 
_path='HuggingFaceSentenceTransformerPlugin/cd52a2089d77df27ae1a888d97422cd38e3bb01a') Se cachea.

	 * Saving in local path: cache/HuggingFaceSentenceTransformerPlugin/cd52a2089d77df27ae1a888d97422cd38e3bb01a/0bb5f0568e0be233f4dfc5bbe7e893a6369f353e
	 * Saved !

        *SklearnOptimizer({'scoring': 'f1_weighted', 'cv': 2, 'pipeline': F3Pipeline({'filters': 
[PCAPlugin({'n_components': 2}), ClassifierSVMPlugin({'C': 1.0, 'kernel': 'rbf', 'gamma': 'scale'})], 'metrics': 
[F1({'average': 'weighted'})], 'overwrite': False, 'store': False, 'log': False})})

Fitting 2 folds for each of 18 candidates, totalling 36 fits

____________________________________________________________________________________________________
Predicting pipeline...
****************************************************************************************************

        *Cached({'filter': HuggingFaceSentenceTransformerPlugin({'model_name': 
'sentence-transformers/all-mpnet-base-v2'}), 'cache_data': True, 'cache_filter': False, 'overwrite': False, 
'storage': None})

         - El dato XYData(_hash='68fc2995310bba822e143578b3f4ee9ddd9f212e', 
_path='HuggingFaceSentenceTransformerPlugin/cd52a2089d77df27ae1a888d97422cd38e3bb01a') No existe, se va a crear.

         - El dato XYData(_hash='68fc2995310bba822e143578b3f4ee9ddd9f212e', 
_path='HuggingFaceSentenceTransformerPlugin/cd52a2089d77df27ae1a888d97422cd38e3bb01a') Se cachea.

	 * Saving in local path: cache/HuggingFaceSentenceTransformerPlugin/cd52a2089d77df27ae1a888d97422cd38e3bb01a/68fc2995310bba822e143578b3f4ee9ddd9f212e
	 * Saved !

        *SklearnOptimizer({'scoring': 'f1_weighted', 'cv': 2, 'pipeline': F3Pipeline({'filters': 
[PCAPlugin({'n_components': 2}), ClassifierSVMPlugin({'C': 1.0, 'kernel': 'rbf', 'gamma': 'scale'})], 'metrics': 
[F1({'average': 'weighted'})], 'overwrite': False, 'store': False, 'log': False})})

In [8]:

Copied!

final_pipeline.evaluate(X_test, y_test, _y)
final_pipeline.evaluate(X_test, y_test, _y)

____________________________________________________________________________________________________
Evaluating pipeline......
****************************************************************************************************

Out[8]:

{'F1': 0.7983619633942866,
 'Precission': 0.8005437637532571,
 'Recall': 0.798194370685077}

Petty cool right, but this didn't show the full potential of the caching mechanism. The benefits of caching comes when we want to change things without repeating unnecesary computations. For instance, if we whant to try to optimize other clasifiers or pipelines, the same embeddings won't be computed again. The F3 will use the cached data.

Let's define another pipeline¶

In [9]:

Copied!





from framework3 import F1, KnnFilter, SklearnOptimizer
from framework3.plugins.filters import PCAPlugin
from framework3.plugins.pipelines.sequential import F3Pipeline


grid_pipelin_v2 = F3Pipeline(
    filters=[
        PCAPlugin().grid({"n_components": [10, 50, 100]}),
        KnnFilter().grid({"n_neighbors": [2, 5, 10]}),
    ],
    metrics=[F1()],
).optimizer(SklearnOptimizer(scoring="f1_weighted", cv=2))
from framework3 import F1, KnnFilter, SklearnOptimizer
from framework3.plugins.filters import PCAPlugin
from framework3.plugins.pipelines.sequential import F3Pipeline


grid_pipelin_v2 = F3Pipeline(
    filters=[
        PCAPlugin().grid({"n_components": [10, 50, 100]}),
        KnnFilter().grid({"n_neighbors": [2, 5, 10]}),
    ],
    metrics=[F1()],
).optimizer(SklearnOptimizer(scoring="f1_weighted", cv=2))

{
    'clazz': 'F3Pipeline',
    'params': {
        'filters': [
            {'clazz': 'PCAPlugin', 'params': {'n_components': 2}, '_grid': {'n_components': [10, 50, 100]}},
            {
                'clazz': 'KnnFilter',
                'params': {
                    'n_neighbors': 5,
                    'weights': 'uniform',
                    'algorithm': 'auto',
                    'leaf_size': 30,
                    'p': 2,
                    'metric': 'minkowski',
                    'metric_params': None,
                    'n_jobs': None
                },
                '_grid': {'n_neighbors': [2, 5, 10]}
            }
        ],
        'metrics': [{'clazz': 'F1', 'params': {'average': 'weighted'}}],
        'overwrite': False,
        'store': False,
        'log': False
    }
}

Now lets update our main pipeline with this new pipeline.

In [10]:

Copied!





final_pipeline = F3Pipeline(
    filters=[Cached(llm, cache_data=True, cache_filter=False), grid_pipelin_v2],
    metrics=[F1(), Precission(), Recall()],
)
final_pipeline = F3Pipeline(
    filters=[Cached(llm, cache_data=True, cache_filter=False), grid_pipelin_v2],
    metrics=[F1(), Precission(), Recall()],
)

Let's train and evaluate this modified pipeline and see what happend.

In [11]:

Copied!

final_pipeline.fit(X_train, y_train)
_y = final_pipeline.predict(X_test)
final_pipeline.fit(X_train, y_train)
_y = final_pipeline.predict(X_test)

____________________________________________________________________________________________________
Fitting pipeline...
****************************************************************************************************

        *Cached({'filter': HuggingFaceSentenceTransformerPlugin({'model_name': 
'sentence-transformers/all-mpnet-base-v2'}), 'cache_data': True, 'cache_filter': False, 'overwrite': False, 
'storage': None})

         - El filtro HuggingFaceSentenceTransformerPlugin({'model_name': 
'sentence-transformers/all-mpnet-base-v2'}) con hash cd52a2089d77df27ae1a888d97422cd38e3bb01a No existe, se va a 
entrenar.

         - El dato XYData(_hash='0bb5f0568e0be233f4dfc5bbe7e893a6369f353e', 
_path='HuggingFaceSentenceTransformerPlugin/cd52a2089d77df27ae1a888d97422cd38e3bb01a') Existe, se crea lambda.

        *SklearnOptimizer({'scoring': 'f1_weighted', 'cv': 2, 'pipeline': F3Pipeline({'filters': 
[PCAPlugin({'n_components': 2}), KnnFilter({'n_neighbors': 5, 'weights': 'uniform', 'algorithm': 'auto', 
'leaf_size': 30, 'p': 2, 'metric': 'minkowski', 'metric_params': None, 'n_jobs': None})], 'metrics': 
[F1({'average': 'weighted'})], 'overwrite': False, 'store': False, 'log': False})})

	 * Downloading: <_io.BufferedReader name='cache/HuggingFaceSentenceTransformerPlugin/cd52a2089d77df27ae1a888d97422cd38e3bb01a/0bb5f0568e0be233f4dfc5bbe7e893a6369f353e'>
Fitting 2 folds for each of 9 candidates, totalling 18 fits

____________________________________________________________________________________________________
Predicting pipeline...
****************************************************************************************************

        *Cached({'filter': HuggingFaceSentenceTransformerPlugin({'model_name': 
'sentence-transformers/all-mpnet-base-v2'}), 'cache_data': True, 'cache_filter': False, 'overwrite': False, 
'storage': None})

         - El dato XYData(_hash='68fc2995310bba822e143578b3f4ee9ddd9f212e', 
_path='HuggingFaceSentenceTransformerPlugin/cd52a2089d77df27ae1a888d97422cd38e3bb01a') Existe, se crea lambda.

        *SklearnOptimizer({'scoring': 'f1_weighted', 'cv': 2, 'pipeline': F3Pipeline({'filters': 
[PCAPlugin({'n_components': 2}), KnnFilter({'n_neighbors': 5, 'weights': 'uniform', 'algorithm': 'auto', 
'leaf_size': 30, 'p': 2, 'metric': 'minkowski', 'metric_params': None, 'n_jobs': None})], 'metrics': 
[F1({'average': 'weighted'})], 'overwrite': False, 'store': False, 'log': False})})

	 * Downloading: <_io.BufferedReader name='cache/HuggingFaceSentenceTransformerPlugin/cd52a2089d77df27ae1a888d97422cd38e3bb01a/68fc2995310bba822e143578b3f4ee9ddd9f212e'>

In [12]:

Copied!

final_pipeline.evaluate(X_test, y_test, _y)
final_pipeline.evaluate(X_test, y_test, _y)

____________________________________________________________________________________________________
Evaluating pipeline......
****************************************************************************************************

Out[12]:

{'F1': 0.8090253572616752,
 'Precission': 0.8127742649585784,
 'Recall': 0.8090812533191716}

As you can see, the F3 has used the cached data from the first pipeline, so it's faster and less resource consuming.