SklearnOptimizer

`labchain.plugins.optimizer.sklearn_optimizer` ¶

`all = ['SklearnOptimizer']` `module-attribute` ¶

`SklearnOptimizer` ¶

Bases: BaseOptimizer

Sklearn-based optimizer for hyperparameter tuning using GridSearchCV.

This class implements hyperparameter optimization using scikit-learn's GridSearchCV. It allows for efficient searching of hyperparameter spaces for machine learning models within the Framework3 pipeline system.

Key Features

Supports various types of hyperparameters (categorical, numerical)
Integrates with scikit-learn's GridSearchCV for exhaustive search
Allows for customizable scoring metrics
Integrates with the Framework3 pipeline system

Usage

The SklearnOptimizer can be used to optimize hyperparameters of a machine learning pipeline:

from framework3.plugins.optimizer import SklearnOptimizer
from framework3.base import XYData

# Assuming you have a pipeline and data
pipeline = ...
x_data = XYData(...)
y_data = XYData(...)

optimizer = SklearnOptimizer(scoring='accuracy', cv=5)
optimizer.optimize(pipeline)
optimizer.fit(x_data, y_data)

best_pipeline = optimizer.pipeline

Attributes:

Name	Type	Description
`scoring`	`str \| Callable \| Tuple \| Dict`	The scoring metric for GridSearchCV.
`pipeline`	`BaseFilter \| None`	The pipeline to be optimized.
`cv`	`int`	The number of cross-validation folds.
`_grid`	`Dict`	The parameter grid for GridSearchCV.
`_filters`	`List[Tuple[str, SkWrapper]]`	The list of pipeline steps.
`_pipeline`	`Pipeline`	The scikit-learn Pipeline object.
`_clf`	`GridSearchCV`	The GridSearchCV object.

Methods:

Name	Description
`optimize`	BaseFilter): Set up the optimization process for a given pipeline.
`fit`	XYData, y: Optional[XYData]) -> None \| float: Fit the GridSearchCV object to the given data.
`predict`	XYData) -> XYData: Make predictions using the best estimator found by GridSearchCV.
`evaluate`	XYData, y_true: XYData \| None, y_pred: XYData) -> Dict[str, Any]: Evaluate the optimized pipeline.

Source code in labchain/plugins/optimizer/sklearn_optimizer.py

@Container.bind()
class SklearnOptimizer(BaseOptimizer):
    """
    Sklearn-based optimizer for hyperparameter tuning using GridSearchCV.

    This class implements hyperparameter optimization using scikit-learn's GridSearchCV.
    It allows for efficient searching of hyperparameter spaces for machine learning models
    within the Framework3 pipeline system.

    Key Features:
        - Supports various types of hyperparameters (categorical, numerical)
        - Integrates with scikit-learn's GridSearchCV for exhaustive search
        - Allows for customizable scoring metrics
        - Integrates with the Framework3 pipeline system

    Usage:
        The SklearnOptimizer can be used to optimize hyperparameters of a machine learning pipeline:

        ```python
        from framework3.plugins.optimizer import SklearnOptimizer
        from framework3.base import XYData

        # Assuming you have a pipeline and data
        pipeline = ...
        x_data = XYData(...)
        y_data = XYData(...)

        optimizer = SklearnOptimizer(scoring='accuracy', cv=5)
        optimizer.optimize(pipeline)
        optimizer.fit(x_data, y_data)

        best_pipeline = optimizer.pipeline
        ```

    Attributes:
        scoring (str | Callable | Tuple | Dict): The scoring metric for GridSearchCV.
        pipeline (BaseFilter | None): The pipeline to be optimized.
        cv (int): The number of cross-validation folds.
        _grid (Dict): The parameter grid for GridSearchCV.
        _filters (List[Tuple[str, SkWrapper]]): The list of pipeline steps.
        _pipeline (Pipeline): The scikit-learn Pipeline object.
        _clf (GridSearchCV): The GridSearchCV object.

    Methods:
        optimize(pipeline: BaseFilter): Set up the optimization process for a given pipeline.
        fit(x: XYData, y: Optional[XYData]) -> None | float: Fit the GridSearchCV object to the given data.
        predict(x: XYData) -> XYData: Make predictions using the best estimator found by GridSearchCV.
        evaluate(x_data: XYData, y_true: XYData | None, y_pred: XYData) -> Dict[str, Any]:
            Evaluate the optimized pipeline.
    """

    def __init__(
        self,
        scoring: str | Callable | Tuple | Dict,
        pipeline: BaseFilter | None = None,
        cv: int = 2,
        n_jobs: int | None = None,
    ):
        """
        Initialize the SklearnOptimizer.

        Args:
            scoring (str | Callable | Tuple | Dict): Strategy to evaluate the performance of the cross-validated model.
            pipeline (BaseFilter | None): The pipeline to be optimized. Defaults to None.
            cv (int): Determines the cross-validation splitting strategy. Defaults to 2.
        """

        super().__init__(
            scoring=scoring,
            cv=cv,
            pipeline=pipeline,
        )
        self.pipeline = pipeline
        self.n_jobs = n_jobs
        self._grid = {}

    def get_grid(self, aux: Dict[str, Any]) -> None:
        """
        Recursively process the grid configuration of a pipeline or filter.

        This method traverses the configuration dictionary and builds the parameter grid
        for GridSearchCV.

        Args:
            aux (Dict[str, Any]): The configuration dictionary to process.

        Note:
            This method modifies the _grid attribute in-place.
        """
        match aux["params"]:
            case {"filters": filters, **r}:
                for filter_config in filters:
                    self.get_grid(filter_config)
            case {"pipeline": pipeline, **r}:  # noqa: F841
                self.get_grid(pipeline)
            case _:
                if "_grid" in aux:
                    for param, value in aux["_grid"].items():
                        if type(value) is list:
                            self._grid[f'{aux["clazz"]}__{param}'] = value
                        else:
                            self._grid[f'{aux["clazz"]}__{param}'] = [value]

    def optimize(self, pipeline: BaseFilter):
        """
        Set up the optimization process for a given pipeline.

        This method prepares the GridSearchCV object for optimization.

        Args:
            pipeline (BaseFilter): The pipeline to be optimized.
        """
        self.pipeline = pipeline
        self.pipeline.verbose(False)
        self._filters = list(
            map(lambda x: (x.__name__, SkWrapper(x)), self.pipeline.get_types())
        )

        dumped_pipeline = self.pipeline.item_dump(include=["_grid"])
        self.get_grid(dumped_pipeline)

        self._pipeline = Pipeline(self._filters)

        self._clf: GridSearchCV = GridSearchCV(
            estimator=self._pipeline,
            param_grid=self._grid,
            scoring=self.scoring,
            cv=self.cv,
            n_jobs=self.n_jobs,
            verbose=10,
        )

    def start(
        self, x: XYData, y: Optional[XYData], X_: Optional[XYData]
    ) -> Optional[XYData]:
        """
        Start the pipeline execution.

        This method fits the optimizer and makes predictions if X_ is provided.

        Args:
            x (XYData): Input data for fitting.
            y (Optional[XYData]): Target data for fitting.
            X_ (Optional[XYData]): Data for prediction (if different from x).

        Returns:
            Optional[XYData]: Prediction results if X_ is provided, else None.

        Raises:
            Exception: If an error occurs during pipeline execution.
        """
        try:
            self.fit(x, y)
            if X_ is not None:
                return self.predict(X_)
            else:
                return self.predict(x)
        except Exception as e:
            print(f"Error during pipeline execution: {e}")
            raise e

    def fit(self, x: XYData, y: Optional[XYData]) -> None | float:
        """
        Fit the GridSearchCV object to the given data.

        This method performs the grid search and prints the results.

        Args:
            x (XYData): The input features.
            y (Optional[XYData]): The target values.

        Returns:
            None | float: The best score achieved during the grid search.
        """
        self._clf.fit(x.value, y.value if y is not None else None)
        results = self._clf.cv_results_
        results_df = (
            pd.DataFrame(results)
            .iloc[:, 4:]
            .sort_values("mean_test_score", ascending=False)
        )
        print(results_df)
        return self._clf.best_score_  # type: ignore

    def predict(self, x: XYData) -> XYData:
        """
        Make predictions using the best estimator found by GridSearchCV.

        Args:
            x (XYData): The input features.

        Returns:
            XYData: The predicted values wrapped in an XYData object.
        """
        return XYData.mock(self._clf.predict(x.value))  # type: ignore

    def evaluate(
        self, x_data: XYData, y_true: XYData | None, y_pred: XYData
    ) -> Dict[str, Any]:
        """
        Evaluate the optimized pipeline.

        This method applies each metric in the pipeline to the predicted and true values,
        and includes the best score from GridSearchCV.

        Args:
            x_data (XYData): Input data.
            y_true (XYData | None): True target data.
            y_pred (XYData): Predicted target data.

        Returns:
            Dict[str, Any]: A dictionary containing the evaluation results for each metric
                            and the best score from GridSearchCV.

        Example:
            ```python
            >>> evaluation = optimizer.evaluate(x_test, y_test, predictions)
            >>> print(evaluation)
            {'F1Score': 0.85, 'best_score': 0.87}
            ```
        """
        if self.pipeline is None:
            raise Exception("No pipeline set for evaluation.")

        results = self.pipeline.evaluate(x_data, y_true, y_pred)
        results["best_score"] = self._clf.best_score_  # type: ignore
        return results

`n_jobs = n_jobs` `instance-attribute` ¶

`pipeline = pipeline` `instance-attribute` ¶

`init(scoring, pipeline=None, cv=2, n_jobs=None)` ¶

Initialize the SklearnOptimizer.

Parameters:

Name	Type	Description	Default
`scoring`	`str \| Callable \| Tuple \| Dict`	Strategy to evaluate the performance of the cross-validated model.	required
`pipeline`	`BaseFilter \| None`	The pipeline to be optimized. Defaults to None.	`None`
`cv`	`int`	Determines the cross-validation splitting strategy. Defaults to 2.	`2`

Source code in labchain/plugins/optimizer/sklearn_optimizer.py

def __init__(
    self,
    scoring: str | Callable | Tuple | Dict,
    pipeline: BaseFilter | None = None,
    cv: int = 2,
    n_jobs: int | None = None,
):
    """
    Initialize the SklearnOptimizer.

    Args:
        scoring (str | Callable | Tuple | Dict): Strategy to evaluate the performance of the cross-validated model.
        pipeline (BaseFilter | None): The pipeline to be optimized. Defaults to None.
        cv (int): Determines the cross-validation splitting strategy. Defaults to 2.
    """

    super().__init__(
        scoring=scoring,
        cv=cv,
        pipeline=pipeline,
    )
    self.pipeline = pipeline
    self.n_jobs = n_jobs
    self._grid = {}

`evaluate(x_data, y_true, y_pred)` ¶

Evaluate the optimized pipeline.

This method applies each metric in the pipeline to the predicted and true values, and includes the best score from GridSearchCV.

Parameters:

Name	Type	Description	Default
`x_data`	`XYData`	Input data.	required
`y_true`	`XYData \| None`	True target data.	required
`y_pred`	`XYData`	Predicted target data.	required

Returns:

Type	Description
`Dict[str, Any]`	Dict[str, Any]: A dictionary containing the evaluation results for each metric and the best score from GridSearchCV.

Example

>>> evaluation = optimizer.evaluate(x_test, y_test, predictions)
>>> print(evaluation)
{'F1Score': 0.85, 'best_score': 0.87}

Source code in labchain/plugins/optimizer/sklearn_optimizer.py

def evaluate(
    self, x_data: XYData, y_true: XYData | None, y_pred: XYData
) -> Dict[str, Any]:
    """
    Evaluate the optimized pipeline.

    This method applies each metric in the pipeline to the predicted and true values,
    and includes the best score from GridSearchCV.

    Args:
        x_data (XYData): Input data.
        y_true (XYData | None): True target data.
        y_pred (XYData): Predicted target data.

    Returns:
        Dict[str, Any]: A dictionary containing the evaluation results for each metric
                        and the best score from GridSearchCV.

    Example:
        ```python
        >>> evaluation = optimizer.evaluate(x_test, y_test, predictions)
        >>> print(evaluation)
        {'F1Score': 0.85, 'best_score': 0.87}
        ```
    """
    if self.pipeline is None:
        raise Exception("No pipeline set for evaluation.")

    results = self.pipeline.evaluate(x_data, y_true, y_pred)
    results["best_score"] = self._clf.best_score_  # type: ignore
    return results

`fit(x, y)` ¶

Fit the GridSearchCV object to the given data.

This method performs the grid search and prints the results.

Parameters:

Name	Type	Description	Default
`x`	`XYData`	The input features.	required
`y`	`Optional[XYData]`	The target values.	required

Returns:

Type	Description
`None \| float`	None \| float: The best score achieved during the grid search.

Source code in labchain/plugins/optimizer/sklearn_optimizer.py

def fit(self, x: XYData, y: Optional[XYData]) -> None | float:
    """
    Fit the GridSearchCV object to the given data.

    This method performs the grid search and prints the results.

    Args:
        x (XYData): The input features.
        y (Optional[XYData]): The target values.

    Returns:
        None | float: The best score achieved during the grid search.
    """
    self._clf.fit(x.value, y.value if y is not None else None)
    results = self._clf.cv_results_
    results_df = (
        pd.DataFrame(results)
        .iloc[:, 4:]
        .sort_values("mean_test_score", ascending=False)
    )
    print(results_df)
    return self._clf.best_score_  # type: ignore

`get_grid(aux)` ¶

Recursively process the grid configuration of a pipeline or filter.

This method traverses the configuration dictionary and builds the parameter grid for GridSearchCV.

Parameters:

Name	Type	Description	Default
`aux`	`Dict[str, Any]`	The configuration dictionary to process.	required

Note

This method modifies the _grid attribute in-place.

Source code in labchain/plugins/optimizer/sklearn_optimizer.py

def get_grid(self, aux: Dict[str, Any]) -> None:
    """
    Recursively process the grid configuration of a pipeline or filter.

    This method traverses the configuration dictionary and builds the parameter grid
    for GridSearchCV.

    Args:
        aux (Dict[str, Any]): The configuration dictionary to process.

    Note:
        This method modifies the _grid attribute in-place.
    """
    match aux["params"]:
        case {"filters": filters, **r}:
            for filter_config in filters:
                self.get_grid(filter_config)
        case {"pipeline": pipeline, **r}:  # noqa: F841
            self.get_grid(pipeline)
        case _:
            if "_grid" in aux:
                for param, value in aux["_grid"].items():
                    if type(value) is list:
                        self._grid[f'{aux["clazz"]}__{param}'] = value
                    else:
                        self._grid[f'{aux["clazz"]}__{param}'] = [value]

`optimize(pipeline)` ¶

Set up the optimization process for a given pipeline.

This method prepares the GridSearchCV object for optimization.

Parameters:

Name	Type	Description	Default
`pipeline`	`BaseFilter`	The pipeline to be optimized.	required

Source code in labchain/plugins/optimizer/sklearn_optimizer.py

def optimize(self, pipeline: BaseFilter):
    """
    Set up the optimization process for a given pipeline.

    This method prepares the GridSearchCV object for optimization.

    Args:
        pipeline (BaseFilter): The pipeline to be optimized.
    """
    self.pipeline = pipeline
    self.pipeline.verbose(False)
    self._filters = list(
        map(lambda x: (x.__name__, SkWrapper(x)), self.pipeline.get_types())
    )

    dumped_pipeline = self.pipeline.item_dump(include=["_grid"])
    self.get_grid(dumped_pipeline)

    self._pipeline = Pipeline(self._filters)

    self._clf: GridSearchCV = GridSearchCV(
        estimator=self._pipeline,
        param_grid=self._grid,
        scoring=self.scoring,
        cv=self.cv,
        n_jobs=self.n_jobs,
        verbose=10,
    )

`predict(x)` ¶

Make predictions using the best estimator found by GridSearchCV.

Parameters:

Name	Type	Description	Default
`x`	`XYData`	The input features.	required

Returns:

Name	Type	Description
`XYData`	`XYData`	The predicted values wrapped in an XYData object.

Source code in labchain/plugins/optimizer/sklearn_optimizer.py

def predict(self, x: XYData) -> XYData:
    """
    Make predictions using the best estimator found by GridSearchCV.

    Args:
        x (XYData): The input features.

    Returns:
        XYData: The predicted values wrapped in an XYData object.
    """
    return XYData.mock(self._clf.predict(x.value))  # type: ignore

`start(x, y, X_)` ¶

Start the pipeline execution.

This method fits the optimizer and makes predictions if X_ is provided.

Parameters:

Name	Type	Description	Default
`x`	`XYData`	Input data for fitting.	required
`y`	`Optional[XYData]`	Target data for fitting.	required
`X_`	`Optional[XYData]`	Data for prediction (if different from x).	required

Returns:

Type	Description
`Optional[XYData]`	Optional[XYData]: Prediction results if X_ is provided, else None.

Raises:

Type	Description
`Exception`	If an error occurs during pipeline execution.

Source code in labchain/plugins/optimizer/sklearn_optimizer.py

def start(
    self, x: XYData, y: Optional[XYData], X_: Optional[XYData]
) -> Optional[XYData]:
    """
    Start the pipeline execution.

    This method fits the optimizer and makes predictions if X_ is provided.

    Args:
        x (XYData): Input data for fitting.
        y (Optional[XYData]): Target data for fitting.
        X_ (Optional[XYData]): Data for prediction (if different from x).

    Returns:
        Optional[XYData]: Prediction results if X_ is provided, else None.

    Raises:
        Exception: If an error occurs during pipeline execution.
    """
    try:
        self.fit(x, y)
        if X_ is not None:
            return self.predict(X_)
        else:
            return self.predict(x)
    except Exception as e:
        print(f"Error during pipeline execution: {e}")
        raise e

SklearnOptimizer

labchain.plugins.optimizer.sklearn_optimizer ¶

__all__ = ['SklearnOptimizer'] module-attribute ¶

SklearnOptimizer ¶

n_jobs = n_jobs instance-attribute ¶

pipeline = pipeline instance-attribute ¶

__init__(scoring, pipeline=None, cv=2, n_jobs=None) ¶

evaluate(x_data, y_true, y_pred) ¶

fit(x, y) ¶

get_grid(aux) ¶

optimize(pipeline) ¶

predict(x) ¶

start(x, y, X_) ¶

`labchain.plugins.optimizer.sklearn_optimizer` ¶

`all = ['SklearnOptimizer']` `module-attribute` ¶

`SklearnOptimizer` ¶

`n_jobs = n_jobs` `instance-attribute` ¶

`pipeline = pipeline` `instance-attribute` ¶

`init(scoring, pipeline=None, cv=2, n_jobs=None)` ¶

`evaluate(x_data, y_true, y_pred)` ¶

`fit(x, y)` ¶

`get_grid(aux)` ¶

`optimize(pipeline)` ¶

`predict(x)` ¶

`start(x, y, X_)` ¶