KFoldSplitter

`framework3.plugins.splitter.cross_validation_splitter` ¶

`KFoldSplitter` ¶

Bases: BaseSplitter

A K-Fold cross-validation splitter for evaluating machine learning models.

This class implements K-Fold cross-validation, which splits the dataset into K equally sized folds. The model is trained on K-1 folds and validated on the remaining fold. This process is repeated K times, with each fold serving as the validation set once.

Key Features

Configurable number of splits
Option to shuffle data before splitting
Supports custom pipelines for model training and evaluation
Provides mean loss across all folds

Usage

from framework3.plugins.splitter import KFoldSplitter
from framework3.plugins.pipelines.sequential import F3Pipeline
from framework3.base import XYData
import numpy as np

# Create a dummy pipeline
pipeline = F3Pipeline(filters=[...], metrics=[...])

# Create the KFoldSplitter
splitter = KFoldSplitter(n_splits=5, shuffle=True, random_state=42, pipeline=pipeline)

# Prepare some dummy data
X = XYData(value=np.random.rand(100, 10))
y = XYData(value=np.random.randint(0, 2, 100))

# Fit and evaluate the model using cross-validation
mean_loss = splitter.fit(X, y)
print(f"Mean loss across folds: {mean_loss}")

# Make predictions on new data
X_new = XYData(value=np.random.rand(20, 10))
predictions = splitter.predict(X_new)

Attributes:

Name	Type	Description
`n_splits`	`int`	Number of folds. Must be at least 2.
`shuffle`	`bool`	Whether to shuffle the data before splitting.
`random_state`	`int`	Controls the shuffling applied to the data before applying the split.
`pipeline`	`BaseFilter \| None`	The pipeline to be used for training and evaluation.

Methods:

Name	Description
`split`	BaseFilter): Set the pipeline for the splitter.
`fit`	XYData, y: XYData \| None) -> Optional[float]: Perform K-Fold cross-validation.
`predict`	XYData) -> XYData: Make predictions using the fitted pipeline.
`evaluate`	XYData, y_true: XYData \| None, y_pred: XYData) -> Dict[str, Any]: Evaluate the pipeline using the last fold.
`start`	XYData, y: Optional[XYData], X_: Optional[XYData]) -> Optional[XYData]: Start the cross-validation process and optionally make predictions.

Source code in framework3/plugins/splitter/cross_validation_splitter.py

@Container.bind()
class KFoldSplitter(BaseSplitter):
    """
    A K-Fold cross-validation splitter for evaluating machine learning models.

    This class implements K-Fold cross-validation, which splits the dataset into K equally sized folds.
    The model is trained on K-1 folds and validated on the remaining fold. This process is repeated K times,
    with each fold serving as the validation set once.

    Key Features:
        - Configurable number of splits
        - Option to shuffle data before splitting
        - Supports custom pipelines for model training and evaluation
        - Provides mean loss across all folds

    Usage:
        ```python
        from framework3.plugins.splitter import KFoldSplitter
        from framework3.plugins.pipelines.sequential import F3Pipeline
        from framework3.base import XYData
        import numpy as np

        # Create a dummy pipeline
        pipeline = F3Pipeline(filters=[...], metrics=[...])

        # Create the KFoldSplitter
        splitter = KFoldSplitter(n_splits=5, shuffle=True, random_state=42, pipeline=pipeline)

        # Prepare some dummy data
        X = XYData(value=np.random.rand(100, 10))
        y = XYData(value=np.random.randint(0, 2, 100))

        # Fit and evaluate the model using cross-validation
        mean_loss = splitter.fit(X, y)
        print(f"Mean loss across folds: {mean_loss}")

        # Make predictions on new data
        X_new = XYData(value=np.random.rand(20, 10))
        predictions = splitter.predict(X_new)
        ```

    Attributes:
        n_splits (int): Number of folds. Must be at least 2.
        shuffle (bool): Whether to shuffle the data before splitting.
        random_state (int): Controls the shuffling applied to the data before applying the split.
        pipeline (BaseFilter | None): The pipeline to be used for training and evaluation.

    Methods:
        split(pipeline: BaseFilter): Set the pipeline for the splitter.
        fit(x: XYData, y: XYData | None) -> Optional[float]: Perform K-Fold cross-validation.
        predict(x: XYData) -> XYData: Make predictions using the fitted pipeline.
        evaluate(x_data: XYData, y_true: XYData | None, y_pred: XYData) -> Dict[str, Any]:
            Evaluate the pipeline using the last fold.
        start(x: XYData, y: Optional[XYData], X_: Optional[XYData]) -> Optional[XYData]:
            Start the cross-validation process and optionally make predictions.
    """

    def __init__(
        self,
        n_splits: int = 5,
        shuffle: bool = True,
        random_state: int = 42,
        pipeline: BaseFilter | None = None,
        # evaluator: BaseMetric | None = None
    ):
        """
        Initialize the KFoldSplitter.

        Args:
            n_splits (int, optional): Number of folds. Must be at least 2. Defaults to 5.
            shuffle (bool, optional): Whether to shuffle the data before splitting. Defaults to True.
            random_state (int, optional): Controls the shuffling applied to the data before applying the split. Defaults to 42.
            pipeline (BaseFilter | None, optional): The pipeline to be used for training and evaluation. Defaults to None.
        """
        super().__init__(pipeline=pipeline)
        self.n_splits = n_splits
        self.shuffle = shuffle
        self.random_state = random_state
        self._kfold = KFold(
            n_splits=n_splits, shuffle=shuffle, random_state=random_state
        )
        self.pipeline = pipeline
        # self.evaluator = evaluator

    def split(self, pipeline: BaseFilter):
        """
        Set the pipeline for the splitter and disable its verbosity.

        Args:
            pipeline (BaseFilter): The pipeline to be used for training and evaluation.
        """
        self.pipeline = pipeline
        self.pipeline.verbose(False)

    def fit(self, x: XYData, y: XYData | None) -> Optional[float]:
        """
        Perform K-Fold cross-validation on the given data.

        This method splits the data into K folds, trains the pipeline on K-1 folds,
        and evaluates it on the remaining fold. This process is repeated K times.

        Args:
            x (XYData): The input features.
            y (XYData | None): The target values.

        Returns:
            Optional[float]: The mean loss across all folds, or None if no losses were calculated.

        Raises:
            ValueError: If y is None or if the pipeline is not set.
        """
        self._print_acction("Fitting with KFold Splitter...")
        if self._verbose:
            rprint(self.pipeline)

        X = x.value
        if y is None:  # type: ignore
            raise ValueError("y must be provided for KFold split")

        Y = y.value

        if self.pipeline is None:
            raise ValueError("Pipeline must be fitted before splitting")

        losses = []
        splits = self._kfold.split(X)
        for train_idx, val_idx in tqdm(
            splits, total=self._kfold.get_n_splits(X), disable=not self._verbose
        ):
            X_train = XYData(
                _hash=f"{x._hash}_{train_idx}",
                _path=f"{x._path}_{train_idx}",
                _value=X[train_idx],
            )
            X_val = XYData(
                _hash=f"{x._hash}_{val_idx}",
                _path=f"{x._path}_{val_idx}",
                _value=X[val_idx],
            )
            y_train = XYData(
                _hash=f"{y._hash}_{train_idx}",
                _path=f"{y._path}_{train_idx}",
                _value=Y[train_idx],
            )
            y_val = XYData(
                _hash=f"{y._hash}_{val_idx}",
                _path=f"{y._path}_{val_idx}",
                _value=Y[val_idx],
            )
            self.pipeline.fit(X_train, y_train)

            _y = self.pipeline.predict(X_val)

            loss = self.pipeline.evaluate(X_val, y_val, _y)
            losses.append(float(next(iter(loss.values()))))

        return float(np.mean(losses) if losses else 0.0)

    def start(
        self, x: XYData, y: Optional[XYData], X_: Optional[XYData]
    ) -> Optional[XYData]:
        """
        Start the cross-validation process and optionally make predictions.

        This method performs cross-validation by fitting the model and then
        makes predictions if X_ is provided.

        Args:
            x (XYData): The input features for training.
            y (Optional[XYData]): The target values for training.
            X_ (Optional[XYData]): The input features for prediction, if different from x.

        Returns:
            Optional[XYData]: Prediction results if X_ is provided, else None.

        Raises:
            Exception: If an error occurs during the process.
        """
        try:
            self.fit(x, y)
            if X_ is not None:
                return self.predict(X_)
            else:
                return self.predict(x)
        except Exception as e:
            print(f"Error during pipeline execution: {e}")
            raise e

    def predict(self, x: XYData) -> XYData:
        """
        Make predictions using the fitted pipeline.

        This method uses the pipeline that was fitted during cross-validation
        to make predictions on new data.

        Args:
            x (XYData): The input features for prediction.

        Returns:
            XYData: The predictions made by the pipeline.

        Raises:
            ValueError: If the pipeline has not been fitted.
        """
        self._print_acction("Predicting with KFold Splitter...")
        if self._verbose:
            rprint(self.pipeline)

        # X = x.value
        if self.pipeline is None:
            raise ValueError("Pipeline must be fitted before prediction")

        return self.pipeline.predict(x)

    def evaluate(
        self, x_data: XYData, y_true: XYData | None, y_pred: XYData
    ) -> Dict[str, Any]:
        """
        Evaluate the pipeline using the provided data.

        This method uses the pipeline's evaluate method to assess its performance
        on the given data.

        Args:
            x_data (XYData): The input features.
            y_true (XYData | None): The true target values.
            y_pred (XYData): The predicted target values.

        Returns:
            Dict[str, Any]: A dictionary containing the evaluation metrics.

        Raises:
            ValueError: If the pipeline has not been fitted.
        """
        if self.pipeline is None:
            raise ValueError("Pipeline must be fitted before evaluation")
        return self.pipeline.evaluate(x_data, y_true, y_pred)

`n_splits = n_splits` `instance-attribute` ¶

`pipeline = pipeline` `instance-attribute` ¶

`random_state = random_state` `instance-attribute` ¶

`shuffle = shuffle` `instance-attribute` ¶

`init(n_splits=5, shuffle=True, random_state=42, pipeline=None)` ¶

Initialize the KFoldSplitter.

Parameters:

Name	Type	Description	Default
`n_splits`	`int`	Number of folds. Must be at least 2. Defaults to 5.	`5`
`shuffle`	`bool`	Whether to shuffle the data before splitting. Defaults to True.	`True`
`random_state`	`int`	Controls the shuffling applied to the data before applying the split. Defaults to 42.	`42`
`pipeline`	`BaseFilter \| None`	The pipeline to be used for training and evaluation. Defaults to None.	`None`

Source code in framework3/plugins/splitter/cross_validation_splitter.py

def __init__(
    self,
    n_splits: int = 5,
    shuffle: bool = True,
    random_state: int = 42,
    pipeline: BaseFilter | None = None,
    # evaluator: BaseMetric | None = None
):
    """
    Initialize the KFoldSplitter.

    Args:
        n_splits (int, optional): Number of folds. Must be at least 2. Defaults to 5.
        shuffle (bool, optional): Whether to shuffle the data before splitting. Defaults to True.
        random_state (int, optional): Controls the shuffling applied to the data before applying the split. Defaults to 42.
        pipeline (BaseFilter | None, optional): The pipeline to be used for training and evaluation. Defaults to None.
    """
    super().__init__(pipeline=pipeline)
    self.n_splits = n_splits
    self.shuffle = shuffle
    self.random_state = random_state
    self._kfold = KFold(
        n_splits=n_splits, shuffle=shuffle, random_state=random_state
    )
    self.pipeline = pipeline

`evaluate(x_data, y_true, y_pred)` ¶

Evaluate the pipeline using the provided data.

This method uses the pipeline's evaluate method to assess its performance on the given data.

Parameters:

Name	Type	Description	Default
`x_data`	`XYData`	The input features.	required
`y_true`	`XYData \| None`	The true target values.	required
`y_pred`	`XYData`	The predicted target values.	required

Returns:

Type	Description
`Dict[str, Any]`	Dict[str, Any]: A dictionary containing the evaluation metrics.

Raises:

Type	Description
`ValueError`	If the pipeline has not been fitted.

Source code in framework3/plugins/splitter/cross_validation_splitter.py

def evaluate(
    self, x_data: XYData, y_true: XYData | None, y_pred: XYData
) -> Dict[str, Any]:
    """
    Evaluate the pipeline using the provided data.

    This method uses the pipeline's evaluate method to assess its performance
    on the given data.

    Args:
        x_data (XYData): The input features.
        y_true (XYData | None): The true target values.
        y_pred (XYData): The predicted target values.

    Returns:
        Dict[str, Any]: A dictionary containing the evaluation metrics.

    Raises:
        ValueError: If the pipeline has not been fitted.
    """
    if self.pipeline is None:
        raise ValueError("Pipeline must be fitted before evaluation")
    return self.pipeline.evaluate(x_data, y_true, y_pred)

`fit(x, y)` ¶

Perform K-Fold cross-validation on the given data.

This method splits the data into K folds, trains the pipeline on K-1 folds, and evaluates it on the remaining fold. This process is repeated K times.

Parameters:

Name	Type	Description	Default
`x`	`XYData`	The input features.	required
`y`	`XYData \| None`	The target values.	required

Returns:

Type	Description
`Optional[float]`	Optional[float]: The mean loss across all folds, or None if no losses were calculated.

Raises:

Type	Description
`ValueError`	If y is None or if the pipeline is not set.

Source code in framework3/plugins/splitter/cross_validation_splitter.py

def fit(self, x: XYData, y: XYData | None) -> Optional[float]:
    """
    Perform K-Fold cross-validation on the given data.

    This method splits the data into K folds, trains the pipeline on K-1 folds,
    and evaluates it on the remaining fold. This process is repeated K times.

    Args:
        x (XYData): The input features.
        y (XYData | None): The target values.

    Returns:
        Optional[float]: The mean loss across all folds, or None if no losses were calculated.

    Raises:
        ValueError: If y is None or if the pipeline is not set.
    """
    self._print_acction("Fitting with KFold Splitter...")
    if self._verbose:
        rprint(self.pipeline)

    X = x.value
    if y is None:  # type: ignore
        raise ValueError("y must be provided for KFold split")

    Y = y.value

    if self.pipeline is None:
        raise ValueError("Pipeline must be fitted before splitting")

    losses = []
    splits = self._kfold.split(X)
    for train_idx, val_idx in tqdm(
        splits, total=self._kfold.get_n_splits(X), disable=not self._verbose
    ):
        X_train = XYData(
            _hash=f"{x._hash}_{train_idx}",
            _path=f"{x._path}_{train_idx}",
            _value=X[train_idx],
        )
        X_val = XYData(
            _hash=f"{x._hash}_{val_idx}",
            _path=f"{x._path}_{val_idx}",
            _value=X[val_idx],
        )
        y_train = XYData(
            _hash=f"{y._hash}_{train_idx}",
            _path=f"{y._path}_{train_idx}",
            _value=Y[train_idx],
        )
        y_val = XYData(
            _hash=f"{y._hash}_{val_idx}",
            _path=f"{y._path}_{val_idx}",
            _value=Y[val_idx],
        )
        self.pipeline.fit(X_train, y_train)

        _y = self.pipeline.predict(X_val)

        loss = self.pipeline.evaluate(X_val, y_val, _y)
        losses.append(float(next(iter(loss.values()))))

    return float(np.mean(losses) if losses else 0.0)

`predict(x)` ¶

Make predictions using the fitted pipeline.

This method uses the pipeline that was fitted during cross-validation to make predictions on new data.

Parameters:

Name	Type	Description	Default
`x`	`XYData`	The input features for prediction.	required

Returns:

Name	Type	Description
`XYData`	`XYData`	The predictions made by the pipeline.

Raises:

Type	Description
`ValueError`	If the pipeline has not been fitted.

Source code in framework3/plugins/splitter/cross_validation_splitter.py

def predict(self, x: XYData) -> XYData:
    """
    Make predictions using the fitted pipeline.

    This method uses the pipeline that was fitted during cross-validation
    to make predictions on new data.

    Args:
        x (XYData): The input features for prediction.

    Returns:
        XYData: The predictions made by the pipeline.

    Raises:
        ValueError: If the pipeline has not been fitted.
    """
    self._print_acction("Predicting with KFold Splitter...")
    if self._verbose:
        rprint(self.pipeline)

    # X = x.value
    if self.pipeline is None:
        raise ValueError("Pipeline must be fitted before prediction")

    return self.pipeline.predict(x)

`split(pipeline)` ¶

Set the pipeline for the splitter and disable its verbosity.

Parameters:

Name	Type	Description	Default
`pipeline`	`BaseFilter`	The pipeline to be used for training and evaluation.	required

Source code in framework3/plugins/splitter/cross_validation_splitter.py

def split(self, pipeline: BaseFilter):
    """
    Set the pipeline for the splitter and disable its verbosity.

    Args:
        pipeline (BaseFilter): The pipeline to be used for training and evaluation.
    """
    self.pipeline = pipeline
    self.pipeline.verbose(False)

`start(x, y, X_)` ¶

Start the cross-validation process and optionally make predictions.

This method performs cross-validation by fitting the model and then makes predictions if X_ is provided.

Parameters:

Name	Type	Description	Default
`x`	`XYData`	The input features for training.	required
`y`	`Optional[XYData]`	The target values for training.	required
`X_`	`Optional[XYData]`	The input features for prediction, if different from x.	required

Returns:

Type	Description
`Optional[XYData]`	Optional[XYData]: Prediction results if X_ is provided, else None.

Raises:

Type	Description
`Exception`	If an error occurs during the process.

Source code in framework3/plugins/splitter/cross_validation_splitter.py

def start(
    self, x: XYData, y: Optional[XYData], X_: Optional[XYData]
) -> Optional[XYData]:
    """
    Start the cross-validation process and optionally make predictions.

    This method performs cross-validation by fitting the model and then
    makes predictions if X_ is provided.

    Args:
        x (XYData): The input features for training.
        y (Optional[XYData]): The target values for training.
        X_ (Optional[XYData]): The input features for prediction, if different from x.

    Returns:
        Optional[XYData]: Prediction results if X_ is provided, else None.

    Raises:
        Exception: If an error occurs during the process.
    """
    try:
        self.fit(x, y)
        if X_ is not None:
            return self.predict(X_)
        else:
            return self.predict(x)
    except Exception as e:
        print(f"Error during pipeline execution: {e}")
        raise e

KFoldSplitter

framework3.plugins.splitter.cross_validation_splitter ¶

KFoldSplitter ¶

n_splits = n_splits instance-attribute ¶

pipeline = pipeline instance-attribute ¶

random_state = random_state instance-attribute ¶

shuffle = shuffle instance-attribute ¶

__init__(n_splits=5, shuffle=True, random_state=42, pipeline=None) ¶

evaluate(x_data, y_true, y_pred) ¶

fit(x, y) ¶

predict(x) ¶

split(pipeline) ¶

start(x, y, X_) ¶

`framework3.plugins.splitter.cross_validation_splitter` ¶

`KFoldSplitter` ¶

`n_splits = n_splits` `instance-attribute` ¶

`pipeline = pipeline` `instance-attribute` ¶

`random_state = random_state` `instance-attribute` ¶

`shuffle = shuffle` `instance-attribute` ¶

`init(n_splits=5, shuffle=True, random_state=42, pipeline=None)` ¶

`evaluate(x_data, y_true, y_pred)` ¶

`fit(x, y)` ¶

`predict(x)` ¶

`split(pipeline)` ¶

`start(x, y, X_)` ¶