Skip to content

Clustering Filters

KMeansFilter

Bases: BaseFilter

A wrapper for scikit-learn's KMeans clustering algorithm using the framework3 BaseFilter interface.

This filter implements the K-Means clustering algorithm within the framework3 ecosystem.

Key Features
  • Integrates scikit-learn's KMeans with framework3
  • Supports various KMeans parameters like number of clusters, initialization method, and algorithm
  • Provides methods for fitting the model, making predictions, and transforming data
  • Includes a static method for generating parameter grids for hyperparameter tuning
Usage

The KMeansFilter can be used to perform K-Means clustering on your data:

from framework3.plugins.filters.clustering.kmeans import KMeansFilter
from framework3.base.base_types import XYData
import numpy as np

# Create sample data
X = np.array([[1, 2], [1, 4], [1, 0], [4, 2], [4, 4], [4, 0]])
X_data = XYData(_hash='X_data', _path='/tmp', _value=X)

# Create and fit the KMeans filter
kmeans = KMeansFilter(n_clusters=2, random_state=42)
kmeans.fit(X_data)

# Make predictions
X_test = XYData(_hash='X_test', _path='/tmp', _value=np.array([[0, 0], [4, 4]]))
predictions = kmeans.predict(X_test)
print(predictions.value)

Attributes:

Name Type Description
_clf KMeans

The underlying scikit-learn KMeans clustering model.

Methods:

Name Description
fit

XYData, y: Optional[XYData], evaluator: BaseMetric | None = None) -> Optional[float]: Fit the KMeans model to the given data.

predict

XYData) -> XYData: Predict the closest cluster for each sample in X.

transform

XYData) -> XYData: Transform X to a cluster-distance space.

item_grid

Generate a parameter grid for hyperparameter tuning.

Note

This filter uses scikit-learn's implementation of KMeans, which may have its own dependencies and requirements. Ensure that scikit-learn is properly installed and compatible with your environment.

Source code in framework3/plugins/filters/clustering/kmeans.py
@Container.bind()
class KMeansFilter(BaseFilter):
    """
    A wrapper for scikit-learn's KMeans clustering algorithm using the framework3 BaseFilter interface.

    This filter implements the K-Means clustering algorithm within the framework3 ecosystem.

    Key Features:
        - Integrates scikit-learn's KMeans with framework3
        - Supports various KMeans parameters like number of clusters, initialization method, and algorithm
        - Provides methods for fitting the model, making predictions, and transforming data
        - Includes a static method for generating parameter grids for hyperparameter tuning

    Usage:
        The KMeansFilter can be used to perform K-Means clustering on your data:

        ```python
        from framework3.plugins.filters.clustering.kmeans import KMeansFilter
        from framework3.base.base_types import XYData
        import numpy as np

        # Create sample data
        X = np.array([[1, 2], [1, 4], [1, 0], [4, 2], [4, 4], [4, 0]])
        X_data = XYData(_hash='X_data', _path='/tmp', _value=X)

        # Create and fit the KMeans filter
        kmeans = KMeansFilter(n_clusters=2, random_state=42)
        kmeans.fit(X_data)

        # Make predictions
        X_test = XYData(_hash='X_test', _path='/tmp', _value=np.array([[0, 0], [4, 4]]))
        predictions = kmeans.predict(X_test)
        print(predictions.value)
        ```

    Attributes:
        _clf (KMeans): The underlying scikit-learn KMeans clustering model.

    Methods:
        fit(x: XYData, y: Optional[XYData], evaluator: BaseMetric | None = None) -> Optional[float]:
            Fit the KMeans model to the given data.
        predict(x: XYData) -> XYData:
            Predict the closest cluster for each sample in X.
        transform(x: XYData) -> XYData:
            Transform X to a cluster-distance space.
        item_grid(**kwargs) -> Dict[str, Any]:
            Generate a parameter grid for hyperparameter tuning.

    Note:
        This filter uses scikit-learn's implementation of KMeans, which may have its own dependencies and requirements.
        Ensure that scikit-learn is properly installed and compatible with your environment.
    """

    def __init__(
        self,
        n_clusters: int = 8,
        init: Literal["k-means++", "random"] = "k-means++",
        n_init: int = 10,
        max_iter: int = 300,
        tol: float = 1e-4,
        random_state: Optional[int] = None,
        algorithm: Literal["lloyd", "elkan"] = "lloyd",
    ):
        """
        Initialize a new KMeansFilter instance.

        This constructor sets up the KMeansFilter with the specified parameters and
        initializes the underlying scikit-learn KMeans model.

        Args:
            n_clusters (int): The number of clusters to form. Defaults to 8.
            init (Literal["k-means++", "random"]): Method for initialization. Defaults to 'k-means++'.
            n_init (int): Number of times the k-means algorithm will be run with different centroid seeds. Defaults to 10.
            max_iter (int): Maximum number of iterations of the k-means algorithm for a single run. Defaults to 300.
            tol (float): Relative tolerance with regards to Frobenius norm of the difference
                         in the cluster centers of two consecutive iterations to declare convergence. Defaults to 1e-4.
            random_state (Optional[int]): Determines random number generation for centroid initialization. Defaults to None.
            algorithm (Literal["lloyd", "elkan"]): K-means algorithm to use. Defaults to 'lloyd'.

        Note:
            The parameters are passed directly to scikit-learn's KMeans.
            Refer to scikit-learn's documentation for detailed information on these parameters.
        """
        super().__init__(
            n_clusters=n_clusters,
            init=init,
            n_init=n_init,
            max_iter=max_iter,
            tol=tol,
            random_state=random_state,
            algorithm=algorithm,
        )
        self._clf = KMeans(
            n_clusters=n_clusters,
            init=init,
            n_init=n_init,
            max_iter=max_iter,
            tol=tol,
            random_state=random_state,
            algorithm=algorithm,
        )

    def fit(
        self, x: XYData, y: Optional[XYData], evaluator: BaseMetric | None = None
    ) -> Optional[float]:
        """
        Fit the KMeans model to the given data.

        This method trains the KMeans model on the provided input features.

        Args:
            x (XYData): The input features for training.
            y (Optional[XYData]): Not used, present for API consistency.
            evaluator (BaseMetric | None): An optional evaluator for the model. Not used in this method.

        Returns:
            Optional[float]: The inertia (within-cluster sum-of-squares) of the fitted model.

        Note:
            This method uses scikit-learn's fit method internally.
            The inertia is returned as a measure of how well the model fits the data.
        """
        self._clf.fit(x.value)
        return self._clf.inertia_  # type: ignore

    def predict(self, x: XYData) -> XYData:
        """
        Predict the closest cluster for each sample in X.

        This method uses the trained KMeans model to predict cluster labels for new input data.

        Args:
            x (XYData): The input features to predict.

        Returns:
            XYData: The predicted cluster labels wrapped in an XYData object.

        Note:
            This method uses scikit-learn's predict method internally.
            The predictions are wrapped in an XYData object for consistency with the framework.
        """
        predictions = self._clf.predict(x.value)
        return XYData.mock(predictions)

    def transform(self, x: XYData) -> XYData:
        """
        Transform X to a cluster-distance space.

        This method computes the distance between each sample in X and the cluster centers.

        Args:
            x (XYData): The input features to transform.

        Returns:
            XYData: The transformed data wrapped in an XYData object.

        Note:
            This method uses scikit-learn's transform method internally.
            The transformed data is wrapped in an XYData object for consistency with the framework.
        """
        transformed = self._clf.transform(x.value)
        return XYData.mock(transformed)

    @staticmethod
    def item_grid(**kwargs: Dict[str, Any]) -> Dict[str, Any]:
        """
        Generate a parameter grid for hyperparameter tuning.

        This static method provides a way to generate a grid of parameters for use in
        hyperparameter optimization techniques like grid search.

        Args:
            **kwargs (Dict[str, Any]): Keyword arguments representing the parameter names and their possible values.

        Returns:
            Dict[str, Any]: A dictionary of parameter names and their possible values.

        Note:
            The returned dictionary can be used directly with hyperparameter tuning tools
            that accept parameter grids, such as scikit-learn's GridSearchCV.
            The parameter names are prefixed with "KMeansFilter__" for compatibility with nested estimators.
        """

        return dict(map(lambda x: (f"KMeansFilter__{x[0]}", x[1]), kwargs.items()))

__init__(n_clusters=8, init='k-means++', n_init=10, max_iter=300, tol=0.0001, random_state=None, algorithm='lloyd')

Initialize a new KMeansFilter instance.

This constructor sets up the KMeansFilter with the specified parameters and initializes the underlying scikit-learn KMeans model.

Parameters:

Name Type Description Default
n_clusters int

The number of clusters to form. Defaults to 8.

8
init Literal['k-means++', 'random']

Method for initialization. Defaults to 'k-means++'.

'k-means++'
n_init int

Number of times the k-means algorithm will be run with different centroid seeds. Defaults to 10.

10
max_iter int

Maximum number of iterations of the k-means algorithm for a single run. Defaults to 300.

300
tol float

Relative tolerance with regards to Frobenius norm of the difference in the cluster centers of two consecutive iterations to declare convergence. Defaults to 1e-4.

0.0001
random_state Optional[int]

Determines random number generation for centroid initialization. Defaults to None.

None
algorithm Literal['lloyd', 'elkan']

K-means algorithm to use. Defaults to 'lloyd'.

'lloyd'
Note

The parameters are passed directly to scikit-learn's KMeans. Refer to scikit-learn's documentation for detailed information on these parameters.

Source code in framework3/plugins/filters/clustering/kmeans.py
def __init__(
    self,
    n_clusters: int = 8,
    init: Literal["k-means++", "random"] = "k-means++",
    n_init: int = 10,
    max_iter: int = 300,
    tol: float = 1e-4,
    random_state: Optional[int] = None,
    algorithm: Literal["lloyd", "elkan"] = "lloyd",
):
    """
    Initialize a new KMeansFilter instance.

    This constructor sets up the KMeansFilter with the specified parameters and
    initializes the underlying scikit-learn KMeans model.

    Args:
        n_clusters (int): The number of clusters to form. Defaults to 8.
        init (Literal["k-means++", "random"]): Method for initialization. Defaults to 'k-means++'.
        n_init (int): Number of times the k-means algorithm will be run with different centroid seeds. Defaults to 10.
        max_iter (int): Maximum number of iterations of the k-means algorithm for a single run. Defaults to 300.
        tol (float): Relative tolerance with regards to Frobenius norm of the difference
                     in the cluster centers of two consecutive iterations to declare convergence. Defaults to 1e-4.
        random_state (Optional[int]): Determines random number generation for centroid initialization. Defaults to None.
        algorithm (Literal["lloyd", "elkan"]): K-means algorithm to use. Defaults to 'lloyd'.

    Note:
        The parameters are passed directly to scikit-learn's KMeans.
        Refer to scikit-learn's documentation for detailed information on these parameters.
    """
    super().__init__(
        n_clusters=n_clusters,
        init=init,
        n_init=n_init,
        max_iter=max_iter,
        tol=tol,
        random_state=random_state,
        algorithm=algorithm,
    )
    self._clf = KMeans(
        n_clusters=n_clusters,
        init=init,
        n_init=n_init,
        max_iter=max_iter,
        tol=tol,
        random_state=random_state,
        algorithm=algorithm,
    )

fit(x, y, evaluator=None)

Fit the KMeans model to the given data.

This method trains the KMeans model on the provided input features.

Parameters:

Name Type Description Default
x XYData

The input features for training.

required
y Optional[XYData]

Not used, present for API consistency.

required
evaluator BaseMetric | None

An optional evaluator for the model. Not used in this method.

None

Returns:

Type Description
Optional[float]

Optional[float]: The inertia (within-cluster sum-of-squares) of the fitted model.

Note

This method uses scikit-learn's fit method internally. The inertia is returned as a measure of how well the model fits the data.

Source code in framework3/plugins/filters/clustering/kmeans.py
def fit(
    self, x: XYData, y: Optional[XYData], evaluator: BaseMetric | None = None
) -> Optional[float]:
    """
    Fit the KMeans model to the given data.

    This method trains the KMeans model on the provided input features.

    Args:
        x (XYData): The input features for training.
        y (Optional[XYData]): Not used, present for API consistency.
        evaluator (BaseMetric | None): An optional evaluator for the model. Not used in this method.

    Returns:
        Optional[float]: The inertia (within-cluster sum-of-squares) of the fitted model.

    Note:
        This method uses scikit-learn's fit method internally.
        The inertia is returned as a measure of how well the model fits the data.
    """
    self._clf.fit(x.value)
    return self._clf.inertia_  # type: ignore

item_grid(**kwargs) staticmethod

Generate a parameter grid for hyperparameter tuning.

This static method provides a way to generate a grid of parameters for use in hyperparameter optimization techniques like grid search.

Parameters:

Name Type Description Default
**kwargs Dict[str, Any]

Keyword arguments representing the parameter names and their possible values.

{}

Returns:

Type Description
Dict[str, Any]

Dict[str, Any]: A dictionary of parameter names and their possible values.

Note

The returned dictionary can be used directly with hyperparameter tuning tools that accept parameter grids, such as scikit-learn's GridSearchCV. The parameter names are prefixed with "KMeansFilter__" for compatibility with nested estimators.

Source code in framework3/plugins/filters/clustering/kmeans.py
@staticmethod
def item_grid(**kwargs: Dict[str, Any]) -> Dict[str, Any]:
    """
    Generate a parameter grid for hyperparameter tuning.

    This static method provides a way to generate a grid of parameters for use in
    hyperparameter optimization techniques like grid search.

    Args:
        **kwargs (Dict[str, Any]): Keyword arguments representing the parameter names and their possible values.

    Returns:
        Dict[str, Any]: A dictionary of parameter names and their possible values.

    Note:
        The returned dictionary can be used directly with hyperparameter tuning tools
        that accept parameter grids, such as scikit-learn's GridSearchCV.
        The parameter names are prefixed with "KMeansFilter__" for compatibility with nested estimators.
    """

    return dict(map(lambda x: (f"KMeansFilter__{x[0]}", x[1]), kwargs.items()))

predict(x)

Predict the closest cluster for each sample in X.

This method uses the trained KMeans model to predict cluster labels for new input data.

Parameters:

Name Type Description Default
x XYData

The input features to predict.

required

Returns:

Name Type Description
XYData XYData

The predicted cluster labels wrapped in an XYData object.

Note

This method uses scikit-learn's predict method internally. The predictions are wrapped in an XYData object for consistency with the framework.

Source code in framework3/plugins/filters/clustering/kmeans.py
def predict(self, x: XYData) -> XYData:
    """
    Predict the closest cluster for each sample in X.

    This method uses the trained KMeans model to predict cluster labels for new input data.

    Args:
        x (XYData): The input features to predict.

    Returns:
        XYData: The predicted cluster labels wrapped in an XYData object.

    Note:
        This method uses scikit-learn's predict method internally.
        The predictions are wrapped in an XYData object for consistency with the framework.
    """
    predictions = self._clf.predict(x.value)
    return XYData.mock(predictions)

transform(x)

Transform X to a cluster-distance space.

This method computes the distance between each sample in X and the cluster centers.

Parameters:

Name Type Description Default
x XYData

The input features to transform.

required

Returns:

Name Type Description
XYData XYData

The transformed data wrapped in an XYData object.

Note

This method uses scikit-learn's transform method internally. The transformed data is wrapped in an XYData object for consistency with the framework.

Source code in framework3/plugins/filters/clustering/kmeans.py
def transform(self, x: XYData) -> XYData:
    """
    Transform X to a cluster-distance space.

    This method computes the distance between each sample in X and the cluster centers.

    Args:
        x (XYData): The input features to transform.

    Returns:
        XYData: The transformed data wrapped in an XYData object.

    Note:
        This method uses scikit-learn's transform method internally.
        The transformed data is wrapped in an XYData object for consistency with the framework.
    """
    transformed = self._clf.transform(x.value)
    return XYData.mock(transformed)

Overview

The Clustering Filters module in framework3 provides a collection of unsupervised learning algorithms for clustering data. These filters are designed to work seamlessly within the framework3 ecosystem, offering a consistent interface and enhanced functionality for various clustering tasks.

Available Clustering Algorithms

K-Means Clustering

The K-Means clustering algorithm is implemented in the KMeansFilter. This popular clustering method aims to partition n observations into k clusters, where each observation belongs to the cluster with the nearest mean (cluster centroid).

Usage

from framework3.plugins.filters.clustering.kmeans import KMeansFilter

kmeans_clusterer = KMeansFilter(n_clusters=3, init='k-means++', n_init=10, max_iter=300)

Parameters

  • n_clusters (int): The number of clusters to form and the number of centroids to generate.
  • init (str): Method for initialization of centroids. Options include 'k-means++' and 'random'.
  • n_init (int): Number of times the k-means algorithm will be run with different centroid seeds.
  • max_iter (int): Maximum number of iterations for a single run.

DBSCAN Clustering

The Density-Based Spatial Clustering of Applications with Noise (DBSCAN) algorithm is implemented in the DBSCANFilter. This algorithm is particularly effective for datasets with clusters of arbitrary shape.

Usage

from framework3.plugins.filters.clustering.dbscan import DBSCANFilter

dbscan_clusterer = DBSCANFilter(eps=0.5, min_samples=5)

Parameters

  • eps (float): The maximum distance between two samples for one to be considered as in the neighborhood of the other.
  • min_samples (int): The number of samples in a neighborhood for a point to be considered as a core point.

Comprehensive Example: Clustering with Synthetic Data

In this example, we'll demonstrate how to use the Clustering Filters with synthetic data, showcasing both K-Means and DBSCAN algorithms, as well as integration with GridSearchCV for parameter tuning.

from framework3.plugins.pipelines.gs_cv_pipeline import GridSearchCVPipeline
from framework3.plugins.filters.clustering.kmeans import KMeansFilter
from framework3.plugins.filters.clustering.dbscan import DBSCANFilter
from framework3.base.base_types import XYData
from sklearn.datasets import make_blobs, make_moons
from sklearn.metrics import silhouette_score
import numpy as np

# Generate synthetic datasets
X_blobs, _ = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=0)
X_moons, _ = make_moons(n_samples=300, noise=0.05, random_state=0)

# Create XYData objects
X_blobs_data = XYData(_hash='X_blobs', _path='/tmp', _value=X_blobs)
X_moons_data = XYData(_hash='X_moons', _path='/tmp', _value=X_moons)

# K-Means Clustering
kmeans_pipeline = GridSearchCVPipeline(
    filterx=[KMeansFilter],
    param_grid=KMeansFilter.item_grid(n_clusters=[2, 3, 4, 5], init=['k-means++', 'random']),
    scoring='silhouette',
    cv=5
)

# Fit K-Means on blobs dataset
kmeans_pipeline.fit(X_blobs_data)

# Make predictions
kmeans_labels = kmeans_pipeline.predict(X_blobs_data)
print("K-Means Cluster Labels:", kmeans_labels.value)

# DBSCAN Clustering
dbscan_pipeline = GridSearchCVPipeline(
    filterx=[DBSCANFilter],
    param_grid=DBSCANFilter.item_grid(eps=[0.1, 0.2, 0.3], min_samples=[3, 5, 7]),
    scoring='silhouette',
    cv=5
)

# Fit DBSCAN on moons dataset
dbscan_pipeline.fit(X_moons_data)

# Make predictions
dbscan_labels = dbscan_pipeline.predict(X_moons_data)
print("DBSCAN Cluster Labels:", dbscan_labels.value)

# Evaluate the models
kmeans_silhouette = silhouette_score(X_blobs, kmeans_labels.value)
dbscan_silhouette = silhouette_score(X_moons, dbscan_labels.value)

print("K-Means Silhouette Score:", kmeans_silhouette)
print("DBSCAN Silhouette Score:", dbscan_silhouette)

This example demonstrates how to:

  1. Generate synthetic datasets suitable for different clustering algorithms
  2. Create XYData objects for use with framework3
  3. Set up GridSearchCV pipelines for both K-Means and DBSCAN clustering
  4. Fit the models and make predictions
  5. Evaluate the models using silhouette scores

Best Practices

  1. Data Preprocessing: Ensure your data is properly preprocessed before applying clustering filters. This may include scaling, normalization, or handling missing values.

  2. Algorithm Selection: Choose the appropriate clustering algorithm based on the characteristics of your data and the specific requirements of your problem.

  3. Parameter Tuning: Use GridSearchCVPipeline to find the optimal parameters for your chosen clustering algorithm, as demonstrated in the example.

  4. Cluster Evaluation: Always evaluate your clustering results using appropriate metrics such as silhouette score, Calinski-Harabasz index, or Davies-Bouldin index.

  5. Visualization: Visualize your clustering results to gain insights into the structure of your data and the performance of the clustering algorithm.

  6. Ensemble Methods: Consider using ensemble clustering techniques to improve the robustness and stability of your clustering results.

Conclusion

The Clustering Filters module in framework3 provides a powerful set of tools for unsupervised learning tasks. By leveraging these filters in combination with other framework3 components, you can build efficient and effective clustering pipelines. The example with synthetic data demonstrates how easy it is to use these clustering algorithms and integrate them with GridSearchCV for parameter tuning.