Clustering Filters¶
KMeansFilter
¶
Bases: BaseFilter
A wrapper for scikit-learn's KMeans clustering algorithm using the framework3 BaseFilter interface.
This filter implements the K-Means clustering algorithm within the framework3 ecosystem.
Key Features
- Integrates scikit-learn's KMeans with framework3
- Supports various KMeans parameters like number of clusters, initialization method, and algorithm
- Provides methods for fitting the model, making predictions, and transforming data
- Includes a static method for generating parameter grids for hyperparameter tuning
Usage
The KMeansFilter can be used to perform K-Means clustering on your data:
from framework3.plugins.filters.clustering.kmeans import KMeansFilter
from framework3.base.base_types import XYData
import numpy as np
# Create sample data
X = np.array([[1, 2], [1, 4], [1, 0], [4, 2], [4, 4], [4, 0]])
X_data = XYData(_hash='X_data', _path='/tmp', _value=X)
# Create and fit the KMeans filter
kmeans = KMeansFilter(n_clusters=2, random_state=42)
kmeans.fit(X_data)
# Make predictions
X_test = XYData(_hash='X_test', _path='/tmp', _value=np.array([[0, 0], [4, 4]]))
predictions = kmeans.predict(X_test)
print(predictions.value)
Attributes:
Name | Type | Description |
---|---|---|
_clf |
KMeans
|
The underlying scikit-learn KMeans clustering model. |
Methods:
Name | Description |
---|---|
fit |
XYData, y: Optional[XYData], evaluator: BaseMetric | None = None) -> Optional[float]: Fit the KMeans model to the given data. |
predict |
XYData) -> XYData: Predict the closest cluster for each sample in X. |
transform |
XYData) -> XYData: Transform X to a cluster-distance space. |
item_grid |
Generate a parameter grid for hyperparameter tuning. |
Note
This filter uses scikit-learn's implementation of KMeans, which may have its own dependencies and requirements. Ensure that scikit-learn is properly installed and compatible with your environment.
Source code in framework3/plugins/filters/clustering/kmeans.py
11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 |
|
__init__(n_clusters=8, init='k-means++', n_init=10, max_iter=300, tol=0.0001, random_state=None, algorithm='lloyd')
¶
Initialize a new KMeansFilter instance.
This constructor sets up the KMeansFilter with the specified parameters and initializes the underlying scikit-learn KMeans model.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
n_clusters
|
int
|
The number of clusters to form. Defaults to 8. |
8
|
init
|
Literal['k-means++', 'random']
|
Method for initialization. Defaults to 'k-means++'. |
'k-means++'
|
n_init
|
int
|
Number of times the k-means algorithm will be run with different centroid seeds. Defaults to 10. |
10
|
max_iter
|
int
|
Maximum number of iterations of the k-means algorithm for a single run. Defaults to 300. |
300
|
tol
|
float
|
Relative tolerance with regards to Frobenius norm of the difference in the cluster centers of two consecutive iterations to declare convergence. Defaults to 1e-4. |
0.0001
|
random_state
|
Optional[int]
|
Determines random number generation for centroid initialization. Defaults to None. |
None
|
algorithm
|
Literal['lloyd', 'elkan']
|
K-means algorithm to use. Defaults to 'lloyd'. |
'lloyd'
|
Note
The parameters are passed directly to scikit-learn's KMeans. Refer to scikit-learn's documentation for detailed information on these parameters.
Source code in framework3/plugins/filters/clustering/kmeans.py
fit(x, y, evaluator=None)
¶
Fit the KMeans model to the given data.
This method trains the KMeans model on the provided input features.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
x
|
XYData
|
The input features for training. |
required |
y
|
Optional[XYData]
|
Not used, present for API consistency. |
required |
evaluator
|
BaseMetric | None
|
An optional evaluator for the model. Not used in this method. |
None
|
Returns:
Type | Description |
---|---|
Optional[float]
|
Optional[float]: The inertia (within-cluster sum-of-squares) of the fitted model. |
Note
This method uses scikit-learn's fit method internally. The inertia is returned as a measure of how well the model fits the data.
Source code in framework3/plugins/filters/clustering/kmeans.py
item_grid(**kwargs)
staticmethod
¶
Generate a parameter grid for hyperparameter tuning.
This static method provides a way to generate a grid of parameters for use in hyperparameter optimization techniques like grid search.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
**kwargs
|
Dict[str, Any]
|
Keyword arguments representing the parameter names and their possible values. |
{}
|
Returns:
Type | Description |
---|---|
Dict[str, Any]
|
Dict[str, Any]: A dictionary of parameter names and their possible values. |
Note
The returned dictionary can be used directly with hyperparameter tuning tools that accept parameter grids, such as scikit-learn's GridSearchCV. The parameter names are prefixed with "KMeansFilter__" for compatibility with nested estimators.
Source code in framework3/plugins/filters/clustering/kmeans.py
predict(x)
¶
Predict the closest cluster for each sample in X.
This method uses the trained KMeans model to predict cluster labels for new input data.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
x
|
XYData
|
The input features to predict. |
required |
Returns:
Name | Type | Description |
---|---|---|
XYData |
XYData
|
The predicted cluster labels wrapped in an XYData object. |
Note
This method uses scikit-learn's predict method internally. The predictions are wrapped in an XYData object for consistency with the framework.
Source code in framework3/plugins/filters/clustering/kmeans.py
transform(x)
¶
Transform X to a cluster-distance space.
This method computes the distance between each sample in X and the cluster centers.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
x
|
XYData
|
The input features to transform. |
required |
Returns:
Name | Type | Description |
---|---|---|
XYData |
XYData
|
The transformed data wrapped in an XYData object. |
Note
This method uses scikit-learn's transform method internally. The transformed data is wrapped in an XYData object for consistency with the framework.
Source code in framework3/plugins/filters/clustering/kmeans.py
Overview¶
The Clustering Filters module in framework3 provides a collection of unsupervised learning algorithms for clustering data. These filters are designed to work seamlessly within the framework3 ecosystem, offering a consistent interface and enhanced functionality for various clustering tasks.
Available Clustering Algorithms¶
K-Means Clustering¶
The K-Means clustering algorithm is implemented in the KMeansFilter
. This popular clustering method aims to partition n observations into k clusters, where each observation belongs to the cluster with the nearest mean (cluster centroid).
Usage¶
from framework3.plugins.filters.clustering.kmeans import KMeansFilter
kmeans_clusterer = KMeansFilter(n_clusters=3, init='k-means++', n_init=10, max_iter=300)
Parameters¶
n_clusters
(int): The number of clusters to form and the number of centroids to generate.init
(str): Method for initialization of centroids. Options include 'k-means++' and 'random'.n_init
(int): Number of times the k-means algorithm will be run with different centroid seeds.max_iter
(int): Maximum number of iterations for a single run.
DBSCAN Clustering¶
The Density-Based Spatial Clustering of Applications with Noise (DBSCAN) algorithm is implemented in the DBSCANFilter
. This algorithm is particularly effective for datasets with clusters of arbitrary shape.
Usage¶
from framework3.plugins.filters.clustering.dbscan import DBSCANFilter
dbscan_clusterer = DBSCANFilter(eps=0.5, min_samples=5)
Parameters¶
eps
(float): The maximum distance between two samples for one to be considered as in the neighborhood of the other.min_samples
(int): The number of samples in a neighborhood for a point to be considered as a core point.
Comprehensive Example: Clustering with Synthetic Data¶
In this example, we'll demonstrate how to use the Clustering Filters with synthetic data, showcasing both K-Means and DBSCAN algorithms, as well as integration with GridSearchCV for parameter tuning.
from framework3.plugins.pipelines.gs_cv_pipeline import GridSearchCVPipeline
from framework3.plugins.filters.clustering.kmeans import KMeansFilter
from framework3.plugins.filters.clustering.dbscan import DBSCANFilter
from framework3.base.base_types import XYData
from sklearn.datasets import make_blobs, make_moons
from sklearn.metrics import silhouette_score
import numpy as np
# Generate synthetic datasets
X_blobs, _ = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=0)
X_moons, _ = make_moons(n_samples=300, noise=0.05, random_state=0)
# Create XYData objects
X_blobs_data = XYData(_hash='X_blobs', _path='/tmp', _value=X_blobs)
X_moons_data = XYData(_hash='X_moons', _path='/tmp', _value=X_moons)
# K-Means Clustering
kmeans_pipeline = GridSearchCVPipeline(
filterx=[KMeansFilter],
param_grid=KMeansFilter.item_grid(n_clusters=[2, 3, 4, 5], init=['k-means++', 'random']),
scoring='silhouette',
cv=5
)
# Fit K-Means on blobs dataset
kmeans_pipeline.fit(X_blobs_data)
# Make predictions
kmeans_labels = kmeans_pipeline.predict(X_blobs_data)
print("K-Means Cluster Labels:", kmeans_labels.value)
# DBSCAN Clustering
dbscan_pipeline = GridSearchCVPipeline(
filterx=[DBSCANFilter],
param_grid=DBSCANFilter.item_grid(eps=[0.1, 0.2, 0.3], min_samples=[3, 5, 7]),
scoring='silhouette',
cv=5
)
# Fit DBSCAN on moons dataset
dbscan_pipeline.fit(X_moons_data)
# Make predictions
dbscan_labels = dbscan_pipeline.predict(X_moons_data)
print("DBSCAN Cluster Labels:", dbscan_labels.value)
# Evaluate the models
kmeans_silhouette = silhouette_score(X_blobs, kmeans_labels.value)
dbscan_silhouette = silhouette_score(X_moons, dbscan_labels.value)
print("K-Means Silhouette Score:", kmeans_silhouette)
print("DBSCAN Silhouette Score:", dbscan_silhouette)
This example demonstrates how to:
- Generate synthetic datasets suitable for different clustering algorithms
- Create XYData objects for use with framework3
- Set up GridSearchCV pipelines for both K-Means and DBSCAN clustering
- Fit the models and make predictions
- Evaluate the models using silhouette scores
Best Practices¶
-
Data Preprocessing: Ensure your data is properly preprocessed before applying clustering filters. This may include scaling, normalization, or handling missing values.
-
Algorithm Selection: Choose the appropriate clustering algorithm based on the characteristics of your data and the specific requirements of your problem.
-
Parameter Tuning: Use
GridSearchCVPipeline
to find the optimal parameters for your chosen clustering algorithm, as demonstrated in the example. -
Cluster Evaluation: Always evaluate your clustering results using appropriate metrics such as silhouette score, Calinski-Harabasz index, or Davies-Bouldin index.
-
Visualization: Visualize your clustering results to gain insights into the structure of your data and the performance of the clustering algorithm.
-
Ensemble Methods: Consider using ensemble clustering techniques to improve the robustness and stability of your clustering results.
Conclusion¶
The Clustering Filters module in framework3 provides a powerful set of tools for unsupervised learning tasks. By leveraging these filters in combination with other framework3 components, you can build efficient and effective clustering pipelines. The example with synthetic data demonstrates how easy it is to use these clustering algorithms and integrate them with GridSearchCV for parameter tuning.