cli.FuzzyClustering

FuzzyClustering(in_dataset, id_column, desc_columns, k=10, m=2, error=1e-06, maxiter=1000, init=None, metric=DistanceMetrics.euclidean, pca=False, out_folder=’./fuzzy_results/’, processes=1, parallelplot=False, radarplot=True, cmap=‘magma’, verbose=False, save_parameters=False, overwrite=False)

Fuzzy Clustering

FuzzyClustering is a wrapper script for a Fuzzy C-Means clustering analysis. By design, the script will compute the analysis for k specified cluster (chosen by –k) and returns various evaluation metrics and summary barplot/parallel plot.

Evaluation Metrics

The fuzzy partition coefficient (FPC) is a metric defined between 0 and 1 with 1 representing the better score. It represents how well the data is described by the clustering model. Therefore, a higher FPC represents a better fitted model. On real-world data, local maxima can also be interpreted as one of the optimal solution.

The Silhouette Coefficient represents an evaluation of cluster’s definition. The score is bounded (-1 to 1) with 1 as the perfect score and -1 as not a good clustering result. A higher Silhouette Coefficient relates to a model with better defined clusters (therefore a better model). It tends to have higher score with cluster generated from density- based methods.

The Calinski-Harabasz Index (or the Variance Ratio Criterion) can be used when no known labels are available. It represents the density and separation of clusters. Although it tends to be higher for cluster generated from density-based methods. A higher Calinski-Harabasz Index relates to better defined clusters.

Davies-Bouldin Index is reported for all cluster-models. A lower DBI relates to a model with better cluster separation. It represents a measure of similarity between clusters and is solely based on quantities and features of the dataset. It also tends to be generally higher for convex clusters and it uses the centroid distance between clusters therefore limiting the distance metric to euclidean space.

Within cluster Sum of Squared error (WSS) represents the average distance from each point to their cluster centroid. WSS is combined with the elbow method to determine the optimal k number of clusters.

The GAP statistics is based on the WSS. It relies on computing the difference in cluster compactness between the actual data and simulated data with a null distribution. The optimal k-number of clusters is identified by a maximized GAP statistic (local maxima can also suggest possible solutions.).

Configurations

Details regarding the parameters can be seen below. Regarding the –m parameter, it defines the degree of fuzziness of the resulting membership matrix. Using –m 1 will returns crisp clusters, whereas –m >1 will returned more and more fuzzy clusters. It is also possible to pre-initialize the c-partitioned matrix from previous membership matrix. If you want to do that, you need to specify a folder containing all membership matrices for each k number (meaning that if you want to perform clustering up to k=10, you need a membership matrices for each of them.). It also must respect this name convention:

[init_folder]
    |-- cluster_membership_1.npy
    |-- cluster_membership_2.npy
    |-- [...]
    └-- cluster_membership_{k}.npy

Output Folder Structure

The script creates a default output structure in a destination specified by using –out-folder. Output structure is as follows:

[out_folder]
    |-- CENTROIDS
    |       |-- clusters_centroids_2.xlsx
    |       |-- [...]
    |       └-- clusters_centroids_{k}.xlsx
    |-- MEMBERSHIP_DF
    |       |-- clusters_membership_2.xlsx
    |       |-- [...]
    |       └-- clusters_membership_{k}.xlsx
    |-- MEMBERSHIP_MAT (in .npy format)
    |-- METRICS
    |       |-- chi.png
    |       |-- [...]
    |       └-- wss.png
    |-- PARALLEL_PLOTS (optional)
    |       |-- parallel_plot_2clusters.png
    |       |-- [...]
    |       |-- parallel_plot_{k}clusters.png
    |-- PCA (optional)
    |       |-- transformed_data.xlsx
    |       |-- variance_explained.xlsx
    |       └-- pca_model.joblib
    |-- RADAR_PLOTS (optional)
    |       |-- radar_plot_2clusters.png
    |       |-- [...]
    |       |-- radar_plot_{k}clusters.png
    |-- validation_indices.xlsx
    └-- viz_multiple_cluster_nb.png

References

[1] Scikit-Fuzzy Documentation (https://pythonhosted.org/scikit-fuzzy/auto_examples/plot_cmeans.html)

[2] Scikit-Learn Documentation - Clustering Performance Evaluation (https://scikit-learn.org/stable/modules/clustering.html#clustering-performance-evaluation)

[3] Selecting the optimal number of clusters - 1 (https://towardsdatascience.com/cheat-sheet-to-implementing-7-methods-for-selecting-optimal-number-of-clusters-in-python-898241e1d6ad)

[4] Selecting the optimal number of clusters - 2 (https://towardsdatascience.com/how-to-determine-the-right-number-of-clusters-with-code-d58de36368b1)

[5] Scikit-Fuzzy GitHub Repository (https://github.com/scikit-fuzzy/scikit-fuzzy)

Example Usage

FuzzyClustering --in-dataset dataset.csv --id-column ID --desc-columns
1 --k 10 --m 2 --error 1e-6 --maxiter 1000 --init init_folder --metric
euclidean --pca --out-folder ./fuzzy_results/ --processes 4 --verbose

Parameters

in_dataset : Input dataset.

id_column : Name of the column containing the subject’s ID tag. Required for proper handling of IDs and merging multiple datasets.

desc_columns : Number of descriptive columns at the beginning of the dataset to exclude in statistics and descriptive tables.

k : Maximum k number of cluster to fit a model for. (Script will iterate until k is met.)

m : Exponentiation value to apply on the membership function, will determined the degree of fuzziness of the membership matrix

error : Error threshold for convergence stopping criterion.

maxiter : Maximum number of iterations to perform.

init : Initial fuzzy c-partitioned matrix

metric : Metric to use to compute distance between original points and clusters centroids.

pca : If set, will perform PCA decomposition to 2 components before clustering.

out_folder : Path of the folder in which the results will be written. If not specified, current folder and default name will be used.

processes : Number of processes to launch in parallel.

parallelplot : If true, will output parallel plot for each cluster solution. Default is False.

radarplot : If true, will output radar plot for each cluster solution. Default is True.

cmap : Colormap to use for plotting. Default is “magma”. See https://matplotlib.org/stable/tutorials/colors/colormaps.html.

verbose : If true, produce verbose output.

save_parameters : If true, will save input parameters to .txt file.

overwrite : If true, force overwriting of existing output files.