clustering.metrics

compute_evaluation_metrics(X, labels, metric=‘euclidean’)

Function to compute a variety of metrics to evaluate the goodness of fit of a clustering model.

Parameters

X : Data from clustering algorithm to derive metrics from.

labels : List of labels.

metric : Distance metric to use. Defaults to ‘euclidean’. Accept options from sklearn.metrics.pairwise.pairwise_distances.

Returns

ss : Silhouette Score (SS).

chi : Calinski Harabasz Score (CHI).

dbi : Davies Bouldin Score (DBI).

compute_gap_stats(X, wss, nrefs, n_cluster, m=2, error=1e-06, maxiter=1000, metric=‘euclidean’, init=None)

Function to compute the GAP Statistics to determine the optimal number of clusters. Adapted from : https://towardsdatascience.com/cheat-sheet-to-implementing-7-methods-for-selecting-optimal-number-of-clusters-in-python-898241e1d6ad and https://github.com/milesgranger/gap_statistic

Parameters

X : Data array on which clustering will be computed.

wss : Within Cluster Sum of Squared Error (WSS) for this clustering model.

nrefs : Number of random reference data to generate and average.

n_cluster : Number of cluster in for this model.

m : Exponentiation value as used in the main script. Defaults to 2.

error : Convergence error threshold. Defaults to 1E-6.

maxiter : Maximum iterations to perform. Defaults to 1000.

metric : Distance metric to use. Defaults to ‘euclidean’.

init : Initial fuzzy c-partitioned matrix. Defaults to None.

Returns

gap : GAP Statistics.

sk : Standard deviation of the GAP statistic.

compute_knee_location(lst, direction=‘decreasing’)

Funtion to compute the Elbow location using the Kneed package.

Parameters

lst: list : List of values representing the indicators to identify the elbow location.

direction: str, optional : Direction of the curve. Defaults to ‘decreasing’.

Returns

elbow: int : Elbow location.

compute_rand_index(dict)

Compute the adjusted Rand Index from a list of fuzzy membership matrices using sklearn.metrics.adjusted_rand_score. A defuzzification step is required since this method applies only to crisp clusters.

Parameters

dict : Dictonnary containing all dataframes.

Returns

np.array : Symmetric ndarray.

compute_sse(X, cntr, labels)

Function to compute within cluster sum of square error (WSS). Adapted from : https://towardsdatascience.com/how-to-determine-the-right-number-of-clusters-with-code-d58de36368b1

Parameters

X : Original data (S, N).

cntr : Centroid points (N, F).

labels : Discrete labels (S,).

Returns

WSS : Within Sum-of-Squares Error (WSS).

find_optimal_gap(gap, sk)

Function to find the optimal k number based on the GAP statistics using the method from Tibshirani R. et al., 2001 (https://hastie.su.domains/Papers/gap.pdf). Highlights the first k value where GAP[k] >= GAP[k+1] - SD[k+1].

Parameters

gap : Ndarray of GAP statistics values for a range of k clusters.

sk : Ndarray of standard deviation for each GAP values.

Returns

optimal : Optimal number of clusters.