metrics
compute_evaluation_metrics(X, labels, metric=‘euclidean’)
Function to compute a variety of metrics to evaluate the goodness of fit of a clustering model.
Parameters
X : Data from clustering algorithm to derive metrics from.
labels : List of labels.
metric : Distance metric to use. Defaults to ‘euclidean’. Accept options from sklearn.metrics.pairwise.pairwise_distances.
Returns
ss : Silhouette Score (SS).
chi : Calinski Harabasz Score (CHI).
dbi : Davies Bouldin Score (DBI).
compute_gap_stats(X, wss, nrefs, n_cluster, m=2, error=1e-06, maxiter=1000, metric=‘euclidean’, init=None)
Function to compute the GAP Statistics to determine the optimal number of clusters. Adapted from : https://towardsdatascience.com/cheat-sheet-to-implementing-7-methods-for-selecting-optimal-number-of-clusters-in-python-898241e1d6ad and https://github.com/milesgranger/gap_statistic
Parameters
X : Data array on which clustering will be computed.
wss : Within Cluster Sum of Squared Error (WSS) for this clustering model.
nrefs : Number of random reference data to generate and average.
n_cluster : Number of cluster in for this model.
m : Exponentiation value as used in the main script. Defaults to 2.
error : Convergence error threshold. Defaults to 1E-6.
maxiter : Maximum iterations to perform. Defaults to 1000.
metric : Distance metric to use. Defaults to ‘euclidean’.
init : Initial fuzzy c-partitioned matrix. Defaults to None.
Returns
gap : GAP Statistics.
sk : Standard deviation of the GAP statistic.
compute_knee_location(lst, direction=‘decreasing’)
Funtion to compute the Elbow location using the Kneed package.
Parameters
lst: list : List of values representing the indicators to identify the elbow location.
direction: str, optional : Direction of the curve. Defaults to ‘decreasing’.
Returns
elbow: int : Elbow location.
compute_rand_index(dict)
Compute the adjusted Rand Index from a list of fuzzy membership matrices using sklearn.metrics.adjusted_rand_score. A defuzzification step is required since this method applies only to crisp clusters.
Parameters
dict : Dictonnary containing all dataframes.
Returns
np.array : Symmetric ndarray.
compute_sse(X, cntr, labels)
Function to compute within cluster sum of square error (WSS). Adapted from : https://towardsdatascience.com/how-to-determine-the-right-number-of-clusters-with-code-d58de36368b1
Parameters
X : Original data (S, N).
cntr : Centroid points (N, F).
labels : Discrete labels (S,).
Returns
WSS : Within Sum-of-Squares Error (WSS).
find_optimal_gap(gap, sk)
Function to find the optimal k number based on the GAP statistics using the method from Tibshirani R. et al., 2001 (https://hastie.su.domains/Papers/gap.pdf). Highlights the first k value where GAP[k] >= GAP[k+1] - SD[k+1].
Parameters
gap : Ndarray of GAP statistics values for a range of k clusters.
sk : Ndarray of standard deviation for each GAP values.
Returns
optimal : Optimal number of clusters.