CompareClustering
CompareClustering(in_dataset, id_column, desc_columns, out_folder, columns_name=[], cmap=‘magma’, title=‘Adjusted Rand Index Heatmap’, verbose=False, save_parameters=False, overwrite=False)
Compare Clustering Results
CompareClustering is a script that compares clustering results from multiple solutions using the Adjusted Rand Index (ARI) and produces a heatmap of the results.
Adjusted Rand Index
The Adjusted Rand Index (ARI) is a measure of similarity between two clustering results. It relies on comparing the predicted labels to the ground truth labels (i.e. the true clustering). The ARI is a number between -1 and 1, where 1 means that the two clustering results are identical, 0 means that the two clustering results are independent (as good as random labelling) and -1 means that the two clustering results are completely different. The ARI is an extension of the Rand Index (RI) that takes into account the fact that the RI is expected to be higher for large number of clusters.
References
[1] Hubert, L., & Arabie, P. (1985). Comparing partitions. Journal of classification, 2(1), 193-218. (https://doi.org/10.1007/BF01908075)
[2] Rand, W. M. (1971). Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association, 66(336), 846-850. (https://doi.org/10.2307/2284239)
[3] D. Steinley, Properties of the Hubert-Arabie adjusted Rand index, Psychological Methods 2004 (https://psycnet.apa.org/doi/10.1037/1082-989X.9.3.386)
[4] Scikit-Learn Documentation - Adjusted Rand Index (ARI) (https://scikit-learn.org/stable/modules/generated/sklearn.metrics.adjusted_rand_score.html)
Example Usage
Parameters
in_dataset : Input dataset(s) (at least 2 are expected to produce a comparison).
id_column : Name of the column containing the subject’s ID tag. Required for proper handling of IDs and merging multiple datasets.
desc_columns : Number of descriptive columns at the beginning of the dataset to exclude in statistics and descriptive tables.
out_folder : Output folder containing the results.
columns_name : Name given to each input dataset (needs to be in the same order as the input datasets).
cmap : Name of the colormap to use. Defaults to “magma”. See https://matplotlib.org/stable/tutorials/colors/colormaps.html
title : Heatmap title.
verbose : If true, produce verbose output.
save_parameters : If true, will save input parameters to .txt file.
overwrite : If true, force overwriting of existing output files.