Skip to content

CompareClustering

CompareClustering(in_dataset, id_column, desc_columns, out_folder, columns_name=[], cmap=‘magma’, title=‘Adjusted Rand Index Heatmap’, verbose=False, save_parameters=False, overwrite=False)

Compare Clustering Results

CompareClustering is a script that compares clustering results from multiple solutions using the Adjusted Rand Index (ARI) and produces a heatmap of the results.

Adjusted Rand Index

The Adjusted Rand Index (ARI) is a measure of similarity between two clustering results. It relies on comparing the predicted labels to the ground truth labels (i.e. the true clustering). The ARI is a number between -1 and 1, where 1 means that the two clustering results are identical, 0 means that the two clustering results are independent (as good as random labelling) and -1 means that the two clustering results are completely different. The ARI is an extension of the Rand Index (RI) that takes into account the fact that the RI is expected to be higher for large number of clusters.

References

[1] Hubert, L., & Arabie, P. (1985). Comparing partitions. Journal of classification, 2(1), 193-218. (https://doi.org/10.1007/BF01908075)

[2] Rand, W. M. (1971). Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association, 66(336), 846-850. (https://doi.org/10.2307/2284239)

[3] D. Steinley, Properties of the Hubert-Arabie adjusted Rand index, Psychological Methods 2004 (https://psycnet.apa.org/doi/10.1037/1082-989X.9.3.386)

[4] Scikit-Learn Documentation - Adjusted Rand Index (ARI) (https://scikit-learn.org/stable/modules/generated/sklearn.metrics.adjusted_rand_score.html)

Example Usage

CompareClustering --in_dataset dataset1.csv --in_dataset dataset2.csv
--in_dataset dataset3.csv --id_column ID --desc_columns 1 --out_folder
./ --columns_name dataset1 dataset2 dataset3 --title "ARI Heatmap"
--verbose

Parameters

in_dataset : Input dataset(s) (at least 2 are expected to produce a comparison).

id_column : Name of the column containing the subject’s ID tag. Required for proper handling of IDs and merging multiple datasets.

desc_columns : Number of descriptive columns at the beginning of the dataset to exclude in statistics and descriptive tables.

out_folder : Output folder containing the results.

columns_name : Name given to each input dataset (needs to be in the same order as the input datasets).

cmap : Name of the colormap to use. Defaults to “magma”. See https://matplotlib.org/stable/tutorials/colors/colormaps.html

title : Heatmap title.

verbose : If true, produce verbose output.

save_parameters : If true, will save input parameters to .txt file.

overwrite : If true, force overwriting of existing output files.