Skip to content

Fuzzy Clustering with NeuroStatX

This tutorial demonstrates how to use the NeuroStatX Python package for fuzzy clustering in cognitive-behavioral neuroscience. We’ll focus on two use cases:

  1. Projecting new data onto precomputed fuzzy centroids (e.g., for generalization or replication across cohorts).
  2. Running fuzzy c-means clustering from scratch to derive data-driven participant profiles (coming soon)

For prediction, the initial clustering approach and validation strategies are detailed in our paper:

Gagnon, A., Gillet, V., Desautels, A.-S., Lepage, J.-F., Baccarelli, A. A., Posner, J., Descoteaux, M., Brunet, M. A., & Takser, L. (2025). Beyond Discrete Classifications: A Computational Approach to the Continuum of Cognition and Behavior in Children. medRxiv. https://doi.org/10.1101/2025.04.14.25325835


Projecting New Participants to Precomputed Centroids

This is useful when you have:

  • Fuzzy centroids already derived on a training set (the centroids from the baseline Adolescent Brain Cognitive Development baseline follow-up are already shipped with NeuroStatX (Gagnon A et al., 2025))
  • New participants data with similar features (e.g., cognitive and behavioral scores).

Why predicting?

Predicting avoids re-fitting clustering and ensures comparability across cohorts. It is also a major advantage when your population does not have enough subjects to derive clusters. You then rely on precomputed centroids from large databases to extract membership values for your new participants.

Requirements for predicting new data into existing centroids.

While the actual prediction is trivial, there is some mandatory assumptions or requirements that need to be met in order to obtain acceptable and sound results.

  1. If using the centroids from Gagnon A. et al. (2025), your data needs to include the features [Internalizing, Externalizing, Stress, VA, EFPS, MEM] in this specific order.
  2. Your data needs to be scaled in order to obtain a mean of 0 with standard deviation of 1. This step can be performed using sklearn.
  3. Optionally, if you have access to ABCD data, harmonizing your cognitive and behavioral scores might improve your results.

Viewing test data

NeuroStatX provides a command-line (CLI) tool to predict fuzzy membership values. It also contains the centroids from the cognitive and behavioral profiles extracted in Gagnon A et al. (2025). Let’s go through an example with the data/example.csv file. Let’s quickly load it in the python console and look at its structure before using the CLI script.

from neurostatx.io.loader import DatasetLoader
# Load example.csv from the /data folder.
df = DatasetLoader().load_data("data/example.csv")
df.get_data().head(5)
print(df.get_metadata())

As you can see, we have 50 subjects with 9 features. The first three features are descriptive variables representing the ID, sex, and age. The next six features are our features of interest. As you can see, the Int and Ext column do not match our predefined column name, as long they respect the same order, this should not be a problem for now. Let’s try this out and see what comes out.

Using the CLI script

To test this out, we will use the PredictFuzzyMembership CLI script. It already wraps all commands required to predict membership values for new participants. First, let’s call the help to see which inputs are required.

Terminal window
PredictFuzzyMembership --help

We can see from the script’s help that only those files are required:

  1. Input dataset.
  2. Centroids.
  3. Name of the column containing subject IDs.
  4. Number of descriptive columns to ignore at the beginning of the dataset.

However, feature reduction was performed prior to clustering in Gagnon A et al. (2025). The NeuroStatX also provides the loadings for this PCA model, we need to supply it at runtime to ensure the same feature reduction is also performed on our new subjects.

Now, let’s truly test this out with the data/example.csv dataset (keep in mind, the name of the columns do not match the required naming.)

Terminal window
PredictFuzzyMembership \
--in-dataset data/example.csv \
--in-cntr data/GagnonA_2025_centroids.xlsx \
--id-column ids \
--desc-columns 3 \
--pca \
--pca-model data/GagnonA_2025_pca.pkl \
--out-folder testPredictFuzzy/ \
-v -s -f

Your new subjects have been successfully projected in the profile space of Gagnon A. et al. (2025)! Let’s inspect the results. You can find them in the testPredictFuzzy/ folder.

Radar plot

One interesting way to visualize clustering results is using a radar plot. Those are automatically generated when you predict new data. Here is the one from our earlier prediction.

radarplot

We can see they reproduce the findings from the original paper! That is exactly what we want. Let’s now view the results using a graph network.

Constructing a graph network

To do this, let’s head back into the python console and inspect our new membership values.

from neurostatx.io.loader import DatasetLoader
# Load the output from the previous script.
df = DatasetLoader().load_data("testPredictFuzzy/predicted_membership_matrix.xlsx")
df.get_data().head(5)

We can see that the membership values for our four clusters have been appended to the original dataframe! We can use those membership values to build our graph network. Let’s leverage additional NeuroStatX function to do so. Most of those functions have been explained in the Introduction to NeuroStatX section.

from neurostatx.io.loader import GraphLoader
from neurostatx.network.utils import get_nodes_and_edges
# Let's drop the descriptive column and original scores.
df.drop_columns([i for i in range(1,9)])
# Let's now construct pairs of subject-centroid nodes.
pairs, _, _ = df.custom_function(
get_nodes_and_edges
)
# Now build our graph.
G = GraphLoader().build_graph(
pairs,
"node1",
"node2",
edge_attr="membership"
)
# Compute the layout.
G.layout(weight="membership")
# Let's view this graph.
G.visualize(
"predictedsubjects.png",
weight="membership",
edge_width_multiplier=1
)

predictedsubs

There you go! Obviously, the layout is different than the one presented in Gagnon A. et al. (2025) since we have much less subjects. You are now ready to use those membership values in your statistical analysis!

Performing fuzzy clustering from scratch