LogisticRegression
LogisticRegression(in_graph, out_folder, attributes, covariates=None, weight=‘membership’, splits=10, test_size=0.2, cs=10, max_iter=1000, penalty=Penalty.l2, solver=Solver.lbfgs, permutations=1000, scoring=ScoringMethod.r2, processes=1, plot_distributions=False, verbose=False, save_parameters=False, overwrite=False)
Logistic Regression Analysis
This script performs a Logistic Regression on a graph using the edges’ weights as predictors and the nodes’ attributes as response variable. The script will perform a cross-validation within a training dataset before testing the model on test data. It will then perform a permutation testing to determine if the model is statistically significant. Finally, it will output the coefficients and statistics as well as plots of the distributions of the attributes and edges’ weights and the coefficients.
Preprocessing
The script will scale the data to unit variance and zero mean and will perform a log transformation on the edges’ weights (for now, it assumes that the weights represent a membership value resulting from a fuzzy clustering analysis).
Nodes’ Attributes
The script takes only one graph file as input. The graph file must be in .gml format. The script will then fetch the attributes from the graph file and will perform the analysis on the attributes and edges’ weights. At least one attribute needs to be provided in order to fit a model. To set attributes to the nodes in the graph file, please see AddNodesAttributes.
Scoring Options
The script will perform a permutation testing to determine if the model is statistically significant. The script will compute the p-value for the permutation testing using the area under the curve (AUC) by default. However, the user can also choose multiple scores to compute the p-value. The available scores can be seen in [1]. The equation used to compute the single-tailed p-value is:
p-value = ∑(score_perm >= score) / (nb_permutations)
Coefficient Significance
The script will also compute the p-value for the coefficients using the permutation testing. The p-value for the coefficients is computed by comparing the coefficients obtained from the model with the coefficients obtained from the permutation testing. The equation used to compute the two-tailed p-value is:
p-value = ∑(abs(coef_perm) >= abs(coef)) / (nb_permutations)
References
[1] https://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter
Example Usage
Parameters
in_graph : Graph file containing the data for the model.
out_folder : Output folder.
attributes : Attributes names to include in the model. Must be present in the graph file. At least one attribute is required.
covariates : Covariates to include in the model. Must be present in the graph file.
weight : Edge weight to use for the model.
splits : Number of splits to use for the cross-validation.
test_size : Size of the testing set. Must be between 0 and 1.
cs : Inverse of regularization strength. Smaller values specify stronger regularization.
max_iter : Maximum number of iterations for the solver.
penalty : Regularization penalty to use for the LogisticRegression model.
solver : Solver to use for the LogisticRegression model.
permutations : Number of permutations to use for the permutation testing.
scoring : Scoring method to use for the permutation testing.
processes : Number of processes to use for the cross-validation.
verbose : If true, produce verbose output.
plot_distributions : If true, will plot the distributions of the edges’ weights.
save_parameters : If true, will save input parameters to .txt file.
overwrite : If true, force overwriting of existing output files.