utils.preprocessing

binary_to_yes_no(df, cols)

Function to change binary answers (1/0) to Yes or No in specific columns from a Pandas Dataframe. Please validate that yes and no are assigned to the correct values, default behavior is yes = 1 and no = 0.

Parameters

df : Pandas dataframe object.

cols : List of column names.

Returns

df : Pandas dataframe object with changed binary answers.

compute_correlation_coefficient(df, out_folder, context=‘poster’, font_scale=0.2, cmap=None, annot=False)

Function to compute a correlation matrix for all variables in a dataframe.

Parameters

df : Pandas dataframe.

out_folder : Path to the output folder.

context : Style to apply to the plots. Defaults to ‘poster’.

font_scale : Font scale. Defaults to 0.2.

cmap : Cmap to use in the heatmap. Defaults to None.

annot : Flag to write correlation values inside the heatmap squares. Defaults to False.

Returns

corr_mat : Correlation matrix with Pearson correlation coefficients.

compute_pca(X, n_components)

Function compute PCA decomposition on a dataset.

Parameters

X : Dataframe to compute PCA on.

n_components : Number of components.

Returns

X : Transformed data array.

pca : PCA model.

exp_var : Explained variance.

components : Components.

p_value : Bartlett’s p-value.

kmo_model : KMO model.

compute_shapiro_wilk_test(df)

Function computing the normality statistic using the Shapiro Wilk’s test for normality and outputting W and p values.

Parameters

df : Pandas dataframe.

Returns

wilk : Shapiro-Wilk values (W).

pvalues : Associated p-values.

get_column_indices(df, column_names)

Function to extract column index based on a list of column names.

Parameters

df : Pandas dataframe object.

column_names : List of column names as strings.

Returns

indices : List of column index.

merge_dataframes(dict_df, index, repeated_columns=False)

Function to merge a variable number of dataframe by matching the values of a specific column (hereby labeled as index.) Index values must appear only once in the dataframe for the function to work.

Parameters

dict_df : Dictionary of pandas dataframe.

index : String of the name of the column to use as index (needs to be the same across all dataframes).

repeated_columns : Flag to use if column name are repeated across dataframe to merge. Defaults to False.

Returns

out : Joint large pandas dataframe.

plot_distributions(df, out_folder, context=‘poster’, font_scale=1)

Script to visualize distribution plots for a complete dataframe.

Parameters

df : Pandas dataframe.

out_folder : Path to the output folder.

context : Style to apply to the plots. Defaults to ‘poster’.

font_scale : Font scale. Defaults to 1.

remove_nans(df)

Clean up dataset by removing all rows containing NaNs.

Parameters

df : Pandas dataframe.

Returns

rows_with_nans : Dataframe containing rows with NaNs.

complete_rows : Cleaned dataframe.

rename_columns(df, old_names, new_names)

Function renaming specific columns according to a list of new and old column names.

Parameters

df : Pandas dataframe object.

old_names : List of old column name as strings.

new_names : List of new column name as strings.

Returns

df : Pandas dataframe object containing the renamed columns.