preprocessing
binary_to_yes_no(df, cols)
Function to change binary answers (1/0) to Yes or No in specific columns from a Pandas Dataframe. Please validate that yes and no are assigned to the correct values, default behavior is yes = 1 and no = 0.
Parameters
df : Pandas dataframe object.
cols : List of column names.
Returns
df : Pandas dataframe object with changed binary answers.
compute_correlation_coefficient(df, out_folder, context=‘poster’, font_scale=0.2, cmap=None, annot=False)
Function to compute a correlation matrix for all variables in a dataframe.
Parameters
df : Pandas dataframe.
out_folder : Path to the output folder.
context : Style to apply to the plots. Defaults to ‘poster’.
font_scale : Font scale. Defaults to 0.2.
cmap : Cmap to use in the heatmap. Defaults to None.
annot : Flag to write correlation values inside the heatmap squares. Defaults to False.
Returns
corr_mat : Correlation matrix with Pearson correlation coefficients.
compute_pca(X, n_components)
Function compute PCA decomposition on a dataset.
Parameters
X : Data array.
n_components : Number of components.
Returns
X : Transformed data array.
pca : PCA model.
exp_var : Explained variance.
components : Components.
p_value : Bartlett’s p-value.
kmo_model : KMO model.
compute_shapiro_wilk_test(df)
Function computing the normality statistic using the Shapiro Wilk’s test for normality and outputting W and p values.
Parameters
df : Pandas dataframe.
Returns
wilk : Shapiro-Wilk values (W).
pvalues : Associated p-values.
get_column_indices(df, column_names)
Function to extract column index based on a list of column names.
Parameters
df : Pandas dataframe object.
column_names : List of column names as strings.
Returns
indices : List of column index.
merge_dataframes(dict_df, index, repeated_columns=False)
Function to merge a variable number of dataframe by matching the values of a specific column (hereby labeled as index.) Index values must appear only once in the dataframe for the function to work.
Parameters
dict_df : Dictionary of pandas dataframe.
index : String of the name of the column to use as index (needs to be the same across all dataframes).
repeated_columns : Flag to use if column name are repeated across dataframe to merge. Defaults to False.
Returns
out : Joint large pandas dataframe.
plot_distributions(df, out_folder, context=‘poster’, font_scale=1)
Script to visualize distribution plots for a complete dataframe.
Parameters
df : Pandas dataframe.
out_folder : Path to the output folder.
context : Style to apply to the plots. Defaults to ‘poster’.
font_scale : Font scale. Defaults to 1.
remove_nans(df)
Clean up dataset by removing all rows containing NaNs.
Parameters
df : Pandas dataframe.
Returns
rows_with_nans : Dataframe containing rows with NaNs.
complete_rows : Cleaned dataframe.
rename_columns(df, old_names, new_names)
Function renaming specific columns according to a list of new and old column names.
Parameters
df : Pandas dataframe object.
old_names : List of old column name as strings.
new_names : List of new column name as strings.
Returns
df : Pandas dataframe object containing the renamed columns.