Skip to content

utils

KNNimputation(ref_df, df, n_neighbors=5, weights=‘distance’, metric=‘nan_euclidean’, keep_all_features=True)

Function to impute data in a dataset based on learned relationship from a reference dataset.

This function uses the KNNImputer from the sklearn library to impute missing values in a dataset. The imputation is based on the relationship learned from a reference dataset. The reference dataset is used to calculate the distance between samples and the missing values are imputed based on the n_neighbors closest samples. Useful to complete data from a different population and compare both of them later on.

** Note: The reference dataset should not contain any missing values. The reference dataset and the dataset to impute values in should contain the same columns. **

Parameters

ref_df : Reference dataset to learn the features’ relationship from.

df : Dataset to impute.

n_neighbors : Number of neighbors to use. Defaults to 5.

weights : Weight function to use, possible value:

  • uniform: uniform weights. All points will have equal importance.
  • distance: Weight by the inverse of their distance. Defaults to distance.

metric : Distance metric for searching neighbors. Defaults to nan_euclidean.

keep_all_features : If True, even columns containing only NaNs will be imputed. Defaults to True.

Returns

pd.DataFrame : Imputed dataset.

apply_various_models(df, mod)

Function to apply various models to a dataset.

Parameters

df : Dataframe to use.

mod : Model to use.

Returns

y : Predicted values.