utils
KNNimputation(ref_df, df, n_neighbors=5, weights=‘distance’, metric=‘nan_euclidean’, keep_all_features=True)
Function to impute data in a dataset based on learned relationship from a reference dataset.
This function uses the KNNImputer from the sklearn library to impute missing values in a dataset. The imputation is based on the relationship learned from a reference dataset. The reference dataset is used to calculate the distance between samples and the missing values are imputed based on the n_neighbors closest samples. Useful to complete data from a different population and compare both of them later on.
** Note: The reference dataset should not contain any missing values. The reference dataset and the dataset to impute values in should contain the same columns. **
Parameters
ref_df : Reference dataset to learn the features’ relationship from.
df : Dataset to impute.
n_neighbors : Number of neighbors to use. Defaults to 5.
weights : Weight function to use, possible value:
- uniform: uniform weights. All points will have equal importance.
- distance: Weight by the inverse of their distance. Defaults to distance.
metric : Distance metric for searching neighbors. Defaults to nan_euclidean.
keep_all_features : If True, even columns containing only NaNs will be imputed. Defaults to True.
Returns
pd.DataFrame : Imputed dataset.
apply_various_models(df, mod)
Function to apply various models to a dataset.
Parameters
df : Dataframe to use.
mod : Model to use.
Returns
y : Predicted values.