In R we have access to a variety of complex methods to impute missing
data. For example, we can use complex statistical algorithms like
**EMB** (Expectation–maximization with bootstrap)
implemented by Amelia package or machine learning approach in the form
of **RandomForest** implemented by missForest.

Problems appear when u want to use one of these methods in machine
learning workflow or just include them in bigger scripts. All of these
packages have different implementation for example most of them return
different objects. In NADIA we try to automatize the process of using
these packages (including available methods to improving imputation). We
create uniform interface for the following packages
**Amelia**, **mice**,
**missMDA**, **missForest**,
**missRanger**, **VIM**,
**softImpute**. To allow the user easy access to all
methods in machine learning workflow we implemented them as operators in
**mlr3pipelines** (Binder et al.
2020).

From Github:

`::install_github("https://github.com/ModelOriented/EMMA/", subdir = "NADIA_package/NADIA") devtools`

From CRAN:

`install.packages("NADIA")`

**Amelia** (Honaker, King, and Blackwell 2011) is a
commonly used implementation of Expectation-maximization with bootstrap.
By default, this package implements multiple imputations. In the case of
**mlr3pipelines** operators, we have to choose one from
produced data sets. **Amelia** can impute categorical and
continuous variables.

**mice** (van Buuren and Groothuis-Oudshoorn 2011)
(Multiple imputation using chain equations) is another popular package
to work with missing data. In our implementation, we use linear models
to evaluate and improve imputation, **mice** can be used in
two possible approaches.

**missMDA** (Josse and Husson
2016) package implements methods, especially useful when you
want to use PCA or similar after imputation. Because of the number of
methods, imputation from this package was separated into two
functions:

The first function implements three complementary methods:

*PCA*(Principal Components Analysis) used when data contains only continuous features*MCA*(Multiple Correspondence Analysis) used when data contains only categorical features*FMAD*(Factorial Analysis for Mixed Data) used when data contains mixed features

Second function implements

*MFA*(Multiple Factor Analysis) which can be used for all types of data.

**missForest** (Stekhoven and Buehlmann
2012) uses machine learning to impute missing data. In this
package, the random-forest model is trained on data with missing values
and used to perform imputation.

**missRanger** (Mayer 2019) is an
improved version of missForest where Predictive Mean Matching is added
between random-forest iterations. This firstly avoids imputation with
values not already present in the original data. Secondly, predictive
mean matching tries to raise the variance in the resulting conditional
distributions to a realistic level.

**VIM** (Kowarik and Templ 2016) implements four
different imputation methods implemented in separate functions:

*Hot-Deck*Data set is sorted and missing values are imputed sequentially running through the data set line (observation) by line (observation). Fast and simple imputation method,*IRMI*(Iterative robust model-based imputation) In each step of the iteration (inner loop), one variable is used as a response variable and the remaining variables serve as the regressors. The procedure is repeated until the algorithm converges (outer loop),*kNN*(k nearest neighbors) An aggregation of the k values of the nearest neighbors is used as an imputed value. Functions used to aggregate neighbors can be pass as arguments,*Regression Imputer*train linear models using column without missing as features and column with missing as a target.

**softImpute** (Hastie and Mazumder
2015) imputation base on the operation on matrixs. Fast but
limited to numeric variables so has to be used alongside some simple
imputation method for categorical variables. This imputation function
can be pass as an argument.

In standard machine learning, the model is first trained on training data. Then a trained model is used to predict new data. This workflow is recommended and should be used when it’s possible. We call this approach A and present it in the diagram below:

Problems start appearing when we want to include advanced imputation
methods in this approach. The majority of used packages don’t allow to
separated training stage from the imputation stage (expect
**mice** more about this in the next section). Because of
this, we have to use something we called approach B. In this case,
imputation work separately on training and test sets but the rest of the
model is trained the same as in approach A. B approach is presented in
the diagram below:

Approach B has obvious limitations for example it’s impossible to predict only one example because imputation techniques don’t work for too small samples. Approach B can be beneficial in case when training data has different distributions then testing data. This situation can happen when training is perform using historic data.

Not all included packages are limited to approach B. We can use
**mice** in the A approach using simple tricks. First
perform imputation on training data and then use trained imputer on the
testing set. To avoid data leak we remove the real values from the
testing data set when imputation is performed. These data are added back
after imputation. By doing that we allow testing on only one example and
avoid all problems with a small test sample size. This approach to
**mice** is available with all mice methods.

All included packages are available in form of
**mlr3pipelines** operator so can be used like this:

```
# Task with missing data from mlr3
<- tsk('pima')
task_with_missing
# Creating an operator implementing the imputation method
<- PipeOpMice$new()
imputation_methods
# Imputation
<- imputation_methods$train(list(task_with_missing))[[1]]
task_with_no_missing
#Check
$missings()
task_with_missing#> diabetes age glucose insulin mass pedigree pregnant pressure
#> 0 0 5 374 11 0 0 35
#> triceps
#> 227
$missings()
task_with_no_missing#> diabetes age pedigree pregnant glucose insulin mass pressure
#> 0 0 0 0 0 0 0 0
#> triceps
#> 0
```

But the real advantage of using NADIA comes from integration with mlr3 (Lang et al. 2019). Because of that, we can easily include advanced imputation techniques inside the machine learning models. For example:

```
library(mlr3learners)
# Creating graph learner
# imputation method
<- PipeOpmissRanger$new()
imp
# learner
<- lrn('classif.glmnet')
learner
<- imp %>>% learner
graph
<- GraphLearner$new(graph, id = 'missRanger.learner')
graph_learner $id <- 'missRanger.learner'
graph_learner# resampling
set.seed(1)
resample(tsk('pima'),graph_learner,rsmp('cv',folds=5))
#> INFO [21:15:45.831] [mlr3] Applying learner 'missRanger.learner' on task 'pima' (iter 1/5)
#> INFO [21:15:47.274] [mlr3] Applying learner 'missRanger.learner' on task 'pima' (iter 2/5)
#> INFO [21:15:49.196] [mlr3] Applying learner 'missRanger.learner' on task 'pima' (iter 3/5)
#> INFO [21:15:50.378] [mlr3] Applying learner 'missRanger.learner' on task 'pima' (iter 4/5)
#> INFO [21:15:52.053] [mlr3] Applying learner 'missRanger.learner' on task 'pima' (iter 5/5)
#> <ResampleResult> of 5 iterations
#> * Task: pima
#> * Learner: missRanger.learner
#> * Warnings: 0 in 0 iterations
#> * Errors: 0 in 0 iterations
```

Advanced imputation technics often can cause errors. NADIA use mlr3’s methods to handle that:

```
# Error handling
$encapsulate <- c(train='evaluate',predict='evaluate')
graph_learner
# Creating a problematic task
<- iris
data
1] <- NA
data[,
<- TaskClassif$new('task',data,'Species')
task_problematic
# Resampling
# All folds will be tested and the script run further
set.seed(1)
resample(task_problematic,graph_learner,rsmp('cv',folds=5))
#> INFO [21:16:09.323] [mlr3] Applying learner 'missRanger.learner' on task 'task' (iter 1/5)
#> INFO [21:16:09.353] [mlr3] Applying learner 'missRanger.learner' on task 'task' (iter 2/5)
#> INFO [21:16:09.388] [mlr3] Applying learner 'missRanger.learner' on task 'task' (iter 3/5)
#> INFO [21:16:09.423] [mlr3] Applying learner 'missRanger.learner' on task 'task' (iter 4/5)
#> INFO [21:16:09.455] [mlr3] Applying learner 'missRanger.learner' on task 'task' (iter 5/5)
#> <ResampleResult> of 5 iterations
#> * Task: task
#> * Learner: missRanger.learner
#> * Warnings: 0 in 0 iterations
#> * Errors: 5 in 5 iterations
```

We want to include any form of imputation tuning provided by used packages in our functions. It not possible for every package but it can be used in for example missRanger:

```
# Turning off encapsulation
$encapsulate <- c(train='none',predict='none')
graph_learner
# Turning on optimalization
$param_set$values$impute_missRanger_B.optimize <- TRUE
graph_learner
# Resampling
set.seed(1)
resample(tsk('pima'),graph_learner,rsmp('cv',folds=5))
#> INFO [21:16:16.863] [mlr3] Applying learner 'missRanger.learner' on task 'pima' (iter 1/5)
#> INFO [21:16:21.162] [mlr3] Applying learner 'missRanger.learner' on task 'pima' (iter 2/5)
#> INFO [21:16:25.126] [mlr3] Applying learner 'missRanger.learner' on task 'pima' (iter 3/5)
#> INFO [21:16:29.365] [mlr3] Applying learner 'missRanger.learner' on task 'pima' (iter 4/5)
#> INFO [21:16:33.660] [mlr3] Applying learner 'missRanger.learner' on task 'pima' (iter 5/5)
#> <ResampleResult> of 5 iterations
#> * Task: pima
#> * Learner: missRanger.learner
#> * Warnings: 0 in 0 iterations
#> * Errors: 0 in 0 iterations
```

Using optimization slows the whole especially in approach B when imputation has to optimize separately on training and test sets.

NADIA also implements simple imputation methods like median or mean in approach B. For example:

```
# Creating graph learner
# imputation method
<- PipeOpMean_B$new()
imp
# learner
<- lrn('classif.glmnet')
learner
<- imp %>>% learner
graph
<- GraphLearner$new(graph)
graph_learner $id <- 'mean.learner'
graph_learner# resampling
set.seed(1)
resample(tsk('pima'),graph_learner,rsmp('cv',folds=5))
#> INFO [21:16:44.609] [mlr3] Applying learner 'mean.learner' on task 'pima' (iter 1/5)
#> INFO [21:16:44.766] [mlr3] Applying learner 'mean.learner' on task 'pima' (iter 2/5)
#> INFO [21:16:44.916] [mlr3] Applying learner 'mean.learner' on task 'pima' (iter 3/5)
#> INFO [21:16:45.078] [mlr3] Applying learner 'mean.learner' on task 'pima' (iter 4/5)
#> INFO [21:16:45.238] [mlr3] Applying learner 'mean.learner' on task 'pima' (iter 5/5)
#> <ResampleResult> of 5 iterations
#> * Task: pima
#> * Learner: mean.learner
#> * Warnings: 0 in 0 iterations
#> * Errors: 0 in 0 iterations
```

NADIA gives the user very easy access to advance imputation techniques scattered on many packages. Also, simplify using these techniques and provide a high level of automatization in using them. Beyond that NADIA implements functions to simulate missing data. This can be especially useful to compare imputation methods with each other.

For example, I will perform two folds cross-validation using
**missMDA** and calculate mean accuracy on simple data set
with and without NADIA.

Without NADIA:

```
library(missMDA)
library(mlr3learners)
# Using task form mlr3
<- tsk("pima")
task
# I can't perform imputation on task so I have to extract data frame
<- as.data.frame(task$data())
data
# Splitting into two sets and removing the target column
<- sample(1:nrow(data),nrow(data)/2)
indx
<- data[indx,-1]
data1
<- data[-indx,-1]
data2
## Performing imputation with optimization
# Features are only numeric so I will use PCA this has to be checked
# Optimization
<- estim_ncpPCA(data1)$ncp
ncp1
<- estim_ncpPCA(data2)$ncp
ncp2
# Imputation
<- as.data.frame(imputePCA(data1,ncp1)$completeObs)
data1
<- as.data.frame(imputePCA(data2,ncp2)$completeObs)
data2
# Adding back target column
$diabetes <- data$diabetes[indx]
data1
$diabetes <- data$diabetes[-indx]
data2
# Creating new tasks to make a prediction
<- TaskClassif$new("t1",data1,"diabetes")
task1
<- TaskClassif$new("t2",data2,"diabetes")
task2
# Training, prediction, and evaluation
# Fold1
<- lrn("classif.glmnet")
learner
$train(task1)
learner
<- learner$predict(task2)
p2
<- p2$score(msr("classif.acc"))
acc2
# Fold2
<- lrn("classif.glmnet")
learner
$train(task2)
learner
<- learner$predict(task1)
p1
<- p1$score(msr("classif.acc"))
acc1
# Mean acc
+acc2)/2
(acc1#> classif.acc
#> 0.7708333
```

With NADIA:

```
library(mlr3learners)
# Using task form mlr3
<- tsk("pima")
task
# Imputation, training, prediction, and evaluation
<- PipeOpMissMDA_PCA_MCA_FMAD$new() %>>% lrns("classif.glmnet")
graph
<- GraphLearner$new(graph)
graph_learner
$id <- 'learner'
graph_learner
<- resample(task,graph_learner,rsmp("cv",folds=2))
rr #> INFO [21:17:08.749] [mlr3] Applying learner 'learner' on task 'pima' (iter 1/2)
#> INFO [21:17:09.184] [mlr3] Applying learner 'learner' on task 'pima' (iter 2/2)
$aggregate(msr("classif.acc"))
rr#> classif.acc
#> 0.7682292
```

As we can see NADIA automatizes the whole process and allow you to easily include imputation techniques in your machine learning models.

Binder, Martin, Florian Pfisterer, Lennart Schneider, Bernd Bischl,
Michel Lang, and Susanne Dandl. 2020. *Mlr3pipelines: Preprocessing
Operators and Pipelines for ’Mlr3’*.

Hastie, Trevor, and Rahul Mazumder. 2015. *softImpute: Matrix
Completion via Iterative Soft-Thresholded SVD*. https://CRAN.R-project.org/package=softImpute.

Honaker, James, Gary King, and Matthew Blackwell. 2011.
“Amelia II: A Program for Missing Data.”
*Journal of Statistical Software* 45 (7): 1–47. https://www.jstatsoft.org/v45/i07/.

Josse, Julie, and François Husson. 2016. “missMDA: A Package for Handling Missing Values in
Multivariate Data Analysis.” *Journal of Statistical
Software* 70 (1): 1–31. https://doi.org/10.18637/jss.v070.i01.

Kowarik, Alexander, and Matthias Templ. 2016. “Imputation with the
R Package VIM.” *Journal of
Statistical Software* 74 (7): 1–16. https://doi.org/10.18637/jss.v074.i07.

Lang, Michel, Martin Binder, Jakob Richter, Patrick Schratz, Florian
Pfisterer, Stefan Coors, Quay Au, Giuseppe Casalicchio, Lars Kotthoff,
and Bernd Bischl. 2019. “mlr3: A
Modern Object-Oriented Machine Learning Framework in
R.” *Journal of Open Source Software*,
December. https://doi.org/10.21105/joss.01903.

Mayer, Michael. 2019. *missRanger: Fast Imputation of Missing
Values*. https://CRAN.R-project.org/package=missRanger.

Stekhoven, Daniel J., and Peter Buehlmann. 2012. “MissForest -
Non-Parametric Missing Value Imputation for Mixed-Type Data.”
*Bioinformatics* 28 (1): 112–18.

van Buuren, Stef, and Karin Groothuis-Oudshoorn. 2011. “mice: Multivariate Imputation by Chained Equations
in r.” *Journal of Statistical Software* 45 (3): 1–67. https://www.jstatsoft.org/v45/i03/.