Variable Selection for Multiply Imputed Data

Alexander Rix
Center For Precision Health Data Science
Department of Biostatistics
University of Michigan School of Public Health

05 March, 2024

Penalized regression methods, such as lasso and elastic net, are used in many biomedical applications when simultaneous regression coefficient estimation and variable selection is desired. However, missing data complicates the implementation of these methods, particularly when missingness is handled using multiple imputation. Applying a variable selection algorithm on each imputed dataset will likely lead to different sets of selected predictors, making it difficult to ascertain a final active set without resorting to ad hoc combination rules. ‘miselect’ presents Stacked Adaptive Elastic Net (saenet) and Grouped Adaptive LASSO (galasso) for continuous and binary outcomes. They, by construction, force selection of the same variables across multiply imputed data. ‘miselect’ also provides cross validated variants of these methods.

saenet works by stacking the multiply imputed data into a single matrix and running a weighted adaptive elastic net on it. galasso works by adding a group penalty to the aggregated objective function to ensure selection consistency across imputations. Simulations suggest that the “stacked” objective function approach (i.e., saenet) tends to be more computationally efficient and have better estimation and selection properties.

Installation

miselect can installed from Github via

# install.packages("devtools")
devtools::install_github("umich-cphds/miselect", build_opts = c())

The Github version may contain bug fixes not yet present on CRAN, so if you are experiencing issues, you may want to try the Github version of the package.

Example

The purpose of this example is to help the user get started with using the methods in the package. To facilitate this, we have included a synthetic example dataset in the package, miselect.df, which contains a binary response, Y and 20 continuous covariates, X[1-20].

library(miselect)

colMeans(is.na(miselect.df))
#>         X1         X2         X3         X4         X5         X6         X7 
#> 0.03333333 0.02666667 0.05333333 0.05000000 0.05666667 0.16000000 0.14666667 
#>         X8         X9        X10        X11        X12        X13        X14 
#> 0.27333333 0.22666667 0.22666667 0.21000000 0.25333333 0.32000000 0.37333333 
#>        X15        X16        X17        X18        X19        X20          Y 
#> 0.34000000 0.34666667 0.36333333 0.43000000 0.38000000 0.42666667 0.00000000

As you can see, this dataset includes missing values, so we need to impute it using the R package mice. Imputation should be done carefully to avoid creating biases in the imputed data that could affect the actual analysis of interest. There are many tutorials available on how to do this properly, but a good reference text is (Little and Rubin 2019).

However, for the sake of example, we are going to just use the default mice settings, i.e., predictive means matching.

library(mice)
#> 
#> Attaching package: 'mice'
#> The following object is masked from 'package:stats':
#> 
#>     filter
#> The following objects are masked from 'package:base':
#> 
#>     cbind, rbind

set.seed(48109)

# Using the mice defaults for sake of example only.
mids <- mice(miselect.df, m = 5, printFlag = FALSE)

Both saenet and galasso take lists of (imputed) design matrices and responses. Manipulating the mice output into this form is not too difficult.

# Generate list of completed data.frames
dfs <- lapply(1:5, function(i) complete(mids, action = i))

# Generate list of imputed design matrices and imputed responses
x <- list()
y <- list()
for (i in 1:5) {
    x[[i]] <- as.matrix(dfs[[i]][, paste0("X", 1:20)])
    y[[i]] <- dfs[[i]]$Y
}
# Calculate observational weights
weights  <- 1 - rowMeans(is.na(miselect.df))
pf       <- rep(1, 20)
adWeight <- rep(1, 20)
alpha    <- c(.5 , 1)

# Since 'Y' is a binary variable, we use 'family = "binomial"'
fit <- cv.saenet(x, y, pf, adWeight, weights, family = "binomial",
                 alpha = alpha, nfolds = 5)

# By default 'coef' returns the betas for (lambda.min , alpha.min)
coef(fit)
#>  (Intercept)           X1           X2           X3           X4           X5 
#>  0.058829196  1.246865643  0.556702265  0.001789517  1.589190800  0.007083157 
#>           X6           X7           X8           X9          X10          X11 
#> -0.167764887  1.662568906  0.011219389 -0.061129447  0.139055222  0.944357657 
#>          X12          X13          X14          X15          X16          X17 
#>  0.000000000  0.195019395 -0.052230604 -0.313618952  0.000000000  0.208136800 
#>          X18          X19          X20 
#>  0.025830652 -0.272683197  0.104190732

coef, by default, returns the coefficients for the lambda / alpha that has the lowest cross validation error.

You can supply different values of lambda and alpha. Here we use the lambda and alpha selected by the one standard error rule

coef(fit, lambda = fit$lambda.1se, alpha = fit$alpha.1se)
#> (Intercept)          X1          X2          X3          X4          X5 
#>  0.09801859  1.02644049  0.25535298  0.02707120  1.10440081  0.00000000 
#>          X6          X7          X8          X9         X10         X11 
#>  0.00000000  1.14061661  0.00000000  0.00000000  0.00000000  0.70733455 
#>         X12         X13         X14         X15         X16         X17 
#>  0.00000000  0.03535632  0.00000000 -0.13505194  0.00000000  0.06696791 
#>         X18         X19         X20 
#>  0.00000000 -0.11068623  0.00000000

Note that the adaptive weights (adWeight) are all 1, so fit was just an elastic net. Let’s use the coefficients from it as adaptive weights. The first term is the intercept, so we drop it.

adWeight <- 1 / (abs(coef(fit)[-1]) + 1 / nrow(miselect.df))

afit <- cv.saenet(x, y, pf, adWeight, weights, family = "binomial",
                  alpha = alpha, nfolds = 5)

coef(afit)
#> (Intercept)          X1          X2          X3          X4          X5 
#>  0.04581241  1.29894689  0.67176521  0.00000000  1.73297802  0.00000000 
#>          X6          X7          X8          X9         X10         X11 
#> -0.23781791  1.83607754  0.00000000 -0.05813537  0.17851775  1.03222712 
#>         X12         X13         X14         X15         X16         X17 
#>  0.00000000  0.23992341 -0.02461918 -0.37817973  0.00000000  0.24911599 
#>         X18         X19         X20 
#>  0.00000000 -0.32677614  0.11543197

galasso works similarly to saenet, but does not have weights, or alpha parameters.

Bugs

If you encounter a bug, please open an issue on the Issues tab on Github or send us an email.

Contact

For questions or feedback, please email Jiacong Du at or Alexander Rix .

References

Variable selection with multiply-imputed datasets: choosing between stacked and grouped methods. Jiacong Du, Jonathan Boss, Peisong Han, Lauren J Beesley, Stephen A Goutman, Stuart Batterman, Eva L Feldman, and Bhramar Mukherjee. 2020. arXiv:2003.07398

Little, R. J., & Rubin, D. B. (2019). Statistical analysis with missing data (Vol. 793). John Wiley & Sons.