VSOLassoBag

Variable-Selection Oriented Lasso Bagging Algorithm

1. Introduction

VSOLassoBag is a variable-selection oriented Lasso bagging algorithm for biomarker development in omics-based translational research, providing a bagging Lasso framework for selecting significant variables from multiple models. A main application of this package is to screen a limited number of variables that are less dependent to the training data sets. Basically, this package was initially developed to adjust Lasso selected results from bootstrapped sample sets. The variables with the highest frequency among several selected results were considered as the stable variables to discriminate different sample sets. However, it is usually difficult to determine the cut-off point in terms of frequency when applied in real data sets. In this package, we introduced several methods, namely (1) curve elbow point detection, (2) parametric statistical test to help determine the cut-off point for variable selection. The source code and documents are freely available through Github (https://github.com/likelet/VSOLassoBag).

1.1 Citation

If you find this tool useful, please cite:

Liang J, Wang C, Zhang D, et al. VSOLassoBag: A variable-selection oriented LASSO bagging algorithm for biomarker discovery in omic-based translational research [published online ahead of print, 2023 Jan 3]. J Genet Genomics. 2023;S1673-8527(22)00285-5. doi:10.1016/j.jgg.2022.12.005

2. Installation

VSOLassoBag can run in both Windows system and most POSIX systems. To the latest version of this package, start R and enter:

if (!requireNamespace("devtools", quietly = TRUE)) {
  install.packages("devtools")
}
devtools::install_github("likelet/VSOLassoBag")

3. Prepare input data

To perform analysis with VSOLassoBag function, you need to provide:

  • ExpressionData is an object constructed by SummarizedExperiment, including independent variables (e.g., expression matrix) and dependent/outcome variable(s). Please view the SummarizedExperiment doucumention for the construction of SummarizedExperiment objects.

  • character indicating the type of dependent variables as input to the parameter a.family.

Note:

  • a.family is determined entirely by the type of dependent/outcome variable(s).

  • You can also tune other parameters to better balance the performance and resource required for the analysis. Key parameters include bootN for bagging times, and bagFreq.sigMethod for the cut-off point decision method.

3.1 Independent variables

Independent variables is a matrix-like object with variables as rows and observations as columns, all the entries should be numeric. It can be obtained by SummarizedExperiment::assay function.

Example Independent variables matrix

          [,1]       [,2]       [,3]        [,4]        [,5]       [,6]
X_1 -1.0059043 -0.5831154  0.5188620  0.08714558  0.45080348  0.1149953
X_2  1.2907372 -0.1005391  0.3385391  0.28154759 -0.14453034  1.0213830
X_3 -0.5321077  1.9809263  0.5795232  0.98104200 -0.02066287 -1.0550853
X_4 -0.9984327  1.4455361  1.3130022  0.47038414 -1.23164279  2.3385465
X_5 -0.2032319 -0.6088388 -1.3633668 -0.22835534  2.33349367  0.2281900
X_6  2.4383744  0.7443007  1.7134854  0.62076548  0.07363745 -0.0915762

3.2 dependent/outcome variable(s)

dependent/outcome variable(s) store sample information with the same rows as the samples in the independent variables matrix and can be invoked by SummarizedExperiment::colData function.

Example out.mat vector

[1]  0.1580122  0.4223877  4.3253401  3.9309707  1.2470942 -0.4556825

3.3 a.family

a.family is a character determined by the data type of out.mat, including binomial, gaussian, cox, multinomial, mgaussian, poisson.

4. Results

A list with:

  1. results: a dataframe containing variables with selection frequency >= 1. If setting bagFreq.sigMethod == "PST", the P.value and the P.adjust of each variable will be provided. If setting bagFreq.sigMethod = "CEP" or using the elbow point indicators, elbow point(s) will be marked with * and thresholds for each variable will be assigned.

  2. permutations: the permutation test results.

  3. model: the regression model built on LassoBag results.

5. Usage Example

For tutorial purpose, here we used two examples utilizing different cut-off point decision methods to exhibit how to interpret results. The data is simulated by Gaussian model and you can obtain it by data(simulated_example)

5.1 Using “CEP” cut-off point decision method

“CEP” (i.e. “Curve Elbow Point Detection”) is the default and recommended method for cut-off point decision. This method assumes that a sharp decrease of the observed frequency should separate important features from those unimportant ones, and based on this detects the elbow point(s) on the observed frequency curve. Finally the features with observed frequency higher than the elbow point are inferred as important features.

Note: There may be more than one elbow point detected on the curve when using loose threshold, therefore it is recommended to use a stricter threshold first (use a larger kneedle.S) and the algorithm will automatically loosen the S parameter in case no elbow point can be found.

  library(VSOLassoBag)
  data(ExpressionData)
  set.seed(19084)
  VSOLassoBag(ExpressionData = ExpressionData, outcomevariable = "y", bootN = 100, a.family = "gaussian", bagFreq.sigMethod = "PST", do.plot = FALSE)

Results of important variables:

Sat Mar 25 00:09:40 2023 -- INIT process completed, start analyzing 
Sat Mar 25 00:09:40 2023 -- start calculating observed frequency 
2023-03-25 00:10:41--Bagging finished ...
Sat Mar 25 00:10:41 2023 -- Using Non-permutation method to determine a cutoff point 
Using Parametric Statistical Test...
Sat Mar 25 00:10:41 2023 -- Done
    variable Frequency       P.value      P.adjust
3        X_3       100  0.000000e+00  0.000000e+00
7        X_7       100  0.000000e+00  0.000000e+00
2        X_2        99 3.872023e-116 3.071805e-114
10      X_10        98 5.138189e-113 3.057222e-111
6        X_6        97 3.375159e-110 1.606576e-108
9        X_9        88  1.233395e-89  4.892466e-88
5        X_5        87  1.214149e-87  4.128108e-86
8        X_8        82  3.167383e-78  9.422964e-77
4        X_4        62  4.736353e-47  1.252502e-45
1        X_1        45  1.290263e-26  3.070827e-25
59     X_108        28  2.930143e-11  6.339763e-10
226    X_468        25  3.983529e-09  7.900667e-08
86     X_169        21  1.372018e-06  2.511849e-05
193    X_404        19  1.836787e-05  3.122538e-04

CEP Observed Frequency Curve (generated by do.plot = TRUE):

5.2 Using “PST” cut-off point decision method

“PST” (i.e. “Parametric Statistical Test”) is one of the alternative methods for cut-off point decision, which is computed as fast and memory-efficient as “CEP”. It assumes the expected selection frequency of all variables follows a binomial distribution, so we can first model such a theoretical background distribution, and then obtain the statistical significance (p-value) of all variables.

  library(VSOLassoBag)
  data(ExpressionData)
  set.seed(19084)
  VSOLassoBag(ExpressionData = ExpressionData, outcomevariable = "y", bootN = 100, a.family = "gaussian", bagFreq.sigMethod = "CEP", do.plot = TRUE)

Results of PST method

Sat Mar 25 00:10:42 2023 -- INIT process completed, start analyzing 
Sat Mar 25 00:10:42 2023 -- start calculating observed frequency 
2023-03-25 00:11:42--Bagging finished ...
Sat Mar 25 00:11:43 2023 -- Using Non-permutation method to determine a cutoff point 
Detecting Elbow Points on the Observed Frequency Curve...
Using S = 10 for elbow point dection.
Sat Mar 25 00:11:43 2023 -- Done
    variable Frequency elbow.point Diff     Thres
3        X_3       100                0  0.000000
7        X_7       100                0  0.000000
2        X_2        99                1  0.000000
10      X_10        98                1  0.000000
6        X_6        97                1  0.000000
9        X_9        88           *    9  4.822785
5        X_5        87                1  4.822785
8        X_8        82                5  4.822785
4        X_4        62           *   20 15.822785
1        X_1        45               17 15.822785
59     X_108        28               17 15.822785
226    X_468        25                3 15.822785
86     X_169        21                4 15.822785
193    X_404        19                2 15.822785

PST Observed Frequency Distribution (generated by do.plot = TRUE):

6. Session Info

R version 4.2.2 (2022-10-31 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19044)

Matrix products: default

locale:
[1] LC_COLLATE=C                               
[2] LC_CTYPE=Chinese (Simplified)_China.utf8   
[3] LC_MONETARY=Chinese (Simplified)_China.utf8
[4] LC_NUMERIC=C                               
[5] LC_TIME=Chinese (Simplified)_China.utf8    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] VSOLassoBag_0.99.1

loaded via a namespace (and not attached):
 [1] POT_1.1-10                  Rcpp_1.0.10                
 [3] lattice_0.20-45             glmnet_4.1-6               
 [5] digest_0.6.31               foreach_1.5.2              
 [7] utf8_1.2.3                  R6_2.5.1                   
 [9] GenomeInfoDb_1.34.9         stats4_4.2.2               
[11] evaluate_0.20               ggplot2_3.4.1              
[13] pillar_1.8.1                zlibbioc_1.44.0            
[15] rlang_1.0.6                 rstudioapi_0.14            
[17] jquerylib_0.1.4             S4Vectors_0.36.2           
[19] Matrix_1.5-1                rmarkdown_2.20             
[21] splines_4.2.2               RCurl_1.98-1.10            
[23] munsell_0.5.0               DelayedArray_0.23.2        
[25] compiler_4.2.2              xfun_0.37                  
[27] pkgconfig_2.0.3             BiocGenerics_0.44.0        
[29] shape_1.4.6                 htmltools_0.5.4            
[31] tidyselect_1.2.0            SummarizedExperiment_1.28.0
[33] tibble_3.2.0                GenomeInfoDbData_1.2.9     
[35] bookdown_0.33               IRanges_2.32.0             
[37] codetools_0.2-18            matrixStats_0.63.0         
[39] fansi_1.0.4                 dplyr_1.1.0                
[41] bitops_1.0-7                grid_4.2.2                 
[43] jsonlite_1.8.4              gtable_0.3.1               
[45] lifecycle_1.0.3             magrittr_2.0.3             
[47] scales_1.2.1                rmdformats_1.0.4           
[49] cli_3.6.0                   cachem_1.0.7               
[51] pbapply_1.7-0               XVector_0.38.0             
[53] bslib_0.4.2                 generics_0.1.3             
[55] vctrs_0.5.2                 iterators_1.0.14           
[57] tools_4.2.2                 Biobase_2.58.0             
[59] glue_1.6.2                  MatrixGenerics_1.10.0      
[61] parallel_4.2.2              fastmap_1.1.1              
[63] survival_3.4-0              yaml_2.3.7                 
[65] colorspace_2.1-0            GenomicRanges_1.50.2       
[67] knitr_1.42                  sass_0.4.5                 

@ Copyright 2018-2022, Center of Bioinformatics, Sun Yat-sen University Cancer Center Revision 322baf5b.