ELCIC

Chixiang Chen

ELCIC is a robust and consistent model selection criterion based upon the empirical likelihood function which is data-driven. In particular, this framework adopts plug-in estimators that can be achieved by solving external estimating equations, not limited to the empirical likelihood, which avoids potential computational convergence issues and allows versatile applications, such as generalized linear models, generalized estimating equations, penalized regressions, and so on. The formulation of our proposed criterion is initially derived from the asymptotic expansion of the marginal likelihood under the variable selection framework, but more importantly, the consistent model selection property is established under a general context.

Installation

if (!require("devtools")) {
  install.packages("devtools")
}
devtools::install_github("chencxxy28/ELCIC")
library(ELCIC)
library(MASS)
## Warning: package 'MASS' was built under R version 4.1.2
library(mvtnorm)

We provide some basic tutorial for illustrating the usage of ELCIC package. More technical details are referred to the Chen et al. (2021). In the following, we list three applications of ELCIC, which is the variable selection in generalized linear model, model selection in longitudinal data, and model selection in longitudinal data with missingness.

Variable Selection in Generalized Linear Models (GLMs)

First, let us generate a pseudo data from negative binomial distribution with overdispersion parameter (ov) equal to 8. We provide a function glm.generator to generate responses and covariates. Different outcome distributions are considered in this function including Gaussian, Poisson, Binomial, and Negative Binomial distributions.

We first test how ELCIC works when the distribution is correctly specified. The outcomes are generated from Poisson distribution. Then, we consider seven candidate models with different covariate combinations and compare ELCIC with most commonly used criteria including AIC, BIC, and GIC. The function ELCIC.glm can produce the values of ELCIC, AIC, BIC, and GIC given a candidate model. The selection rates from four criteria are present based on 20 Monte Carlo runs (may take 1-2 minutes to run).

set.seed(28582)
inter.total=20
count.matrix<-matrix(0,nrow=4,ncol=7)
candidate.sets<-list(c(1,2),c(1,3),c(1,4),c(1,2,3),c(1,2,4),c(1,3,4),c(1,2,3,4))
rownames(count.matrix)<-c("ELCIC","AIC","BIC","GIC")
colnames(count.matrix)<-candidate.sets
samplesize<-100
ov=2

for(iter in 1:inter.total)
{
    #generate data
simulated.data<-glm.generator(beta=c(0.5,0.5,0.5,0),samplesize=samplesize,rho=0.5,dist="poisson")
y<-simulated.data[["y"]]
x<-simulated.data[["x"]]

criterion.all<-ELCIC.glm(x=x,y=y,candidate.sets=candidate.sets,name.var=NULL,dist="poisson")

#print(iter)
index.used<-cbind(1:4,apply(criterion.all,1,which.min))
count.matrix[index.used]<-count.matrix[index.used]+1
}
count.matrix/inter.total

Based on the results, we find ELCIC and BIC has the best performance due to the highest selection rate among others. It shows that ELCIC is as powerful as BIC when the distribution is correctly specified. Next, we consider the situation where the AIC, BIC, and GIC is based on misspecified distribution. Accordingly, we generate outcomes from Negative Binomial distribution with the dispersion parameter equal to 2, while AIC, BIC, and GIC consider Poisson distribution as the input (may take 1-2 minutes to run).

set.seed(28582)
inter.total=20
count.matrix<-matrix(0,nrow=4,ncol=7)
candidate.sets<-list(c(1,2),c(1,3),c(1,4),c(1,2,3),c(1,2,4),c(1,3,4),c(1,2,3,4))
rownames(count.matrix)<-c("ELCIC","AIC","BIC","GIC")
colnames(count.matrix)<-candidate.sets
samplesize<-100
ov=2

for(iter in 1:inter.total)
{
    #generate data
simulated.data<-glm.generator(beta=c(0.5,0.5,0.5,0),samplesize=samplesize,rho=0.5,dist="NB",ov=2)
y<-simulated.data[["y"]]
x<-simulated.data[["x"]]

criterion.all<-ELCIC.glm(x=x,y=y,candidate.sets=candidate.sets,name.var=NULL,dist="poisson")

#print(iter)
index.used<-cbind(1:4,apply(criterion.all,1,which.min))
count.matrix[index.used]<-count.matrix[index.used]+1
}
count.matrix/inter.total

The above results show that ELCIC has the highest selection rate. The better performance is attributed to its distribution free property, which is highly valuable in real applications where the underlying distribution is hard to be correctly specified.

Model Selection in longitudinal data analysis (assuming missing completely at random)

Since ELCIC is distribution free, it is most suitable to statistical models in semi-parametric framework. This section evaluates how ELCIC works for joint selection of marginal mean and correlation structures in longitudinal data analysis. We assume the data is complete or missing completely at random. The following packages are used to generate different types of outcomes.

library(PoisNor)
library(bindata)
library(geepack)

In the following evaluation, we consider generalized estimating equations GEEs framework, which is a semi-parametric framework and widely used in longitudinal data analysis. We jointly select marginal mean and correlation structures. To be noted, correctly identifying working correlation structure will be beneficial to estimation efficiency in the mean structure. We provide a function gee.generator to generate responses and covariates. Different outcome distributions are considered in this function including Gaussian, Poisson, and Binomial, and both time-dependent covariates and time-independent covariates are allowed in this generator function. We compare ELCIC with another widely adopted information criterion QIC.

set.seed(28582)
inter.total=20
samplesize<-300
time=3
dist="poisson"
candidate.sets<-list(c(1,2),c(1,3),c(1,4),c(1,2,3),c(1,2,4),c(1,3,4),c(1,2,3,4))
candidate.cor.sets<-c("independence","exchangeable","ar1")

count.matrix.elcic<-count.matrix.qic<-matrix(0,nrow=length(candidate.cor.sets),ncol=length(candidate.sets))
rownames(count.matrix.elcic)<-rownames(count.matrix.qic)<-candidate.cor.sets
colnames(count.matrix.elcic)<-colnames(count.matrix.qic)<-candidate.sets

for(iter in 1:inter.total)
{
    #generate data
data.corpos<-gee.generator(beta=c(-1,1,0.5,0),samplesize=samplesize,time=time,num.time.dep=2,num.time.indep=1,rho=0.4,x.rho=0.2,dist="poisson",cor.str="exchangeable",x.cor.str="exchangeable")

y<-data.corpos$y
x<-data.corpos$x
id<-data.corpos$id
r<-rep(1,samplesize*time)

criterion.elcic<-ELCIC.gee(x=x,y=y,r=r,id=id,time=time,candidate.sets=candidate.sets,name.var=NULL,dist="poisson",candidate.cor.sets=candidate.cor.sets)

criterion.qic<-QICc.gee(x=x,y=y,id=id,dist=dist,candidate.sets=candidate.sets, name.var=NULL, candidate.cor.sets=candidate.cor.sets)

index.used.elcic<-which(criterion.elcic==min(criterion.elcic),arr.ind = TRUE)
count.matrix.elcic[index.used.elcic]<-count.matrix.elcic[index.used.elcic]+1

index.used.qic.row<-which(candidate.cor.sets==rownames(criterion.qic)) 
index.used.qic.col<-which.min(criterion.qic)
count.matrix.qic[index.used.qic.row,index.used.qic.col]<-count.matrix.qic[index.used.qic.row,index.used.qic.col]+1

print(iter)
}
count.matrix.elcic/inter.total
count.matrix.qic/inter.total

The results above show that ELCIC has much higher power to implement joint selection, especially for the selection of marginal mean structure.

Model Selection in longitudinal data analysis (dropout missing)

coming soon