# Examples for Big Data Mediation Analysis

## Package installation

R package mmabig is used for general multiple mediation/confounding analysis with high dimensional data sets Yu and Li (2022). Mediation/confounding effects refer to the effects from third variables that intervene the relationship between an exposure and an outcome. In the following we generally call the mediation/confounding effect as the third-variable effect (TVE). TVEs include the total effect between the exposure and the outcome, direct effect from the exposure to outcome after adjusting for other variables, and indirect effect of a third variable (the effect from the exposure variable to the third variable and to the outcome), which are defined and the statistical inferences described in Li et al. (2021). In mmabig, a generalized linear model with LASSO or elastic net regularization is used to fit the final model. If readers are interested in using a generalized linear model or multiple Additive Regression Trees (MART) to fit the model, please refer to the R package mma .

To use the R package mmabig, we first install the package in R (install.packages("mmabig")) and load it.

## Data Organization and Identify Potential Moderators/Confounders

The function data.org.big is used to do a preliminary data analysis to identify potential Mediators/Confounders (MCs) and covariates. It returns an object of “med_iden” class that organizes the data into a format to be used directly for the mediation analysis functions.

### One binary/continuous predictor and one binary/continuous outcome

The following code generates a simulated data set with 20 potential MCs, of which five (m16-m20) are multicategorical variables. The real mediator/confounders are “m11”, “m12”, and “m16”, which are highly related with both the predictor “pred” and the outcome “y”.

# a binary predictor
set.seed(1)
n=100
pred=rbinom(n,1,0.5)
m1=matrix(rnorm(n*10),n,10)
m2<-matrix(rnorm(n*10,mean=pred,sd=1),n,10)
m3.1=m2[,6:10]
m3=m3.1
m2=m2[,1:5]
m3[m3.1<=0.1]=0
m3[0.1<m3.1 & m3.1<=1]=1
m3[m3.1>1]=2
m3<-apply(m3,2,as.factor)
m<-data.frame(m1,m2,m3)
colnames(m)<-c(paste("m0",1:9,sep=""),paste("m",10:20,sep=""))

lu<--0.5363+0.701*pred+0.801*m[,1]+0.518*m[,2]+1.402*m[,11]+0.773*m[,12]+
ifelse(m[,16]=="2",2.15,0)+ifelse(m[,16]=="1",0.201,0)

# a continuous y
y<-rnorm(n,lu,1)

Next, we use the function “data.org.big” to identify mediators. The exposure variable is specified by “pred=”. All potential MCs and covariates are in the dataframe “x”. The outcome variable is “y”. Both the exposure and the outcome can be multivariate. The “mediator=1:ncol(m)” indicates that all variables in x should be tested for potential MCs. Two tests are needed to identify a potential MC: first, the variable is significantly related with the predictor adjusting for other covariates. The significance level is set by “alpha2”, whose default value is 0.01. Second, the variable has to be significantly related with the outcome not adjusting (testtype=2, by default) or adjusting (testtype=2) for the predictor and other variables. The significance level is set by “alpha1”. In the following example, p-value 1 shows the results for the second test, and p-value 2 are the results for the first test. A variable that passes the second test but not the first test is considered as a covariate. Variables do not pass the second test are not adopted in further analysis. Variables in “x” but not in “mediator” are forced in further analysis as covariates. The argument “alpha” is the elasticnet mixing parameter such that $$0\leq alpha\leq 1$$, with alpha=1 be the lasso penalty, and alpha=0 be the ridge penalty. By default, alpha=1.

data.e1<-data.org.big(x=m,y=data.frame(y),mediator=1:ncol(m),
pred=data.frame(pred),testtype=1)
summary(data.e1,only=TRUE)
#> Identified as mediators:
#> [1] "m11"  "m12"  "m161" "m162"
#> Selected as covariates:
#> [1] "m01" "m02"
#> Tests:
#>        Coefficients.y P-Value 2.pred
#> m01             0.702          0.548
#> m02             0.534          0.920
#> m11 *           1.180          0.000
#> m12 *           0.479          0.000
#> m161 *          0.000          0.000
#> m162 *          1.548          0.000
#> ----
#>  *:mediator,-:joint mediator
#>  Coefficients: estimated coefficients; P-Value 2:Tests of relationship with the Predictor

The results from “data.org.big” function are summarized by the “summary” function. When “only=TRUE”, it only shows the test results for selected covariates and mediators. The multicategorical mediator “m16” has 3 categories, and it has been binarized into two binary variables “m161” and “m162”, and they are selected as potential mediators together. We specify testtype=1 in the above example to identify covariates/mediators using full model. By default, testtype=2 is used that covariates/mediators are tested one by one in models with the predictor only. The following codes show the results when testtype is 2.

data.e1.2<-data.org.big(x=m,y=data.frame(y),mediator=1:ncol(m),pred=data.frame(pred))
summary(data.e1.2,only=TRUE)
#> Identified as mediators:
#> [1] "m11"  "m12"  "m161" "m162"
#> Selected as covariates:
#> [1] "m01" "m02"
#> Tests:
#>       P-Value 1.y P-Value 2.pred
#> m01         0.001          0.548
#> m02         0.000          0.920
#> m11 *       0.000          0.000
#> m12 *       0.000          0.000
#> m16 *       0.000          0.000
#> ----
#>  *:mediator,-:joint mediator
#>  P-Value 1:univariate relationship test with the outcome;P-Value 2:Tests of relationship with the Predictor

### Survival outcome

The functions in mmabig can deal with binary, categorical, continuous, or time-to-event outcomes. If the outcome is time-to-event, it should be defined by the “Surv” function in the survival package. The following is an example.

lambda=1/500
survt=-log(runif(n))/lambda/exp(lu)
st=round(runif(n,1,500),0)
time=ifelse(st+survt>600,600,st+survt)-st
cen=ifelse(st+survt>600,0,1)
y=Surv(time,cen)

data.e3<-data.org.big(x=m,y=data.frame(y),mediator=1:ncol(m),pred=data.frame(pred),testtype=1)
summary(data.e3,only=TRUE)
#> Identified as mediators:
#> [1] "m11"  "m12"  "m161" "m162"
#> Selected as covariates:
#> [1] "m01"
#> Tests:
#>        Coefficients.y P-Value 2.pred
#> m01             0.341          0.548
#> m11 *           0.818          0.000
#> m12 *           0.286          0.000
#> m161 *          0.000          0.000
#> m162 *          1.290          0.000
#> ----
#>  *:mediator,-:joint mediator
#>  Coefficients: estimated coefficients; P-Value 2:Tests of relationship with the Predictor

### Multivariate predictors

In addition, the package can handle multivariate and multicategorical predictors. If the predictor is multicategorical of k levels, “data.org.big” first transforms the predictor to k-1 binary predictors. If a variable significantly relates with any of the k-1 predictors, the variable passes the first test described above. P-value 2 is shown for each predictor. In the following example, the predictor has three levels.

# multicategorical predictor
set.seed(1)
n=100
pred=rmultinom(100,1,c(0.5, 0.3, 0.2))
pred=pred[1,]*0+pred[2,]*1+pred[3,]*2
m1=matrix(rnorm(n*10),n,10)
m2<-matrix(rnorm(n*10,mean=pred,sd=1),n,10)
m3.1=m2[,6:10]
m2=m2[,1:5]
m3=m3.1
m3[m3.1<=0.1]=0
m3[0.1<m3.1 & m3.1<=1]=1
m3[m3.1>1]=2
m3<-apply(m3,2,as.factor)
m<-data.frame(m1,m2,m3)
colnames(m)<-c(paste("m0",1:9,sep=""),paste("m",10:20,sep=""))
pred<-as.factor(pred)
# continuous y
lu<--0.5363+ifelse(pred=="1",0.3,0)+ifelse(pred=="2",0.7,0)+0.801*m[,1]+0.518*m[,2]+
1.402*m[,11]+0.773*m[,12]+ifelse(m[,16]=="2",2.15,0)+ifelse(m[,16]=="1",0.201,0)
y<-rnorm(n,lu,1)

data.m.e1<-data.org.big(x=m,y=data.frame(y),mediator=1:ncol(m),pred=data.frame(pred),testtype=1)
summary(data.m.e1,only=TRUE)
#> Identified as mediators:
#> [1] "m11" "m12" "m15"
#> Selected as covariates:
#> [1] "m01" "m02" "m16"
#> Tests:
#>       Coefficients.y P-Value 2.pred1...i.1 P-Value 2.pred1...i.2
#> m01            0.546                 0.721                 0.732
#> m02            0.461                 0.157                 0.201
#> m11 *          1.375                 0.000                 0.000
#> m12 *          0.609                 0.000                 0.000
#> m15 *          0.041                 0.000                 0.000
#> ----
#>  *:mediator,-:joint mediator
#>  Coefficients: estimated coefficients; P-Value 2:Tests of relationship with the Predictor

In the following example, the predictor is bivariate.

# multivariate predictor
set.seed(1)
n=100
pred=cbind(runif(n,-1,1),rnorm(n))
m1=matrix(rnorm(n*10),n,10)
m2<-matrix(rnorm(n*5,mean=0.3*pred[,1]+0.4*pred[,2],sd=1),n,5)
m3.1=matrix(rnorm(n*5,mean=0.7*pred[,1]+0.8*pred[,2],sd=1),n,5)
m3=m3.1
m3[m3.1<=0]=0
m3[0<m3.1 & m3.1<=1.28]=1
m3[m3.1>1.28]=2
m3<-apply(m3,2,as.factor)
m<-data.frame(m1,m2,m3)
colnames(m)<-c(paste("m0",1:9,sep=""),paste("m",10:20,sep=""))
colnames(pred)=c("x1","x2")
# binary y
lu<--0.6852+0.3*pred[,1]+0.7*pred[,2]+0.801*m[,1]+0.518*m[,2]+1.402*m[,11]+0.773*m[,12]+
ifelse(m[,16]=="2",2.15,0)+ifelse(m[,16]=="1",0.201,0)
y<-rbinom(n,1,exp(lu)/(1+exp(lu)))

data.m.c.e2<-data.org.big(x=m,y=data.frame(y),mediator=1:ncol(m),pred=data.frame(pred),
testtype=1,alpha1=0.05,alpha2=0.05)
summary(data.m.c.e2,only=TRUE)
#> Identified as mediators:
#> [1] "m11"  "m12"  "m15"  "m161" "m162" "m181" "m182" "m191" "m192"
#> Selected as covariates:
#> [1] "m01" "m06"
#> Tests:
#>        Coefficients.y P-Value 2.x1 P-Value 2.x2
#> m01             0.355        0.607        0.533
#> m06            -0.211        0.216        0.977
#> m11 *           0.637        0.653        0.000
#> m12 *           0.381        0.003        0.000
#> m15 *           0.096        0.095        0.000
#> m161 *          0.000        0.003        0.003
#> m162 *          0.923        0.000        0.000
#> m181 *          0.092        0.018        0.018
#> m182 *          0.000        0.000        0.000
#> m191 *          0.000        0.028        0.028
#> m192 *          0.046        0.000        0.000
#> ----
#>  *:mediator,-:joint mediator
#>  Coefficients: estimated coefficients; P-Value 2:Tests of relationship with the Predictor

### Multivariate outcomes

Similarly, the package can deal with multivariate outcomes. The following code deals with multivariate predictors and multivariate responses. If the variable is significantly related with any one of the outcomes, it passes the second test described above. The results from “data.org.big” are summarized for each combination of the exposure-outcome relationship.

# multivariate predictor
set.seed(1)
n=100
pred=cbind(runif(n,-1,1),rnorm(n))
m1=matrix(rnorm(n*10),n,10)
m2<-matrix(rnorm(n*5,mean=0.3*pred[,1]+0.4*pred[,2],sd=1),n,5)
m3.1=matrix(rnorm(n*5,mean=0.7*pred[,1]+0.8*pred[,2],sd=1),n,5)
m3=m3.1
m3[m3.1<=0]=0
m3[0<m3.1 & m3.1<=1.28]=1
m3[m3.1>1.28]=2
m3<-apply(m3,2,as.factor)
m<-data.frame(m1,m2,m3)
colnames(m)<-c(paste("m0",1:9,sep=""),paste("m",10:20,sep=""))
colnames(pred)=c("x1","x2")
#multivariate responses
y<-cbind(rnorm(n,lu,1),rbinom(n,1,exp(lu)/(1+exp(lu))))
colnames(y)=c("y1","y2")

data.m.m.c.e2<-data.org.big(x=m,y=data.frame(y),mediator=1:ncol(m),pred=data.frame(pred),
testtype=1,alpha1=0.05,alpha2=0.05)
summary(data.m.m.c.e2,only=TRUE)
#> Identified as mediators:
#>  [1] "m11"  "m12"  "m15"  "m161" "m162" "m171" "m172" "m191" "m192" "m201"
#> [11] "m202"
#> Selected as covariates:
#> [1] "m01" "m02" "m09"
#> Tests:
#>        Coefficients.y1 Coefficients.y2 P-Value 2.x1 P-Value 2.x2
#> m01              0.652           0.069        0.607        0.533
#> m02              0.454           0.000        0.053        0.936
#> m09              0.101           0.000        0.678        0.292
#> m11 *            1.375           0.813        0.653        0.000
#> m12 *            0.661           0.114        0.003        0.000
#> m15 *            0.059           0.000        0.095        0.000
#> m161 *           0.547           0.000        0.003        0.003
#> m162 *           1.519           0.021        0.000        0.000
#> m171 *           0.000           0.000        0.219        0.219
#> m172 *           0.163           0.000        0.000        0.000
#> m191 *           0.000           0.000        0.028        0.028
#> m192 *           0.072           0.000        0.000        0.000
#> m201 *           0.000           0.008        0.053        0.053
#> m202 *           0.149           0.331        0.000        0.000
#> ----
#>  *:mediator,-:joint mediator
#>  Coefficients: estimated coefficients; P-Value 2:Tests of relationship with the Predictor

data.m.m.c.e2.2<-data.org.big(x=m,y=data.frame(y),mediator=1:ncol(m),pred=data.frame(pred),
alpha1=0.05,alpha2=0.05)
summary(data.m.m.c.e2.2,only=TRUE)
#> Identified as mediators:
#> [1] "m11" "m12"
#> Selected as covariates:
#> [1] "m01" "m09"
#> Tests:
#>       P-Value 1.y1 P-Value 1.y2 P-Value 2.x1 P-Value 2.x2
#> m01          0.000        0.053        0.607        0.533
#> m09          0.043        0.993        0.678        0.292
#> m11 *        0.000        0.000        0.653        0.000
#> m12 *        0.003        0.148        0.003        0.000
#> ----
#>  *:mediator,-:joint mediator
#>  P-Value 1:univariate relationship test with the outcome;P-Value 2:Tests of relationship with the Predictor

##Third-Variable Effect Analysis Next, we use med.big function to estimate the TVE using the results from data.org.big function.

### Binary predictor and continuous outcome

med.e1<-med.big(data.e1)

To show the results:

med.e1
#> $y #> pred #> de 0.01857787 #> m161 0.00000000 #> m162 0.98931416 #> m11 1.21150833 #> m12 0.52051526 #> te 2.73991562 ### Survival outcome For survival outcome, the default option is to fit the final full model using Cox proportional hazard model. ### Multivariate outcomes med.m.m.c.e2.2<-med.big(data.m.m.c.e2.2) med.m.m.c.e2.2 #>$y1
#>            x1        x2
#> de  0.2494818 0.9580249
#> m11 0.1204943 0.7289243
#> m12 0.4290098 0.3444318
#> te  0.7989859 2.0313811
#>
#> \$y2
#>              x1          x2
#> de  0.000000000 0.000000000
#> m11 0.063517400 0.384245356
#> m12 0.003373375 0.002708325
#> te  0.066890775 0.386953680

Finally, in the mmabig package, the relationship between third variables and the exposure variable(s) are fitted through generalized smoothing splines to allow the fit of potential nonlinear relationships. In the package, the ns() function is used to generate the B-spline basis matrix for a natural cubic spline. The argument “df” in the med.big function is used to assign the degrees of freedom for the spline basis matrix. By default, the degree of freedom is 1, which is to fit a linear relationship.

We also allow generalized linear models to fit the relationship between outcomes and all predictors in the full model. The argument used in the med.big function to define the generalized linear model is “family1”. It is a list with the ith item defines the conditional distribution of the ith outcome, y[,i] given the predictors, and the linkage function that links the mean of the outcome with the system component if generalized linear model is used as the final full model. The default value of “family1”” is gaussian(link=“identity”) for continuous outcomes, and binomial(link = “logit”) for binary ones.

## Combined function for multiple TVE analysis with big data sets

The “mma.big” is a function that automatically identify potential MCs, based on which to make statistical inference on the mediation effects for high-dimensional data sets. Bootstrap method is used to make inferences on the TVE. The summary function is used to summarize inference results. In the summary function, three different sets of confidence intervals are calculated: based on the normal assumption of bootstrap method (lwbd, upbd), on the quantiles (lwbd_q,upbd_q), and on the confidence ball (lwbd_b, upbd_b) .

set.seed(1)
n=100
pred=rbinom(n,1,0.5)
m1=matrix(rnorm(n*10),n,10)
m2<-matrix(rnorm(n*10,mean=pred,sd=1),n,10)
m3.1=m2[,6:10]
m3=m3.1
m2=m2[,1:5]
m3[m3.1<=0.1]=0
m3[0.1<m3.1 & m3.1<=1]=1
m3[m3.1>1]=2
m3<-apply(m3,2,as.factor)
m<-data.frame(m1,m2,m3)
colnames(m)<-c(paste("m0",1:9,sep=""),paste("m",10:20,sep=""))

lu<--0.5363+0.701*pred+0.801*m[,1]+0.518*m[,2]+1.402*m[,11]+0.773*m[,12]+
ifelse(m[,16]=="2",2.15,0)+ifelse(m[,16]=="1",0.201,0)
y<-rnorm(n,lu,1)

mma.e1<-mma.big(x=m,y=data.frame(y),mediator=1:ncol(m),pred=data.frame(pred),
alpha=1,alpha1=0.05,alpha2=0.05,n2=3)  #use only the test results.
#> [1] 1
#> [1] 2
#> [1] 3
summary(mma.e1)
#>
#> The mediaiton effects:
#>
#> For the response variable,y,
#>        pred.de pred.m161 pred.m162 pred.m11 pred.m12 pred.te
#> est      0.019     0.000     0.989    1.212    0.521   2.740
#> mean     0.150    -0.012     1.046    1.201    0.528   2.913
#> sd       0.205     0.025     0.143    0.300    0.197   0.353
#> upbd_q   0.368     0.010     1.197    1.469    0.738   3.247
#> lwbd_q   0.003    -0.037     0.936    0.902    0.393   2.576
#> upbd     0.421     0.048     1.270    1.800    0.906   3.432
#> lwbd    -0.384    -0.048     0.708    0.623    0.135   2.047
#> upbd_b   0.065    -0.007     1.207    1.237    0.754   2.916
#> lwbd_b   0.000    -0.038     0.932    0.884    0.440   2.558

mma.big also accepts the organized dataset from data.org.big as the first argument.

mma.e1.2<-mma.big(data=data.e1.2,alpha1=0.05,alpha2=0.05,n2=3)
#> [1] 1
#> [1] 2
#> [1] 3
summary(mma.e1.2,RE=TRUE)
#> The relative effects:
#>
#> For the response variable,y,
#>        pred.de pred.m161 pred.m162 pred.m11 pred.m12 pred.te
#> est      0.007         0     0.361    0.442    0.190       1
#> mean     0.089         0     0.351    0.374    0.185       1
#> sd       0.069         0     0.050    0.020    0.085       0
#> upbd_q   0.132         0     0.404    0.396    0.267       1
#> lwbd_q   0.016         0     0.321    0.360    0.105       1
#> upbd     0.141         0     0.459    0.482    0.358       1
#> lwbd    -0.128         0     0.263    0.402    0.022       1

### Plots of the fitted mma object from boot.med

plot.mma.big() plots the marginal effect of the selected variable on the outcome, and the marginal effect of the predictor on the selected variable.

#plot(mma.e1.2,vari="m16")
plot(mma.e1.2,vari="m11")

In the above figure, the upper panel shows the coefficient of “m11” when it is used to estimate y in an elasticnet regression with y being the outcome in bootstrap samples. We see that the coefficient is significantly positive. The lower panel shows when the exposure variable increases by 1 unit, the average change in “m11”. Again we see that the change in “m11” is positive. Therefore, there is a significant indirect effect (positive) through “m11”. As is shown by summary results, all confidence intervals for the indirect effect of “m11” are to the right of 0.

### Survival outcome

lambda=1/500
survt=-log(runif(n))/lambda/exp(lu)
st=round(runif(n,1,500),0)
time=ifelse(st+survt>600,600,st+survt)-st
cen=ifelse(st+survt>600,0,1)
y=Surv(time,cen)

mma.e3<-mma.big(x=m,y=data.frame(y),mediator=1:ncol(m),pred=data.frame(pred),alpha=1,alpha1=0.05,
alpha2=0.05,n2=3)  #use only the test results.
#> [1] 1
#> [1] 2
#> [1] 3
mma.e3.2<-mma.big(data=data.e3.2,alpha1=0.05,alpha2=0.05,n2=3)  #use only the test results.
#> [1] 1
#> [1] 2
#> [1] 3
summary(mma.e3)
#>
#> The mediaiton effects:
#>
#> For the response variable,y,
#>        pred.de pred.m161 pred.m162 pred.m11 pred.m12 pred.te
#> est      0.201    -0.037     0.683    0.809    0.530   2.185
#> mean     0.182    -0.025     0.581    0.906    0.412   2.055
#> sd       0.160     0.047     0.183    0.451    0.078   0.701
#> upbd_q   0.297     0.005     0.719    1.357    0.460   2.697
#> lwbd_q   0.012    -0.075     0.388    0.506    0.328   1.367
#> upbd     0.514     0.056     1.041    1.693    0.683   3.560
#> lwbd    -0.112    -0.130     0.325   -0.075    0.377   0.811
#> upbd_b   0.245     0.005     0.645    0.844    0.460   2.108
#> lwbd_b   0.000    -0.079     0.375    0.489    0.453   1.328
summary(mma.e3.2,RE=TRUE,quant=FALSE)
#> The relative effects:
#>
#> For the response variable,y,
#>        pred.de pred.m161 pred.m162 pred.m171 pred.m172 pred.m11 pred.m12
#> est      0.124    -0.002     0.347     0.000     0.000    0.375    0.156
#> mean     0.075    -0.010     0.338     0.013     0.011    0.394    0.178
#> sd       0.111     0.008     0.013     0.014     0.014    0.030    0.066
#> upbd_q   0.194    -0.001     0.348     0.027     0.026    0.422    0.234
#> lwbd_q   0.001    -0.015     0.324     0.001     0.000    0.366    0.111
#> upbd     0.342     0.015     0.373     0.028     0.028    0.433    0.284
#> lwbd    -0.095    -0.019     0.321    -0.028    -0.028    0.317    0.027
#>        pred.te
#> est          1
#> mean         1
#> sd           0
#> upbd_q       1
#> lwbd_q       1
#> upbd         1
#> lwbd         1

We can also plot the selected variables on the survival outcome.

plot(mma.e3.2,vari="m16")

plot(mma.e3.2,vari="m11")

## References

Li, Bin, Qingzhao Yu, L. Zhang, and M. Hsieh. 2021. “Regularized Multiple Mediation Analysis for Big Data Sets.” Statistics and Its Interface 14: 449–58.
Yu, Qingzhao, and Bin Li. 2017. “Mma: An r Package for Mediation Analysis with Multiple Mediators.” Journal of Open Research Software 5 (1): 11. https://doi.org/10.5334/jors.160.
———. 2021. “A Multivariate Multiple Mediation Analysis with an Application to Explore Racial and Ethnic Disparities in Obesity.” Journal of Applied Statistics, no. 4: 750–64. https://doi.org/https://doi.org/10.1080/02664763.2020.1738359.
———. 2022. Statistical Methods for Mediation, Confounding and Moderation Analysis Using r and SAS. Chapman & Hall/CRC. https://www.routledge.com/Statistical-Methods-for-Mediation-Confounding-and-Moderation-Analysis/Yu-Li/p/book/9780367365479.