Using GxEScanR

This vignette shows some examples of using GxEScanR to perform genome-wide association study (GWAS) and genome-wide by environment interaction study (GWEIS) scans using all the options available to the user.

library(GxEScanR)

Introduction

With growing number of SNPs that can be imputed it is necessary to have software that can efficiently perform GWAS and GWEIS scans. GxEScanR can do this using files that were saved in the BinaryDosage format. The BinaryDosage package can convert VCF and GEN files into the BinaryDosage format. The BinaryDosage format was designed to keep the file with the genetic data small with fast read times. GxEScanR uses this and efficient large scale regression routines to perform GWAS and GWEIS scans quickly.

Example Files

The examples below use three sample files. The first contains a data frame that has subject data. The second file is a genetic data file in the BinaryDosage format. The last file contains the information returned by the BinaryDosage::getbdinfo routine that returns information about the binary dosage file that makes reading it fast.

covdatafile <- system.file("extdata", "covdata.rds", package = "GxEScanR")
covdata <- readRDS(covdatafile)
First 5 Subjects
sid y e
I1 0 0
I2 0 0
I3 0 0
I4 0 0
I5 0 0

To load the binary dosage information file, it is necessary to update the file name since the file has been moved from its original location during the installation process. The following loads the binary dosage information and updates the file name.

bdinfofile <- system.file("extdata", "pdata_4_1.bdinfo", package = "GxEScanR")
bdinfo <- readRDS(bdinfofile)
bdinfo$filename <- system.file("extdata", "pdata_4_1.bdose", package = "GxEScanR")
Models Fit
model outcome predictors subjects
D|E,G Phenotype All covariates, gene All
D|E,G,GxE Phenotype All covariates, gene, gene x last covariate All
E|G Last covariate All other covariates, gene All
E|G,D=1, case only Last covariate All other covariates, gene Cases
E|G,D=0 control only Last covariate All other covariates, gene Controls

Examples

Linear Regression GWAS

The simplest scan to do is a linear regression GWAS. The following model is first when doing a linear regression GWAS.

Model Fit
model outcome predictors subjects
D|E,G Phenotype All covariates, gene All

In the example data set, the phenotype, y, is coded 0,1. When GxEScanR sees the phenotype codes this way it assumes the outcome is binary and uses logistic regression. To perform a linear regression GWAS the binary option needs to be set to FALSE. The following shows how to do a linear GWAS along with the results.

The routine outputs the number of subjects used in the analysis and returns a data frame with the results.

lingwas1 <- gwas(data = covdata,
                 bdinfo = bdinfo,
                 binary = FALSE)
#> [1] "200 subjects have complete data"
Linear Regression GWAS
snp betag lrtg
1:10001 0.1605708 9.0532054
1:10002 -0.0220380 0.1219035
1:10003 -0.0732687 1.3812032
1:10004 0.0326160 0.2355224
1:10005 0.0627953 1.2072107

The output can be redirected to output file that produces a plain test version of the results in a tab delimited file that can be read into R using the read.table routine. In this case, the gwas routine returns a value of 0.

outfile <- tempfile()
lingwas2 <- gwas(data = covdata,
                 bdinfo = bdinfo,
                 outfile = outfile,
                 binary = FALSE)
#> [1] "200 subjects have complete data"
lingwas2
#> [1] 0
lingwas2 <- read.table(outfile, header = TRUE, sep ='\t')
Linear Regression GWAS
SNPID betag lrtg
1:10001 0.1605708 9.0532054
1:10002 -0.0220380 0.1219035
1:10003 -0.0732687 1.3812032
1:10004 0.0326160 0.2355224
1:10005 0.0627953 1.2072107

Linear Regression GWEIS

The gweis routine takes the same parameters as the gwas function but performs additional tests. The models fit for a linear regression GWAS are.

Models Fit
model outcome predictors subjects
D|E,G Phenotype All covariates, gene All
D|E,G,GxE Phenotype All covariates, gene, gene x last covariate All

Note: When doing a GWEIS the interaction covariate is in the last column of the subject data frame.

In this test the minmaf option was used. When minmaf is specified the minor allele for a SNP must exceed minmaf to be test. Notice that only 5 SNPs are in the output data frame. This is because one of the SNPs has a minor allele frequency below 0.2.

lingweis1 <- gweis(data = covdata,
                   bdinfo = bdinfo,
                   minmaf = 0.2,
                   binary = FALSE)
#> [1] "200 subjects have complete data"
Linear Regression GWEIS
snp betadg lrtdg betagxe lrtgxe lrt2df
1:10001 0.1605708 9.0532054 0.0777308 0.5402899 9.593495
1:10002 -0.0220380 0.1219035 0.1484303 1.3479068 1.469810
1:10004 0.0326160 0.2355224 0.1968781 2.0769149 2.312437
1:10005 0.0627953 1.2072107 -0.0442079 0.1401332 1.347344

If the user is interested in see what happened to SNPs that weren’t included in the data frame, the skipfile option can be used. The skipfile value is the name of a file to write the skipped SNPs to. The skipfile option can be used along with the outfile option. The skip file is in the same format as the output file. Below is an example using the skipfile option.

skipfile = tempfile()
lingweis2 <- gweis(data = covdata,
                   bdinfo = bdinfo,
                   skipfile = skipfile,
                   minmaf = 0.2,
                   binary = FALSE)
#> [1] "200 subjects have complete data"
Linear Regression GWEIS
snp betadg lrtdg betagxe lrtgxe lrt2df
1:10001 0.1605708 9.0532054 0.0777308 0.5402899 9.593495
1:10002 -0.0220380 0.1219035 0.1484303 1.3479068 1.469810
1:10004 0.0326160 0.2355224 0.1968781 2.0769149 2.312437
1:10005 0.0627953 1.2072107 -0.0442079 0.1401332 1.347344
knitr::kable(skipsnps, caption = "Skipped SNPs")
Skipped SNPs
SNPID reasondg reasongxe
1:10003 2 2

The following table lists the reasons SNPs were skipped given the skipped value.

Skipped Reasons
code reason
1 Excluded by user
2 Minor allele frequency below threshold
3 X matrix singular
4 Singular matrix updating beta top
5 Singular matrix updating beta bottom
6 Maximum iterations exceeded

Logistic Regression GWAS

In this example, the phenotype is coded (0,1). The gwas and gweis routines check for this an will run logistic regressions if the outcome is coded (0,1) unless binary is set to FALSE. If the use wants to make sure the outcome is coded (0,1), the user may set binary to TRUE. In this case, if the outcome is not coded (0,1) an error is produced.

The following model is fit when doing a logistic regression GWAS.

Model Fit
model outcome predictors subjects
D|E,G Phenotype All covariates, gene All
loggwas1 <- gwas(data = covdata,
                 bdinfo = bdinfo,
                 blksize = 2,
                 binary = TRUE)
#> [1] "200 subjects have complete data"
Logistic Regression GWAS
snp betag lrtg
1:10001 0.7129 9.0230
1:10002 -0.0943 0.1219
1:10003 -0.3144 1.3777
1:10004 0.1397 0.2356
1:10005 0.2680 1.2003

In this example, the option blksize is used. When an analysis is run several SNPs are read in at one time. This saves disk time. The following are the default values for given the number of subjects. These values were chosen to keep the program running using less than 4GB of RAM. The user is allowed to specify a value up to twice the default value. Little performance gain is seen going with larger values. If the user enters 0 for blksize, the default value is used.

Default blksize
subjects blksize
less than 10,000 5000
10,000 to 24,999 2000
25,000 to 49,999 1000
50,000 to 99,999 500
100,000 to 249,999 200
250,000 to 499,999 100
500,000 or greater 50

Logistic Regression GWEIS

A logistic regression GWEIS fits an additional 4 models that produce 7 more tests. 3 of these models use the the interaction covariate as the outcome. The following show all the models fit in a logistic regression GWEIS.

The following model is fit when doing a logistic regression GWAS.

Models Fit
model outcome predictors subjects
D|E,G Phenotype All covariates, gene All
D|E,G,GxE Phenotype All covariates, gene, gene x last covariate All
E|G Last covariate All other covariates, gene All
E|G,D=1, case only Last covariate All other covariates, gene Cases
E|G,D=0 control only Last covariate All other covariates, gene Controls

Note: When doing a GWEIS the interaction covariate is in the last column of the data frame.

Logistic Regression GWEIS with Binary Covariate

In the example subject data, the covariate is coded (0,1). In this case, the gweis routine will use logistic regression to fit the last 3 models.

loggweis1 <- gweis(data = covdata,
                   bdinfo = bdinfo,
                   snps = 1:2,
                   binary = TRUE)
#> [1] "200 subjects have complete data"
Logistic Regression GWEIS
snp betadg lrtdg betagxe lrtgxe lrt2df betaeg lrteg lrt3df betacase lrtcase betactrl lrtctrl
1:10001 0.7129 9.0230 0.4931 0.9520 9.9750 0.4115 3.3606 13.3356 0.4290 2.0147 -0.0477 0.0147
1:10002 -0.0943 0.1219 0.6508 1.3569 1.4789 0.0798 0.0910 1.5699 0.3882 1.1018 -0.2562 0.3735

In this example the snps options was used. The snps option can either be a vector of indices indicating what SNPs to include or a list of SNPs by SNP ID. A vector of indices was used in this example.

Logistic Regression GWEIS with a Continuous Covariate

In the example subject data, the covariate is coded (0,1) which the gweis routine sees a binary covariate to make the routine do a linear regression 1 can be added to the interaction covariate. This will change the coding to (1,2) which the routine sees as a continuous covariate.

covdata2 <- covdata
covdata2$e <- covdata2$e + 1
loggweis2 <- gweis(data = covdata2,
                   bdinfo = bdinfo,
                   snps = c("1:10001", "1:10002"))
#> [1] "200 subjects have complete data"
Logistic Regression GWEIS
snp betadg lrtdg betagxe lrtgxe lrt2df betaeg lrteg lrt3df betacase lrtcase betactrl lrtctrl
1:10001 0.7129 9.0230 0.4931 0.9520 9.9750 0.1001 3.4031 13.3781 0.1044 2.0128 -0.0098 0.0147
1:10002 -0.0943 0.1219 0.6508 1.3569 1.4789 0.0194 0.0912 1.5700 0.0947 1.0967 -0.0516 0.3688

In this example the snps options was used with the SNP IDs.