bmass Introduction – Simulated Data

Michael C Turchin

2019-05-16

This is a very simple, introductory vignette on how to run bmass. In this introductory example, we will use a small, simulated dataset to run bmass and explore some basic output. The dataset is provided with the package.

The purpose of this vignette is to quickly allow users to get up and running with bmass. For a more advanced introductory vignette, where we download and use the GlobalLipids 2013 dataset as an example, please see here.

Running bmass

For this introductory vignette, we will be using bmass_SimulatedData1, bmass_SimulatedData2, and bmass_SimulatedSigSNPs. These are simulated datasets created for the purposes of unit testing and this vignette. To load up ‘bmass’ and the datasets, run the following:

library("bmass");
data(bmass_SimulatedData1, bmass_SimulatedData2, bmass_SimulatedSigSNPs)
Phenotypes <- c("bmass_SimulatedData1", "bmass_SimulatedData2")

First, we’ll take a look at the format of one of our simulated datasets:

head(bmass_SimulatedData1)
##   Chr    BP Marker  MAF A1 A2 Direction  pValue    N
## 1   1  1000    rs1 0.20  A  G         - 6.0e-13 2500
## 2   1  2000    rs2 0.10  G  T         - 7.6e-02 2467
## 3   2  3000    rs3 0.06  T  G         + 3.0e-09 2761
## 4   3  4000    rs4 0.40  C  A         + 5.6e-03 2310
## 5   3 15000    rs5 0.40  C  T         + 1.0e-07 2410
## 6   3 21000    rs6 0.37  G  A         + 5.0e-08 2582

Here you should see nine columns: Chr, BP, Marker, MAF, A1, A2, Direction, pValue, N. bmass expects each input dataset (representing a single phenotype GWAS) to contain these 9 columns and with these specific headers. Their descriptions are as follows –

Now to run bmass, all we use is the following code:

bmassResults <- bmass(Phenotypes, bmass_SimulatedSigSNPs);

Note that at its most basic level, we only need to provide two inputs for bmass() to properly run: a vector of strings containing the variable names of each of our univariate summary GWAS datasets and a file containing the list of previously associated univariate genome-wide significant SNPs. The former input, the vector of strings, dictates to bmass() where to find each dataset and what to call each phenotype being analyzed (ie the name of each variable containg the GWAS summary data is expected to correspond to the phenotype represented by that data). The latter input is just a list of each previous GWAS SNP with the following two columns of information, Chr and BP:

bmass_SimulatedSigSNPs
##   Chr   BP
## 1   1 1000
## 2   2 3000

Exploring basic results

Now to begin to get a sense of what bmass() outputs, we can run the following:

summary(bmassResults)
##                       Length Class  Mode     
## MarginalSNPs            3    -none- list     
## PreviousSNPs            4    -none- list     
## NewSNPs                 3    -none- list     
## LogFile                19    -none- character
## ZScoresCorMatrix        4    -none- numeric  
## Models                 18    -none- numeric  
## ModelPriors           126    -none- numeric  
## GWASlogBFMinThreshold   1    -none- numeric

As you can see, there are multiple output variables corresponding to different components of the results. For the purposes of this vignette, we will only touch on the first three variables shown from this summary() output: PreviousSNPs, MarginalSNPS, and NewSNPs. PreviousSNPs refers to information on the input GWAS SNPs we used for the analysis, MarginalSNPs refers to information on the the marginally significant SNPs that were used in the final steps of the analysis, and NewSNPs refers to information on any SNPs that reach multivariate genome-wide levels of significance.

The main result most users will be immediately interested in is whether there are any new variants identified as genome-wide multivariate significant, and this can be determined by accessing the NewSNPs sublist:

summary(bmassResults$NewSNPs)
##            Length Class      Mode   
## SNPs       22     data.frame list   
## logBFs     27     -none-     numeric
## Posteriors 27     -none-     numeric
head(bmassResults$NewSNPs$SNPs)
##    ChrBP Chr   BP Marker  MAF A1 bmass_SimulatedData1_A2
## 9 4_7000   4 7000    rs8 0.15  G                       C
##   bmass_SimulatedData1_Direction bmass_SimulatedData1_pValue
## 9                              +                       7e-08
##   bmass_SimulatedData1_N bmass_SimulatedData1_ZScore
## 9                   2514                    5.391171
##   bmass_SimulatedData2_Direction bmass_SimulatedData2_pValue
## 9                              -                       6e-08
##   bmass_SimulatedData2_N bmass_SimulatedData2_ZScore GWASannot   mvstat
## 9                   2514                   -5.418801         0 145.0667
##   mvstat_log10pVal  unistat unistat_log10pVal Nmin logBFWeightedAvg
## 9         31.50084 29.36341          7.221849 2514           29.107

As you can see in this example, we have 1 new SNP that reaches multivariate genome-wide significance. This is determined by using the smallest log BF weighted average value among the previous univariate significanct GWAS SNPs (PreviousSNPS):

bmassResults$GWASlogBFMinThreshold
## [1] 7.41173

Any MarginalSNPs with logBFWeightedAvg greater than this value are designated as new multivariate associations (see eq. 1 in Turchin and Stephens bioRxiv 2019).

For further exploration and details of bmass output, please see the second introductory vignette.