From GWAS Summary Statistics to Credible Sets

Anna Hutchinson

Z scores to PPs

Maller et al. derive a method to calculate PPs from GWAS summary statistics (Supplementary text) from which the following is based on. Let \(\beta_i\) for \(i=1,...,k\) SNPs in a genomic region, be the regression coefficient from a single-SNP logistic regression model, quantifying the evidence of an association between SNP \(i\) and the disease. Assuming that there is only one CV per region and that this is typed in the study, then if SNP \(i\) is causal, \(\beta_i\neq 0\) and \(\beta_j\) (for \(j\neq i\)) is non-zero only through LD between SNPs \(i\) and \(j\). Note that no parametric assumptions are required for \(\beta_i\) yet, so we write that it is sampled from some distribution, \(\beta_i \sim \text{[ ]}\). The likelihood is then, \[\begin{equation} \begin{split} P(D|\beta_i\sim\text{[ ]},\text{ }i\text{ causal}) & = P(D_i |\beta_i\sim\text{[ ]},\text{ }i\text{ causal}) \times P(D_{-i}|D_i,\text{ }\beta_i\sim\text{[ ]},\text{ }i\text{ causal})\\ & = P(D_i |\beta_i\sim\text{[ ]},\text{ }i\text{ causal}) \times P(D_{-i}|D_i,\text{ }i\text{ causal})\,, \end{split} \end{equation}\]

since \(D_{-i}\) is independent of \(\beta_i\) given \(D_i\). Here, \(D\) is the genotype data (0, 1 or 2 counts of the minor allele) for the entire genomic region and \(i\) is a SNP in the region, such that \(D_i\) and \(D_{-i}\) are the genotype data at SNP \(i\) and at the remaining SNPs in the genomic region, respectively.

Parametric assumptions can now be placed on SNP \(i\)’s true effect on disease. This is typically quantified as log odds ratio, and is assumed to be sampled from a Gaussian distribution, \(\beta_i\sim N(0,W)\), where \(W\) is chosen to reflect the researcher’s prior belief on the variability of the true OR. Conventionally \(W\) is set to \(0.2^2\), reflecting a belief that 95% of odds ratios range from \(exp(-1.96\times 0.2)=0.68\) to \(exp(1.96\times 0.2)=1.48\).

The posterior probabilities of causality for each SNP \(i\) in an associated genomic region with \(k\) SNPs can be calculated where, \[\begin{equation} PP_i=P(\beta_i \sim N(0,W),\text{ }i \text{ causal}|D)\,, \quad i \in \{1,...,k\}. \end{equation}\]

Under the assumption that each SNP is equally likely to be causal, then \[\begin{equation} P(\beta_i \sim N(0,W),\text{ }i\text{ causal})=\dfrac{1}{k}\,, \quad i \in \{1,...,k\} \end{equation}\] and Bayes theorem can be used to write \[\begin{equation} \begin{aligned} PP_i=P(\beta_i \sim N(0,W),\text{ }i \text{ causal}|D)\propto P(D|\beta_i\sim N(0,W),\text{ }i\text{ causal}). \end{aligned} \end{equation}\]

Dividing through by the probability of the genotype data given the null model of no genetic effect, \(H_0\), yields a likelihood ratio, \[\begin{equation} PP_i\propto \dfrac{P(D|\beta_i \sim N(0,W),\text{ }i \text{ causal)}}{P(D|H_0)}, \end{equation}\]

from which Equation (1) can be used to derive, \[\begin{equation} PP_i\propto \frac{P(D_i|\beta_i \sim N(0,W),\text{ }i \text{ causal})}{P(D_i|H_0)}= BF_i\,, \end{equation}\] where \(BF_i\) is the Bayes factor for SNP \(i\), measuring the ratio of the probabilities of the data at SNP \(i\) given the alternative (SNP \(i\) is causal) and the null (no genetic effect) models.

In genetic association studies where sample sizes are usually large, these BFs can be approximated using Wakefield’s asymptotic Bayes factors (ABFs). Given that \(\hat\beta_i\sim N(\beta_i,V_i)\) and \(\beta_i\sim N(0,W)\),

\[\begin{equation} PP_i\propto BF_i \approx ABF_i=\sqrt{\frac{V_i}{V_i+W}}exp\left(\frac{Z_i^2}{2}\frac{W}{(V_i+W)}\right)\,, \end{equation}\] where \(Z_i^2=\dfrac{\hat\beta_i^2}{V_i}\) is the squared marginal \(Z\) score for SNP \(i\).

In Bayesian fine-mapping, PPs are calculated for all SNPs in the genomic region and the variants are sorted into descending order of their PP. The PPs are then cumulatively summed until some threshold, \(\alpha\), is exceeded. The variants required to exceed this threshold form the \(\alpha\)-level credible set.