Bayesian Testing of Equal Genotype Proportions between Multiple Populations

Martin Lysy, Wookjung P. Kim, Terin Robinson

2024-02-26

This tutorial shows how to use the MADPop package to test for genetic differences between two populations, of which the individuals of a same species contain a variable number of alleles.

require(MADPop)
## Loading required package: MADPop
## Loading required package: rstan
## Loading required package: StanHeaders
## 
## rstan version 2.32.5 (Stan version 2.32.2)
## For execution on a local, multicore CPU with excess RAM we recommend calling
## options(mc.cores = parallel::detectCores()).
## To avoid recompilation of unchanged Stan programs, we recommend calling
## rstan_options(auto_write = TRUE)
## For within-chain threading using `reduce_sum()` or `map_rect()` Stan functions,
## change `threads_per_chain` option:
## rstan_options(threads_per_chain = 1)

Pre-Processing

Our data consists of \(N = 215\) recordings of Major Histocompatibility Complex (MHC) genotypes of lake trout from \(K = 11\) lakes in Ontario, Canada. For each of the fish, between 1-4 alleles in the MHC genotype are recorded. This is partially because duplicate genes are undetectable by current instrumentation, and possibly because the fish possess a variable number of alleles of a given MHC gene.

Our dataset fish215 is included with MADPop. A random sample from it looks like this:

head(fish215[sample(nObs),])
##             Lake  A1  A2  A3  A4
## 68  Michipicoten     r.3 r.4    
## 167    Kingscote r.1 r.8        
## 129       Dickey v.3 r.4        
## 162    Kingscote             w.5
## 43       Crystal     r.4     t.7
## 14         Hogan x.4 r.4

The first column is the lake name (or population ID) for each sample, the remaining four columns are for potentially recorded allele codes (A1-A4). Here the code to identify a unique allele is a small letter followed by a number, but it could have been the sequence of integers \(1, 2,\ldots, A\), which for the fish215 data is \(A = 57\) unique alleles.

It is relatively straightforward to import a CSV file into the format above. An example of this is given along with our raw data in the extdata directory of the local copy of the MADPop package.

Two-Population Comparisons

Suppose that we wish to compare two lakes, say Dickey and Simcoe. The allele counts in these lakes are in the table below. It is a subset of the full contingency table on all \(K = 11\) lakes, which is produced by the MADPop function UM.suff():

popId <- c("Dickey", "Simcoe")          # lakes to compare
Xsuff <- UM.suff(fish215)             # summary statistics for dataset
ctab <- Xsuff$tab[popId,]             # contingency table
ctab <- ctab[,colSums(ctab) > 0] # remove alleles with no counts
#ctab
rbind(ctab, Total = colSums(ctab))
##        1.20 1.4.5 1.4.9.45 1.5 10 20 20.54 27.28.29 3 4 4.10 4.11 4.20.27
## Dickey    1     1        0   2  0  2     1        1 0 0    1    0       1
## Simcoe    0     0        1   1  2  0     0        0 1 1    2    1       0
## Total     1     1        1   3  2  2     1        1 1 1    3    1       1
##        4.20.31 4.20.45 4.27 4.33 4.5 4.53.55 4.56 4.7 4.7.10 5 5.20 5.30 5.32
## Dickey       1       1    1    0   1       1    1   0      0 0    1    1    1
## Simcoe       0       0    0    1   1       0    0   3      1 1    0    0    0
## Total        1       1    1    1   2       1    1   3      1 1    1    1    1
##        5.7 7 9.11
## Dickey   1 0    0
## Simcoe   1 2    1
## Total    2 2    1

The unique allele identifiers are encoded as integers between \(1\) and \(A\) and separated by dots. The original allele names are stored in Xsuff$A, such that the genotype of the first column 1.20 is

gtype <- colnames(ctab)[1]
gtype <- as.numeric(strsplit(gtype, "[.]")[[1]])
gtype
## [1]  1 20
names(gtype) <- paste0("A", gtype)
sapply(gtype, function(ii) Xsuff$A[ii])
##    A1   A20 
## "r.1" "u.4"

There are \(C = 29\) genotype combinations observed in these two lakes, corresponding to each column of the table.

Multinomial Model

In the two-population problem we have \(K = 2\) lakes with \(N_1\) and \(N_2\) fish sampled from each. Let \(\boldsymbol Y_k = (Y_{k1}, \ldots Y_{kC})\) denote the counts for each genotype observed in lake \(k\), such that \(\sum_{i=1}^C Y_{ki} = N_k\). The sampling model for these data is \[ \boldsymbol Y_k \stackrel{\textrm{ind}}{\sim}\textrm{Multinomial}(N_k, \boldsymbol \rho_k), \quad k = 1,2, \] where \(\boldsymbol \rho_k = (\rho_{k1},...,\rho_{kC})\) are the population proportions of each genotype, and \(\sum_{i=1}^C \rho_{ki} = 1\).

Hypothesis Testing

Our objective is to test \[ \begin{split} H_0 &: \textrm{The two populations have the same genotype proportions} \\ & \phantom{: } \iff \boldsymbol \rho_1 = \boldsymbol \rho_2. \end{split} \] The classical test statistics for assessing \(H_0\) are Pearson’s Chi-Square statistic \(\mathcal X\) and the Likelihood Ratio statistic \(\Lambda\), \[ \mathcal X = \sum_{k=1}^2 \sum_{i=1}^C \frac{(N_k\hat\rho_i - Y_{ki})^2}{N_k\hat\rho_i}, \qquad \Lambda = 2 \sum_{k=1}^2 \sum_{i=1}^C Y_{ki} \log\left(\frac{Y_{ki}}{N_k\hat\rho_i}\right), \qquad \hat \rho_i = \frac{Y_{1i} + Y_{2i}}{N_1 + N_2}. \] Under \(H_0\), the asymptotic distribution of either of these test statistics \(T = \mathcal X\) or \(\Lambda\) is \(\chi^2_{(C-1)}\), such that the \(p\)-value \[ \textrm{p}_\textrm{v}= \textrm{Pr}(T > T_\textrm{obs} \mid H_0) \] for an observed value of \(T_\textrm{obs}\) can be estimated as follows:

# observed values of the test statistics
chi2.obs <- chi2.stat(ctab) # Pearson's chi^2
LRT.obs <- LRT.stat(ctab) # LR test statistic
T.obs <- c(chi2 = chi2.obs, LRT = LRT.obs)
# p-value with asymptotic calculation
C <- ncol(ctab)
pv.asy <- pchisq(q = T.obs, df = C-1, lower.tail = FALSE)
signif(pv.asy, 2)
##  chi2   LRT 
## 0.330 0.041

The Chi-Square and LR tests are asymptotically equivalent and so should give roughly the same \(p\)-values. The huge discrepancy observed here indicates that the sample sizes are too small for asymptotics to kick in. A more reliable \(p\)-value estimate can be obtained by the Bootstrap method, which in this case consists of simulating \(M\) contigency tables with \(\boldsymbol Y_k^{(m)} \stackrel{\textrm{ind}}{\sim}\textrm{Multinomial}(N_k, \hat{\boldsymbol \rho})\), \(m = 1, \ldots, M\), where \(\hat{\boldsymbol \rho}\) is the estimate of the common probability vector \(\boldsymbol \rho_1 = \boldsymbol \rho_2\) under \(H_0\). For each contingency table \((\boldsymbol Y_1^{(m)}, \boldsymbol Y_2^{(m)})\), we calculate the test statistic \(T^{(m)}\), and the bootstrapped p-value is defined as \[ \textrm{p}_\textrm{v}^{\textrm{boot}}= \frac 1 M \sum_{m=1}^M \delta(T^{(m)} \ge T_\textrm{obs}). \] \(\textrm{p}_\textrm{v}^{\textrm{boot}}\) is calculated with MADPop as follows:

N1 <- sum(ctab[1,])                     # size of first sample
N2 <- sum(ctab[2,])                     # size of second sample
rho.hat <- colSums(ctab)/(N1+N2)        # common probability vector
# bootstrap distribution of the test statistics
# set verbose = TRUE for progress output
system.time({
  T.boot <- UM.eqtest(N1 = N1, N2 = N2, p0 = rho.hat, nreps = 1e4,
                      verbose = FALSE)
})
##    user  system elapsed 
##   0.224   0.014   0.240
# bootstrap p-value
pv.boot <- rowMeans(t(T.boot) >= T.obs)
signif(pv.boot, 2)
##   chi2    LRT 
## 0.0068 0.0059

Note that the bootstrap \(p\)-values for both tests are roughly the same and decisively reject \(H_0\), whereas the less reliable asymptotic \(p\)-values both failed to reject (at quite different significance levels).

Pairwise Comparisons between Multiple Populations

Bootstrapping overcomes many deficiencies of the asymptotic \(p\)-value calculation. However, bootstrapping has a tendency to reject \(H_0\) when sample sizes are small. To see why this is, consider all columns of ctab which have only one genotype count between the two lakes:

itab1 <- colSums(ctab) == 1             # single count genotypes
cbind(ctab[,itab1],
      Other = rowSums(ctab[,!itab1]),
      Total = rowSums(ctab))
##        1.20 1.4.5 1.4.9.45 20.54 27.28.29 3 4 4.11 4.20.27 4.20.31 4.20.45 4.27
## Dickey    1     1        0     1        1 0 0    0       1       1       1    1
## Simcoe    0     0        1     0        0 1 1    1       0       0       0    0
##        4.33 4.53.55 4.56 4.7.10 5 5.20 5.30 5.32 9.11 Other Total
## Dickey    0       1    1      0 0    1    1    1    0     7    20
## Simcoe    1       0    0      1 1    0    0    0    1    12    20

There are \(c_1 = 21\) such columns, accounting for \(\hat p_1 = 0.525\) of the common genotype distribution under \(H_0\), as estimated from the two-lake sample. For each of these columns, observing counts in one lake but not the other provides evidence against \(H_0\). Moreover, under the estimated common distribution \(\hat {\boldsymbol \rho}\), it is very unlikely to have counts in only one of the lakes for each of these \(c_1 = 21\) genotypes. Therefore, the data appear to provide very strong evidence against \(H_0\). However, it is not so unlikely to have \(c_1 = 21\) one-count genotypes if the true number of unique genotypes in these two lakes is much larger than the observed value of \(C = 29\). With \(C = 29\) unique genotypes in only \(N = N_1 + N_2 = 40\) fish samples, it is quite plausible that a new sample of fish would yield several genotypes which are not present in the original two-lake sample ctab.

One way to obtain information about the unobserved genotypes in lakes Dickey and Simcoe is to consider the genotypes in all \(K = 11\) Ontario lakes for which we have collected data. A natural way to do this is through a hierarchical model: \[ \begin{split} \boldsymbol Y_k \mid \boldsymbol \rho_k & \stackrel{\textrm{ind}}{\sim}\textrm{Multinomial}(N_k, \boldsymbol \rho_k), \quad k = 1,\ldots,K \\ \boldsymbol \rho_k & \stackrel{\textrm{iid}}{\sim}\textrm{Dirichlet}(\boldsymbol \alpha), \qquad \boldsymbol \alpha= (\alpha_1, \ldots, \alpha_C). \end{split} \] That is, each population is allowed to have its own probability vector \(\boldsymbol \rho_k\). However, these probability vectors are drawn from a common Dirichlet distribution. The common distribution specifies that before any data is drawn, we have \[ E[\boldsymbol \rho] = \bar{\boldsymbol \alpha}, \qquad \textrm{var}(\rho_i) = \frac{(\bar \alpha_i)(1-\bar \alpha_i)}{\alpha_0 + 1}, \] where \(\alpha_0 = \sum_{i=1}^C \alpha_i\) and \(\bar{\boldsymbol \alpha} = \boldsymbol \alpha/ \alpha_0\). Moreover, the posterior distribution of \(\boldsymbol \rho_k\) given \(\boldsymbol \alpha\) and the data is also Dirichlet: \[ \boldsymbol \rho_k \mid \boldsymbol Y\stackrel{\textrm{ind}}{\sim}\textrm{Dirichlet}(\boldsymbol \alpha+ \boldsymbol Y_k). \] In this sense, the posterior estimate of the proportion of genotype \(i\) in population \(k\), \(\rho_{ki}\), can be non-zero even if \(Y*{ki} = 0\). In practice, \(\boldsymbol \alpha\) is estimated from \(\boldsymbol Y\), the data from all \(K\) populations (details in the following section). The extent to which the genotypes observed in one lake affect inference in the other lakes is determined by \(\alpha*0\) (the larger this value, the larger the effect).

Parameter Estimation

The likelihood function for this model corresponds to a Dirichlet-Multinomial distribution, \[ \begin{split} \mathcal L(\boldsymbol \alpha\mid \boldsymbol Y) = \prod_{k=1}^K p(\boldsymbol Y_k \mid \boldsymbol \alpha) & = \prod_{k=1}^K \int p(\boldsymbol Y_k \mid \boldsymbol \rho_k) p(\boldsymbol \rho_k \mid \boldsymbol \alpha) \,\mathrm{d}\boldsymbol \rho_k \\ & = \prod_{k=1}^K \left[\frac{N_k!\cdot\Gamma(\alpha_0)}{\Gamma(N_k+\alpha_0)} \prod_{i=1}^C \frac{\Gamma(Y_{ki} + \alpha_i)}{Y_{ki}!\cdot\Gamma(\alpha_i)}\right], \end{split} \] from which \(\boldsymbol \alpha\) can be estimated by maximum likelihood (Minka 2000). Alternatively we estimate \(\boldsymbol \alpha\) using a Bayesian approach, which is more readily extensible to more complex genotype models, for which \(\mathcal L(\boldsymbol \alpha\mid \boldsymbol Y)\) has no closed form (see Future Work).

First, we specify a prior distribution \[ \pi(\alpha_0, \bar{\boldsymbol \alpha}) = \frac{1}{(1+ \alpha_0)^2}, \] which is a uniform distribution on the prior probability vector \(\bar{\boldsymbol \alpha}\), and a uniform distribution on the variance-like parameter \(A = (1+\alpha_0)^{-1}\), in the sense that the prior mean and variance of a given genotype population probability are \[ E[\rho_{kj}] = \tilde \alpha_j, \qquad \textrm{var}(\rho_{kj}) = A \cdot \tilde \alpha_j(1-\tilde \alpha_j). \] Then, we sample from the posterior distribution \(p(\boldsymbol \alpha\mid \boldsymbol Y) \propto \mathcal L(\boldsymbol \alpha\mid \boldsymbol Y) \pi(\boldsymbol \alpha)\) using a Markov chain Monte Carlo (MCMC) algorithm provided by the Stan programming language (Stan Development Team 2016), as detailed below.

Bayesian Hypothesis Testing

Under the hierarchical model, the null hypothesis of equal genetic proportions between two populations, say \(k = 1,2\), is \(H_0: \boldsymbol \rho_1 = \boldsymbol \rho_2 = \boldsymbol \rho_{12}\). Our Bayesian hypothesis-testing strategy is implemented in two steps:

  1. Sample from the posterior distribution \(p(\boldsymbol \alpha\mid \boldsymbol Y, H_0)\).
  2. Sample from the posterior distribution of either classical test statistic, \(T = \mathcal X\) or \(\Lambda\), thus obtaining a posterior p-value \[ \textrm{p}_\textrm{v}^{\textrm{post}}= \textrm{Pr}(T > T_\textrm{obs} \mid \boldsymbol Y, H_0). \]

MCMC Sampling from the Posterior Distribution

In order to estimate \(\boldsymbol \alpha\) under \(H_0\), we apply the Dirichlet-Multinomial distribution to \(K-1\) lakes, with the first two being collapsed into a common count vector \(\boldsymbol Y_{12} = \boldsymbol Y_1 + \boldsymbol Y_2\) with sample size \(N_{12} = N_1 + N_2\). MCMC sampling is implemented via a Hybrid Monte Carlo (HMC) variant provided by the R package rstan. In MADPop, sampling from \(p(\boldsymbol \alpha\mid \boldsymbol Y, H_0)\) is accomplished with the function hUM.post(), where “hUM” stands for hierarchical Unconstrained Multinomial model. We start by merging the two samples from equal distributions under \(H_0\):

popId                                   # equal lakes under H0
## [1] "Dickey" "Simcoe"
eqId0 <- paste0(popId, collapse = ".")  # merged lake name
popId0 <- as.character(fish215$Lake)    # all lake names
popId0[popId0 %in% popId] <- eqId0
table(popId0, dnn = NULL)               # merged lake counts
##       Crystal Dickey.Simcoe         Hogan     Kingscote     Macdonald 
##            20            40            21            18            19 
##       Manitou  Michipicoten       Opeongo        Seneca         Slate 
##            20            20            20            18            19

Next, we sample from \(p(\boldsymbol \alpha\mid \boldsymbol Y, H_0)\) using rstan:

X0 <- cbind(Id = popId0, fish215[-1])  # allele data with merged lakes

nsamples <- 1e4
fit0 <- hUM.post(nsamples = nsamples, X = X0,
                 rhoId = eqId0,     # output rho only for merged lakes
                 chains = 1,  # next two arguments are passed to rstan
                 warmup = min(1e4, floor(nsamples/10)),
                 full.stan.out = FALSE)
## 
## SAMPLING FOR MODEL 'DirichletMultinomial' NOW (CHAIN 1).
## Chain 1: 
## Chain 1: Gradient evaluation took 5.9e-05 seconds
## Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0.59 seconds.
## Chain 1: Adjust your expectations accordingly!
## Chain 1: 
## Chain 1: 
## Chain 1: Iteration:    1 / 10000 [  0%]  (Warmup)
## Chain 1: Iteration: 1000 / 10000 [ 10%]  (Warmup)
## Chain 1: Iteration: 1001 / 10000 [ 10%]  (Sampling)
## Chain 1: Iteration: 2000 / 10000 [ 20%]  (Sampling)
## Chain 1: Iteration: 3000 / 10000 [ 30%]  (Sampling)
## Chain 1: Iteration: 4000 / 10000 [ 40%]  (Sampling)
## Chain 1: Iteration: 5000 / 10000 [ 50%]  (Sampling)
## Chain 1: Iteration: 6000 / 10000 [ 60%]  (Sampling)
## Chain 1: Iteration: 7000 / 10000 [ 70%]  (Sampling)
## Chain 1: Iteration: 8000 / 10000 [ 80%]  (Sampling)
## Chain 1: Iteration: 9000 / 10000 [ 90%]  (Sampling)
## Chain 1: Iteration: 10000 / 10000 [100%]  (Sampling)
## Chain 1: 
## Chain 1:  Elapsed Time: 0.83 seconds (Warm-up)
## Chain 1:                7.557 seconds (Sampling)
## Chain 1:                8.387 seconds (Total)
## Chain 1:

rstan typically outputs a number of messages and warnings, many of which are harmless. The MADPop package lists rstan as a dependency, such that its sophisticated tuning mechanism is exposed. Optionally, hUM.post() returns the entire rstan output (full.stan.out = TRUE), which can be run through rstan’s MCMC convergence diagnostics.

By setting the rhoId argument of hUM.post(), the fit0 object contains samples from \(p(\boldsymbol \alpha, \boldsymbol \rho_{12} \mid \boldsymbol Y, H_0)\). Boxplots of the 50 highest posterior probability genotypes in the common distribution \(\boldsymbol \rho_{12}\) are displayed below:

rho.post <- fit0$rho[,1,]   # p(rho12 | Y)
# sort genotype counts by decreasing order
rho.count <- colSums(Xsuff$tab[popId,])
rho.ord <- order(colMeans(rho.post), decreasing = TRUE)

# plot
nbxp <- 50            # number of genotypes for which to make boxplots
clrs <- c("#999999", "#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2",
          "#D55E00", "#CC79A7")
rho.ct <- rho.count[rho.ord[1:nbxp]]
rho.lgd <- unique(sort(rho.ct))
rho.box <- rho.post[,rho.ord[1:nbxp]]
par(mar = c(5,4.5,.5,.5)+.1, cex.axis = .7)
boxplot(x = rho.box,
        las = 2, col = clrs[rho.ct+1], pch = 16, cex = .2)
title(xlab = "Genotype", line = 4)
title(ylab = expression(p(bold(rho)[(12)]*" | "*bold(Y))))
legend("topright", legend = rev(rho.lgd), fill = rev(clrs[rho.lgd+1]),
       title = "Counts", ncol = 2, cex = .8)

Calculating the Posterior P-Value

This is calculated analogously to the bootstrapped p-value, by generating \(M\) contingency tables with \(\boldsymbol Y_k^{(m)} \stackrel{\textrm{ind}}{\sim}\textrm{Multinomial}(N_k, \boldsymbol \rho_{12}^{(m)}\), \(m = 1, \ldots, M\). However, for each contingency table, the common probability vector \(\boldsymbol \rho_{12}^{(m)}\) is different, corresponding to a random draw from \(p(\boldsymbol \rho_{12} \mid \boldsymbol Y, H_0)\). The test statistic \(T^{(m)}\) is then calculated and we have \[ \textrm{p}_\textrm{v}^{\textrm{post}}= \frac 1 M \sum_{m=1}^M \delta(T^{(m)} \ge T_{\textrm{obs}}). \] \(\textrm{p}_\textrm{v}^{\textrm{post}}\) is calculated with MADPop as follows:

system.time({
  T.post <- UM.eqtest(N1 = N1, N2 = N2, p0 = rho.post, nreps = 1e4,
                      verbose = FALSE)
})
##    user  system elapsed 
##   0.387   0.032   0.422
# posterior p-value
pv.post <- rowMeans(t(T.post) >= T.obs)

We can now compare the three types of p-values (asymptotic, bootstrapped, posterior) for each test statistic (\(\mathcal X\) and \(\Lambda\)):

p-value (\(\times100\%\))
Asymptotic Bootstrap Posterior
\(\mathcal X\) 33.0 0.68 13
\(\Lambda\) 4.1 0.59 12

We see that \(\textrm{p}_\textrm{v}^{\textrm{post}}\) is much more conservative than \(\textrm{p}_\textrm{v}^{\textrm{boot}}\).

Future Work

Here we used a completely unconstrained Multinomial model for the genotype counts, in the sense that the only restriction on \(\boldsymbol \rho_k\) is that \(\rho_{ki} \ge 0\) and \(\sum_{i=1}^C \rho_{ki} = 1\). However, it is possible to impose genetic constraints such as Hardy-Weinberg equilibrium, preferential mating, etc., which effectively reduce the degrees of freedom of \(\boldsymbol \rho_k\) below \(C-1\). In this case, the closed-form likelihood \(\mathcal L(\boldsymbol \alpha\mid \boldsymbol Y)\) is typically unavailable, but the Stan code can easily be modified to sample from \(p(\boldsymbol \alpha, \boldsymbol{\mathcal R}\mid \boldsymbol Y)\), where \(\boldsymbol{\mathcal R}= (\boldsymbol \rho_1, \ldots, \boldsymbol \rho_K)\). We hope to feature some of these extensions to the basic Dirichlet-Multinomial model in the next version of MADPop.

References

Minka, T.P. 2000. “Estimating a Dirichlet Distribution.” Technical Report. https://tminka.github.io/papers/dirichlet/minka-dirichlet.pdf.
Stan Development Team. 2016. “Rstan: The R Interface to Stan.” 2016. https://mc-stan.org/rstan/.