Empirical Bayes Metrics with openEBGM

2017-11-15

Background

In Bayesian statistics, the gamma distribution is the conjugate prior distribution for a Poisson likelihood. ‘Conjugate’ means that the posterior distribution will follow the same general form as the prior distribution. DuMouchel (1999) used a model with a Poisson(\(\mu_{ij}\)) likelihood for the counts (for row i and column j of the contingency table). We are interested in the ratio \(\lambda_{ij}=\frac{\mu_{ij}}{E_{ij}}\), where \(E_{ij}\) are the expected counts. The \(\lambda_{ij}\)s are considered random draws from a mixture of two gamma distributions (our prior) with hyperparameter \(\theta=(\alpha_1,\beta_1,\alpha_2,\beta_2,P)\), where \(P\) is the prior probability that \(\lambda\) came from the first component of the prior mixture (i.e., the mixture fraction). The prior is a single distribution that models all the cells in the table; however, there is a separate posterior distribution for each cell in the table. The posterior distribution of \(\lambda\), given count \(N=n\), is a mixture of two gamma distributions with parameters \(\theta=(\alpha_1+n,\beta_1+E,\alpha_2+n,\beta_2+E,Q_n)\) (subscripts suppressed for clarity), where \(Q_n\) is the probability that \(\lambda\) came from the first component of the posterior, given \(N=n\) (i.e., the mixture fraction).

The posterior distribution, in a sense, is a Bayesian representation of the relative reporting ratio, \(RR\) (note the similarity in the equations \(RR_{ij}=\frac{N_{ij}}{E_{ij}}\) and \(\lambda_{ij}=\frac{\mu_{ij}}{E_{ij}}\)). The Empirical Bayes (EB) metrics are taken from the posterior distribution. The Empirical Bayes Geometric Mean \((EBGM)\) is the antilog of the mean of the log2-transformed posterior distribution. The \(EBGM\) is therefore a measure of central tendency of the posterior distribution. The 5% and 95% quantiles of the posterior distributions can be used to create two-sided 90% credibility intervals for \(\lambda_{ij}\), given \(N_{ij}\) (i.e, our “sort of” RR). Alternatively, since we are most interested in the lower bound, we could ignore the upper bound and create a one-sided 95% credibility interval.

Due to Bayesian shrinkage (please see the Background section of the Introduction to openEBGM vignette), the EB scores are much more stable than \(RR\) for small counts.

References

Please see the Introduction to openEBGM vignette.


Calculating the EB-Scores

Once the product/event combinations have been counted and the hyperparameters have been estimated, we can calculate the EB scores:

library(openEBGM)
data(caers)  #subset of publicly available CAERS data

processed <- processRaw(caers, stratify = FALSE, zeroes = FALSE)
squashed <- squashData(processed)
squashed2 <- squashData(squashed, count = 2, bin_size = 10, keep_bins = 5)
theta_init <- data.frame(alpha1 = c(0.2, 0.1, 0.3, 0.5, 0.2),
                         beta1  = c(0.1, 0.1, 0.5, 0.3, 0.2),
                         alpha2 = c(2,   10,  6,   12,  5),
                         beta2  = c(4,   10,  6,   12,  5),
                         p      = c(1/3, 0.2, 0.5, 0.8, 0.4))
hyper_estimates <- autoHyper(squashed2, theta_init = theta_init)
(theta_hat <- hyper_estimates$estimates)
#>     alpha1      beta1     alpha2      beta2          P 
#> 3.25379064 0.39988801 2.02613219 1.90808279 0.06534557

Qn()

The Qn() function calculates the mixture fractions for the posterior distributions. The values returned by Qn() correspond to the probability that \(\lambda\) came from the first component of the posterior mixture distribution, given \(N=n\) (recall there is a \(\lambda|N=n\) for each cell in the table, but that each \(\lambda\) comes from a common distribution). Thus, the output from Qn() returns a numeric vector of length equal to the total number of product-symptom combinations, which is also the number of rows in the data frame returned by processRaw(). When calculating the \(Q_n\)s, be sure to use the full data set from processRaw() – not the squashed data set or the raw data.

qn <- Qn(theta_hat, N = processed$N, E = processed$E)
head(qn)
#> [1] 0.2819736 0.3409652 0.3482316 0.2819736 0.2226354 0.2556668
identical(length(qn), nrow(processed))
#> [1] TRUE
summary(qn)
#>      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
#> 0.0001205 0.2340247 0.3089647 0.2846020 0.3397683 0.9999997

ebgm()

The ebgm() function calculates the Empirical Bayes Geometric Mean \((EBGM)\) scores. \(EBGM\) is a measure of central tendency of the posterior distributions, \(\lambda_{ij}|N=n\). Scores much larger than one indicate product/adverse event pairs that are reported at an unusually high rate.

processed$ebgm <- ebgm(theta_hat, N = processed$N, E = processed$E, qn  = qn)
head(processed)
#>                      var1                  var2 N            E      RR
#> 1         1-PHENYLALANINE  HEART RATE INCREASED 1 0.0360548272   27.74
#> 2 11 UNSPECIFIED VITAMINS                ASTHMA 1 0.0038736591  258.15
#> 3 11 UNSPECIFIED VITAMINS CARDIAC FUNCTION TEST 1 0.0002979738 3356.00
#> 4 11 UNSPECIFIED VITAMINS            CHEST PAIN 1 0.0360548272   27.74
#> 5 11 UNSPECIFIED VITAMINS              DYSPNOEA 1 0.0765792610   13.06
#> 6 11 UNSPECIFIED VITAMINS      HYPERSENSITIVITY 1 0.0527413588   18.96
#>      PRR ebgm
#> 1  27.96 2.23
#> 2 279.58 2.58
#> 3    Inf 2.63
#> 4  27.96 2.23
#> 5  13.11 1.92
#> 6  19.06 2.09

quantBisect()

The quantBisect() function calculates quantiles of the posterior distribution using the bisection method. quantBisect() can calculate any quantile of the posterior distribution between 1 and 99%, and these quantiles can be used as limits for credibility intervals. Below, QUANT_05 is the 5th percentile; QUANT_95 is the 95th percentile. These form the lower and upper bounds of 90% credibility intervals for the Empirical Bayes (EB) scores.

processed$QUANT_05 <- quantBisect(5, theta_hat = theta_hat,
                                  N = processed$N, E = processed$E, qn = qn)
processed$QUANT_95 <- quantBisect(95, theta_hat = theta_hat,
                                  N = processed$N, E = processed$E, qn = qn)
head(processed)
#>                      var1                  var2 N            E      RR
#> 1         1-PHENYLALANINE  HEART RATE INCREASED 1 0.0360548272   27.74
#> 2 11 UNSPECIFIED VITAMINS                ASTHMA 1 0.0038736591  258.15
#> 3 11 UNSPECIFIED VITAMINS CARDIAC FUNCTION TEST 1 0.0002979738 3356.00
#> 4 11 UNSPECIFIED VITAMINS            CHEST PAIN 1 0.0360548272   27.74
#> 5 11 UNSPECIFIED VITAMINS              DYSPNOEA 1 0.0765792610   13.06
#> 6 11 UNSPECIFIED VITAMINS      HYPERSENSITIVITY 1 0.0527413588   18.96
#>      PRR ebgm QUANT_05 QUANT_95
#> 1  27.96 2.23     0.49    13.85
#> 2 279.58 2.58     0.52    15.78
#> 3    Inf 2.63     0.52    16.02
#> 4  27.96 2.23     0.49    13.85
#> 5  13.11 1.92     0.47    11.77
#> 6  19.06 2.09     0.48    12.95

Analysis of EB-Scores

The EB-scores (\(EBGM\) and quantile scores) can be used to look for “signals” in the data. As stated in the Background section of the Introduction to openEBGM vignette, Bayesian shrinkage causes the EB-scores to be far more stable than their \(RR\) counterparts, which allows for better separation between signal and noise. One could, for example, look at all product-symptom combinations where QUANT_05 (the lower part of the 90% two-sided credibility interval) is 2 or greater. This is often used as a conservative alternative to \(EBGM\) since QUANT_05 scores are naturally smaller than \(EBGM\) scores. We can say with high confidence that the “true relative reporting ratios” of product/adverse event combinations above this threshold are much greater than 1, so those combinations are truly reported more than expected. The value of 2 is arbitrarily chosen, and depends on the context. Below is an example of how one may identify product-symptom combinations that require further investigation based on the EB-scores.

suspicious <- processed[processed$QUANT_05 >= 2, ]
nrow(suspicious); nrow(processed); nrow(suspicious)/nrow(processed)
#> [1] 131
#> [1] 17189
#> [1] 0.007621153

From above we see that less than 1% of product-symptom pairs are suspect based on the QUANT_05 score. One may look more closely at these product-symptom combinations to ascertain which products may need further investigation. Subject matter knowledge is required to determine which signals might identify a possible causal relationship. The EB-scores find statistical associations – not necessarily causal relationships.

suspicious <- suspicious[order(suspicious$QUANT_05, decreasing = TRUE),
                         c("var1", "var2", "N", "E", "QUANT_05", "ebgm", 
                           "QUANT_95")]
head(suspicious, 5)
#>                                           var1                        var2
#> 13924                            REUMOFAN PLUS            WEIGHT INCREASED
#> 8187  HYDROXYCUT REGULAR RAPID RELEASE CAPLETS          EMOTIONAL DISTRESS
#> 13886                            REUMOFAN PLUS                    IMMOBILE
#> 7793              HYDROXYCUT HARDCORE CAPSULES CARDIO-RESPIRATORY DISTRESS
#> 8220  HYDROXYCUT REGULAR RAPID RELEASE CAPLETS                      INJURY
#>        N          E QUANT_05  ebgm QUANT_95
#> 13924 16 0.40643623    15.68 23.26    33.48
#> 8187  19 0.89690107    11.65 16.78    23.55
#> 13886  6 0.07866508    10.16 18.28    30.83
#> 7793   8 0.30482718     9.00 15.25    24.52
#> 8220  11 0.56317044     8.98 14.28    21.78

tabbed <- table(suspicious$var1)
head(tabbed[order(tabbed, decreasing = TRUE)])
#> 
#> HYDROXYCUT REGULAR RAPID RELEASE CAPLETS 
#>                                       26 
#>             HYDROXYCUT HARDCORE CAPSULES 
#>                                       13 
#>                            REUMOFAN PLUS 
#>                                        8 
#>                      HYDROXYCUT CAPSULES 
#>                                        5 
#>               HYDROXYCUT MAX LIQUID CAPS 
#>                                        5 
#>         HYDROXYCUT CAFFEINE FREE CAPLETS 
#>                                        4

The output above suggests some products which may require further investigation.

Next, the openEBGM Objects and Class Functions vignette will demonstrate the object-oriented features of the openEBGM package.