# A Brief Critique of Proportionality

## Introduction

We recognize that this package uses concepts that are not necessarily intuitive. As such, we offer a brief critique of proportionality analysis. Although the user may feel eager to start here, we strongly recommend first reading the companion vignette, “An Introduction to Proportionality”.

## Sample data

To facilitate discussion, we simulate count data for 5 features (e.g., genes) labeled “a”, “b”, “c”, “d”, and “e”, as measured across 100 subjects.

library(propr)
N <- 100
a <- seq(from = 5, to = 15, length.out = N)
b <- a * rnorm(N, mean = 1, sd = 0.1)
c <- rnorm(N, mean = 10)
d <- rnorm(N, mean = 10)
e <- rep(10, N)
X <- data.frame(a, b, c, d, e)

Let us assume that these data $$X$$ represent absolute abundance counts (i.e., not relative data). We can build a relative dataset, $$Y$$, by constraining and perturbing $$X$$:

Y <- X / rowSums(X) * abs(rnorm(N))

We can check that the new feature vectors do in fact contain relative quantities. For example, the ratio of the second feature to the first is the same for both the absolute and relative datasets.

all(round(X[, 2] / X[, 1] - Y[, 2] / Y[, 1], 5) == 0)
## [1] TRUE

## Spurious correlation

Next, we compare pairwise scatterplots for the absolute count data and the corresponding relative count data. We see quickly how these relative data suggest a spurious correlation: although genes “c” and “d” do not correlate with one another absolutely, their relative quantities do.

pairs(X) # absolute data

pairs(Y) # relative data

Spurious correlation is evident by the correlation coefficients too.

suppressWarnings(cor(X)) # absolute correlation
##              a           b            c           d  e
## a  1.000000000  0.94475844  0.003970374 -0.07234164 NA
## b  0.944758439  1.00000000 -0.014389878 -0.11947914 NA
## c  0.003970374 -0.01438988  1.000000000  0.02875936 NA
## d -0.072341638 -0.11947914  0.028759360  1.00000000 NA
## e           NA          NA           NA          NA  1
cor(Y) # relative correlation
##           a         b         c         d         e
## a 1.0000000 0.9863649 0.8539498 0.8503377 0.8650862
## b 0.9863649 1.0000000 0.8348674 0.8345597 0.8499053
## c 0.8539498 0.8348674 1.0000000 0.9758778 0.9894774
## d 0.8503377 0.8345597 0.9758778 1.0000000 0.9843489
## e 0.8650862 0.8499053 0.9894774 0.9843489 1.0000000

## An in-depth look at VLR

In contrast, the variance of the log-ratios (VLR), defined as the variance of the logarithm of the ratio of two feature vectors, offers a measure of dependence that (a) does not change with respect to the nature of the data (i.e., absolute or relative), and (b) does not change with respect to the number of features included in the computation. As such, the VLR, constituting the numerator portion of the $$\phi$$ metric and a portion of the $$\rho$$ metric as well, is considered sub-compositionally coherent. Yet, while VLR yields valid results for compositional data, it lacks a meaningful scale.

propr:::proprVLR(Y[, 1:4]) # relative VLR
##            a          b         c         d
## a 0.00000000 0.01021946 0.1042684 0.1119747
## b 0.01021946 0.00000000 0.1171924 0.1254567
## c 0.10426838 0.11719244 0.0000000 0.0163765
## d 0.11197472 0.12545665 0.0163765 0.0000000
propr:::proprVLR(X) # absolute VLR
##            a          b           c           d           e
## a 0.00000000 0.01021946 0.104268383 0.111974717 0.097960496
## b 0.01021946 0.00000000 0.117192436 0.125456654 0.109071317
## c 0.10426838 0.11719244 0.000000000 0.016376496 0.007706843
## d 0.11197472 0.12545665 0.016376496 0.000000000 0.009189899
## e 0.09796050 0.10907132 0.007706843 0.009189899 0.000000000

## An in-depth look at clr

In the calculation of proportionality, we adjust the arbitrarily large VLR by the variance of its individual constituents. To do this, we need to place samples on a comparable scale. Log-ratio transformation, such as the centered log-ratio (clr) transformation, shifts the data onto a “standardized” scale that allows us to compare differences in the VLR-matrix.

In the next figures, we compare pairwise scatterplots for the clr-transformed absolute count data and the corresponding clr-transformed relative count data. While equivalent, we see a relationship between “c” and “d” that should not exist based on what we know from the non-transformed absolute count data. This demonstrates that, although the clr-transformation helps us compare values across samples, it does not rescue information lost by making absolute data relative.

pairs(propr:::proprCLR(Y[, 1:4])) # relative clr-transformation

pairs(propr:::proprCLR(X)) # absolute clr-transformation

Proportionality is a compromise between the advantages of VLR and the disadvantages of clr to establish a measure of dependence that is robust yet interpretable. Note, however, that because of the division of VLR by the variance of the clr-transformed data, proportionality is not sub-compositionally coherent. As such, spurious proportionality is possible when the clr does not adequately approximate the absolute data.

propr(Y[, 1:4])@matrix # relative proportionality with clr
##            a          b          c          d
## a  1.0000000  0.8272187 -0.8824763 -0.8856807
## b  0.8272187  1.0000000 -0.8904919 -0.9013457
## c -0.8824763 -0.8904919  1.0000000  0.7368192
## d -0.8856807 -0.9013457  0.7368192  1.0000000
propr(X)@matrix # absolute proportionality with clr
##            a          b          c          d          e
## a  1.0000000  0.8730806 -0.8215966 -0.8437871 -0.8512108
## b  0.8730806  1.0000000 -0.8101044 -0.8386186 -0.8052084
## c -0.8215966 -0.8101044  1.0000000  0.6357140  0.7924989
## d -0.8437871 -0.8386186  0.6357140  1.0000000  0.7738257
## e -0.8512108 -0.8052084  0.7924989  0.7738257  1.0000000

## An in-depth look at alr

Unlike the clr which adjusts each subject vector by the geometric mean of that vector, the additive log-ratio (alr) adjusts each subject vector by the value of one its own components, chosen as a reference. If we select as a reference some feature $$D$$ with an a priori known fixed absolute count across all subjects, we can effectively “back-calculate” absolute data from relative data. When initially crafting the data $$X$$, we included “e” as this fixed value.

The following figures compare pairwise scatterplots for alr-transformed relative count data (i.e., $$alr(Y)$$ with “e” as the reference) and the corresponding absolute count data. We see here how alr-transformation eliminates the spurious correlation between “c” and “d”.

pairs(propr:::proprALR(Y, ivar = 5)) # relative alr

pairs(X[, 1:4]) # absolute data

Again, this gets reflected in the results of perb when we select “e” as the reference.

propr(Y, ivar = 5)@matrix # relative proportionality with alr
##             a            b            c           d e
## a  1.00000000  0.950638235  0.013239238 -0.04502384 0
## b  0.95063823  1.000000000 -0.003547555 -0.06084360 0
## c  0.01323924 -0.003547555  1.000000000  0.03078971 0
## d -0.04502384 -0.060843598  0.030789706  1.00000000 0
## e  0.00000000  0.000000000  0.000000000  0.00000000 1

Now, let us assume these same data, $$X$$, actually measure relative counts. In other words, $$X$$ is already relative and we do not know the real quantities which correspond to $$X$$ absolutely. Well, if we knew that “a” represented a known fixed quantity, we could use alr-transformation again to “back-calculate” the absolute abundances. In this case, we will see that “c”, “d”, and “e” actually do have proportional expression under these conditions. Although the measured quantity of “c”, “d”, and “e” do not change considerably across subjects, the measured quantity of the known fixed feature does change. As such. whenever “a” increases while “c”, “d”, and “e” remains the same, the latter three features have actually decreased. Since they all decreased together, they act as a highly proportional module.

pairs(propr:::proprALR(X, ivar = 1)) # new relative alr

Again, this gets reflected in the results of perb when we select “a” as the reference.

propr(X, ivar = 1)@matrix # new relative proportionality with alr
##   a            b           c           d            e
## a 1  0.000000000  0.00000000  0.00000000  0.000000000
## b 0  1.000000000 -0.02362345 -0.02669915 -0.008239653
## c 0 -0.023623446  1.00000000  0.92426812  0.961890494
## d 0 -0.026699149  0.92426812  1.00000000  0.956225073
## e 0 -0.008239653  0.96189049  0.95622507  1.000000000

We see here that, unlike clr-transformed proportionality metrics, the alr-transformed metric $$\rho$$ is sub-compositionally coherent and yields identical results regardless of the nature of the data explored. Of course, this assumes that one knows the identity of a feature fixed across all subjects.

citation("propr")
##
## To cite propr in publications use:
##
##   Quinn T, Richardson MF, Lovell D, Crowley T (2017) propr: An
##   R-package for Identifying Proportionally Abundant Features Using
##   Compositional Data Analysis. Scientific Reports 7(16252):
##   doi:10.1038/s41598-017-16520-0
##
##   Erb I, Quinn T, Lovell D, Notredame C (2017) Differential
##   Proportionality - A Normalization-Free Approach To Differential
##   Gene Expression. Proceedings of CoDaWork 2017, The 7th
##   Compositional Data Analysis Workshop; available under bioRxiv
##   134536: doi:10.1101/134536
##
##   Quinn T, Erb I, Richardson MF, Crowley T (2018) Understanding
##   sequencing data as compositions: an outlook and review.
##   Bioinformatics 34(16): doi:10.1093/bioinformatics/bty175
##
##   Lovell D, Pawlowsky-Glahn V, Egozcue JJ, Marguerat S, Bahler J
##   (2015) Proportionality: A Valid Alternative to Correlation for
##   Relative Data. PLoS Comput Biol 11(3):
##   doi:10.1371/journal.pcbi.1004075
##
##   Erb I, Notredame C (2016) How should we measure proportionality
##   on relative gene expression data? Theory Biosci 135(1):
##   doi:10.1007/s12064-015-0220-8
##
## To see these entries in BibTeX format, use 'print(<citation>,
## bibtex=TRUE)', 'toBibtex(.)', or set
## 'options(citation.bibtex.max=999)'.