In this vignette, we use a real dataset to show how to apply proportionality analysis to RNA-seq count data. We place a particular emphasis here on documenting \(\rho\) and the visualization tools included with this package. Although the user may feel eager to start here, we strongly recommend first reading the companion vignette, “An Introduction to Proportionality”.

As a use case, we analyze raw RNA-seq counts from a published study on cane toad evolution and adaptation (Rollins 2015). This dataset contains the transcript counts for 20 toads, sampled from two locations in the wild. Sugar cane farmers introduced cane toads to Australia in 1935 as a pest control measure, but these toads quickly became pests themselves. Starting in Queensland (QLD), they have since spread out into parts of Western Australia (WA). The two locations sampled, which we will treat as the experimental groups, include the early settlement region in QLD and the front of expansion into WA. In this analysis, we try to tease apart the genomic differences between the regional and invasive toads. To begin, we load the `propr`

library along with the example data.

```
library(propr)
data(caneToad.counts)
data(caneToad.groups)
```

Next, we calculate proportionality between all transcripts in the dataset. When working with a large number of transcripts (57,580 in this case), we may first wish to filter non-informative transcripts to minimize the computational burden of analysis. However, while the numerator portion of \(\rho\) remains fixed regardless of pre-filtering, the denominator portion of \(\rho\) does change. In other words, although the numerator portion is sub-compositionally coherent, the denominator portion is not. Therefore, pre-filtering *before* calculating \(\rho\) impacts the final result.

In some circumstances, it is simply infeasible to calculate a proportionality matrix using all features. For example, calculating pairwise proportionality here requires at least 32 GB of RAM, not counting the additional RAM required for indexing and visualization. However, beginning with version 2.0.3, we can now pre-filter *while* calculating \(\rho\), which does not impact the final result. For this pre-filter step, we remove all transcripts that do not have at least 10 counts in at least 10 samples. We base this on the intuition that lowly-expressed transcripts likely have high relative error in cross-sample comparisons due to random noise from the assay technology and alignment protocol. Note that beginning with version 2.2.0, `caneToad.counts`

is bundled as already pre-filtered in order to satisfy strict file size limits imposed by CRAN.

```
keep <- apply(caneToad.counts, 2, function(x) sum(x >= 10) >= 10)
rho <- propr(caneToad.counts, metric = "rho", select = keep)
```

Next, we index the most highly proportional pairs based on a cutoff (preferably chosen with `updateCutoffs`

to minimize false discovery). Here, we use an arbitrarily high threshold so that the vignette renders quickly.

`best <- rho[">", .995]`

After indexing, it is helpful to check the pairwise distribution of proportional pairs using the `plot`

(or, equivalently, `smear`

) function. As an “index-aware” function, `plot`

only plots feature pairs indexed with `[`

. When interpreting this figure, a smear of straight diagonal lines confirms that the pairs are proportional. Intuitively, this means that as one transcript increases in log-ratio transformed abundance, so does the other.

`plot(best)`

`## Alert: Generating plot using indexed feature pairs.`

You can also inspect how the indexed pairs cluster using the `dendrogram`

function. Specifically, this function clusters features based on a hierarchical clustering of the proportionality matrix. In this package, we define the dissimilarity measure as `as.dist(1-abs(rho@matrix))`

. Like `plot`

, this function is “index-aware” and only plots feature pairs indexed with `[`

. Take note that this dendrogram matches the dendrogram used to label co-clusters in the `prism`

, `bokeh`

, and `bucket`

plots below, but differs from the dendrogram produced by the `snapshot`

plot. Heatmap intensity is not scaled.

`dendrogram(best)`

`## Alert: Generating plot using indexed feature pairs.`

The remaining visualization methods do not restrict plotting to indexed pairs, but instead incorporate all features in the `propr`

object. To exclude features that do not comprise at least one indexed pair, we `simplify`

the indexed `propr`

object.

`best <- simplify(best)`

Compositional data are not distributed in Euclidean space, but rather exist in a space known as the Aitchison simplex (Fernandes 2014). The log-ratio transformation transforms such data to Euclidean space, allowing us to summarize our data through conventional statistics.

First, we look at `pca`

, a function for visualizing the samples across the first two dimensions as reduced by principal components analysis (PCA). This is a valid implementation of PCA for relative data, computed using log-ratio transformed data (Gloor 2016). The `group`

argument allows us to color the sample IDs by group membership.

`pca(best, group = caneToad.groups)`

Second, we look at `snapshot`

, a function for visualizing the intensity of the log-ratio transformed data across samples. Heatmap intensity is not scaled.

`snapshot(best)`

Finally, we look at `prism`

, `bokeh`

, and `bucket`

(pronounced *bouquet*), three functions for visualizing the co-clustering of proportional features. We mention these plots together because they share some key similarities. First, they are “index-naive” and plot all \(\rho\) in the `@matrix`

slot of the `propr`

object. Second, they identify the feature pairs where both constituents co-cluster (with the total number of clusters toggled by `k`

). Third, they return a vector of cluster memberships for all features in the `propr`

object.

The `prism`

function plots the variance of the ratio of the log-ratio transformed feature pair (VLR) versus the sum of the individual variances of each log-ratio transformed feature (VLS). The ratio of the VLR to the VLS equals \(1 - \rho\). As such, we use here seven rainbow colored lines to indicate where \(\rho = [.01, .05, .50, 0, 1.50, 1.95, 1.99]\), going from red to violet. A low VLR with a high VLS suggests that the feature pair remains in an equilibrium despite high variability among the individual features.

`clusts <- prism(best, k = 5)`

The `bokeh`

function plots pairs across the individual variances of the constituent log-ratio transformed features. For clarity of visualization, this figure projects the data on a log-fold scale. Therefore, the highly variable co-clusters appear in the top-right of the figure while the lowly variable co-clusters appear in the bottom-left. Meanwhile, highly proportional pairs tend to aggregate around the \(y=x\) diagonal. The user can retrieve the table used to generate the `prism`

and `bokeh`

plots by passing the `propr`

object through the `slate`

function.

`clusts <- bokeh(best, k = 5)`

The `bucket`

function plots an estimation of the degree to which a feature pair differentiates the experimental groups versus the proportionality between that pair.

`clusts <- bucket(best, group = caneToad.groups, k = 5)`

`## Warning: Removed 1 rows containing missing values (geom_hline).`

These plots can help us conceptualize high-dimensional data and select a highly proportional module for further analysis. In this example, we choose co-cluster 2 for further analysis because it shows high proportionality in the setting of high individual feature variance.

We can extract co-cluster 2 from the `propr`

object using the `subset`

method.

`sub <- subset(best, select = (clusts == 2))`

`## Alert: User must repopulate @pairs slot after `subset`.`

`## Alert: Using 'subset' is not compatible the @results table.`

`## Alert: Using 'subset' disables permutation testing.`

We can visualize this cluster as a network using the `cytescape`

function. The resultant figure shows indexed proportional pairs as edges connecting the constituent features. Providing feature names to the optional arguments `col1`

and `col2`

will color those nodes red and blue, respectively. We recommend using this function for integrating proportionality analysis with conventional differential expression analysis. Note that setting `d3 = TRUE`

will have the `rgl`

package render the network in 3D.

`cyt <- cytescape(sub[">", .95])`

We can use the `pca`

function to see how well this cluster differentiates the two experimental groups. We see here that the first two principal components of this highly proportional module leads good separation between the experimental groups. This matches the separation achieved in the source publication which used `edgeR`

for feature selection (Rollins 2015), suggesting that our unsupervised method identified a feature subset that is relevant to the experimental condition. Note, however, that our approach prioritizes features with known coordination across all samples.

`pca(sub, group = caneToad.groups)`

Having identified a cluster that separates the experimental groups, the next step in this pipeline might involve a gene set enrichment analysis (GSEA) of the transcripts participating in this highly proportional module. The application of GSEA is beyond the scope of this vignette, although we can easily extract the names of the transcripts that belong to this cluster.

`transcripts <- colnames(sub@logratio)`

We introduce here a simple tool that is valid for all biological count data. In the example above, we show how this package can uncover a highly proportional transcript module that differentiates the experimental groups. We find this feat particularly impressive considering that, with the exception of the superfluous `bucket`

plot, we discovered this module without ever specifying how the experimental groups should guide feature selection. We believe that this unsupervised application of proportionality analysis could offer unique insights into biological systems.

Erb, Ionas, and Cedric Notredame. “How Should We Measure Proportionality on Relative Gene Expression Data?” Theory in Biosciences = Theorie in Den Biowissenschaften 135, no. 1-2 (June 2016): 21-36. http://dx.doi.org/10.1007/s12064-015-0220-8.

Fernandes, Andrew D., Jennifer Ns Reid, Jean M. Macklaim, Thomas A. McMurrough, David R. Edgell, and Gregory B. Gloor. “Unifying the Analysis of High-Throughput Sequencing Datasets: Characterizing RNA-Seq, 16S rRNA Gene Sequencing and Selective Growth Experiments by Compositional Data Analysis.” Microbiome 2 (2014): 15. http://dx.doi.org/10.1186/2049-2618-2-15.

Gloor, Gregory B., and Gregor Reid. “Compositional Analysis: A Valid Approach to Analyze Microbiome High-Throughput Sequencing Data.” Canadian Journal of Microbiology 62, no. 8 (August 2016): 692-703. http://dx.doi.org/10.1139/cjm-2015-0821.

Lovell, David, Vera Pawlowsky-Glahn, Juan José Egozcue, Samuel Marguerat, and Jürg Bähler. “Proportionality: A Valid Alternative to Correlation for Relative Data.” PLoS Computational Biology 11, no. 3 (March 2015): e1004075. http://dx.doi.org/10.1371/journal.pcbi.1004075.

Rollins, Lee A., Mark F. Richardson, and Richard Shine. “A Genetic Perspective on Rapid Evolution in Cane Toads (Rhinella Marina).” Molecular Ecology 24, no. 9 (May 2015): 2264-76. http://dx.doi.org/10.1111/mec.13184.