Using tidyestimate

Kai Aragaki

2023-08%d

library(tidyestimate)

Introduction

tidyestimate is a tidy implementation of the estimate package and paper (Yoshihara et al. 2013).

estimate infers tumor purity from gene expression data by performing gene-set-enrichment-analysis (GSEA)(Subramanian et al. 2005) on a single-sample basis (ssGSEA)(Barbie et al. 2009), using both a stromal and immune gene set. By adding these two signatures together, an ESTIMATE score can then be converted into an inferred purity score.

Whereas estimate inputs and outputs were largely .GCT files, tidyestimate simplifies the estimate package by taking a data frame, tibble, or matrix as an input, and returning a tibble. This eliminates the need for intermediate files, reduces execution time, and allows for tidier (pipe-compatible) code. Furthermore, tidyestimate provides an optional conservative alias-matching algorithm to capture genes that may be listed under a different name within the dataset.

Use

The dataset we will be using is a set of 10 ovarian cancer tumors, whose expression was profiled using Affymetrix arrays

dim(ov)
#> [1] 17256    10
head(ov[, 1:5])
#>             s516    s518    s519    s520    s521
#> C9orf152  4.8815  4.5757  3.7395  3.6960  4.1597
#> ELMO2     7.2981  7.5554  7.5332  7.3824  7.3079
#> CREB3L1   5.5692  5.7004  5.9597  5.7700  5.2190
#> RPS11    13.3899 13.8488 13.6429 13.6546 13.5698
#> PNMA1     9.3480 10.0092 10.4310  9.5399  9.6423
#> MMP2      7.6182  8.0369  8.9551 10.3875  7.4141

This dataset is a matrix, but it could just as easily be a tibble or data.frame. The identifiers can either be HGNC symbols or Entrez IDs - here we have HGNC symbols.

The ESTIMATE algorithm is sensitive to number of genes included. As such, we will filter for genes that are common to six different expression profiling platforms (see ?tidyestimate::common_genes for more information).

filtered <- filter_common_genes(ov, 
                                id = "hgnc_symbol", 
                                tidy = FALSE, 
                                tell_missing = TRUE, 
                                find_alias = TRUE)
#> 461 of 488 missing genes found matches using aliases.
#> 
#> The following genes are in the list of common genes, but not in your dataset:
#> 
#> ACKR2 EPRS1 AOC1 GARS1 ASIC2 MELTF SEPTIN2 SEPTIN4 POLR1G B4GAT1...and 17 others
#> 
#> Found 10364 of 10391 genes (99.74%) in your dataset.

You will notice there are several arguments:

After we have put our dataset on common ground, we can calculate the ESTIMATE score for each sample:

scored <- estimate_score(filtered,
                         is_affymetrix = TRUE)
#> Number of stromal_signature genes in data: 141 (out of 141)
#> Number of immune_signature genes in data: 141 (out of 141)
scored
#>    sample    stromal     immune   estimate    purity
#> 1    s516 -285.49841  165.75062  -119.7478 0.8323791
#> 2    s518 -429.16931   99.71302  -329.4563 0.8490421
#> 3    s519  -60.98619 -368.70314  -429.6893 0.8567232
#> 4    s520 1927.51431 2326.15984  4253.6742 0.3348246
#> 5    s521 -673.84954  141.72775  -532.1218 0.8643812
#> 6    s522 1447.95517 1166.51854  2614.4737 0.5497248
#> 7    s523 -271.15756 -928.44921 -1199.6068 0.9094242
#> 8    s525  965.61804 1310.27775  2275.8958 0.5905450
#> 9    s526  545.99467 2149.10473  2695.0994 0.5398002
#> 10   s527 -710.44370 1303.08009   592.6364 0.7699846

Note that the estimate_score function takes an is_affymetrix argument. This argument determines whether or not a purity score will be calculated for the samples. As the data used to train the model to convert ESTIMATE scores to purity scores were produced by Affymetrix, it is unwise to infer a tumor purity score using the same method for RNAseq data. However, stromal and immune infiltration as well as ESTIMATE scores can be used to measure relative purity in RNAseq samples (versus absolute 0 to 1 purity inferred for Affymetrix samples). For instance, sample s516 has less stroma than it does immune cell infiltrate. Further, it has more stromal infiltration than sample s518. By looking at the ESTIMATE scores, we can see that s516 is less pure than s518. This can be said absolutely with the purity values: s516 is roughly 83% pure, a metric that can be used both within and across studies.

In the case of Affymetrix samples, where a purity score can be calculated, you may want to see where your sample stands in the model generated by Yoshihara et al. This can be done simply with plotting:

plot_purity(scored, is_affymetrix = TRUE)

The is_affymetrix argument is a bit of a false choice - if FALSE, the function will remind the user that purity scores for non-Affymetrix data are not supported, then stop.

On this plot, gray circles represent the Affymetrix samples used to train the model. Their tumor purity was measured using the ABSOLUTE method (Carter et al. 2012), and a model was fit using an evolutionary model that takes the form of

\[\cos(0.6049872018 + 0.0001467884 * ESTIMATE)\]

The red circles, then, represent the input samples, and have been labeled by their sample name (column names in the original matrix)