This vignette shows the relevant steps to QC RNA-seq data. The rationale of this workflow is to assess whether the overall experiment meets desired quality standards and to detect low quality samples.
We start with loading relevant packages for the analysis.
library("RNAseqQC") library("DESeq2") library("ensembldb") library("dplyr") library("ggplot2") library("purrr") library("tidyr") library("tibble") library("magrittr")
The input to the workflow consists of a genes \(\times\) samples count matrix and
data.frame of metadata annotating the samples, where the
number of metadata rows must equal the number of matrix columns. In
addition, the count matrix must have column names and the row names must
be ENSEMBL gene IDs.
As an example, we are analyzing a data set in which T47D cells with
different mutation statuses were treated with E2 (estradiol) or vehicle
(Bahreini et al. 2017). The input data
are stored in this package as a
T47D. For demonstration purposes, we extract the count
matrix and sample metadata to show how to construct an annotated
DESeqDataSet, which is required for running this
<- counts(T47D) count_mat <- data.frame(colData(T47D)) meta # count matrix; rownames must be ENSEMBL gene IDs head(which(rowSums(count_mat) > 0)), 1:10] count_mat[#> veh_WT_rep1 veh_WT_rep2 veh_WT_rep3 veh_WT_rep4 veh_D538G_rep1 #> ENSG00000223972 772 386 487 929 568 #> ENSG00000278267 290 391 311 623 319 #> ENSG00000227232 11914 13539 14108 14427 9616 #> ENSG00000284332 0 0 0 63 0 #> ENSG00000243485 89 353 153 352 0 #> ENSG00000237613 0 0 0 150 0 #> veh_D538G_rep2 veh_D538G_rep4 veh_Y537S_rep1 veh_Y537S_rep2 #> ENSG00000223972 210 891 518 760 #> ENSG00000278267 318 267 354 283 #> ENSG00000227232 10344 10747 11011 9569 #> ENSG00000284332 0 52 0 0 #> ENSG00000243485 253 501 103 137 #> ENSG00000237613 0 0 0 0 #> veh_Y537S_rep3 #> ENSG00000223972 406 #> ENSG00000278267 324 #> ENSG00000227232 13232 #> ENSG00000284332 0 #> ENSG00000243485 118 #> ENSG00000237613 0 # metadata of the samples, where row i corresponds to column i in the count matrix meta#> treatment mutation replicate #> veh_WT_rep1 veh WT rep1 #> veh_WT_rep2 veh WT rep2 #> veh_WT_rep3 veh WT rep3 #> veh_WT_rep4 veh WT rep4 #> veh_D538G_rep1 veh D538G rep1 #> veh_D538G_rep2 veh D538G rep2 #> veh_D538G_rep4 veh D538G rep4 #> veh_Y537S_rep1 veh Y537S rep1 #> veh_Y537S_rep2 veh Y537S rep2 #> veh_Y537S_rep3 veh Y537S rep3 #> veh_Y537S_rep4 veh Y537S rep4 #> veh_D538G_rep3 veh D538G rep3 #> E2_WT_rep1 E2 WT rep1 #> E2_WT_rep2 E2 WT rep2 #> E2_WT_rep3 E2 WT rep3 #> E2_WT_rep4 E2 WT rep4 #> E2_Y537S_rep1 E2 Y537S rep1 #> E2_Y537S_rep2 E2 Y537S rep2 #> E2_Y537S_rep3 E2 Y537S rep3 #> E2_Y537S_rep4 E2 Y537S rep4 #> E2_D538G_rep1 E2 D538G rep1 #> E2_D538G_rep2 E2 D538G rep2 #> E2_D538G_rep3 E2 D538G rep3 #> E2_D538G_rep4 E2 D538G rep4
We make a
DESeqDataSet from the input count matrix and
metadata. This step also requires to specify a record identifier from
the AnnotationHub package, that allows to download an
object to annotate gene IDs and store them in the
slot. We choose the latest homo sapiens AnnotationHub
<- make_dds(counts = count_mat, metadata = meta, ah_record = "AH89426")dds
Records for other species can be found by running the following code. Ideally, the AnnotationHub record should match the genome build against which the sequencing reads were aligned.
mcols(AnnotationHub::AnnotationHub()) %>% as_tibble(rownames = "record_id") %>% ::filter(rdataclass == "EnsDb") dplyr
A good metric to start quality control is to look at the total number of counts for each sample. This is also referred to as library size and we typically expect all samples to have total counts within the same order of magnitude.
For each sample, it is shown what fraction of counts is taken up by what fraction of genes. Samples showing a different library complexity than the rest might be considered low quality. In our case, all samples have very similar complexity.
Another interesting quantity is the number of detected genes for each sample. This number depends on the detection threshold, i.e. the minimum count required to call a gene detected and also on the used protocol (e.g. RNA-seq read counts vs. DRUG-seq UMI counts). Thus, different thresholds are of interest. Note that genes in this context refers to ENSEMBL genes and includes protein coding genes, lncRNAs, and various other biotypes.
It is also helpful to stratify the total gene count by the different gene biotypes. The plot below shows for each sample the total (library size) normalized count for the major gene biotypes. In a high quality data set, we expect this plot to show almost horizontal lines. Deviations from the horizontal pattern may correspond to different biological groups in the data set, especially in case of different cell lines, but could also indicate batch effects or low quality samples.
Before continuing with quality control, a useful intermediate step is to remove genes with low counts. This often substantially reduces the total number of genes, and thus the overall size of the data set and computation time. A good strategy is to determine the size of a biological group, i.e. all samples under the same biological condition, and filter (keep) genes with a certain count at least as often as the size of the biological group. In our case, each group is a combination of treatment and mutation status with 4 samples and we choose 5 counts as threshold.
<- filter_genes(dds, min_count = 5, min_rep = 4)dds
For downstream tasks like clustering or PCA, a transformation is
necessary so that the variance of a gene does not depend on it’s mean,
i.e. we want genes with low and high mean counts to have similar
variances. A transformation that achieves this is called variance
stabilizing and we use the
vst function of DESeq2 for
this task. To check if the variance is indeed stabilized, we plot for
each gene the rank of the mean count versus the standard deviation. In
the resulting plot, the red trend line should be relatively flat with
respect to the scale on the y-axis, as can be seen for our example
<- vst(dds) vsd mean_sd_plot(vsd)
Chromosome expression plots can be useful to detect expression patterns related to the chromatin. For each chromosome, we make a heatmap where genes are arranged by chromosomal position along the x-axis and each row is a sample. The columns of the heatmap are scaled such that different genes can be compared to each other. Long red (blue) stretches along the x-axis indicate regions with increased (decreased) expression.
There can be several reasons for differential expression of a chromosomal region, for example duplication of a chromosome band or changes in chromatin accessibility. Also if the data comprise different cell lines or primary cancer samples, differential expression of a region or a whole chromosome could be observed.
In our example, chromosome 1 shows no obvious differentially expressed region, chromosome 5 shows differential expression related to the interaction of treatment and mutation status, and chromosome 14 is down regulated in samples with the D538G mutation.
map(c("1", "5", "14"), ~plot_chromosome(vsd, .x)) #> []
#> #> []
#> #> []
A good quality metric on a per-sample basis is to check whether
biological and/or technical replicates are closely correlated. To do
this, we define groups of samples by specifying a column of the
colData slot as grouping variable, and define all samples
that belong to the same level as group. Subsequently, a reference sample
is computed for each group by taking for each gene the median over all
samples in the group. Finally, we make an MA-plot where each sample is
plotted against its group reference.
In our example, we define a new grouping variable as the combination of treatment and mutation status and show replicate variability for the sample groups veh_WT and veh_Y537S. All shown replicates have high concordance with their respective reference. The only exception is sample veh_Y537S_rep3, which has an increased number down-regulated genes with high mean count.
# define new grouping variable colData(vsd)$trt_mut <- paste0(colData(vsd)$treatment, "_", colData(vsd)$mutation) <- plot_sample_MAs(vsd, group = "trt_mut") ma_plots ::plot_grid(plotlist = ma_plots[17:24], ncol = 2)cowplot
We can also have a look at the clustering of samples by a distance metric such as euclidean distance or pearson correlation distance. It is useful to define annotation variables for the distance heatmap and relate appearing clusters to the annotation. In our example, samples cluster according to the biological groups defined in the previous section, highlighting the good quality of the data set.
# set seed to control random annotation colors set.seed(1) plot_sample_clustering(vsd, anno_vars = c("treatment", "mutation",