Install ADAPTSdata3 using the code:

install.packages(‘devtools’)

library(devtools) devtools::install_github(‘sdanzige/ADAPTSdata3’)

Step 1: Build signature matrices from the normal data and estimate the accuracy on the diabetes data

All the data comes from E-MTAB-5061 - Single-cell RNA-seq analysis of human pancreas. In this vignette,instead of splitting the normal data into training and test set, all the signature matrices are built using the entire normal data, and then tested on the new diabetes data.

normalData <- log(ADAPTSdata3::normalData.5061+1)
diabetesData<-log(ADAPTSdata3::diabetesData.5061+1)


Step 1a: Build a deconvolution seed matrix using ranger forest and estimate the accuracy on pseudo bulk diabetes

ADAPTS provides the option of building a new seed matrix de novo based on the sample given, in addition to augmenting existing signature matrices, such as LM22. This is particularly helpful for single cell data sets, where the cell types present have come from their native tissue.

trainSet.30sam <- ADAPTS::scSample(RNAcounts = normalData, groupSize = 30, randomize = TRUE)
trainSet.3sam <- ADAPTS::scSample(RNAcounts = normalData, groupSize = 3, randomize = TRUE)

seedMat<-ADAPTS::buildSeed(trainSet=normalData, trainSet.3sam =trainSet.3sam, trainSet.30sam = trainSet.30sam, genesInSeed = 100, groupSize = 30, randomize = TRUE, num.trees = 1000, plotIt = TRUE)  

pseudobulk.test <- data.frame(test=rowSums(diabetesData))
pseudobulk.test.counts<-table(sub('\\..*','',colnames(diabetesData)))
actFrac.test <- 100 * pseudobulk.test.counts / sum(pseudobulk.test.counts)

estimates.test <- as.data.frame(ADAPTS::estCellPercent.DCQ(seedMat, pseudobulk.test))
colnames(estimates.test)<-'seed'

estimates.test$actFrac<-round(actFrac.test[rownames(estimates.test)],2)

seedAcc<-ADAPTS::calcAcc(estimates=estimates.test[,1], reference=estimates.test[,2])


Step 1b: Build a deconvolution matrix using all the genes and estimate the accuracy on pseudo bulk diabetes data

This step tests if building signature matrix is really necessary by comparing the performance of signature matrices and all-gene matrix.

allGeneSig <- apply(trainSet.3sam, 1, function(x){tapply(x, colnames(trainSet.3sam), mean, na.rm=TRUE)})

estimates.allGene <- as.data.frame(ADAPTS::estCellPercent.DCQ(t(allGeneSig), pseudobulk.test))
colnames(estimates.allGene)<-'all'

estimates.test<-cbind(estimates.allGene,estimates.test)

allAcc<-ADAPTS::calcAcc(estimates=estimates.test[,1], reference=estimates.test[,3])


Step 1c: Augment the seed matrix and estimate the accuray on pseudo bulk diabetes data

ADAPTS takes in the seed matrix, adds one additional gene from the full data at a time and records their condition number. The new augmented signature matrix is chosen based on the lowest condition number.

gList <- ADAPTS::gListFromRF(trainSet=trainSet.30sam)

augTrain <- ADAPTS::AugmentSigMatrix(origMatrix = seedMat, fullData = trainSet.3sam, gList = gList, nGenes = 1:100, newData = trainSet.3sam, plotToPDF = FALSE, pdfDir = '.')

estimates.augment <- as.data.frame(ADAPTS::estCellPercent.DCQ(augTrain, pseudobulk.test))
colnames(estimates.augment) <- paste('aug')

estimates.test <- cbind(estimates.augment, estimates.test)

augAcc<-ADAPTS::calcAcc(estimates=estimates.test[,1], reference=estimates.test[,4])


Step 1d: Shrink the augmented matrix to minimize the condition number and estimate the accuray on pseudo bulk diabetes data

This step iteratively removes the genes that would most improve the condition number by their absence. The condition number will usually decrease before increasing again when the number of genes becomes small, as shown in the plot below. Note that this plot should be read from right to left.

ADAPTS provides options to automatically select different points along the condition number curve corresponding to different numbers of genes. Typically, the best classifier has the largest number of genes where the condition number is at the bottom of the bowl shown in the plot. The fastStop option speeds up this process by halting execution when the algorithm has detected (perhaps incorrectly) that it has reached this point.

augTrain.shrink <- ADAPTS::shrinkSigMatrix(augTrain, numChunks=NULL, verbose=FALSE, plotIt=TRUE, aggressiveMin=TRUE,sigGenesList=NULL, singleCore=FALSE, fastStop=FALSE) 
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo

estimates.shrink <- as.data.frame(ADAPTS::estCellPercent.DCQ(augTrain.shrink, pseudobulk.test))
colnames(estimates.shrink)<-'shrink'

titleStr <- paste('Shrunk Signature Matrix,', '# Genes:',nrow(augTrain),'->',nrow(augTrain.shrink))
pheatmap(augTrain.shrink, main=titleStr)

estimates.test <- cbind(estimates.shrink, estimates.test)

augshrunkAcc<-ADAPTS::calcAcc(estimates=estimates.test[,1], reference=estimates.test[,5])



Step 2: Combine indistinguishible cell types for meta analysis

Cell-types for single cells are typically assigned by using unsupervised clustering techniques to group cells that are putatively all of the same cell type and then assigning cells in each group based on the expression of marker genes. While powerful, this approach is limited in that the delineations between clusters are often wrong. In other word, cell identified as different cell types by clustering may not really be different cell types or they may not really be detectable as different cell types based on gene expression data.

ADAPTS detects when cell clusters ought to be combined and this typically results in a much more accurate signature matrix.

This step shows that combining cell types that are highly correlated improves the ability to accurately deconvolve. The clusters are built based on n-pass spillover matrix. ADAPTS iteratively applying the spillover calculation until the results converge into clusters of cell types. See more details in Algorithm 2 of ADAPTS paper at https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0224693. All the signature matrices are built as in Step 1.

varClusts <- ADAPTS::clustWspillOver(sigMatrix = seedMat, geneExpr = trainSet.3sam)
metaCluster.id <- list()
  for(i in 1:length(varClusts$allClusters)) {
    for (x in varClusts$allClusters[[i]]) {
      metaCluster.id[[x]] <- paste('Meta',i,sep='_')
    }
  }

metaClust.LUT <-  unlist(metaCluster.id)  

metatrainSet<-normalData
metatestSet<-diabetesData

colnames(metatrainSet) <- metaClust.LUT[sub('\\..*','',colnames(metatestSet))]
colnames(metatestSet) <- metaClust.LUT[sub('\\..*','', colnames(metatestSet))]

metatrainSet.3sam <- ADAPTS::scSample(RNAcounts = metatrainSet, groupSize = 3, randomize = TRUE)
metatrainSet.30sam <- ADAPTS::scSample(RNAcounts = metatrainSet, groupSize = 30, randomize = TRUE)
metaclusterIDs <- factor(colnames(metatrainSet.30sam))
metatrainSet.4reg <- t(metatrainSet.30sam)

metapseudobulk.test <- data.frame(test=rowSums(metatestSet))
metapseudobulk.test.counts <- table(sub('\\..*','', colnames(metatestSet)))
meta.actFrac <- 100 * metapseudobulk.test.counts / sum(metapseudobulk.test.counts)


Step 2a: Build a deconvolution seed matrix for clustered cell types

metaseedMat <- ADAPTS::buildSeed(trainSet=metatrainSet, trainSet.3sam = metatrainSet.3sam, trainSet.30sam = metatrainSet.30sam, genesInSeed = 100, groupSize = 30, randomize = TRUE, num.trees = 1000, plotIt = TRUE)

estimates.Meta.onTest <- as.data.frame(ADAPTS::estCellPercent.DCQ(metaseedMat, metapseudobulk.test))
colnames(estimates.Meta.onTest)<-paste('meta.seed')
estimates.Meta.onTest$meta.actFrac <- round(meta.actFrac[rownames(estimates.Meta.onTest)],2)
metaseedAcc<-ADAPTS::calcAcc(estimates=estimates.Meta.onTest[,1], reference=estimates.Meta.onTest[,2])


Step 2b: Build a deconvolution matrix using all the genes for clustered cell types

metaallGeneSig <- apply(metatrainSet.3sam, 1, function(x){tapply(x, colnames(metatrainSet.3sam), mean, na.rm=TRUE)})
metaestimates.allGene <- as.data.frame(ADAPTS::estCellPercent.DCQ(t(metaallGeneSig), metapseudobulk.test))
colnames(metaestimates.allGene)<-paste('meta.all')
estimates.Meta.onTest <- cbind(metaestimates.allGene, estimates.Meta.onTest)
metaallAcc<-ADAPTS::calcAcc(estimates=estimates.Meta.onTest[,1], reference=estimates.Meta.onTest[,3])


Step 2c: Augment the seed matrix for clustered cell types

metagList <- ADAPTS::gListFromRF(trainSet=metatrainSet.30sam)

metaaugTrain <- ADAPTS::AugmentSigMatrix(origMatrix = metaseedMat, fullData = metatrainSet.3sam, gList = metagList, nGenes = 1:100, newData = metatrainSet.3sam, plotToPDF = FALSE, pdfDir = '.')

estimates.Meta.augment <- as.data.frame(ADAPTS::estCellPercent.DCQ(metaaugTrain, metapseudobulk.test))
colnames(estimates.Meta.augment) <- paste('meta.aug')
estimates.Meta.onTest <- cbind(estimates.Meta.augment, estimates.Meta.onTest)
metaaugAcc<-ADAPTS::calcAcc(estimates=estimates.Meta.onTest[,1], reference=estimates.Meta.onTest[,4])


Step 2d: Shrink the augmented matrix for clustered cell types

metaaugTrain.shrink <- ADAPTS::shrinkSigMatrix(sigMatrix=metaaugTrain, numChunks=NULL, verbose=FALSE, plotIt = TRUE, aggressiveMin=TRUE, fastStop=FALSE, singleCore=FALSE)
estimates.Meta.shrink <- as.data.frame(ADAPTS::estCellPercent.DCQ(metaaugTrain.shrink, metapseudobulk.test))
colnames(estimates.Meta.shrink) <- paste('meta.shrink')
metatitleStr <- paste('Shrunk Signature Matrix,', '# Genes:',nrow(metaaugTrain),'->',nrow(metaaugTrain.shrink))
pheatmap(augTrain.shrink, main=metatitleStr)
estimates.Meta.onTest <- cbind(estimates.Meta.shrink, estimates.Meta.onTest)
metashrunkAcc <-ADAPTS::calcAcc(estimates=estimates.Meta.onTest[,1], reference=estimates.Meta.onTest[,5])



Final result

The most accurate signature matrix (as measured by Pearson and Spearman’s correlation) is usually generated by meta-clustering with ‘clustWspillOver’, building a seed matrix with ‘buildSeed’, augmenting it with ‘gListFromRF’ and ‘AugmentSigMatrix’, and then removing excess genes with ‘shrinkSigMatrix’. Compare the rho.cor and spear.cor for ‘meta.shrink’ to all other methods, but note that overall RMSE tends to increase due to combing the clusters.

The result varies in different runs due to the randomization in ranger forest. Codes for automatically running the entire simulation multiple times will be included in the next version of ADAPTS.

Note: This data set is relatively difficult to predict. ADAPTS regularly achieves correlations above 0.8 on more recently generated single cell data sets that assumedly use better technology.

acc<-cbind(augshrunkAcc,augAcc,allAcc,seedAcc)
acc<-acc[c(1,3,6),]
acc<-cbind(acc,rep(1,3))
colnames(acc)<-c('shrink','aug','all','seed','actFrac')
deconTable<-round(rbind(estimates.test,acc),2)

metacc<-cbind(metashrunkAcc,metaaugAcc,metaallAcc,metaseedAcc)
metacc<-metacc[c(1,3,6),]
metacc<-cbind(metacc,rep(1,3))
colnames(metacc)<-c('meta.shrink','meta.aug','meta.all','meta.seed','meta.actFrac')
metadeconTable<-round(rbind(estimates.Meta.onTest,metacc),2)

print(deconTable)
##              shrink   aug   all  seed actFrac
## acinar         1.04  1.79  7.17  0.00    5.78
## alpha         15.74 11.72  9.54 10.61   36.24
## beta           7.44  7.34  9.26  5.08   12.39
## co             9.62  8.58  8.48  7.03    1.68
## delta          6.50  8.07  9.57  6.22    7.35
## ductal         8.13  7.95  7.57  5.86   21.74
## endothelial    7.80  7.50  5.54 10.58    0.53
## epsilon        8.49  9.03  6.66  9.10    0.21
## gamma          8.95  8.98  9.56  7.11    9.45
## mast           3.82  6.61  3.31 11.30    0.32
## MHC            2.13  3.97  3.89  6.54    0.32
## PSC            5.95  6.93  6.40 10.50    2.31
## unclassified  14.39 11.51 13.05 10.06    1.68
## others         0.00  0.00  0.00  0.00      NA
## rho.cor        0.51  0.39  0.34 -0.04    1.00
## spear.rho      0.28  0.23  0.61 -0.33    1.00
## rmse           8.89  9.52  9.65 10.77    1.00
print(metadeconTable)
##           meta.shrink meta.aug meta.all meta.seed meta.actFrac
## Meta_1          15.22    15.57    15.30     15.46        27.52
## Meta_2          18.82    18.48    16.27     18.34        45.69
## Meta_3          17.26    17.36    15.92     17.21        14.08
## Meta_4          16.03    16.41    15.91     16.15         9.03
## Meta_5          15.16    15.45    15.69     15.31         3.15
## Meta_6           8.76     6.97     9.43      7.31         0.21
## Meta_7           8.74     9.76    11.48     10.22         0.32
## others           0.00     0.00     0.00      0.00           NA
## rho.cor          0.72     0.67     0.58      0.68         1.00
## spear.rho        0.86     0.89     0.79      0.89         1.00
## rmse            13.21    13.28    14.30     13.36         1.00