dsb (denoised and scaled by background) is an R package developed in John Tsang’s Lab for removing noise and normalizing protein data from single cell methods measuring protein with DNA-barcoded antibodies such as CITE-seq or Mission Bio Tapestri data. See the dsb Biorxiv preprint for details and please consider citing the paper if you use this package or find our experiments / modeling describing the sources of noise in CITE-seq data useful.

Table of Contents

  1. Background and motivation
  2. Installation and quick overview
  3. Tutorial with public 10X Genomics data
  4. Step 1 Download public data
  5. Step 2 Load RNA and protein alignment (Cell Ranger) data and define metadata
  6. Step 3 Quality control on cell-containing and background droplets
  7. Step 4 Normalize protein data with the DSBNormalizeProtein Function
  8. Integrating dsb with Seurat
  9. Clustering cells based on dsb normalized protein using Seurat
  10. dsb derived cluster interpretation
  11. Weighted Nearest Neighbor multimodal clustering using dsb normalized values with Seurat
  12. Integrating dsb with Bioconductor
  13. Integrating dsb with python/Scanpy
  14. Using dsb with data lacking isotype controls
  15. Integrating dsb with sample multiplexing experiments
  16. Frequently Asked Questions and advanced usage topics

Background and motivation

CITE-seq protein data suffers from substantial background noise (for example, see supplementary fig 5a in Stoeckius et. al. 2017 Nat. Methods). We performed experiments and analysis to dissect this noise; the dsb method is based on 3 key findings outlined in our preprint

  1. Based on spike in experiments and modeling we found a major source of protein background noise comes from ambient, unbound antibody encapsulated in droplets.

  2. “Empty” droplets (containing ambient mRNA and antibody but no cell), outnumber cell-containing droplets by 10-100 fold and capture the ambient component of protein background noise.

  3. Cell-to-cell technical variations (i.e. stochastic differences in capture, RT efficiency, sequencing depth, and non-specific antibody binding) can be estimated and removed through a model of each cell’s “technical component” using isotype control counts and per cell mixture modeling.

Installation and quick overview

Install dsb directly from CRAN using install.packages('dsb').

Use the DSBNormalizeProtein() function to normalize included package data of a raw protein count matrix cells_citeseq_mtx using background/empty droplet matrix empty_drop_citeseq_mtx. Model and remove the ‘technical component’ of each cell’s protein library by setting denoise.counts = TRUE including isotype controls in calculating the technical component with use.isotype.control = TRUE.

Tutorial with public 10X Genomics data

Below we demonstrate dsb using data 10X Genomics data aligned with Cell Ranger. We define cells and background drops, run QC, normalize protein with dsb, cluster cells based on dsb normalized protein levels, visualize and interpret clustering results based on dsb values. Finally we use dsb values directly with Seurat’s Weighted Nearest Neighbor multimodal RNA + protein clustering algorithm. Wrappers are also provided to integrate dsb into AnnData (python) and Bioconductor (SingleCellExperiment) data structures (see table of contents).

Step 1 Download public data

Download both the filtered and raw count matrices from Cell Ranger from this public 10X Genomics CITE-seq data.

To follow code exactly below without changing paths, download and uncompress the files ‘Feature / cell matrix (filtered)’ and ‘Feature / cell matrix (raw)’ which will be automatically renamed filtered_feature_bc_matrix and raw_feature_bc_matrix and add them to a directory data/ in your working directory.

The filtered output is a subset of the raw output that are the cells estimated by Cell Ranger’s improved cell calling algorithm based on the EmptyDrops algorithm. One should always set the Cell Ranger --expect-cells argument roughly equal to the estimated cell recovery per lane based on number of cells loaded in the experiment. The raw output is all possible cell barcodes ranging from 500k-6M barcodes depending on the assay version. We will filter these barcodes to extract the major background signal. You can also estimate cells directly from count aligners such as CITE-seq-count (set the argument –cells to a value >2e5 to align background) or Kallisto.

Your working directory should now be structured:
data/
  |_filtered_feature_bc_matrix
  |_raw_feature_bc_matrix

Step 2 Load RNA and protein alignment (Cell Ranger) data and define metadata

Here we use the convenience function from Seurat Read10X which will automatically detect multiple assays and create two element list Gene Expression and Antibody Capture.

Step 3 Quality control on cell-containing and background droplets

The plot below shows the number of detected genes vs the protein library size for cells vs background drops. One can also define the cells vs background drops directly by thresholding on a plot like this if for example using a different count aligner that does not provide filtered cell output.

We next further filter cells based on thresholds calculated from quality control metrics as in any standard scRNAseq analysis, e.g. see Luecken et. al. 2019 Mol Syst Biol.

Note cells with the smaller library size are mostly naive CD4 T cells which are small in size and naturally have less mRNA content.

Sanity check: are the number of QCd cells in line with the expected recovery from the experiment?

[1] 4096

Yes. After quality control above we have 4096 cells which is in line with the ~5000 cells loaded in this experiment.

We also filter background droplets to remove potential spurious cells (drops with high mRNA content) and use the major peak in the background distribution between 1.4 and 2.5 log total protein counts.

Optional step; remove proteins without staining

While dsb will handle noisy proteins, some proteins in an experiment may not work for bioinformatic reasons or may target a very rare cell population that was absent in the experiment. This is especially true as panels increase in size. Proteins without counts on the stained cells should be removed prior to normalization.

prot pmax
CD34_TotalSeqB 4
CD80_TotalSeqB 60
CD274_TotalSeqB 75
IgG2b_control_TotalSeqB 90
IgG2a_control_TotalSeqB 95
IgG1_control_TotalSeqB 112

In this experiment, the stem cell marker CD34 has essentially no data (a maximum UMI of 4 across all cells in the experiment). We therefore remove it from cell and background matrices. In many cases, removing proteins is not necessary.

Step 4 Normalize protein data with the DSBNormalizeProtein Function

For data with isotype control proteins, set denoise.counts = TRUE and use.isotype.control = TRUE and provide a vector containing names of isotype control proteins (the rownames of the protein matrix that are isotype controls). For data without isotype controls, see the vignette section Using dsb with data lacking isotype controls.

The function returns a matrix of normalized protein values which can be integrated with any single cell analysis software. We provide an example with Seurat, Bioconductor and Scanpy below.
optional QC step sometimes denoising results in very small negative values for a protein corresponding to very low expression; it can be helpful to convert all dsb normalized values < -10 to 0 for interpretation / visualization purposes. (see FAQ)

Integrating dsb with Seurat

This object can be used in downstream analysis using Seurat (note to use dsb values do not use the default CLR normalization after adding dsb values).

Clustering cells based on dsb normalized protein using Seurat

Here we will cluster the cells and annotate them based on dsb normalized protein levels. This is similar to the workflows used in our paper Kotliarov et. al. 2020 Nat. Medicine. We first run spectral clustering using Seurat directly on the dsb normalized protein values without reducing dimensionality of the cells x protein matrix with PCA.

dsb derived cluster interpretation

A heatmap of average dsb normalized values in each cluster help in annotating clusters results. The values for each cell represent the number of standard deviations of each protein from the expected noise from reflected by the protein’s distribution in empty droplets, +/- the residual of the fitted model to the cell-intrinsic technical component.

Annotate cell types based on median dsb normalized protein levels per cluster

Weighted Nearest Neighbor multimodal clustering using dsb normalized values with Seurat

Below we demonstrate using Seurat’s weighted nearest neighbors multimodal clustering method with dsb normalized values as input for the protein assay. The performance of this algorithm is better on larger datasets but we demonstrate here on this small dataset as an example.

Of note, the default recommendation of the WNN method is to first compress both the ADT and mRNA data into principal components. For a dataset with a smaller number of proteins, we have found that just using the dsb normalized cells x protein directly rather than compressing the ADT data into principal components can improve the resulting clusters and interpretation of the joint embedding. Datasets generated with recently available pre-titrated panels consisting of more than 100 or 200 proteins may benefit more from dimensionality reduction with PCA.

Below, examples are provided for using both the dsb normalized protein values directly as well as the dsb normalized values compressed into principal components (the default WNN method). Again, this algorithm is more powerful on datasets with a greater number of cells.

# use pearson residuals as normalized values for pca 
DefaultAssay(s) = "RNA"
s = NormalizeData(s, verbose = FALSE) %>% 
  FindVariableFeatures(selection.method = 'vst', verbose = FALSE) %>% 
  ScaleData(verbose = FALSE) %>%
  RunPCA(verbose = FALSE)

# set up dsb values to use in WNN analysis 
DefaultAssay(s) = "CITE"
# hack seurat to use normalized protein values as a dimensionality reduction object.
VariableFeatures(s) = prots

# run true pca to initialize dr pca slot for WNN 
s = ScaleData(s, assay = 'CITE', verbose = FALSE)
s = RunPCA(s, reduction.name = 'pdsb', features = VariableFeatures(s), verbose = FALSE)

# make matrix of norm values to add as dr embeddings
pseudo = t(s@assays$CITE@data)[,1:29]
pseudo_colnames = paste('pseudo', 1:29, sep = "_")
colnames(pseudo) = pseudo_colnames
# add to object 
s@reductions$pdsb@cell.embeddings = pseudo

# run WNN 
s = FindMultiModalNeighbors(
  object = s,
  reduction.list = list("pca", "pdsb"),
  weighted.nn.name = "dsb_wnn", 
  knn.graph.name = "dsb_knn",
  modality.weight.name = "dsb_weight",
  snn.graph.name = "dsb_snn",
  dims.list = list(1:30, 1:29), 
  verbose = FALSE
)

s = FindClusters(s, graph.name = "dsb_knn", algorithm = 3, resolution = 1.5,
                 random.seed = 1990,  verbose = FALSE)
s = RunUMAP(s, nn.name = "dsb_wnn", reduction.name = "dsb_wnn_umap", 
            reduction.key = "dsb_wnnUMAP_", seed.use = 1990, verbose = FALSE)

# plot 
p1 = Seurat::DimPlot(s, reduction = 'dsb_wnn_umap', group.by = 'dsb_knn_res.1.5',
              label = TRUE, repel = TRUE, label.size = 2.5, pt.size = 0.1) + 
  theme_bw() + 
  xlab('dsb protein RNA multimodal UMAP 1') + 
  ylab('dsb protein RNA multimodal UMAP 2') + 
  ggtitle('WNN using dsb normalized protein values')

p1

Now we can add these higher resolution subcluster annotations to the dsb derived protein only clusters for interpretation and visualize the results with a multimodal heatmap.

# create multimodal heatmap 
vf = VariableFeatures(s,assay = "RNA")

Idents(s) = "dsb_knn_res.1.5"
DefaultAssay(s)  = "RNA"
rnade = FindAllMarkers(s, features = vf, only.pos = TRUE)
gene_plot = rnade %>% filter(avg_log2FC > 1 ) %>%  group_by(cluster) %>% top_n(3) %$% gene %>% unique 

s@meta.data$celltype_subcluster = paste(s@meta.data$celltype, s@meta.data$dsb_knn_res.1.5)

d = cbind(s@meta.data, 
          # protein 
          as.data.frame(t(s@assays$CITE@data)), 
          # mRNA
          as.data.frame(t(as.matrix(s@assays$RNA@data[gene_plot, ]))),
          s@reductions$umap@cell.embeddings)

# combined data 
adt_plot = d %>% 
  dplyr::group_by(dsb_knn_res.1.5) %>% 
  dplyr::summarize_at(.vars = c(prots, gene_plot), .funs = median) %>% 
  tibble::remove_rownames() %>% 
  tibble::column_to_rownames("dsb_knn_res.1.5") 


# make a combined plot 
suppressMessages(library(ComplexHeatmap))
# protein heatmap 
prot_col = circlize::colorRamp2(breaks = seq(-10,30, by = 2), colors = viridis::viridis(n = 18, option = "B", end = 0.95))
p1 = Heatmap(t(adt_plot)[prots, ], name = "protein",col = prot_col, use_raster = T,
             row_names_gp = gpar(color = "black", fontsize = 5))

# mRNA heatmap 
mrna = t(adt_plot)[gene_plot, ]
rna_col = circlize::colorRamp2(breaks = c(-2,-1,0,1,2), colors = colorspace::diverge_hsv(n = 5))
p2 = Heatmap(t(scale(t(mrna))), name = "mRNA", col = rna_col, use_raster = T, 
             clustering_method_columns = 'average',
             column_names_gp = gpar(color = "black", fontsize = 7), 
             row_names_gp = gpar(color = "black", fontsize = 5))

ht_list = p1 %v% p2
draw(ht_list)

One can also use the default implementation of Seurat’s WNN analysis with principal components based dsb normalized values as input.

Integrating dsb with Bioconductor

Rather than Seurat you may wish to use the SingleCellExperiment class to use Bioconductor packages - this can be accomplished with this alternative to step III above using the following code. To use Bioconductor’s semantics, we store raw protein values in an ‘alternative Experiment’ in a SingleCellExperiment object containing RNA counts.

Integrating dsb with python/Scanpy

You can also use dsb normalized values with the AnnData class in Python by following steps 1 and 2 above to load data, define drops and normalize with dsb in R. The simplest option is then to use reticulate to create the AnnData object from dsb denoised and normalized protein values as well as raw RNA data. That object can then be imported into a python session. Anndata are not structured as separate assays; we therefore need to merge the RNA and protein data. See the current Scanpy CITE-seq workflow and more on interoperability between Scanpy Bioconductor and Seurat

NEW: dsb users keen on using python are also encouraged to checkout the dsb wrapper available in the muon python multimodal framework about muon dsb wrapper for python in muon

Using dsb with data lacking isotype controls

If isotype controls are not included, you can run dsb correcting ambient background without cell denoising. We only recommend setting denoise.counts = FALSE if isotype controls were not included in the experiment which results in not defining the technical component of each cell’s protein library. The values of the normalized matrix returned are the number of standard deviations above the expected ambient noise captured by empty droplets.

We strongly recommend using isotype controls, however if these are not available, the background mean for each cell inferred via a per-cell gaussian mixture model (µ1) can theoretically be used alone to define the cell’s technical component, however this assumes the background mean has no expected biological variation. In our data the background mean had weak but significant correlation with the foreground mean (µ2) across single cells (see the paper). Isotype controls anchor the component of the background mean associated with noise.

Integrating dsb with sample multiplexing experiments

In multiplexing experiments with cell superloading, demultiplexing functions define a “negative” cell population which can then be used to define background droplets for dsb. Multiplexing / Demultiplexing methods and functions compatible with dsb:
HTODemux (Seurat)
deMULTIplex (Multiseq)
demuxlet

In our data, dsb normalized values were nearly identically distributed when dsb was run with background defined by demultiplexing functions or protein library size (see the paper).

We load the raw output from cell ranger. Prior to demultiplexing we use the min.genes argument in the Seurat::Read10X function to partially threshold out some background drops yet still retain sufficient (often > 80,000 droplets per 10X lane depending on experiment) from which to estimate the background. This balances memory strain when demultiplexing tens of thousands of cells with requirements of the Seurat::HTODemux function to have sufficient empty drops to estimate the background population of each Hash antibody. Importantly, the HTODemux function may not converge if the cell ranger filtered output was loaded since there will not be sufficient negative drops to estimate the background for each hashing antibody. Increasing the number of drops used in demultiplexing will result in more droplets defined by the function as “negative” which can increase the confidence in the estimate of background used by dsb.

Frequently Asked Questions

Can other count aligners like Kallisto and CITE-seq-count be used? Yes! dsb was developed prior to 10X Genomics supporting CITE-seq or hashing data and we routinely use other alignment pipelines. It is important to specify a large number of cells for e.g. CITE-seq-count in order to make sure you obtain background reads and not just cells. If you loaded 10,000 cells, you would NOT set the –cells argument in CITE-seq-count to be 10000, you would set it to e.g. 300000 in order to align to background drops as well as the large library size cell-containing droplets. dsb can then be used exactly as in the vignette, one simply loads the ADT count data and defines foreground and background drops using thresholds based on mRNA and protein as shown in the vignette. For using Cell Ranger, the analogous argument expect-cells should actually be set to the number of expected recovered cells-Cell Ranger uses the EmptyDrops algorithm to call cells vs empty drops based in part on this parameter and returns the full output in the raw and just the estimated cells in the filtered outputs respectively (as shown in the documentation example above).

For more info see: https://github.com/Hoohm/CITE-seq-Count/issues/146
For comparison of CITE-seq alignment pipelines in a workflow that used dsb see: https://github.com/Terkild/CITE-seq_optimization

I get the error “Error in quantile.default(x, seq(from = 0, to = 1, length = n)): missing values and NaN’s not allowed if ‘na.rm’ is FALSE” What should I do? - This error occurs during denoising, (denoise = TRUE) when you have antibodies with 0 counts or close to 0 across all cells. To get rid of this error, check the distributions of the antibodies with e.g. apply(cells_protein_matrix, 1, quantile) to find the protein(s) with basically no counts, then remove these from the empty drops and the cells. Please refer to these links:
https://github.com/niaid/dsb/issues/6
https://github.com/niaid/dsb/issues/26

I get a "problem too large or memory exhausted error when I try to convert to a regular R matrix - (see issue 10 on the dsb github) CITE-seq protein counts don’t need a sparse representation-very likely this error is because there are too many negative droplets defined (i.e. over 1 million). You should be able to normalize datasets with 100,000+ cells and similar numbers of negative droplets (or less) on a normal 16GB laptop. By further narrowing in on the major background distribution, one should be able to convert the cells and background to a normal R matrix which should run successfully.
https://github.com/niaid/dsb/issues/10

the range of dsb normalized values is large is this normal? In all cases encountered thus far, the large range of values for a protein (e.g. ranging from -50 to 50) are caused by just a few outlier cells, most often a few cells with low negative values for the protein. We have now provided a quantile clipping option in dsb to address these outlier cells. Users can also investigate these cells to see if they have very high values for isotype control proteins and they can possibly be removed from the dataset. https://github.com/niaid/dsb/issues/22
https://github.com/niaid/dsb/issues/9

implement dsb with Quantile clipping

By default setting quanitle.clipping = TRUE sets cells values for a given protein above the 99.95th percentile or below the 0.1 percentile of that protein’s expression to be thees quantile values. That value is optimized to remove a few high and low magnitude outliers but can be set to adapt to the number of cells in the dataset.

A simpler alternative to quantile clipping not requiring re-running dsb

Another simple option if there are a few low magnitude outliers: cells with a given protein value < -10 can be interpreted as not expressed. One can simply convert these values to 0 with the code below.

What are the minimum number of proteins required to use dsb dsb is compatible with datasets with any number of proteins with step I alone. We have validated the algorithm assumptions on datasets with 14 phenotyping antibodies and 3 isotype controls. With less proteins than this, we recommend users return the internal stats calculated by dsb and check correlations of the variables as shown below. If these values are reasonably correlated, it indicates the model assumptions of dsb are likely being met.

How do I know whether I should set the denoise.counts argument to TRUE vs FALSE?
In the vast majority of cases we recommend setting denoise.counts = TRUE and use.isotype.control = TRUE (this is the package default). The denoise.counts argument specifies whether to remove cell-intrinsic technical noise by defining and regressing out cell-intrinsic technical factors that contribute technical variations not captured by protein counts in background droplets used in dsb. The only reason not to use this argument is if the model assumptions used to define the technical component are not expected to be met by the particular experiment: with denoise.counts = TRUE dsb models the negative protein population (µ1) for each cell with a two-component Gaussian mixture, making the conservative assumption that cells in the experiment should be negative for a subset of the measured proteins. If you expect all cells in your experiment be positive for all of the proteins measured, this may not be an optimal assumption. Model assumptions were validated on datasets measuring less than 20 to more than 200 proteins. If your data have less proteins, see note above.

I have multiple “lanes” of 10X data from the same pool of cells, how should I run the workflow above?

Droplets derived from the same pool of stained cells partitioned across multiple lanes should be normalized together. To do this, you should merge the raw output of each lane, then run step 1 in the workflow-note that since the cell barcode names are the same for each lane in the raw output, you need to add a string to each barcode to identify the lane of origin to make the barcodes have unique names; here is one way to do that: First, add each 10X lane raw output from Cell Ranger into a separate directory in a folder “data”
data.
|_10xlane1
  |_outs
    |_raw_feature_bc_matrix
|_10xlane2
  |_outs
    |_raw_feature_bc_matrix

I have 2 batches, should I combine them into a single batch or normalize each batch separately? - (See issue 12 on the dsb github) How much batch variation there is depends on how much experiment-specific and expected biological variability there is between the batches. In the dataset used in the preprint, if we normalized with all background drops and cells in a single normalization, the resulting dsb normalized values were highly concordant with when we normalized each batch separately, this held true with either definition of background drops used (i.e. based on thresholding with the library size or based on hashing-see below). One could try both and see which mitigates the batch variation the most. See https://github.com/niaid/dsb/issues/12 for example code.