1 Quality control

Low-quality libraries in scRNA-seq data can arise from a variety of sources such as cell damage during dissociation or failure in library preparation. These usually manifest as “cells” with low total counts, few expressed genes and high mitochondrial or spike-in proportions. These low-quality libraries are problematic as they can contribute to misleading results in downstream analyses:

They form their own distinct cluster(s), complicating the interpretation of the results.
They distort the characterization of population heterogeneity during variance estimation or principal component analysis.
They contain genes that appear to be strongly “upregulated” due to aggressive scaling to normalize for small library sizes.

To avoid—or at least mitigate—these problems, we need to remove these cells at the start of the analysis. This step is commonly referred to as quality control (QC) on the cells.

2 QC metrics

The library size is defined as the total sum of counts across all relevant features for each cell.
The number of expressed features in each cell is defined as the number of endogenous genes with non-zero counts for that cell.
The proportion of reads mapped to spike-in transcripts is calculated relative to the total count across all features (including spike-ins) for each cell.
In the absence of spike-in transcripts, the proportion of reads mapped to genes in the mitochondrial genome can be used (and we have the chromosomic position of each gene).

A key assumption here is that the QC metrics are independent of the biological state of each cell. Poor values (e.g., low library sizes, high mitochondrial proportions) are presumed to be driven by technical factors rather than biological processes, meaning that the subsequent removal of cells will not misrepresent the biology in downstream analyses.

The first step is to remove non-expressed genes, this will reduce the dimensionality of the data we have to work with. To test if a given gene is expressed you can use the nexprs() function and add the information to the rowData() slot.

load(file = "sce_singecellexperiment.Rdata", v = T)

## Loading objects:
##   sce

rowData(sce)$expressed <- nexprs(sce, byrow = TRUE) > 0
table(rowData(sce)$expressed)

## 
## FALSE  TRUE 
## 38951 38251

We can compute a large number of different quality metrics on each gene and on each cell. The scater package contains the addPerCellQC() function that appends a lot of quality control data. The subset option allows us to compute statistics on the non-genomic gene content (i.e. Mitocondrial genes)

per_cell <- perCellQCMetrics(sce[rowData(sce)$expressed, ], subset = list(non_genomic = !rowData(sce[rowData(sce)$expressed, ])$is_genomic))
summary(per_cell$sum)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##     0.00     1.00     1.00    76.04     8.00 74292.00

summary(per_cell$detected)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    1.00    1.00   23.29    8.00 9664.00

summary(per_cell$subsets_non_genomic_percent)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    0.00    0.00    2.64    0.00  100.00   43822

# we add the statistics to the sce object
colData(sce) <- cbind(colData(sce), per_cell)

The plotColData() function makes easy ggplot from SingleCellExperiment object. We can make QC diagnostic plots to check the statistics computed.

plotColData(sce,
            x = "sum",
            y = "detected",
            colour_by = "species")

plotColData(sce,
            x = "sum",
            y = "subsets_non_genomic_percent",
            other_fields = "species") + facet_wrap( ~ species)

We can identify outlier cells. The isOutlier function provides a more data-adaptive way of choosing these thresholds. This defines the threshold at a certain number of median absolute deviations (MADs) away from the median. Values beyond this threshold are considered outliers and can be filtered out, assuming that they correspond to low-quality cells. Here, we define small outliers (using type="lower") for the log-total counts at 3 MADs from the median. Use this function to identify cells with low total umi counts.

colData(sce)$keep_total <-
  isOutlier(colData(sce)$sum, type = "lower", log = TRUE)

Analogous to our call to addPerCellQC() in the prior section, the addPerFeatureQC() function will insert values in the rowData slot of our sce object:

sce <- addPerFeatureQC(sce)
rowData(sce)

## DataFrame with 77202 rows and 8 columns
##                                  genes_id genes_names      species
##                               <character> <character>  <character>
## ENSG00000237235.2       ENSG00000237235.2       TRDD2 Homo sapiens
## ENSG00000223997.1       ENSG00000223997.1       TRDD1 Homo sapiens
## ENSG00000228985.1       ENSG00000228985.1       TRDD3 Homo sapiens
## ENSG00000282253.1       ENSG00000282253.1          NA Homo sapiens
## ENSG00000282431.1       ENSG00000282431.1       TRBD1 Homo sapiens
## ...                                   ...         ...          ...
## ENSMUSG00000106908.1 ENSMUSG00000106908.1          NA Mus musculus
## ENSMUSG00000107077.1 ENSMUSG00000107077.1          NA Mus musculus
## ENSMUSG00000106901.1 ENSMUSG00000106901.1          NA Mus musculus
## ENSMUSG00000107196.1 ENSMUSG00000107196.1          NA Mus musculus
## ENSMUSG00000118095.1 ENSMUSG00000118095.1          NA Mus musculus
##                                chr_pos is_genomic expressed      mean  detected
##                            <character>  <logical> <logical> <numeric> <numeric>
## ENSG00000237235.2                   14       TRUE     FALSE         0         0
## ENSG00000223997.1                   14       TRUE     FALSE         0         0
## ENSG00000228985.1                   14       TRUE     FALSE         0         0
## ENSG00000282253.1    CHR_HSCHR7_2_CTG6      FALSE     FALSE         0         0
## ENSG00000282431.1                    7       TRUE     FALSE         0         0
## ...                                ...        ...       ...       ...       ...
## ENSMUSG00000106908.1  CHR_MG3562_PATCH      FALSE     FALSE         0         0
## ENSMUSG00000107077.1  CHR_MG3562_PATCH      FALSE     FALSE         0         0
## ENSMUSG00000106901.1  CHR_MG3562_PATCH      FALSE     FALSE         0         0
## ENSMUSG00000107196.1  CHR_MG3562_PATCH      FALSE     FALSE         0         0
## ENSMUSG00000118095.1  CHR_MG4249_PATCH      FALSE     FALSE         0         0

summary(rowData(sce)$detected)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.03016 0.01436 6.48303

3 Filter empty droplets

Most barcodes in the matrix correspond to empty droplets. A common way to determine which barcodes are empty droplets and which are real cells is to plot the rank of total UMI counts of each barcode against the total UMI counts itself, which is commonly called knee plot. The inflection point in that plot, signifying a change in state, is used as a cutoff for total UMI counts; barcodes below that cutoff are deemed empty droplets and removed.

Droplet-based data is that we have no prior knowledge about whether a particular library (i.e., cell barcode) corresponds to cell-containing or empty droplets. Thus, we need to call cells from empty droplets based on the observed expression profiles.

The distribution of total counts exhibits a sharp transition between barcodes with large and small total counts, probably corresponding to cell-containing and empty droplets respectively. A simple approach would be to apply a threshold on the total count to only retain those barcodes with large totals. However, this unnecessarily discards libraries derived from cell types with low RNA content.

To compute the barcode ranks, you can use the barcodesRanks() function. The following knee_plot function allows you to display the knee plot.

knee_plot <- function(bc_rank) {
  knee_plt <- tibble(rank = bc_rank[["rank"]],
                     total = bc_rank[["total"]]) %>% 
    distinct() %>% 
    dplyr::filter(total > 0)
  annot <- tibble(inflection = metadata(bc_rank)[["inflection"]],
                  rank_cutoff = max(bc_rank$rank[bc_rank$total > metadata(bc_rank)[["inflection"]]]))
  p <- ggplot(knee_plt, aes(total, rank)) +
    geom_line() +
    geom_hline(aes(yintercept = rank_cutoff), data = annot, linetype = 2) +
    geom_vline(aes(xintercept = inflection), data = annot, linetype = 2) +
    scale_x_log10() +
    scale_y_log10() +
    annotation_logticks() +
    labs(y = "Rank", x = "Total UMIs")
  return(p)
}
bcrank <- barcodeRanks(counts(sce[rowData(sce)$expressed, ]))
knee_plot(bcrank)

colData(sce)$is_cell <- colData(sce)$n_umi > metadata(bcrank)$inflection

We can also use the emptyDrops() function to test whether the expression profile for each cell barcode is significantly different from the ambient RNA pool (Lun et al. 2019). Any significant deviation indicates that the barcode corresponds to a cell-containing droplet.

# emptyDrops performs Monte Carlo simulations to compute p-values,
# so we need to set the seed to obtain reproducible results.
set.seed(100)
ed_out <- emptyDrops(counts(sce))
table(Sig=ed_out$FDR <= 0.05, Limited=ed_out$Limited)
is_cell <- ed_out$FDR <= 0.05
sum(is_cell, na.rm=TRUE)
tibble(logprob = ed_out$LogProb,
       is_cell = is_cell,
       total = ed_out$Total,
       species = colData(sce)$species
) %>% 
  ggplot() +
  geom_point(aes(x = total, y = -logprob, color = is_cell)) +
  annotation_logticks() +
  scale_y_log10() +
  scale_x_log10() +
  facet_wrap(~species) +
  labs(x = "Total UMI count", y = "-Log Probability")

colData(sce)$is_cell <- ifelse(is.na(is_cell), FALSE, is_cell)

Now that we can work with a more reasonable number of cells (with the is_cell variable), we can plot genes. The plotHighestExprs() fuction shows the top 50 (by default) most-expressed genes Each row in the plot below corresponds to a gene, and each bar corresponds to the expression of a gene in a single cell. The circle indicates the median expression of each gene, with which genes are sorted.

You can plot a selection of genes with the plotExpression() function (the 6 firsts).

plotHighestExprs(
  sce[rowData(sce)$expressed, sce$is_cell],
  exprs_values = "counts",
  n = 10
)

sce <- computeSumFactors(
  sce[rowData(sce)$expressed, sce$is_cell],
  clusters = sce$species[sce$is_cell]
)
sce <- logNormCounts(sce)
plotExpression(sce, rownames(sce)[1:6], x = "species")

Here, we added a logNormCounts matrix to our sce object. We can store many different counts matrix simultaneously in one sce object. This is very useful to switch between function working on counts or on pseudo-count. We see that the assay slot has grown to contain two entries: "counts" (our initial data) and "logcounts" (the log-transformed normalized data). Similar to "counts", the "logcounts" name can be conveniently accessed using logcounts(sce).

logcounts(sce)

4 Doublet detection

We have seen many cells annotated as mixed, how many real cells are human-mouse mixture ? Make a visualization of the proportion of n_mm_umi and n_hg_umi for each cells.

colData(sce) %>% 
  as_tibble() %>% 
  dplyr::filter(is_cell) %>% 
  ggplot() +
  geom_point(aes(x = n_mm_umi, y = n_hg_umi, color = species)) +
  annotation_logticks() +
  scale_y_log10() +
  scale_x_log10() +
  labs(x = "Mouse UMI count", y = "Human UMI count")

How many cells are from humans and how many from mice? The number of cells with mixed species indicates doublet rate. Here, it’s easy, we can identify human-mouse mixed droplets and deduce that they are doublet. But what about human-human doublets or mouse-mouse doublets ?

Doublet detection is a difficult problem, and state-of-the-art strategy involves in silico simulation of doublets from the single-cell expression profiles (Dahlin et al. 2018). This is performed using the doubletCells() function from scran, which will :

Simulate thousands of doublets by adding together two randomly chosen single-cell profiles.
For each original cell, compute the density of simulated doublets in the surrounding neighborhood.
For each original cell, compute the density of other observed cells in the neighborhood.
Return the ratio between the two density as a “doublet score” for each cell.

This approach assumes that the simulated doublets are good approximations for real doublets. The use of random selection accounts for the relative abundances of different subpopulations, which affect the likelihood of their involvement in doublets; and the calculation of a ratio avoids high scores for non-doublet cells in highly abundant subpopulations.

dbl_dens <- doubletCells(sce[rowData(sce)$expressed, sce$is_cell])
summary(dbl_dens)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##    0.000    0.000    1.319   23.476    1.347 2717.790

sce$doublet_score <- 0
sce$doublet_score[sce$is_cell] <- log10(dbl_dens + 1)

pca_data <- prcomp(t(logcounts(sce)), rank = 50)
set.seed(5252)
tsne_data <- Rtsne(pca_data$x[,1:50], pca = FALSE)
reducedDims(sce) <- list(PCA = pca_data$x, TSNE = tsne_data$Y)
plotTSNE(sce[rowData(sce)$expressed, sce$is_cell], colour_by = "doublet_score")

save(sce, file = "sce_quality_control.Rdata")

Doublet detection procedures should only be applied to libraries generated in the same experimental batch. It is obviously impossible for doublets to form between two cells that were captured separately.

single-cell RNA sequencing analysis

Laurent Modolo

April 6, 2021

1 Quality control

2 QC metrics

3 Filter empty droplets

4 Doublet detection