single-cell RNA sequencing analysis

Author

Laurent Modolo

Published

Invalid Date

Normalization

Systematic differences in sequencing coverage between libraries are often observed in single-cell RNA sequencing data. They typically arise from technical differences in cDNA capture or PCR amplification efficiency across cells, attributable to the difficulty of achieving consistent library preparation with minimal starting material. Normalization aims to remove these differences such that they do not interfere with comparisons of the expression profiles between cells. This ensures that any observed heterogeneity or differential expression within the cell population is driven by biology and not technical biases.

We will now calculate some properties and visually inspect the data. Our main interest is in the general trends not in individual outliers. Neither genes nor cells that stand out are important at this step, but we focus on the global trends.

Derive gene and cell attributes from the UMI matrix to plot the mean-variance relationship (on the expressed gene and non-empty droplets in log scale).

curl -O http://perso.ens-lyon.fr/laurent.modolo/scrna/sce_quality_control.Rdata
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
  0  123M    0  752k    0     0  2439k      0  0:00:51 --:--:--  0:00:51 2468k
  5  123M    5 6930k    0     0  5294k      0  0:00:23  0:00:01  0:00:22 5306k
 10  123M   10 13.0M    0     0  5792k      0  0:00:21  0:00:02  0:00:19 5799k
 15  123M   15 19.4M    0     0  5989k      0  0:00:21  0:00:03  0:00:18 5994k
 21  123M   21 26.0M    0     0  6194k      0  0:00:20  0:00:04  0:00:16 6199k
 27  123M   27 33.4M    0     0  6454k      0  0:00:19  0:00:05  0:00:14 6703k
 32  123M   32 40.7M    0     0  6613k      0  0:00:19  0:00:06  0:00:13 6959k
 39  123M   39 48.6M    0     0  6815k      0  0:00:18  0:00:07  0:00:11 7288k
 45  123M   45 56.6M    0     0  6979k      0  0:00:18  0:00:08  0:00:10 7636k
 52  123M   52 64.5M    0     0  7101k      0  0:00:17  0:00:09  0:00:08 7883k
 57  123M   57 70.9M    0     0  7052k      0  0:00:17  0:00:10  0:00:07 7686k
 63  123M   63 79.2M    0     0  7174k      0  0:00:17  0:00:11  0:00:06 7882k
 70  123M   70 87.3M    0     0  7261k      0  0:00:17  0:00:12  0:00:05 7914k
 77  123M   77 95.7M    0     0  7369k      0  0:00:17  0:00:13  0:00:04 8018k
 84  123M   84  104M    0     0  7465k      0  0:00:16  0:00:14  0:00:02 8140k
 90  123M   90  112M    0     0  7529k      0  0:00:16  0:00:15  0:00:01 8514k
 97  123M   97  121M    0     0  7597k      0  0:00:16  0:00:16 --:--:-- 8554k
100  123M  100  123M    0     0  7630k      0  0:00:16  0:00:16 --:--:-- 8687k
if (!require("tidyverse", quietly = TRUE))
    install.packages("tidyverse")
if (!require("BiocManager", quietly = TRUE))
  install.packages("BiocManager")
if (!require("SingleCellExperiment", quietly = TRUE))
  BiocManager::install("SingleCellExperiment")
if (!require("codetools", quietly = TRUE))
  install.packages('codetools')
if (!require("sctransform", quietly = TRUE))
  install.packages('sctransform')
library(sctransform)
library(codetools)
library(tidyverse)
library(SingleCellExperiment)
load(file = "sce_quality_control.Rdata", v = T)
Loading objects:
  sce
sce_hg <- sce[rowData(sce)$expressed & rowData(sce)$species %in% "Homo sapiens",
              colData(sce)$is_cell & colData(sce)$species %in% "Homo sapiens"]
rowData(sce_hg) <- tibble(
  mean = rowMeans(counts(sce_hg)),
  var = apply(counts(sce_hg), 1, var),
  detection_rate = rowMeans(counts(sce_hg) > 0),
  log_mean = log10(mean),
  log_var = log10(var)) %>% 
  as_tibble() %>% 
  cbind(rowData(sce_hg), .)

sce_hg <- sce_hg[rowData(sce_hg)$detection_rate > 0, ]

rowData(sce_hg) %>% 
  as_tibble() %>% 
  ggplot(aes(log_mean, log_var)) +
  geom_point(alpha = 0.3, shape = 16) +
  geom_density_2d(size = 0.3) +
  geom_abline(intercept = 0,
              slope = 1,
              color = 'red')

From this mean-variance relationship, we can see that up to a given mean UMI count the variance follows the line through the origin with slop one, i.e. variance and mean are roughly equal as expected under a Poisson model. However, genes with a higher average UMI count show overdispersion compared to Poisson.

The sampling process of each gene depending on its mean expression can be modeled with a Poisson. You can plot this theoretical detection rate and compare it to the empirical detection rate of each gene.

# add the expected detection rate under Poisson model
x = seq(from = -3, to = 2, length.out = 1000)
poisson_model <- tibble(
  log_mean = x,
  detection_rate = 1 - dpois(0, lambda = 10 ^ x)
)
rowData(sce_hg) %>% 
  as_tibble() %>% 
  dplyr::filter(is_genomic) %>% 
  ggplot(aes(x = log_mean, y = detection_rate)) + 
  geom_point(alpha = 0.3, shape = 16) + 
  geom_line(data = poisson_model, color = 'red') +
  theme_gray(base_size = 8) +
  facet_wrap(~species)

From the mean-detection-rate relationship, we see a lower than expected detection rate in the medium expression range. However, for the highly expressed genes, the rate is at or very close to 1.0 suggesting that there is no zero-inflation in the counts for those genes and that zero-inflation is a result of overdispersion, rather than an independent systematic bias.

colData(sce_hg) %>% 
  as_tibble() %>% 
  ggplot(aes(n_umi, detected)) + 
  geom_point(alpha = 0.3, shape = 16) + 
  geom_density_2d(size = 0.3)

The more UMI counts a cell has, the more genes are detected. In this data set, this seems to be an almost linear relationship (at least within the UMI range of most of the cells).

The scTransform transform algorithms, Hafemeister et al. 2019 propose to model the expression of each gene as a negative binomial random variable with a mean that depends on other variables. Here the other variables can be used to model the differences in sequencing depth between cells and are used as independent variables in a regression model. In order to avoid overfitting, we will first fit model parameters per gene, and then use the relationship between gene mean and parameter values to fit parameters, thereby combining information across genes. Given the fitted model parameters, we transform each observed UMI count into a Pearson residual which can be interpreted as the number of standard deviations an observed count was away from the expected mean. If the model accurately describes the mean-variance relationship and the dependency of mean and latent factors, then the result should have mean zero and a stable variance across the range of expression.

The vst() function estimates model parameters and performs the variance stabilizing transformation. Here we use the log10 of the total UMI counts of a cell as variable for sequencing depth for each cell. After data transformation we plot the model parameters as a function of gene mean (geometric mean). The correct() function computes the corrected count matrix.

set.seed(44)
colData(sce_hg)$log_umi <- log10(colData(sce_hg)$n_umi)
vst_out <- sctransform::vst(
  counts(sce_hg),
  cell_attr = colData(sce_hg),
  latent_var = c("log_umi"),
  return_gene_attr = T,
  return_cell_attr = T,
  show_progress = F)
sctransform::plot_model_pars(vst_out)

sctransform::correct(vst_out, show_progress = F)

Internally vst performs Poisson regression per gene with \(\log(\mu) = \beta_0 + \beta_1 x\), where \(x\) is log_umi, the base 10 logarithm of the total number of UMI counts in each cell, and μ are the expected number of UMI counts of the given gene. The previous plot shows \(\beta_0\) (Intercept), the \(\beta_1\) coefficient log_umi, and the maximum likelihood estimate of the overdispersion parameter theta under the negative binomial model. Under the negative binomial model, the variance of a gene depends on the expected UMI counts and theta: \(\mu + \frac{\mu^2}{\theta}\). In a second step, the regularized model parameters are used to turn observed UMI counts into Pearson residuals.

After exploring the problem of empty droplets and doublets. We now turn toward another scRNASeq data set to exploit another aspect of scRNASeq data.

We are now going to run a simple scRNA-Seq analysis with Seurat