Single Cell RNA-seq Experiments
Gemma now supports single-cell data. This is still a work in progress, but the main features for compatibility are now in place and we have started to make data sets available through the Gemma web site and web services.
Scope
We are currently focusing on data sets collected with 10x Genomics platforms. We are not yet re-analyzing the data from FASTQ files (this is in the works as of 9/2025). For now, this limits us to data sets where count matrices are provided by the authors. Large numbers of additional data sets will be eligible for inclusion once we have our re-analysis pipeline up and running.
Data Model
While we store cell-level data, the presentation in Gemma is primarily based on pseudo-bulk (aggregated) data for each cell type, per sample. We are conducting differential expression analysis in the same manner as for other data sets. We are still evaluating our pipeline for any adjustments that may be needed.
API Support
Support for single-cell experiment access is implemented in the R and Python packages. We are especially keen to get feedback on the features, data and analysis results. Please contact us through pavlab-support@msl.ubc.ca with any questions or comments.
Methods
We are re-annotating cell types in many data sets. This is necessitated because most authors do not provide cell type annotations. At this time, our pipeline supports neocortex and hippocampus of human and mouse. More brain regions and tissues will be added based on availability of reference data (and our own resources).
Cell Calling
We have implemented an optional cell-calling step which applies Cellranger’s cell-calling algorithm. The intended use case is unfiltered 10x Genomics MEX data from GEO which contains “empty” cell barcodes presumed to contain ambient RNA molecules instead of true cellular content.
Classifying Cells
For details on the methods we are using, please see our annotation pipeline repository. Briefly, cell types are assigned using a random forest classifier trained on scVI latent embeddings from the CELLxGENE Discover Census data corpus [1][2][3]. Our automated Nextflow pipeline downloads a trained scVI model based on provided organism and Census version. Embeddings are generated for query data using the pre-trained model. Reference cells and latent embeddings are sub-sampled from the Census for a given and organism and set of collection names. A Random Forest classifier trained on reference embeddings is then used to predict unknown cell types in the query data.
Defining Outliers
Our single-cell re-annotation pipleine may pass UMI, ribosomal, hemoglobin, mitochondrial, and “counts” outliers to Gemma for optional masking from linear models of gene expression. Outliers are defined using Median Absolute Deviations (MAD) per-sample in line with current best practices[4].
Mitochondrial, Ribosomal, Hemoglobin, UMI and Genes Outliers
The above metrics are called outliers if
\[\lvert M_i - \mathrm{median}(M) \rvert \>\ X \cdot \mathrm{MAD}(M),\]where $M_i$ is the metric of interest generated by scanpy.pp.calculate_qc_metrics
:
- Mitochondrial:
pct_counts_mito
- Ribosomal:
pct_counts_ribo
- Hemoglobin:
pct_counts_hb
- Gene content:
log1p_n_genes_by_counts
- UMI content:
log1p_total_counts
and $X = \texttt{–nmads}$ (default $X=5$).
Counts Outliers
So-called “counts” outliers are defined as cells whose gene counts deviate by more than --nmads
from the expectation of log-linearity with UMI counts per cell.
Let the residuals be
\[r_i = \ln(\mathrm{genes}_i + 1) - \widehat{\ln(\mathrm{genes}_i + 1)} ,\]where the fitted values $\widehat{\ln(\mathrm{genes}+1)}$ come from the model
\[\ln(\mathrm{genes}+1) \sim \ln(\mathrm{counts}+1).\]Then mark as outliers those with
\[\lvert r_i - \mathrm{median}(r) \rvert > X \cdot \mathrm{MAD}(r)\]Doublets
Doublets are predicted with Scanpy implementation of Scrublet algorithm Wolock et al., 2019
References
- Lim N., et al., Curation of over 10,000 transcriptomic studies to enable data reuse. Database, 2021.
- CZI Single-Cell Biology Program, Shibla Abdulla, Brian Aevermann, Pedro Assis, Seve Badajoz, Sidney M. Bell, Emanuele Bezzi, et al. “CZ CELL×GENE Discover: A Single-Cell Data Platform for Scalable Exploration, Analysis and Modeling of Aggregated Data,” November 2, 2023. https://doi.org/10.1101/2023.10.30.563174.
- Lopez, Romain, Jeffrey Regier, Michael B. Cole, Michael I. Jordan, and Nir Yosef. “Deep Generative Modeling for Single-Cell Transcriptomics.” Nature Methods 15, no. 12 (December 2018): 1053–58. https://doi.org/10.1038/s41592-018-0229-2.
- Heumos, L., Schaar, A.C., Lance, C. et al. Best practices for single-cell analysis across modalities. Nat Rev Genet (2023). https://doi.org/10.1038/s41576-023-00586-w