Advancing Single-Cell Annotation - Emerging Approaches Over Classical Methods
Introduction
Annotation of scRNA-seq data is crucial but challenging, as clustering cells is straightforward, yet assigning biological labels requires bridging computational results with prior knowledge. Manual annotation is labor-intensive and subjective, often limited by inconsistent definitions of cell types. Computational methods address this by leveraging curated gene sets (e.g., GO, KEGG) or aligning expression profiles with annotated reference datasets.
Prominent models such as scBERT, scGPT, Geneformer, CellLM, scAnnotate, and TOSICA exemplify the capabilities of ML and DL in cell annotation, with many more contributing to this growing field. Most of these models utilize pre-trained knowledge, generalize across datasets, and can infer cell types without explicit training, making them faster and more reliable alternatives to traditional approaches while enabling novel biological discoveries.
Methods in Single cell annotation
- Classical methods (Marker genes/Correlation).
- Machine learning based methods.
Classical methods
Classical methods for single-cell annotation rely on predefined marker genes, statistical techniques, and clustering algorithms. Researchers can use independently curated marker gene lists or leverage existing databases and ontologies to identify cell types. Alternatively, gene expression profiles from reference datasets can directly annotate query datasets. These approaches are tailored to either annotate entire cell clusters or classify individual cells, circumventing clustering biases.
Marker gene-based tools like scCATCH[Fig-1] and SCSA utilize curated marker databases (e.g., CellMarker, CancerSEA) to infer cell types. scCATCH assigns cell types by scoring tissue-specific markers and integrates biological functions through GO term enrichment. SCSA uses a similar scoring system and allows users to add custom markers, providing functional insights. CellAssign adopts a Bayesian probabilistic model to assign cell types at the single-cell level using trusted marker gene references, offering a more statistical approach.
Typical Cell-to-Gene Mapping derived from Marker Gene Databases.
gene_marker = {
"CD14+ Mono": ["FCN1", "CD14"],
"CD16+ Mono": ["TCF7L2", "FCGR3A", "LYN"],
"cDC1": ["CLEC9A", "CADM1"],
"Erythroblast": ["MKI67", "HBA1", "HBB"],
"NK": ["GNLY", "NKG7", "CD247", "GRIK4", "FCER1G", "TYROBP", "KLRG1", "FCGR3A"],
"Naive CD20+ B": ["MS4A1", "IL4R", "IGHD", "FCRL1", "IGHM"],
}
Fig-1 : Maker gene based single cell cluster annotation in scCATCH,Shao et al.,2020.
Correlation-based tools, like ClustifyR[Fig-2], CIPR, and SingleR, match query data to reference datasets by measuring gene expression similarities. ClustifyR compares cluster centroids using metrics like Spearman, Pearson, and Cosine similarity, adding a consensus score for robustness. CIPR uses log-transformed fold changes for refined feature selection. SingleR performs single-cell comparisons using Spearman correlation and supports multiple references with label harmonization, ensuring accurate cell-type annotation.
Fig-2: Correlation-based single-cell cluster annotation in clustifyR: Fu et al., 2020.
Out of these classical methods, Scanpy is a popular Python library used for scalable analysis and visualization of single-cell RNA-seq data. The annotations are performed using the following steps.
-
Filter for highly variable genes to reduce noise.
-
Compute principal components and generate a reduced matrix (n_cells × n_PCs).
-
Construct a k-NN graph from PCA components for clustering.
-
Perform Leiden-based clustering of cells based on their proximity in the k-NN graph.
-
Manually map cells to genes using marker genes from literature or databases.
-
Visualize gene expression patterns across clusters using dot plots.
-
Assign cell types to clusters by inspecting expression patterns in the dot plot.
Fig-3: Single-cell cluster annotated using SCANPY: Wolf et al., 2018.
Another popular method is Seurat, which primarily employs a marker-based and correlation-driven approach for cell annotation. It identifies differentially expressed genes for each cluster and matches them with known marker genes. Additionally, Seurat supports reference-based mapping through label transfer, where query cells are projected onto annotated references using anchor-based nearest neighbor correlation to infer cell identities.
Limitations of Classical Annotation Methods
-
Marker genes are context-dependent and may not generalize across datasets, leading to ambiguous annotations.
-
Dropout events and technical artifacts can obscure marker expression and weaken correlation-based classifications.
-
Closely related subtypes sharing marker gene expression profiles may be incorrectly grouped together, making it harder to achieve precise annotations at the subtype level, which requires a more refined classification.
-
Performance depends on well-annotated reference datasets, which may lack rare or novel cell types.
-
Interpretation often requires expert validation, introducing human bias and inconsistency.
Machine learning based methods
Classical ML methods
Supervised cell type annotation methods using classical machine learning rely on labeled reference datasets for classification.
- Support Vector Machine (SVM)-based tools like scPred[Fig-4] operate on PCA-transformed gene expression matrices to reduce bias.
- Random Forest-based tools like SingleCellNet compute similarity scores.
- k-Nearest Neighbors (kNN) methods such as OnClass classify cells based on feature proximity.
- Hierarchical classification tools like Garnett leverage predefined cell type trees for structured annotation.
Fig-4: scPred method for cell-type classification from single-cell RNA-seq data, Hernandez et al., 2019.
DNN/LLMs
Deep learning-based approaches improve accuracy and generalizability by leveraging neural networks.
- CellTypist[Fig-5], uses a neural network model trained on gene expression data to predict cell types, leveraging a combination of feature selection and supervised learning to provide scalable cell-type annotations across diverse datasets.
- ACTINN and SuperCT handle non-linear relationships and batch effects efficiently.
- scVI[Fig-6], a deep generative model, learns a latent representation of single-cell RNA-seq data. Its scANVI extension refines this by incorporating both labeled and unlabeled cells. Using known annotations to guide learning, scANVI assigns probabilistic cell type labels, enhancing annotation accuracy, especially for rare or ambiguous populations.
- Large language models (LLMs) such as scBERT, scGPT, and Geneformer introduce transformer-based architectures pre-trained on vast biological datasets.
- pre-trained LLMs enable zero-shot or few-shot annotation by capturing complex gene-gene interactions, reducing reliance on curated references.
- LLM-based approaches offer a scalable and adaptable solution, particularly useful when dataset variability or marker availability challenges classical methods.
Fig-5: CellTypist for automated celltype annotation, Conde et al., 2022.
Fig-6: scVI for automated celltype annotation, Xu et al., 2021.
Harnessing the power of transformer models, Geneformer (Theodoris et al.,2023)[Fig-7,8] is a transformer model pre-trained on a vast corpus of single-cell transcriptomes, starting with ~30 million non-cancer samples in June 2021 and expanding to ~95 million non-cancer samples by April 2024. It then underwent continual learning on ~14 million cancer transcriptomes to create a cancer-tuned version. By excluding cells with high mutational burdens, Geneformer ensures robust and interpretable results for accurate cell-type annotation across diverse datasets. The annotation process in Geneformer follows these steps:
-
Tokenize all cells in the test dataset using Geneformer’s tokenizer.
-
Select the pre-trained Geneformer model and the cluster-to-cell type mapping JSON file. If fine-tuning, train the model on labeled data before proceeding to prediction.
-
Perform cell type prediction and compute probability scores using the tokenized data.
-
Assess prediction accuracy by visualizing results with UMAP or t-SNE.
Fig-7: Geneformer methods for transfer learning, Theodoris et al.,2023.
Fig-8: Celltype annotation using Geneformer pre-trained model(~30 M), cellxgene-census.
While Geneformer provides powerful predictions for cell-type annotation, its pre-trained model may not always deliver accurate results, sometimes predicting cell types from unrelated tissues. To enhance prediction accuracy, fine-tuning the model on specific datasets is highly recommended, as suggested by the authors. Fine-tuning allows for improved performance on tasks like transcription factor dosage sensitivity, chromatin dynamics, gene network centrality, disease classification, and more.
Limitations
-
Classical ML approaches require well-annotated training datasets, making them ineffective for identifying novel or rare cell types.
-
Deep learning models, especially transformers like Geneformer, scGPT, and scBERT, demand significant computational resources for training, fine-tuning, and inference, limiting accessibility for smaller research groups.
-
pre-trained models are trained on large-scale datasets but may not generalize well to new tissues or conditions without additional fine-tuning, often leading to misannotations or incorrect cell-type predictions.
Conclusion
Automated cell type annotation in scRNA-seq has evolved from marker-based and correlation-driven methods to advanced deep learning (DL) and pre-trained transformer models. While marker-based approaches rely on curated gene sets and correlation methods leverage reference datasets, DL-based tools like CellTypist use neural networks for scalable predictions. More recently, transformer models such as Geneformer, scGPT, and scBERT have emerged, offering powerful pre-trained frameworks trained on millions of single-cell transcriptomes. These models enable zero-shot learning but often require fine-tuning for precise tissue-specific annotations, as seen with Geneformer’s continual learning approach. By capturing complex gene interactions and overcoming batch effects, these models enhance annotation accuracy across diverse datasets.
References
- Automated methods for cell type annotation on scRNA-seq data, Pasquini et al.,2021.
- scCATCH: Automatic Annotation on Cell Types of Clusters from Single-Cell RNA Sequencing Data, Shao et al.,2020.
- clustifyr: an R package for automated single-cell RNA sequencing cluster classification, Fu et al., 2020.
- SCANPY: large-scale single-cell gene expression data analysis, Wolf et al., 2018.
- scPred: accurate supervised method for cell-type classification from single-cell RNA-seq data, Hernandez et al., 2019.
- Cross-tissue immune cell analysis reveals tissue-specific features in humans, Conde et al., 2022.
- Transfer learning enables predictions in network biology, Theodoris et al.,2023.
- CZ CELLxGENE Discover Census.