Deciphering Cellular Complexity using Differential Gene Expression Analysis in Single-Cell Transcriptomic data: A Practical Guide
Brief Introduction
In the realm of molecular biology, the quest to understand the intricacies of cellular diversity and function has been greatly accelerated by the advent of single-cell transcriptomics. This cutting-edge technology allows researchers to scrutinize gene expression profiles at the resolution of individual cells, unveiling a wealth of information that was previously inaccessible through traditional bulk RNA sequencing methods. At the heart of single-cell transcriptomic analysis lies the powerful tool of differential gene expression (DGE) analysis, which enables the identification of genes that are differentially expressed across distinct cell populations or experimental conditions. DGE analysis serves as a cornerstone in unraveling the molecular underpinnings of cellular heterogeneity.By comparing gene expression profiles between different cell types, states, or experimental perturbations, researchers can pinpoint genes that are uniquely associated with specific cellular identities or responses. This comparative approach not only reveals the core gene expression signatures defining various cell populations but also provides insights into the dynamic regulatory networks governing cellular behaviors[1-2].
One of the primary applications of DGE analysis in single-cell transcriptomics is in the characterization of cell types and states within complex tissues. By dissecting heterogeneous cell populations into distinct clusters based on their gene expression profiles, researchers can identify marker genes that define each cell type. This knowledge not only enhances our understanding of tissue organization and function but also facilitates the discovery of novel cell subtypes with specialized roles in health and disease[3-4].
Major steps of DGE analysis
- Preprocessing and normalization
- Statistical modeling
- Visualization
An experimental procedure
In this blog, we will delve into differential gene expression analysis in single-cell RNA sequencing data using the pseudo-bulking approach[5], leveraging the capabilities of the pyDESeq2
library[6].
Import library
import anndata as ad
import matplotlib.pyplot as plt
from matplotlib.pyplot import rc_context
import numpy as np
import pandas as pd
import pydeseq2
from pydeseq2.dds import DeseqDataSet
from pydeseq2.ds import DeseqStats
import scanpy as sc
Reading single cell RNA-seq AnnData
An AnnData object in the .h5ad
file format, containing gene expression data X
, cell metadata obs
, and gene ID details var
, should be used. It is highly recommended to use the raw gene expression data available in anndata.raw.X
. This is because unnormalized or raw counts typically do not contain negative expression values, and these negative counts can cause errors during calculations.
adata = sc.read_h5ad('/home/demofile.h5ad', backed='r')
adata.var
: A pandas DataFrame representing the gene or feature details
Figure 1: The var
DataFrame representing the Gene_ID's and the related metadata.
adata.obs
: A pandas DataFrame representing the metadata
Figure 2: The obs
DataFrame representing cells and the associated metadata for each donor
or subject
.
adata.raw.X
: A CSRDataset
which efficiently stores the sparse gene expression matrix
In the absence of adata.raw.X
, we can proceed with adata.X
, but we must be cautious about normalization and batch correction in the expression data, as these processes can produce negative values.
Figure 3: Gene expression matrix representing the raw expression values of each gene across all cells.
Parameters for intra cell DGE analysis between different conditions
Cell-specific DGE
analysis involves comparing gene expression within a cell type
between two conditions. So we need to select a specific cell type
; for example, we may compare the gene expression of naive B cell
between COVID-19
and normal
subjects;figure[4]. Here, the disease
column is the design_factor
, and 'COVID-19
& normal
are the two levels of the design_factor
.
Since single-cell data contains multiple expression vectors for a cell type
in each subject, the expression values need to be aggregated for DGE analysis. This aggregation is done for the selected cell type
in each subject. In figure[4], the expression values of naive B cell
for each subject listed in the donor_id
column are aggregated.
Figure 4: Important parameters to perform DGE.
Subsetting the anndata with the fixed celltype: "naive B cell"
fixed_celltype = 'naive B cell'
celltype_col = 'cell_type'
cell_subset_adata = adata[adata.obs[celltype_col] == fixed_celltype].to_memory()
print(f'The "cell_subset_adata" has {cell_subset_adata.shape[0]} cells and {cell_subset_adata.shape[1]} genes')
The "cell_subset_adata" has 8998 cells and 23586 genes
The cell_subset_adata
contains all the subjects
that have expression values for the naive B cell
.
Figure 5: Subset anndata upon selection of a fixed cell type : naive B cell
.
Aggregating the gene expression of each donor for selected celltype naive B cell
def get_psudo_expr_df(cell_subset_ad,
stratifier,
design_factor):
'''Aggregates the expression values of individual genes for each 'startifier' to
produce individual anndata, and concatenate all anndata.
Args:
cell_subset_ad : cell_subset_adata
stratifier : A column containing donors or subject ids in 'obs'.
design_factor : The column name in 'obs', containing different levels of
condition
Returns:
A pandas DataFrame with aggregated counts from concatenated anndata.
A pandas DataFrame with design_factor specific metadata'''
bulk_ad_per_stratifier_list = []
for uniq_id in cell_subset_ad.obs[stratifier].unique():
samp_cell_subset_ad = cell_subset_ad[cell_subset_ad.obs[stratifier] == uniq_id]
if cell_subset_ad.raw is not None:
rep_adata = sc.AnnData(X = samp_cell_subset_ad.raw.X.sum(axis = 0),
var = samp_cell_subset_ad.var[[]])
else:
rep_adata = sc.AnnData(X = samp_cell_subset_ad.X.sum(axis = 0),
var = samp_cell_subset_ad.var[[]])
rep_adata.obs_names = [uniq_id]
rep_adata.obs[design_factor] = samp_cell_subset_ad.obs[design_factor].iloc[0]
bulk_ad_per_stratifier_list.append(rep_adata)
bulk_ad_per_stratifier_combined = sc.concat(bulk_ad_per_stratifier_list)
count_matrix = pd.DataFrame(bulk_ad_per_stratifier_combined.X,
columns=bulk_ad_per_stratifier_combined.var_names,
index =bulk_ad_per_stratifier_combined.obs_names)
count_matrix = count_matrix.round().astype(int)
metadata = bulk_ad_per_stratifier_combined.obs
return count_matrix,metadata
stratifier = 'donor_id'
design_factor = 'disease'
aggr_counts_df,aggr_metadata = get_psudo_expr_df(cell_subset_adata,stratifier,design_factor)
The get_psudo_expr_df
function generates a metadata DataFrame
that includes all donors or subjects listed in the donor_id
column in cell_subset_adata.obs
, who have the expression values for naive B cell
along with the condition-level they belong to. The disease
column selected as design_factor
has all the associated condition-levels, COVID_19
& normal
. This metadata will be used as a design_matrix
later; figure[6A].
The function also returns donor-wise summed expression values for each gene within the selected cell type, naive B cell
; figure[6B].
Figure 6: Design matrix(A) and the aggregated gene expression matrix(B) across individual donor IDs.
Differentially expressed gene in 'naive B cell' between 'COVID-19' and 'Normal' samples
Calculating dispersion and Log fold change in gene expression
def get_deseq2_dataset(counts_df,
metadata,
design_factor):
'''Creates an instance of 'DeseqDataSet', select genes expressed
in at least one cell, normalizes the aggregated expression values,
estimates dispersion parameter, fit a GLM to identify Differential
expressed genes and calculates the log2 fold change in expression
of genes between condition.
Args:
counts_df : 'aggr_counts_df' returned from 'get_psudo_expr_df'
metadata : 'aggr_metadata' returned from 'get_psudo_expr_df'
design_factor : The column name in 'aggr_metadata', containing
different levels of condition.
Returns:
A AnnData object with size factors, Log Fold change values,
dispersions and so on.
'''
deseq_dataset = DeseqDataSet(counts = counts_df,
metadata=metadata,
design_factors=design_factor)
#removing genes with no expression across all samples
sc.pp.filter_genes(deseq_dataset, min_cells = 1)
deseq_dataset.deseq2()
return deseq_dataset
deseq2_dataset = get_deseq2_dataset(aggr_counts_df,aggr_metadata,design_factor)
Extracting statistical results and visualizing in volcano plot
def get_dge_statsdf_volc_plot(dge_dataset,design_factor,factor_1,
factor_2,fixed_cell,
save_stats_df = True,p_val = 0.05,
fold = 1.5,plot_save =True):
''' Extract summary statistics such as number of significantly differentially
expressed genes with p,and adjusted p values. Volcano-plot with
y->(-log10pvalue) and x->(Log2 fold change) for each gene
Args:
dge_dataset : deseq2_dataset
design_factor : design_factor
factor_1 : condition level-1
factor_2 : condition level-2
fixed_cell : fixed_celltype
save_stats_df : True, if stats results to be saved as a csv file
p_val : significant p value to be used in the volcano plot.
fold : linear fold change value that is to be converted to log2fold
change for visualizing differentially expressed genes.
plot_save : True, if volcano plot to be saved as a png file
Returns:
A pandas DataFrame with summary stats
'''
stats = DeseqStats(dge_dataset,contrast=(design_factor, factor_1,factor_2))
stats.summary()
stats_df = stats.results_df
if save_stats_df:
stats_df.index.name = 'Gene'
stats_df.to_csv(f'DGE_reslt_{factor_1}_vs_{factor_2}.csv',header = True)
#for plotting
log2_fc = np.log2(fold)
log10_pval_neg = -np.log10(p_val)
stats_df['-log10(pvalue)'] = -np.log10(stats_df['pvalue'])
down_genes_bool = (stats_df['log2FoldChange']<=(-log2_fc))&(stats_df['-log10(pvalue)']>=(log10_pval_neg))
up_genes_bool = (stats_df['log2FoldChange']>=(log2_fc))&(stats_df['-log10(pvalue)']>=(log10_pval_neg))
rest_gene_bool = ~(down_genes_bool|up_genes_bool)
plt.figure(figsize=(10, 5))
plt.grid(False)
plt.scatter(stats_df.loc[up_genes_bool,'log2FoldChange'], stats_df.loc[up_genes_bool,'-log10(pvalue)'],
color='red',s=20,alpha=0.25,label= f"Fold>={fold} & p<={p_val}")
plt.scatter(stats_df.loc[down_genes_bool,'log2FoldChange'], stats_df.loc[down_genes_bool,'-log10(pvalue)'],
color='blue',s=20,alpha=0.25,label= f"Fold<={fold} & p<={p_val}")
plt.scatter(stats_df.loc[rest_gene_bool,'log2FoldChange'], stats_df.loc[rest_gene_bool,'-log10(pvalue)'],
color='gray',alpha=0.2,s=20)
plt.axvline(log2_fc, color='black', linestyle='--', alpha=0.4)
plt.axvline(-log2_fc, color='black', linestyle='--', alpha=0.4)
plt.axhline(log10_pval_neg, color='black', linestyle='--', alpha=0.4)
plt.legend(loc='center left', bbox_to_anchor=(1, 0.6))
plt.xlabel('Log2 Fold Change',fontweight='bold')
plt.ylabel('-Log10(p-value)',fontweight='bold')
plt.title(f'DGE of "{fixed_cell}" in {factor_1} v/s {factor_2} samples',fontweight='bold')
if plot_save:
plt.savefig(f'DGE_plot_{fixed_cell}_{factor_1}_vs_{factor_2}.png',bbox_inches='tight')
plt.show()
return stats_df
factor_1 = 'COVID-19'#The condition-1 to do the DGE
factor_2 = 'normal' # The reference condition-2 angaist which the condition-1 is going to be compared
dge_stats_df = get_dge_statsdf_volc_plot(dge_dataset = deseq2_dataset,
design_factor = design_factor,
factor_1 = factor_1,
factor_2 = factor_2,
fixed_cell = fixed_celltype)
Figure 7: Statistical results of differentially expressed genes in COVID-19
samples compared to normal
ones.
Volcano plots, which plot -log10(p-value)
v/s log_2(fold change)
for each gene, are useful for visualizing the extent of gene upregulation and downregulation. In figure[8], red dots represent genes that are significantly upregulated in COVID-19 w.r.t. normal subjects (with a p-value ≤0.05 and a fold change ≥1.5), while blue dots indicate genes that are significantly downregulated (with a p-value ≤0.05 and a fold change ≤-1.5).
Figure 8: Volcano plot of -log10(p value)
v/s log2Foldchange
helps in visualising upregulated and down regulated genes.
Conclusion
In conclusion, DGE analysis in single-cell transcriptomics represents a powerful approach for unraveling cellular complexity and dynamics across tissues, developmental stages, and disease states. By comparing gene expression profiles between different cell populations or experimental conditions, researchers can decipher the molecular underpinnings of cellular identity, behavior, and pathology. As technology continues to evolve and analytical techniques become more sophisticated, the insights gleaned from single-cell transcriptomic analysis promise to revolutionize our understanding of biology and disease.
References
- Svensson, Valentine. "Droplet-based single-cell RNA sequencing and its bioinformatics challenges." Computational and structural biotechnology journal 16 (2018): 363-370.
- Zeisel, Amit, et al. "Cell types in the mouse cortex and hippocampus revealed by single-cell RNA-seq." Science 347.6226 (2015): 1138-1142.
- Trapnell, Cole, et al. "The dynamics and regulators of cell fate decisions are revealed by pseudotemporal ordering of single cells." Nature biotechnology 32.4 (2014): 381-386.
- Macosko, Evan Z., et al. "Highly parallel genome-wide expression profiling of individual cells using nanoliter droplets." Cell 161.5 (2015): 1202-1214.
- https://bioconductor.org/packages/devel/bioc/vignettes/glmGamPoi/inst/doc/pseudobulk.html.
- Muzellec et al. "PyDESeq2: a python package for bulk RNA-seq differential expression analysis." : 10.1093/bioinformatics/btad547(2023).