Deciphering Cellular Complexity using Differential Gene Expression Analysis in Single-Cell Transcriptomic data: A Practical Guide

Single-Cell Genomics

Differential Gene Expression

Python

Amit Samal Bioinformatics Engineer @ Infocusp

14 min read . 20 July 2024

banner image

Brief Introduction

In the realm of molecular biology, the quest to understand the intricacies of cellular diversity and function has been greatly accelerated by the advent of single-cell transcriptomics. This cutting-edge technology allows researchers to scrutinize gene expression profiles at the resolution of individual cells, unveiling a wealth of information that was previously inaccessible through traditional bulk RNA sequencing methods. At the heart of single-cell transcriptomic analysis lies the powerful tool of differential gene expression (DGE) analysis, which enables the identification of genes that are differentially expressed across distinct cell populations or experimental conditions. DGE analysis serves as a cornerstone in unraveling the molecular underpinnings of cellular heterogeneity.By comparing gene expression profiles between different cell types, states, or experimental perturbations, researchers can pinpoint genes that are uniquely associated with specific cellular identities or responses. This comparative approach not only reveals the core gene expression signatures defining various cell populations but also provides insights into the dynamic regulatory networks governing cellular behaviors[1-2].

One of the primary applications of DGE analysis in single-cell transcriptomics is in the characterization of cell types and states within complex tissues. By dissecting heterogeneous cell populations into distinct clusters based on their gene expression profiles, researchers can identify marker genes that define each cell type. This knowledge not only enhances our understanding of tissue organization and function but also facilitates the discovery of novel cell subtypes with specialized roles in health and disease[3-4].

Major steps of DGE analysis

Preprocessing and normalization
Statistical modeling
Visualization

An experimental procedure

In this blog, we will delve into differential gene expression analysis in single-cell RNA sequencing data using the pseudo-bulking approach[5], leveraging the capabilities of the pyDESeq2 library[6].

Import library

import anndata as ad
import matplotlib.pyplot as plt
from matplotlib.pyplot import rc_context
import numpy as np
import pandas as pd
import pydeseq2
from pydeseq2.dds import DeseqDataSet
from pydeseq2.ds import DeseqStats
import scanpy as sc

Reading single cell RNA-seq AnnData

An AnnData object in the .h5ad file format, containing gene expression data X, cell metadata obs, and gene ID details var, should be used. It is highly recommended to use the raw gene expression data available in anndata.raw.X. This is because unnormalized or raw counts typically do not contain negative expression values, and these negative counts can cause errors during calculations.

adata = sc.read_h5ad('/home/demofile.h5ad', backed='r')

`adata.var` : A pandas DataFrame representing the gene or feature details

Figure 1: The var DataFrame representing the Gene_ID's and the related metadata.

`adata.obs` : A pandas DataFrame representing the metadata

Figure 2: The obs DataFrame representing cells and the associated metadata for each donor or subject.

`adata.raw.X` : A `CSRDataset` which efficiently stores the sparse gene expression matrix

In the absence of adata.raw.X, we can proceed with adata.X, but we must be cautious about normalization and batch correction in the expression data, as these processes can produce negative values.

Figure 3: Gene expression matrix representing the raw expression values of each gene across all cells.

Parameters for intra cell DGE analysis between different conditions

Cell-specific DGE analysis involves comparing gene expression within a cell type between two conditions. So we need to select a specific cell type; for example, we may compare the gene expression of naive B cell between COVID-19 and normal subjects;figure[4]. Here, the disease column is the design_factor, and 'COVID-19 & normal are the two levels of the design_factor.

Since single-cell data contains multiple expression vectors for a cell type in each subject, the expression values need to be aggregated for DGE analysis. This aggregation is done for the selected cell type in each subject. In figure[4], the expression values of naive B cell for each subject listed in the donor_id column are aggregated.

Figure 4: Important parameters to perform DGE.

Subsetting the anndata with the fixed celltype: "naive B cell"

fixed_celltype = 'naive B cell'
celltype_col = 'cell_type'
cell_subset_adata = adata[adata.obs[celltype_col] == fixed_celltype].to_memory()
print(f'The "cell_subset_adata" has {cell_subset_adata.shape[0]} cells and {cell_subset_adata.shape[1]} genes')

The "cell_subset_adata" has 8998 cells and 23586 genes

The cell_subset_adata contains all the subjects that have expression values for the naive B cell.

Figure 5: Subset anndata upon selection of a fixed cell type : naive B cell.

Aggregating the gene expression of each donor for selected celltype `naive B cell`

def get_psudo_expr_df(cell_subset_ad,
                      stratifier,
                      design_factor):
    '''Aggregates the expression values of individual genes for each 'startifier' to
    produce individual anndata, and concatenate all anndata.

    Args:
        cell_subset_ad : cell_subset_adata
        stratifier     : A column containing donors or subject ids in 'obs'.
        design_factor  : The column name in 'obs', containing different levels of 
                         condition

    Returns:
        A pandas DataFrame with aggregated counts from concatenated anndata.
        A pandas DataFrame with design_factor specific metadata'''
    
    bulk_ad_per_stratifier_list = []
    for uniq_id in cell_subset_ad.obs[stratifier].unique():
        samp_cell_subset_ad = cell_subset_ad[cell_subset_ad.obs[stratifier] == uniq_id]
        
        if cell_subset_ad.raw is not None:
            rep_adata = sc.AnnData(X = samp_cell_subset_ad.raw.X.sum(axis = 0),
                                   var = samp_cell_subset_ad.var[[]])
        else:
            rep_adata = sc.AnnData(X = samp_cell_subset_ad.X.sum(axis = 0),
                                   var = samp_cell_subset_ad.var[[]])
        
        rep_adata.obs_names = [uniq_id]
        rep_adata.obs[design_factor] = samp_cell_subset_ad.obs[design_factor].iloc[0]
        
        bulk_ad_per_stratifier_list.append(rep_adata)
    bulk_ad_per_stratifier_combined = sc.concat(bulk_ad_per_stratifier_list)
    
    count_matrix = pd.DataFrame(bulk_ad_per_stratifier_combined.X,
                            columns=bulk_ad_per_stratifier_combined.var_names,
                            index =bulk_ad_per_stratifier_combined.obs_names)
    count_matrix = count_matrix.round().astype(int)
    metadata = bulk_ad_per_stratifier_combined.obs
    
    return count_matrix,metadata

stratifier = 'donor_id'
design_factor = 'disease'
aggr_counts_df,aggr_metadata = get_psudo_expr_df(cell_subset_adata,stratifier,design_factor)

The get_psudo_expr_df function generates a metadata DataFrame that includes all donors or subjects listed in the donor_id column in cell_subset_adata.obs, who have the expression values for naive B cell along with the condition-level they belong to. The disease column selected as design_factor has all the associated condition-levels, COVID_19 & normal. This metadata will be used as a design_matrix later; figure[6A].

The function also returns donor-wise summed expression values for each gene within the selected cell type, naive B cell; figure[6B].

Figure 6: Design matrix(A) and the aggregated gene expression matrix(B) across individual donor IDs.

Differentially expressed gene in 'naive B cell' between 'COVID-19' and 'Normal' samples

Calculating dispersion and Log fold change in gene expression

def get_deseq2_dataset(counts_df,
                      metadata,
                      design_factor):
    '''Creates an instance of 'DeseqDataSet', select genes expressed
    in at least one cell, normalizes the aggregated expression values,
    estimates dispersion parameter, fit a GLM to identify Differential
    expressed genes and calculates the log2 fold change in expression 
    of genes between condition.

    Args:
        counts_df     : 'aggr_counts_df' returned from 'get_psudo_expr_df'
        metadata      : 'aggr_metadata' returned from 'get_psudo_expr_df'
        design_factor : The column name in 'aggr_metadata', containing 
                        different levels of condition.

    Returns:
        A AnnData object with size factors, Log Fold change values, 
        dispersions and so on.
    '''
    
    deseq_dataset = DeseqDataSet(counts = counts_df,
                             metadata=metadata,
                             design_factors=design_factor)
    #removing genes with no expression across all samples
    sc.pp.filter_genes(deseq_dataset, min_cells = 1)
    deseq_dataset.deseq2()
    return deseq_dataset

deseq2_dataset = get_deseq2_dataset(aggr_counts_df,aggr_metadata,design_factor)

Extracting statistical results and visualizing in volcano plot

def get_dge_statsdf_volc_plot(dge_dataset,design_factor,factor_1,
                              factor_2,fixed_cell,
                              save_stats_df = True,p_val = 0.05,
                              fold = 1.5,plot_save =True):
    ''' Extract summary statistics such as number of significantly differentially
    expressed genes with p,and adjusted p values. Volcano-plot with
    y->(-log10pvalue) and x->(Log2 fold change) for each gene
    
    Args:
        dge_dataset : deseq2_dataset
        design_factor : design_factor 
        factor_1 : condition level-1
        factor_2 : condition level-2
        fixed_cell : fixed_celltype
        save_stats_df : True, if stats results to be saved as a csv file
        p_val : significant p value to be used in the volcano plot.
        fold : linear fold change value that is to be converted to log2fold 
                change for visualizing differentially expressed genes.
        plot_save : True, if volcano plot to be saved as a png file

    Returns: 
            A pandas DataFrame with summary stats
    '''
    
    stats = DeseqStats(dge_dataset,contrast=(design_factor, factor_1,factor_2))
    stats.summary()
    stats_df = stats.results_df
    if save_stats_df:
        stats_df.index.name = 'Gene'
        stats_df.to_csv(f'DGE_reslt_{factor_1}_vs_{factor_2}.csv',header = True)
    
    #for plotting
    log2_fc = np.log2(fold)
    log10_pval_neg = -np.log10(p_val)
    stats_df['-log10(pvalue)'] = -np.log10(stats_df['pvalue'])
    down_genes_bool = (stats_df['log2FoldChange']<=(-log2_fc))&(stats_df['-log10(pvalue)']>=(log10_pval_neg))
    up_genes_bool = (stats_df['log2FoldChange']>=(log2_fc))&(stats_df['-log10(pvalue)']>=(log10_pval_neg))
    rest_gene_bool = ~(down_genes_bool|up_genes_bool)  
    plt.figure(figsize=(10, 5))
    plt.grid(False)
    plt.scatter(stats_df.loc[up_genes_bool,'log2FoldChange'], stats_df.loc[up_genes_bool,'-log10(pvalue)'],
                color='red',s=20,alpha=0.25,label= f"Fold>={fold} & p<={p_val}")
    plt.scatter(stats_df.loc[down_genes_bool,'log2FoldChange'], stats_df.loc[down_genes_bool,'-log10(pvalue)'],
                color='blue',s=20,alpha=0.25,label= f"Fold<={fold} & p<={p_val}")

    plt.scatter(stats_df.loc[rest_gene_bool,'log2FoldChange'], stats_df.loc[rest_gene_bool,'-log10(pvalue)'], 
                color='gray',alpha=0.2,s=20)
    
    plt.axvline(log2_fc, color='black', linestyle='--', alpha=0.4)
    plt.axvline(-log2_fc, color='black', linestyle='--', alpha=0.4)
    plt.axhline(log10_pval_neg, color='black', linestyle='--', alpha=0.4)
    plt.legend(loc='center left', bbox_to_anchor=(1, 0.6))

    plt.xlabel('Log2 Fold Change',fontweight='bold')
    plt.ylabel('-Log10(p-value)',fontweight='bold')
    plt.title(f'DGE of "{fixed_cell}" in {factor_1} v/s {factor_2} samples',fontweight='bold')
    if plot_save:
        plt.savefig(f'DGE_plot_{fixed_cell}_{factor_1}_vs_{factor_2}.png',bbox_inches='tight')    
    plt.show()
    return stats_df

factor_1 = 'COVID-19'#The condition-1 to do the DGE
factor_2 = 'normal' # The reference condition-2 angaist which the condition-1 is going to be compared
dge_stats_df = get_dge_statsdf_volc_plot(dge_dataset = deseq2_dataset,
                                         design_factor = design_factor,
                                         factor_1 = factor_1,
                                         factor_2 = factor_2,
                                         fixed_cell = fixed_celltype)

Figure 7: Statistical results of differentially expressed genes in COVID-19 samples compared to normal ones.

Volcano plots, which plot -log10(p-value) v/s log_2(fold change) for each gene, are useful for visualizing the extent of gene upregulation and downregulation. In figure[8], red dots represent genes that are significantly upregulated in COVID-19 w.r.t. normal subjects (with a p-value ≤0.05 and a fold change ≥1.5), while blue dots indicate genes that are significantly downregulated (with a p-value ≤0.05 and a fold change ≤-1.5).

Figure 8: Volcano plot of -log10(p value) v/s log2Foldchange helps in visualising upregulated and down regulated genes.

Conclusion

In conclusion, DGE analysis in single-cell transcriptomics represents a powerful approach for unraveling cellular complexity and dynamics across tissues, developmental stages, and disease states. By comparing gene expression profiles between different cell populations or experimental conditions, researchers can decipher the molecular underpinnings of cellular identity, behavior, and pathology. As technology continues to evolve and analytical techniques become more sophisticated, the insights gleaned from single-cell transcriptomic analysis promise to revolutionize our understanding of biology and disease.

References

Svensson, Valentine. "Droplet-based single-cell RNA sequencing and its bioinformatics challenges." Computational and structural biotechnology journal 16 (2018): 363-370.
Zeisel, Amit, et al. "Cell types in the mouse cortex and hippocampus revealed by single-cell RNA-seq." Science 347.6226 (2015): 1138-1142.
Trapnell, Cole, et al. "The dynamics and regulators of cell fate decisions are revealed by pseudotemporal ordering of single cells." Nature biotechnology 32.4 (2014): 381-386.
Macosko, Evan Z., et al. "Highly parallel genome-wide expression profiling of individual cells using nanoliter droplets." Cell 161.5 (2015): 1202-1214.
https://bioconductor.org/packages/devel/bioc/vignettes/glmGamPoi/inst/doc/pseudobulk.html.
Muzellec et al. "PyDESeq2: a python package for bulk RNA-seq differential expression analysis." : 10.1093/bioinformatics/btad547(2023).