12  Other tools used in scATAC-seq

Overview of the scATAC-seq analysis steps.

12.1 Gene regulatory network reconstruction

In the previous section, we demonstrated how to integrate RNA and ATAC assays to identify cell type heterogeneity and explore cis- and trans-regulatory elements important for cell identity. Another approach is to focus on gene relationships, specifically the regulatory interactions between TFs and cis-regulatory elements that control target gene transcription—this is known as gene regulatory network (GRN) analysis

One of the early approaches involves checking expression correlations between genes to build a co-expression network, followed by the identification of co-expression modules. These modules contain genes thought to be co-regulated, and their regulators can be identified through methods like TF binding motif enrichment analysis. A well-known example is WGCNA (Weighted Gene Coexpression Network Analysis), commonly used in bulk transcriptome analysis.

Another approach treats the expression of a gene as a function of its regulators’ (mostly TFs) expression. The goal is to fit a linear or non-linear regression model using the target gene expression as the response and the expression of all possible regulators as covariates. This is followed by statistical tests to determine significant contributions from covariates or uses regularization for feature selection. Examples include methods like GENIE3 and GRNBoost2, which have also proven applicable to single-cell transcriptomic data.

While these methods provide valuable insights into regulatory mechanisms, they rely solely on the expression of TFs and target genes, overlooking the biological requirement that the target gene must contain a binding motif for a TF at its cis-regulatory elements (promoter or enhancers). To enhance biological relevance, pipelines now integrate both expression data and TF binding site predictions. SCENIC, developed by the Aerts lab, adds a filtering step after using GRNBoost2 to infer the raw network, removing interactions lacking corresponding TF binding motifs at the gene’s promoter. CellOracle takes a different approach by identifying putative TFs via motif searching at promoters and enhancers, using public or ATAC-seq data, and then applying a regularized regression model with only TFs that have predicted binding motifs. Further details are available in their respective publications.

With scMultiome data, we can quantitatively incorporate chromatin accessibility profiles into GRN inference, which is the rationale behind Pando. Pando first scans for TF binding motifs at each peak in the ATAC assay, applying further filtering based on sequence conservation and public regulatory element databases (e.g., ENCODE). For each gene, it builds a linear regression model that considers the expression of TFs with predicted binding sites as well as their interaction with the peak accessibility at those binding sites. More details can be found in the biorxiv preprint. In the next part of this tutorial, we will briefly demonstrate how to run Pando on scMultiome data to identify potential TF targets.

Pando

Pando had a major upgrade to change its defined data structure, which was to avoid possible effect by any major change to the data structure of Seurat objects in the future. The following tutorial session has therefore been updated to adapt to the changes.

12.1.1 References