6  Practical 15: Multimodal analysis and Integration

In this step, we will demonstrate the following:

7 Batch Effect Correction

Variation in different Chromium Single Cell ATAC (Assay for Transposase Accessible Chromatin) samples can be affected by technical factors, such as laboratory conditions or reagent choices. These batch effects may confound true biological variation between samples. Therefore, correcting the batch effects can be useful for data analysis.

If you are combining libraries generated by Chromium Single Cell ATAC v1.1 and v2 reagents, you might observe systematic differences in chromatin structure profiles between libraries. Here, we will demonstrate how to use community-developed tools to merge and correct batch effects between Single Cell ATAC v1.1 and v2 data. The same procedure could also be used to correct other types of batch effects.

7.1 Option 1: Merging objects

When merging multiple single-cell chromatin datasets, it’s important to be aware that if peak calling was performed on each dataset independently, the peaks are unlikely to be exactly the same. We therefore need to create a common set of peaks across all the datasets to be merged.

The two datasets used in this example can be found here: dataset 1: 10k Human PBMCs, ATAC v1.1 cells and dataset 2: 10k Human PBMCs, ATAC v2 cells.

From: Batch Effect Correction in Chromium Single Cell ATAC Data

Code
# Set working directory before downloading files
setwd("data")

# Download dataset 1 filtered_peak_bc_matrix.h5 file

download.file("https://cf.10xgenomics.com/samples/cell-atac/2.1.0/10k_pbmc_ATACv1p1_nextgem_Chromium_X/10k_pbmc_ATACv1p1_nextgem_Chromium_X_filtered_peak_bc_matrix.h5","10k_pbmc_ATACv1p1_nextgem_Chromium_X_filtered_peak_bc_matrix.h5")

# Download dataset 2 filtered_peak_bc_matrix.h5 file

download.file("https://cf.10xgenomics.com/samples/cell-atac/2.1.0/10k_pbmc_ATACv2_nextgem_Chromium_X/10k_pbmc_ATACv2_nextgem_Chromium_X_filtered_peak_bc_matrix.h5","10k_pbmc_ATACv2_nextgem_Chromium_X_filtered_peak_bc_matrix.h5")

7.2 Option 2: Harmony batch effect

We can use the Harmony batch effect correction algorithm (Korsunsky et al. 2019) implemented in the Signac R package. The Harmony algorithm is available on GitHub, and Signac has tutorials for integration.

Code
hm.integrated <- RunHarmony(object = unintegrated, group.by.vars = 'dataset', reduction = 'lsi', assay.use = 'peaks', project.dim = FALSE)
hm.integrated <- RunUMAP(hm.integrated, dims = 2:30, reduction = 'harmony')
DimPlot(hm.integrated, group.by = 'dataset', pt.size = 0.1)

7.3 Option 3: scATAC-seq data integration

An important first step in any integrative analysis of single-cell chromatin data is to ensure that the same features are measured in each dataset. Here, we quantify the multiome peaks in the ATAC dataset to ensure that there are common features across the two datasets. See the merge vignette for more information about merging chromatin assays.

7.4 Common peak set

If the peaks were identified independently in each experiment then they will likely not overlap perfectly. We can merge peaks from all the datasets to create a common peak set, and quantify this peak set in each experiment prior to merging the objects.

First we’ll load the peak coordinates for each experiment and convert them to genomic ranges, the use the GenomicRanges::reduce function to create a common set of peaks to quantify in each dataset.

flowchart LR    
  A(Create a common peak set) --> B(Create fragment objects)     
  A --> C(Quantify peaks in each dataset)
  C --> D(Create the objects)
  D --> E(Merge objects)
  E --> F(Merging without a \ncommon feature set)

8 scATAC-seq Downstream

flowchart LR    
  A{Integrating with scRNA-seq data} --> B(Gene activity matrix approach - \nRNA imputation)     
  A --> C(Dictionary Learning for cross-modality \nintegration - Bridge integration)
  B --> D[Check Biomarkers]
  C --> D

8.1 Option B: Dictionary Learning for cross-modality integration / Bridge integration

Broad schematic of the bridge integration workflow. From: Seurat v5, Hao et al, 2023. Nature biotechnology

Broad schematic of the bridge integration workflow. From: Seurat v5, Hao et al, 2023. Nature biotechnology

8.1.1 Step 1: Load the bridge, query, and reference datasets (each modality individually)

Input files:

  • 10x multiome dataset: Consisting of ~12,000 PBMC from a helthy donor. The dataset measures scRNA-seq and scATAC-seq in the same cell, and is available for download from 10x Genomics here.

  • scATAC-seq Query: Represents ~10,000 PBMC from a healthy donor, and is available for download here.

  • Reference from Azimuth: We load the reference (download here) from this recent paper. This reference is stored as an h5Seurat file, a format that enables on-disk storage of multimodal Seurat objects (more details on h5Seurat and SeuratDisk can be found here).

Total 2.7 Gb

# 10x multiome dataset 
# Raw data 
wget https://cf.10xgenomics.com/samples/cell-arc/2.0.0/pbmc_granulocyte_sorted_10k/pbmc_granulocyte_sorted_10k_filtered_feature_bc_matrix.h5 
# fragments file 
wget https://cf.10xgenomics.com/samples/cell-arc/2.0.0/pbmc_granulocyte_sorted_10k/pbmc_granulocyte_sorted_10k_atac_fragments.tsv.gz 
# fragments index 
wget https://cf.10xgenomics.com/samples/cell-arc/2.0.0/pbmc_granulocyte_sorted_10k/pbmc_granulocyte_sorted_10k_atac_fragments.tsv.gz.tbi  

# scATAC-seq Query (Total 2.7 Gb) 
# Raw data  
wget https://cf.10xgenomics.com/samples/cell-atac/2.1.0/10k_pbmc_ATACv2_nextgem_Chromium_Controller/10k_pbmc_ATACv2_nextgem_Chromium_Controller_filtered_peak_bc_matrix.h5 
# metadata 
wget https://cf.10xgenomics.com/samples/cell-atac/2.1.0/10k_pbmc_ATACv2_nextgem_Chromium_Controller/10k_pbmc_ATACv2_nextgem_Chromium_Controller_singlecell.csv 
# fragments file 
wget https://cf.10xgenomics.com/samples/cell-atac/2.1.0/10k_pbmc_ATACv2_nextgem_Chromium_Controller/10k_pbmc_ATACv2_nextgem_Chromium_Controller_fragments.tsv.gz 
# fragments index 
wget https://cf.10xgenomics.com/samples/cell-atac/2.1.0/10k_pbmc_ATACv2_nextgem_Chromium_Controller/10k_pbmc_ATACv2_nextgem_Chromium_Controller_fragments.tsv.gz.tbi  

# Azimuth Reference (Total 9 Gb)
wget https://atlas.fredhutch.org/data/nygc/multimodal/pbmc_multimodal.h5seurat

Load and setup the 10x multiome object.

Code
library(Seurat)
library(Signac)
library(EnsDb.Hsapiens.v86)
library(dplyr)
library(ggplot2)
library(glmGamPoi)

# the 10x hdf5 file contains both data types.
inputdata.10x <- Read10X_h5("/brahms/hartmana/vignette_data/pbmc_cellranger_arc_2/pbmc_granulocyte_sorted_10k_filtered_feature_bc_matrix.h5")
# extract RNA and ATAC data
rna_counts <- inputdata.10x$`Gene Expression`
atac_counts <- inputdata.10x$Peaks
# Create Seurat object
obj.multi <- CreateSeuratObject(counts = rna_counts)
# Get % of mitochondrial genes
obj.multi[["percent.mt"]] <- PercentageFeatureSet(obj.multi, pattern = "^MT-")

# add the ATAC-seq assay
grange.counts <- StringToGRanges(rownames(atac_counts), sep = c(":", "-"))
grange.use <- seqnames(grange.counts) %in% standardChromosomes(grange.counts)
atac_counts <- atac_counts[as.vector(grange.use), ]
# Get gene annotations
annotations <- GetGRangesFromEnsDb(ensdb = EnsDb.Hsapiens.v86)
# Change style to UCSC
seqlevelsStyle(annotations) <- 'UCSC'
genome(annotations) <- "hg38"
# File with ATAC per fragment information file
frag.file <- "/brahms/hartmana/vignette_data/pbmc_cellranger_arc_2/pbmc_granulocyte_sorted_10k_atac_fragments.tsv.gz"
# Add in ATAC-seq data as ChromatinAssay object
chrom_assay <- CreateChromatinAssay(
  counts = atac_counts,
  sep = c(":", "-"),
  genome = 'hg38',
  fragments = frag.file,
  min.cells = 10,
  annotation = annotations
)

# Add the ATAC assay to the multiome object
obj.multi[["ATAC"]] <- chrom_assay
# Filter ATAC data based on QC metrics
obj.multi <- subset(
  x = obj.multi,
  subset = nCount_ATAC < 7e4 &
    nCount_ATAC > 5e3 &
    nCount_RNA < 25000 &
    nCount_RNA > 1000 &
    percent.mt < 20
)

We note that it is important to quantify the same set of genomic features in the query dataset as are quantified in the multi-omic bridge. We therefore requantify the set of scATAC-seq peaks using the FeatureMatrix command. This is also described in the Signac vignettes and shown below.

Load and setup the 10x scATAC-seq query.

Code
# Load ATAC dataset
atac_pbmc_data <- Read10X_h5(filename = "data/10k_PBMC_ATAC_nextgem_Chromium_X_filtered_peak_bc_matrix.h5") 
fragpath <- "data/10k_PBMC_ATAC_nextgem_Chromium_X_fragments.tsv.gz"
# Get gene annotations
annotation <- GetGRangesFromEnsDb(ensdb = EnsDb.Hsapiens.v86)
# Change to UCSC style 
seqlevelsStyle(annotation) <- 'UCSC'
# Create ChromatinAssay for ATAC data
atac_pbmc_assay <- CreateChromatinAssay(
  counts = atac_pbmc_data,
  sep = c(":", "-"),
  fragments = fragpath,
  annotation = annotation
)
# Requantify query ATAC to have same features as multiome ATAC dataset
requant_multiome_ATAC <- FeatureMatrix(
  fragments = Fragments(atac_pbmc_assay),
  features = granges(obj.multi[['ATAC']]),
  cells = Cells(atac_pbmc_assay)
)
# Create assay with requantified ATAC data
ATAC_assay <- CreateChromatinAssay(
  counts = requant_multiome_ATAC,
  fragments = fragpath,
  annotation = annotation
)
# Create Seurat sbject
obj.atac  <- CreateSeuratObject(counts = ATAC_assay,assay = 'ATAC')
obj.atac[['peak.orig']] <- atac_pbmc_assay
obj.atac <- subset(obj.atac, subset = nCount_ATAC < 7e4 & nCount_ATAC > 2000)

Download Azimuth reference and load it.

Code
obj.rna <- readRDS("data/pbmc_multimodal_2023.rds")

8.1.2 Step 2: Preprocessing/normalization for all datasets

Prior to performing bridge integration, we normalize and pre-process each of the datasets (note that the reference has already been normalized). We normalize gene expression data using sctransform, and ATAC data using TF-IDF.

Code
# normalize multiome RNA
DefaultAssay(obj.multi) <- "RNA"
obj.multi <- SCTransform(obj.multi, verbose = FALSE)
# normalize multiome ATAC
DefaultAssay(obj.multi) <- "ATAC"
obj.multi <- RunTFIDF(obj.multi)
obj.multi <- FindTopFeatures(obj.multi, min.cutoff = "q0")
# normalize query
obj.atac <- RunTFIDF(obj.atac)

8.1.3 Step 3: Map scATAC-seq dataset using bridge integration

Now that we have the reference, query, and bridge datasets set up, we can begin integration. The bridge dataset enables translation between the scRNA-seq reference and the scATAC-seq query, effectively augmenting the reference so that it can map a new data type. We call this an extended reference, and first set it up. Note that you can save the results of this function and map multiple scATAC-seq datasets without having to rerun.

First, we drop the first dimension of the ATAC reduction.

Code
dims.atac <- 2:50
dims.rna <- 1:50
DefaultAssay(obj.multi) <-  "RNA"
DefaultAssay(obj.rna) <- "SCT"
obj.rna.ext <- PrepareBridgeReference(
  reference = obj.rna, bridge = obj.multi,
  reference.reduction = "spca", reference.dims = dims.rna,
  normalization.method = "SCT")

Now, we can directly find anchors between the extended reference and query objects. We use the FindBridgeTransferAnchors function, which translates the query dataset using the same dictionary as was used to translate the reference, and then identifies anchors in this space. The function is meant to mimic our FindTransferAnchors function, but to identify correspondences across modalities.

Code
bridge.anchor <- FindBridgeTransferAnchors(
  extended.reference = obj.rna.ext, query = obj.atac,
  reduction = "lsiproject", dims = dims.atac)

Once we have identified anchors, we can map the query dataset onto the reference. The MapQuery function is the same as we have previously introduced for reference mapping . It transfers cell annotations from the reference dataset, and also visualizes the query dataset on a previously computed UMAP embedding. Since our reference dataset contains cell type annotations at three levels of resolution (l1 - l3), we can transfer each level to the query dataset.

Code
obj.atac <- MapQuery(
  anchorset = bridge.anchor, reference = obj.rna.ext,
  query = obj.atac,
  refdata = list(
    l1 = "celltype.l1",
    l2 = "celltype.l2",
    l3 = "celltype.l3"),
  reduction.model = "wnn.umap")

Now we can visualize the results, plotting the scATAC-seq cells based on their predicted annotations, on the reference UMAP embedding. You can see that each scATAC-seq cell has been assigned a cell name based on the scRNA-seq defined cell ontology.

Code
# plot
DimPlot(
  obj.atac, group.by = "predicted.l2",
  reduction = "ref.umap", label = TRUE
) + ggtitle("ATAC") + NoLegend()

8.1.4 Step 4: Assessing the mapping

To assess the mapping and cell type predictions, we will first see if the predicted cell type labels are concordant with an unsupervised analysis of the scATAC-seq dataset. We follow the standard unsupervised processing workflow for scATAC-seq data:

Code
obj.atac <- FindTopFeatures(obj.atac, min.cutoff = "q0")
obj.atac <- RunSVD(obj.atac)
obj.atac <- RunUMAP(obj.atac, reduction = "lsi", dims = 2:50)

Now, we visualize the predicted cluster labels on the unsupervised UMAP emebdding. We can see that predicted cluster labels (from the scRNA-seq reference) are concordant with the structure of the scATAC-seq data. However, there are some cell types (i.e. Treg), that do not appear to separate in unsupervised analysis. These may be prediction errors, or cases where the reference mapping provides additional resolution.

Code
DimPlot(obj.atac, group.by = "predicted.l2", reduction = "umap", label = FALSE)

Lastly, we validate the predicted cell types for the scATAC-seq data by examining their chromatin accessibility profiles at canonical loci. We use the CoveragePlot function to visualize accessibility patterns at the CD8A, FOXP3, and RORC, after grouping cells by their predicted labels. We see expected patterns in each case. For example, the PAX5 locus exhibits peaks that are accessible exclusively in B cells, and the CD8A locus shows the same in CD8 T cell subsets. Similarly, the accessibility of FOXP3, a canonical marker of regulatory T cells (Tregs), in predicted Tregs provides strong support for the accuracy of our prediction.

Code
CoveragePlot(
  obj.atac, region  = "PAX5", group.by = "predicted.l1",
  idents = c("B", "CD4 T", "Mono", "NK"), window = 200,
  extend.upstream = -150000)

9 Analysis of multi-omics data

Recently, single cell multi-omics methods that run several assays on the same cells have become available. One such method is Chromium Single Cell Multiome from 10X genomics, which simultaneously measures gene expression (RNA-seq) and chromatin accessibility (ATAC-seq) in the same nuclei. This makes it possible to identify cell types and states based on both gene expression and accessibility. It also makes it easy to use external gene expression data to annotate your cells, and at the same time study the chromatin accessibility in the cells.

9.1 Joint RNA and ATAC analysis: 10x multiomic

In this tutorial, we’ll demonstrate how to jointly analyze a single-cell dataset measuring both DNA accessibility and gene expression in the same cells using Signac and Seurat. In this vignette we’ll be using a publicly available 10x Genomic Multiome dataset for human PBMCs.

9.2 Dataset overview

Libraries generated:

  • Chromium Single Cell Multiome ATAC library

  • Chromium Single Cell Multiome Gene Expression library

# multiomic data  
wget https://cf.10xgenomics.com/samples/cell-arc/1.0.0/pbmc_granulocyte_sorted_10k/pbmc_granulocyte_sorted_10k_filtered_feature_bc_matrix.h5
# fragments file 
wget https://cf.10xgenomics.com/samples/cell-arc/1.0.0/pbmc_granulocyte_sorted_10k/pbmc_granulocyte_sorted_10k_atac_fragments.tsv.gz
# fragments index 
wget https://cf.10xgenomics.com/samples/cell-arc/1.0.0/pbmc_granulocyte_sorted_10k/pbmc_granulocyte_sorted_10k_atac_fragments.tsv.gz.tbi

9.2.1 📗 Step 1: Load the data and create the Seurat object

Code
library(Signac)
library(Seurat)
library(EnsDb.Hsapiens.v86) # Ensembl based annotation package (hg37)
# BiocManager::install("BSgenome.Hsapiens.UCSC.hg38")
library(BSgenome.Hsapiens.UCSC.hg38) # Full genomic sequences for Homo sapiens (UCSC genome hg38)
Code
library(future)

Attaching package: 'future'
The following object is masked from 'package:AnnotationFilter':

    value
Code
# parallelization options
plan("multicore", workers = 8)
# Increase the maximum memory usage
options(future.globals.maxSize = 14 * 1024^3)  # para 14 GB de RAM

Load data and add annotation:

Code
# load the RNA and ATAC data
counts <- Read10X_h5("data/pbmc_granulocyte_sorted_10k_filtered_feature_bc_matrix.h5")
fragpath <- "data/pbmc_granulocyte_sorted_10k_atac_fragments.tsv.gz"
# get gene annotations for hg38
annotation <- GetGRangesFromEnsDb(ensdb = EnsDb.Hsapiens.v86)
seqlevels(annotation) <- paste0('chr', seqlevels(annotation))

# create a Seurat object containing the RNA adata
pbmc <- CreateSeuratObject(
  counts = counts$`Gene Expression`,
  assay = "RNA"
)

# create ATAC assay and add it to the object
pbmc[["ATAC"]] <- CreateChromatinAssay(
  counts = counts$Peaks,
  sep = c(":", "-"),
  fragments = fragpath,
  annotation = annotation
)

View file:

Code
pbmc
An object of class Seurat 
180488 features across 11898 samples within 2 assays 
Active assay: RNA (36601 features, 0 variable features)
 1 layer present: counts
 1 other assay present: ATAC

9.2.2 📕 Step 2: Computing QC metrics

We can compute per-cell quality control metrics using the DNA accessibility data and remove cells that are outliers for these metrics, as well as cells with low or unusually high counts for either the RNA or ATAC assay.

Code
DefaultAssay(pbmc) <- "ATAC"

pbmc <- NucleosomeSignal(pbmc)
pbmc <- TSSEnrichment(pbmc)
Extracting TSS positions
Extracting fragments at TSSs

Computing TSS enrichment score

The relationship between variables stored in the object metadata can be visualized using the DensityScatter() function. This can also be used to quickly find suitable cutoff values for different QC metrics by setting quantiles=TRUE:

Code
DensityScatter(pbmc, x = 'nCount_ATAC', y = 'TSS.enrichment', log_x = TRUE, quantiles = TRUE)

We can plot the distribution of each QC metric separately using a violin plot:

Code
VlnPlot(
  object = pbmc,
  features = c("nCount_RNA", "nCount_ATAC", "TSS.enrichment", "nucleosome_signal"),
  ncol = 4,
  pt.size = 0
)

Finally we remove cells that are outliers for these QC metrics. The exact QC thresholds used will need to be adjusted according to your dataset.

Code
# filter out low quality cells
pbmc <- subset(
  x = pbmc,
  subset = nCount_ATAC < 100000 &
    nCount_RNA < 25000 &
    nCount_ATAC > 1800 &
    nCount_RNA > 1000 &
    nucleosome_signal < 2 &
    TSS.enrichment > 1
)
pbmc
An object of class Seurat 
180488 features across 11211 samples within 2 assays 
Active assay: ATAC (143887 features, 0 variable features)
 2 layers present: counts, data
 1 other assay present: RNA

9.2.3 📘 Step 3: Analysis on the RNA assay

We can normalize the gene expression data using SCTransform, and reduce the dimensionality using PCA.

Code
DefaultAssay(pbmc) <- "RNA"
pbmc <- SCTransform(pbmc)
Running SCTransform on assay: RNA
Running SCTransform on layer: counts
vst.flavor='v2' set. Using model with fixed slope and excluding poisson genes.
Variance stabilizing transformation of count matrix of size 24459 by 11211
Model formula is y ~ log_umi
Get Negative Binomial regression parameters per gene
Using 2000 genes, 5000 cells
Found 167 outliers - those will be ignored in fitting/regularization step
Second step: Get residuals using fitted parameters for 24459 genes
Computing corrected count matrix for 24459 genes
Calculating gene attributes
Wall clock passed: Time difference of 1.82833 mins
Determine variable features
Centering data matrix
Getting residuals for block 1(of 3) for counts dataset
Getting residuals for block 2(of 3) for counts dataset
Getting residuals for block 3(of 3) for counts dataset
Centering data matrix
Finished calculating residuals for counts
Set default assay to SCT
Code
pbmc <- RunPCA(pbmc)
PC_ 1 
Positive:  VCAN, SAT1, PLXDC2, SLC8A1, NEAT1, DPYD, ZEB2, NAMPT, LRMDA, LYZ 
       FCN1, AOAH, LYN, ANXA1, CD36, JAK2, ARHGAP26, TYMP, RBM47, PID1 
       PLCB1, LRRK2, ACSL1, LYST, IRAK3, DMXL2, MCTP1, TCF7L2, AC020916.1, GAB2 
Negative:  EEF1A1, RPL13, RPS27, RPL13A, RPL41, RPS27A, RPS2, RPL3, RPS12, RPL11 
       RPS18, BCL2, LEF1, BCL11B, RPL23A, RPL10, BACH2, IL32, RPL34, INPP4B 
       RPS3, CAMK4, IL7R, RPS26, RPS14, RPL30, RPL19, LTB, RPLP2, SKAP1 
PC_ 2 
Positive:  GNLY, CCL5, NKG7, CD247, IL32, PRKCH, PRF1, BCL11B, INPP4B, LEF1 
       VCAN, IL7R, DPYD, THEMIS, GZMA, NEAT1, CAMK4, RORA, AOAH, PLCB1 
       TXK, GZMH, TC2N, TRBC1, ANXA1, FYB1, STAT4, PITPNC1, TRAC, AAK1 
Negative:  BANK1, IGHM, AFF3, IGKC, RALGPS2, MS4A1, CD74, PAX5, EBF1, FCRL1 
       CD79A, OSBPL10, LINC00926, COL19A1, BLK, NIBAN3, IGHD, CD22, HLA-DRA, CD79B 
       ADAM28, PLEKHG1, AP002075.1, COBLL1, LIX1-AS1, CCSER1, TCF4, BCL11A, STEAP1B, GNG7 
PC_ 3 
Positive:  LEF1, CAMK4, EEF1A1, MAML2, PDE3B, FHIT, BACH2, IL7R, INPP4B, TSHZ2 
       SERINC5, NELL2, CCR7, BCL2, ANK3, PRKCA, FOXP1, TCF7, BCL11B, AC139720.1 
       RPL11, LTB, RPL13, OXNAD1, RPS27A, MLLT3, TRABD2A, RPS12, RASGRF2, VCAN 
Negative:  GNLY, NKG7, CCL5, PRF1, GZMA, GZMH, KLRD1, SPON2, FGFBP2, CST7 
       GZMB, CCL4, TGFBR3, KLRF1, CTSW, BNC2, PDGFD, ADGRG1, PPP2R2B, PYHIN1 
       A2M, IL2RB, CLIC3, MCTP2, NCAM1, ACTB, TRDC, SAMD3, NCALD, MYBL1 
PC_ 4 
Positive:  BANK1, IGHM, VCAN, IGKC, PAX5, MS4A1, GNLY, FCRL1, RALGPS2, EBF1 
       LINC00926, CD79A, OSBPL10, COL19A1, NKG7, ARHGAP24, CCL5, IGHD, BACH2, PLCB1 
       PDE4D, DPYD, LIX1-AS1, CD22, LARGE1, PLEKHG1, PLXDC2, STEAP1B, ANXA1, AP002075.1 
Negative:  TCF4, LINC01374, CUX2, AC023590.1, RHEX, PLD4, EPHB1, LINC01478, PTPRS, LINC00996 
       ZFAT, LILRA4, CLEC4C, COL26A1, FAM160A1, CCDC50, PLXNA4, ITM2C, SCN9A, COL24A1 
       UGCG, RUNX2, NRP1, GZMB, IRF8, SLC35F3, JCHAIN, BCL11A, TNFRSF21, AC007381.1 
PC_ 5 
Positive:  CDKN1C, FCGR3A, TCF7L2, IFITM3, MTSS1, AIF1, CST3, LST1, MS4A7, WARS 
       PSAP, ACTB, SERPINA1, HLA-DPA1, IFI30, CD74, COTL1, FCER1G, CFD, HLA-DRA 
       FTL, HLA-DRB1, SMIM25, CSF1R, TMSB4X, FMNL2, SAT1, MAFB, S100A4, HLA-DPB1 
Negative:  VCAN, TCF4, PLCB1, LINC01374, RHEX, CUX2, AC023590.1, DPYD, EPHB1, LINC01478 
       PTPRS, ANXA1, ZFAT, LINC00996, CD36, GZMB, PLXDC2, ARHGAP26, FCHSD2, COL26A1 
       CLEC4C, LILRA4, ACSL1, PLXNA4, AC020916.1, DYSF, ARHGAP24, CSF3R, LRMDA, MEGF9 
Caution

If you have this error:

“Error in getGlobalsAndPackages(expr, envir = envir, globals = globals) :

The total size of the 19 globals exported for future expression (‘FUN()’) is 4.69 GiB.. This exceeds the maximum allowed size of 500.00 MiB (option ‘future.globals.maxSize’). The three largest globals are ‘FUN’ (4.67 GiB of class ‘function’), ‘umi_bin’ (19.28 MiB of class ‘numeric’) and ‘data_step1’ (1.27 MiB of class ‘list’)“

You need to use parallelization in Seurat with future. If you want know more about this read this issue and documentation.

9.2.4 📙 Step 4: Analysis on the ATAC assay

Here we process the DNA accessibility assay the same way we would process a scATAC-seq dataset, by performing latent semantic indexing (LSI).

Code
# Normalization and linear dimensional reduction (LSI)
DefaultAssay(pbmc) <- "ATAC"
pbmc <- FindTopFeatures(pbmc, min.cutoff = 5)
pbmc <- RunTFIDF(pbmc)
Performing TF-IDF normalization
Code
pbmc <- RunSVD(pbmc)
Running SVD
Scaling cell embeddings

9.2.5 ✒️ Step 5: Annotating cell types

To annotate cell types in the dataset we can transfer cell labels from an existing PBMC reference dataset using tools in the Seurat package. See the Seurat reference mapping vignette for more information.

We’ll use an annotated PBMC reference dataset from Hao et al. (2020), available for download here: https://atlas.fredhutch.org/data/nygc/multimodal/pbmc_multimodal.h5seurat

Note that the SeuratDisk package is required to load the reference dataset. Installation instructions for SeuratDisk can be found here.

SeuratDisk is not currently available on CRAN. You can install it from GitHub with:

Code
if (!requireNamespace("remotes", quietly = TRUE)) {
  install.packages("remotes")
}
remotes::install_github("mojaveazure/seurat-disk")

If you have problems you can use this way, from this issue:

  • Download the package source locally (bash terminal):
cd ~/
git clone https://github.com/mojaveazure/seurat-disk
  • Install package from source:
Code
install.packages("~/seurat-disk", repos = NULL, type = "source")
Code
library(SeuratDisk)
Registered S3 method overwritten by 'SeuratDisk':
  method            from  
  as.sparse.H5Group Seurat
Code
# load PBMC reference
reference <- LoadH5Seurat("data/pbmc_multimodal.h5seurat", assays = list("SCT" = "counts"), reductions = 'spca')
Validating h5Seurat file
Initializing SCT with counts
Adding variable feature information for SCT
Adding miscellaneous information for SCT
Adding reduction spca
Adding cell embeddings for spca
Adding feature loadings for spca
Adding miscellaneous information for spca
Adding graph wknn
Adding graph wsnn
Adding command information
Adding cell-level metadata
Code
reference <- UpdateSeuratObject(reference)
Validating object structure
Updating object slots
Ensuring keys are in the proper structure
Updating matrix keys for DimReduc 'spca'
Ensuring keys are in the proper structure
Ensuring feature names don't have underscores or pipes
Updating slots in SCT
Updating slots in wknn
Cannot find wknn in the object, setting default assay of wknn to SCT
Updating slots in wsnn
Cannot find wsnn in the object, setting default assay of wsnn to SCT
Updating slots in spca
Validating object structure for SCTAssay 'SCT'
Validating object structure for Graph 'wknn'
Validating object structure for Graph 'wsnn'
Validating object structure for DimReduc 'spca'
Object representation is consistent with the most current Seurat version
Code
DefaultAssay(pbmc) <- "SCT"

# transfer cell type labels from reference to query
transfer_anchors <- FindTransferAnchors(
  reference = reference,
  query = pbmc,
  normalization.method = "SCT",
  reference.reduction = "spca",
  recompute.residuals = FALSE,
  dims = 1:50
)
Projecting cell embeddings
Finding neighborhoods
Finding anchors
    Found 14597 anchors
Code
predictions <- TransferData(
  anchorset = transfer_anchors, 
  refdata = reference$celltype.l2,
  weight.reduction = pbmc[['pca']],
  dims = 1:50
)
Finding integration vectors
Finding integration vector weights
Predicting cell labels
Code
pbmc <- AddMetaData(
  object = pbmc,
  metadata = predictions
)

# set the cell identities to the cell type predictions
Idents(pbmc) <- "predicted.id"

# remove low-quality predictions
pbmc <- pbmc[, pbmc$prediction.score.max > 0.5]

9.2.6 📐 Step 6: Joint UMAP visualization

Using the weighted nearest neighbor methods in Seurat v4, we can compute a joint neighbor graph that represent both the gene expression and DNA accessibility measurements.

Code
# build a joint neighbor graph using both assays
pbmc <- FindMultiModalNeighbors(
  object = pbmc,
  reduction.list = list("pca", "lsi"), 
  dims.list = list(1:50, 2:40),
  modality.weight.name = "RNA.weight",
  verbose = TRUE
)
Calculating cell-specific modality weights
Finding 20 nearest neighbors for each modality.
Calculating kernel bandwidths
Warning in FindMultiModalNeighbors(object = pbmc, reduction.list = list("pca",
: The number of provided modality.weight.name is not equal to the number of
modalities. SCT.weight ATAC.weight are used to store the modality weights
Finding multimodal neighbors
Constructing multimodal KNN graph
Constructing multimodal SNN graph
Code
# build a joint UMAP visualization
pbmc <- RunUMAP(
  object = pbmc,
  nn.name = "weighted.nn",
  assay = "RNA",
  verbose = TRUE
)
Warning: The default method for RunUMAP has changed from calling Python UMAP via reticulate to the R-native UWOT using the cosine metric
To use Python UMAP via reticulate, set umap.method to 'umap-learn' and metric to 'correlation'
This message will be shown once per session
18:31:32 UMAP embedding parameters a = 0.9922 b = 1.112
18:31:38 Commencing smooth kNN distance calibration using 8 threads with target n_neighbors = 20
18:31:41 Found 2 connected components, falling back to random initialization
18:31:41 Initializing from uniform random
18:31:41 Commencing optimization for 200 epochs, with 316712 positive edges
18:31:55 Optimization finished
Code
DimPlot(pbmc, label = TRUE, repel = TRUE, reduction = "umap") + NoLegend()

9.2.7 📊 Step 7: Linking peaks to genes

For each gene, we can find the set of peaks that may regulate the gene by by computing the correlation between gene expression and accessibility at nearby peaks, and correcting for bias due to GC content, overall accessibility, and peak size. See the Signac paper for a full description of the method we use to link peaks to genes.

Running this step on the whole genome can be time consuming, so here we demonstrate peak-gene links for a subset of genes as an example. The same function can be used to find links for all genes by omitting the genes.use parameter:

Code
DefaultAssay(pbmc) <- "ATAC"

# first compute the GC content for each peak
pbmc <- RegionStats(pbmc, genome = BSgenome.Hsapiens.UCSC.hg38)
Warning in RegionStats.default(object = regions, genome = genome, verbose =
verbose, : Not all seqlevels present in supplied genome
Code
# link peaks to genes
pbmc <- LinkPeaks(
  object = pbmc,
  peak.assay = "ATAC",
  expression.assay = "SCT",
  genes.use = c("LYZ", "MS4A1")
)
Testing 2 genes and 143871 peaks
Warning in .merge_two_Seqinfo_objects(x, y): Each of the 2 combined objects has sequence levels not in the other:
  - in 'x': GL000194.1, GL000195.1, GL000205.2, GL000219.1, KI270711.1, KI270713.1, KI270721.1, KI270726.1, KI270727.1, KI270728.1, KI270731.1, KI270734.1
  - in 'y': chrMT
  Make sure to always combine/compare objects based on the same reference
  genome (use suppressWarnings() to suppress this warning).
Warning: UNRELIABLE VALUE: One of the 'future.apply' iterations
('future_lapply-1') unexpectedly generated random numbers without declaring so.
There is a risk that those random numbers are not statistically sound and the
overall results might be invalid. To fix this, specify 'future.seed=TRUE'. This
ensures that proper, parallel-safe random numbers are produced via the
L'Ecuyer-CMRG method. To disable this check, use 'future.seed = NULL', or set
option 'future.rng.onMisuse' to "ignore".
Warning: UNRELIABLE VALUE: One of the 'future.apply' iterations
('future_lapply-2') unexpectedly generated random numbers without declaring so.
There is a risk that those random numbers are not statistically sound and the
overall results might be invalid. To fix this, specify 'future.seed=TRUE'. This
ensures that proper, parallel-safe random numbers are produced via the
L'Ecuyer-CMRG method. To disable this check, use 'future.seed = NULL', or set
option 'future.rng.onMisuse' to "ignore".

We can visualize these links using the CoveragePlot() function, or alternatively we could use the CoverageBrowser() function in an interactive analysis:

Code
idents.plot <- c("B naive", "B intermediate", "B memory",
                 "CD14 Mono", "CD16 Mono", "CD8 TEM", "CD8 Naive")

pbmc <- SortIdents(pbmc)
Creating pseudobulk profiles for 27 variables across 143887 features
Computing euclidean distance between pseudobulk profiles
Clustering distance matrix
Code
p1 <- CoveragePlot(
  object = pbmc,
  region = "MS4A1",
  features = "MS4A1",
  expression.assay = "SCT",
  idents = idents.plot,
  extend.upstream = 500,
  extend.downstream = 10000
)

p2 <- CoveragePlot(
  object = pbmc,
  region = "LYZ",
  features = "LYZ",
  expression.assay = "SCT",
  idents = idents.plot,
  extend.upstream = 8000,
  extend.downstream = 5000
)

patchwork::wrap_plots(p1, p2, ncol = 1)

Code
sessionInfo()
R version 4.4.1 (2024-06-14 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 11 x64 (build 22631)

Matrix products: default


locale:
[1] LC_COLLATE=English_United States.utf8 
[2] LC_CTYPE=English_United States.utf8   
[3] LC_MONETARY=English_United States.utf8
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.utf8    

time zone: America/Mexico_City
tzcode source: internal

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] SeuratDisk_0.0.0.9021             future_1.34.0                    
 [3] BSgenome.Hsapiens.UCSC.hg38_1.4.5 BSgenome_1.72.0                  
 [5] rtracklayer_1.64.0                BiocIO_1.14.0                    
 [7] Biostrings_2.72.1                 XVector_0.44.0                   
 [9] EnsDb.Hsapiens.v86_2.99.0         ensembldb_2.28.1                 
[11] AnnotationFilter_1.28.0           GenomicFeatures_1.56.0           
[13] AnnotationDbi_1.66.0              Biobase_2.64.0                   
[15] GenomicRanges_1.56.1              GenomeInfoDb_1.40.1              
[17] IRanges_2.38.1                    S4Vectors_0.42.1                 
[19] BiocGenerics_0.50.0               Seurat_5.1.0                     
[21] SeuratObject_5.0.2                sp_2.1-4                         
[23] Signac_1.14.0                    

loaded via a namespace (and not attached):
  [1] RcppAnnoy_0.0.22            splines_4.4.1              
  [3] later_1.3.2                 bitops_1.0-8               
  [5] tibble_3.2.1                polyclip_1.10-7            
  [7] rpart_4.1.23                XML_3.99-0.17              
  [9] fastDummies_1.7.4           lifecycle_1.0.4            
 [11] hdf5r_1.3.11                globals_0.16.3             
 [13] lattice_0.22-6              MASS_7.3-60.2              
 [15] backports_1.5.0             magrittr_2.0.3             
 [17] Hmisc_5.1-3                 plotly_4.10.4              
 [19] rmarkdown_2.28              yaml_2.3.10                
 [21] httpuv_1.6.15               glmGamPoi_1.16.0           
 [23] sctransform_0.4.1           spam_2.10-0                
 [25] spatstat.sparse_3.1-0       reticulate_1.39.0          
 [27] cowplot_1.1.3               pbapply_1.7-2              
 [29] DBI_1.2.3                   RColorBrewer_1.1-3         
 [31] abind_1.4-8                 zlibbioc_1.50.0            
 [33] Rtsne_0.17                  purrr_1.0.2                
 [35] biovizBase_1.52.0           RCurl_1.98-1.16            
 [37] nnet_7.3-19                 tweenr_2.0.3               
 [39] VariantAnnotation_1.50.0    GenomeInfoDbData_1.2.12    
 [41] ggrepel_0.9.6               irlba_2.3.5.1              
 [43] listenv_0.9.1               spatstat.utils_3.1-0       
 [45] goftest_1.2-3               RSpectra_0.16-2            
 [47] spatstat.random_3.3-2       fitdistrplus_1.2-1         
 [49] parallelly_1.38.0           DelayedMatrixStats_1.26.0  
 [51] leiden_0.4.3.1              codetools_0.2-20           
 [53] DelayedArray_0.30.1         RcppRoll_0.3.1             
 [55] ggforce_0.4.2               tidyselect_1.2.1           
 [57] UCSC.utils_1.0.0            farver_2.1.2               
 [59] base64enc_0.1-3             matrixStats_1.4.0          
 [61] spatstat.explore_3.3-2      GenomicAlignments_1.40.0   
 [63] jsonlite_1.8.8              Formula_1.2-5              
 [65] progressr_0.14.0            ggridges_0.5.6             
 [67] survival_3.6-4              tools_4.4.1                
 [69] ica_1.0-3                   Rcpp_1.0.13                
 [71] glue_1.7.0                  gridExtra_2.3              
 [73] SparseArray_1.4.8           xfun_0.48                  
 [75] MatrixGenerics_1.16.0       dplyr_1.1.4                
 [77] withr_3.0.1                 fastmap_1.2.0              
 [79] fansi_1.0.6                 digest_0.6.36              
 [81] R6_2.5.1                    mime_0.12                  
 [83] colorspace_2.1-1            scattermore_1.2            
 [85] tensor_1.5                  dichromat_2.0-0.1          
 [87] spatstat.data_3.1-2         RSQLite_2.3.7              
 [89] utf8_1.2.4                  tidyr_1.3.1                
 [91] generics_0.1.3              data.table_1.16.2          
 [93] httr_1.4.7                  htmlwidgets_1.6.4          
 [95] S4Arrays_1.4.1              uwot_0.2.2                 
 [97] pkgconfig_2.0.3             gtable_0.3.5               
 [99] blob_1.2.4                  lmtest_0.9-40              
[101] htmltools_0.5.8.1           dotCall64_1.1-1            
[103] ProtGenerics_1.36.0         scales_1.3.0               
[105] png_0.1-8                   spatstat.univar_3.0-1      
[107] knitr_1.48                  rstudioapi_0.17.0          
[109] reshape2_1.4.4              rjson_0.2.22               
[111] checkmate_2.3.2             nlme_3.1-164               
[113] curl_5.2.2                  zoo_1.8-12                 
[115] cachem_1.1.0                stringr_1.5.1              
[117] KernSmooth_2.23-24          vipor_0.4.7                
[119] parallel_4.4.1              miniUI_0.1.1.1             
[121] foreign_0.8-86              ggrastr_1.0.2              
[123] restfulr_0.0.15             pillar_1.9.0               
[125] grid_4.4.1                  vctrs_0.6.5                
[127] RANN_2.6.2                  promises_1.3.0             
[129] xtable_1.8-4                cluster_2.1.6              
[131] beeswarm_0.4.0              htmlTable_2.4.3            
[133] evaluate_1.0.1              cli_3.6.2                  
[135] compiler_4.4.1              Rsamtools_2.20.0           
[137] rlang_1.1.3                 crayon_1.5.3               
[139] future.apply_1.11.2         labeling_0.4.3             
[141] ggbeeswarm_0.7.2            plyr_1.8.9                 
[143] stringi_1.8.4               viridisLite_0.4.2          
[145] deldir_2.0-4                BiocParallel_1.38.0        
[147] munsell_0.5.1               lazyeval_0.2.2             
[149] spatstat.geom_3.3-3         Matrix_1.7-0               
[151] RcppHNSW_0.6.0              patchwork_1.3.0            
[153] sparseMatrixStats_1.16.0    bit64_4.0.5                
[155] ggplot2_3.5.1               KEGGREST_1.44.1            
[157] shiny_1.9.1                 SummarizedExperiment_1.34.0
[159] ROCR_1.0-11                 igraph_2.0.3               
[161] memoise_2.0.1               fastmatch_1.1-4            
[163] bit_4.0.5                  

9.3 References