ENCODE 4 Mouse snRNA-seq and snATAC-seq Analysis

Overview

Analysis of postnatal timecourse of Bl6/Cast F1 hybrid mouse development in 5 core tissues: cortex, hippocampus, heart, adrenal gland, and gastrocnemius, at 7 timepoints for snRNA-seq: PND04, PND10, PND14, PND25, PND36, PNM02, and PNM18-20 and 2 timepoints for multiome snRNA-seq+snATAC-seq: PND14 and PNM02, with 2 males and 2 females per timepoint. Also added C2C12 mouse myoblast cell line data at 0hr and 72hr differentiation timepoints.

Contact info

Analyst: Liz Rebboah, Mortazavi Lab. [email protected]

snRNA-seq

Data

Note: All Parse snRNA-seq experiments were done at the Mortazavi lab while all 10x multiome experiments were done at the Snyder lab, from nuclei isolation to sequencing.

Pre-processing

step1_get_data.sh fetches unfiltered sparse gene count matrix of all reads tar.gz files from the ENCODE portal using the batch download script xargs -L 1 curl -O -J -L < files.txt for Parse and 10x separately. The tar folders are unzipped in counts_parse or counts_10x with the ENCODE "ENCFF" filename ID as the folder name. There are 40 10x folders and 436 Parse folders. Unzipping the data requires 39G of space for Parse, 32G for 10x. Next get_counts_parse.R is called to merge Parse data by experimental batch across file IDs and filter the resulting counts matrices by gene biotypes and for nuclei > 500 UMI, which are saved as sparse matrices and tsv files for genes and barcodes in the scrublet folder. Finally get_counts_10x.R reads in 10x data by file ID and filters the counts matrices by gene biotypes and for nuclei > 0 UMI for ambient RNA removal. The sparse matrices and tsv files for genes and barcodes are saved in the same file ID folder in counts_10x.
step2_cellbender.sh runs Cellbender as an array job to remove ambient RNA from the 40 10x datasets using these Cellbender settings, where the first column is path to the file ID folder, second column is expected cells, and third column is total droplets included. The unfiltered Cellbender output is saved as cellbender.h5 in the same file ID folder in counts_10x. The Cellbender output requires an additional 56G of space.
step3_run_scrublet.sh runs followup script format_cellbender_output.R to output >500 UMI sparse matrices to the scrublet folder from the 10x cellbender.h5 files, matching the status of the Parse data. The python script run_scrublet.py detects doublets (same script for Parse and 10x) and outputs a modified barcodes tsv (_barcodes_scrublet.tsv) file in the scrublet folder with additonal columns for doublet information.
step4_submit_data.sh runs Submission_files.R to output files for submission to Synapse/ENCODE portal using these guidelines.

Celltype Annotation

Note: must run snATAC script step1_atac_archr.sh before intergrating data so we can filter the 10x snRNA nuclei for those also passing snATAC filters.

Each tissue is integrated across technologies and annotated using Seurat. The final annotations have 3 levels of granularity: gen_celltype, celltypes, and subtypes. See celltype metadata to see how these levels relate to each other.

integrate_parse_10x.R merges counts across technologies and makes 3 Seurat objects for Parse standard, Parse deep, and 10x. Nuclei are filtered (see detailed metadata for filter cutoffs). CCA integrates the 3 objects. The combined.sct object is processed with PCA, UMAP, SNN graph construction, and high-resolution clustering. Marker genes are called from Seurat clusters and saved in the seurat folder.

Hippocampus

Run integration R script with integrate_hippocampus.sh.
Check integration results and clustering resolution in HC_snRNA.ipynb.
predict_hippocampus_celltypes.R uses an external 10x dataset from mouse hippcampus and cortex subsetted by 1,000 nuclei in each annotated subtype to predict celltypes.
Check prediction results in HC_snRNA.ipynb and make adjustments.

Cortex

Run integration R script with integrate_cortex.sh.
Check integration results and clustering resolution in CX_snRNA.ipynb.
predict_cortex_celltypes.R uses an external 10x dataset from mouse hippcampus and cortex subsetted by 1,000 nuclei in each annotated subtype to predict celltypes.
Check prediction results in CX_snRNA.ipynb and make adjustments.

Adrenal

Run integration R script with integrate_adrenal.sh.
Check integration results and clustering resolution in ADR_snRNA.ipynb.
Manually annotate gen_celltype, celltypes, and subtypes in ADR_snRNA.ipynb.

Heart

Run integration R script with integrate_heart.sh.
Check integration results and clustering resolution in HT_snRNA.ipynb.
predict_heart_celltypes.R uses 2 external datasets to predict celltypes:
- Stressed mouse ventricles and associated paper
- Human heart cell atlas converted to mouse.
Check prediction results in HT_snRNA.ipynb and make adjustments.

Gastrocnemius

Run integration R script with integrate_gastroc.sh.
Check integration results and clustering resolution in GC_snRNA.ipynb.
predict_gastroc_celltypes.R uses external dataset from TA muscle to predict celltypes:
- PND10, PND21, PNM 05 and associated paper
Check prediction results in GC_snRNA.ipynb and make adjustments.

C2C12

Add metadata from paper in C2C12_snRNA.ipynb.

Regulatory gene list

Correlate miRNA expression to host gene expression: miRNA_hostGene_correlation.ipynb
Download and concatenate GO terms: Histone_regulators.ipynb
Make final regulatory gene "display table": Regulatory_genes.ipynb
Subset data on regulatory genes and merge across tissues: snRNA_regulatory_subset.ipynb

snATAC-seq

step1_atac_archr.sh downloads the 10x multiome fragment files from the ENCODE portal using the batch download script xargs -L 1 curl -O -J -L < files.txt from this cart for the 5 tissues. The tar folders are unzipped in the fragments folder with the ENCODE "ENCFF" filename ID as the folder name. There are 40 folders. Unzipping the data requires 93G of space.. Calls make_archr_proj.R which makes ArchR Arrow files, initializes ArchR Project and filters nuclei (minTSS = 4, minFrags = 1000, filterDoublets).
step2_archr_process.sh calls process_archr_proj.R to also filter snATAC nuclei for those that are also in the corresponding 10x snRNA metadata (passed RNA filters). Also includes standard ArchR processing such as iterative LSI, clustering, and UMAPs. All snATAC data is multiome, so cell type annotations are simply transferred from the snRNA annotation to the ArchR object.
step3_tissue_analysis.sh calls archr_tissue_objects.R to split 10x mouse ArchR object into 5 separate tissue objects followed by standard ArchR processing.

Submission files

Code for submission files. Files are currently hosted on Synapse:

Figure captions

Text captions for figures included in figures tarball. Not all tissues have the same figures in figures/snrna/annotation.

figures/snatac/

Plot-UMAP-Sample-Clusters.pdf: 4 snATAC UMAPs grouped by metadata: sample ID, tissue, timepoint, and annotated general celltype, celltypes, and subtypes.

figures/snrna/annotation

UMAP_final_celltypes.pdf: snRNA UMAP grouped by final annotated celltypes.
UMAP_final_gen_celltype.pdf: snRNA UMAP grouped by final annotated general celltype.
UMAP_final_subtypes.pdf: snRNA UMAP grouped by final annotated general subtypes.
UMAP_maximum_predictions.pdf: snRNA UMAP grouped by the maximum predicted celltype in each cluster.
UMAP_predictions.pdf: snRNA UMAP grouped by predicted celltypes, if using an external reference.
celltypes_marker_dotplot.pdf: Dot plot of cells grouped by annotated celltype showing average expression and percent of cells expressing selected marker genes.
cluster_marker_dotplot.pdf: Dot plot of cells grouped by cluster showing average expression and percent of cells expressing selected marker genes.
marker_featureplots.pdf: snRNA UMAP "feature plots" of selected marker genes.
subtypes_marker_dotplot.pdf: Dot plot of cells grouped by annotated subtypes showing average expression and percent of cells expressing selected marker genes.
timepoint_celltypes_proportions.pdf: Proportion of celltypes across postnatal development.

figures/snrna/clustering

Parse_10x_experiment_distribution.pdf: snRNA UMAPs grouped by cluster and split by experiment (Parse standard/deep, 10x) to check integration results, also cluster-level experiment proportion barplot.
UMAP_Parse_10x.pdf: snRNA UMAP grouped by technology and cluster.
UMAP_cluster_sample_barplot.pdf: snRNA UMAP grouped by cluster and cluster-level sample proportion barplot.
age_sex_barplot.pdf: snRNA UMAP grouped by developmental timepoint and sex alongside cluster-level timepoint and sex proportion barplots.

figures/snrna/qc

experiment_kneeplots.pdf: Knee plots showing total # UMIs of ranked cells and red dashed line at 500 UMIs, grouped by snRNA experiment (label is counts matrix file accession).
experiment_violinplots.pdf: Violin plots of # genes, # UMIs, percent mitochondrial, and percent ribosomal gene expression, grouped by sample and split by experiment (Parse standard/deep, 10x).
qc_featureplot.png: snRNA UMAP "feature plots" of # genes, # UMIs, percent mitochondrial and percent ribosomal gene expression, doublet score, and cell cycle G2M score.

Auxiliary data - available on Synapse

Auxiliary data tarball includes processed Seurat snRNA objects before and after integration and Seurat object metadata as a csv file, Seurat cluster marker genes, and snATAC ArchR project folders (arrow files, rds files, and plots).

Adrenal gland: 20GB
Left cortex: 24GB
Hippocampus: 19GB
Heart: 22GB
Gastrocnemius: 17GB

Directory structure

FAQ

Q: Why is the snRNA analysis so much longer than the snATAC analysis?

A: The snRNA comes from 2 different single cell platforms and must be integrated, then celltypes must be manually annotated. The ATAC comes from the 10x multiome where chromatin accessibility and gene expression are detected simultaneously from the same nucleus. Therefore the same nuclei that were annotated in the snRNA analysis are in the snATAC data, so the celltype annotations were simply transferred to the ATAC Archr metadata.

Q: Why are there differences in pre-processing between Parse and 10x snRNA?

A: The snRNA comes from Parse Biosciences which uses combinatorial barcoding, and 10x Genomics, which uses droplet-based barcoding. Droplet-based barcoding introduces the possibility of RNA outside of the nucleus ending up in a droplet and getting barcoded along with real nuclei ("empty droplets"). Combinatorial barcoding requires each RNA molecule to be fixed inside the nuclei across every round of barcoding, so we feel that the empty droplets filter is unnecessary. Other than ambient RNA removal, the processing for Parse and 10x data is the same.

Q: Why are there so many Parse snRNA experiments?

A: 1. We conducted Parse snRNA-seq for all 7 postnatal timepoints, while 10x snRNA-seq was done for PND14 and PNM2. 2. We sequenced every short-read Parse experiment at a standard depth, but a subset of 1,000-2,000 nuclei were deeply sequenced. These deeply sequenced nuclei were also sequenced with either PacBio or ONT long read platforms (single-nucleus LR analysis code coming soon). See our LR-Split-seq paper for more details.

Q: Why do you need to integrate 2 Parse Seurat RNA objects with 1 10x Parse RNA object?

A: See above, but basically the difference in sequencing depth between Parse "standard" and Parse "deep" is a batch effect best fixed with the same integration strategy for combining the Parse and 10x experiments. The raw counts matrices can be merged within technology, depth, and tissue (i.e. across timepoints and sexes) with no batch effects, but differences in Parse and 10x experiments (including differences in nuclei preparation) required a heavy hand at the integration step.

Name		Name	Last commit message	Last commit date
Latest commit History 445 Commits
snatac		snatac
snrna		snrna
README.md		README.md
cellbender2_environment.yml		cellbender2_environment.yml
directory_struct.png		directory_struct.png
hpc3sc_environment.yml		hpc3sc_environment.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ENCODE 4 Mouse snRNA-seq and snATAC-seq Analysis

Overview

Contact info

snRNA-seq

Data

Pre-processing

Celltype Annotation

Hippocampus

Cortex

Adrenal

Heart

Gastrocnemius

C2C12

Regulatory gene list

snATAC-seq

Submission files

Figure captions

figures/snatac/

figures/snrna/annotation

figures/snrna/clustering

figures/snrna/qc

Auxiliary data - available on Synapse

Directory structure

FAQ

About

Releases

Packages

Languages

erebboah/enc4_mouse

Folders and files

Latest commit

History

Repository files navigation

ENCODE 4 Mouse snRNA-seq and snATAC-seq Analysis

Overview

Contact info

snRNA-seq

Data

Pre-processing

Celltype Annotation

Hippocampus

Cortex

Adrenal

Heart

Gastrocnemius

C2C12

Regulatory gene list

snATAC-seq

Submission files

Figure captions

figures/snatac/

figures/snrna/annotation

figures/snrna/clustering

figures/snrna/qc

Auxiliary data - available on Synapse

Directory structure

FAQ

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages