Benj utility scripts
R installation: Use devtools to install
devtools::install_github("kellislab/benj")
Python installation: Use pip
pip install git+https://github.com/kellislab/benj
Environments under conda/ can be used. Mamba is recommended for environment creation because of the number of packages
On Luria, you can put this script in ~/src/conda_init.sh
, where you can run source ~/src/conda_init.sh
to activate Conda,
and then you can run conda activate benj
to activate the benj
environment.
#!/usr/bin/env bash
# >>> conda initialize >>>
# !! Contents within this block are managed by 'conda init' !!
__conda_setup="$('/net/bmc-lab5/data/kellis/group/Benjamin/software/miniconda3/bin/conda' 'shell.bash' 'hook' 2> /dev/null)"
if [ $? -eq 0 ]; then
eval "$__conda_setup"
else
if [ -f "/net/bmc-lab5/data/kellis/group/Benjamin/software/miniconda3/etc/profile.d/conda.sh" ]; then
. "/net/bmc-lab5/data/kellis/group/Benjamin/software/miniconda3/etc/profile.d/conda.sh"
else
export PATH="/net/bmc-lab5/data/kellis/group/Benjamin/software/miniconda3/bin:$PATH"
fi
fi
unset __conda_setup
# <<< conda initialize <<<
The overall strategy is
- Load per-sample H5AD from cellranger H5, including velocyto loom. Barcodes are renamed to “Sample#barcode-1” to make multiomic ATAC+GEX analysis easier, as ArchR uses this format.
- Concatenate all H5AD, add metadata, and run scrublet to remove doublets.
- Integrate cell type iteratively
From a list of sample names and filtered_feature_bc_matrix.h5 files in a file called “items.txt”, you can enter a SLURM script as follows to build H5AD for each:
#!/usr/bin/env bash
#SBATCH -p kellis
#SBATCH --array=1-106%10
#SBATCH --nice=10000
cd /path/to/items/dir
export SAMPLE=$(< "items.txt" awk -v TASK=${SLURM_ARRAY_TASK_ID} 'NR == TASK { print $1 }')
export H5=$(< "items.txt" awk -v TASK=${SLURM_ARRAY_TASK_ID} 'NR == TASK { print $2 }')
export OUTPUT="${SAMPLE}.h5ad"
if [ -f "${OUTPUT}" ]; then
echo "File ${OUTPUT} already created."
else
module load lmod-conda
conda activate pybenj
`type -P time` h5ad_from_cellranger_rna.py --h5 "${H5}" --sample "${SAMPLE}" --output "${OUTPUT}"
fi
Now is when sample metadata is included. For every sample in metadata index, the H5AD is read in from the directory.
aggregate_anndata.py --metadata /path/to/md.tsv -d /path/to/h5ad/dir/ -o concatenated.h5ad
function integrate {
cls=${1:-overall}
shift
integrate_rna.py -i concatenated.h5ad \
-o "${cls}/${cls}.h5ad" \
-t "${cls}/${cls}_clust.tsv.gz" \
-f "${cls}" \
-b "batch" \
-l "${cls}_clust" \
--hvg 5000 \
--dotplot Mbp Pdgfra Aif1 Gfap Flt1 Pdgfrb Syt1 Slc17a7 Gad1 \
"$@"
}
integrate overall
integrate exc -a ./overall_clust/overall_clust.tsv.gz --subset overall_clust=C1,C2,C3 --plot overall_clust
integrate exc_L23 -a ./exc_clust/exc_clust.tsv.gz --subset exc_clust=C0,C2,C3 --plot overall_clust exc_clust
The ATAC integration pipeline is very similar to the RNA integration pipeline.
As a brief overview, this pipeline uses both ArchR and muon. ArchR is currently better at QC, peaks, and motifs, and muon is better at primary analysis (cell types, UMAPs, multiomics).
So, the same barcodes (ArchR format, “Sample#barcode-1”) is used to ensure no mixup.
The steps consist of:
library(ArchR)
addArchRGenome("hg38")
geneAnnotation = benj::createGeneAnnotationGFF("/path/to/cellranger/refdata/genes/genes.gtf", OrgDb=org.Hs.eg.db::org.Hs.eg.db, dataSource="cellranger", organism="Homo sapiens")
ArrowFiles=createArrowFiles(..., geneAnnotation=geneAnnotation)
proj = createArchRProject(ArrowFiles, outputDirectory="ArchR", geneAnnotation=geneAnnotation)
gzf = gzfile("ArchR_metadata.tsv.gz", "w")
write.table(as.data.frame(proj@cellColData), gzf, sep="\t")
close(gzf)
saveArchRProject(proj, "ArchR")
If you don’t have a previous peak annotation, use a tile BED file.
Usually the cellranger-arc aggr
pipeline provides a good peak BED that is calculated early in the process, so cellranger
can be quit after the BED
#!/usr/bin/env bash
#SBATCH -p kellis
#SBATCH --array=1-106%10
#SBATCH --nice=10000
cd /home/benjames/data/SCORCH/1.Mash_BA9_NAc/ATAC
export SAMPLE=$(< "items.txt" awk -v TASK=${SLURM_ARRAY_TASK_ID} 'NR == TASK { print $1 }')
export FRAG=$(< "items.txt" awk -v TASK=${SLURM_ARRAY_TASK_ID} 'NR == TASK { print $2 }')
export META="/path/to/ArchR_metadata.tsv.gz"
export PEAKS="/path/to/atac_peak_annotation.tsv.gz"
export OUTPUT="${SAMPLE}.h5ad"
if [ -f "${OUTPUT}" ]; then
echo "File ${OUTPUT} already exists"
else
module load lmod-conda
conda activate pybenj
`type -P time` h5ad_from_archr_annotation.py --fragments "${FRAG}" --sample "${SAMPLE}" --cell-metadata "${META}" --peaks "${PEAKS}" --output "${OUTPUT}"
fi
Now is when sample metadata is included. For every sample in metadata index, the H5AD is read in from the directory.
aggregate_anndata.py --metadata /path/to/md.tsv -d /path/to/h5ad/dir/ -o concatenated.h5ad
Using ArchR style gene estimation, except use the peak set instead of tile matrix.
estimate_gene_accessibility -i concatenated.h5ad -o gacc.h5ad --gtf /path/to/cellranger/genes/genes.gtf.gz
Or, if you have annotations already (or after annotations!) you can rank genes and plot:
estimate_gene_accessibility -i concatenated.h5ad -o gacc.h5ad --gtf /path/to/cellranger/genes/genes.gtf.gz --groupby CellType
Very similar to RNA integration process. But, use sample level batch correction.
function integrate {
cls=${1:-overall}
shift
integrate_atac.py -i concatenated.h5ad \
-o "${cls}/${cls}.h5ad" \
-t "${cls}/${cls}_clust.tsv.gz" \
-f "${cls}" \
-b "Sample" \
-l "${cls}_clust" \
"$@"
}
integrate overall
integrate exc -a ./overall_clust/overall_clust.tsv.gz --subset overall_clust=C1,C2,C3 --plot overall_clust
integrate exc_L23 -a ./exc_clust/exc_clust.tsv.gz --subset exc_clust=C0,C2,C3 --plot overall_clust exc_clust
From the integrated ATAC, load in the *_clust.tsv.gz
files, and addGroupCoverages in ArchR, then call new peaks.
Then, you can iteratively improve the integration.
- Currently, you should process 1) RNA first, using RNA subtyping framework.
- Then, process ATAC alone as single-omic. But, in the integrate() function, add an annotation for the
Standardized single cell data environments allow for easy computation.
This repo works best with the following environment structure per-batch (e.g. from cellranger count):
├── aggr.csv
├── sample_list.txt
├── metrics_summary.tsv
├── ArrowFiles
│ ├── archr_metadata.tsv.gz
│ ├── Sample1.arrow
│ ├── Sample2.arrow
│ ├── Sample3.arrow
│ ├── Sample4.arrow
│ └── Sample5.arrow
├── H5AD
│ ├── cellbender
│ │ ├── Sample1.h5ad
│ │ ├── Sample2.h5ad
│ │ ├── Sample3.h5ad
│ │ ├── Sample4.h5ad
│ │ ├── Sample5.h5ad
│ ├── filtered
│ │ ├── Sample1.h5ad
│ │ ├── Sample2.h5ad
│ │ ├── Sample3.h5ad
│ │ ├── Sample4.h5ad
│ │ ├── Sample5.h5ad
│ └── raw
│ ├── Sample1.h5ad
│ ├── Sample2.h5ad
│ ├── Sample3.h5ad
│ ├── Sample4.h5ad
│ └── Sample5.h5ad
├── Sample1
│ ├── _cmdline
│ ├── _filelist
│ ├── _finalstate
│ ├── _invocation
│ ├── _jobmode
│ ├── _log
│ ├── _mrosource
│ ├── outs
...
├── Sample2
│ ├── _cmdline
│ ├── _filelist
│ ├── _finalstate
│ ├── _invocation
│ ├── _jobmode
│ ├── _log
│ ├── _mrosource
│ ├── outs
...
The script check_batch.sh
allows a user to check if a certain batch matches what is expected from this structure to ensure uniform quality.
aggr.csv
is generated by benj::make_aggregation_table()
or something similar, where sample name is the 1st column, the atac_fragments
are contained in another column if ATAC is included.
Other columns should be metadata to include in H5AD per sample, such as tissue or post-mortem interval, or case/control status.
List of samples. Ideally used for cellranger-count, but used to keep track of the directories output from count
. May not necessarily be the rownames in aggr.csv
if you want to rename after counting.
Summary of metrics per sample, such as # of reads, # of estimated cells. Computed by:
cellranger_metrics_summary -i ./*/outs/*summary.csv -o metrics_summary.tsv
and can be used, e.g. extracting expected number of cells for cellbender.
H5AD files should be computed using h5ad_from_cellranger_rna.sh
, for each of filtered_feature_bc_matrix, raw_feature_bc_matrix, and cellbender_filtered.
Note that barcodes will use ArchR format, in the case of multiome data, to allow for shared barcodes to trivially overlap.
You can rename the sample if necessary. Note that velocyto counts will be included by default if exists, and cellranger counts will be aggregated into the cellranger directory if those exist.
For example, if you have the sample names used in cellranger count
in sample_list.txt and you have the new names as the rownames in aggr.csv
, you can use;
paste <(cat sample_list.txt) <(tail -n+2 <aggr.csv | cut -d , -f 1) | awk -F, '{ print "h5ad_from_cellranger_rna.sh \"$1\" \"$2\"" }' | xargs -P8 sh -c
Or, if they are the same:
< sample_list.txt xargs -P8 h5ad_from_cellranger_rna.sh
Speaking of cellbender, if computed, you should place the H5 files into the outs/
subdirectory per counts directory, just like filtered_feature_bc_matrix.h5
.
Using velocyto is easy now. If you’re in a conda env in the batch directory, you can use:
velocyto_slurmgen.sh -b /path/to/batch/directory | sbatch -p my_queue_name
to generate velocyto counts.
For ATAC, ArchR metadata is later used for metadata, and arrow files are needed regardless for secondary analysis. To compute Arrow files for each fragment file, and compute metadata,
mkdir ArrowFiles && pushd ArrowFiles
archr_from_fragments.R -a ../aggr.csv --tss 1 -g /path/to/ArchR_refdata-cellranger-arc-GRCh38-2020-A-2.0.0.rds
popd
I currently have the file ~/src/conda_init.sh
on Luria as:
#!/usr/bin/env bash
# >>> conda initialize >>>
# !! Contents within this block are managed by 'conda init' !!
__conda_setup="$('/net/bmc-lab5/data/kellis/group/Benjamin/software/miniconda3/bin/conda' 'shell.bash' 'hook' 2> /dev/null)"
if [ $? -eq 0 ]; then
eval "$__conda_setup"
else
if [ -f "/net/bmc-lab5/data/kellis/group/Benjamin/software/miniconda3/etc/profile.d/conda.sh" ]; then
. "/net/bmc-lab5/data/kellis/group/Benjamin/software/miniconda3/etc/profile.d/conda.sh"
else
export PATH="/net/bmc-lab5/data/kellis/group/Benjamin/software/miniconda3/bin:$PATH"
fi
fi
unset __conda_setup
# <<< conda initialize <<<
which should allow anyone with group access to use the benj/pybenj/rbenj environments as updated.
In ~/.bashrc
of BOTH the host and all clients/desktops you want to connect from, set
export JUPYTER_LURIA_PREFIX=110
or some number above 80. This will help if anyone else is using the port number on Luria.
The simple SSH config here ~/.ssh/config
should keep things simple.
Please use an SSH key to make things easy.
Host luria.mit.edu
HostName 10.159.3.125
User your_kerb
PubkeyAcceptedKeyTypes +ssh-rsa
Match exec "echo %h | grep -qE '^b[0-9]+$'"
HostName %h
ProxyJump luria.mit.edu
User your_kerb
If you are on GNU/Linux that uses SystemD, you could use the following per-user service file, ~/.config/systemd/user/[email protected]
:
[Unit]
Description=jupyter_luria.sh "%I"
After=network.target
[Service]
ExecStart=%h/.local/bin/jupyter_luria.sh %i
Restart=on-failure
RestartSec=5s
Environment="JUPYTER_LURIA_PREFIX=95"
[Install]
WantedBy=default.target
assuming jupyter_luria.sh is in ~/.local/bin/
(you can find this out with type -P jupyter_luria.sh
),
and the JUPYTER_LURIA_PREFIX is the same as on the Luria server.
Then, systemctl --user daemon-reload
and systemctl --user start jupyter_luria@b3
should connect you to b3.
Then, to launch a Jupyter job on Luria,
sbatch jupyter_luria.sh benj
to launch a jupyter lab in the environment named “benj”.
Look at the hostname of the just-launched job via
squeue -u ${USER}
for example, b3, and connect on your laptop/desktop with:
jupyter_luria.sh b3
Then, you should go in your web browser and connect to https://localhost:XXXX/ where XXXX
is the port output by the script.
If you have trouble connecting, try running jupyter lab password
in a compute node to use password-based login instead of token-based.
To use, in your ~/.bash_profile
, put
module use /path/to/this/repo/modules
and re-login to view changes.
Conda integration is at modules/lmod-conda
To change the default Conda root directory, replace ~/data/miniconda3
with your conda root directory.
- https://support.10xgenomics.com/single-cell-multiome-atac-gex/software/downloads/latest
- https://support.10xgenomics.com/single-cell-gene-expression/software/downloads/latest
- https://support.10xgenomics.com/single-cell-atac/software/downloads/latest
Download from https://genome.ucsc.edu/cgi-bin/hgTables