hatpal is an R package for alternative polyadenylation (APA) identification & analysis using 3’ scRNA-seq (10x etc.).
By clustering the possible polyA sites, hatpal can find high-confidence APA clusters without GTF file, and create a count matrix for downstream analysis after being annotated.
Softwares: samtools [>= 1.9] and R [>= 3.6.1]
R packages: data.table [1.13.2], GenomicAlignments [1.24.0], Matrix [1.2-18], Biostrings [2.56.0] and Ckmeans.1d.dp [4.3.3]
if (!requireNamespace("BiocManager", quietly=TRUE))
install.packages("BiocManager")
if (!require("data.table")) install.packages("data.table")
if (!require("Ckmeans.1d.dp")) install.packages("Ckmeans.1d.dp")
if (!require("Matrix")) install.packages("Matrix")
if (!require("GenomicAlignments")) BiocManager::install("GenomicAlignments")
if (!require("Biostrings")) BiocManager::install("Biostrings")
The latest hatpal release can be downloaded here.
hatpal can be installed as followed in linux:
R CMD INSTALL hatpal2_0.1.2.tar.gz
and in windows:
install.packages("hatpal2_0.1.2.tar.gz", repos = NULL, type = "win.binary")
hatpal can also be installed from github through devtools:
library(devtools)
devtools::install_github('HHengUG/hatpal2')
- An indexed BAM file generated from 3’ scRNA-seq for preprocessing
- A GTF file if you need annotation and count matrix for downstream analysis in Seurat
- A FASTA file for filtering with a .fai index in the same directory
To find the mapped soft-clipped reads, run the preprocessing of BAM file on Linux via samtools.
#!/bin/bash
echo $1
# input BAM
echo $2
# output directory
echo $3
# project name
filtered=$2$3"_F.bam"
sorted=$2$3"_S.bam"
psam=$2$3"_P3S.sam"
nsam=$2$3"_N3S.sam"
header=$2$3"_HEADER"
sbam=$2$3"_3S.bam"
ssorted=$2$3"_S3S.bam"
samtools view -@ 2 -F 256 -b -h $1 > $filtered
samtools sort $filtered > $sorted
samtools index -@ 2 $sorted
samtools view -F 16 $sorted | awk '{if ($6~/([0-9][0-9]|[3-9])S$/ ) print }' > $psam
samtools view -f 16 $sorted | awk '{if ($6~/^([0-9][0-9]|[3-9])S/ ) print }' > $nsam
samtools view -H $sorted > $header
cat $header $psam $nsam | samtools view -b > $sbam
samtools sort $sbam > $ssorted
samtools index -@ 2 $ssorted
echo $ssorted
The output *S.bam and *S3S.bam are unique-mapped BAM and soft-clipped BAM.
hatpal includes six steps and functions.
You can simply run the script provided here, or use the following functions:
Before you run these functions, you should:
library(hatpal2)
to library this package
ExtractChrinfo() extracts the chromosomes information from the indexed BAM file, including name and length.
The input should be a BAM file with a .bai index in the same directory.
chr_info <- ExtractChrinfo("input/input.sorted.bam")
The output is a data.table object in R.
ExtractAPA() generates two-dimensional coordinate files of each strand for clustering.
Two BAM files and an output directory are needed, and .bai index in the same directory is also required. The third parameter is a data.table containing chromosomes information extracted by ExtractChrinfo() and the last parameter is the output directory.
Use the original indexed BAM file as the first parameter, and the soft-clipped indexed BAM as the second parameter.
ExtractAPA("input/S.bam", "input/S3S.bam", chr_info, "output")
The outputs are six files (rc_end.prebed, rc_uni.prebed and sa.prebed) in two directories (negative_strand/ and positive_strand/), and an extra file containing the list of cell barcodes named "output/_CBlist.txt".
ClusterAPA() clusters the data in two two-dimensional coordinate files (SA file and rc_uni file) to find the APA clusters.
Every strand has three two-dimensional coordinate files (rc_end.prebed, rc_uni.prebed and sa.prebed) in two directories (negative_strand/ and positive_strand/). Four input files (rc_end.prebed, rc_uni.prebed and sa.prebed) are generated by ExtractAPA(). The first parameter is the sa file, and the second is the sa file or rc_uni file. The third parameter is the output directory. So you should run this function twice for each strand:
ClusterAPA("output/negative_strand/sa.prebed", "output/negative_strand/sa.prebed",
"output/negative_strand/")
ClusterAPA("output/positive_strand/sa.prebed", "output/positive_strand/sa.prebed",
"output/positive_strand/")
Or if you want more APA clusters, just replace second sa file with rc_uni file:
ClusterAPA("output/negative_strand/sa.prebed", "output/negative_strand/rc_uni.prebed",
"output/negative_strand/")
ClusterAPA("output/positive_strand/sa.prebed", "output/positive_strand/ec_uni.prebed",
"output/positive_strand/")
The outputs are two files containing APA clusters (APA_clusters.out).
FilterAPAc() filters internal false-positive APAc.
Every strand has one APA_clusters.out file. To filter it, use FilterAPAc() with a fasta file. The first parameter is the APA_clusters.out file, and the second is the fasta file. The third parameter is the strand and the last one is the output directory.
FilterAPA("output/positive_strand/APA_clusters.out", "ref/hg19.fasta", "+", "output/positive_strand/")
FilterAPA("output/negative_strand/APA_clusters.out", "ref/hg19.fasta", "-", "output/negative_strand/")
The outputs are two files containing APA clusters after filtering (APA_clusters.filtered.out).
MapAPAc() maps APAc with the count region.
To mark and count the full usage of every APAc, use MapAPAc() to expand the count region. The first two parameters are the rc_uni and sa files from ExtractAPA() , and the third is the APA_clusters.filtered.out file and the last one is the output directory.
MapAPA("output/positive_strand/rc_uni.prebed", "output/positive_strand/sa.prebed","output/positive_strand/APA_clusters.filtered.out", "output/positive_strand/")
MapAPA("output/negative_strand/rc_uni.prebed", "output/negative_strand/sa.prebed","output/negative_strand/APA_clusters.filtered.out", "output/negative_strand/")
The outputs are two files containing APA clusters and map region after filtering (APA_map.out).
AnnoAPAc() annotates the APA clusters from a GTF file.
The first two inputs are one APA cluster file and one GTF file. The third parameter determines the strand. If chromosome names of the GTF file do not have "chr" in them, the fourth parameter (add_chr = TRUE) will add string "chr" before them. You should also run AnnoAPAc() twice for each strand.
AnnoAPAc("output/positive_strand/APA_map.out", "ref/hg_19.gtf", "+")
AnnoAPAc("output/negative_strand/APA_map.out", "ref/hg_19.gtf", "-")
When the chromosome names of the GTF do not have "chr" in it (i.e., 1 2 3), but that of the APA_clusters.out or chr_info have (i.e., chr1 chr2 chr3), run:
AnnoAPAc("output/positive_strand/APA_map.out", "ref/hg_19.gtf", "+", add_chr = TRUE)
AnnoAPAc("output/negative_strand/APA_map.out", "ref/hg_19.gtf", "-", add_chr = TRUE)
The output (APA_map.out.anno) is a file where each annotation of clusters is described by 11 columns. The first common three columns are chr, start and end, and the fourth and fifth columns are annotated gene and type; the sixth and seventh columns are map region; the ninth column is the relative position of APAc on this annotation region. If there is no annotation for the clusters, then the last three columns will be the gene, type and the relative distance of the closest annotation on the 5' end.
CountAPAc() creates a APA_cluster-Cell_Barcode count matrix which can be used in Seurat analysis.
The first two parameters are annotated APA cluster files of positive strand and negative strand. The third parameter is a BAM file with a .bai index in the same directory. The fourth parameter is a CB list, you can use white list of cell barcode from cellranger, or use the "output/_CBlist.txt" generated by WriteEndlist(). The fifth parameter is a data.table containing chromosomes information extracted by ExtractChrinfo(), and the last parameter is the output directory.
CountCB("output/positive_strand/APA_map.out.anno",
"output/negative_strand/APA_map.out.anno",
"input/S.bam","output/_CBlist.txt",
chr_info, "output/")
The outputs are a combined annotated APA clusters file and three files (genes.tsv, barcodes.tsv and martrix.mtx) in a directory named "matrix/" under "output/".
The first 8 columns in combined annotated APA clusters file are the same as the output of AnnoAPAc(), and the last four columns are cluster unique id, cluster id in each gene, cluster count per gene and min-max normalization value of location in each gene.
The matrix can be easily loaded into Seurat:
library(Seurat)
pbmc.data <- Read10X("output/matrix/")
The 3k PBMC dataset from 10x Genomics were used for testing.
A script to run hatpal2 automatically can be found here .
Question and feedback are welcome at:
HHeng: hengheng at genomics dot cn
2021/05/08
- hatpal2 [0.1.2] is released!
- new readme