Skip to content

Latest commit

 

History

History
254 lines (158 loc) · 10.2 KB

README.md

File metadata and controls

254 lines (158 loc) · 10.2 KB

hatpal2 🧢🧢

hatpal is an R package for alternative polyadenylation (APA) identification & analysis using 3’ scRNA-seq (10x etc.).

By clustering the possible polyA sites, hatpal can find high-confidence APA clusters without GTF file, and create a count matrix for downstream analysis after being annotated.

Before the start 📰📰

Dependencies

Softwares: samtools [>= 1.9] and R [>= 3.6.1]

R packages: data.table [1.13.2], GenomicAlignments [1.24.0], Matrix [1.2-18], Biostrings [2.56.0] and Ckmeans.1d.dp [4.3.3]

if (!requireNamespace("BiocManager", quietly=TRUE))
    install.packages("BiocManager")
if (!require("data.table")) install.packages("data.table")
if (!require("Ckmeans.1d.dp")) install.packages("Ckmeans.1d.dp")
if (!require("Matrix")) install.packages("Matrix")
if (!require("GenomicAlignments")) BiocManager::install("GenomicAlignments")
if (!require("Biostrings")) BiocManager::install("Biostrings")

Installation

The latest hatpal release can be downloaded here.

hatpal can be installed as followed in linux:

R CMD INSTALL hatpal2_0.1.2.tar.gz

and in windows:

install.packages("hatpal2_0.1.2.tar.gz", repos = NULL, type = "win.binary")

hatpal can also be installed from github through devtools:

library(devtools)
devtools::install_github('HHengUG/hatpal2')

Input preparation

  • An indexed BAM file generated from 3’ scRNA-seq for preprocessing
  • A GTF file if you need annotation and count matrix for downstream analysis in Seurat
  • A FASTA file for filtering with a .fai index in the same directory

To find the mapped soft-clipped reads, run the preprocessing of BAM file on Linux via samtools.

#!/bin/bash
echo $1 
# input BAM
echo $2
# output directory
echo $3
# project name

filtered=$2$3"_F.bam"
sorted=$2$3"_S.bam"
psam=$2$3"_P3S.sam"
nsam=$2$3"_N3S.sam"
header=$2$3"_HEADER"
sbam=$2$3"_3S.bam"
ssorted=$2$3"_S3S.bam"

samtools view -@ 2 -F 256 -b -h $1 > $filtered 
samtools sort $filtered > $sorted  
samtools index -@ 2 $sorted  
samtools view -F 16 $sorted | awk '{if ($6~/([0-9][0-9]|[3-9])S$/ ) print }' > $psam 
samtools view -f 16 $sorted | awk '{if ($6~/^([0-9][0-9]|[3-9])S/ ) print }' > $nsam 
samtools view -H $sorted > $header 
cat $header $psam $nsam | samtools view -b > $sbam
samtools sort $sbam > $ssorted 
samtools index -@ 2 $ssorted 

echo $ssorted

The output *S.bam and *S3S.bam are unique-mapped BAM and soft-clipped BAM.

Run hatpal 🚴

hatpal includes six steps and functions.

You can simply run the script provided here, or use the following functions:

Before you run these functions, you should:

library(hatpal2)

to library this package

ExtractChrinfo

ExtractChrinfo() extracts the chromosomes information from the indexed BAM file, including name and length.

The input should be a BAM file with a .bai index in the same directory.

chr_info <- ExtractChrinfo("input/input.sorted.bam")

The output is a data.table object in R.

ExtractAPA

ExtractAPA() generates two-dimensional coordinate files of each strand for clustering.

Two BAM files and an output directory are needed, and .bai index in the same directory is also required. The third parameter is a data.table containing chromosomes information extracted by ExtractChrinfo() and the last parameter is the output directory.

Use the original indexed BAM file as the first parameter, and the soft-clipped indexed BAM as the second parameter.

ExtractAPA("input/S.bam", "input/S3S.bam", chr_info, "output")

The outputs are six files (rc_end.prebed, rc_uni.prebed and sa.prebed) in two directories (negative_strand/ and positive_strand/), and an extra file containing the list of cell barcodes named "output/_CBlist.txt".

ClusterAPA

ClusterAPA() clusters the data in two two-dimensional coordinate files (SA file and rc_uni file) to find the APA clusters.

Every strand has three two-dimensional coordinate files (rc_end.prebed, rc_uni.prebed and sa.prebed) in two directories (negative_strand/ and positive_strand/). Four input files (rc_end.prebed, rc_uni.prebed and sa.prebed) are generated by ExtractAPA(). The first parameter is the sa file, and the second is the sa file or rc_uni file. The third parameter is the output directory. So you should run this function twice for each strand:

ClusterAPA("output/negative_strand/sa.prebed", "output/negative_strand/sa.prebed", 
           "output/negative_strand/")
ClusterAPA("output/positive_strand/sa.prebed", "output/positive_strand/sa.prebed", 
           "output/positive_strand/")

Or if you want more APA clusters, just replace second sa file with rc_uni file:

ClusterAPA("output/negative_strand/sa.prebed", "output/negative_strand/rc_uni.prebed", 
           "output/negative_strand/")
ClusterAPA("output/positive_strand/sa.prebed", "output/positive_strand/ec_uni.prebed", 
           "output/positive_strand/")

The outputs are two files containing APA clusters (APA_clusters.out).

FilterAPAc

FilterAPAc() filters internal false-positive APAc.

Every strand has one APA_clusters.out file. To filter it, use FilterAPAc() with a fasta file. The first parameter is the APA_clusters.out file, and the second is the fasta file. The third parameter is the strand and the last one is the output directory.

FilterAPA("output/positive_strand/APA_clusters.out", "ref/hg19.fasta", "+", "output/positive_strand/")
FilterAPA("output/negative_strand/APA_clusters.out", "ref/hg19.fasta", "-", "output/negative_strand/")

The outputs are two files containing APA clusters after filtering (APA_clusters.filtered.out).

MapAPAc

MapAPAc() maps APAc with the count region.

To mark and count the full usage of every APAc, use MapAPAc() to expand the count region. The first two parameters are the rc_uni and sa files from ExtractAPA() , and the third is the APA_clusters.filtered.out file and the last one is the output directory.

MapAPA("output/positive_strand/rc_uni.prebed", "output/positive_strand/sa.prebed","output/positive_strand/APA_clusters.filtered.out", "output/positive_strand/")
MapAPA("output/negative_strand/rc_uni.prebed", "output/negative_strand/sa.prebed","output/negative_strand/APA_clusters.filtered.out", "output/negative_strand/")

The outputs are two files containing APA clusters and map region after filtering (APA_map.out).

AnnoAPAc

AnnoAPAc() annotates the APA clusters from a GTF file.

The first two inputs are one APA cluster file and one GTF file. The third parameter determines the strand. If chromosome names of the GTF file do not have "chr" in them, the fourth parameter (add_chr = TRUE) will add string "chr" before them. You should also run AnnoAPAc() twice for each strand.

AnnoAPAc("output/positive_strand/APA_map.out", "ref/hg_19.gtf", "+")
AnnoAPAc("output/negative_strand/APA_map.out", "ref/hg_19.gtf", "-")

When the chromosome names of the GTF do not have "chr" in it (i.e., 1 2 3), but that of the APA_clusters.out or chr_info have (i.e., chr1 chr2 chr3), run:

AnnoAPAc("output/positive_strand/APA_map.out", "ref/hg_19.gtf", "+", add_chr = TRUE)
AnnoAPAc("output/negative_strand/APA_map.out", "ref/hg_19.gtf", "-", add_chr = TRUE)

The output (APA_map.out.anno) is a file where each annotation of clusters is described by 11 columns. The first common three columns are chr, start and end, and the fourth and fifth columns are annotated gene and type; the sixth and seventh columns are map region; the ninth column is the relative position of APAc on this annotation region. If there is no annotation for the clusters, then the last three columns will be the gene, type and the relative distance of the closest annotation on the 5' end.

CountAPAc

CountAPAc() creates a APA_cluster-Cell_Barcode count matrix which can be used in Seurat analysis.

The first two parameters are annotated APA cluster files of positive strand and negative strand. The third parameter is a BAM file with a .bai index in the same directory. The fourth parameter is a CB list, you can use white list of cell barcode from cellranger, or use the "output/_CBlist.txt" generated by WriteEndlist(). The fifth parameter is a data.table containing chromosomes information extracted by ExtractChrinfo(), and the last parameter is the output directory.

CountCB("output/positive_strand/APA_map.out.anno", 
        "output/negative_strand/APA_map.out.anno", 
        "input/S.bam","output/_CBlist.txt", 
        chr_info, "output/")

The outputs are a combined annotated APA clusters file and three files (genes.tsv, barcodes.tsv and martrix.mtx) in a directory named "matrix/" under "output/".

The first 8 columns in combined annotated APA clusters file are the same as the output of AnnoAPAc(), and the last four columns are cluster unique id, cluster id in each gene, cluster count per gene and min-max normalization value of location in each gene.

The matrix can be easily loaded into Seurat:

library(Seurat)
pbmc.data  <- Read10X("output/matrix/")

Sample 🎥

The 3k PBMC dataset from 10x Genomics were used for testing.

A script to run hatpal2 automatically can be found here .

Contact 📨

Question and feedback are welcome at:

HHeng: hengheng at genomics dot cn

News ✨

2021/05/08

  • hatpal2 [0.1.2] is released!
  • new readme