Skip to content

Commit

Permalink
add readme
Browse files Browse the repository at this point in the history
qwq
  • Loading branch information
HHengUG authored May 8, 2021
1 parent f8b4a2a commit c288654
Showing 1 changed file with 256 additions and 0 deletions.
256 changes: 256 additions & 0 deletions hatpal2.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,256 @@
# hatpal2 🧢🧢

**hatpal** is an R package for alternative polyadenylation (APA) identification & analysis using 3’ scRNA-seq (10x etc.).

By clustering the possible polyA sites, hatpal can find high-confidence APA clusters without GTF file, and create a count matrix for downstream analysis after being annotated.



## Before the start 📰📰

### Dependencies

Softwares: [**samtools**](http://www.htslib.org/download/) [>= 1.9] and [**R**](https://www.r-project.org/) [>= 3.6.1]

R packages: [**data.table**](https://cran.r-project.org/web/packages/data.table/index.html) [1.13.2], [**GenomicAlignments**](https://bioconductor.org/packages/release/bioc/html/GenomicAlignments.html) [1.24.0], **[Matrix](https://cran.r-project.org/web/packages/Matrix/index.html)** [1.2-18], **[Biostrings](https://bioconductor.org/packages/release/bioc/html/Biostrings.html)** [2.56.0] and **[Ckmeans.1d.dp](https://cran.r-project.org/web/packages/Ckmeans.1d.dp/index.html)** [4.3.3]

```R
if (!requireNamespace("BiocManager", quietly=TRUE))
install.packages("BiocManager")
if (!require("data.table")) install.packages("data.table")
if (!require("Ckmeans.1d.dp")) install.packages("Ckmeans.1d.dp")
if (!require("Matrix")) install.packages("Matrix")
if (!require("GenomicAlignments")) BiocManager::install("GenomicAlignments")
if (!require("Biostrings")) BiocManager::install("Biostrings")
```



### Installation

The latest hatpal release can be downloaded [here](https://github.com/HHengUG/hatpal2/releases).

hatpal can be installed as followed in linux:

```shell
R CMD INSTALL hatpal2_0.1.2.tar.gz
```

and in windows:

```R
install.packages("hatpal2_0.1.2.tar.gz", repos = NULL, type = "win.binary")
```

hatpal can also be installed from github through [devtools](https://github.com/r-lib/devtools):

```R
library(devtools)
devtools::install_github('HHengUG/hatpal2')
```



### Input preparation

- An **indexed BAM** file generated from 3’ scRNA-seq for preprocessing
- A **GTF** file if you need annotation and count matrix for downstream analysis in Seurat
- A **FASTA** file for filtering with a .fai index in the same directory

To find the mapped soft-clipped reads, run the preprocessing of BAM file on Linux via **samtools**.

```shell
#!/bin/bash
echo $1
# input BAM
echo $2
# output directory
echo $3
# project name

filtered=$2$3"_F.bam"
sorted=$2$3"_S.bam"
psam=$2$3"_P3S.sam"
nsam=$2$3"_N3S.sam"
header=$2$3"_HEADER"
sbam=$2$3"_3S.bam"
ssorted=$2$3"_S3S.bam"

samtools view -@ 2 -F 256 -b -h $1 > $filtered
samtools sort $filtered > $sorted
samtools index -@ 2 $sorted
samtools view -F 16 $sorted | awk '{if ($6~/([0-9][0-9]|[3-9])S$/ ) print }' > $psam
samtools view -f 16 $sorted | awk '{if ($6~/^([0-9][0-9]|[3-9])S/ ) print }' > $nsam
samtools view -H $sorted > $header
cat $header $psam $nsam | samtools view -b > $sbam
samtools sort $sbam > $ssorted
samtools index -@ 2 $ssorted

echo $ssorted
```

The output *S.bam and *S3S.bam are unique-mapped BAM and soft-clipped BAM.



## Run hatpal 🚴

hatpal includes six steps and functions.

You can simply run the script provided [here](https://github.com/HHengUG/hatpal2_example), or use the following functions:

Before you run these functions, you should:

```R
library(hatpal2)
```

to library this package

### ExtractChrinfo

*ExtractChrinfo()* extracts the chromosomes information from the indexed BAM file, including name and length.

The input should be a BAM file with a .bai index in the same directory.

```R
chr_info <- ExtractChrinfo("input/input.sorted.bam")
```

The output is a data.table object in R.

### ExtractAPA

*ExtractAPA()* generates two-dimensional coordinate files of each strand for clustering.

Two BAM files and an output directory are needed, and .bai index in the same directory is also required. The third parameter is a data.table containing chromosomes information extracted by *ExtractChrinfo()* and the last parameter is the output directory.

Use the original indexed BAM file as the first parameter, and the soft-clipped indexed BAM as the second parameter.

```R
ExtractAPA("input/S.bam", "input/S3S.bam", chr_info, "output")
```

The outputs are six files (rc_end.prebed, rc_uni.prebed and sa.prebed) in two directories (negative_strand/ and positive_strand/), and an extra file containing the list of cell barcodes named "output/_CBlist.txt".

### ClusterAPA

*ClusterAPA()* clusters the data in two two-dimensional coordinate files (SA file and rc_uni file) to find the APA clusters.

Every strand has three two-dimensional coordinate files (rc_end.prebed, rc_uni.prebed and sa.prebed) in two directories (negative_strand/ and positive_strand/). Four input files (rc_end.prebed, rc_uni.prebed and sa.prebed) are generated by *ExtractAPA()*. The first parameter is the sa file, and the second is the sa file or rc_uni file. The third parameter is the output directory. So you should run this function twice for each strand:

```R
ClusterAPA("output/negative_strand/sa.prebed", "output/negative_strand/sa.prebed",
"output/negative_strand/")
ClusterAPA("output/positive_strand/sa.prebed", "output/positive_strand/sa.prebed",
"output/positive_strand/")
```

Or if you want more APA clusters, just replace second sa file with rc_uni file:

```R
ClusterAPA("output/negative_strand/sa.prebed", "output/negative_strand/rc_uni.prebed",
"output/negative_strand/")
ClusterAPA("output/positive_strand/sa.prebed", "output/positive_strand/ec_uni.prebed",
"output/positive_strand/")
```

The outputs are two files containing APA clusters (APA_clusters.out).

### FilterAPAc

*FilterAPAc()* filters internal false-positive APAc.

Every strand has one APA_clusters.out file. To filter it, use *FilterAPAc()* with a fasta file. The first parameter is the APA_clusters.out file, and the second is the fasta file. The third parameter is the strand and the last one is the output directory.

```R
FilterAPA("output/positive_strand/APA_clusters.out", "ref/hg19.fasta", "+", "output/positive_strand/")
FilterAPA("output/negative_strand/APA_clusters.out", "ref/hg19.fasta", "-", "output/negative_strand/")
```

The outputs are two files containing APA clusters after filtering (APA_clusters.filtered.out).

### MapAPAc

*MapAPAc()* maps APAc with the count region.

To mark and count the full usage of every APAc, use *MapAPAc()* to expand the count region. The first two parameters are the rc_uni and sa files from *ExtractAPA()* , and the third is the APA_clusters.filtered.out file and the last one is the output directory.

```R
MapAPA("output/positive_strand/rc_uni.prebed", "output/positive_strand/sa.prebed","output/positive_strand/APA_clusters.filtered.out", "output/positive_strand/")
MapAPA("output/negative_strand/rc_uni.prebed", "output/negative_strand/sa.prebed","output/negative_strand/APA_clusters.filtered.out", "output/negative_strand/")
```

The outputs are two files containing APA clusters and map region after filtering (APA_map.out).

### AnnoAPAc

*AnnoAPAc()* annotates the APA clusters from a GTF file.

The first two inputs are one APA cluster file and one GTF file. The third parameter determines the strand. If chromosome names of the GTF file do not have "chr" in them, the fourth parameter (add_chr = TRUE) will add string "chr" before them. You should also run *AnnoAPAc()* twice for each strand.

```R
AnnoAPAc("output/positive_strand/APA_map.out", "ref/hg_19.gtf", "+")
AnnoAPAc("output/negative_strand/APA_map.out", "ref/hg_19.gtf", "-")
```

When the chromosome names of the GTF do not have "chr" in it (i.e., 1 2 3), but that of the APA_clusters.out or chr_info have (i.e., chr1 chr2 chr3), run:

```R
AnnoAPAc("output/positive_strand/APA_map.out", "ref/hg_19.gtf", "+", add_chr = TRUE)
AnnoAPAc("output/negative_strand/APA_map.out", "ref/hg_19.gtf", "-", add_chr = TRUE)
```

The output (APA_map.out.anno) is a file where each annotation of clusters is described by 11 columns. The first common three columns are chr, start and end, and the fourth and fifth columns are annotated gene and type; the sixth and seventh columns are map region; the ninth column is the relative position of APAc on this annotation region. If there is no annotation for the clusters, then the last three columns will be the gene, type and the relative distance of the closest annotation on the 5' end.

### CountAPAc

*CountAPAc()* creates a APA_cluster-Cell_Barcode count matrix which can be used in Seurat analysis.

The first two parameters are annotated APA cluster files of positive strand and negative strand. The third parameter is a BAM file with a .bai index in the same directory. The fourth parameter is a CB list, you can use white list of cell barcode from cellranger, or use the "output/_CBlist.txt" generated by *WriteEndlist()*. The fifth parameter is a data.table containing chromosomes information extracted by *ExtractChrinfo(),* and the last parameter is the output directory.

```R
CountCB("output/positive_strand/APA_map.out.anno",
"output/negative_strand/APA_map.out.anno",
"input/S.bam","output/_CBlist.txt",
chr_info, "output/")
```

The outputs are a combined annotated APA clusters file and three files (genes.tsv, barcodes.tsv and martrix.mtx) in a directory named "matrix/" under "output/".

The first 8 columns in combined annotated APA clusters file are the same as the output of *AnnoAPAc()*, and the last four columns are cluster unique id, cluster id in each gene, cluster count per gene and min-max normalization value of location in each gene.

The matrix can be easily loaded into [Seurat](https://satijalab.org/seurat/):

```R
library(Seurat)
pbmc.data <- Read10X("output/matrix/")
```



## Sample 🎥

The [3k PBMC dataset](https://support.10xgenomics.com/single-cell-gene-expression/datasets/1.0.0/pbmc3k) from 10x Genomics were used for testing.

The sample outputs can be found [here](https://github.com/HHengUG/hatpal_example/tree/main).

A script to run hatpal automatically can be found [here](https://github.com/HHengUG/hatpal_example) .



## Contact 📨

Question and feedback are welcome at:

HHeng: hengheng at genomics dot cn



## News ✨

2021/05/08

- hatpal2 [0.1.2] is released!
- new readme

0 comments on commit c288654

Please sign in to comment.