Scallop2 is a reference-based transcript assembler specifically optimized for paired-/multiple-end RNA-seq data. The development of Scallop2 has been based on the Scallop assembler. Scripts and documentation that evaluate Scallop2 is available at scallop2-test.
Scallop2 can be easily installed with conda:
Install from source code: download the source code of Scallop2 from
here.
Scallop2 uses additional libraries of Boost and htslib.
If they have not been installed in your system, you first
need to download and install them. You might also need to
export the runtime library path to certain environmental
variable (for example, LD_LIBRARY_PATH
, for most linux distributions).
After install these dependencies, you then compile the source code of Scallop2.
If some of the above dependencies are not installed to the default system
directories (for example, /usr/local
, for most linux distributions),
their corresponding installing paths should be specified to configure
of Scallop2.
The entire installation typically takes a few minutes to complete.
If Boost has not been downloaded/installed, download Boost (license) from (http://www.boost.org). Uncompress it somewhere (compiling and installing are not necessary).
If htslib has not been installed, download htslib (license) from (http://www.htslib.org/) with version 1.5 or higher. Note that htslib relies on zlib. So if zlib has not been installed in your system, you need to install zlib first. To do so, download zlib (license) at (https://zlib.net/). Use the following commands to install zlib:
./configure
make
make install
After installing zlib, use the following commands to build htslib:
./configure --disable-bz2 --disable-lzma --disable-gcs --disable-s3 --enable-libcurl=no
make
make install
The default installation location of htslib is /usr/lib
.
If you would install it to a different location, replace the above configure
line with
the following (by adding --prefix=/path/to/your/htslib
to the end):
./configure --disable-bz2 --disable-lzma --disable-gcs --disable-s3 --enable-libcurl=no --prefix=/path/to/your/htslib
In this case, you also need to export the runtime library path (note that there
is an additional lib
following the installation path):
export LD_LIBRARY_PATH=/path/to/your/htslib/lib:$LD_LIBRARY_PATH
Use the following to compile Scallop2:
./configure --with-htslib=/path/to/your/htslib --with-boost=/path/to/your/boost
make
If some of the dependencies are installed in the default system directory (for example, /usr/lib
),
then the corresponding --with-
option might not be necessary.
The executable file scallop2
will appear at src/scallop2
.
The usage of scallop2
is:
./scallop2 -i <input.bam> -o <output.gtf> [options]
The input.bam
is the read alignment file generated by some RNA-seq aligner, (for example, STAR or HISAT2).
Make sure that it is sorted; otherwise run samtools
to sort it:
samtools sort input.bam > input.sort.bam
The reconstructed transcripts shall be written as gtf format into output.gtf
.
Scallop2 support the following parameters. Please refer to additional explanations below the table.
Parameters | Default Value | Description |
---|---|---|
--help | print usage of Scallop2 and exit | |
--version | print version of Scallop2 and exit | |
--preview | show the inferred library_type and exit |
|
--verbose | 1 | chosen from {0, 1, 2} |
-f/--transcript_fragments | file to which the assembled non-full-length transcripts will be written to | |
--library_type | empty | chosen from {empty, unstranded, first, second} |
--assemble_duplicates | 10 | the number of consensus runs of the decomposition |
--min_transcript_coverage | 1.5 | the minimum coverage required to output a multi-exon transcript |
--min_single_exon_coverage | 20 | the minimum coverage required to output a single-exon transcript |
--min_transcript_length_base | 150 | the minimum base length of a transcript |
--min_transcript_length_increase | 50 | the minimum increased length of a transcript with each additional exon |
--min_mapping_quality | 1 | ignore reads with mapping quality less than this value |
--max_num_cigar | 1000 | ignore reads with CIGAR size larger than this value |
--min_bundle_gap | 100 | the minimum distances required to start a new bundle |
--min_num_hits_in_bundle | 5 | the minimum number of reads required in a bundle |
--min_flank_length | 3 | the minimum match length required in each side for a spliced read |
--min_splice_bundary_hits | 1 | the minimum number of spliced reads required to support a junction |
-
For
--verbose
, 0: quiet; 1: one line for each splice graph; 2: details of graph decomposition. -
--library_type
is highly recommended to provide. Theunstranded
,first
, andsecond
correspond tofr-unstranded
,fr-firststrand
, andfr-secondstrand
used in standard Illumina sequencing libraries. If none of them is given, i.e., it isempty
by default, then Scallop2 will try to infer thelibrary_type
by itself (see--preview
). Notice that such inference is based on theXS
tag stored in the inputbam
file. If the inputbam
file do not containXS
tag, then it is essential to provide thelibrary_type
to Scallop2. You can try--preview
to see the inferredlibrary_type
. -
--min_transcript_coverage
is used to filter lowly expressed transcripts: Scallop2 will filter out transcripts whose (predicted) raw counts (number of moleculars) is less than this number. -
--min_transcript_length_base
and--min_transcript_length_increase
is combined to filter short transcripts: the minimum length of a transcript is given by--min_transcript_length_base
+--min_transcript_length_increase
* num-of-exons-in-this-transcript. Transcripts that are less than this number will be filtered out.
The alignments of a set of 10 Illumina paired-end RNA-seq samples are available at doi:10.26208/8c06-w247. Scallop2 usually take a few minteus to an hour to assemble a typical aligned Illumina RNA-seq sample.
A small example of input data example-input.bam
is available in the example
directory.
Suppose we have installed Scallop2 following the steps in the Installation
section, we have the executable file scallop2
at src/scallop2
.
Commands to enter example
directory and run Scallop2 using example-input.bam
as input:
cd ./example
../src/scallop2 -i example-input.bam -o example-output.gtf
An output file named example-output.gtf
will appear in the example
directory.
The output file stores the reconstructed transcripts assembled by Scallop2 in GTF format.
The Gene transfer format (GTF) is a file format used to hold information about gene structure, detailed documentation about GTF format is available from Ensembl.
We can use IGV to visualize the input reads (BAM file) and the output assembled transcripts (GTF file).
A snapshot of IGV visualization of example-input.bam
and example-output.gtf
at gene loci chr1:46,303,377-46,321,251 is avaliable at example/example-IGV-snapshot.png
.