Here we focus on describing HTSeq-Hadoop which extends the HTSeq package with Hadoop implementations.

HTSeq provides an Application Programming Interface (API) to manipulate raw and processed Next Generation Sequencing (NGS) data using the Python programming language. A limitation of HTSeq is that it is generally restricted to a single thread, though allowing to scale up to a whole multicore node in some cases.

We modified two widely used tools from HTSeq in RNA-seq analysis: htseq-count for counting how many reads are mapped to the genes and htseq-qa for quality assessment of raw or mapped reads. These were adapted to run in the Hadoop framework in order to significantly increase the scalability.

At the present moment there are two utilities in the HTSeq-Hadoop:

HTSeqCount -- mimicking the functionality of the htseq-count
HTSeqQA -- htseq-qa

The runtime performance of HTSeqCount under Hadoop was compared with the Pig Latin script on the Apache Pig platform. The choice of Hadoop-streaming library made possible to involve the GNU-parallel utility to run HTSeq-Hadoop in multiple threads on the multicore Linux workstations or on a cluster node.

The documentation for the HTSeq-Hadoop is available here.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.rst

README.rst

Files

README.rst

Latest commit

History

README.rst

File metadata and controls