Skip to content

Latest commit

 

History

History
24 lines (13 loc) · 1.42 KB

README.rst

File metadata and controls

24 lines (13 loc) · 1.42 KB

Here we focus on describing HTSeq-Hadoop which extends the HTSeq package with Hadoop implementations.

HTSeq provides an Application Programming Interface (API) to manipulate raw and processed Next Generation Sequencing (NGS) data using the Python programming language. A limitation of HTSeq is that it is generally restricted to a single thread, though allowing to scale up to a whole multicore node in some cases.

We modified two widely used tools from HTSeq in RNA-seq analysis: htseq-count for counting how many reads are mapped to the genes and htseq-qa for quality assessment of raw or mapped reads. These were adapted to run in the Hadoop framework in order to significantly increase the scalability.

At the present moment there are two utilities in the HTSeq-Hadoop:

  • HTSeqCount -- mimicking the functionality of the htseq-count
  • HTSeqQA -- htseq-qa

The runtime performance of HTSeqCount under Hadoop was compared with the Pig Latin script on the Apache Pig platform. The choice of Hadoop-streaming library made possible to involve the GNU-parallel utility to run HTSeq-Hadoop in multiple threads on the multicore Linux workstations or on a cluster node.

The documentation for the HTSeq-Hadoop is available here.