Skip to content

Commit

Permalink
Update README.rst
Browse files Browse the repository at this point in the history
  • Loading branch information
bbuchfink authored Aug 26, 2016
1 parent 666db39 commit d079360
Showing 1 changed file with 16 additions and 145 deletions.
161 changes: 16 additions & 145 deletions README.rst
Original file line number Diff line number Diff line change
@@ -1,21 +1,19 @@
**DIAMOND v0.8.19 by Benjamin Buchfink** - http://github.com/bbuchfink/diamond

Published in *Nature Methods* **12**, 59–60 (2015) | `doi:10.1038/nmeth.3176 <http://dx.doi.org/10.1038/nmeth.3176>`_
Introduction
============
DIAMOND is a sequence aligner for protein and translated DNA searches and functions as a drop-in replacement for the NCBI BLAST software tools. It is suitable for protein-protein search as well as DNA-protein search on short reads and longer sequences including contigs and assemblies, providing a speedup over BLAST ranging up to x20,000.

DIAMOND is a BLAST-compatible local aligner for mapping protein and translated DNA query sequences against a protein reference database (BLASTP and BLASTX alignment mode). The speedup over BLAST is up to 20,000 on short reads at a typical sensitivity of 90-99% relative to BLAST depending on the data and settings.
Quick start guide
=================
Please read the `manual <https://github.com/bbuchfink/diamond/raw/master/diamond_manual.pdf>`_ for detailed installation and usage instructions. This demonstrates a quick example for setting up and using the program on Linux.

Download & Installation
=======================
If you use a recent **Linux** operating system, you can download the software in binary format for immediate use::
Installing the software on your system may be done by downloading it in binary format for immediate use::

wget http://github.com/bbuchfink/diamond/releases/download/v0.8.19/diamond-linux64.tar.gz
wget http://github.com/bbuchfink/diamond/releases/download/v0.8.20/diamond-linux64.tar.gz
tar xzf diamond-linux64.tar.gz

Users of **Mac OS X** and **older Linux systems** need to compile the software from source (see `Compiling from source`_). On **FreeBSD**, the software can be installed using ``pkg install diamond``. A binary executable for **Windows** is available under Releases (you also need to install the Visual C++ Redistributable: https://www.microsoft.com/en-us/download/details.aspx?id=48145).
The extracted ``diamond`` binary file should be moved to a directory contained in your executable search path (PATH environment variable).

Basic command line use
======================
We assume to have a protein database file in FASTA format named ``nr.faa`` and a file of DNA reads that we want to align named ``reads.fna``.
To now run an alignment task, we assume to have a protein database file in FASTA format named ``nr.faa`` and a file of DNA reads that we want to align named ``reads.fna``.

In order to set up a reference database for DIAMOND, the ``makedb`` command needs to be executed with the following command line::

Expand All @@ -28,139 +26,12 @@ This will create a binary DIAMOND database file with the specified name (``nr.dm
The output file here is specified with the ``–o`` option and named ``matches.m8``. By default, it is generated in BLAST tabular format.

*Note*:
- Other output formats can be chosen with the ``-f`` option (see below).
- The program may use quite a lot of memory and also temporary disk space. Should the program fail due to running out of either one, you need to set a lower value for the block size parameter ``-b`` (see below).
- The default (fast) mode was mainly designed for short reads. For longer sequences, the sensitive modes (options ``--sensitive`` or ``--more-sensitive``) are recommended.
- The full speed of the program is only attained on large query datasets. It is currently not efficient in mapping a smaller number of query sequences.

About
=====
DIAMOND is developed by Benjamin Buchfink. Feel free to contact me for support (`Email <mailto:[email protected]>`_ `Twitter <http://twitter.com/bbuchfink>`_).

Commands
========
Commands are issued as the first parameter on the command line and set the task to be run by the program.

======= ===========
Command Description
======= ===========
makedb Create DIAMOND formatted reference database from a FASTA input file.
blastp Align protein query sequences against a protein reference database.
blastx Align translated DNA query sequences against a protein reference database.
view Generate formatted output from DAA files.
======= ===========

Makedb options
==============
============ ===== ======= ===========
Option Short Default Description
============ ===== ======= ===========
--in Path to protein reference database file in FASTA format (may be gzip compressed).
--db -d Path to DIAMOND database file.
============ ===== ======= ===========

General & IO options
====================
========== ===== ======= ===========
Option Short Default Description
========== ===== ======= ===========
--threads -p max Number of CPU threads.
--db -d Path to DIAMOND database file (not including the file extension .dmnd).
--query -q Path to query input file in FASTA or FASTQ format (may be gzip compressed).
--out -o stdout Path to output file.
--outfmt -f Format of output file. (tab = BLAST tabular format; sam = SAM format; xml = BLAST XML format; daa = DAA format)
--compress 0 Compression for output file (0=none, 1=gzip).
========== ===== ======= ===========

Sensitivity & speed options
===========================
================ ===== ======= ===========
Option Short Default Description
================ ===== ======= ===========
--sensitive Trigger the sensitive alignment mode with a 16x9 seed shape configuration.
--more-sensitive This mode provides some additional sensitivity compared to the sensitive mode.
--band auto Dynamic programming band for seed extension. This corresponds to the maximum length of gaps that can be found in alignments.
================ ===== ======= ===========

Scoring & Reporting Options
=================
================= ===== ======== ===========
Option Short Default Description
================= ===== ======== ===========
--gapopen 11 Gap open penalty.
--gapextend 1 Gap extension penalty.
--matrix BLOSUM62 Scoring matrix.
--seg Enable SEG masking of low complexity segments in the query (yes/no). The default is no for blastp and yes for blastx.
--max-target-seqs -k 25 The maximum number of target sequences per query to keep alignments for.
--top Keep alignments within the given percentage range of the top alignment score for a query (overrides –max-target-seqs option).
--evalue -e 0.001 Maximum expected value to keep an alignment.
--min-score Minimum bit score to keep an alignment. Setting this option will override the --evalue parameter.
--query-cover Report only alignments above the given percentage of query cover.
--salltitles Print full length subject titles in output.
================= ===== ======== ===========

Memory & performance options
============================
============== ===== ======== ===========
Option Short Default Description
============== ===== ======== ===========
--block-size -b 2.0 Block size in billions of sequence letters to be processed at a time.
--tmpdir -t Directory to be used for temporary storage.
--index-chunks -c 4 The number of chunks for processing the seed index.
============== ===== ======== ===========
The ``--block-size/-b`` parameter is the main option for controlling the program's memory usage. Bigger numbers will increase the use of memory *and* temporary disk space, but also improve performance. The program can be expected to roughly use six times this number of memory (in GB). So for the default value of ``-b2.0``, the memory usage will be about 12 GB.

The temporary directory is set to the output directory by default. The amount of disk space that will be used depends on the program's settings and your data. As a general rule you should ensure that 100 GB of disk space are available here. If you run the program in a cluster environment, and disk space is only available over a slow network based file system, you may want to set the ``--tmpdir`` option to ``/dev/shm``. This will keep temporary information in memory and thus increase the program's memory usage substantially.

The ``--index-chunks/-c`` option can be additionally used to tune the performance. It is recommended to set this to 1 on a high memory server, which will increase performance and memory usage, but not the usage of temporary disk space.

View options
============
========== ===== ======== ===========
Option Short Default Description
========== ===== ======== ===========
--daa -a Path to input file in DAA format.
--out -o Path to output file.
--outfmt -f Format of output file. (tab = BLAST tabular format; sam = SAM format; xml = BLAST XML format)
--compress 0 Compression for output file (0=none, 1=gzip).
========== ===== ======== ===========
FAQ
===
*DIAMOND is slower than claimed in the paper, even slower than BLAST.*

The DIAMOND algorithm is designed for the alignment of large datasets. The algorithm is not efficient for a small number of query sequences or only a single one of them, and speed will be low. BLAST is recommend for small datasets.

*Can several copies of DIAMOND be run in parallel?*

It is possible, but not recommended. The algorithm is more efficient if you allocate more memory to a single task. If you need to process several files, performance will be better if you run DIAMOND on them sequentially.

*Reads imported into MEGAN lack taxonomic or functional assignment.*

MEGAN requires mapping files which need to be downloaded separately at the MEGAN website and configured to be used.

.. _Compiling from source:
Compiling from source
=====================
To compile DIAMOND from source, invoke the following commands on the shell::

$ wget http://github.com/bbuchfink/diamond/archive/v0.8.19.tar.gz
$ tar xzf v0.8.19.tar.gz
$ cd diamond-0.8.19
$ mkdir bin
$ cd bin
$ cmake ..
$ make install

*Note*:
- Use cmake -DCMAKE_INSTALL_PREFIX=... to install to a different prefix.
- Use cmake -DBUILD_STATIC=ON to create a statically linked executable.

Scoring matrices
================
======== ============================================
Matrix Supported values for (gap open)/(gap extend)
======== ============================================
BLOSUM45 (10-13)/3; (12-16)/2; (16-19)/1
BLOSUM50 (9-13)/3; (12-16)/2; (15-19)/1
BLOSUM62 (6-11)/2; (9-13)/1
BLOSUM80 (6-9)/2; 13/2; 25/2; (9-11)/1
BLOSUM90 (6-9)/2; (9-11)/1
PAM250 (11-15)/3; (13-17)/2; (17-21)/1
PAM70 (6-8)/2; (9-11)/1
PAM30 (5-7)/2; (8-10)/1
======== ============================================
If you use DIAMOND in published research, please cite B. Buchfink, Xie C., D. Huson, "Fast and sensitive protein alignment using DIAMOND", Nature Methods 12, 59-60 (2015).

0 comments on commit d079360

Please sign in to comment.