-
Notifications
You must be signed in to change notification settings - Fork 0
Documentation
BALIGN is a command line tool working on text files with specific formats. This page explains the command line options of BALIGN, the file formats recognized by BALIGN and the algorithms incorporated into BALIGN.
balign -s SOURCE [-t TARGET] [-o OUTPUT] [-e] [-a ALIGNS] [-m METHOD] [-p GAPOPEN] [-x GAPEXTEND] [-i SCOREMAT] [-u SOURCE2] [-r TARGET2] [-c SCOREMAT2] [-b BALANCE]
-s SOURCE
- Path to the FASTA formatted file containing the list of source sequences. This option is mandatory.
-t TARGET
- Path to the FASTA formatted file containing the list of target sequences. If target sequences are not specified, the source sequences are aligned/compared against themselves.
-o OUTPUT
- Path to the pairwise score file to be generated. If omitted, no pairwise score file will be generated.
-e
- Puts four extra columns to the pairwise score file for percent identity, percent similarity, percent gaps and alignment length. This option triggers full alignment computation, which may slow down the process.
-a ALIGNS
- Path to the alignment file to be generated. This option triggers full alignment computation, which may slow down the process.
-m METHOD
- Sequence alignment/comparison method to be used. Available options are NW for Needleman-Wunsch, SW for Smith-Waterman, OK for Oommen-Kashyap. Default is NW. See Algorithms section for details.
-p GAPOPEN
- Gap opening penalty. Default is 10. Each new series of consecutive gaps is penalized by GAPOPEN.
-x GAPEXTEND
- Gap extension penalty. Default is 0.5. Each gap in a series of consecutive gaps is penalized by GAPEXTEND.
-i SCOREMAT
- Path to the BLAST formatted scoring matrix file. If omitted, internal BLOSUM62 matrix is used by default.
Compound alignment is a way of extending the sequence alphabets using Cartesian products of standard 8-bit alphabets. In compound alignment, each (primary) sequence has a dual secondary sequence. The compound sequence is a pair of primary and secondary sequences. Accordingly, a compound sequence element is a pair of primary and secondary elements, and the compound alphabet is the Cartesian product of the primary alphabet and the secondary alphabet.
To perform compound alignment, one has to specify secondary source sequences, secondary target sequences and a secondary scoring matrix in addition to the previous options. A balance factor may also be specified to adjust the balance between the primary scoring matrix and the secondary scoring matrix.
-u SOURCE2
- Path to the FASTA formatted file containing the list of secondary source sequences. This option is mandatory for compound alignment.
-r TARGET2
- Path to the FASTA formatted file containing the list of secondary target sequences. If primary and secondary target sequences are not specified, the source sequences are aligned/compared against themselves.
-c SCOREMAT2
- Path to the BLAST formatted secondary scoring matrix file. If omitted, internal WALLQVIST matrix is used by default.
-b BALANCE
- Balance factor for scoring matrices. Default is 0.5. A 0 balance takes only the primary sequences into account. A 1 balance takes only the secondary sequences into account. Other values inside the range 0--1 linearly interpolates between the primary scoring matrix and the secondary scoring matrix.
FASTA file format is used in bioinformatics applications to represent nucleotide and amino acid sequences in a text-based file. Please refer to Wikipedia for more information on the standard FASTA format.
BALIGN uses the same basic syntax to represent any character-based sequence list. A BALIGN sequence may contain any 8-bit character (defined in the scoring matrix of course, see below) with the following exceptions:
- Control characters (0--31, 127--160).
- Space character (32).
- Number sign (#).
- Minus sign (-).
Scoring matrix file specifies the letter-to-letter similarity scores. It also determines the alphabet for source and target sequences.
BLAST scoring matrix file format is a trivial way to represent scoring matrices for 8-bit characters. Some examples are available here. The basic structure of a BLAST scoring matrix file is as follows:
- Optional comment lines starting with a number sign (#).
- Header row, containing the letters of the target alphabet separated with spaces.
- Data rows, where
- the first column is a letter of the source alphabet,
- remaining columns are the similarity scores between the source letter and the corresponding target letter.
Pairwise score file is created and populated by BALIGN during the sequence alignment/comparison process. It contains one line for each source-target sequence pair. In each line, there are five columns separated by tab characters:
- Index of the source sequence.
- Index of the target sequence.
- Length of the source sequence.
- Length of the target sequence.
- Alignment/comparison score.
-e
option (see above), four extra columns are appended to each line:
- Percent identity.
- Percent similarity.
- Percent gaps.
- Length of the alignment.
1 1 29 29 2900 100 100 0 29 1 2 29 17 200 5 5 79 38 2 1 17 29 200 7 7 47 30 2 2 17 17 1700 100 100 0 17
If full alignment computation is enabled, BALIGN can also list actual alignments of the sequences (with all the different possibilities) in a separate file. For each source-target sequence pair, an alignment file contains:
- A header line in the format
** {source index}, {target index}
. - A list of best alignments (each having the maximum alignment score):
- A header line in the format
* {source start}-{source end} / {target start}-{target end}, Length: {alignment length}, Score: {alignment score}, Identity: {# of identities}, Similarity: {# of similarities}, Gaps: {# of gaps}
. - Aligned source sequence.
- Aligned target sequence.
- A header line in the format
Example:
** 1, 8 * 1-29 / 1-33, Length: 37, Score: 7, Identity: 10, Similarity: 10, Gaps: 12 elektrik_elektronik_fak---ültesi----- ---itü_t-elekomünikasyon_mühendisliği * 1-29 / 1-33, Length: 39, Score: 7, Identity: 11, Similarity: 11, Gaps: 16 elektrik-_-elektronik_fak---ültesi----- ------itü_telekomünikasyon_mühendisliği
This section is under-construction.
The time complexity of Needleman-Wunsch algorithm is <math>O(N^2)</math>.
- Needleman, S. & Wunsch, C. A general method applicable to the search for similarities in the amino acid sequence of two proteins J. Mol. Biol, 1970, 48, 443-453
The time complexity of Smith-Waterman algorithm is <math>O(N^2)</math>.
- Smith, T. & Waterman, M. Identification of common molecular subsequences J. Mol. Bwl, 1981, 147, 195-197
The time complexity of Oommen-Kashyap algorithm is <math>O(N^3)</math>.
- Oommen, B. & Kashyap, R. A formal theory for optimal and information theoretic syntactic pattern recognition Pattern Recognition, Elsevier, 1998, 31, 1159-1177
- Aygün, E.; Oommen, B. & Cataltepe, Z. Peptide classification using optimal and information theoretic syntactic modeling Pattern Recognition, Elsevier, 2010
- Aygün, E.; Oommen, B. & Cataltepe, Z. On utilizing optimal and information theoretic syntactic modeling for peptide classification Pattern Recognition in Bioinformatics, IAPR, 2009