Skip to content
eseraygun edited this page May 30, 2012 · 1 revision

BALIGN is a command line tool working on text files with specific formats. This page explains the command line options of BALIGN, the file formats recognized by BALIGN and the algorithms incorporated into BALIGN.

Table of Contents

Command Line Options

balign -s SOURCE [-t TARGET] [-o OUTPUT] [-e] [-a ALIGNS] [-m METHOD] [-p GAPOPEN] [-x GAPEXTEND] [-i SCOREMAT] [-u SOURCE2] [-r TARGET2] [-c SCOREMAT2] [-b BALANCE]

Basic Options

-s SOURCE

Path to the FASTA formatted file containing the list of source sequences. This option is mandatory.
-t TARGET
Path to the FASTA formatted file containing the list of target sequences. If target sequences are not specified, the source sequences are aligned/compared against themselves.
-o OUTPUT
Path to the pairwise score file to be generated. If omitted, no pairwise score file will be generated.
-e
Puts four extra columns to the pairwise score file for percent identity, percent similarity, percent gaps and alignment length. This option triggers full alignment computation, which may slow down the process.
-a ALIGNS
Path to the alignment file to be generated. This option triggers full alignment computation, which may slow down the process.
-m METHOD
Sequence alignment/comparison method to be used. Available options are NW for Needleman-Wunsch, SW for Smith-Waterman, OK for Oommen-Kashyap. Default is NW. See Algorithms section for details.
-p GAPOPEN
Gap opening penalty. Default is 10. Each new series of consecutive gaps is penalized by GAPOPEN.
-x GAPEXTEND
Gap extension penalty. Default is 0.5. Each gap in a series of consecutive gaps is penalized by GAPEXTEND.
-i SCOREMAT
Path to the BLAST formatted scoring matrix file. If omitted, internal BLOSUM62 matrix is used by default.

Compound Alignment Options

Compound alignment is a way of extending the sequence alphabets using Cartesian products of standard 8-bit alphabets. In compound alignment, each (primary) sequence has a dual secondary sequence. The compound sequence is a pair of primary and secondary sequences. Accordingly, a compound sequence element is a pair of primary and secondary elements, and the compound alphabet is the Cartesian product of the primary alphabet and the secondary alphabet.

To perform compound alignment, one has to specify secondary source sequences, secondary target sequences and a secondary scoring matrix in addition to the previous options. A balance factor may also be specified to adjust the balance between the primary scoring matrix and the secondary scoring matrix.

-u SOURCE2

Path to the FASTA formatted file containing the list of secondary source sequences. This option is mandatory for compound alignment.
-r TARGET2
Path to the FASTA formatted file containing the list of secondary target sequences. If primary and secondary target sequences are not specified, the source sequences are aligned/compared against themselves.
-c SCOREMAT2
Path to the BLAST formatted secondary scoring matrix file. If omitted, internal WALLQVIST matrix is used by default.
-b BALANCE
Balance factor for scoring matrices. Default is 0.5. A 0 balance takes only the primary sequences into account. A 1 balance takes only the secondary sequences into account. Other values inside the range 0--1 linearly interpolates between the primary scoring matrix and the secondary scoring matrix.

Files and Formats

Sequence List Files (FASTA Format)

FASTA file format is used in bioinformatics applications to represent nucleotide and amino acid sequences in a text-based file. Please refer to Wikipedia for more information on the standard FASTA format.

BALIGN uses the same basic syntax to represent any character-based sequence list. A BALIGN sequence may contain any 8-bit character (defined in the scoring matrix of course, see below) with the following exceptions:

  • Control characters (0--31, 127--160).
  • Space character (32).
  • Number sign (#).
  • Minus sign (-).
Also, all upper-case Latin letters are transformed into lower-case during parsing. In total, BALIGN can recognize 161 distinct characters.

Scoring Matrix File (BLAST Format)

Scoring matrix file specifies the letter-to-letter similarity scores. It also determines the alphabet for source and target sequences.

BLAST scoring matrix file format is a trivial way to represent scoring matrices for 8-bit characters. Some examples are available here. The basic structure of a BLAST scoring matrix file is as follows:

  • Optional comment lines starting with a number sign (#).
  • Header row, containing the letters of the target alphabet separated with spaces.
  • Data rows, where
    • the first column is a letter of the source alphabet,
    • remaining columns are the similarity scores between the source letter and the corresponding target letter.
The source alphabet and the target alphabet may differ. 8-bit characters representing the letters are subject to the limitations listed in the previous section.

Pairwise Score File

Pairwise score file is created and populated by BALIGN during the sequence alignment/comparison process. It contains one line for each source-target sequence pair. In each line, there are five columns separated by tab characters:

  • Index of the source sequence.
  • Index of the target sequence.
  • Length of the source sequence.
  • Length of the target sequence.
  • Alignment/comparison score.
If full alignment computation is enabled by -e option (see above), four extra columns are appended to each line:
  • Percent identity.
  • Percent similarity.
  • Percent gaps.
  • Length of the alignment.
Example:
1	1	29	29	2900	100	100	0	29
1	2	29	17	200	5	5	79	38
2	1	17	29	200	7	7	47	30
2	2	17	17	1700	100	100	0	17

Alignment File

If full alignment computation is enabled, BALIGN can also list actual alignments of the sequences (with all the different possibilities) in a separate file. For each source-target sequence pair, an alignment file contains:

  • A header line in the format ** {source index}, {target index}.
  • A list of best alignments (each having the maximum alignment score):
    • A header line in the format * {source start}-{source end} / {target start}-{target end}, Length: {alignment length}, Score: {alignment score}, Identity: {# of identities}, Similarity: {# of similarities}, Gaps: {# of gaps}.
    • Aligned source sequence.
    • Aligned target sequence.
Gaps are represented by minus signs (-) in the aligned sequences.

Example:

** 1, 8
* 1-29 / 1-33, Length: 37, Score: 7, Identity: 10, Similarity: 10, Gaps: 12
elektrik_elektronik_fak---ültesi-----
---itü_t-elekomünikasyon_mühendisliği
* 1-29 / 1-33, Length: 39, Score: 7, Identity: 11, Similarity: 11, Gaps: 16
elektrik-_-elektronik_fak---ültesi-----
------itü_telekomünikasyon_mühendisliği

Algorithms

This section is under-construction.

Needleman-Wunsch Algorithm for Global Alignment

The time complexity of Needleman-Wunsch algorithm is <math>O(N^2)</math>.

  • Needleman, S. & Wunsch, C. A general method applicable to the search for similarities in the amino acid sequence of two proteins J. Mol. Biol, 1970, 48, 443-453

Smith-Waterman Algorithm for Local Alignment

The time complexity of Smith-Waterman algorithm is <math>O(N^2)</math>.

  • Smith, T. & Waterman, M. Identification of common molecular subsequences J. Mol. Bwl, 1981, 147, 195-197

Oommen-Kashyap Algorithm for Transition Probability Computation

The time complexity of Oommen-Kashyap algorithm is <math>O(N^3)</math>.

  • Oommen, B. & Kashyap, R. A formal theory for optimal and information theoretic syntactic pattern recognition Pattern Recognition, Elsevier, 1998, 31, 1159-1177
  • Aygün, E.; Oommen, B. & Cataltepe, Z. Peptide classification using optimal and information theoretic syntactic modeling Pattern Recognition, Elsevier, 2010
  • Aygün, E.; Oommen, B. & Cataltepe, Z. On utilizing optimal and information theoretic syntactic modeling for peptide classification Pattern Recognition in Bioinformatics, IAPR, 2009