Conversion of `maf` to `fasta` files

The aim of these scripts is to be able to merge the sequence alignments of a maf (Multiple Alignment Format) file in a consistent and predictable way.

Required features are:

enforceable order of output sequences in the resulting fasta file (maf2fasta)
consistent padding with gaps for samples not included in an alignment block (maf2fasta)
checks for consistent length of the fasta output (concat_fastas)
checks for sequences consisting entirely of gaps (concat_fastas)

The desired output (potentially) concatenates at several places:

within each maf -> fasta conversion (all sequences with the same header are concatenated)
combining the results of multiple conversions (combining mafs from different scaffolds in the reference genome)

Dependencies

The python scripts require biopython (eg as specified in the conda environment in envs/biopython.yml)

Intersect `maf` and `bed` file

The purpose of the intersect_maf_bed script is to clip a sequence alignment in maf format to regions specified within a bed file:

Note that this will parse the bed in such a way that overlaps of individual overlapping bed entries are being merged before the intersection with the maf file.

The clipped alignment is then re-exported as maf file.

./intersect_maf_bed \
  -m tests/maf/test3.maf \
  -b tests/bed/A.bed \
  -r A

##maf version=1 program=intersect_maf_bed

a
s A.Chromosome1  15 30 + 275 TACGTACGTACGTACGATTTACGTAACGTT
s B.Chr2         18 30 + 175 TACGTACGTACGTACGATTTACGTAACGTT
s C.chrA        135 25 + 375 -----ACGTACGTAGGATTTATGTAACGTT

a
s A.Chromosome1 110 15 +  275 GTACGTACGTACGTA
s B.Chr2         56 15 +  175 CTACGTACGTACGTA
s C.chrA        650  5 + 3375 ----------ACGTA

Convert `maf` file to multi-sample `fasta` file

./maf2fasta \
  --maf tests/maf/test1.maf \
  -s A,C,B,D

>A
GTACGTACGTACGTACGTACGATTTACGTAACGTTACGTACGTACGTACGTACGT
>C
----------ACGTACGTAGGATTTATGTAACGTTACGTACGAC-----------
>B
CTACGTACGTACGTACGTACGATTTACGTAACGTTACGTACGTACGTACGTTCGT
>D
-------------------------------------------------------

Note that the order of the output fasta sequences is enforced with the --sample-order flag. Also, including a sample ID that is not included in the maf file will create an entirely blank sequence for that ID (only consisting of - characters).

Concatenate several multi-sample `fasta` files

The main purpose of the concat_fastas script is to concatenate several multi-sample fasta files. This should happen on a sample by sample basis:

./concat_fastas \
  tests/fa/test1.fa tests/fa/test2.fa \
  -s A,C,B,Y | \
  fold -w 45

>A
GTACGTACGTACGTACGTACGATTTACGTAACGTTACGTACGTAC
GTACGTACGTATCAGTCAGCAGTGTAGCTGTGTGTGCATGCATGC
>C
----------ACGTACGTAGGATTTATGTAACGTTACGTACGAC-
----------ATTAG-----AG---AGCTCTGA-----TGCAAGC
>B
CTACGTACGTACGTACGTACGATTTACGTAACGTTACGTACGTAC
GTACGTTCGTATTATTTAGC--TGA----------GGATGCATGG

Warning: The following sample(s) were dropped from the output: [ D, E, Y ]
(either because they are missing from --sample-order, or from the input fasta file(s))

If there are multiple sequences with the same name within the fasta file they will be merged in the order of appearance.

./concat_fastas \
  tests/fa/test1_split1.fa \
  -s A

Warning: The following sample(s) were dropped from the output: [ B, C, D, E ]
(either because they are missing from --sample-order, or from the input fasta file(s))
>A
GTACGTACGTACGTACGTACGATTTACGTAACGTTACGTACGTACGTACGTACGT

The script can also be used to create a summary of the created fasta file (eg. to check the gap content per sequence).

./concat_fastas \
  tests/fa/test1.fa tests/fa/test2.fa \
  -s A,C,B,D \
  -o /dev/null \
  --keep-gaps-only \
  --base-report

Warning: The following sample(s) were dropped from the output: [ E ]
(either because they are missing from --sample-order, or from the input fasta file(s))
Info: The following sequence(s) contain only gaps: [ D ]
# Summary of gaps and bases counts for each sample:
# sample        gaps    gaps%   A       C       G       T       N       n
# A     0       0.0     21      19      24      26      0       90
# C     34      37.8    17      10      14      15      0       90
# B     12      13.3    19      15      19      25      0       90
# D     90      100.0   0       0       0       0       0       90

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
docs		docs
envs		envs
tests		tests
.gitignore		.gitignore
concat_fastas		concat_fastas
intersect_maf_bed		intersect_maf_bed
maf2fasta		maf2fasta
readme.md		readme.md
tests_all.py		tests_all.py
tests_fa_concat.py		tests_fa_concat.py
tests_intersect_maf.py		tests_intersect_maf.py
tests_maf_to_fa.py		tests_maf_to_fa.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Conversion of `maf` to `fasta` files

Dependencies

Intersect `maf` and `bed` file

Convert `maf` file to multi-sample `fasta` file

Concatenate several multi-sample `fasta` files

About

Releases

Packages

Languages

k-hench/gapped_maf2fasta

Folders and files

Latest commit

History

Repository files navigation

Conversion of maf to fasta files

Dependencies

Intersect maf and bed file

Convert maf file to multi-sample fasta file

Concatenate several multi-sample fasta files

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Conversion of `maf` to `fasta` files

Intersect `maf` and `bed` file

Convert `maf` file to multi-sample `fasta` file

Concatenate several multi-sample `fasta` files

Packages