This repository contains several scripts I have used during my research to manipulate fasta files using BioPython. The tools in this list include:
- Fasta extractor: Extract fasta sequences from a multifasta using a list, or extract all at once. This can also be used to transform fasta format between formats, including fasta, fasta-2line, tab, and pir.
- update_descriptions.py: Take the fasta descriptions from one multifasta and apply them to fasta sequences of the same name from another. Useful for transferring annotations between fasta files where the names may be identical, but the data is not (for instance, between protein and CDS sequences).
- find_duplicates.py: This Python script uses BioPython to parse through a multifasta and print sequence IDs of sequences with duplicated names.
The only external requirement that these scripts need is BioPython. This can be installed using pip.
pip install -r requirements.txt
Fasta Extractor is a straightforward Python script for extracting fasta sequences from a multifasta file using a list of sequence names. Fasta extractor uses Argparse and BioPython to parse input files and fasta sequences.
usage: Fasta-Extractor [-h] [-i] -f -o [--format] [--quiet]
Fasta-Extractor BioPython to extract fasta sequences from a multifasta using a list of sequence names
options:
-h, --help show this help message and exit
-i , --input Input file containing sequence list. Sequence names must be one name per line.
If no file is entered, then the entire fasta file will be extracted.
-f , --fasta Input fasta file
-o , --out Output file name.
--format Output format. Options: 'fasta', 'fasta-2line', 'pir', 'tab'. Default = fasta
--quiet Silence print messages. Default = false
Thank you for using Fasta-Extractor. For more details, please visit the GitHub repository at https://github.com/Lachiemckbioinfo/fasta_extractor
Fasta Extractor has three required commands: input file (your gene list, -i/--input
), fasta file (where you are extracting sequences from, -f/--fasta
), and output file (where fasta sequences are being saved to, -o/--out
).
An example usage of Fasta Extractor would look like this:
python fasta_extractor.py --input genelist.txt --fasta sequences.fa --out output_sequences.fa
Fasta Extractor can be run in quiet mode to prevent printing to stdout by using the --quiet
argument.
The output format can be changed using the --format
argument. Choices include FASTA, PIR and tab formats. For example:
python fasta-extractor.py --input list.txt --fasta sequences.fa --out output_file.fasta --format fasta
This script is used to take the fasta descriptions from one file, and apply them to sequences of the same name in another file. I generated this tool as I had protein sequences with annotations, but my CDS descriptions did not have those annotations. Thus, this tool was made to take the protein descriptions and apply them to the CDS fastas of the same name.
This script uses simple positional arguments to run. Three positional arguments are used. These are (in order): template fasta (which has the descriptions), query fasta (which doesn't have the descriptions), and output fasta. Thus, to use this script, use the following code:
python update-descriptions.py template-fasta.fa query-fasta.fa output.fa
As it runs, this script also generates a logfile. This logs all the sequences in the query fasta that were not updated, typically due to them not being in the template, or their names not matching.
This script uses a single positional argument, which is used to select the fasta file to be searched. This script can be used using the following code:
python find_duplicates.py <fasta_file>