Filter the NCBI nr FASTA database on any arbitrary taxonomy or source database.
Grab the latest release of filterNRFASTA.jar from the latest releases page:
Java 11 or higher. Download from
Only needs to be done periodically...
Download NCBI's nr FASTA: You do not need to uncompress it.
Download NCBI's accession to taxonomy mapping: You do not need to uncompress it.
Download NCBI's taxonomy hierarchy file: Either or
- Uncompress with tar -xvzf new_taxdump.tar.gz or unzip
- We are interested in the nodes.dmp file, rest can be erased
Note: This program can consume a lot of RAM, especially when filtering on very general taxa (such as bacteria (NCBI taxonomy: 2)). If you run out of RAM (heap space) while running this program, give Java more RAM by using the -Xmx parameter as shown in the bacteria example below.
You will need to know the NCBI taxonomy ID(s) you want to filter on. You can look them up at
java -jar filterNRFASTA.jar
java -jar filterNRFASTA.jar -h
Outputs only sequences and accessions corresponding to human. Outputs as "" file.
java -jar downloadFilteredNR.jar
-t 9606
-a ../tax-dump-old/prot.accession2taxid.gz
-n ../taxonomy_dump/nodes.dmp
-f ../sequences/nr.gz >
Includes sequences and accessions corresponding to any bacteria. Outputs as "" file.
This command also allocates 64 gigs of RAM for this Java process.
java -Xmx64g -jar downloadFilteredNR.jar
-t 2
-d refseq
-d uniprot
-a ../tax-dump-old/prot.accession2taxid.gz
-n ../taxonomy_dump/nodes.dmp
-f ../sequences/nr.gz >