A fast FASTQ filter program.
Fastq-filter correctly takes into account that quality scores are log scores when calculating the mean. It also provides an option to filter on average error rate directly.
FASTQ Q=30 stands for an average error rate of 0.001, Q=20 for 0.01 and Q=10 for 0.1. This is not very intuitive. Q=20 has 10 times more errors than Q=30 though the numbers (20 and 30) do little to convey this difference. Using 0.01 and 0.001 correctly conveys that these error rates are an order of magnitude apart. This also means that the phred scores cannot be naively averaged. Q=10 and Q=30 do not average Q=20. The actual average error rate is (0.001 + 0.1) / 2 = 0.0505. Roughly 1 in 20. Q=20 means 0.01: 1 in 100. By naively averaging the quality is overestimated by a factor of 5! This means any tool that averages naively is unusable in practice.
Unfortunately many tools do this. fastq-filter was written to provide a very fast filtering solution so the correct filtering can be applied at a very low cost.
- With pip:
pip install fastq-filter
- For the latest development version:
pip install git+https://github.com/LUMC/fastq-filter
- With conda
conda install -c conda-forge -c bioconda fastq-filter
Single fastq files can be filtered with:
fastq-filter -e 0.001 -o output.fastq input.fastq
Multiple fastq files can be filtered with:
fastq-filter -e 0.001 -o r1_filtered.fastq.gz -o r2_filtered.fastq.gz r1.fastq.gz r2.fastq.gz
Fastq-filter ensures the output is in sync. It is not limited to two inputs
so also R1.fq
, R2.fq
and R3.fq
can be filtered together.
In the following section 'pair' is used to note when 2 or more FASTQ records are evaluated. When multiple FASTQ files are given the filters behave as follows:
- average error rate: The average of the combined phred scores is used.
- median quality: The median of the combined phred scores is used.
- Minimum length: at least one of the records of the pair must meet the minimum length.
- Maximum length: None of the records in the pair must exceed the maximum length.
The rationale for the length filters is that R1 and R2 both sequence the same molecule and the canonical length is the longest of both.
usage: fastq-filter [-h] [-o OUTPUT] [-l MIN_LENGTH] [-L MAX_LENGTH] [-e AVERAGE_ERROR_RATE] [-q MEAN_QUALITY] [-Q MEDIAN_QUALITY] [-c COMPRESSION_LEVEL] [--verbose] [--quiet] input [input ...] Filter FASTQ files on various metrics. positional arguments: input Input FASTQ files. Compression format automatically detected. Use - for stdin. optional arguments: -h, --help show this help message and exit -o OUTPUT, --output OUTPUT Output FASTQ files. Compression format automatically determined by file extension. Flag can be used multiple times. An output must be given for each input. Default: stdout. -l MIN_LENGTH, --min-length MIN_LENGTH The minimum length for a read. -L MAX_LENGTH, --max-length MAX_LENGTH The maximum length for a read. -e AVERAGE_ERROR_RATE, --average-error-rate AVERAGE_ERROR_RATE The minimum average per base error rate. -q MEAN_QUALITY, --mean-quality MEAN_QUALITY Average quality. Same as the '--average-error-rate' option but specified with a phred score. I.e '-q 30' is equivalent to '-e 0.001'. -Q MEDIAN_QUALITY, --median-quality MEDIAN_QUALITY The minimum median phred score. -c COMPRESSION_LEVEL, --compression-level COMPRESSION_LEVEL Compression level for the output files. Relevant when output files have a .gz extension. Default: 2 --verbose Report stats on individual filters. --quiet Turn of logging output.
fastq-filter has used the following optimizations to be fast:
- Multiple filters can applied simultaneously to minimize IO.
- fastq-filter can be used in pipes to minimize IO
- The python filter function is used. Which is a a shorthand for python code that would otherwise need to be interpreted.
- The mean and median quality algorithms are implemented in C with bindings to Python.
- The mean quality algorithm uses a lookup table since there are only 93 possible phred scores encoded in FASTQ. That saves a lot of power calculations to calculate the probabilities.
- The median quality algorithm implements a counting sort, which is really fast but not applicable for generic data. Since FASTQ qualities are uniquely suited for a counting sort, median calculation can be performed very quickly.
- dnaio is used as FASTQ parser. This parses the FASTQ files with a parser written in Cython.
- xopen is used to read and write
files. This allows for support of gzip compressed files which are opened
using python-isal which
reads gzip files 2 times faster and writes gzip files 5 times faster than
the python
gzip
module implementation.