Skip to content

LUMC/fastq-filter

Repository files navigation

fastq-filter

A fast FASTQ filter program.

Fastq-filter correctly takes into account that quality scores are log scores when calculating the mean. It also provides an option to filter on average error rate directly.

FASTQ Q=30 stands for an average error rate of 0.001, Q=20 for 0.01 and Q=10 for 0.1. This is not very intuitive. Q=20 has 10 times more errors than Q=30 though the numbers (20 and 30) do little to convey this difference. Using 0.01 and 0.001 correctly conveys that these error rates are an order of magnitude apart. This also means that the phred scores cannot be naively averaged. Q=10 and Q=30 do not average Q=20. The actual average error rate is (0.001 + 0.1) / 2 = 0.0505. Roughly 1 in 20. Q=20 means 0.01: 1 in 100. By naively averaging the quality is overestimated by a factor of 5! This means any tool that averages naively is unusable in practice.

Unfortunately many tools do this. fastq-filter was written to provide a very fast filtering solution so the correct filtering can be applied at a very low cost.

Installation

  • With pip: pip install fastq-filter
  • For the latest development version: pip install git+https://github.com/LUMC/fastq-filter
  • With conda conda install -c conda-forge -c bioconda fastq-filter

Usage

Single fastq files can be filtered with:

fastq-filter -e 0.001 -o output.fastq input.fastq

Multiple fastq files can be filtered with:

fastq-filter -e 0.001 -o r1_filtered.fastq.gz -o r2_filtered.fastq.gz r1.fastq.gz r2.fastq.gz

Fastq-filter ensures the output is in sync. It is not limited to two inputs so also R1.fq, R2.fq and R3.fq can be filtered together.

In the following section 'pair' is used to note when 2 or more FASTQ records are evaluated. When multiple FASTQ files are given the filters behave as follows:

  • average error rate: The average of the combined phred scores is used.
  • median quality: The median of the combined phred scores is used.
  • Minimum length: at least one of the records of the pair must meet the minimum length.
  • Maximum length: None of the records in the pair must exceed the maximum length.

The rationale for the length filters is that R1 and R2 both sequence the same molecule and the canonical length is the longest of both.

usage: fastq-filter [-h] [-o OUTPUT] [-l MIN_LENGTH] [-L MAX_LENGTH]
                    [-e AVERAGE_ERROR_RATE] [-q MEAN_QUALITY]
                    [-Q MEDIAN_QUALITY] [-c COMPRESSION_LEVEL] [--verbose]
                    [--quiet]
                    input [input ...]

Filter FASTQ files on various metrics.

positional arguments:
  input                 Input FASTQ files. Compression format automatically
                        detected. Use - for stdin.

optional arguments:
  -h, --help            show this help message and exit
  -o OUTPUT, --output OUTPUT
                        Output FASTQ files. Compression format automatically
                        determined by file extension. Flag can be used
                        multiple times. An output must be given for each
                        input. Default: stdout.
  -l MIN_LENGTH, --min-length MIN_LENGTH
                        The minimum length for a read.
  -L MAX_LENGTH, --max-length MAX_LENGTH
                        The maximum length for a read.
  -e AVERAGE_ERROR_RATE, --average-error-rate AVERAGE_ERROR_RATE
                        The minimum average per base error rate.
  -q MEAN_QUALITY, --mean-quality MEAN_QUALITY
                        Average quality. Same as the '--average-error-rate'
                        option but specified with a phred score. I.e '-q 30'
                        is equivalent to '-e 0.001'.
  -Q MEDIAN_QUALITY, --median-quality MEDIAN_QUALITY
                        The minimum median phred score.
  -c COMPRESSION_LEVEL, --compression-level COMPRESSION_LEVEL
                        Compression level for the output files. Relevant when
                        output files have a .gz extension. Default: 2
  --verbose             Report stats on individual filters.
  --quiet               Turn of logging output.

Optimizations

fastq-filter has used the following optimizations to be fast:

  • Multiple filters can applied simultaneously to minimize IO.
  • fastq-filter can be used in pipes to minimize IO
  • The python filter function is used. Which is a a shorthand for python code that would otherwise need to be interpreted.
  • The mean and median quality algorithms are implemented in C with bindings to Python.
  • The mean quality algorithm uses a lookup table since there are only 93 possible phred scores encoded in FASTQ. That saves a lot of power calculations to calculate the probabilities.
  • The median quality algorithm implements a counting sort, which is really fast but not applicable for generic data. Since FASTQ qualities are uniquely suited for a counting sort, median calculation can be performed very quickly.
  • dnaio is used as FASTQ parser. This parses the FASTQ files with a parser written in Cython.
  • xopen is used to read and write files. This allows for support of gzip compressed files which are opened using python-isal which reads gzip files 2 times faster and writes gzip files 5 times faster than the python gzip module implementation.