zfq: FastQ file compressor

Description

zfq is a lossless compression/uncompression wrapper for FastQ files.

Key features:

Universal:
- zfq is based on a non-specialized compression algorithm. This somewhat reduces its compression performance, but means it can handle any FastQ (no character limit in sequences headers or qualities).
- Without reference genome.
Robust:
- Wraps well maintained and tested standard software: zstd.
- Decompression result is automatically tested after each compression.
- md5sum of the original file is stored to be automatically tested after each decompression.
- No failures on the benchmark dataset of 645 files from Element Biosciences, Illumina and PacBio instruments.
Efficient:
- Compression rate is better than widely used gzip (up to 2 times better than gzip best), similar to zstd in ultra mode and lower than sequence compression algorithms like quip (up to half).
Userfriendly:
- gzipped fastq can be directly take as input or written to output.
- zfq info instantly provides the number of sequences and nucleotids stored in the file.

Installation

Requirements :

python (>=3.7)
zstd (>=1.4.4) a fast lossless compression algorithm developped by meta.

Build: python -m pip install zfq

Usage example

Compress fastq(.gz) file:

Command: zfq.py compress -t 2 -i SRR.fastq.gz -o SRR.fastq.zfq -r

Options in example:

-r/--remove is used to remove zfq file after decompression.
-t/--threads number of compression threads.

STDERR:

zfq.py compress -i SRR.fastq.gz -o SRR.fastq.zfq
2023-09-05 11:48:20,483 -- [zfq.py][pid:3163205][INFO] -- Command: zfq.py compress -t 2 -i SRR.fastq.gz -o SRR.fastq.zfq
2023-09-05 11:48:21,341 -- [zfq.py][pid:3163205][INFO] -- End of job

Get information from original file:

Command: zfq.py info -i SRR.fastq.zfq

STDOUT:

{"seq": 14615, "nt": 1865822, "md5": "c1f5e805b3a076d5c58fa206f2c30ac5", "mtime": 1693907258.333263}

Convert zfq to fastq.gz

Command: zfq.py uncompress -i SRR.fastq.zfq -o SRR2.fastq.gz -r

Option in example:

-r/--remove is used to remove zfq file after decompression.

STDERR:

2023-09-05 11:55:42,348 -- [zfq.py][pid:3164218][INFO] -- Command: zfq.py uncompress -i SRR.fastq.zfq -o SRR2.fastq.gz -r
2023-09-05 11:55:44,066 -- [zfq.py][pid:3164218][INFO] -- End of job

How it works

Compress:

Write number of reads, nucleotides, modification time and original md5sum in info file.
Split FastQ into three parts: headers, sequences (no new line) and qualities.
Compress each part with zstd.
Store all compressed files and input info in a tar archive.
Apply the original modification time to archive.
Decompress the file into a temporary file to compare md5sum of the original file and decompressed file.

Decompress:

Extract files.
Decompress with zstd.
Merge each part (sequences are splitted according to quality length).
Apply original modification time to decompressed file.
Compare md5sum of the original file (from info file) and decompressed file.

Benchmarks

Software

Text compression:
- gzip (https://www.gnu.org/software/gzip/) in two modes: default and best.
- zstd (https://github.com/facebook/zstd) in two modes: 13 and ultra 22.
Sequences compression:
- lfastqc (https://github.uconn.edu/sya12005/LFastqC)
- lfqc (https://github.com/mariusmni/lfqc)
- picard (https://broadinstitute.github.io/picard/command-line-overview.html) to convert as uBAM
- quip (https://github.com/dcjones/quip)
- zfq

Dataset

645 files
Sequencers types: Element Biosciences (AVITI), Illumina (MiSeq, NextSeq and NovaSeq) and PacBio (Sequel 2)
Library: amplicon, capture and whole
Matrix: DNA, ctDNA and RNA
Species: Homo sapiens and sevreal virus

Results

Compression rate and time

Compression rate better than widely used gzip (up to 2 times better than gzip best), similar to zstd in ultra mode and lower than sequence compression algorithms like quip (up to half).

Compression faster than zstd ultra and lfqc, similar to gzip best and slower than quip.

Decompression time

Faster than ubam, quip, lfastqc and lfqc and slower than gzip, and others.

Error rate

Sequence compression algorithms failed to compress or decompress several files in the dataset. This was due to memory requirements (> 200G) or limited quality range. This was very problematic in cases where compression was performed without apparent error and decompression was not possible because the file was invalid.
Text compression algorithms can convert every type of fastq.

Copyright

2023 CHU Toulouse

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
bench		bench
src		src
test		test
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

zfq: FastQ file compressor

Table of Contents

Description

Key features:

Installation

Usage example

Compress fastq(.gz) file:

Get information from original file:

Convert zfq to fastq.gz

How it works

Benchmarks

Software

Dataset

Results

Compression rate and time

Decompression time

Error rate

Copyright

About

Releases

Packages

Languages

License

bialimed/zfq

Folders and files

Latest commit

History

Repository files navigation

zfq: FastQ file compressor

Table of Contents

Description

Key features:

Installation

Usage example

Compress fastq(.gz) file:

Get information from original file:

Convert zfq to fastq.gz

How it works

Benchmarks

Software

Dataset

Results

Compression rate and time

Decompression time

Error rate

Copyright

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages