Skip to content

Latest commit

 

History

History
154 lines (108 loc) · 9.92 KB

README.md

File metadata and controls

154 lines (108 loc) · 9.92 KB

demuxFQ

Flexibly demultiplex Fastq files.

This package provides demuxFQ , a program for demultiplexing Fastq files generated by Illumina's sequencers (or any other Fastq in a sufficiently similar format). The user provides a sample sheet (listing the indices in the Fastq, and the corresponding output filenames) and a Fastq file, then the program generates the demultiplexed files. The program will produce a summary file, recording the number of reads found for each index in the sample sheet. Optionally, the summary can be produced without actually demultiplexing, for diagnostic purposes.

Sample Sheets

demuxFQ reads a sample sheet that describes the expected indices in the Fastq file. The format of a line in the sample sheet is:

XXXXXXXX filename

Or if the file has dual indices:

XXXXXXXX YYYYYYYY filename

where XXXXXXXX and YYYYYYYY are index sequences, and filename is the name of the file to which to write sequences matching this index (pair). The white space separating the tokens may be spaces or tabs. Filenames and indices may not contain white space. (But you weren't going to do that anyway, right?) The sample sheet may contain additional text after the filename:

XXXXXXXX filename extra notes here

This extended text will be added to the demultiplexing summary, for documentation purposes. If the index pattern (see the next section) includes a second index, then the first two columns of the sample sheet are the indices, and the third is the filename. Otherwise, the first column is the index, and the second the filename. In this way it is possible to distinguish single-index from dual-index sample sheets, even in the presence of arbitrary columns of additional text.

Filenames are normally given as base names (i.e. without a leading path); the path to write them to, if not the current directory, is given on the command line with the "-o" option. The destination directory must exist.

Blank lines and lines with "#" as the first non-blank character are ignored.

An example of a sample sheet with single indices::

GCCAATA jc1174_D1837ACXX_6_1.fq.gz
CTTGTAA jc1175_D1837ACXX_6_1.fq.gz
GTGAAAC jc1176_D1837ACXX_6_1.fq.gz
ACAGTGA jc1173_D1837ACXX_6_1.fq.gz

An example with dual indices:

# This is a comment.
ATTACTCG TAATCTTA SLX-5906.S12.r_1.fq.gz
ATTACTCG CAGGACGT SLX-5906.S13.r_1.fq.gz
ATTACTCG GTACTGAC SLX-5906.S14.r_1.fq.gz
TCCGGAGA TATAGCCT SLX-5906.S03.r_1.fq.gz
TCCGGAGA ATAGAGGC SLX-5906.S04.r_1.fq.gz
TCCGGAGA CCTATCCT SLX-5906.S15.r_1.fq.gz

If filenames in the sample sheet end with .gz the files will be gzip compressed; otherwise they will be written uncompressed.

Fastq Headers

The format of the indices in the Fastq header is described by a string that indicates the positions of the first and second indices (if there is a second), relative to the end of the header line. Note that this software assumes that the index is at the end of the Fastq read name line, or at least that any trailing characters are a fixed-length string. With this assumption, the description is composed of the characters 1, 2, X, plus possibly other literal characters, such that a run of 1's shows the first index, a run of 2's the second, and X's positions to be ignored. Any other character is expected to match literally.

For example:

11111111XXX+22222222

indicates 8 characters for the 1st index, 3 skipped characters, a literal "+" separating the indices, and 8 characters for the 2nd index.

11111111XXXX

indicates 8 characters for the 1st index, followed by 4 unused characters (presumably some other lane used 12bp indices, so 12 were sequenced).

11111111+22222222

shows a straightforward dual index, 8bp for the first and second, separated by a literal "+" symbol.

Examples of complete Fastq header lines:

@HWI-ST230:965:1:1101:13957:19936:CGTACGTA#TAATCTTA/1
@HWI-ST230:965:1:1101:13957:19936:CGTACGTA/1
@HWI-ST230:965:1:1101:13957:19936:CGTACGTA+TAATCTTA
@HWI-ST230:965:1:1101:13957:19936:CGTACG

which would be described by these patterns:

11111111#22222222XX
11111111XX
11111111+22222222
111111

Given that the pattern ends the line, it is not necessary to specify the delimiter (in this case a colon): given a pattern of n characters, the program will consider the last n characters of each Fastq header.

Command line

demuxFQ [options] <sampleSheet> <fastq> ...

The program summarizes the indices in the file(s), and demultiplexes them depending on the presence of the -d option. Multiple fastq files may be listed; they will be demultiplexed in order. Note that the output will be merged, as if all of the inputs were strung together as one long file. Input fastq files may be compressed with gzip, or uncompressed.

Command-line options include:

  • -b <filename> -- Save reads with non-matching indices to the named file (default: discard non-matching reads).
  • -d -- Generate demultiplexed output files. (Without this option the program will generate a summary without demultiplexing.)
  • -f -- overwrite existing files (force overwrite) (default: do not overwrite).
  • -o <directory> -- Write output files to the given directory (default: current directory). Applies to Fastq only, not the summary file.
  • -p <pattern> -- parse indices according to the given pattern (default: 11111111+22222222). Note that (obviously) the pattern and the sample sheet must match. If there is a run of "2" in the pattern, then the sample sheet should have two index columns.
  • -r <x.xxx> -- Report unknown indices with frequency at least this high (default: 0.001).
  • -R -- Reverse complement the second index when matching (to allow for HiSeq 4000 and NextSeq reversing the second index read) (default: do not reverse complement).
  • -s <summaryFile> -- Write summary to this file (default: name of the sample sheet, with the suffix removed, and "_summary.txt" appended). (A suffix is a period followed by non-period text, just as one might expect; if no period is present, the name will be used without removing a suffix.)
  • -t <N> -- Allow up to N mismatches (default: 2).

Summary File

A summary of the indices found is provided, including the parameters used for this run, the expected indices and how many reads matching each were found. In addition, any unexpected indices are reported, if they occur more frequently than some threshold, as well as the overall number of unique indices found.

A sample report might look like:

176602548 reads
18154293 10.27% lost
1 = threshold for match
3 = minimum distance between barcodes
Expected:
Index   Total   Balance    0 Mismatches    File
GCCAAT  53878803    91.52%  53800000 89.6% thisfile.fq.gz
CTTGTA  48841472    82.96%  48800000 95.0% thatfile.fq.gz
GTGAAA  55727980    94.66%  55000000 89.x% borkfile.fq.gz
23511 distinct codes
 45548212   CTTGTAA
 50236885   GCCAATA
 51427561   GTGAAAA
Unknown
  1248202   ACGTACA

This report includes

  • the number of reads,
  • the number (and percentage) lost, i.e. did not match a known index within N mismatches,
  • the allowed number of mismatches,
  • the minimum edit distance between pairs of indices in the sample sheet,
  • counts of the reads matching each expected index,
  • any unknown indices seen with a frequency at least the threshold given.

For each expected index, the output includes

  • the index sequence,
  • the number of reads matching this index,
  • the balance (see below),
  • the number and percent of perfect matches,
  • the filename of the output,
  • the matching metadata for this index, if extended metadata are provided.

In a balanced pool of multiplexed samples, all samples would occur with about the same frequency. Given N total sequences and M samples, about N/M sequences per sample would be good. The balance column indicates how close to this ideal the data are: If E = N/M is the expected number of reads per sample, and A the actual number of reads found for a sample, the balance is A/E, expressed as a percentage. Accordingly, a balance of 100% is perfect, less than 100% means that fewer than expected reads were seen, and more than 100% means more than expected reads were seen for this sample. Note that the percentages are based on the total number of identified reads; the lost reads are not included in this calculation.

For example, if there are 176602548 reads total (identified), and 3 indices in the sample sheet, we expect around 176602548/3 = 58867516 reads to match each index. If 53878803 actually match a particular barcode, then the "balance" is 53878803/58867516 = 91.5%. If 74959637 actually match, then the "balance" is 74959637/58867516 = 127.3%.

Features to add

  • Check for IO errors, fail loudly if found. (Obviously we should be doing this anyway, but...)
  • Validate number of saved, lost reads against input total.
    • fast: just check the sums on exit.
    • slow (optional): re-read files, count reads.
    • safe mode: count reads in input file first, warn if didn't write (or discard) that many.
  • Write summary in machine-friendly format (JSON? XML?).
  • Add parameter for max length of Fastq sequence (so we can pre-allocate a Fastq object this big, ready for reading).
  • Add parameter to specify the separator between the header and index, defaulting to ':' (use to check whether pattern length matches index length).
  • Add memory-saving mode that does not track unique indices (doesn't need to maintain a hash of them). Note that the summary report will necessarily be more limited.
  • Support paired-end sequencing directly: include two filenames in sample sheet.
  • Perform strict checking on Fastq input syntax.
  • Support for extracting UMI's from the index. Include a "U" as a legal character in the pattern, and an optional filename to extract those sequences, and write the UMI's found to that file. Example: 11111111UUUUUUUUU+22222222 includes 8bp index 1, 9bp UMI, 8bp index 2.
  • Allow the last component of the destination path not to exist, and create it.