kmer-count

This program reads a DNA sequence from an input file and outputs the N most common k-mers. Due to the internal representation of the k-mers (detailed below), it only works for k <= 29. The following assumptions are made:

Input files contain a single valid DNA sequence (alphabet {A, C, G, T, N})
The sequence is on a single line, without whitespace (except for a possible newline character at the end of the file).

Implementation details

In order to strike a balance between speed an memory requirements, two different approaches are used, depending on k:

For k <= 10, the number of possible k-mers is small enough to fit into memory. Therefore, the program uses a trie structure (prefix tree) to store and count the k-mers. Once the input sequence has been read, the counts are extracted from the leaves.
For 10 < k <= 29, the k-mers are binned into files according to their first two bases. Bases 3 to k are encoded as numbers of base 5. Up to 27 bases can be stored as a unique 64-bit number in this way, leading to the maximum k of 29. The disk space required is therefore about 8 times the size of the input file (e.g. with a 1GB input file, 8GB of disk space are necessary). In a second stage, the files are read back into memory one by one, and the k-mers are counted.

In both cases, only the N most common k-mers are kept during the counting phase. If there are multiple k-mers with the same count, they are ordered lexicographically. In case the N-th and (N+1)-th k-mers have the same count, only the N-th k-mer is reported.

Building

The source code can be compiled using the included makefile:

cd k-mer_counter
make depend
make

will compile the executable.

The program was validated with a set of tests (using the Catch framework). To run these tests, use

make depend
make test
./test

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
k-mer counter.xcodeproj		k-mer counter.xcodeproj
k-mer counter		k-mer counter
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

kmer-count

Implementation details

Building

About

Releases

Packages

Contributors 2

Languages

ddlqn/kmer-count

Folders and files

Latest commit

History

Repository files navigation

kmer-count

Implementation details

Building

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages