Skip to content

Commit

Permalink
Merge pull request #41 from oma219/main
Browse files Browse the repository at this point in the history
Updates to README
  • Loading branch information
oma219 authored Jun 24, 2024
2 parents c18ba23 + 76b1208 commit dde2cff
Show file tree
Hide file tree
Showing 5 changed files with 271 additions and 185 deletions.
82 changes: 80 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,82 @@
# Digest
C++ library which supports various minimizer schemes for digestion of DNA sequences
# ✂️ Digest: fast, multi-use $k$-mer sub-sampling library

<p align="left">
<img width="900" alt="image1" src="https://github.com/oma219/digest/assets/32006908/09523db6-fd0b-49de-8e57-0d2cedef2a26">
<br>
<em>Visualization of different minimizer schemes supported in Digest and code example using library </em>
</p>


## What is the Digest library?
- a `C++` library that supports various sub-sampling schemes for $k$-mers in DNA sequences.
- `Digest` library utilizes the rolling hash-function from [ntHash](https://github.com/bcgsc/ntHash) to order the $k$-mers in a window.

## How to install and build into your project?
<img width="600" alt="image2" src="https://github.com/oma219/digest/assets/32006908/7cea427e-c22a-4271-a234-a2aafeb45413">

### Step 1: Install library

After cloning from GitHub, we use the [Meson](https://mesonbuild.com) build-system to install the library.
- `PREFIX` is an absolute path to library files will be install (`*.h` and `*.a` files)
- **IMPORTANT**: `PREFIX` should not be the root directory of the `Digest/` repo to avoid any issues with installation.
- These commands generate an `include` and `lib` folders in `PREFIX` folder

```bash
git clone https://github.com/VeryAmazed/digest.git

meson setup --prefix=<PREFIX> --buildtype=release build
meson install -C build
```

### Step 2: Include Digest in your project

#### (a) Using `Meson`:

If your coding project uses `Meson` to build the executable(s), you can include a file called `subprojects/digest.wrap` in your repository and let Meson install it for you.

#### (b) Using `g++`:

To use Digest in your C++ project, you just need to include the header files (`*.h`) and library file (`*.a`) that were installed in the first step. Assuming that `install/` is the directory you installed them in, here is how you can compile.

```bash
g++ -std=c++17 -o main main.cpp -I install/include/ -L install/lib -lnthash
```

## Detailed Look at Example Usage (2 ways):

There are three types of minimizer schemes that can be used:

1. Windowed Minimizer
2. Modimizer
3. Syncmer

The general steps to use Digest is as follows: (1) include the relevant header files, (2) declare the Digest object and (3) find the positions where the minimizers are present in the sequence.

### 1. Find positions of minimizers:
```cpp
#include "digest/digester.hpp"
#include "digest/window_minimizer.hpp"

digest::WindowMin<digest::BadCharPolicy::WRITEOVER, digest::ds::Adaptive> digester (dna, 15, 7);

std::vector<size_t> output;
digester.roll_minimizer(100, output);
```
- This code snippet will find up to 100 Windowed Minimizers and store their positions in the vector called `output`.
- `digest::BadCharPolicy::WRITEOVER` means that anytime the code encounters an non-`ACTG` character, it will replace it with an `A`.
- `digest::BadCharPolicy::SKIPOVER` will skip any $k$-mers with non-`ACTG` characters
- `digest::ds::Adaptive` is our recommended data-structure for finding the minimum value in a window (see wiki for other options)
### 2. Find both positions and hash values of minimizers
If you would like to obtain both the positions and hash values for each minimizer, you can pass a vector of paired integers to do so.
```
std::vector<std::pair<size_t, size_t>> output;
digester.roll_minimizer(100, output);
```
<!---
# Implementation
Supports Mod Minimizers, Window Minimizers, and Syncmers
Expand Down Expand Up @@ -55,3 +131,5 @@ meson setup build
cd build && meson compile
```
this will generate proper executables for benchmark/testing

-->
4 changes: 2 additions & 2 deletions include/digest/data_structure.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -115,7 +115,7 @@ template <uint32_t k> struct Naive {
std::array<uint64_t, k> arr;
unsigned int i = 0;

Naive(uint32_t){};
Naive(uint32_t) {};
Naive(const Naive &other) = default;
Naive &operator=(const Naive &other) = default;

Expand Down Expand Up @@ -183,7 +183,7 @@ template <uint32_t k> struct Naive2 {
unsigned int last = 0;
std::vector<uint64_t> arr = std::vector<uint64_t>(k);

Naive2(uint32_t){};
Naive2(uint32_t) {};
Naive2(const Naive2 &other) = default;
Naive2 &operator=(const Naive2 &other) = default;

Expand Down
162 changes: 83 additions & 79 deletions tests/density/ACTG.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -31,83 +31,87 @@ int dir2[] = {0, 1, 0, -1, 1, -1, 1, -1};

int main() {

std::cout << std::fixed << std::setprecision(8);
// if you use ld, use the above and don't use string stream

std::string str;

std::vector<std::string> strs;
assert(freopen("../tests/density/ACTG.txt", "r", stdin));
for (int i = 0; i < 100; i++) {
std::cin >> str;
strs.pb(str);
}

std::vector<std::vector<double>> mod_min_vec(4, std::vector<double>());
std::vector<std::vector<double>> wind_min_vec(4, std::vector<double>());
std::vector<std::vector<double>> sync_vec(4, std::vector<double>());

uint64_t mods[4] = {109, 128, 1009, 1024};
unsigned l_winds[4] = {7, 8, 17, 16};

double kmers = 100000 - 16 + 1;

for (int i = 0; i < 4; i++) {
for (int j = 0; j < 100; j++) {
digest::ModMin<digest::BadCharPolicy::SKIPOVER> mm(
strs[j], 16, mods[i], 0, 0, digest::MinimizedHashType::CANON);
std::vector<uint32_t> temp;
mm.roll_minimizer(100000, temp);
double am = temp.size();
am /= kmers;
mod_min_vec[i].pb(am);
}
}

for (int i = 0; i < 4; i++) {
for (int j = 0; j < 100; j++) {
digest::WindowMin<digest::BadCharPolicy::SKIPOVER, digest::ds::Adaptive>
wm(strs[j], 16, l_winds[i], 0, digest::MinimizedHashType::CANON);
std::vector<uint32_t> temp;
wm.roll_minimizer(100000, temp);
double am = temp.size();
am /= kmers;
wind_min_vec[i].pb(am);
}
}

for (int i = 0; i < 4; i++) {
for (int j = 0; j < 100; j++) {
digest::Syncmer<digest::BadCharPolicy::SKIPOVER, digest::ds::Adaptive>
syn(strs[j], 16, l_winds[i], 0, digest::MinimizedHashType::CANON);
std::vector<uint32_t> temp;
syn.roll_minimizer(100000, temp);
double am = temp.size();
am /= kmers;
sync_vec[i].pb(am);
}
}
assert(freopen("../tests/density/out1.txt", "w", stdout));
for (int i = 0; i < 4; i++) {
for (size_t j = 0; j < 100; j++) {
std::cout << mod_min_vec[i][j] << " ";
}
std::cout << std::endl;
}

for (int i = 0; i < 4; i++) {
for (size_t j = 0; j < 100; j++) {
std::cout << wind_min_vec[i][j] << " ";
}
std::cout << std::endl;
}

for (int i = 0; i < 4; i++) {
for (size_t j = 0; j < 100; j++) {
std::cout << sync_vec[i][j] << " ";
}
std::cout << std::endl;
}

return 0;
std::cout << std::fixed << std::setprecision(8);
// if you use ld, use the above and don't use string stream

std::string str;

std::vector<std::string> strs;
assert(freopen("../tests/density/ACTG.txt", "r", stdin));
for (int i = 0; i < 100; i++) {
std::cin >> str;
strs.pb(str);
}

std::vector<std::vector<double>> mod_min_vec(4, std::vector<double>());
std::vector<std::vector<double>> wind_min_vec(4, std::vector<double>());
std::vector<std::vector<double>> sync_vec(4, std::vector<double>());

uint64_t mods[4] = {109, 128, 1009, 1024};
unsigned l_winds[4] = {7, 8, 17, 16};

double kmers = 100000 - 16 + 1;

for (int i = 0; i < 4; i++) {
for (int j = 0; j < 100; j++) {
digest::ModMin<digest::BadCharPolicy::SKIPOVER> mm(
strs[j], 16, mods[i], 0, 0, digest::MinimizedHashType::CANON);
std::vector<uint32_t> temp;
mm.roll_minimizer(100000, temp);
double am = temp.size();
am /= kmers;
mod_min_vec[i].pb(am);
}
}

for (int i = 0; i < 4; i++) {
for (int j = 0; j < 100; j++) {
digest::WindowMin<digest::BadCharPolicy::SKIPOVER,
digest::ds::Adaptive>
wm(strs[j], 16, l_winds[i], 0,
digest::MinimizedHashType::CANON);
std::vector<uint32_t> temp;
wm.roll_minimizer(100000, temp);
double am = temp.size();
am /= kmers;
wind_min_vec[i].pb(am);
}
}

for (int i = 0; i < 4; i++) {
for (int j = 0; j < 100; j++) {
digest::Syncmer<digest::BadCharPolicy::SKIPOVER,
digest::ds::Adaptive>
syn(strs[j], 16, l_winds[i], 0,
digest::MinimizedHashType::CANON);
std::vector<uint32_t> temp;
syn.roll_minimizer(100000, temp);
double am = temp.size();
am /= kmers;
sync_vec[i].pb(am);
}
}
assert(freopen("../tests/density/out1.txt", "w", stdout));
for (int i = 0; i < 4; i++) {
for (size_t j = 0; j < 100; j++) {
std::cout << mod_min_vec[i][j] << " ";
}
std::cout << std::endl;
}

for (int i = 0; i < 4; i++) {
for (size_t j = 0; j < 100; j++) {
std::cout << wind_min_vec[i][j] << " ";
}
std::cout << std::endl;
}

for (int i = 0; i < 4; i++) {
for (size_t j = 0; j < 100; j++) {
std::cout << sync_vec[i][j] << " ";
}
std::cout << std::endl;
}

return 0;
}
Loading

0 comments on commit dde2cff

Please sign in to comment.