Skip to content

Commit

Permalink
Merge pull request #146 from smehringer/merged_fpr
Browse files Browse the repository at this point in the history
[FEATURE] Allow a higher false positive rate on merged bins.
  • Loading branch information
eseiler authored Oct 23, 2023
2 parents 2621d5a + 226b1f1 commit c725f38
Show file tree
Hide file tree
Showing 18 changed files with 235 additions and 98 deletions.
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -118,7 +118,7 @@ int main()
seqan::hibf::config config{.input_fn = get_user_bin_data, // required
.number_of_user_bins = 3u, // required
.number_of_hash_functions = 2u,
.maximum_false_positive_rate = 0.05,
.maximum_fpr = 0.05,
.threads = 1u};

// The HIBF constructor will determine a hierarchical layout for the user bins and build the filter.
Expand Down
69 changes: 51 additions & 18 deletions include/hibf/config.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -30,19 +30,20 @@ namespace seqan::hibf
*
* Here is the list of all configs options:
*
* | Type | Option Name | Default | Note |
* |:--------|:-------------------------------------------------|:-------:|:-----------------------|
* | General | seqan::hibf::config::input_fn | - | [REQUIRED] |
* | General | seqan::hibf::config::number_of_user_bins | - | [REQUIRED] |
* | General | seqan::hibf::config::number_of_hash_functions | 2 | |
* | General | seqan::hibf::config::maximum_false_positive_rate | 0.05 | [RECOMMENDED_TO_ADAPT] |
* | General | seqan::hibf::config::threads | 1 | [RECOMMENDED_TO_ADAPT] |
* | Layout | seqan::hibf::config::sketch_bits | 12 | |
* | Layout | seqan::hibf::config::tmax | 0 | 0 indicates unset |
* | Layout | seqan::hibf::config::max_rearrangement_ratio | 0.5 | |
* | Layout | seqan::hibf::config::alpha | 1.2 | |
* | Layout | seqan::hibf::config::disable_estimate_union | false | |
* | Layout | seqan::hibf::config::disable_rearrangement | false | |
* | Type | Option Name | Default | Note |
* |:--------|:----------------------------------------------|:-------:|:-----------------------|
* | General | seqan::hibf::config::input_fn | - | [REQUIRED] |
* | General | seqan::hibf::config::number_of_user_bins | - | [REQUIRED] |
* | General | seqan::hibf::config::number_of_hash_functions | 2 | |
* | General | seqan::hibf::config::maximum_fpr | 0.05 | [RECOMMENDED_TO_ADAPT] |
* | General | seqan::hibf::config::relaxed_fpr | 0.3 | |
* | General | seqan::hibf::config::threads | 1 | [RECOMMENDED_TO_ADAPT] |
* | Layout | seqan::hibf::config::sketch_bits | 12 | |
* | Layout | seqan::hibf::config::tmax | 0 | 0 indicates unset |
* | Layout | seqan::hibf::config::max_rearrangement_ratio | 0.5 | |
* | Layout | seqan::hibf::config::alpha | 1.2 | |
* | Layout | seqan::hibf::config::disable_estimate_union | false | |
* | Layout | seqan::hibf::config::disable_rearrangement | false | |
*
* As a copy and paste source, here are all config options with their defaults:
*
Expand All @@ -61,7 +62,7 @@ namespace seqan::hibf
* Check the documentation of the following options that influence the memory consumption:
* * seqan::hibf::config::threads
* * seqan::hibf::config::number_of_hash_functions
* * seqan::hibf::config::maximum_false_positive_rate
* * seqan::hibf::config::maximum_fpr
*
* ## Validation
*
Expand Down Expand Up @@ -133,7 +134,7 @@ struct config
/*!\brief The desired maximum false positive rate of the underlying Bloom Filters. [RECOMMENDED_TO_ADAPT]
*
* We ensure that when querying a single hash value in the (H)IBF, the probability of getting a false positive answer
* will not exceed the value set for seqan::hibf::config::maximum_false_positive_rate.
* will not exceed the value set for seqan::hibf::config::maximum_fpr.
* The internal Bloom Filters will be configured accordingly. Individual Bloom Filters might have a different
* but always lower false positive rate (FPR).
*
Expand All @@ -146,7 +147,35 @@ struct config
*
* \sa [Bloom Filter Calculator](https://hur.st/bloomfilter/).
*/
double maximum_false_positive_rate{0.05};
double maximum_fpr{0.05};

/*!\brief Allow a higher FPR in non-accuracy-critical parts of the HIBF structure.
*
* Some parts in the hierarchical structure are not critical to ensure the seqan::hibf::config::maximum_fpr.
* These can be allowed to have a higher FPR to reduce the overall space consumption, while only minimally
* affecting the runtime performance.
*
* Value must be in range (0.0,1.0).
* Value must be equal to or larger than seqan::hibf::config::maximum_fpr.
* Recommendation: default value (0.3)
*
* ### Technical details
*
* Merged bins in an HIBF layout will always be followed by one or more lower-level IBFs that will have split bins
* or single bins (split = 1) to recover the original user bins. Thus, the FPR of merged bins does not determine the
* seqan::hibf::config::maximum_fpr, but is independent. Choosing a higher FPR for merged bins can
* lower the memory requirement but increases the runtime. Experiments show that the decrease in memory is
* significant, while the runtime suffers only slightly. The accuracy of the results is not affected by this
* parameter.
*
* Note: For each IBF there is a limit to how high the FPR of merged bins can be. Specifically, the FPR for merged
* bins can never decrease the IBF size more than what is needed to ensure the
* seqan::hibf::config::maximum_fpr for split bins. This means that, at some point, choosing even
* higher values for this parameter will have no effect anymore.
*
* \sa [Bloom Filter Calculator](https://hur.st/bloomfilter/).
*/
double relaxed_fpr{0.3};

/*!\brief The number of threads to use during construction. [RECOMMENDED_TO_ADAPT]
*
Expand Down Expand Up @@ -264,7 +293,10 @@ struct config
*
* Constrains:
* * seqan::hibf::config::number_of_hash_functions must be in `[1,5]`.
* * seqan::hibf::config::maximum_false_positive_rate must be in `(0.0,1.0)`.
* * seqan::hibf::config::maximum_fpr must be in `(0.0,1.0)`.
* * seqan::hibf::config::relaxed_fpr must be in `(0.0,1.0)`.
* * seqan::hibf::config::relaxed_fpr must be equal to or larger than
* seqan::hibf::config::maximum_fpr.
* * seqan::hibf::config::threads must be greater than `0`.
* * seqan::hibf::config::sketch_bits must be in `[5,32]`.
* * seqan::hibf::config::tmax must be at most `18446744073709551552`.
Expand All @@ -291,7 +323,8 @@ struct config

archive(CEREAL_NVP(number_of_user_bins));
archive(CEREAL_NVP(number_of_hash_functions));
archive(CEREAL_NVP(maximum_false_positive_rate));
archive(CEREAL_NVP(maximum_fpr));
archive(CEREAL_NVP(relaxed_fpr));
archive(CEREAL_NVP(threads));

archive(CEREAL_NVP(sketch_bits));
Expand Down
2 changes: 1 addition & 1 deletion include/hibf/hierarchical_interleaved_bloom_filter.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -161,7 +161,7 @@ class hierarchical_interleaved_bloom_filter
*
* Options recommended to adapt to your setup:
* * `threads` - Choose number of threads depending on your hardware settings to speed up construction
* * `maximum_false_positive_rate` - How many false positive answers can you tolerate? A low FPR (e.g. 0.001) is
* * `maximum_fpr` - How many false positive answers can you tolerate? A low FPR (e.g. 0.001) is
* needed if you can tolerate a high RAM peak when using the HIBF but post-processing steps are heavy and FPs
* should be avoided. A high FPR (e.g. `0.3`) can be chosed if you want a very small HIBF and false positive
* can be easily filtered in the down-stream analysis
Expand Down
5 changes: 5 additions & 0 deletions include/hibf/layout/graph.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -48,6 +48,11 @@ struct graph
std::optional<size_t> favourite_child_idx{std::nullopt};
std::vector<layout::layout::user_bin> remaining_records{}; // non-merged bins (either split or single)

bool max_bin_is_merged() const
{
return favourite_child_idx.has_value();
}

// Doesn't work, because the type is incomplete. To compare node, a comparison for the children member is needed.
// But children is a std::vector<node>, so a comparison for node is needed to compare children.
// https://godbolt.org/z/arrr4YKae
Expand Down
79 changes: 70 additions & 9 deletions include/hibf/layout/hierarchical_binning.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -9,9 +9,10 @@
#include <utility> // for addressof, pair
#include <vector> // for vector

#include <hibf/config.hpp> // for config
#include <hibf/layout/data_store.hpp> // for data_store
#include <hibf/platform.hpp> // for HIBF_WORKAROUND_GCC_BOGUS_MEMCPY
#include <hibf/build/bin_size_in_bits.hpp> // for bin_size_in_bits
#include <hibf/config.hpp> // for config
#include <hibf/layout/data_store.hpp> // for data_store
#include <hibf/platform.hpp> // for HIBF_WORKAROUND_GCC_BOGUS_MEMCPY

namespace seqan::hibf::layout
{
Expand All @@ -32,6 +33,70 @@ class hierarchical_binning
//!\brief The number of technical bins requested by the user.
size_t num_technical_bins{};

//!\brief Simplifies passing the parameters needed for tracking the maximum technical bin.
struct maximum_bin_tracker
{
size_t max_id{}; //!< The ID of the technical bin with maximal size.
size_t max_size{}; //!< The maximum technical bin size seen so far.
size_t max_split_id{}; //!< The ID of the split bin with maximal size (if any).
size_t max_split_size{}; //!< The maximum split bin size seen so far.

void update_max(size_t const new_id, size_t const new_size)
{
if (new_size > max_size)
{
max_id = new_id;
max_size = new_size;
}
}

//!\brief Split cardinality `new_size` must already account for fpr-correction.
void update_split_max(size_t const new_id, size_t const new_size)
{
if (new_size > max_split_size)
{
max_split_id = new_id;
max_split_size = new_size;
}
}

/*!\brief Decides which bin is reported as the maximum bin.
*\param config The HIBF configuration.
*\return The chosen max bin id.
*
* As a HIBF feature, the merged bin FPR can differ from the overall maximum FPR. Merged bins in an HIBF layout
* will always be followed by one or more lower-level IBFs that will have split bins or single bins (split = 1)
* to recover the original user bins.
*
* We need to make sure, though, that downsizing merged bins does not affect split bins.
* Therefore, we check if choosing a merged bin as the max bin violates the minimum_bits needed for split bins.
* If so, we can report the largest split bin as the max bin as it will choose the correct size and downsize
* larger merged bins only a little.
*/
size_t choose_max_bin(seqan::hibf::config const & config)
{
if (max_id == max_split_id) // Overall max bin is a split bin.
return max_id;

// Split cardinality `max_split_size` already accounts for fpr correction.
// The minimum size of the TBs of this IBF to ensure the maximum_false_positive_rate for split bins.
size_t const minimum_bits{build::bin_size_in_bits({.fpr = config.maximum_fpr,
.hash_count = config.number_of_hash_functions,
.elements = max_split_size})};

// The potential size of the TBs of this IBF given the allowed merged bin FPR.
size_t const merged_bits{build::bin_size_in_bits({.fpr = config.relaxed_fpr, //
.hash_count = config.number_of_hash_functions,
.elements = max_size})};

// If split and merged bits are the same, we prefer merged bins. Better for build parallelisation.
if ((minimum_bits > merged_bits))
return max_split_id;

return max_id;
}
};

public:
hierarchical_binning() = default; //!< Defaulted.
hierarchical_binning(hierarchical_binning const &) = delete; //!< Deleted. Would modify same data.
Expand Down Expand Up @@ -123,15 +188,13 @@ class hierarchical_binning
void backtrack_merged_bin(size_t trace_j,
size_t const next_j,
size_t const bin_id,
size_t & high_level_max_id,
size_t & high_level_max_size,
maximum_bin_tracker & max_tracker,
bool is_first_row = false);

void backtrack_split_bin(size_t trace_j,
size_t const number_of_bins,
size_t const bin_id,
size_t & high_level_max_id,
size_t & high_level_max_size);
maximum_bin_tracker & max_tracker);

//!\brief Backtracks the trace matrix and writes the resulting binning into the output file.
size_t backtracking(std::vector<std::vector<std::pair<size_t, size_t>>> const & trace);
Expand All @@ -143,8 +206,6 @@ class hierarchical_binning
void update_libf_data(data_store & libf_data, size_t const bin_id) const;

size_t add_lower_level(data_store & libf_data) const;

void update_max_id(size_t & max_id, size_t & max_size, size_t const new_id, size_t const new_size) const;
};

} // namespace seqan::hibf::layout
18 changes: 13 additions & 5 deletions src/build/construct_ibf.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -28,12 +28,20 @@ seqan::hibf::interleaved_bloom_filter construct_ibf(robin_hood::unordered_flat_s
build_data & data,
bool is_root)
{
size_t const kmers_per_bin{static_cast<size_t>(std::ceil(static_cast<double>(kmers.size()) / number_of_bins))};
double const bin_bits{static_cast<double>(bin_size_in_bits({.fpr = data.config.maximum_false_positive_rate,
.hash_count = data.config.number_of_hash_functions,
.elements = kmers_per_bin}))};
bool const max_bin_is_merged = ibf_node.max_bin_is_merged();
assert(!max_bin_is_merged || number_of_bins == 1u); // merged max bin implies (=>) number of bins == 1

size_t const kmers_per_bin{(kmers.size() + number_of_bins - 1u) / number_of_bins}; // Integer ceil
double const fpr = max_bin_is_merged ? data.config.relaxed_fpr : data.config.maximum_fpr;

size_t const bin_bits{bin_size_in_bits({.fpr = fpr, //
.hash_count = data.config.number_of_hash_functions,
.elements = kmers_per_bin})};
// data.fpr_correction[1] == 1.0, but we can avoid floating point operations with the ternary.
// Check number_of_bins instead of max_bin_is_merged, because split bins can also occupy only one technical bin.
seqan::hibf::bin_size const bin_size{
static_cast<size_t>(std::ceil(bin_bits * data.fpr_correction[number_of_bins]))};
number_of_bins == 1u ? bin_bits
: static_cast<size_t>(std::ceil(bin_bits * data.fpr_correction[number_of_bins]))};
seqan::hibf::bin_count const bin_count{ibf_node.number_of_technical_bins};

timer<concurrent::no> local_index_allocation_timer{};
Expand Down
11 changes: 9 additions & 2 deletions src/config.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -72,8 +72,15 @@ void config::validate_and_set_defaults()
if (number_of_hash_functions == 0u || number_of_hash_functions > 5u)
throw std::invalid_argument{"[HIBF CONFIG ERROR] config::number_of_hash_functions must be in [1,5]."};

if (maximum_false_positive_rate <= 0.0 || maximum_false_positive_rate >= 1.0)
throw std::invalid_argument{"[HIBF CONFIG ERROR] config::maximum_false_positive_rate must be in (0.0,1.0)."};
if (maximum_fpr <= 0.0 || maximum_fpr >= 1.0)
throw std::invalid_argument{"[HIBF CONFIG ERROR] config::maximum_fpr must be in (0.0,1.0)."};

if (relaxed_fpr <= 0.0 || relaxed_fpr >= 1.0)
throw std::invalid_argument{"[HIBF CONFIG ERROR] config::relaxed_fpr must be in (0.0,1.0)."};

if (relaxed_fpr < maximum_fpr)
throw std::invalid_argument{"[HIBF CONFIG ERROR] config::relaxed_fpr must be "
"greater than or equal to config::maximum_fpr."};

if (threads == 0u)
throw std::invalid_argument{"[HIBF CONFIG ERROR] config::threads must be greater than 0."};
Expand Down
11 changes: 6 additions & 5 deletions src/hierarchical_interleaved_bloom_filter.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -47,7 +47,7 @@ size_t hierarchical_build(hierarchical_interleaved_bloom_filter & hibf,

auto initialise_max_bin_kmers = [&]() -> size_t
{
if (current_node.favourite_child_idx.has_value()) // max bin is a merged bin
if (current_node.max_bin_is_merged())
{
// recursively initialize favourite child first
ibf_positions[current_node.max_bin_index] =
Expand Down Expand Up @@ -91,7 +91,7 @@ size_t hierarchical_build(hierarchical_interleaved_bloom_filter & hibf,

// We do not want to process the favourite child. It has already been processed prior.
// https://godbolt.org/z/6Yav7hrG1
if (current_node.favourite_child_idx.has_value())
if (current_node.max_bin_is_merged())
std::erase(indices, current_node.favourite_child_idx.value());

if (is_root)
Expand Down Expand Up @@ -127,7 +127,7 @@ size_t hierarchical_build(hierarchical_interleaved_bloom_filter & hibf,
loop_over_children();

// If max bin was a merged bin, process all remaining records, otherwise the first one has already been processed
size_t const start{(current_node.favourite_child_idx.has_value()) ? 0u : 1u};
size_t const start{(current_node.max_bin_is_merged()) ? 0u : 1u};
for (size_t i = start; i < current_node.remaining_records.size(); ++i)
{
auto const & record = current_node.remaining_records[i];
Expand Down Expand Up @@ -182,8 +182,9 @@ void build_index(hierarchical_interleaved_bloom_filter & hibf,
layout::graph::node const & root_node = data.ibf_graph.root;

size_t const t_max{root_node.number_of_technical_bins};
data.fpr_correction = layout::compute_fpr_correction(
{.fpr = config.maximum_false_positive_rate, .hash_count = config.number_of_hash_functions, .t_max = t_max});
data.fpr_correction = layout::compute_fpr_correction({.fpr = config.maximum_fpr, //
.hash_count = config.number_of_hash_functions,
.t_max = t_max});

hierarchical_build(hibf, root_node, data);

Expand Down
2 changes: 1 addition & 1 deletion src/interleaved_bloom_filter.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -59,7 +59,7 @@ size_t max_bin_size(config & configuration)
max_size = std::max(max_size, kmers.size());
}

return build::bin_size_in_bits({.fpr = configuration.maximum_false_positive_rate,
return build::bin_size_in_bits({.fpr = configuration.maximum_fpr, //
.hash_count = configuration.number_of_hash_functions,
.elements = max_size});
}
Expand Down
Loading

1 comment on commit c725f38

@vercel
Copy link

@vercel vercel bot commented on c725f38 Oct 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Successfully deployed to the following URLs:

hibf – ./

hibf-seqan.vercel.app
hibf.vercel.app
hibf-git-main-seqan.vercel.app

Please sign in to comment.