Skip to content

Commit

Permalink
Rewrite detset limit code for discover to never slow DFD down
Browse files Browse the repository at this point in the history
The original implementation worked by removing search nodes for sets
above the given limit. For attributes with sets larger than that, or no
sets at all, this caused DFD to "ping-pong" again the limit boundary,
which is much less efficient than searching above the bounday as usual.

The new implementation works by forbidding sets above the given size
from being search path seeds, so that nodes for sets above the limit are
only visited when it helps resolves sets within the limit. This results
in having to filter out that the above-limit dependencies at the end of
the full search, and rarely results in the search being faster than not
giving a limit, but it never results in the search taking longer.
  • Loading branch information
CharnelMouse committed Nov 27, 2024
1 parent cfa3436 commit f423691
Show file tree
Hide file tree
Showing 3 changed files with 58 additions and 34 deletions.
2 changes: 1 addition & 1 deletion NEWS.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
* Added an `as.character` method for `functional_dependency`. The optional `align_arrows` argument can add padding to one side, in order to make the arrows align when they're printed to different lines. These options are used to align arrows in its `print` method, and its `format` method for when printed as a data frame column.
* Added `==` and `!=` implementations for `functional_dependency`. These ignore differences in `attrs_order`: differently-ordered determinant sets are considered equal.
* Added a `dependants` argument to `discover`, which limit the functional dependency search to those with a dependant in the given set of column names, defaulting to all of them. This should significantly speed up searches where only some dependants are of interest.
* Added a `detset_limit` argument for `discover`/`autodb`, which limits the FD search to only look for dependencies with the determinant set size under a given limit. For DFD, this usually makes the run time dramatically larger, rather than smaller, but it will be useful when other search algorithms are implemented.
* Added a `detset_limit` argument for `discover`/`autodb`, which limits the FD search to only look for dependencies with the determinant set size under a given limit. For DFD, this usually doesn't significantly reduce the search time, but it won't make it worse. It will be useful once other search algorithms are implemented.

## Fixes

Expand Down
56 changes: 35 additions & 21 deletions R/discover.r
Original file line number Diff line number Diff line change
Expand Up @@ -62,17 +62,17 @@
#'
#' ## Limiting the determinant set size
#'
#' Setting \code{detset_limit} smaller than the largest-possible value reduces
#' the size of the search tree for possible functional dependencies. The result
#' is that \code{discover(x, 1, detset_limit = n)} is equivalent to doing a full
#' search, \code{fds <- discover(x, 1)}, then filtering by determinant set size
#' post-hoc, \code{fds[lengths(detset(fds)) <= n]}.
#' Setting \code{detset_limit} smaller than the largest-possible value has
#' different behaviour for different search algorithms, the result is always
#' that \code{discover(x, 1, detset_limit = n)} is equivalent to doing a full
#' search, \code{fds <- discover(x, 1)}, then
#' filtering by determinant set size post-hoc, \code{fds[lengths(detset(fds)) <=
#' n]}.
#'
#' However, setting a lower value for \code{detset_limit} should be done with
#' caution: since it reduces the search tree, you might think that it should
#' also decrease the computation time, but this depends on the search algorithm
#' used. For DFD, the only algorithm currently implemented, the computation
#' often greatly increases!
#' For DFD, the naive way to implement it is by removing determinant sets larger
#' than the limit from the search tree for possible functional dependencies for
#' each dependant. However, this usually results in the search taking much more
#' time than without a limit.
#'
#' For example, suppose we search for determinant sets for a dependant that has
#' none (the dependant is the only key for \code{df}, for example). Using DFD,
Expand All @@ -87,7 +87,17 @@
#' With a smaller limit \eqn{k}, there are \eqn{\binom{n}{k}} maximum-size sets
#' to explore. Since a DFD search adds or removes one attribute at each step,
#' this means the search must take at least \eqn{k - 2 + 2\binom{n}{k}} steps,
#' which is usually much larger than \eqn{n}.
#' which is larger than \eqn{n} for all non-trivial cases \eqn{0 < k \leq n}.
#'
#' We therefore use a different approach, where any determinant sets above the
#' size limit are not allowed to be candidate seeds for new search paths, and
#' any discovered dependencies with a size above the limit are discard at the
#' end of the entire DFD search. This means that nodes for determinant sets
#' above the size limit are only visited in order to determine maximality of
#' non-dependencies within the size limit. It turns out to be rare that this
#' results in a significant speed-up, but it never results in the search having
#' to visit more nodes than it would without a size limit, so the average search
#' time is never made worse.
#' @param df a data.frame, the relation to evaluate.
#' @param accuracy a numeric in (0, 1]: the accuracy threshold required in order
#' to conclude a dependency.
Expand Down Expand Up @@ -214,7 +224,7 @@ discover <- function(
}
nonfixed <- setdiff(column_names, fixed)
valid_dependant_attrs <- intersect(dependants, nonfixed)
if (length(valid_dependant_attrs) == 0)
if (length(valid_dependant_attrs) == 0 || detset_limit < 1)
return(flatten(dependencies, column_names))

# For nonfixed attributes, all can be dependants, but
Expand Down Expand Up @@ -254,8 +264,7 @@ discover <- function(
max_n_lhs_attrs,
nonempty_powerset,
"constructing powerset",
use_visited,
max_size = min(detset_limit, max_n_lhs_attrs)
use_visited
)
# cache generated powerset and reductions, otherwise we spend a lot
# of time duplicating reduction work
Expand Down Expand Up @@ -304,6 +313,7 @@ discover <- function(
partitions,
compute_partitions,
bijection_candidate_nonfixed_indices,
detset_limit,
store_cache
)
if (lhss[[2 + store_cache]]) {
Expand Down Expand Up @@ -346,6 +356,7 @@ discover <- function(
nonfixed,
column_names
)
dependencies <- lapply(dependencies, \(x) x[lengths(x) <= detset_limit])
flatten(dependencies, column_names)
}

Expand All @@ -357,6 +368,7 @@ find_LHSs_dfd <- function(
partitions,
compute_partitions,
bijection_candidate_nonfixed_indices,
detset_limit,
store_cache = FALSE
) {
# The original library "names" nodes with their attribute set,
Expand Down Expand Up @@ -493,7 +505,7 @@ find_LHSs_dfd <- function(
max_non_deps <- res[[5]]
node <- res[[1]]
}
seeds <- generate_next_seeds(max_non_deps, min_deps, simple_nodes, nodes)
seeds <- generate_next_seeds(max_non_deps, min_deps, simple_nodes, nodes, detset_limit)
}
if (store_cache)
list(
Expand All @@ -516,6 +528,7 @@ find_LHSs_tane <- function(
partitions,
compute_partitions,
bijection_candidate_nonfixed_indices,
detset_limit,
store_cache = FALSE
) {
# See find_LHSs_dfd for node categories, etc.
Expand Down Expand Up @@ -634,7 +647,7 @@ remove_pruned_supersets <- function(supersets, subsets, bitsets) {
supersets[!prune]
}

generate_next_seeds <- function(max_non_deps, min_deps, lhs_attr_nodes, nodes) {
generate_next_seeds <- function(max_non_deps, min_deps, lhs_attr_nodes, nodes, detset_limit) {
if (length(max_non_deps) == 0) {
# original DFD paper doesn't mention case where no maximal non-dependencies
# found yet, so this approach could be inefficient
Expand All @@ -652,7 +665,7 @@ generate_next_seeds <- function(max_non_deps, min_deps, lhs_attr_nodes, nodes) {
if (length(seeds) == 0)
seeds <- max_non_dep_c
else {
seeds <- cross_intersection(seeds, max_non_dep_c, nodes)
seeds <- cross_intersection(seeds, max_non_dep_c, nodes, detset_limit)
}
}
}
Expand All @@ -677,15 +690,16 @@ generate_next_seeds_tane <- function(seeds, nodes, min_deps) {
)
}

cross_intersection <- function(seeds, max_non_dep, powerset) {
cross_intersection <- function(seeds, max_non_dep, powerset, detset_limit) {
new_seed_full_indices <- integer()
for (dep in seeds) {
seed_bitset <- powerset$bits[[dep]]
for (set in max_non_dep) {
set_bitset <- powerset$bits[[set]]
new_seed_bitset_index <- to_bitset_index(which(
seed_bitset | set_bitset
))
new_seed_bitset <- seed_bitset | set_bitset
if (sum(new_seed_bitset) > detset_limit)
next
new_seed_bitset_index <- to_bitset_index(which(new_seed_bitset))
new_seed_full_indices <- c(new_seed_full_indices, new_seed_bitset_index)
}
}
Expand Down
34 changes: 22 additions & 12 deletions man/discover.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

0 comments on commit f423691

Please sign in to comment.