max_coverage

Given a list of sets (over a given base set) the question is: which k sets have to be chosen to cover the base set as much as possible.

The solution is one (or more) k-sized subset of the given sets.

build

cmake -DCMAKE_BUILD_TYPE=Release . && make

requirements

cmake
C++11 compilant compiler
optionals for distributed parallelization
- gnu parallel
- python
- bash

usage

See maxcover --help

Calculates maximum coverage of sets.
Author: Gabor Borbely, contact: [email protected]

The input is read from stdin, one line represents a set, a 'set' is a list of words separated by spaces.
Command line arguments:
        -k <int>        the exact number of sets you can choose. positive integer, default: 5
        -p --progress --print   prints progress periodically to stderr, default: false
        --file -i --input <filename>    default: "-"
        -b --begin -f --from <indices>  starting subset
        -e --end -t --to <indices>      ending subset, not inclusive

Example:
        maxcover -k 2 < test/01.txt
Output:
        4/6
        0	1
        0	2
        1	2
The first line of the output is the maximum coverage per the size of the union of the sets.
The rest of the output are the optimal solutions. Each line is a list of k indices.
The indices determine which sets to choose.

If you interrupt the calculation (CTRL+C) then the so-far-best solution(s) are printed, which is not the guaranteed maximum. But later you can continue the search (see parallelization).

algorithm

Simply looks all the n-choose-k possibilities, where k is the number of sets to choose and n is the number of the sets to choose from (lines of input).

The key components of the algorithm are compile time optimized. This bloats the size of the compiled binary and forces some limitations on the size of the sets and on the k parameter. But those limitations can be easily expanded. The number of sets (n) is not limited compile time.

parallelization

The parallelization is done in the least fancy way possible, however quite effective.

For example you have 10 sets and k=5 and you want to work on 8 threads. The total 10-choose-5 = 252 possibilities can be split into 8 subsets. For each subset of combinations you can calculate the maximum solutions, in parallel. The final solution is the maximum of those 8 sub-solutions.

Since the combinations can be ordered lexicographically, one can represent a subset of combinations with an interval-like notation. For example:

"0 1 2 3 4" - "0 1 3 8 9"
"0 1 3 8 9" - "0 2 3 4 7"
"0 2 3 4 7" - "0 2 5 6 8"
"0 2 5 6 8" - "0 3 4 5 6"
"0 3 4 5 6" - "0 3 4 5 7"
"0 3 4 5 7" - "2 3 4 8 9"
"2 3 4 8 9" - "2 5 7 8 9"
"2 5 7 8 9" -

The last valid combination in this example is "5 6 7 8 9". In the example the size of the splits are quite uneven (the right end of the interval is exclusive):

"0 1 2 3 4" - "0 1 3 8 9" ->  35
"0 1 3 8 9" - "0 2 3 4 7" ->  23
"0 2 3 4 7" - "0 2 5 6 8" ->  24
"0 2 5 6 8" - "0 3 4 5 6" ->   9
"0 3 4 5 6" - "0 3 4 5 7" ->   1
"0 3 4 5 7" - "2 3 4 8 9" -> 113
"2 3 4 8 9" - "2 5 7 8 9" ->  24
"2 5 7 8 9" -             ->  23
-------------------------------
"0 1 2 3 4" -             -> 252

So in this case the subtask "0 3 4 5 7" - "2 3 4 8 9" is the bottleneck. Nonetheless, if you choose these intervals then you can break down the search into subtasks, which can be calculated separately and remotely. As you can see, the effectiveness of the parallelization relies on the evenly split subtasks.

A more even split would be:

"0 1 2 3 4" - "0 1 6 7 8" -> 52
"0 1 6 7 8" - "0 2 6 7 8" -> 35
"0 2 6 7 8" - "0 6 7 8 9" -> 38
"0 6 7 8 9" - "1 3 6 7 8" -> 52
"1 3 6 7 8" - "1 6 7 8 9" -> 18
"1 6 7 8 9" - "2 3 6 7 8" -> 17
"2 3 6 7 8" - "2 6 7 8 9" -> 18
"2 6 7 8 9" -             -> 22

The pseudo-code for parallelization:

break down the search space into intervals (use generate_boundaries.py)
compute all those using test.sh
find the maximum among the sub-tasks (with post_process.sh).

See exp.sh for example.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
test		test
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
README.md		README.md
compile.sh		compile.sh
exp.sh		exp.sh
generate_boundaries.py		generate_boundaries.py
maxcover.cpp		maxcover.cpp
one.sh		one.sh
post_process.sh		post_process.sh
stopsignal.c		stopsignal.c
stopsignal.h		stopsignal.h
test.sh		test.sh
utils.h		utils.h

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

max_coverage

build

requirements

usage

algorithm

parallelization

About

Releases

Packages

Languages

License

hlt-bme-hu/max_coverage

Folders and files

Latest commit

History

Repository files navigation

max_coverage

build

requirements

usage

algorithm

parallelization

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages