Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use HashMapDictionary #397

Merged
merged 67 commits into from
Nov 6, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
67 commits
Select commit Hold shift + click to select a range
d868b6b
move Default into dinctionary implementations
mkroetzsch Aug 28, 2023
035b745
new binary for testing dictionary code
mkroetzsch Aug 28, 2023
20556ab
automated updates
mkroetzsch Aug 28, 2023
53e37de
add detailed timing and wait when done
mkroetzsch Aug 28, 2023
508841c
new enum for Dictionary.add return type
mkroetzsch Aug 28, 2023
5df465c
more tests
mkroetzsch Aug 28, 2023
c68931e
fixed bug
mkroetzsch Aug 28, 2023
0bf04c6
renamed enum variants for EntryStatus
mkroetzsch Aug 28, 2023
18eae76
new buffer-based hashmap dictionary
mkroetzsch Sep 1, 2023
90a36b0
+pages for global string buffer
mkroetzsch Sep 1, 2023
916104c
docs
mkroetzsch Sep 1, 2023
608e5aa
corrected assertions
mkroetzsch Sep 1, 2023
2ad31e8
some cleanup and documentation
mkroetzsch Sep 1, 2023
a827a86
Merge branch 'dict-exp' of github.com:knowsys/nemo into dict-exp
larry-gonzalez Sep 1, 2023
e25d236
multi-buffer solution with some thread-safety
mkroetzsch Sep 2, 2023
0f84d08
small cosmetic changes
mkroetzsch Sep 2, 2023
f4377b8
make tmp string handling thread-safe
mkroetzsch Sep 3, 2023
c178b1f
initial work on meta dict, with some placeholder code
mkroetzsch Sep 5, 2023
942830e
some cleanup
mkroetzsch Sep 5, 2023
0942f5b
Merge branch 'dict-exp' of github.com:knowsys/nemo into dict-exp
larry-gonzalez Sep 5, 2023
9263970
compute bit mask instead of fixing it manually
mkroetzsch Sep 5, 2023
12fa2d8
remove is_empty() from Dictionary trait
mkroetzsch Sep 5, 2023
e54a4e0
update Dictionary interface
mkroetzsch Sep 5, 2023
44b45b1
update and extend Dictionary interface
mkroetzsch Sep 5, 2023
d6cdf7b
intermediate commit, expanded features of DictionaryString
mkroetzsch Sep 7, 2023
b7c8520
Fix DictionaryString; format
Sep 7, 2023
8279ff7
format
Sep 7, 2023
e8921f5
add tests
Sep 7, 2023
2602dfd
typo
Sep 7, 2023
45558f2
Merge branch 'dict-exp' of github.com:knowsys/nemo into dict-exp
larry-gonzalez Sep 13, 2023
14c9c7e
more work on infix-based dictoinaries
mkroetzsch Sep 21, 2023
6e8a210
improved tests
mkroetzsch Sep 21, 2023
553b0f0
use infix shorternings + some testing code
mkroetzsch Sep 21, 2023
731ffd2
some cleanup
mkroetzsch Sep 21, 2023
e7d231b
create infix dictionaries dynamically
mkroetzsch Sep 21, 2023
c23c51e
moved some outputs to logging
mkroetzsch Sep 22, 2023
f504109
faster lookup of suitable dictionary
mkroetzsch Sep 22, 2023
d1e4a88
some optimisations, leading to faster runtimes in meta
mkroetzsch Sep 22, 2023
7efad52
some further performance tweaks
mkroetzsch Sep 22, 2023
6a9739a
some speedup
mkroetzsch Sep 25, 2023
27210c0
some smaller optimizations
mkroetzsch Sep 29, 2023
8ca091c
Merge branch 'dict-exp' of github.com:knowsys/nemo into dict-exp
larry-gonzalez Oct 2, 2023
6b110a5
use MetaDictionary instead of StringDictionary; ignoring cfg no-prefi…
Oct 6, 2023
253b536
merge; fix merge conflict
Nov 1, 2023
b0a5122
use merge main; use nemo_physical::management::database::Dict in test…
Nov 1, 2023
f5f9c91
use HashMapDictionary as Dict
Nov 2, 2023
3c2af3f
inline methods
mkroetzsch Nov 2, 2023
e55086d
Merge branch 'dict-exp' of https://github.com/knowsys/nemo into dict-exp
mkroetzsch Nov 2, 2023
9054dd0
Merge branch 'main' into dict-exp
Nov 2, 2023
ac399bb
Merge branch 'dict-exp' of github.com:knowsys/nemo into dict-exp
Nov 2, 2023
69003ad
Merge branch 'dict-exp' of github.com:knowsys/nemo into dict-exp
larry-gonzalez Nov 2, 2023
7f31fdf
remove -no-prefixed-string-dictionary feature
larry-gonzalez Nov 2, 2023
dcfb04e
fix typo
larry-gonzalez Nov 2, 2023
06bf86b
add documentation; address code review
larry-gonzalez Nov 2, 2023
656c6a0
Improve documentation
larry-gonzalez Nov 2, 2023
ce1820c
addressing code review
larry-gonzalez Nov 2, 2023
c7cc8eb
addressing code review
Nov 3, 2023
b7d1e43
address code review: push unsafe inside
Nov 3, 2023
11e6f46
format
Nov 3, 2023
a744789
address code review: implementing Display; adding new lines; pushing …
Nov 3, 2023
1e812ba
address code review: adding new lines
Nov 3, 2023
41780cd
address code review: code simplification
Nov 3, 2023
b1a2a9f
implementing clyppy suggestions
Nov 3, 2023
e83a3b1
refactoring
Nov 3, 2023
4718138
address `cargo clippy` and `cargo doc` comments.
Nov 6, 2023
b52901f
addressing clippy
Nov 6, 2023
c5f805d
addressing code review: droping comments
Nov 6, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
57 changes: 57 additions & 0 deletions Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

10 changes: 9 additions & 1 deletion nemo-benches/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -9,13 +9,21 @@ license.workspace = true
readme = "README.md"
repository.workspace = true

[[bin]]
name = "dict-bench"
path = "src/bin/dict-bench.rs"

[dependencies]
nemo-physical = { path = "../nemo-physical", default-features = false }
nemo = { path = "../nemo", default-features = false }
rand = "0.8.5"
flate2 = "1"
log = { version = "0.4", features = [ "max_level_trace", "release_max_level_trace" ] }
clap = { version = "4.0.32", features = [ "derive", "cargo", "env" ] }
colored = "2"
env_logger = "*"

[dev-dependencies]
env_logger = "*"
criterion = { version = "0.5", features = [ "html_reports" ] }
rand_pcg = "0.3"

Expand Down
8 changes: 4 additions & 4 deletions nemo-benches/benches/input.rs
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ use nemo_physical::{
builder_proxy::{
ColumnBuilderProxy, PhysicalBuilderProxyEnum, PhysicalStringColumnBuilderProxy,
},
dictionary::PrefixedStringDictionary,
dictionary::HashMapDictionary,
};
use rand::{distributions::Alphanumeric, prelude::*};
use rand_pcg::Pcg64;
Expand Down Expand Up @@ -35,7 +35,7 @@ pub fn benchmark_input(c: &mut Criterion) {
group.bench_function("read_strings", |b| {
b.iter_batched(
|| {
let dict = std::cell::RefCell::new(PrefixedStringDictionary::default());
let dict = std::cell::RefCell::new(HashMapDictionary::default());
(strings.clone(), dict)
},
|(input, dict)| {
Expand All @@ -53,7 +53,7 @@ pub fn benchmark_input(c: &mut Criterion) {
group.bench_function("read_terms", |b| {
b.iter_batched(
|| {
let dict = std::cell::RefCell::new(PrefixedStringDictionary::default());
let dict = std::cell::RefCell::new(HashMapDictionary::default());
(terms.clone(), dict)
},
|(input, dict)| {
Expand All @@ -71,7 +71,7 @@ pub fn benchmark_input(c: &mut Criterion) {
group.bench_function("read_iris", |b| {
b.iter_batched(
|| {
let dict = std::cell::RefCell::new(PrefixedStringDictionary::default());
let dict = std::cell::RefCell::new(HashMapDictionary::default());
(iris.clone(), dict)
},
|(input, dict)| {
Expand Down
119 changes: 119 additions & 0 deletions nemo-benches/src/bin/dict-bench.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,119 @@
use flate2::read::MultiGzDecoder;
use std::env;
use std::fs::File;
use std::io::prelude::*;
use std::io::stdin;
use std::io::BufReader;

use nemo::meta::{timing::TimedDisplay, TimedCode};
use nemo_physical::dictionary::{
hash_map_dictionary::HashMapDictionary, meta_dictionary::MetaDictionary,
prefixed_string_dictionary::PrefixedStringDictionary, string_dictionary::StringDictionary,
AddResult, Dictionary,
};

fn create_dictionary(dict_type: &str) -> Box<dyn Dictionary> {
match dict_type {
"hash" => {
println!("Using StringDictionary.");
Box::new(StringDictionary::new())
}
"hashmap" => {
println!("Using HashMapDictionary.");
Box::new(HashMapDictionary::new())
}
"prefix" => {
println!("Using PrefixedStringDictionary.");
Box::new(PrefixedStringDictionary::new())
}
"meta" => {
println!("Using MetaDictionary.");
Box::new(MetaDictionary::new())
}
_ => panic!("Unexpected dictionary type '{}'.", dict_type),
}
}

fn main() {
env_logger::init();
TimedCode::instance().start();

let args: Vec<_> = env::args().collect();
if args.len() < 3 {
println!("Usage: dict-bench <filename> <dicttype> <nonstop>");
println!(
" <filename> File with dictionary entries, one per line, possibly with duplicates."
);
println!(
" <dicttype> Identifier for the dictionary to test, e.g., \"hash\" or \"prefix\"."
);
println!(
" <nonstop> If anything is given here, the program will terminate without asking for a prompt."
);
}

let filename = &args[1];
let dicttype = &args[2];

let reader = BufReader::new(MultiGzDecoder::new(
File::open(filename).expect("Cannot open file."),
));

let mut dict = create_dictionary(dicttype);
let mut count_lines = 0;
let mut count_unique = 0;
let mut bytes = 0;

TimedCode::instance().sub("Dictionary filling").start();

println!("Starting to fill dictionary ...");

for l in reader.lines() {
let s = l.unwrap();
let b = s.len();

let entry_status = dict.add_string(s);
match entry_status {
AddResult::Fresh(_value) => {
bytes += b;
count_unique += 1;
}
AddResult::Known(_value) => {}
AddResult::Rejected => {}
}

count_lines += 1;
}

TimedCode::instance().sub("Dictionary filling").stop();

println!(
"Processed {} strings (dictionary contains {} unique strings with {} bytes overall).",
count_lines, count_unique, bytes
);

TimedCode::instance().stop();

println!(
"\n{}",
TimedCode::instance().create_tree_string(
"dict-bench",
&[
TimedDisplay::default(),
TimedDisplay::default(),
TimedDisplay::new(nemo::meta::timing::TimedSorting::LongestThreadTime, 0)
]
)
);

if args.len() < 4 {
println!("All done. Press return to end benchmark (and free all memory).");
let mut s = String::new();
stdin().read_line(&mut s).expect("No string entered?");
}

if dict.len() == 123456789 {
// FWIW, prevent dict from going out of scope before really finishing
println!("Today is your lucky day.");
}
}
3 changes: 0 additions & 3 deletions nemo-cli/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -9,9 +9,6 @@ license.workspace = true
readme = "README.md"
repository.workspace = true

[features]
no-prefixed-string-dictionary = ["nemo/no-prefixed-string-dictionary"]

[[bin]]
name = "nmo"
path = "src/main.rs"
Expand Down
3 changes: 2 additions & 1 deletion nemo-physical/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,6 @@ default = ["timing"]
# Enables time measurements using the "howlong" crate
# If this feature is not enabled, all time measurements will display zero instead
timing = ["dep:howlong"]
no-prefixed-string-dictionary = []

[dependencies]
log = "0.4"
Expand All @@ -24,10 +23,12 @@ num = "0.4.0"
ascii_tree = "0.1.1"
once_cell = "1"
linked-hash-map = "0.5.6"
lru = "0.11.1"
howlong = { version = "0.1", optional = true }
rio_turtle = "0.8.4"
rio_xml = "0.8.4"
reqwest = "0.11.18"
regex = "1.9.5"

[dev-dependencies]
arbitrary = { version = "1", features = ["derive"] }
Expand Down
2 changes: 1 addition & 1 deletion nemo-physical/src/arithmetic/traits.rs
Original file line number Diff line number Diff line change
Expand Up @@ -114,7 +114,7 @@ impl CheckedPow for usize {

impl CheckedPow for u8 {
fn checked_pow(self, exponent: Self) -> Option<Self> {
num::checked_pow(self, exponent.try_into().ok()?)
num::checked_pow(self, exponent.into())
}
}

Expand Down
8 changes: 7 additions & 1 deletion nemo-physical/src/builder_proxy.rs
Original file line number Diff line number Diff line change
Expand Up @@ -95,7 +95,13 @@ impl ColumnBuilderProxy<PhysicalString> for PhysicalStringColumnBuilderProxy<'_>
generic_trait_impl_without_add!(VecT::U64);
fn add(&mut self, input: PhysicalString) -> Result<(), ReadingError> {
self.commit();
self.value = Some(self.dict.borrow_mut().add(input.into()).try_into()?);
self.value = Some(
self.dict
.borrow_mut()
.add_string(input.into())
.value()
.try_into()?,
);
Ok(())
}
}
Expand Down
9 changes: 7 additions & 2 deletions nemo-physical/src/datatypes/data_value.rs
Original file line number Diff line number Diff line change
Expand Up @@ -73,7 +73,12 @@ impl DataValueT {
match self {
Self::String(val) => {
// dictionary indices
StorageValueT::U64(dict.add(val.clone().into()).try_into().unwrap())
StorageValueT::U64(
dict.add_string(val.clone().into())
.value()
.try_into()
.unwrap(),
)
}
Self::U32(val) => StorageValueT::U32(*val),
Self::U64(val) => StorageValueT::U64(*val),
Expand All @@ -88,7 +93,7 @@ impl DataValueT {
match self {
Self::String(val) => Some(StorageValueT::U64(
// dictionary indices
dict.index_of(val.into())?.try_into().unwrap(),
dict.fetch_id(val.into())?.try_into().unwrap(),
)),
Self::U32(val) => Some(StorageValueT::U32(*val)),
Self::U64(val) => Some(StorageValueT::U64(*val)),
Expand Down
Loading