Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Milestone-3] Serokell: New OrderedSet.mo & OrderedMap fixup #662

Closed
wants to merge 52 commits into from

Conversation

GoPavel
Copy link

@GoPavel GoPavel commented Oct 7, 2024

This is an MR for the 3rd Milestone of the Serokell's grant about improving Motoko's base library.

The main goal of the PR is to introduce a new functional implementation of the set data structure to the' base' library. Also, it brings a few changes to the new functional map that was added in #664 , #654 .

General changes:

  • rename PersistentOrderedMap to OrderedMap (same for the OrderedSet)
  • improve docs

Functional Map changes:

New functionality:

  • add any/all functions
  • add contains function
  • add minEntry/maxEntry

Optimizations:

Fixup:

  • add entriesRev(), remove iter()

NEW functional Set:

The new data structure implements an ordered set interface using Red-Black trees as well as the new functional map from the 1-2 Milestones.

API implemented:

  • Basic operations (based on the map): put, delete, contains, fromIter, etc
  • Maps and folds: map, mapFilter, foldLeft, foldRight
  • Set operations: union , intersect, diff, isSubset, equal
  • Additional operations (as for the OrderedMap): min/max, all/some

Maintainance support:

  • Unit, property tests
  • Documentation

Applied optimizations:

Rejected optimizations:

  • Nipkow's implementation of set operation [Tobias Nipkow's "Functional Data Structures and Algorithms", 117].
    Initially, we were planning to use an implementation of set operations (intersect, union, diff) from Nipkow's book. However, the experiment shows that naive implementation with a simple size heuristic performs better. The benchmark results are comparing 3 versions:
    • persistentset_baseline -- original implementation that uses Nipkow's algorithms. However, the black height is calculated before each set operation (the book assumes it's stored).
    • persistentset_bh -- the same as the baseline but the black height is stored in each node.
    • persistentset -- naive implementation that looks up in a smaller set and modifies a bigger one (it gives us O(min(n,m)log((max(n,m)) which is very close to Nipkow's version). Sizes of sets are also stored but only in the root.
      The last one outperforms others and keeps a tree slim in terms of byte size. Thus, we have picked it.

Final benchmark results:

Collection benchmarks

binary_size generate max mem batch_get 50 batch_put 50 batch_remove 50 upgrade
orderedset+100 218_168 186_441 37_916 53_044 121_237 127_460 346_108
trieset+100 211_245 574_022 47_652 131_218 288_429 268_499 729_696
orderedset+1000 218_168 2_561_296 520_364 69_883 158_349 170_418 3_186_579
trieset+1000 211_245 7_374_045 633_440 162_806 383_594 375_264 9_178_466
orderedset+10000 218_168 40_015_301 320_532 84_660 192_931 215_592 31_522_120
trieset+10000 211_245 105_695_670 682_792 192_931 457_923 462_594 129_453_045
orderedset+100000 218_168 476_278_087 3_200_532 98_553 230_123 258_372 409_032_232
trieset+100000 211_245 1_234_038_235 6_826_516 222_247 560_440 549_813 1_525_692_388
orderedset+1000000 218_168 5_514_198_432 32_000_532 115_836 268_236 306_896 4_090_302_778
trieset+1000000 211_245 13_990_048_548 68_228_312 252_211 650_405 642_099 17_455_845_492

set API

size intersect union diff equals isSubset
orderedset+100 100 146_264 157_544 215_871 28_117 27_726
trieset+100 100 352_496 411_306 350_935 201_896 201_456
orderedset+1000 1000 162_428 194_198 286_747 242_329 241_938
trieset+1000 1000 731_650 1_079_906 912_629 2_589_090 4_023_673
orderedset+10000 10000 177_080 231_070 345_529 2_383_587 2_383_591
trieset+10000 10000 3_986_854 21_412_306 5_984_106 46_174_710 31_885_381
orderedset+100000 100000 190_727 267_008 402_081 91_300_348 91_300_393
trieset+100000 100000 178_863_894 209_889_623 199_028_396 521_399_350 521_399_346
orderedset+1000000 1000000 205_022 304_937 464_859 912_901_595 912_901_558
trieset+1000000 1000000 1_782_977_198 2_092_850_787 1_984_818_266 5_813_335_155 5_813_335_151

new set API

size foldLeft foldRight mapfilter map
orderedset 100 16_487 16_463 88_028 224_597
orderedset 1000 133_685 131_953 1_526_510 4_035_782
orderedset 10000 1_305_120 1_287_495 28_455_361 51_527_733
orderedset 100000 13_041_665 12_849_418 344_132_505 630_692_463
orderedset 1000000 130_428_573 803_454_777 4_019_592_041 7_453_944_902

@GoPavel GoPavel changed the title Serokell: New PersistentOrderedSet.mo lib [Draft] Serokell: New PersistentOrderedSet.mo lib Oct 7, 2024
@GoPavel GoPavel changed the title [Draft] Serokell: New PersistentOrderedSet.mo lib Draft: [Milestone-3] Serokell: New PersistentOrderedSet.mo lib Oct 22, 2024
@GoPavel GoPavel changed the title Draft: [Milestone-3] Serokell: New PersistentOrderedSet.mo lib Draft: [Milestone-3] Serokell: New OrderedSet.mo lib Nov 5, 2024
@GoPavel GoPavel changed the title Draft: [Milestone-3] Serokell: New OrderedSet.mo lib [Milestone-3] Serokell: New OrderedSet.mo lib Nov 5, 2024
@GoPavel GoPavel changed the title [Milestone-3] Serokell: New OrderedSet.mo lib [Milestone-3] Serokell: New OrderedSet.mo & OrderedMap fixup Nov 5, 2024
@crusso
Copy link
Contributor

crusso commented Nov 5, 2024

One quick observation. subset and equality seem very expensive: I wonder if that's because the isSubsetHelper still uses an iterator, that allocates. Maybe just use recursion again?

Perhaps there's even faster way but I don't know enough about red-black trees to suggest one. If the representation is canonical, maybe you can exploit that. But I'm not sure it is or depends on the order of insertions/deletions. Do you know?

@GoPavel
Copy link
Author

GoPavel commented Nov 6, 2024

One quick observation. subset and equality seem very expensive: I wonder if that's because the isSubsetHelper still uses an iterator, that allocates. Maybe just use recursion again?

Perhaps there's even faster way but I don't know enough about red-black trees to suggest one. If the representation is canonical, maybe you can exploit that. But I'm not sure it is or depends on the order of insertions/deletions. Do you know?

It's not canonical, so we cannot have just a linear one-to-one comparison. I've experimented with a couple of alternatives I think it works pretty well now. I've managed to use the order of nodes in the tree to optimize it a bit.

I have updated the benchmark results in the description, for the comparison of the version before and after the last commit see serokell#37 .

@crusso
Copy link
Contributor

crusso commented Nov 6, 2024

It's not canonical, so we cannot have just a linear one-to-one comparison. I've experimented with a couple of alternatives I think it works pretty well now. I've managed to use the order of nodes in the tree to optimize it a bit.

I have updated the benchmark results in the description, for the comparison of the version before and after the last commit see serokell#37 .

That's much better indeed! Thank you!

@GoPavel
Copy link
Author

GoPavel commented Nov 8, 2024

Also, I've optimized intersect for Set and fixed doc comments: serokell#39

src/OrderedMap.mo Outdated Show resolved Hide resolved
};

/// Test helpers
public module MapDebug {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we move this module into the test dir, to avoid polluting the library?

src/OrderedSet.mo Outdated Show resolved Hide resolved
src/OrderedSet.mo Outdated Show resolved Hide resolved
src/OrderedSet.mo Outdated Show resolved Hide resolved
src/OrderedMap.mo Outdated Show resolved Hide resolved
src/OrderedMap.mo Outdated Show resolved Hide resolved
src/OrderedMap.mo Outdated Show resolved Hide resolved
src/OrderedMap.mo Outdated Show resolved Hide resolved
src/OrderedMap.mo Outdated Show resolved Hide resolved
Sereja313 and others added 19 commits November 13, 2024 16:21
Store the set size in the root node and make set operations to use the
set size instead of calculating black height.
* rename elements() -> vals()
* add valsRev()
* Hide helper types for iterations into the Internal module
* rename `rbSet` to `set` or `s`
* add import `Debug` into examples
* fix typos
Also:
* fix code format
* fix typo
* refactor internal functions a bit
Co-authored-by: Claudio Russo <[email protected]>
auto-merge was automatically disabled November 13, 2024 15:21

Head branch was pushed to by a user without write access

@crusso crusso enabled auto-merge (squash) November 13, 2024 15:22
@crusso crusso disabled auto-merge November 13, 2024 15:24
Copy link
Contributor

@crusso crusso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the adjustments!

crusso added a commit that referenced this pull request Nov 13, 2024
This is an MR for the 3rd Milestone of the Serokell's grant about
improving Motoko's base library.

The main goal of the PR is to introduce a new functional implementation
of the set data structure to the' base' library. Also, it brings a few
changes to the new functional map that was added in #664 , #654 .

# General changes:

* rename `PersistentOrderedMap` to `OrderedMap` (same for the
`OrderedSet`)
* improve docs

# Functional Map changes:

## New functionality:
+ add `any`/`all` functions
+ add `contains` function
+ add `minEntry`/`maxEntry`

## Optimizations:
+ Store `size` in the Map, [benchmark
results](serokell#35)

## Fixup: 
+ add `entriesRev()`, remove `iter()`

# NEW functional Set:

The new data structure implements an ordered set interface using
Red-Black trees as well as the new functional map from the 1-2
Milestones.

## API implemented:
* Basic operations (based on the map): `put`, `delete`, `contains`,
`fromIter`, etc
* Maps and folds: `map`, `mapFilter`, `foldLeft`, `foldRight`
* Set operations: `union` , `intersect`, `diff`, `isSubset`, `equal`
* Additional operations (as for the `OrderedMap`): `min`/`max`,
`all`/`some`

## Maintainance support:
* Unit, property tests
* Documentation

## Applied optimizations:

* Same optimizations that were useful for the functional map:
   * inline node color
   * float-out exceeded matching in iteration
   * `map`/`filterMap` through `foldLeft`
   * direct recursion in `foldLeft`
* [Benchmark results for all four optimizations
together](serokell#27)
* store size in the root of the tree, [benchmark
results](serokell#36 (comment))
* Pattern matching order optimization, [benchmark
results](serokell#36 (comment))
 * Other optimizations:
* Inline code of `OrderedMap` instead of sharing it, [benchmark
results](serokell#25)
* `intersect` optimization: use order of output values to build the
resulting tree faster, see
serokell#39
* `isSubset`, `equal` optimization: use early exit and use order of
subtrees to reduce intermediate tree height, see
serokell#37

## Rejected optimizations:

* Nipkow's implementation of set operation [Tobias Nipkow's "Functional
Data Structures and Algorithms", 117].
Initially, we were planning to use an implementation of set operations
(`intersect`, `union`, `diff`) from Nipkow's book. However, the
experiment shows that naive implementation with a simple size heuristic
performs better. [The benchmark
results](serokell#33) are comparing
3 versions:
* persistentset_baseline -- original implementation that uses Nipkow's
algorithms. However, the black height is calculated before each set
operation (the book assumes it's stored).
* persistentset_bh -- the same as the baseline but the black height is
stored in each node.
* persistentset -- naive implementation that looks up in a smaller set
and modifies a bigger one (it gives us `O(min(n,m)log((max(n,m))` which
is very close to Nipkow's version). Sizes of sets are also stored but
only in the root.
The last one outperforms others and keeps a tree slim in terms of byte
size. Thus, we have picked it.

## Final benchmark results:

### Collection benchmarks

| |binary_size|generate|max mem|batch_get 50|batch_put 50|batch_remove
50|upgrade|
|--:|--:|--:|--:|--:|--:|--:|--:|
|orderedset+100|218_168|186_441|37_916|53_044|121_237|127_460|346_108|
|trieset+100|211_245|574_022|47_652|131_218|288_429|268_499|729_696|

|orderedset+1000|218_168|2_561_296|520_364|69_883|158_349|170_418|3_186_579|

|trieset+1000|211_245|7_374_045|633_440|162_806|383_594|375_264|9_178_466|

|orderedset+10000|218_168|40_015_301|320_532|84_660|192_931|215_592|31_522_120|

|trieset+10000|211_245|105_695_670|682_792|192_931|457_923|462_594|129_453_045|

|orderedset+100000|218_168|476_278_087|3_200_532|98_553|230_123|258_372|409_032_232|

|trieset+100000|211_245|1_234_038_235|6_826_516|222_247|560_440|549_813|1_525_692_388|

|orderedset+1000000|218_168|5_514_198_432|32_000_532|115_836|268_236|306_896|4_090_302_778|

|trieset+1000000|211_245|13_990_048_548|68_228_312|252_211|650_405|642_099|17_455_845_492|

### set API

| |size|intersect|union|diff|equals|isSubset|
|--:|--:|--:|--:|--:|--:|--:|
|orderedset+100|100|146_264|157_544|215_871|28_117|27_726|
|trieset+100|100|352_496|411_306|350_935|201_896|201_456|
|orderedset+1000|1000|162_428|194_198|286_747|242_329|241_938|
|trieset+1000|1000|731_650|1_079_906|912_629|2_589_090|4_023_673|
|orderedset+10000|10000|177_080|231_070|345_529|2_383_587|2_383_591|

|trieset+10000|10000|3_986_854|21_412_306|5_984_106|46_174_710|31_885_381|
|orderedset+100000|100000|190_727|267_008|402_081|91_300_348|91_300_393|

|trieset+100000|100000|178_863_894|209_889_623|199_028_396|521_399_350|521_399_346|

|orderedset+1000000|1000000|205_022|304_937|464_859|912_901_595|912_901_558|

|trieset+1000000|1000000|1_782_977_198|2_092_850_787|1_984_818_266|5_813_335_155|5_813_335_151|

### new set API

| |size|foldLeft|foldRight|mapfilter|map|
|--:|--:|--:|--:|--:|--:|
|orderedset|100|16_487|16_463|88_028|224_597|
|orderedset|1000|133_685|131_953|1_526_510|4_035_782|
|orderedset|10000|1_305_120|1_287_495|28_455_361|51_527_733|
|orderedset|100000|13_041_665|12_849_418|344_132_505|630_692_463|
|orderedset|1000000|130_428_573|803_454_777|4_019_592_041|7_453_944_902|

---------

Co-authored-by: Andrei Borzenkov <[email protected]>
Co-authored-by: Andrei Borzenkov <[email protected]>
Co-authored-by: Sergey Gulin <[email protected]>
Co-authored-by: Claudio Russo <[email protected]>
@crusso
Copy link
Contributor

crusso commented Nov 13, 2024

Squashed and fast-forward merged with commit 1961fab

(see https://dfinity.slack.com/archives/D07MAEML9RD/p1731527989433379)

@crusso crusso closed this Nov 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants