Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OAK-11602 : removed usage of Guava's ImmutableSet.copyOf with LinkedSet #2178

Open
wants to merge 3 commits into
base: trunk
Choose a base branch
from

Conversation

rishabhdaim
Copy link
Contributor

No description provided.

Copy link

Commit-Check ✔️

Copy link
Contributor

@nfsantos nfsantos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this replacement is too dangerous. It changes significantly the semantics and performance characteristics.

  • Changes mutability: LinkedHashSet is not immutable.
  • Changes performance: the sets created by ImmutableSet.copyOf() are highly optimized. There are special implementations for empty and one-element sets. And the implementation for general sized sets is also optimised, taking advantage of the set being immutable. It stores the elements in an array, which is more compact and faster to access than in a LinkedHashSet.

This might have unintended consequences in performance and even in correctness.

@reschke
Copy link
Contributor

reschke commented Mar 13, 2025

The tricky questions remain: should we review the code to check what semantics are really needed (a bit a dangerous, does not scale well, but may clarify the code). should we check whether performance really matters?

FWIW, when I replaced other collection utilities, I did exacly that, and apparently did not break anything. See, for instance, https://issues.apache.org/jira/browse/OAK-11278. Note that in this case, I did the zero/one/more variants separately (evidently zero or one arguments are easier to argue about).

@reschke
Copy link
Contributor

reschke commented Mar 13, 2025

Regarding mutability: there are many cases where it's trivial to see that it does not matter. For instance, when the set is just used once in a chain of collection transformations.

@nfsantos
Copy link
Contributor

The tricky questions remain: should we review the code to check what semantics are really needed (a bit a dangerous, does not scale well, but may clarify the code). should we check whether performance really matters?

As these changes are intended to be purely mechanical, I don't think we should be looking at how the objects are used. That does not scale and it is very error-prone, as we may miss some hidden assumptions about the behaviour of the APIs that are being changed. The safest way would be to preserve the semantics and performance characteristics as much as possible. So in this case, the Set should still be immutable and should have similar performance and memory usage. Replacing the Guava sets by the sets created by Set.of() would be closer to the original semantic and performance characteristics.

FWIW, when I replaced other collection utilities, I did exacly that, and apparently did not break anything. See, for instance, https://issues.apache.org/jira/browse/OAK-11278. Note that in this case, I did the zero/one/more variants separately (evidently zero or one arguments are easier to argue about).

This PR replaces ImmutableMap with Map.of(), which are much closer, both are immutable and have similar performance characteristics. But it was still dangerous because the iteration behaviour of the maps created by Map.of() is not the same as the ones created by ImmutableMap, and some code may rely on a particular iteration order, even though this is not guaranteed by the Map spec. This is a subtle change which may not be caught in testing, and may end up causing hard to debug problems in production.

@rishabhdaim
Copy link
Contributor Author

I think this replacement is too dangerous. It changes significantly the semantics and performance characteristics.

* Changes mutability: LinkedHashSet is not immutable.

* Changes performance: the sets created by ImmutableSet.copyOf() are highly optimized. There are special implementations for empty and one-element sets. And the implementation for general sized sets is also optimised, taking advantage of the set being immutable. It stores the elements in an array, which is more compact and faster to access than in a LinkedHashSet.

This might have unintended consequences in performance and even in correctness.

I believe that these comments are given on the assumption that all the changes are mechanical which is not true.

I checked the usage before making this change. I also checked the previous PR, where we removed the ImmutableSet.copyOf operation and replaced it with HashSet, and IMO, that's what caused the failures because Guava maintains the insertion order while HashSet doesn't.
This is the reason why I have chosen LinkedHashSet now.

Second, about the performance, we are only using the ImmutableSet.copyOf(Iterables/Iterator ...) variant, and that also takes O(n) in Guava's implementation, and since it maintains an additional Hashtable, it uses more space.

In my implementation, I am simply creating a LinkedHashSet, and the time complexity is the same as that of Guava.

Regarding Immutability, there is nothing in JDK or Apache's commons-collections4 that provides the drop-in replacement of ImmutableSet of Guava.
Set.of again, changes the iteration order and thus we again risk the previous issue.

We could overcome this (case by case basis) with either Collections.unmodifiableSet from JDK and UnmodifiableSet.unmodifiableSet(set) from commons-collections4 but that won't be necessary in the majority of the cases.

But just to remove any doubt, I would again do a thorough review of all the cases and wrap it inside an unmodifiable wrapper in case of any doubt.

cc @nfsantos @reschke

@rishabhdaim
Copy link
Contributor Author

Replacing the Guava sets by the sets created by Set.of() would be closer to the original semantic and performance characteristics.

Set.of has 2 big issues, first, it doesn't maintain insertion order, and second, it doesn't even allow null in the Set.contains(Object o) operation. Passing a null to contains would throw an NPE.

Copy link
Member

@thomasmueller thomasmueller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer a solution that:

  • Exactly matches the behavior of ImmutableSet.copyOf (specially being read-only, behavior of null entries, concurrency).
  • Have similar performance (measurably). Here a simple micro-benchmark would help; it doesn't need to be a unit test.
  • Have similar memory usage (measured). I don't think we need a unit test for that; instead, we could just test with a fixed set of e.g. 100 million entries, and see if it's running OOME at the same memory setting.

Also, I would probably not change all 35 files at the same time, but maybe half of those (so in two steps) with one week delay. So in case we break something, we know more exactly where.

@rishabhdaim
Copy link
Contributor Author

rishabhdaim commented Mar 13, 2025

@nfsantos @thomasmueller thanks for the review.

I did try to incorporate some of your review comments in 9876048.

However, I agree with @thomasmueller that we should even split this PR into smaller chunks.

@thomasmueller does the following sound reasonable to you:

ImmutableSet.copyOf replaced with Collections.unmodifiableSet(LinkedHashSet(...)). this ensures that we won't be able to modify it and also the order would remain the same.

@rishabhdaim
Copy link
Contributor Author

rishabhdaim commented Mar 14, 2025

@thomasmueller @nfsantos Please find the benchmark below:

Each method creates a Set from Iterables and iterates over it.
So far I haven't gotten any OOM from either of them.

Benchmark (size) Mode Cnt Score Error Units
ImmutableSet.apache 1 avgt 10 0.013 ± 0.001 us/op
ImmutableSet.apache 100 avgt 10 1.171 ± 0.006 us/op
ImmutableSet.apache 10000 avgt 10 112.914 ± 7.751 us/op
ImmutableSet.apache 1000000 avgt 10 22179.836 ± 3071.103 us/op
ImmutableSet.apache 100000000 avgt 10 3063667.217 ± 70743.291 us/op
ImmutableSet.guava 1 avgt 10 0.001 ± 0.001 us/op
ImmutableSet.guava 100 avgt 10 1.026 ± 0.006 us/op
ImmutableSet.guava 10000 avgt 10 128.871 ± 9.784 us/op
ImmutableSet.guava 1000000 avgt 10 51641.190 ± 2674.666 us/op
ImmutableSet.guava 100000000 avgt 10 3225314.017 ± 661972.699 us/op
ImmutableSet.jdk 1 avgt 10 0.017 ± 0.001 us/op
ImmutableSet.jdk 100 avgt 10 1.306 ± 0.022 us/op
ImmutableSet.jdk 10000 avgt 10 113.677 ± 6.856 us/op
ImmutableSet.jdk 1000000 avgt 10 24777.681 ± 9976.868 us/op
ImmutableSet.jdk 100000000 avgt 10 3145239.833 ± 121858.887 us/op

@rishabhdaim rishabhdaim force-pushed the OAK-11602 branch 4 times, most recently from 3746b8b to 94e7383 Compare March 14, 2025 07:00
@@ -86,7 +87,7 @@ protected void beforeSuite() throws Exception {
createForEachPrincipal(principal, acMgr, allPrivileges);

adminSession.save();
nodeSet = ImmutableSet.copyOf(nodePaths);
nodeSet = SetUtils.toLinkedSet(nodePaths);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is a test case, so no need for Immutability here.

@nfsantos
Copy link
Contributor

@thomasmueller @nfsantos Please find the benchmark below:

Each method creates a Set from Iterables and iterates over it. So far I haven't gotten any OOM from either of them.

Benchmark (size) Mode Cnt Score Error Units
ImmutableSet.apache 1 avgt 10 0.013 ± 0.001 us/op
ImmutableSet.apache 100 avgt 10 1.171 ± 0.006 us/op
ImmutableSet.apache 10000 avgt 10 112.914 ± 7.751 us/op
ImmutableSet.apache 1000000 avgt 10 22179.836 ± 3071.103 us/op
ImmutableSet.apache 100000000 avgt 10 3063667.217 ± 70743.291 us/op
ImmutableSet.guava 1 avgt 10 0.001 ± 0.001 us/op
ImmutableSet.guava 100 avgt 10 1.026 ± 0.006 us/op
ImmutableSet.guava 10000 avgt 10 128.871 ± 9.784 us/op
ImmutableSet.guava 1000000 avgt 10 51641.190 ± 2674.666 us/op
ImmutableSet.guava 100000000 avgt 10 3225314.017 ± 661972.699 us/op
ImmutableSet.jdk 1 avgt 10 0.017 ± 0.001 us/op
ImmutableSet.jdk 100 avgt 10 1.306 ± 0.022 us/op
ImmutableSet.jdk 10000 avgt 10 113.677 ± 6.856 us/op
ImmutableSet.jdk 1000000 avgt 10 24777.681 ± 9976.868 us/op
ImmutableSet.jdk 100000000 avgt 10 3145239.833 ± 121858.887 us/op

Thanks for the benchmarks. Could you do similar benchmarks for lookups? Fast lookups are one of the main reasons to use a set instead of a list, so I would imagine that most Sets created by Oak will be heavily used for lookups. Therefore I think lookup performance is more important than traversal performance.

@@ -34,7 +35,7 @@ public PathElementComparator() {
}

public PathElementComparator(Iterable<String> preferredPathElements) {
this.preferred = ImmutableSet.copyOf(preferredPathElements);
this.preferred = Collections.unmodifiableSet(SetUtils.toLinkedSet(preferredPathElements));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we do not need to preserve insertion order. Sets.of() is enough.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue with Sets.of is that, we need to make sure that we don't even pass null to Set.contains().

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case, we don't pass null (this is indexing code, I am familiar with it).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Set.of() doesn't accept Iterable and Set.copyOf only accepts Collection.

@reschke
Copy link
Contributor

reschke commented Mar 14, 2025

As these changes are intended to be purely mechanical,

Nope. That's not what we planned. The focus was on replacing Guava methods with standard JDK calls wherever possible.

@rishabhdaim
Copy link
Contributor Author

@nfsantos benchmark for creating set from Iterable + iterating it + calling contains API

Benchmark (size) Mode Cnt Score Error Units
ImmutableSet.apache 1 avgt 10 0.016 ± 0.002 us/op
ImmutableSet.apache 100 avgt 10 1.343 ± 0.018 us/op
ImmutableSet.apache 10000 avgt 10 141.176 ± 5.004 us/op
ImmutableSet.apache 1000000 avgt 10 30727.830 ± 14243.153 us/op
ImmutableSet.apache 100000000 avgt 10 3440894.454 ± 99999.848 us/op
ImmutableSet.guava 1 avgt 10 0.001 ± 0.001 us/op
ImmutableSet.guava 100 avgt 10 1.157 ± 0.008 us/op
ImmutableSet.guava 10000 avgt 10 139.581 ± 0.581 us/op
ImmutableSet.guava 1000000 avgt 10 61763.741 ± 17289.356 us/op
ImmutableSet.guava 100000000 avgt 10 3550494.812 ± 353004.799 us/op
ImmutableSet.jdk 1 avgt 10 0.022 ± 0.001 us/op
ImmutableSet.jdk 100 avgt 10 1.609 ± 0.099 us/op
ImmutableSet.jdk 10000 avgt 10 150.021 ± 1.770 us/op
ImmutableSet.jdk 1000000 avgt 10 25955.204 ± 3648.838 us/op
ImmutableSet.jdk 100000000 avgt 10 3531891.188 ± 92113.725 us/op

@rishabhdaim rishabhdaim requested a review from nfsantos March 14, 2025 09:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants