Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bytes.hashCode() can be optimized #389

Closed
Tracked by #14395
artemananiev opened this issue Feb 7, 2025 · 1 comment · Fixed by #391
Closed
Tracked by #14395

Bytes.hashCode() can be optimized #389

artemananiev opened this issue Feb 7, 2025 · 1 comment · Fixed by #391
Assignees
Labels
Performance Issues related to performance concerns.
Milestone

Comments

@artemananiev
Copy link
Member

Bytes.hashCode() is implemented in a sub-optimal way:

    @Override
    public int hashCode() {
        int h = 1;
        for (long i = length() - 1; i >= 0; i--) {
            h = 31 * h + getByte(i);
        }
        return h;
    }

Instead, we can use Arrays.hashCode(), which leverages vector ops support in modern CPUs.

There is one issue, though. The existing hashCode() may not be changed, not yet. If it's changed, hash code values will be changed, too. If a hash code is persisted anywhere (e.g. in MerkleDb), this place will be broken. We've seen it in the past: a key was stored on disk with one hash code, then hashCode() implementation was changed, and the key couldn't be retrieved from disk any longer.

So I propose:

  • Introduce a separate method in Bytes, for example, fastHashCode(). This is what this ticket is about
  • When we switch to Virtual Mega Map, use this method instead of hashCode() to store keys/values on disk. Migration to the mega map doesn't have to be backwards compatible, all data is migrated anyway
  • Some time later, hashCode() can be changed the same way as fastHashCode(), and the latter can be removed. All usages of fastHashCode() is the main repo will be replaced with hashCode(). Since implementation is the same, it will be backwards compatible
  • In-memory usages of hashCode(), e.g. when an object is stored in a Java collection, should be fine, since they are not persisted across node runs (no traces in state snapshots)
@artemananiev artemananiev added the Performance Issues related to performance concerns. label Feb 7, 2025
@artemananiev artemananiev self-assigned this Feb 7, 2025
@artemananiev
Copy link
Member Author

Some interesting facts:

  • Arrays.hashCode() only works with full arrays, there is no way to hash a part of an array (offset + length). It makes this method unusable for sliced Bytes objects (created using Bytes.slice())

  • Arrays.hashCode() value is not equal to existing Bytes.hashCode()

  • Arrays.hashCode() calls ArraysSupport.vectorizedHashCode(), which is annotated with IntrinsicCandidate. It means, this method may or may not be optimized by VM

  • Optimized and unoptimized versions of ArraysSupport.IntrinsicCandidate() produce different values. We need stable and predictable hashCode() implementation for Bytes, so Arrays.hashCode() doesn't look like a viable option

And some numbers:

  • On mac, Arrays.hashCode() is 50-60% faster than existing Bytes.hashCode()
  • I suspect Arrays.hashCode() is not optimized by JVM on mac. With some tweaks to our current implementation, it can be made almost as fast as Arrays.hashCode()
  • On linux, Arrays.hashCode() is 3-4 times faster than the current implementation
  • Unoptimized version of Arrays.hashCode() is about 25% faster than the current implementation, but with the same tweaks the difference can be eliminated

So I am going to repurpose this ticket to improve existing Bytes.hashCode() rather than to provide a new "fast" version of this method. These improvements will produce same values as hashCode() does today, i.e. new implementation will be fully backwards compatible. It will also be predictable, so we don't have to assume JVM will or will not run some optimizations (which result in different hash code values), hash codes will be always calculated the very same way.

@artemananiev artemananiev changed the title Bytes.fastHashCode() Bytes.hashCode() can be optimized Feb 12, 2025
@artemananiev artemananiev added this to the 0.9.17 milestone Feb 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Performance Issues related to performance concerns.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant