Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

In Intern::new compute the hash before acquiring the mutex #28

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

stepancheg
Copy link
Contributor

So there's smaller lock contention.

This also unlocks an option to do partitioning by hash, which should
reduce contention greatly.

So there's smaller lock contention.

This also unlocks an option to do partitioning by hash, which should
reduce contention greatly.
@droundy
Copy link
Owner

droundy commented Nov 15, 2021

Just a check: can you confirm that you've run benchmarks and this makes a difference? How big a difference does it make, and under what kind of workload?

@stepancheg
Copy link
Contributor Author

No I didn't do benchmark, sorry. It is not trivial to do testing with unreleased versions in our repo setup.

But there are more reasons to do this change. Another one is this: current implementation relies on Borrow. So it is possible to intern without allocation for example of String from str, but it is not possible to do interning of:

struct Pair(String, String)

from

struct PairRef(&str, &str)

because it is not possible to borrow Pair to PairRef. With switching away from HashSet to RawTable it is possible.

@stepancheg
Copy link
Contributor Author

stepancheg commented Nov 15, 2021

I reimplemented the library in our project with two PRs I submitted and more changes.

This is the implementation. https://gist.github.com/stepancheg/9d2ebac23b27ff8d21a1bcf494b5ac7c

The speedup in our program (not just internment) is about 1% (edit). However, the speedup is against version 0.4.0.

@droundy
Copy link
Owner

droundy commented Nov 16, 2021

I'm wondering why https://docs.rs/hashbrown/0.11.2/hashbrown/hash_map/struct.HashMap.html#method.raw_entry_mut wouldn't have worked for this purpose? I don't like that the raw API describes itself as "unsafe and experimental". It seems like raw_entry_mut would give the advantages you describe. Am I missing something?

@stepancheg
Copy link
Contributor Author

raw_entry_mut should probably work. I didn't know such function exists.

@droundy
Copy link
Owner

droundy commented Nov 24, 2021

Okay I'm now thinking that this implementation could open users of internment up to hash collision attacks, since it uses a hash with DefaultHasher rather than the hasher for the HashMap, which means the RandomState is predictable and an attached who can generate values that are interned could generate many values with collisions. There are various use cases for internment that I could imagine where this attack path might be open. On the other hand, one might point out that since Intern leaks memory, it's not safe to let attackers generate unlimited Intern values in any case... Not sure if this matters, but I'm reluctant to sidestep the protections built into the std HashMap.

@droundy
Copy link
Owner

droundy commented Nov 25, 2021

I've confirmed from the docs that DefaultHasher::new() generates a SipHasher with zero for the keys, which means we would be vulnerable to DOS attacks. Probably your shouldn't let potential attackers provide data for internment anyways, but I'm also not comfortable forgoing the standard protection provided by the standard library.

I'll think about how to store a RandomState outside a Mutex so we can compute hashes securely before taking the lock.

@stepancheg
Copy link
Contributor Author

I'll think about how to store a RandomState outside a Mutex

static RANDOM_STATE: OnceCell<RandomState> = ... should not have noticeable performance overhead, no?

@droundy
Copy link
Owner

droundy commented Nov 29, 2021 via email

@droundy
Copy link
Owner

droundy commented Nov 29, 2021

BTW, the reason I'm picky about optimization changes is that I've been bitten in the past by "optimizations" which ended up introducing serious bugs (only discovered months later after they had corrupted many people's data), without even a demonstrated benefit. I doubt that's the case here (among other things, internment is unlikely to affect anyone's on-disk format), but want to ensure that I understand the trade-offs, and that we don't sacrifice my ability to fix any bugs either in this pull request, or that arise in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants