-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fallible APIs for StringInterner for #70 #71
base: master
Are you sure you want to change the base?
Conversation
First, I haven't found issues adding fallible API to Backend implementations (except the capacity issue, below). The most problematic change involves the
I'm also a bit concerned about the sanity of the What if instead we give a user a way to estimate the "average" string size? Or perhaps even the separate additional capacity for string data. |
@royaltm thank you for the PR! I think this PR generally goes into the right direction. The CI is currently unhappy with docs and formatting. Also a memory consumption test is broken because the number of expected allocations decreased. Can you explain why? I am a bit sceptic about the usefulness of the reserve API for |
Perhaps different number of allocations must have something to do with changes to the buffer backend where I introduced the As for the RawTable API, I'm not quite sure what advantage it would have over current implementation. Perhaps just less intermediary function calls? |
I have read somewhere that the |
I suppose moving to Regarding Perhaps I'll remove the |
The |
…e updated buffer backend to reflect changes in backend
Hi @Robbepop! Sorry for the delay, I've had a whirlwind of obligations to fulfill in the passing weeks and I'd rather not contribute at all than contribute haphazardly during my lunch break as good and quickly seldom meet.
If this meets your criteria I don't think there's anything left considering the scope of this PR. |
Hi @Robbepop, I've fixed formatting errors, I'm not sure however what can I do to fix the tarpaulin issue. |
I just ran benchmarks for this branch locally and compared them to the current main branch. Unfortunately I saw quite some big performance regressions with this PR:
|
head: FixedString::with_capacity(cap), | ||
head: FixedString::try_with_capacity(cap).unwrap(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a big fan of using raw unwrap
calls but coming up with proper expect messages is not easy here I see.
I usually try to use unwrap_or_else(|| panic!("..."))
because it allows to use scoped information for the message and otherwise I use expect("...")
with a proper message stating why this must not happen. However, here it is due to the guarantees of the API. 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here I was actually assuming the error message from try_with_capacity will do the job.
// self.spans.try_reserve(1)?; | ||
// // this cannot fail, the previous line either returned or added at least 1 free slot | ||
// let _ = self.spans.push_within_capacity(value); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could also recursively call try_push_span
again instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wouldn't that end up calling try_next_symbol
twice?
With the new code, we try to reserve capacity eagerly:
self.spans.try_reserve(1)?;
self.spans.push(interned);
which unfortunately checks capacity redundantly, the 2nd one being in the push itself, which might impact the performance. The whole point of using push_within_capacity
is to try to push first, where there is only one capacity check in the optimistic scenario, and only when that fails code branches to allocate.
// FIXME: vec_push_within_capacity #100486, replace the following with: | ||
// | ||
// if let Err(value) = self.full.push_within_capacity(old_string) { | ||
// self.full.try_reserve(1)?; | ||
// // this cannot fail, the previous line either returned or added at least 1 free slot | ||
// let _ = self.full.push_within_capacity(value); | ||
// } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to be the same code as above so maybe propose to introduce a helper method instead. Or?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, but that would need to be a generic function that takes a vector as argument because spans
and full
are vectors of different types
fn encode_var_usize(&mut self, value: usize) -> usize { | ||
encode_var_usize(&mut self.buffer, value) | ||
fn try_encode_var_length(&mut self, length: usize) -> Result<()> { | ||
let add_len = length + calculate_var7_size(length); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would rename it to something like len_encode_var_usize
which is very similar to encode_var_usize
to draw the parallels to the reader of the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Though I have to admit that it is a bit unfortunate that this encoding step is now 2 phased.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
encode_var_usize
pushes each byte separately, calling try_reserve
every time a byte is encoded would take down the performance even more. However with push_within_capacity
stabilized that could be implemented differently.
Alternative solution is to encode length into an array on the stack and then call extend_from_slice
on self.buffer
. This however can't be done efficiently without a bit of unsafe code (MaybeUninit
) or would require using ArrayVec or a similar crate.
/// According to google the approx. word length is 5. | ||
const DEFAULT_STR_LEN: usize = 5; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should really get rid of this thing entirely. My bad to have introduced them in the first place. 🙈
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By that, do you mean to remove with_capacity
from the Backend altogether?
/// We encode the `usize` string length into the buffer as well. | ||
const LEN_USIZE: usize = mem::size_of::<usize>(); | ||
/// According to google the approx. word length is 5. | ||
const DEFAULT_STR_LEN: usize = 5; | ||
let bytes_per_string = DEFAULT_STR_LEN + LEN_USIZE; | ||
// We encode the `usize` string length into the buffer as well. | ||
let var7_len: usize = calculate_var7_size(capacity); | ||
let bytes_per_string = DEFAULT_STR_LEN + var7_len; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As said in a prior comment, we should probably get rid of all the heuristics based with_capacity
methods and get this down to the basics. A Rust BTreeMap
for example also does not provide a with_capacity
method because it simply does not make sense to do so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, it's rather pointless.
/// Calculate var7 encoded size from a given value. | ||
#[inline] | ||
fn calculate_var7_size(value: usize) -> usize { | ||
// number of bits to encode | ||
// value = 0 would give 0 bits, hence: |1, could be anything up to |0x7F as well | ||
let bits = usize::BITS - (value | 1).leading_zeros(); | ||
// (bits to encode / 7).ceil() | ||
((bits + 6) / 7) as usize | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please explain this formula a bit better so that it does not look like magic to the reader?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also: is this precise or an over approximation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This formula is precise and this is also reflected in tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The formula goes like this:
- Take the binary size (bit length) of the original length
let bits = usize::BITS - (value | 1).leading_zeros();
For example if your length is 42
this is binary 101010
, the bit length is 6.
Assuming usize::BITS
is 64 leading_zeros
of 42usize is 58. 64 - 58 = 6.
-
If the binary size is 0, make it 1 (0 is the only case that breakes the formula, hence
(value | 1)
). -
The encoded size in bytes is the total number of encoded bits divided by 7 (each byte encodes only 7 bits of the original value), rounded up because the last byte's 7-bits might not be filled entirely: hence (bits + 6).
// (bits to encode / 7).ceil()
((bits + 6) / 7) as usize
For example 0x7F bit length is 7, (7 + 6) / 7 == 1
The length is encoded as: 0b0111_1111
0x80 bit length is 8, (8 + 6) / 7 == 2
The length is encoded as: 0b1000_0000 0b0000_0001
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's take the original bit value of the encoded length (each bit placement is represented by a character)
ZYXWVUTSRQPONMLKJIHGFEDCBA9876543210
(bit-length: 36)
The encoded chunks end up being:
!6543210 !DCBA987 !KJIHGFE !RQPONML !YXWVUTS _______Z
Where !
and _
are 1
and 0
bit markers respectively.
As seen clearly each 7 bits ends up in each chunk with the last chunk filled with the remaining 1-7 bits
(36 + 6) / 7 == 6
Now if we take:
YXWVUTSRQPONMLKJIHGFEDCBA9876543210
(bit-length: 35)
The encoded chunks end up being:
!6543210 !DCBA987 !KJIHGFE !RQPONML _YXWVUTS
(35 + 6) / 7 == 5
This process repeats every 7 bits.
The only exception is the 0
itself which if we followed the formula precisely should be encoded in exactly 0
bytes - because there are no bits to encode!
This leads us to the interesting observation that we could in fact extend the encoded var7 by adding any number of redundant 0
at the end, and the decoding formula (providing we have an infinite integer capacity) would still work:
!6543210 !DCBA987 !KJIHGFE !RQPONML !YXWVUTS ________
!6543210 !DCBA987 !KJIHGFE !RQPONML !YXWVUTS !_______ ________
!6543210 !DCBA987 !KJIHGFE !RQPONML !YXWVUTS !_______ !_______ ________
and so on...
I hope that explains why 0
is breaking the formula without actually really breaking it. What we really need is just 1 bit of information that this is the end of the stream of data.
// According to google the approx. word length is 5. | ||
const DEFAULT_WORD_LEN: usize = 5; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again: we should remove these heuristics entirely.
@@ -204,6 +224,13 @@ where | |||
backend, | |||
} = self; | |||
let hash = make_hash(hasher, string.as_ref()); | |||
// this call requires hashbrown `raw` feature enabled |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please remove superflous comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, this was more of a message to you than the real code comment.
All in all this looks good. The benchmark regressions need to be fixed before we can merge though. |
I'm beginning to think that without stabilization of If I'm not mistaken another area of improvement might be in the raw hash-map department. With the latest changes I have also introduced redundant capacity checks in the |
Also, while I was pondering on the var-int buffer encoding issue if we are going to address the "2 phased encoding step" and we decide to try the single-step array encoding approach I'd also suggest to change the var-int algorithm to the more efficient one: I've created a gist with more info on the topic and playground examples: Please feel free to check it out at your leisure. |
Perhaps a better approach is to use https://doc.rust-lang.org/alloc/vec/struct.Vec.html#method.spare_capacity_mut. Something along the lines of: Although a lot of additional unsafety is introduced the generated assembly looks neat: There's a very tight loop at:
|
Hi @Robbepop!
This is my take on issue #70.
This pull request is NOT READY to merge.
I haven't added any tests nor did I check whether new code breaks anything in the existing test suites.
Let's iterate this through here instead.