Fallible APIs for StringInterner for #70 #71

royaltm · 2024-05-27T14:50:57Z

This is my take on issue #70.

This pull request is NOT READY to merge.
I haven't added any tests nor did I check whether new code breaks anything in the existing test suites.

Let's iterate this through here instead.

royaltm · 2024-05-27T15:00:37Z

First, I haven't found issues adding fallible API to Backend implementations (except the capacity issue, below).

The most problematic change involves the dedup HashMap:

to implement try_reserve I needed to directly access HashTable, thus a raw feature needed to be enabled for hashbrown
there's a substanstial addition here: https://github.com/royaltm/string-interner/blob/e8a7616160f8834f53927d72532aab12ec27bf2a/src/interner.rs#L163
which might beg for a bigger refactor to this function

I'm also a bit concerned about the sanity of the with_capacity and the new try_reserve API, since the implementation relies on guesswork - estimating average word size by google - like with everything "on average" will mostly be of no use for most particular cases.

What if instead we give a user a way to estimate the "average" string size? Or perhaps even the separate additional capacity for string data.

Robbepop · 2024-05-29T11:29:22Z

@royaltm thank you for the PR!

I think this PR generally goes into the right direction.

The RawMap API should have been implemented and used by the StringInterner long ago, but I didn't get to it. Maybe it would be nice to separate this change into another PR and basing this PR on it. The idea is to no longer use hashbrown's RawEntry API and instead use its RawMap API for everything.

CI is currently unhappy with docs and formatting. Also a memory consumption test is broken because the number of expected allocations decreased. Can you explain why?

I am a bit sceptic about the usefulness of the reserve API for StringInterner in general after seeing the mess that it introduces. The reserve API was once useful when we still had the SimpleBackend. However, the SimpleBackend got removed because it was too inefficient and thus served no real purpose. Without it the reserve API depends too much on heuristics and should probably be removed or redesigned to not be based on the expected number of strings to intern but on the expected bytes of all expected strings to be interned but even then it wouldn't be ideal imo.

royaltm · 2024-05-29T12:23:39Z

Perhaps different number of allocations must have something to do with changes to the buffer backend where I introduced the calculate_var7_size function so I am able to reserve space for the next entry including the var7 size marker ahead of pushing to the backend buffer.
Now the buffer backend is resized only once before encode_var_usize is called and the string itself is pushed.
Previously var7 size marker was pushed separately from the string itself and each operation could possibly reallocate backend buffer. But that's just my guess.

As for the RawTable API, I'm not quite sure what advantage it would have over current implementation. Perhaps just less intermediary function calls?

Robbepop · 2024-05-29T13:52:15Z

As for the RawTable API, I'm not quite sure what advantage it would have over current implementation. Perhaps just less intermediary function calls?

I have read somewhere that the RawEntry API of the hashbrown crate was to be phased out in favor of the newer and more versatile RawMap type. Since you are now making use of the raw crate feature I thought it would make sense to go full in with the change. Not necessary though. We can go ahead without it.

royaltm · 2024-05-29T16:54:17Z

I suppose moving to RawMap interface should be only an internal change. I propose we can tackle this issue after the one at hand.

Regarding try_reserve and with_capacity interface I agree that its usefulness is pretty questionable. One thing is that It's really hard to make predictions ahead. Especially with the BucketBackend and what makes the problem even worse is that we can't shrink_to_fit the head String due to the known problems with reallocation.

Perhaps I'll remove the try_reserve calls from API or make them internal where they are still required, and we'll see where that get us.

Robbepop · 2024-05-29T19:59:49Z

I suppose moving to RawMap interface should be only an internal change. I propose we can tackle this issue after the one at hand.

Regarding try_reserve and with_capacity interface I agree that its usefulness is pretty questionable. One thing is that It's really hard to make predictions ahead. Especially with the BucketBackend and what makes the problem even worse is that we can't shrink_to_fit the head String due to the known problems with reallocation.

Perhaps I'll remove the try_reserve calls from API or make them internal where they are still required, and we'll see where that get us.

The try_reserve probably does not make sense for the Backend to implement since it is too backend specific. At the moment no backend has a proper and well defined way to do it. For the StringInterner itself try_reserve could still be useful for its hashmap to deduplicate strings, however, it is probably not much help in practice.

…ringInterner

…e updated buffer backend to reflect changes in backend

royaltm · 2024-06-11T12:30:20Z

Hi @Robbepop!

Sorry for the delay, I've had a whirlwind of obligations to fulfill in the passing weeks and I'd rather not contribute at all than contribute haphazardly during my lunch break as good and quickly seldom meet.

I removed the Backend::try_reserve from trait and backend implementations,
I left StringInterner::try_reserve with a doc note about what it does exactly,
fixed docs issues,
added a mention of try_get_or_intern in the root doc of the StringInterner.
testing of calculate_var7_size together with each encode_var_usize,
updated number of (de)allocations in buffer_backend::test_memory_consumption.

If this meets your criteria I don't think there's anything left considering the scope of this PR.

royaltm · 2024-06-20T11:16:46Z

Hi @Robbepop,

I've fixed formatting errors, I'm not sure however what can I do to fix the tarpaulin issue.
Is there anything else on my end that needs to be done?

Robbepop · 2024-06-21T07:19:34Z

I just ran benchmarks for this branch locally and compared them to the current main branch.

Unfortunately I saw quite some big performance regressions with this PR:

get_or_intern/fill-empty/new/: ~10% regression
get_or_intern/fill-empty/with_capacity/: ~20-30% regression
get_or_intern/already-filled/: ~5% regression
get_or_intern_static: BuckedBackend improved by 5%, other backends regressed by 10%

Robbepop · 2024-06-21T06:53:04Z

src/backend/bucket/mod.rs

-            head: FixedString::with_capacity(cap),
+            head: FixedString::try_with_capacity(cap).unwrap(),


Not a big fan of using raw unwrap calls but coming up with proper expect messages is not easy here I see.
I usually try to use unwrap_or_else(|| panic!("...")) because it allows to use scoped information for the message and otherwise I use expect("...") with a proper message stating why this must not happen. However, here it is due to the guarantees of the API. 🤔

Here I was actually assuming the error message from try_with_capacity will do the job.

Robbepop · 2024-06-21T06:56:18Z

src/backend/bucket/mod.rs

+        //     self.spans.try_reserve(1)?;
+        //     // this cannot fail, the previous line either returned or added at least 1 free slot
+        //     let _ = self.spans.push_within_capacity(value);


We could also recursively call try_push_span again instead.

wouldn't that end up calling try_next_symbol twice?

With the new code, we try to reserve capacity eagerly:

self.spans.try_reserve(1)?; self.spans.push(interned);

which unfortunately checks capacity redundantly, the 2nd one being in the push itself, which might impact the performance. The whole point of using push_within_capacity is to try to push first, where there is only one capacity check in the optimistic scenario, and only when that fails code branches to allocate.

Robbepop · 2024-06-21T07:21:42Z

src/backend/bucket/mod.rs

+            // FIXME: vec_push_within_capacity #100486, replace the following with:
+            //
+            // if let Err(value) = self.full.push_within_capacity(old_string) {
+            //     self.full.try_reserve(1)?;
+            //     // this cannot fail, the previous line either returned or added at least 1 free slot
+            //     let _ = self.full.push_within_capacity(value);
+            // }


This seems to be the same code as above so maybe propose to introduce a helper method instead. Or?

sure, but that would need to be a generic function that takes a vector as argument because spans and full are vectors of different types

Robbepop · 2024-06-21T08:01:13Z

src/backend/buffer.rs

-    fn encode_var_usize(&mut self, value: usize) -> usize {
-        encode_var_usize(&mut self.buffer, value)
+    fn try_encode_var_length(&mut self, length: usize) -> Result<()> {
+        let add_len = length + calculate_var7_size(length);


I would rename it to something like len_encode_var_usize which is very similar to encode_var_usize to draw the parallels to the reader of the code.

Though I have to admit that it is a bit unfortunate that this encoding step is now 2 phased.

encode_var_usize pushes each byte separately, calling try_reserve every time a byte is encoded would take down the performance even more. However with push_within_capacity stabilized that could be implemented differently.

Alternative solution is to encode length into an array on the stack and then call extend_from_slice on self.buffer. This however can't be done efficiently without a bit of unsafe code (MaybeUninit) or would require using ArrayVec or a similar crate.

Robbepop · 2024-06-21T08:02:37Z

src/backend/buffer.rs

+/// According to google the approx. word length is 5.
+const DEFAULT_STR_LEN: usize = 5;


We should really get rid of this thing entirely. My bad to have introduced them in the first place. 🙈

By that, do you mean to remove with_capacity from the Backend altogether?

Robbepop · 2024-06-21T08:03:46Z

src/backend/buffer.rs

-        /// We encode the `usize` string length into the buffer as well.
-        const LEN_USIZE: usize = mem::size_of::<usize>();
-        /// According to google the approx. word length is 5.
-        const DEFAULT_STR_LEN: usize = 5;
-        let bytes_per_string = DEFAULT_STR_LEN + LEN_USIZE;
+        // We encode the `usize` string length into the buffer as well.
+        let var7_len: usize = calculate_var7_size(capacity);
+        let bytes_per_string = DEFAULT_STR_LEN + var7_len;


As said in a prior comment, we should probably get rid of all the heuristics based with_capacity methods and get this down to the basics. A Rust BTreeMap for example also does not provide a with_capacity method because it simply does not make sense to do so.

I agree, it's rather pointless.

Robbepop · 2024-06-21T08:04:51Z

src/backend/buffer.rs

+/// Calculate var7 encoded size from a given value.
+#[inline]
+fn calculate_var7_size(value: usize) -> usize {
+    // number of bits to encode
+    // value = 0 would give 0 bits, hence: |1, could be anything up to |0x7F as well
+    let bits = usize::BITS - (value | 1).leading_zeros();
+    // (bits to encode / 7).ceil()
+    ((bits + 6) / 7) as usize
+}


Can you please explain this formula a bit better so that it does not look like magic to the reader?

Also: is this precise or an over approximation?

This formula is precise and this is also reflected in tests.

The formula goes like this:

Take the binary size (bit length) of the original length

let bits = usize::BITS - (value | 1).leading_zeros();

For example if your length is 42 this is binary 101010, the bit length is 6.
Assuming usize::BITS is 64 leading_zeros of 42usize is 58. 64 - 58 = 6.

If the binary size is 0, make it 1 (0 is the only case that breakes the formula, hence (value | 1)).

The encoded size in bytes is the total number of encoded bits divided by 7 (each byte encodes only 7 bits of the original value), rounded up because the last byte's 7-bits might not be filled entirely: hence (bits + 6).

// (bits to encode / 7).ceil() ((bits + 6) / 7) as usize

For example 0x7F bit length is 7, (7 + 6) / 7 == 1

The length is encoded as: 0b0111_1111

0x80 bit length is 8, (8 + 6) / 7 == 2

The length is encoded as: 0b1000_0000 0b0000_0001

Let's take the original bit value of the encoded length (each bit placement is represented by a character)

ZYXWVUTSRQPONMLKJIHGFEDCBA9876543210 (bit-length: 36)

The encoded chunks end up being:

!6543210 !DCBA987 !KJIHGFE !RQPONML !YXWVUTS _______Z

Where ! and _ are 1 and 0 bit markers respectively.

As seen clearly each 7 bits ends up in each chunk with the last chunk filled with the remaining 1-7 bits

(36 + 6) / 7 == 6

Now if we take:

YXWVUTSRQPONMLKJIHGFEDCBA9876543210 (bit-length: 35)

The encoded chunks end up being:

!6543210 !DCBA987 !KJIHGFE !RQPONML _YXWVUTS

(35 + 6) / 7 == 5

This process repeats every 7 bits.

The only exception is the 0 itself which if we followed the formula precisely should be encoded in exactly 0 bytes - because there are no bits to encode!

This leads us to the interesting observation that we could in fact extend the encoded var7 by adding any number of redundant 0 at the end, and the decoding formula (providing we have an infinite integer capacity) would still work:

!6543210 !DCBA987 !KJIHGFE !RQPONML !YXWVUTS ________ !6543210 !DCBA987 !KJIHGFE !RQPONML !YXWVUTS !_______ ________ !6543210 !DCBA987 !KJIHGFE !RQPONML !YXWVUTS !_______ !_______ ________

and so on...

I hope that explains why 0 is breaking the formula without actually really breaking it. What we really need is just 1 bit of information that this is the end of the stream of data.

Robbepop · 2024-06-21T08:06:01Z

src/backend/string.rs

+// According to google the approx. word length is 5.
+const DEFAULT_WORD_LEN: usize = 5;


Again: we should remove these heuristics entirely.

Robbepop · 2024-06-21T08:07:09Z

src/interner.rs

@@ -204,6 +224,13 @@ where
            backend,
        } = self;
        let hash = make_hash(hasher, string.as_ref());
+        // this call requires hashbrown `raw` feature enabled


please remove superflous comment

Sure, this was more of a message to you than the real code comment.

Robbepop · 2024-06-21T08:08:35Z

All in all this looks good. The benchmark regressions need to be fixed before we can merge though.

royaltm · 2024-06-21T14:26:44Z

I'm beginning to think that without stabilization of push_within_capacity we can't do much to improve backend's fill performance, as we ended up with an algorithm that checks the capacity of the vector multiple times for a single push operation. Please see the comment above #71 (comment) for more on this.

If I'm not mistaken another area of improvement might be in the raw hash-map department. With the latest changes I have also introduced redundant capacity checks in the try_get_or_intern_using function.

royaltm · 2024-06-21T14:30:29Z

Also, while I was pondering on the var-int buffer encoding issue if we are going to address the "2 phased encoding step" and we decide to try the single-step array encoding approach I'd also suggest to change the var-int algorithm to the more efficient one:

I've created a gist with more info on the topic and playground examples:
https://gist.github.com/royaltm/f31ce5b1c8878e7c10ff7d335f3826e2

Please feel free to check it out at your leisure.

royaltm · 2024-06-21T15:43:59Z

Perhaps a better approach is to use https://doc.rust-lang.org/alloc/vec/struct.Vec.html#method.spare_capacity_mut.

Something along the lines of:
https://play.rust-lang.org/?version=stable&mode=release&edition=2021&gist=bd41840d6595d2787c2f60b86a67e84b

Although a lot of additional unsafety is introduced the generated assembly looks neat:

There's a very tight loop at: .LBB10_5 and a single capacity pre-check that branches to .LBB10_2 to make more room.

playground::encode_var_usize:
	push	r14
	push	rbx
	push	rax
	mov	rbx, rsi
	lea	rdx, [rsi + 64]
	mov	rax, qword ptr [rdi]
	mov	rsi, qword ptr [rdi + 16]
	sub	rax, rsi
	cmp	rax, rdx
	jb	.LBB10_2
	mov	rcx, qword ptr [rdi + 8]
	add	rcx, rsi
	xor	eax, eax
	cmp	rbx, 128
	jb	.LBB10_6

.LBB10_4:
	mov	rdx, rbx

.LBB10_5:
	mov	r8d, edx
	or	r8b, -128
	mov	byte ptr [rcx + rax], r8b
	inc	rax
	add	rdx, -128
	mov	rbx, rdx
	shr	rbx, 7
	cmp	rdx, 16383
	mov	rdx, rbx
	ja	.LBB10_5

.LBB10_6:
	mov	byte ptr [rcx + rax], bl
	lea	rcx, [rsi + rax]
	inc	rcx
	inc	rax
	mov	qword ptr [rdi + 16], rcx
	add	rsp, 8
	pop	rbx
	pop	r14
	ret

.LBB10_2:
	mov	r14, rdi
	call	alloc::raw_vec::RawVec<T,A>::reserve::do_reserve_and_handle
	mov	rdi, r14
	mov	rsi, qword ptr [r14 + 16]
	mov	rcx, qword ptr [r14 + 8]
	add	rcx, rsi
	xor	eax, eax
	cmp	rbx, 128
	jae	.LBB10_4
	jmp	.LBB10_6

royaltm added 2 commits May 27, 2024 16:43

BACKEND BREAKING: implement fallible API

e8a7616

backend/bucket fix docs

9ad151b

royaltm added 4 commits June 11, 2024 13:04

remove Backend::try_reserve method

27c77cc

fixes docs and mentions try_get_or_intern in the root doc of the St…

55df1e7

…ringInterner

buffer backend: test calculate_var7_size alongside encode_var_usize

5b022da

tests: memory consuption - update number of alloc/deallocations of th…

a1b1175

…e updated buffer backend to reflect changes in backend

fix formatting

474f2b9

Robbepop reviewed Jun 21, 2024

View reviewed changes

Caellian mentioned this pull request Nov 19, 2024

Add interner lifetime and improve documentation #76

Draft

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fallible APIs for StringInterner for #70 #71

Fallible APIs for StringInterner for #70 #71

royaltm commented May 27, 2024 •

edited

Loading

royaltm commented May 27, 2024

Robbepop commented May 29, 2024 •

edited

Loading

royaltm commented May 29, 2024

Robbepop commented May 29, 2024

royaltm commented May 29, 2024

Robbepop commented May 29, 2024

royaltm commented Jun 11, 2024

royaltm commented Jun 20, 2024

Robbepop commented Jun 21, 2024

Robbepop Jun 21, 2024

royaltm Jun 21, 2024

Robbepop Jun 21, 2024

royaltm Jun 21, 2024

Robbepop Jun 21, 2024

royaltm Jun 21, 2024

Robbepop Jun 21, 2024

Robbepop Jun 21, 2024

royaltm Jun 21, 2024

Robbepop Jun 21, 2024

royaltm Jun 21, 2024

Robbepop Jun 21, 2024

royaltm Jun 21, 2024

Robbepop Jun 21, 2024

Robbepop Jun 21, 2024

royaltm Jun 21, 2024

royaltm Jun 21, 2024

royaltm Jun 21, 2024 •

edited

Loading

Robbepop Jun 21, 2024

Robbepop Jun 21, 2024

royaltm Jun 21, 2024

Robbepop commented Jun 21, 2024

royaltm commented Jun 21, 2024

royaltm commented Jun 21, 2024

royaltm commented Jun 21, 2024

		head: FixedString::with_capacity(cap),
		head: FixedString::try_with_capacity(cap).unwrap(),

		/// According to google the approx. word length is 5.
		const DEFAULT_STR_LEN: usize = 5;

		// According to google the approx. word length is 5.
		const DEFAULT_WORD_LEN: usize = 5;

Fallible APIs for StringInterner for #70 #71

Are you sure you want to change the base?

Fallible APIs for StringInterner for #70 #71

Conversation

royaltm commented May 27, 2024 • edited Loading

royaltm commented May 27, 2024

Robbepop commented May 29, 2024 • edited Loading

royaltm commented May 29, 2024

Robbepop commented May 29, 2024

royaltm commented May 29, 2024

Robbepop commented May 29, 2024

royaltm commented Jun 11, 2024

royaltm commented Jun 20, 2024

Robbepop commented Jun 21, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

royaltm Jun 21, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Robbepop commented Jun 21, 2024

royaltm commented Jun 21, 2024

royaltm commented Jun 21, 2024

royaltm commented Jun 21, 2024

royaltm commented May 27, 2024 •

edited

Loading

Robbepop commented May 29, 2024 •

edited

Loading

royaltm Jun 21, 2024 •

edited

Loading