Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fallible APIs for StringInterner for #70 #71

Open
wants to merge 7 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ edition = "2021"

[dependencies]
cfg-if = "1.0"
hashbrown = { version = "0.14.0", default-features = false, features = ["ahash"] }
hashbrown = { version = "0.14.0", default-features = false, features = ["ahash", "raw"] }
serde = { version = "1.0", default-features = false, features = ["alloc"], optional = true }

[dev-dependencies]
Expand Down
25 changes: 14 additions & 11 deletions src/backend/bucket/fixed_str.rs
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
use super::InternedStr;
use crate::Result;
#[cfg(not(feature = "std"))]
use alloc::string::String;

Expand All @@ -10,13 +11,16 @@ pub struct FixedString {
impl FixedString {
/// Creates a new fixed string with the given fixed capacity.
#[inline]
pub fn with_capacity(cap: usize) -> Self {
Self {
contents: String::with_capacity(cap),
}
pub fn try_with_capacity(cap: usize) -> Result<Self> {
let mut contents = String::new();
contents.try_reserve(cap)?;
Ok(Self { contents })
// FIXME: try_with_capacity #91913, replace with the following:
// Ok(Self {
// contents: String::try_with_capacity(cap)?,
// })
}

/// Returns the underlying [`Box<str>`].
/// Returns the underlying [`String`].
///
/// Guarantees not to perform any reallocations in this process.
#[inline]
Expand All @@ -43,17 +47,16 @@ impl FixedString {
#[inline]
pub fn push_str(&mut self, string: &str) -> Option<InternedStr> {
let len = self.len();
if self.capacity() < len + string.len() {
let new_len = len + string.len();
if self.capacity() < new_len {
return None;
}
self.contents.push_str(string);
debug_assert_eq!(self.contents.len(), len + string.len());
debug_assert_eq!(self.contents.len(), new_len);
Some(InternedStr::new(
// SAFETY: We convert from bytes to utf8 from which we know through the
// input string that they must represent valid utf8.
unsafe {
core::str::from_utf8_unchecked(&self.contents.as_bytes()[len..len + string.len()])
},
unsafe { core::str::from_utf8_unchecked(&self.contents.as_bytes()[len..new_len]) },
))
}
}
69 changes: 47 additions & 22 deletions src/backend/bucket/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ mod interned_str;

use self::{fixed_str::FixedString, interned_str::InternedStr};
use super::Backend;
use crate::{symbol::expect_valid_symbol, DefaultSymbol, Symbol};
use crate::{symbol::expect_valid_symbol, DefaultSymbol, Error, Result, Symbol};
#[cfg(not(feature = "std"))]
use alloc::string::String;
use alloc::vec::Vec;
Expand Down Expand Up @@ -91,25 +91,25 @@ where
fn with_capacity(cap: usize) -> Self {
Self {
spans: Vec::with_capacity(cap),
head: FixedString::with_capacity(cap),
head: FixedString::try_with_capacity(cap).unwrap(),
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a big fan of using raw unwrap calls but coming up with proper expect messages is not easy here I see.
I usually try to use unwrap_or_else(|| panic!("...")) because it allows to use scoped information for the message and otherwise I use expect("...") with a proper message stating why this must not happen. However, here it is due to the guarantees of the API. 🤔

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here I was actually assuming the error message from try_with_capacity will do the job.

full: Vec::new(),
marker: Default::default(),
}
}

#[inline]
fn intern(&mut self, string: &str) -> Self::Symbol {
fn try_intern(&mut self, string: &str) -> Result<Self::Symbol> {
// SAFETY: This is safe because we never hand out the returned
// interned string instance to the outside and only operate
// on it within this backend.
let interned = unsafe { self.alloc(string) };
self.push_span(interned)
let interned = unsafe { self.try_alloc(string)? };
self.try_push_span(interned)
}

#[cfg_attr(feature = "inline-more", inline)]
fn intern_static(&mut self, string: &'static str) -> Self::Symbol {
fn try_intern_static(&mut self, string: &'static str) -> Result<Self::Symbol> {
let interned = InternedStr::new(string);
self.push_span(interned)
self.try_push_span(interned)
}

fn shrink_to_fit(&mut self) {
Expand Down Expand Up @@ -142,29 +142,54 @@ where
S: Symbol,
{
/// Returns the next available symbol.
fn next_symbol(&self) -> S {
expect_valid_symbol(self.spans.len())
fn try_next_symbol(&self) -> Result<S> {
S::try_from_usize(self.spans.len()).ok_or(Error::OutOfSymbols)
}

/// Pushes the given interned string into the spans and returns its symbol.
fn push_span(&mut self, interned: InternedStr) -> S {
let symbol = self.next_symbol();
/// Pushes the given interned string into the spans and returns its symbol on success.
fn try_push_span(&mut self, interned: InternedStr) -> Result<S> {
let symbol = self.try_next_symbol()?;
// FIXME: vec_push_within_capacity #100486, replace the following with:
//
// if let Err(value) = self.spans.push_within_capacity(interned) {
// self.spans.try_reserve(1)?;
// // this cannot fail, the previous line either returned or added at least 1 free slot
// let _ = self.spans.push_within_capacity(value);
Comment on lines +155 to +157
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could also recursively call try_push_span again instead.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wouldn't that end up calling try_next_symbol twice?

With the new code, we try to reserve capacity eagerly:

        self.spans.try_reserve(1)?;
        self.spans.push(interned);

which unfortunately checks capacity redundantly, the 2nd one being in the push itself, which might impact the performance. The whole point of using push_within_capacity is to try to push first, where there is only one capacity check in the optimistic scenario, and only when that fails code branches to allocate.

// }
self.spans.try_reserve(1)?;
self.spans.push(interned);
symbol

Ok(symbol)
}

/// Interns a new string into the backend and returns a reference to it.
unsafe fn alloc(&mut self, string: &str) -> InternedStr {
/// Ensure head has enough reserved capacity or replace it with a new one.
#[cfg_attr(feature = "inline-more", inline)]
fn try_reserve_head(&mut self, additional: usize) -> Result<()> {
let cap = self.head.capacity();
if cap < self.head.len() + string.len() {
let new_cap = (usize::max(cap, string.len()) + 1).next_power_of_two();
let new_head = FixedString::with_capacity(new_cap);
if cap < self.head.len() + additional {
let new_cap = (usize::max(cap, additional) + 1).next_power_of_two();
let new_head = FixedString::try_with_capacity(new_cap)?;
let old_head = core::mem::replace(&mut self.head, new_head);
self.full.push(old_head.finish());
let old_string = old_head.finish();
// FIXME: vec_push_within_capacity #100486, replace the following with:
//
// if let Err(value) = self.full.push_within_capacity(old_string) {
// self.full.try_reserve(1)?;
// // this cannot fail, the previous line either returned or added at least 1 free slot
// let _ = self.full.push_within_capacity(value);
// }
Comment on lines +174 to +180
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be the same code as above so maybe propose to introduce a helper method instead. Or?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, but that would need to be a generic function that takes a vector as argument because spans and full are vectors of different types

self.full.try_reserve(1)?;
self.full.push(old_string);
}
self.head
Ok(())
}
/// Interns a new string into the backend and returns a reference to it.
unsafe fn try_alloc(&mut self, string: &str) -> Result<InternedStr> {
self.try_reserve_head(string.len())?;
Ok(self
.head
.push_str(string)
.expect("encountered invalid head capacity (2)")
.expect("encountered invalid head capacity (2)"))
}
}

Expand All @@ -174,7 +199,7 @@ impl<S> Clone for BucketBackend<S> {
// head string leaving the cloned `full` empty.
let new_head_cap =
self.head.capacity() + self.full.iter().fold(0, |lhs, rhs| lhs + rhs.len());
let mut head = FixedString::with_capacity(new_head_cap);
let mut head = FixedString::try_with_capacity(new_head_cap).unwrap();
let mut spans = Vec::with_capacity(self.spans.len());
for span in &self.spans {
let string = span.as_str();
Expand Down
69 changes: 46 additions & 23 deletions src/backend/buffer.rs
Original file line number Diff line number Diff line change
@@ -1,9 +1,9 @@
#![cfg(feature = "backends")]

use super::Backend;
use crate::{symbol::expect_valid_symbol, DefaultSymbol, Symbol};
use crate::{DefaultSymbol, Error, Result, Symbol};
use alloc::vec::Vec;
use core::{marker::PhantomData, mem, str};
use core::{marker::PhantomData, str};

/// An interner backend that appends all interned string information in a single buffer.
///
Expand Down Expand Up @@ -78,8 +78,8 @@ where
{
/// Returns the next available symbol.
#[inline]
fn next_symbol(&self) -> S {
expect_valid_symbol(self.buffer.len())
fn try_next_symbol(&self) -> Result<S> {
S::try_from_usize(self.buffer.len()).ok_or(Error::OutOfSymbols)
}

/// Resolves the string for the given symbol if any.
Expand Down Expand Up @@ -123,30 +123,39 @@ where
unsafe { str::from_utf8_unchecked(str_bytes) }
}

/// Pushes the given value onto the buffer with `var7` encoding.
/// Pushes the given length value onto the buffer with `var7` encoding.
/// Ensures there's enough capacity for the `var7` encoded length and
/// the following string bytes ahead of pushing.
///
/// Returns the amount of `var7` encoded bytes.
#[inline]
fn encode_var_usize(&mut self, value: usize) -> usize {
encode_var_usize(&mut self.buffer, value)
fn try_encode_var_length(&mut self, length: usize) -> Result<()> {
let add_len = length + calculate_var7_size(length);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rename it to something like len_encode_var_usize which is very similar to encode_var_usize to draw the parallels to the reader of the code.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Though I have to admit that it is a bit unfortunate that this encoding step is now 2 phased.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

encode_var_usize pushes each byte separately, calling try_reserve every time a byte is encoded would take down the performance even more. However with push_within_capacity stabilized that could be implemented differently.

Alternative solution is to encode length into an array on the stack and then call extend_from_slice on self.buffer. This however can't be done efficiently without a bit of unsafe code (MaybeUninit) or would require using ArrayVec or a similar crate.

self.buffer.try_reserve(add_len)?;
encode_var_usize(&mut self.buffer, length);
Ok(())
}

/// Pushes the given string into the buffer and returns its span.
/// Pushes the given string into the buffer and returns its span on success.
///
/// # Panics
/// # Errors
///
/// If the backend ran out of symbols.
fn push_string(&mut self, string: &str) -> S {
let symbol = self.next_symbol();
/// Returns an [`Error`] if the backend ran out of symbols or memory.
fn try_push_string(&mut self, string: &str) -> Result<S> {
let symbol = self.try_next_symbol()?;
let str_len = string.len();
let str_bytes = string.as_bytes();
self.encode_var_usize(str_len);
self.buffer.extend(str_bytes);
self.try_encode_var_length(str_len)?;
// try_encode_var_length ensures there's enough space left for str_bytes
self.buffer.extend_from_slice(str_bytes);
self.len_strings += 1;
symbol
Ok(symbol)
}
}

/// According to google the approx. word length is 5.
const DEFAULT_STR_LEN: usize = 5;
Comment on lines +156 to +157
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should really get rid of this thing entirely. My bad to have introduced them in the first place. 🙈

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By that, do you mean to remove with_capacity from the Backend altogether?


impl<S> Backend for BufferBackend<S>
where
S: Symbol,
Expand All @@ -158,11 +167,9 @@ where

#[cfg_attr(feature = "inline-more", inline)]
fn with_capacity(capacity: usize) -> Self {
/// We encode the `usize` string length into the buffer as well.
const LEN_USIZE: usize = mem::size_of::<usize>();
/// According to google the approx. word length is 5.
const DEFAULT_STR_LEN: usize = 5;
let bytes_per_string = DEFAULT_STR_LEN + LEN_USIZE;
// We encode the `usize` string length into the buffer as well.
let var7_len: usize = calculate_var7_size(capacity);
let bytes_per_string = DEFAULT_STR_LEN + var7_len;
Comment on lines -161 to +172
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As said in a prior comment, we should probably get rid of all the heuristics based with_capacity methods and get this down to the basics. A Rust BTreeMap for example also does not provide a with_capacity method because it simply does not make sense to do so.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, it's rather pointless.

Self {
len_strings: 0,
buffer: Vec::with_capacity(capacity * bytes_per_string),
Expand All @@ -171,8 +178,8 @@ where
}

#[inline]
fn intern(&mut self, string: &str) -> Self::Symbol {
self.push_string(string)
fn try_intern(&mut self, string: &str) -> Result<Self::Symbol> {
self.try_push_string(string)
}

#[inline]
Expand Down Expand Up @@ -200,6 +207,16 @@ where
}
}

/// Calculate var7 encoded size from a given value.
#[inline]
fn calculate_var7_size(value: usize) -> usize {
// number of bits to encode
// value = 0 would give 0 bits, hence: |1, could be anything up to |0x7F as well
let bits = usize::BITS - (value | 1).leading_zeros();
// (bits to encode / 7).ceil()
((bits + 6) / 7) as usize
}
Comment on lines +210 to +218
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please explain this formula a bit better so that it does not look like magic to the reader?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also: is this precise or an over approximation?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This formula is precise and this is also reflected in tests.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The formula goes like this:

  1. Take the binary size (bit length) of the original length
let bits = usize::BITS - (value | 1).leading_zeros();

For example if your length is 42 this is binary 101010, the bit length is 6.
Assuming usize::BITS is 64 leading_zeros of 42usize is 58. 64 - 58 = 6.

  1. If the binary size is 0, make it 1 (0 is the only case that breakes the formula, hence (value | 1)).

  2. The encoded size in bytes is the total number of encoded bits divided by 7 (each byte encodes only 7 bits of the original value), rounded up because the last byte's 7-bits might not be filled entirely: hence (bits + 6).

// (bits to encode / 7).ceil()
((bits + 6) / 7) as usize

For example 0x7F bit length is 7, (7 + 6) / 7 == 1

The length is encoded as: 0b0111_1111

0x80 bit length is 8, (8 + 6) / 7 == 2

The length is encoded as: 0b1000_0000 0b0000_0001

Copy link
Author

@royaltm royaltm Jun 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's take the original bit value of the encoded length (each bit placement is represented by a character)

ZYXWVUTSRQPONMLKJIHGFEDCBA9876543210 (bit-length: 36)

The encoded chunks end up being:

!6543210 !DCBA987 !KJIHGFE !RQPONML !YXWVUTS _______Z

Where ! and _ are 1 and 0 bit markers respectively.

As seen clearly each 7 bits ends up in each chunk with the last chunk filled with the remaining 1-7 bits

(36 + 6) / 7 == 6

Now if we take:

YXWVUTSRQPONMLKJIHGFEDCBA9876543210 (bit-length: 35)

The encoded chunks end up being:

!6543210 !DCBA987 !KJIHGFE !RQPONML _YXWVUTS

(35 + 6) / 7 == 5

This process repeats every 7 bits.

The only exception is the 0 itself which if we followed the formula precisely should be encoded in exactly 0 bytes - because there are no bits to encode!

This leads us to the interesting observation that we could in fact extend the encoded var7 by adding any number of redundant 0 at the end, and the decoding formula (providing we have an infinite integer capacity) would still work:

!6543210 !DCBA987 !KJIHGFE !RQPONML !YXWVUTS ________
!6543210 !DCBA987 !KJIHGFE !RQPONML !YXWVUTS !_______ ________
!6543210 !DCBA987 !KJIHGFE !RQPONML !YXWVUTS !_______ !_______ ________

and so on...

I hope that explains why 0 is breaking the formula without actually really breaking it. What we really need is just 1 bit of information that this is the end of the stream of data.


/// Encodes the value using variable length encoding into the buffer.
///
/// Returns the amount of bytes used for the encoding.
Expand Down Expand Up @@ -308,7 +325,7 @@ fn decode_var_usize_cold(buffer: &[u8]) -> Option<(usize, usize)> {

#[cfg(test)]
mod tests {
use super::{decode_var_usize, encode_var_usize};
use super::{calculate_var7_size, decode_var_usize, encode_var_usize};
#[cfg(not(feature = "std"))]
use alloc::vec::Vec;

Expand All @@ -317,6 +334,7 @@ mod tests {
let mut buffer = Vec::new();
for i in 0..2usize.pow(7) {
buffer.clear();
assert_eq!(calculate_var7_size(i), 1);
assert_eq!(encode_var_usize(&mut buffer, i), 1);
assert_eq!(buffer, [i as u8]);
assert_eq!(decode_var_usize(&buffer), Some((i, 1)));
Expand All @@ -328,6 +346,7 @@ mod tests {
let mut buffer = Vec::new();
for i in 2usize.pow(7)..2usize.pow(14) {
buffer.clear();
assert_eq!(calculate_var7_size(i), 2);
assert_eq!(encode_var_usize(&mut buffer, i), 2);
assert_eq!(buffer, [0x80 | ((i & 0x7F) as u8), (0x7F & (i >> 7) as u8)]);
assert_eq!(decode_var_usize(&buffer), Some((i, 2)));
Expand All @@ -340,6 +359,7 @@ mod tests {
let mut buffer = Vec::new();
for i in 2usize.pow(14)..2usize.pow(21) {
buffer.clear();
assert_eq!(calculate_var7_size(i), 3);
assert_eq!(encode_var_usize(&mut buffer, i), 3);
assert_eq!(
buffer,
Expand All @@ -359,6 +379,7 @@ mod tests {
let mut buffer = Vec::new();
for i in range {
buffer.clear();
assert_eq!(calculate_var7_size(i), 4);
assert_eq!(encode_var_usize(&mut buffer, i), 4);
assert_eq!(
buffer,
Expand Down Expand Up @@ -401,6 +422,7 @@ mod tests {
fn encode_var_u32_max_works() {
let mut buffer = Vec::new();
let i = u32::MAX as usize;
assert_eq!(calculate_var7_size(i), 5);
assert_eq!(encode_var_usize(&mut buffer, i), 5);
assert_eq!(buffer, [0xFF, 0xFF, 0xFF, 0xFF, 0x0F]);
assert_eq!(decode_var_usize(&buffer), Some((i, 5)));
Expand All @@ -410,6 +432,7 @@ mod tests {
fn encode_var_u64_max_works() {
let mut buffer = Vec::new();
let i = u64::MAX as usize;
assert_eq!(calculate_var7_size(i), 10);
assert_eq!(encode_var_usize(&mut buffer, i), 10);
assert_eq!(
buffer,
Expand Down
Loading
Loading