-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fallible APIs for StringInterner for #70 #71
base: master
Are you sure you want to change the base?
Changes from all commits
e8a7616
9ad151b
27c77cc
55df1e7
5b022da
a1b1175
474f2b9
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -5,7 +5,7 @@ mod interned_str; | |
|
||
use self::{fixed_str::FixedString, interned_str::InternedStr}; | ||
use super::Backend; | ||
use crate::{symbol::expect_valid_symbol, DefaultSymbol, Symbol}; | ||
use crate::{symbol::expect_valid_symbol, DefaultSymbol, Error, Result, Symbol}; | ||
#[cfg(not(feature = "std"))] | ||
use alloc::string::String; | ||
use alloc::vec::Vec; | ||
|
@@ -91,25 +91,25 @@ where | |
fn with_capacity(cap: usize) -> Self { | ||
Self { | ||
spans: Vec::with_capacity(cap), | ||
head: FixedString::with_capacity(cap), | ||
head: FixedString::try_with_capacity(cap).unwrap(), | ||
full: Vec::new(), | ||
marker: Default::default(), | ||
} | ||
} | ||
|
||
#[inline] | ||
fn intern(&mut self, string: &str) -> Self::Symbol { | ||
fn try_intern(&mut self, string: &str) -> Result<Self::Symbol> { | ||
// SAFETY: This is safe because we never hand out the returned | ||
// interned string instance to the outside and only operate | ||
// on it within this backend. | ||
let interned = unsafe { self.alloc(string) }; | ||
self.push_span(interned) | ||
let interned = unsafe { self.try_alloc(string)? }; | ||
self.try_push_span(interned) | ||
} | ||
|
||
#[cfg_attr(feature = "inline-more", inline)] | ||
fn intern_static(&mut self, string: &'static str) -> Self::Symbol { | ||
fn try_intern_static(&mut self, string: &'static str) -> Result<Self::Symbol> { | ||
let interned = InternedStr::new(string); | ||
self.push_span(interned) | ||
self.try_push_span(interned) | ||
} | ||
|
||
fn shrink_to_fit(&mut self) { | ||
|
@@ -142,29 +142,54 @@ where | |
S: Symbol, | ||
{ | ||
/// Returns the next available symbol. | ||
fn next_symbol(&self) -> S { | ||
expect_valid_symbol(self.spans.len()) | ||
fn try_next_symbol(&self) -> Result<S> { | ||
S::try_from_usize(self.spans.len()).ok_or(Error::OutOfSymbols) | ||
} | ||
|
||
/// Pushes the given interned string into the spans and returns its symbol. | ||
fn push_span(&mut self, interned: InternedStr) -> S { | ||
let symbol = self.next_symbol(); | ||
/// Pushes the given interned string into the spans and returns its symbol on success. | ||
fn try_push_span(&mut self, interned: InternedStr) -> Result<S> { | ||
let symbol = self.try_next_symbol()?; | ||
// FIXME: vec_push_within_capacity #100486, replace the following with: | ||
// | ||
// if let Err(value) = self.spans.push_within_capacity(interned) { | ||
// self.spans.try_reserve(1)?; | ||
// // this cannot fail, the previous line either returned or added at least 1 free slot | ||
// let _ = self.spans.push_within_capacity(value); | ||
Comment on lines
+155
to
+157
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We could also recursively call There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. wouldn't that end up calling With the new code, we try to reserve capacity eagerly:
which unfortunately checks capacity redundantly, the 2nd one being in the push itself, which might impact the performance. The whole point of using |
||
// } | ||
self.spans.try_reserve(1)?; | ||
self.spans.push(interned); | ||
symbol | ||
|
||
Ok(symbol) | ||
} | ||
|
||
/// Interns a new string into the backend and returns a reference to it. | ||
unsafe fn alloc(&mut self, string: &str) -> InternedStr { | ||
/// Ensure head has enough reserved capacity or replace it with a new one. | ||
#[cfg_attr(feature = "inline-more", inline)] | ||
fn try_reserve_head(&mut self, additional: usize) -> Result<()> { | ||
let cap = self.head.capacity(); | ||
if cap < self.head.len() + string.len() { | ||
let new_cap = (usize::max(cap, string.len()) + 1).next_power_of_two(); | ||
let new_head = FixedString::with_capacity(new_cap); | ||
if cap < self.head.len() + additional { | ||
let new_cap = (usize::max(cap, additional) + 1).next_power_of_two(); | ||
let new_head = FixedString::try_with_capacity(new_cap)?; | ||
let old_head = core::mem::replace(&mut self.head, new_head); | ||
self.full.push(old_head.finish()); | ||
let old_string = old_head.finish(); | ||
// FIXME: vec_push_within_capacity #100486, replace the following with: | ||
// | ||
// if let Err(value) = self.full.push_within_capacity(old_string) { | ||
// self.full.try_reserve(1)?; | ||
// // this cannot fail, the previous line either returned or added at least 1 free slot | ||
// let _ = self.full.push_within_capacity(value); | ||
// } | ||
Comment on lines
+174
to
+180
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This seems to be the same code as above so maybe propose to introduce a helper method instead. Or? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. sure, but that would need to be a generic function that takes a vector as argument because |
||
self.full.try_reserve(1)?; | ||
self.full.push(old_string); | ||
} | ||
self.head | ||
Ok(()) | ||
} | ||
/// Interns a new string into the backend and returns a reference to it. | ||
unsafe fn try_alloc(&mut self, string: &str) -> Result<InternedStr> { | ||
self.try_reserve_head(string.len())?; | ||
Ok(self | ||
.head | ||
.push_str(string) | ||
.expect("encountered invalid head capacity (2)") | ||
.expect("encountered invalid head capacity (2)")) | ||
} | ||
} | ||
|
||
|
@@ -174,7 +199,7 @@ impl<S> Clone for BucketBackend<S> { | |
// head string leaving the cloned `full` empty. | ||
let new_head_cap = | ||
self.head.capacity() + self.full.iter().fold(0, |lhs, rhs| lhs + rhs.len()); | ||
let mut head = FixedString::with_capacity(new_head_cap); | ||
let mut head = FixedString::try_with_capacity(new_head_cap).unwrap(); | ||
let mut spans = Vec::with_capacity(self.spans.len()); | ||
for span in &self.spans { | ||
let string = span.as_str(); | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,9 +1,9 @@ | ||
#![cfg(feature = "backends")] | ||
|
||
use super::Backend; | ||
use crate::{symbol::expect_valid_symbol, DefaultSymbol, Symbol}; | ||
use crate::{DefaultSymbol, Error, Result, Symbol}; | ||
use alloc::vec::Vec; | ||
use core::{marker::PhantomData, mem, str}; | ||
use core::{marker::PhantomData, str}; | ||
|
||
/// An interner backend that appends all interned string information in a single buffer. | ||
/// | ||
|
@@ -78,8 +78,8 @@ where | |
{ | ||
/// Returns the next available symbol. | ||
#[inline] | ||
fn next_symbol(&self) -> S { | ||
expect_valid_symbol(self.buffer.len()) | ||
fn try_next_symbol(&self) -> Result<S> { | ||
S::try_from_usize(self.buffer.len()).ok_or(Error::OutOfSymbols) | ||
} | ||
|
||
/// Resolves the string for the given symbol if any. | ||
|
@@ -123,30 +123,39 @@ where | |
unsafe { str::from_utf8_unchecked(str_bytes) } | ||
} | ||
|
||
/// Pushes the given value onto the buffer with `var7` encoding. | ||
/// Pushes the given length value onto the buffer with `var7` encoding. | ||
/// Ensures there's enough capacity for the `var7` encoded length and | ||
/// the following string bytes ahead of pushing. | ||
/// | ||
/// Returns the amount of `var7` encoded bytes. | ||
#[inline] | ||
fn encode_var_usize(&mut self, value: usize) -> usize { | ||
encode_var_usize(&mut self.buffer, value) | ||
fn try_encode_var_length(&mut self, length: usize) -> Result<()> { | ||
let add_len = length + calculate_var7_size(length); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would rename it to something like There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Though I have to admit that it is a bit unfortunate that this encoding step is now 2 phased. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Alternative solution is to encode length into an array on the stack and then call |
||
self.buffer.try_reserve(add_len)?; | ||
encode_var_usize(&mut self.buffer, length); | ||
Ok(()) | ||
} | ||
|
||
/// Pushes the given string into the buffer and returns its span. | ||
/// Pushes the given string into the buffer and returns its span on success. | ||
/// | ||
/// # Panics | ||
/// # Errors | ||
/// | ||
/// If the backend ran out of symbols. | ||
fn push_string(&mut self, string: &str) -> S { | ||
let symbol = self.next_symbol(); | ||
/// Returns an [`Error`] if the backend ran out of symbols or memory. | ||
fn try_push_string(&mut self, string: &str) -> Result<S> { | ||
let symbol = self.try_next_symbol()?; | ||
let str_len = string.len(); | ||
let str_bytes = string.as_bytes(); | ||
self.encode_var_usize(str_len); | ||
self.buffer.extend(str_bytes); | ||
self.try_encode_var_length(str_len)?; | ||
// try_encode_var_length ensures there's enough space left for str_bytes | ||
self.buffer.extend_from_slice(str_bytes); | ||
self.len_strings += 1; | ||
symbol | ||
Ok(symbol) | ||
} | ||
} | ||
|
||
/// According to google the approx. word length is 5. | ||
const DEFAULT_STR_LEN: usize = 5; | ||
Comment on lines
+156
to
+157
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We should really get rid of this thing entirely. My bad to have introduced them in the first place. 🙈 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. By that, do you mean to remove |
||
|
||
impl<S> Backend for BufferBackend<S> | ||
where | ||
S: Symbol, | ||
|
@@ -158,11 +167,9 @@ where | |
|
||
#[cfg_attr(feature = "inline-more", inline)] | ||
fn with_capacity(capacity: usize) -> Self { | ||
/// We encode the `usize` string length into the buffer as well. | ||
const LEN_USIZE: usize = mem::size_of::<usize>(); | ||
/// According to google the approx. word length is 5. | ||
const DEFAULT_STR_LEN: usize = 5; | ||
let bytes_per_string = DEFAULT_STR_LEN + LEN_USIZE; | ||
// We encode the `usize` string length into the buffer as well. | ||
let var7_len: usize = calculate_var7_size(capacity); | ||
let bytes_per_string = DEFAULT_STR_LEN + var7_len; | ||
Comment on lines
-161
to
+172
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. As said in a prior comment, we should probably get rid of all the heuristics based There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I agree, it's rather pointless. |
||
Self { | ||
len_strings: 0, | ||
buffer: Vec::with_capacity(capacity * bytes_per_string), | ||
|
@@ -171,8 +178,8 @@ where | |
} | ||
|
||
#[inline] | ||
fn intern(&mut self, string: &str) -> Self::Symbol { | ||
self.push_string(string) | ||
fn try_intern(&mut self, string: &str) -> Result<Self::Symbol> { | ||
self.try_push_string(string) | ||
} | ||
|
||
#[inline] | ||
|
@@ -200,6 +207,16 @@ where | |
} | ||
} | ||
|
||
/// Calculate var7 encoded size from a given value. | ||
#[inline] | ||
fn calculate_var7_size(value: usize) -> usize { | ||
// number of bits to encode | ||
// value = 0 would give 0 bits, hence: |1, could be anything up to |0x7F as well | ||
let bits = usize::BITS - (value | 1).leading_zeros(); | ||
// (bits to encode / 7).ceil() | ||
((bits + 6) / 7) as usize | ||
} | ||
Comment on lines
+210
to
+218
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can you please explain this formula a bit better so that it does not look like magic to the reader? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Also: is this precise or an over approximation? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This formula is precise and this is also reflected in tests. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The formula goes like this:
For example if your length is
For example 0x7F bit length is 7, (7 + 6) / 7 == 1 The length is encoded as: 0b0111_1111 0x80 bit length is 8, (8 + 6) / 7 == 2 The length is encoded as: 0b1000_0000 0b0000_0001 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Let's take the original bit value of the encoded length (each bit placement is represented by a character)
The encoded chunks end up being:
Where As seen clearly each 7 bits ends up in each chunk with the last chunk filled with the remaining 1-7 bits
Now if we take:
The encoded chunks end up being:
This process repeats every 7 bits. The only exception is the This leads us to the interesting observation that we could in fact extend the encoded var7 by adding any number of redundant
and so on... I hope that explains why |
||
|
||
/// Encodes the value using variable length encoding into the buffer. | ||
/// | ||
/// Returns the amount of bytes used for the encoding. | ||
|
@@ -308,7 +325,7 @@ fn decode_var_usize_cold(buffer: &[u8]) -> Option<(usize, usize)> { | |
|
||
#[cfg(test)] | ||
mod tests { | ||
use super::{decode_var_usize, encode_var_usize}; | ||
use super::{calculate_var7_size, decode_var_usize, encode_var_usize}; | ||
#[cfg(not(feature = "std"))] | ||
use alloc::vec::Vec; | ||
|
||
|
@@ -317,6 +334,7 @@ mod tests { | |
let mut buffer = Vec::new(); | ||
for i in 0..2usize.pow(7) { | ||
buffer.clear(); | ||
assert_eq!(calculate_var7_size(i), 1); | ||
assert_eq!(encode_var_usize(&mut buffer, i), 1); | ||
assert_eq!(buffer, [i as u8]); | ||
assert_eq!(decode_var_usize(&buffer), Some((i, 1))); | ||
|
@@ -328,6 +346,7 @@ mod tests { | |
let mut buffer = Vec::new(); | ||
for i in 2usize.pow(7)..2usize.pow(14) { | ||
buffer.clear(); | ||
assert_eq!(calculate_var7_size(i), 2); | ||
assert_eq!(encode_var_usize(&mut buffer, i), 2); | ||
assert_eq!(buffer, [0x80 | ((i & 0x7F) as u8), (0x7F & (i >> 7) as u8)]); | ||
assert_eq!(decode_var_usize(&buffer), Some((i, 2))); | ||
|
@@ -340,6 +359,7 @@ mod tests { | |
let mut buffer = Vec::new(); | ||
for i in 2usize.pow(14)..2usize.pow(21) { | ||
buffer.clear(); | ||
assert_eq!(calculate_var7_size(i), 3); | ||
assert_eq!(encode_var_usize(&mut buffer, i), 3); | ||
assert_eq!( | ||
buffer, | ||
|
@@ -359,6 +379,7 @@ mod tests { | |
let mut buffer = Vec::new(); | ||
for i in range { | ||
buffer.clear(); | ||
assert_eq!(calculate_var7_size(i), 4); | ||
assert_eq!(encode_var_usize(&mut buffer, i), 4); | ||
assert_eq!( | ||
buffer, | ||
|
@@ -401,6 +422,7 @@ mod tests { | |
fn encode_var_u32_max_works() { | ||
let mut buffer = Vec::new(); | ||
let i = u32::MAX as usize; | ||
assert_eq!(calculate_var7_size(i), 5); | ||
assert_eq!(encode_var_usize(&mut buffer, i), 5); | ||
assert_eq!(buffer, [0xFF, 0xFF, 0xFF, 0xFF, 0x0F]); | ||
assert_eq!(decode_var_usize(&buffer), Some((i, 5))); | ||
|
@@ -410,6 +432,7 @@ mod tests { | |
fn encode_var_u64_max_works() { | ||
let mut buffer = Vec::new(); | ||
let i = u64::MAX as usize; | ||
assert_eq!(calculate_var7_size(i), 10); | ||
assert_eq!(encode_var_usize(&mut buffer, i), 10); | ||
assert_eq!( | ||
buffer, | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a big fan of using raw
unwrap
calls but coming up with proper expect messages is not easy here I see.I usually try to use
unwrap_or_else(|| panic!("..."))
because it allows to use scoped information for the message and otherwise I useexpect("...")
with a proper message stating why this must not happen. However, here it is due to the guarantees of the API. 🤔There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here I was actually assuming the error message from try_with_capacity will do the job.