-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adds lazy reader support for reading clobs #638
Conversation
Codecov ReportPatch coverage is
📢 Thoughts on this report? Let us know!. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🗺️ PR tour
@@ -413,7 +413,7 @@ impl<'data> LazyRawBinaryValue<'data> { | |||
fn read_clob(&self) -> ValueParseResult<'data, BinaryEncoding> { | |||
debug_assert!(self.encoded_value.ion_type() == IonType::Clob); | |||
let bytes = self.value_body()?; | |||
Ok(RawValueRef::Clob(bytes)) | |||
Ok(RawValueRef::Clob(bytes.into())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🗺️ Reading a clob
now returns a BytesRef<'_>
instead of a &[u8]
to accommodate the escape decoding process that happens in text clobs. This change mirrors the one made for blobs in #629.
Cow::Owned(text) => Vec::from(text).into(), | ||
} | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🗺️ This impl converts a String
into its underlying Vec
or a &str
to its underlying &[u8]
.
@@ -1002,13 +1008,13 @@ impl<'data> TextBufferView<'data> { | |||
|
|||
/// Returns a matched buffer and a boolean indicating whether any escaped characters were | |||
/// found in the short string. | |||
fn match_short_string_body(self) -> IonParseResult<'data, (Self, bool)> { | |||
pub(crate) fn match_short_string_body(self) -> IonParseResult<'data, (Self, bool)> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🗺️ The clob reading logic re-uses the short- and long-form string matchers to isolate the content within the larger match.
let text = String::from_utf8(sanitized).unwrap(); | ||
Ok(StrRef::from(text.to_string())) | ||
} | ||
} | ||
|
||
fn escape_text(matched_input: TextBufferView, sanitized: &mut Vec<u8>) -> IonResult<()> { | ||
fn decode_text_containing_escapes( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🗺️ I renamed this method to make it clearer which "direction" we were going. It accepts text with escapes and decodes them into bytes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This name is still confusing for me because it's possible to "decode" text to bytes (e.g. base64) and to "decode" bytes to text (e.g. UTF-8). What about something like convert_escaped_text_to_bytes
or decode_escaped_text_into_bytes
?
let mut remaining = matched_input; | ||
|
||
// For ways to optimize this in the future, look at the `memchr` crate. | ||
let match_byte = |byte: &u8| *byte == b'\\' || *byte == b'\r'; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🗺️ The logic needed to normalize an unescaped \r
differs from that needed to replace an escaped \r
(or any other escape). We're looking for a raw byte value 0x0A
that is not prefixed with a \
.
// being allocated when it isn't strictly necessary. | ||
contains_escaped_chars = true; | ||
continue; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🗺️ In long-form clobs and long-form strings, we need to normalize unescaped \r
and \r\n
to \n
. This throws the naming off a bit; contains_escapes
should really be something like requires_substitutions
. However, I think escapes
is a more obvious/suggestive name. Open to input here; I left it as-is because a consistent rename across usages/modules would touch a lot of lines and I'd rather do it in another PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
requires_substitutions_of_escaped_characters
? (What a mouthful... maybe too long.)
// Normalize newlines | ||
true, | ||
// Support unicode escapes | ||
true, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🗺️ I considered enums for these two bool
s to make them self-documenting, but as they're not part of the public API I decided to just comment the handful of places where this method is called.
// being allocated when it isn't strictly necessary. | ||
contains_escaped_chars = true; | ||
continue; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
requires_substitutions_of_escaped_characters
? (What a mouthful... maybe too long.)
List, | ||
SExp, | ||
Struct, | ||
// TODO: ...the other types |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🥳
let text = String::from_utf8(sanitized).unwrap(); | ||
Ok(StrRef::from(text.to_string())) | ||
} | ||
} | ||
|
||
fn escape_text(matched_input: TextBufferView, sanitized: &mut Vec<u8>) -> IonResult<()> { | ||
fn decode_text_containing_escapes( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This name is still confusing for me because it's possible to "decode" text to bytes (e.g. base64) and to "decode" bytes to text (e.g. UTF-8). What about something like convert_escaped_text_to_bytes
or decode_escaped_text_into_bytes
?
Short, | ||
Long, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add even a brief doc comment for these.
Also, is it worth having separate cases for with and without escapes? Or long with single vs multiple segments? (Did we already talk about this? I think we might have.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a narrow case that benefits: single-segment clobs that only contain ASCII. Every other case requires a sanitization/decoding buffer anyway. I concluded that I'd wait to see if anyone actually uses clobs outside of ion-tests
before worrying about optimizing it further.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I meant to approve this because none of my latest comments are things that would block the PR.
Adds
LazyRawTextReader
support for matching and reading clobs.Fixes #634.
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.