-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adds LazyRawTextReader
support for structs
#619
Conversation
Codecov ReportPatch coverage:
Additional details and impacted files@@ Coverage Diff @@
## main #619 +/- ##
========================================
Coverage 81.64% 81.65%
========================================
Files 120 122 +2
Lines 21876 22235 +359
Branches 21876 22235 +359
========================================
+ Hits 17860 18155 +295
- Misses 2366 2411 +45
- Partials 1650 1669 +19
☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🗺️ PR tour
src/lazy/encoding.rs
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🗺️ This file housed lots of ToDo*
types that existed to satisfy the LazyDecoder
trait. This PR replaces them with actual implementations.
map( | ||
match_and_length(Self::match_struct), | ||
|(matched_struct, length)| { | ||
EncodedTextValue::new(MatchedValue::Struct, matched_struct.offset(), length) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🗺️ This PR changes the parser to require that containers be completely buffered before reading begins. This means that all value types have the same behavior for both buffering and skipping. If we later determine that this is too restrictive, we can change it.
@@ -331,6 +411,74 @@ impl<'data> TextBufferView<'data> { | |||
.parse(self) | |||
} | |||
|
|||
/// Matches a list. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🗺️ Introduces a matching function for the list
type that finds its end offset.
Ok((remaining, matched)) | ||
} | ||
|
||
/// Matches a struct. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🗺️ Introduces a matching function for the struct
type that finds its end offset.
MatchedString::Short(MatchedShortString::new(contains_escaped_chars)) | ||
if contains_escaped_chars { | ||
MatchedString::ShortWithEscapes | ||
} else { | ||
MatchedString::ShortWithoutEscapes | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -715,7 +867,13 @@ impl<'data> TextBufferView<'data> { | |||
/// Matches a quoted symbol (`'foo'`). | |||
fn match_quoted_symbol(self) -> IonParseResult<'data, MatchedSymbol> { | |||
delimited(char('\''), Self::match_quoted_symbol_body, char('\'')) | |||
.map(|(_matched, contains_escaped_chars)| MatchedSymbol::Quoted(contains_escaped_chars)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🗺️ As above, I've broken out the nested (bool)
on the MatchedSymbol
variant.
@@ -86,7 +91,13 @@ impl EncodedTextValue { | |||
// 'foo' | |||
// "foo" | |||
// $10 | |||
pub(crate) fn with_field_name(mut self, offset: usize, length: usize) -> EncodedTextValue { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🗺️ This now stores the MatchedSymbol
, which indicates which symbol syntax was found in the input. (Quoted, identifier, symbol ID)
fn read(&self) -> IonResult<RawValueRef<'data, TextEncoding>> { | ||
let matched_input = self.input.slice(0, self.encoded_value.data_length()); | ||
let matched_input = self.input.slice( | ||
self.encoded_value.data_offset() - self.input.offset(), | ||
self.encoded_value.data_length(), | ||
); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🗺️ Prior to this PR, the matched input slice always started with the value. Now that we have struct fields, it sometimes begins at the field name.
|
||
/// Matches any amount of whitespace followed by a closing `}`. | ||
fn match_struct_end(self) -> IonMatchResult<'data> { | ||
whitespace_and_then(peek(tag("}"))).parse(self) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does whitespace_and_then
also account for comments?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it does. I wanted a more explicit name but couldn't get it short enough. ws_and_comments_and_then
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm ambivalent about the name. I was just looking to confirm my assumption about its behavior.
@@ -60,6 +62,8 @@ pub(crate) struct EncodedTextValue { | |||
// value is stored. For others (e.g. a timestamp), the various components of the value are | |||
// recognized during matching and partial information like subfield offsets can be stored here. | |||
matched_value: MatchedValue, | |||
|
|||
field_name_syntax: Option<MatchedSymbol>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Doc comment?
- Did you consider creating a
LazyRawTextSymbolToken
or similar to represent annotations and field names? What made you decide to use the approach that you did? - Further down, I see that there is a
LazyRawTextField
. Is there any way we could put field name data (length, offset, etc) in that struct (or something analogous) rather than inEncodedTextValue
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Will do.
- I'm not sure I understand the suggestion, but I can explain the approach. Both the binary version and the text version model their lazy values as a
(field_id?, annotations?, value)
tuple. That is, the field is modeled as being part of the value. (This language is loose; the data structures and approach to parsing couple the field with the value but I recognize that "the field is part of the value" doesn't feel quite right from an Ion data model perspective.) If I did introduce aLazyRawTextSymbolToken
type, it would simply wrap aMatchedSymbol
. Let me know if you'd like to discuss this further, I think an offline discussion would probably be helpful. - I think it's possible, yeah. Might even be a small performance win by shrinking the size of values that aren't in structs. I've opened Explore moving struct field info out of
Lazy*Value
and intoLazy*Field
#631 to track investigating this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Opened #632 to track investigating an optimized representation for annotations offset info.
// Quoted symbol | ||
'bar': 200, | ||
// Short-form string | ||
"baz": 300 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be good to have some comments in unexpected (but legal) locations.
"baz": 300 | |
"baz": 300, | |
/* Comments... */ qux /* comments */ : /* are */ 400 /* everywhere */, |
Builds on outstanding PRs #609, #612, #613, #614, #616, and #617.
Adds support for reading structs to the
LazyRawTextReader
.By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.