Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds LazyRawTextReader support for structs #619

Merged
merged 16 commits into from
Aug 23, 2023
Merged

Adds LazyRawTextReader support for structs #619

merged 16 commits into from
Aug 23, 2023

Conversation

zslayton
Copy link
Contributor

Builds on outstanding PRs #609, #612, #613, #614, #616, and #617.

Adds support for reading structs to the LazyRawTextReader.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@codecov
Copy link

codecov bot commented Aug 10, 2023

Codecov Report

Patch coverage: 74.43% and no project coverage change.

Comparison is base (cb1042a) 81.64% compared to head (37a2a50) 81.65%.

Additional details and impacted files
@@           Coverage Diff            @@
##             main     #619    +/-   ##
========================================
  Coverage   81.64%   81.65%            
========================================
  Files         120      122     +2     
  Lines       21876    22235   +359     
  Branches    21876    22235   +359     
========================================
+ Hits        17860    18155   +295     
- Misses       2366     2411    +45     
- Partials     1650     1669    +19     
Files Changed Coverage Δ
src/lazy/encoding.rs 0.00% <ø> (ø)
src/lazy/text/matched.rs 67.32% <57.57%> (-2.49%) ⬇️
src/lazy/decoder.rs 60.00% <60.00%> (ø)
src/lazy/text/encoded_value.rs 70.54% <60.71%> (+5.78%) ⬆️
src/lazy/text/raw/reader.rs 90.28% <61.53%> (-2.70%) ⬇️
src/lazy/text/raw/struct.rs 68.57% <68.57%> (ø)
src/lazy/text/raw/sequence.rs 53.27% <80.95%> (+28.27%) ⬆️
src/lazy/text/buffer.rs 88.25% <85.71%> (-0.48%) ⬇️
src/lazy/raw_value_ref.rs 71.09% <100.00%> (ø)
src/lazy/text/value.rs 44.73% <100.00%> (+8.37%) ⬆️

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor Author

@zslayton zslayton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗺️ PR tour

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗺️ This file housed lots of ToDo* types that existed to satisfy the LazyDecoder trait. This PR replaces them with actual implementations.

map(
match_and_length(Self::match_struct),
|(matched_struct, length)| {
EncodedTextValue::new(MatchedValue::Struct, matched_struct.offset(), length)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗺️ This PR changes the parser to require that containers be completely buffered before reading begins. This means that all value types have the same behavior for both buffering and skipping. If we later determine that this is too restrictive, we can change it.

@@ -331,6 +411,74 @@ impl<'data> TextBufferView<'data> {
.parse(self)
}

/// Matches a list.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗺️ Introduces a matching function for the list type that finds its end offset.

Ok((remaining, matched))
}

/// Matches a struct.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗺️ Introduces a matching function for the struct type that finds its end offset.

Comment on lines -620 to +772
MatchedString::Short(MatchedShortString::new(contains_escaped_chars))
if contains_escaped_chars {
MatchedString::ShortWithEscapes
} else {
MatchedString::ShortWithoutEscapes
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗺️ The MatchedShortString enum variant used to have a nested (bool) that indicated whether the string contained escapes. Per @popematt's feedback on #616, I've broken it into two variants with descriptive names.

@@ -715,7 +867,13 @@ impl<'data> TextBufferView<'data> {
/// Matches a quoted symbol (`'foo'`).
fn match_quoted_symbol(self) -> IonParseResult<'data, MatchedSymbol> {
delimited(char('\''), Self::match_quoted_symbol_body, char('\''))
.map(|(_matched, contains_escaped_chars)| MatchedSymbol::Quoted(contains_escaped_chars))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗺️ As above, I've broken out the nested (bool) on the MatchedSymbol variant.

@@ -86,7 +91,13 @@ impl EncodedTextValue {
// 'foo'
// "foo"
// $10
pub(crate) fn with_field_name(mut self, offset: usize, length: usize) -> EncodedTextValue {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗺️ This now stores the MatchedSymbol, which indicates which symbol syntax was found in the input. (Quoted, identifier, symbol ID)

Comment on lines 50 to +54
fn read(&self) -> IonResult<RawValueRef<'data, TextEncoding>> {
let matched_input = self.input.slice(0, self.encoded_value.data_length());
let matched_input = self.input.slice(
self.encoded_value.data_offset() - self.input.offset(),
self.encoded_value.data_length(),
);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗺️ Prior to this PR, the matched input slice always started with the value. Now that we have struct fields, it sometimes begins at the field name.

@zslayton zslayton marked this pull request as ready for review August 10, 2023 00:40

/// Matches any amount of whitespace followed by a closing `}`.
fn match_struct_end(self) -> IonMatchResult<'data> {
whitespace_and_then(peek(tag("}"))).parse(self)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does whitespace_and_then also account for comments?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it does. I wanted a more explicit name but couldn't get it short enough. ws_and_comments_and_then?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm ambivalent about the name. I was just looking to confirm my assumption about its behavior.

@@ -60,6 +62,8 @@ pub(crate) struct EncodedTextValue {
// value is stored. For others (e.g. a timestamp), the various components of the value are
// recognized during matching and partial information like subfield offsets can be stored here.
matched_value: MatchedValue,

field_name_syntax: Option<MatchedSymbol>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Doc comment?
  2. Did you consider creating a LazyRawTextSymbolToken or similar to represent annotations and field names? What made you decide to use the approach that you did?
  3. Further down, I see that there is a LazyRawTextField. Is there any way we could put field name data (length, offset, etc) in that struct (or something analogous) rather than in EncodedTextValue?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Will do.
  2. I'm not sure I understand the suggestion, but I can explain the approach. Both the binary version and the text version model their lazy values as a (field_id?, annotations?, value) tuple. That is, the field is modeled as being part of the value. (This language is loose; the data structures and approach to parsing couple the field with the value but I recognize that "the field is part of the value" doesn't feel quite right from an Ion data model perspective.) If I did introduce a LazyRawTextSymbolToken type, it would simply wrap a MatchedSymbol. Let me know if you'd like to discuss this further, I think an offline discussion would probably be helpful.
  3. I think it's possible, yeah. Might even be a small performance win by shrinking the size of values that aren't in structs. I've opened Explore moving struct field info out of Lazy*Value and into Lazy*Field #631 to track investigating this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opened #632 to track investigating an optimized representation for annotations offset info.

// Quoted symbol
'bar': 200,
// Short-form string
"baz": 300
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be good to have some comments in unexpected (but legal) locations.

Suggested change
"baz": 300
"baz": 300,
/* Comments... */ qux /* comments */ : /* are */ 400 /* everywhere */,

Base automatically changed from lazy-lists to main August 23, 2023 00:44
@zslayton zslayton merged commit 998a5bf into main Aug 23, 2023
17 of 18 checks passed
@zslayton zslayton deleted the lazy-structs branch August 23, 2023 03:18
@zslayton zslayton self-assigned this Aug 29, 2023
zslayton added a commit that referenced this pull request Sep 7, 2023
zslayton added a commit that referenced this pull request Sep 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants