Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds LazyRawTextReader support for reading symbols #616

Merged
merged 13 commits into from
Aug 23, 2023
Merged

Conversation

zslayton
Copy link
Contributor

@zslayton zslayton commented Aug 1, 2023

Builds on #609, #612, #613, and #614.

Adds support for reading quoted symbols ('foo'), identifiers (foo), and symbol IDs ($42). Also modifies the SymbolRef and RawSymbolToken types to hold a Cow<'a, str> instead of a &str to accommodate situations where the symbol's text in input contained escapes and so required allocating a new string.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

Copy link
Contributor Author

@zslayton zslayton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗺️ PR tour

Comment on lines +594 to +598
fn match_long_string(self) -> IonParseResult<'data, MatchedString> {
// TODO: implement long string matching
// The `fail` parser is a nom builtin that never matches.
fail(self)
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗺️ This placeholder method was moved from further down in the file.


/// A helper method for matching bytes until the specified delimiter. Ignores any byte
/// (including the delimiter) that is prefaced by the escape character `\`.
fn match_text_until_unescaped(self, delimiter: u8) -> IonParseResult<'data, (Self, bool)> {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗺️ This method was previously match_short_string, but it's generally useful for both strings and symbols. match_short_string and match_quoted_symbol now call this.

self.input,
result
);
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗺️ Prior to this change, this unit test method would assert that there was no match. However, it was possible for the parser to match part of the input and report success. Now this method requires that the parser match the entire test input to be considered a successful match.


fn escape_short_string(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗺️ This method was also generally useful for text types and has been broken out into a helper method.


#[test]
fn test_top_level() -> IonResult<()> {
let data = r#"
let mut data = String::new();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗺️ Previously, was just a &str literal. However, some of the test cases require actual escaped bytes to appear in them, which isn't possible within a raw string (r#""#). Now it's a mutable String that we can append things to in bulk.


/// Like RawSymbolToken, but the Text variant holds a borrowed reference instead of a String.
#[derive(Debug, Clone, PartialEq, Eq)]
pub enum RawSymbolTokenRef<'a> {
SymbolId(SymbolId),
Text(&'a str),
Text(Cow<'a, str>),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗺️ If the raw reader encounters a symbol like 'Hello\nworld!', it can't just return a reference to those bytes in the input buffer. It has to make a new String with the \n replaced by 0x0A. Using Cow allows the RawSymbolTokenRef to hold either a borrowed &str or an owned String.

use std::fmt::{Debug, Formatter};
use std::hash::{Hash, Hasher};

/// A reference to a fully resolved symbol. Like `Symbol` (a fully resolved symbol with a
/// static lifetime), a `SymbolRef` may have known or undefined text (i.e. `$0`).
#[derive(PartialEq, Eq, PartialOrd, Ord, Clone)]
pub struct SymbolRef<'a> {
text: Option<&'a str>,
text: Option<Cow<'a, str>>,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗺️ This change is analogous to the one in RawSymbolTokenRef.

@zslayton zslayton marked this pull request as ready for review August 1, 2023 22:37
@codecov
Copy link

codecov bot commented Aug 1, 2023

Codecov Report

Patch coverage: 84.31% and project coverage change: +0.08% 🎉

Comparison is base (6d22b6f) 81.64% compared to head (eba5913) 81.72%.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #616      +/-   ##
==========================================
+ Coverage   81.64%   81.72%   +0.08%     
==========================================
  Files         119      119              
  Lines       21547    21778     +231     
  Branches    21547    21778     +231     
==========================================
+ Hits        17591    17799     +208     
- Misses       2312     2331      +19     
- Partials     1644     1648       +4     
Files Changed Coverage Δ
src/lazy/text/encoded_value.rs 65.38% <0.00%> (-0.64%) ⬇️
src/lazy/text/value.rs 32.25% <0.00%> (-1.08%) ⬇️
src/lazy/value.rs 73.28% <0.00%> (-1.34%) ⬇️
src/lazy/text/matched.rs 69.80% <69.48%> (+0.46%) ⬆️
src/symbol_ref.rs 75.00% <69.56%> (-7.36%) ⬇️
src/raw_symbol_token_ref.rs 88.46% <80.00%> (-0.83%) ⬇️
src/lazy/text/buffer.rs 88.36% <99.22%> (+2.03%) ⬆️
src/binary/binary_writer.rs 64.67% <100.00%> (ø)
src/lazy/text/raw/reader.rs 93.45% <100.00%> (+2.31%) ⬆️
src/text/raw_text_writer.rs 85.67% <100.00%> (+0.01%) ⬆️
... and 2 more

... and 1 file with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Comment on lines +1176 to +1177
// These inputs have leading/trailing whitespace to make them more readable, but the string
// matcher doesn't accept whitespace. We'll trim each one before testing it.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Outdated comment?


sanitized.push(substitute);
Ok(input_after_escape)
fn write_escaped<'data>(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doc comment for this function would be appreciated. I believe this is responsible for rewriting escaped characters as their unescaped counterparts—in other words, it unescapes any escaped characters in a TextBufferView—but the function name had me thinking the opposite at first.

Comment on lines +391 to +393
/// The symbol is delimited by single quotes. Holds a `bool` indicating whether the
/// matched input contained any escaped bytes.
Quoted(bool),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious—any particular reason for having Quoted(bool) instead of e.g. Quoted and QuotedWithEscaped?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yeah, that's better. Thanks!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in #619.

@zslayton zslayton merged commit dc8579d into main Aug 23, 2023
18 checks passed
@zslayton zslayton deleted the lazy-symbols branch August 23, 2023 00:31
@zslayton zslayton self-assigned this Aug 29, 2023
zslayton added a commit that referenced this pull request Sep 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants