Adds `LazyRawTextReader` support for reading symbols #616

zslayton · 2023-08-01T20:24:29Z

Adds support for reading quoted symbols ('foo'), identifiers (foo), and symbol IDs ($42). Also modifies the SymbolRef and RawSymbolToken types to hold a Cow<'a, str> instead of a &str to accommodate situations where the symbol's text in input contained escapes and so required allocating a new string.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

zslayton

🗺️ PR tour

zslayton · 2023-08-01T20:32:13Z

src/lazy/text/buffer.rs

+    fn match_long_string(self) -> IonParseResult<'data, MatchedString> {
+        // TODO: implement long string matching
+        //       The `fail` parser is a nom builtin that never matches.
+        fail(self)
+    }


🗺️ This placeholder method was moved from further down in the file.

zslayton · 2023-08-01T20:40:19Z

src/lazy/text/buffer.rs

+
+    /// A helper method for matching bytes until the specified delimiter. Ignores any byte
+    /// (including the delimiter) that is prefaced by the escape character `\`.
+    fn match_text_until_unescaped(self, delimiter: u8) -> IonParseResult<'data, (Self, bool)> {


🗺️ This method was previously match_short_string, but it's generally useful for both strings and symbols. match_short_string and match_quoted_symbol now call this.

zslayton · 2023-08-01T20:41:44Z

src/lazy/text/buffer.rs

+                    self.input,
+                    result
+                );
+            }


🗺️ Prior to this change, this unit test method would assert that there was no match. However, it was possible for the parser to match part of the input and report success. Now this method requires that the parser match the entire test input to be considered a successful match.

zslayton · 2023-08-01T20:42:18Z

src/lazy/text/matched.rs


-    fn escape_short_string(


🗺️ This method was also generally useful for text types and has been broken out into a helper method.

zslayton · 2023-08-01T22:34:42Z

src/lazy/text/raw/reader.rs


    #[test]
    fn test_top_level() -> IonResult<()> {
-        let data = r#"
+        let mut data = String::new();


🗺️ Previously, was just a &str literal. However, some of the test cases require actual escaped bytes to appear in them, which isn't possible within a raw string (r#""#). Now it's a mutable String that we can append things to in bulk.

zslayton · 2023-08-01T22:36:15Z

src/raw_symbol_token_ref.rs


 /// Like RawSymbolToken, but the Text variant holds a borrowed reference instead of a String.
 #[derive(Debug, Clone, PartialEq, Eq)]
 pub enum RawSymbolTokenRef<'a> {
    SymbolId(SymbolId),
-    Text(&'a str),
+    Text(Cow<'a, str>),


🗺️ If the raw reader encounters a symbol like 'Hello\nworld!', it can't just return a reference to those bytes in the input buffer. It has to make a new String with the \n replaced by 0x0A. Using Cow allows the RawSymbolTokenRef to hold either a borrowed &str or an owned String.

zslayton · 2023-08-01T22:36:34Z

src/symbol_ref.rs

 use std::fmt::{Debug, Formatter};
 use std::hash::{Hash, Hasher};

 /// A reference to a fully resolved symbol. Like `Symbol` (a fully resolved symbol with a
 /// static lifetime), a `SymbolRef` may have known or undefined text (i.e. `$0`).
 #[derive(PartialEq, Eq, PartialOrd, Ord, Clone)]
 pub struct SymbolRef<'a> {
-    text: Option<&'a str>,
+    text: Option<Cow<'a, str>>,


🗺️ This change is analogous to the one in RawSymbolTokenRef.

codecov · 2023-08-01T23:08:52Z

Codecov Report

Patch coverage: 84.31% and project coverage change: +0.08% 🎉

Comparison is base (6d22b6f) 81.64% compared to head (eba5913) 81.72%.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #616      +/-   ##
==========================================
+ Coverage   81.64%   81.72%   +0.08%     
==========================================
  Files         119      119              
  Lines       21547    21778     +231     
  Branches    21547    21778     +231     
==========================================
+ Hits        17591    17799     +208     
- Misses       2312     2331      +19     
- Partials     1644     1648       +4

Files Changed	Coverage Δ
src/lazy/text/encoded_value.rs	`65.38% <0.00%> (-0.64%)`	⬇️
src/lazy/text/value.rs	`32.25% <0.00%> (-1.08%)`	⬇️
src/lazy/value.rs	`73.28% <0.00%> (-1.34%)`	⬇️
src/lazy/text/matched.rs	`69.80% <69.48%> (+0.46%)`	⬆️
src/symbol_ref.rs	`75.00% <69.56%> (-7.36%)`	⬇️
src/raw_symbol_token_ref.rs	`88.46% <80.00%> (-0.83%)`	⬇️
src/lazy/text/buffer.rs	`88.36% <99.22%> (+2.03%)`	⬆️
src/binary/binary_writer.rs	`64.67% <100.00%> (ø)`
src/lazy/text/raw/reader.rs	`93.45% <100.00%> (+2.31%)`	⬆️
src/text/raw_text_writer.rs	`85.67% <100.00%> (+0.01%)`	⬆️
... and 2 more

... and 1 file with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

popematt · 2023-08-02T17:39:48Z

src/lazy/text/buffer.rs

+        // These inputs have leading/trailing whitespace to make them more readable, but the string
+        // matcher doesn't accept whitespace. We'll trim each one before testing it.


Outdated comment?

popematt · 2023-08-02T17:53:02Z

src/lazy/text/matched.rs


-        sanitized.push(substitute);
-        Ok(input_after_escape)
+fn write_escaped<'data>(


Doc comment for this function would be appreciated. I believe this is responsible for rewriting escaped characters as their unescaped counterparts—in other words, it unescapes any escaped characters in a TextBufferView—but the function name had me thinking the opposite at first.

popematt · 2023-08-02T17:54:44Z

src/lazy/text/matched.rs

+    /// The symbol is delimited by single quotes. Holds a `bool` indicating whether the
+    /// matched input contained any escaped bytes.
+    Quoted(bool),


Just curious—any particular reason for having Quoted(bool) instead of e.g. Quoted and QuotedWithEscaped?

Oh yeah, that's better. Thanks!

Addressed in #619.

Feedback from PRs: * #609 * #614 * #616 * #619 * #620 * #627 * #628 * #638 * #639

zslayton added 11 commits July 24, 2023 16:54

Top-level nulls, bools, ints

e0a83d8

Consolidate impls of AsUtf8 w/helper fn

89f79aa

Improved TextBufferView docs, removed DataSource

840be4d

Adds lazy text floats

5db1ff0

Adds LazyRawTextReader support for comments

07d4a70

Adds LazyRawTextReader support for reading strings

181e0a5

clippy fixes

357ca8f

Fix a couple of unit tests

716ff34

Less ambitious float eq comparison

e29fec5

Adds LazyRawTextReader support for reading symbols

8f79a36

Adds more doc comments

4cb9b2b

zslayton commented Aug 1, 2023

View reviewed changes

zslayton marked this pull request as ready for review August 1, 2023 22:37

zslayton requested review from jobarr-amzn, popematt and desaikd August 1, 2023 22:37

More doc comments

54470d2

popematt approved these changes Aug 2, 2023

View reviewed changes

This was referenced Aug 3, 2023

Adds LazyRawTextReader support for reading lists #617

Merged

Adds LazyRawTextReader support for structs #619

Merged

This was referenced Aug 10, 2023

Adds LazyRawTextReader support for reading IVMs #620

Merged

Initial impl of a LazyRawAnyReader #621

Merged

Adds lazy reader support for reading annotations #622

Merged

Adds lazy reader support for timestamps #623

Merged

This was referenced Aug 18, 2023

Lazy reader support for s-expressions #627

Merged

Adds lazy reader support for decimals #628

Merged

Adds lazy reader support for blobs #629

Merged

Adds lazy reader support for long strings #630

Merged

Base automatically changed from lazy-strings to main August 23, 2023 00:01

Merge remote-tracking branch 'origin/main' into lazy-symbols

eba5913

zslayton merged commit dc8579d into main Aug 23, 2023
18 checks passed

zslayton deleted the lazy-symbols branch August 23, 2023 00:31

zslayton self-assigned this Aug 29, 2023

zslayton mentioned this pull request Sep 7, 2023

Incorporates pending feedback from lazy reader PRs #642

Merged

zslayton added a commit that referenced this pull request Sep 7, 2023

Incorporates pending feedback from lazy reader PRs (#642)

ec91888

Feedback from PRs: * #609 * #614 * #616 * #619 * #620 * #627 * #628 * #638 * #639

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adds `LazyRawTextReader` support for reading symbols #616

Adds `LazyRawTextReader` support for reading symbols #616

zslayton commented Aug 1, 2023

zslayton left a comment

zslayton Aug 1, 2023

zslayton Aug 1, 2023

zslayton Aug 1, 2023

zslayton Aug 1, 2023

zslayton Aug 1, 2023

zslayton Aug 1, 2023

zslayton Aug 1, 2023

codecov bot commented Aug 1, 2023 •

edited

Loading

popematt Aug 2, 2023

popematt Aug 2, 2023

popematt Aug 2, 2023

zslayton Aug 2, 2023

zslayton Aug 10, 2023

		// These inputs have leading/trailing whitespace to make them more readable, but the string
		// matcher doesn't accept whitespace. We'll trim each one before testing it.

Adds LazyRawTextReader support for reading symbols #616

Adds LazyRawTextReader support for reading symbols #616

Conversation

zslayton commented Aug 1, 2023

zslayton left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Aug 1, 2023 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Adds `LazyRawTextReader` support for reading symbols #616

Adds `LazyRawTextReader` support for reading symbols #616

codecov bot commented Aug 1, 2023 •

edited

Loading