Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds lazy reader support for blobs #629

Merged
merged 36 commits into from
Sep 1, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
36 commits
Select commit Hold shift + click to select a range
e0a83d8
Top-level nulls, bools, ints
zslayton Jul 16, 2023
89f79aa
Consolidate impls of AsUtf8 w/helper fn
zslayton Jul 25, 2023
840be4d
Improved TextBufferView docs, removed DataSource
zslayton Jul 25, 2023
5db1ff0
Adds lazy text floats
zslayton Jul 27, 2023
07d4a70
Adds LazyRawTextReader support for comments
zslayton Jul 27, 2023
181e0a5
Adds LazyRawTextReader support for reading strings
zslayton Jul 28, 2023
357ca8f
clippy fixes
zslayton Jul 28, 2023
716ff34
Fix a couple of unit tests
zslayton Jul 29, 2023
e29fec5
Less ambitious float eq comparison
zslayton Jul 29, 2023
8f79a36
Adds LazyRawTextReader support for reading symbols
zslayton Aug 1, 2023
4cb9b2b
Adds more doc comments
zslayton Aug 1, 2023
54470d2
More doc comments
zslayton Aug 1, 2023
78014e7
Adds `LazyRawTextReader` support for reading lists
zslayton Aug 3, 2023
a6a3aa8
Adds `LazyRawTextReader` support for structs
zslayton Aug 10, 2023
4fc9078
More doc comments
zslayton Aug 10, 2023
11174ac
Adds `LazyRawTextReader` support for reading IVMs
zslayton Aug 10, 2023
719dbaa
Initial impl of a LazyRawAnyReader
zslayton Aug 11, 2023
f603872
Improved comments.
zslayton Aug 11, 2023
4696ca5
Adds LazyRawTextReader support for annotations
zslayton Aug 11, 2023
c7129ac
Adds lazy reader support for timestamps
zslayton Aug 14, 2023
44435ea
Lazy reader support for s-expressions
zslayton Aug 18, 2023
d50e05b
Fixed doc comments
zslayton Aug 18, 2023
8283422
Fix internal doc link
zslayton Aug 18, 2023
0f01099
Adds lazy reader support for decimals
zslayton Aug 19, 2023
b60f1fe
Fixed bad unit test example case
zslayton Aug 20, 2023
915c83a
clippy fixes
zslayton Aug 20, 2023
fe922ff
Adds lazy reader support for blobs
zslayton Aug 20, 2023
4b53bb3
Merge remote-tracking branch 'origin/main' into lazy-timestamps
zslayton Aug 23, 2023
60d5a17
Incorporates review feedback
zslayton Aug 23, 2023
db9718d
Matcher recognizes +00:00 as Zulu
zslayton Aug 23, 2023
37264a3
Merge remote-tracking branch 'origin/lazy-timestamps' into lazy-sexps
zslayton Aug 23, 2023
74b8baf
Merge remote-tracking branch 'origin/lazy-sexps' into lazy-decimals
zslayton Aug 23, 2023
d716c9f
Merge remote-tracking branch 'origin/lazy-decimals' into lazy-blobs
zslayton Aug 23, 2023
8cecdda
Allow blobs with interleaved whitespace
zslayton Aug 28, 2023
b28df56
clippy suggestions
zslayton Aug 28, 2023
8ea5c9d
Merge remote-tracking branch 'origin/main' into lazy-blobs
zslayton Sep 1, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion src/lazy/binary/raw/value.rs
Original file line number Diff line number Diff line change
Expand Up @@ -406,7 +406,7 @@ impl<'data> LazyRawBinaryValue<'data> {
fn read_blob(&self) -> ValueParseResult<'data, BinaryEncoding> {
debug_assert!(self.encoded_value.ion_type() == IonType::Blob);
let bytes = self.value_body()?;
Ok(RawValueRef::Blob(bytes))
Ok(RawValueRef::Blob(bytes.into()))
}

/// Helper method called by [`Self::read`]. Reads the current value as a clob.
Expand Down
123 changes: 123 additions & 0 deletions src/lazy/bytes_ref.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,123 @@
use crate::text::text_formatter::IonValueFormatter;
use crate::Bytes;
use std::borrow::Cow;
use std::fmt::{Debug, Display, Formatter};
use std::ops::Deref;

pub struct BytesRef<'data> {
data: Cow<'data, [u8]>,
}
Comment on lines +7 to +9
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗺️ When there was only a binary reader, methods that read blobs could always return a &[u8]--a slice of the input buffer. Now that we also have a text reader, we need to accommodate base64-encoded blobs, which always require a new Vec to be allocated to hold the decoded data. BytesRef can hold either a borrowed &[u8] or an owned Vec<u8>, allowing it to be used in either situation.

This type is analogous to StrRef and SymbolRef but for blobs.


impl<'data> Deref for BytesRef<'data> {
type Target = [u8];

fn deref(&self) -> &Self::Target {
self.data.as_ref()
}
}

impl<'data> BytesRef<'data> {
pub fn to_owned(&self) -> Bytes {
Bytes::from(self.as_ref())
}

pub fn into_owned(self) -> Bytes {
Bytes::from(self)
}

pub fn data(&self) -> &[u8] {
self.as_ref()
}
}

impl<'data> From<BytesRef<'data>> for Bytes {
fn from(value: BytesRef<'data>) -> Self {
match value.data {
Cow::Borrowed(bytes) => Bytes::from(bytes),
Cow::Owned(bytes) => Bytes::from(bytes),
}
}
}

impl<'data, const N: usize> From<&'data [u8; N]> for BytesRef<'data> {
fn from(bytes: &'data [u8; N]) -> Self {
BytesRef {
data: Cow::from(bytes.as_ref()),
}
}
}

impl<'data> From<&'data [u8]> for BytesRef<'data> {
fn from(bytes: &'data [u8]) -> Self {
BytesRef {
data: Cow::from(bytes),
}
}
}

impl<'data> From<Vec<u8>> for BytesRef<'data> {
fn from(bytes: Vec<u8>) -> Self {
BytesRef {
data: Cow::from(bytes),
}
}
}

impl<'data> From<&'data str> for BytesRef<'data> {
fn from(text: &'data str) -> Self {
BytesRef {
data: Cow::from(text.as_bytes()),
}
}
}

impl<'data> PartialEq<[u8]> for BytesRef<'data> {
fn eq(&self, other: &[u8]) -> bool {
self.data() == other
}
}

impl<'data> PartialEq<&[u8]> for BytesRef<'data> {
fn eq(&self, other: &&[u8]) -> bool {
self.data() == *other
}
}

impl<'data> PartialEq<BytesRef<'data>> for [u8] {
fn eq(&self, other: &BytesRef<'data>) -> bool {
self == other.data()
}
}

impl<'a, 'b> PartialEq<BytesRef<'a>> for BytesRef<'b> {
fn eq(&self, other: &BytesRef<'a>) -> bool {
self == other.data()
}
}

impl<'data> Display for BytesRef<'data> {
fn fmt(&self, f: &mut Formatter<'_>) -> std::fmt::Result {
let mut formatter = IonValueFormatter { output: f };
formatter
.format_blob(self.data())
.map_err(|_| std::fmt::Error)
}
}

impl<'data> Debug for BytesRef<'data> {
fn fmt(&self, f: &mut Formatter<'_>) -> std::fmt::Result {
const NUM_BYTES_TO_SHOW: usize = 32;
let data = self.data.as_ref();
// Shows up to the first 32 bytes in hex
write!(f, "BytesRef: [")?;
for byte in data.iter().copied().take(NUM_BYTES_TO_SHOW) {
write!(f, "{:x} ", byte)?;
}
if data.len() > NUM_BYTES_TO_SHOW {
write!(f, "...{} more", (data.len() - NUM_BYTES_TO_SHOW))?;
}
write!(f, "]")?;

Ok(())
}
}
3 changes: 2 additions & 1 deletion src/lazy/mod.rs
Original file line number Diff line number Diff line change
@@ -1,8 +1,9 @@
//! Provides an ergonomic, lazy view of an Ion stream that permits random access within each
//! top level value.

mod any_encoding;
pub mod any_encoding;
pub mod binary;
pub mod bytes_ref;
pub mod decoder;
pub(crate) mod encoding;
pub mod raw_stream_item;
Expand Down
7 changes: 4 additions & 3 deletions src/lazy/raw_value_ref.rs
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
use crate::lazy::bytes_ref::BytesRef;
use crate::lazy::decoder::LazyDecoder;
use crate::lazy::str_ref::StrRef;
use crate::result::IonFailure;
Expand All @@ -18,7 +19,7 @@ pub enum RawValueRef<'data, D: LazyDecoder<'data>> {
Timestamp(Timestamp),
String(StrRef<'data>),
Symbol(RawSymbolTokenRef<'data>),
Blob(&'data [u8]),
Blob(BytesRef<'data>),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗺️ RawValueRef now returns a BytesRef instead of a &[u8] so it has the option to allocate a Vec<u8> when the input encoding is base64 text. The binary reader can still return a slice of the input buffer.

Clob(&'data [u8]),
SExp(D::SExp),
List(D::List),
Expand Down Expand Up @@ -140,7 +141,7 @@ impl<'data, D: LazyDecoder<'data>> RawValueRef<'data, D> {
}
}

pub fn expect_blob(self) -> IonResult<&'data [u8]> {
pub fn expect_blob(self) -> IonResult<BytesRef<'data>> {
if let RawValueRef::Blob(b) = self {
Ok(b)
} else {
Expand Down Expand Up @@ -247,7 +248,7 @@ mod tests {
);
assert_eq!(
reader.next()?.expect_value()?.read()?.expect_blob()?,
&[0x06, 0x5A, 0x1B] // Base64-decoded "Blob"
[0x06u8, 0x5A, 0x1B].as_ref() // Base64-decoded "Blob"
);
assert_eq!(
reader.next()?.expect_value()?.read()?.expect_clob()?,
Expand Down
97 changes: 94 additions & 3 deletions src/lazy/text/buffer.rs
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ use std::str::FromStr;

use nom::branch::alt;
use nom::bytes::streaming::{is_a, is_not, tag, take_until, take_while1, take_while_m_n};
use nom::character::streaming::{char, digit1, one_of, satisfy};
use nom::character::streaming::{alphanumeric1, char, digit1, one_of, satisfy};
use nom::combinator::{consumed, fail, map, not, opt, peek, recognize, success, value};
use nom::error::{ErrorKind, ParseError};
use nom::multi::{many0_count, many1_count};
Expand All @@ -17,8 +17,8 @@ use crate::lazy::encoding::TextEncoding;
use crate::lazy::raw_stream_item::RawStreamItem;
use crate::lazy::text::encoded_value::EncodedTextValue;
use crate::lazy::text::matched::{
MatchedDecimal, MatchedFloat, MatchedHoursAndMinutes, MatchedInt, MatchedString, MatchedSymbol,
MatchedTimestamp, MatchedTimestampOffset, MatchedValue,
MatchedBlob, MatchedDecimal, MatchedFloat, MatchedHoursAndMinutes, MatchedInt, MatchedString,
MatchedSymbol, MatchedTimestamp, MatchedTimestampOffset, MatchedValue,
};
use crate::lazy::text::parse_result::{InvalidInputError, IonParseError};
use crate::lazy::text::parse_result::{IonMatchResult, IonParseResult};
Expand Down Expand Up @@ -497,6 +497,12 @@ impl<'data> TextBufferView<'data> {
)
},
),
map(
match_and_length(Self::match_blob),
|(matched_blob, length)| {
EncodedTextValue::new(MatchedValue::Blob(matched_blob), self.offset(), length)
},
),
map(
match_and_length(Self::match_list),
|(matched_list, length)| {
Expand Down Expand Up @@ -1341,6 +1347,36 @@ impl<'data> TextBufferView<'data> {
recognize(pair(one_of("012345"), Self::match_any_digit)),
)(self)
}

/// Matches a complete blob, including the opening `{{` and closing `}}`.
pub fn match_blob(self) -> IonParseResult<'data, MatchedBlob> {
delimited(
tag("{{"),
// Only whitespace (not comments) can appear within the blob
recognize(Self::match_base64_content),
preceded(Self::match_optional_whitespace, tag("}}")),
)
.map(|base64_data| {
MatchedBlob::new(base64_data.offset() - self.offset(), base64_data.len())
})
.parse(self)
}

/// Matches the base64 content within a blob. Ion allows the base64 content to be broken up with
/// whitespace, so the matched input region may need to be stripped of whitespace before
/// the data can be decoded.
fn match_base64_content(self) -> IonMatchResult<'data> {
recognize(terminated(
many0_count(preceded(
Self::match_optional_whitespace,
alt((alphanumeric1, is_a("+/"))),
)),
opt(preceded(
Self::match_optional_whitespace,
alt((tag("=="), tag("="))),
)),
))(self)
}
}

// === nom trait implementations ===
Expand Down Expand Up @@ -2008,4 +2044,59 @@ mod tests {
mismatch_sexp(input);
}
}

#[test]
fn test_match_blob() {
fn match_blob(input: &str) {
MatchTest::new(input).expect_match(match_length(TextBufferView::match_blob));
}
fn mismatch_blob(input: &str) {
MatchTest::new(input).expect_mismatch(match_length(TextBufferView::match_blob));
}
// Base64 encodings of utf-8 strings
let good_inputs = &[
zslayton marked this conversation as resolved.
Show resolved Hide resolved
// <empty blobs>
"{{}}",
"{{ }}",
"{{\n\t}}",
// hello
"{{aGVsbG8=}}",
"{{ aGVsbG8=}}",
"{{aGVsbG8= }}",
"{{\taGVsbG8=\n\n}}",
"{{aG Vs bG 8 =}}",
r#"{{
aG Vs
bG 8=
}}"#,
// hello!
"{{aGVsbG8h}}",
"{{ aGVsbG8h}}",
"{{aGVsbG8h }}",
"{{ aGVsbG8h }}",
// razzle dazzle root beer
"{{cmF6emxlIGRhenpsZSByb290IGJlZXI=}}",
zslayton marked this conversation as resolved.
Show resolved Hide resolved
"{{\ncmF6emxlIGRhenpsZSByb290IGJlZXI=\r}}",
];
for input in good_inputs {
match_blob(input);
}

let bad_inputs = &[
// illegal character $
"{{$aGVsbG8=}}",
// comment within braces
r#"{{
// Here's the data:
aGVsbG8=
}}"#,
// padding at the beginning
"{{=aGVsbG8}}",
// too much padding
"{{aGVsbG8===}}",
];
for input in bad_inputs {
mismatch_blob(input);
}
}
}
1 change: 1 addition & 0 deletions src/lazy/text/encoded_value.rs
Original file line number Diff line number Diff line change
Expand Up @@ -130,6 +130,7 @@ impl EncodedTextValue {
MatchedValue::Timestamp(_) => IonType::Timestamp,
MatchedValue::String(_) => IonType::String,
MatchedValue::Symbol(_) => IonType::Symbol,
MatchedValue::Blob(_) => IonType::Blob,
MatchedValue::List => IonType::List,
MatchedValue::SExp => IonType::SExp,
MatchedValue::Struct => IonType::Struct,
Expand Down
Loading
Loading