Skip to content

RUST-1998 Remove lossy utf8 as a decoder option #550

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Jun 9, 2025

Conversation

abr-egn
Copy link
Contributor

@abr-egn abr-egn commented Jun 4, 2025

RUST-1998

This removes lossy utf8 decoding as a public option for the bson serde decoder, and as a consequence many of the API entry points. Dealing with lossy utf8 is now done in one of three ways depending on the specific need:

  • The existing Utf8LossyDeserialization helper type, if the user is going via serde and knows that particular fields might have bad data;
  • The newly-exposed RawElement::value_utf8_lossy method, which allows fine-grained iteration over potentially-invalid raw data, or
  • The new RawDocument::to_document_utf8_lossy method, which is a convenience wrapper over the above to bulk-convert an entire RawDocument at once.

@abr-egn abr-egn force-pushed the RUST-1998/lossy-utf8 branch from 3df66fd to 328ac6d Compare June 4, 2025 16:07
@@ -386,13 +386,6 @@ impl crate::DateTime {
}
}

#[deprecated(since = "2.3.0", note = "Use try_to_rfc3339_string instead.")]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is completely unrelated but I happened to notice it and figured we should remove deprecated methods while we're doing a major version release. I did a grep and didn't find any others.

@@ -531,6 +531,21 @@ impl RawDocument {
let bytes = self.cstring_bytes_at(start_at)?;
try_to_str(bytes)
}

/// Copy this into a [`Document`], returning an error if invalid BSON is encountered.
pub fn to_document(&self) -> Result<Document> {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved this from RawDocumentBuf since that was calling the TryInto impl on RawDocument anyway and this way both types can use it (RawDocumentBuf derefs to this one).

src/raw/error.rs Outdated
Self::MalformedValue { message: r_message },
) => l_message == r_message,
(Self::Utf8EncodingError, Self::Utf8EncodingError) => true,
(Self::DeError(_), Self::DeError(_)) => true,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

crate::de:Error doesn't impl PartialEq because it wraps std::io::Error, which also doesn't, hence the manual impl. I figure this whole situation can be revisted for RUST-1406.

This is all needed so I can use crate::de::reader_to_vec in RawDocumentBuf::from_reader; the former uses the crate::de error hierarchy, the latter the crate::raw one.

@@ -266,7 +268,47 @@ impl<'a> RawElement<'a> {
})
}

pub(crate) fn value_utf8_lossy(&self) -> Result<Option<Utf8LossyBson<'a>>> {
pub fn value_utf8_lossy(&self) -> Result<RawBson> {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I opted to expose this method rather than the lower-level value_utf8_lossy_inner because in my estimation the confluence between needing to deal with invalid utf8 and needing to save the memory cost of copying a single element is very small, and outside of that this is a strictly more useful interface.

@abr-egn abr-egn marked this pull request as ready for review June 4, 2025 17:15
@abr-egn abr-egn requested a review from a team as a code owner June 4, 2025 17:15
@abr-egn abr-egn requested a review from isabelatkinson June 4, 2025 17:15
Copy link
Contributor

@isabelatkinson isabelatkinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the new API looks good! we are using from_slice_utf8_lossy in insert and update response parsing in the driver, which is probably a hot path for some users - would the wrapper type be a fine replacement here regarding performance?

@abr-egn
Copy link
Contributor Author

abr-egn commented Jun 6, 2025

the new API looks good! we are using from_slice_utf8_lossy in insert and update response parsing in the driver, which is probably a hot path for some users - would the wrapper type be a fine replacement here regarding performance?

the new API looks good! we are using from_slice_utf8_lossy in insert and update response parsing in the driver, which is probably a hot path for some users - would the wrapper type be a fine replacement here regarding performance?

Yup, it'll be equivalent performance-wise, since using the wrapper type just enables behavior the same way from_slice_utf8_lossy would. Looking at this, we can actually do a bit better using the wrapper type and wrap the error messages specifically so that the rest is parsed strictly.

@abr-egn abr-egn force-pushed the RUST-1998/lossy-utf8 branch from 96b7844 to 6930185 Compare June 6, 2025 14:12
@abr-egn abr-egn requested a review from isabelatkinson June 6, 2025 14:25
isabelatkinson
isabelatkinson previously approved these changes Jun 9, 2025
@abr-egn abr-egn merged commit 8389d37 into mongodb:main Jun 9, 2025
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants