Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid re-validating UTF-8 in FromUtf8Error::into_utf8_lossy #130408

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

okaneco
Copy link
Contributor

@okaneco okaneco commented Sep 15, 2024

Part of the unstable feature string_from_utf8_lossy_owned - #129436

Refactor FromUtf8Error::into_utf8_lossy to copy valid UTF-8 bytes into the buffer, avoiding double validation of bytes.
Add tests that mirror the String::from_utf8_lossy tests.

@rustbot
Copy link
Collaborator

rustbot commented Sep 15, 2024

r? @Noratrieb

rustbot has assigned @Noratrieb.
They will have a look at your PR within the next two weeks and either review your PR or reassign to another reviewer.

Use r? to explicitly pick a reviewer

@rustbot rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-libs Relevant to the library team, which will review and decide on the PR/issue. labels Sep 15, 2024
Comment on lines +2099 to +2106
let iter = self.bytes[self.error.valid_up_to()..].utf8_chunks();

for chunk in iter {
res.push_str(chunk.valid());
if !chunk.invalid().is_empty() {
res.push_str(REPLACEMENT);
}
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following the logic from String::from_utf8_lossy


// `Utf8Error::valid_up_to` returns the maximum index of validated
// UTF-8 bytes. Copy the valid bytes into the output buffer.
v.extend(self.bytes[..self.error.valid_up_to()].iter().copied());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you use extend_from_slice here? that's both simpler and possible more efficient (though the iterator probably produces good code as well? but either way, extend_from_slice is shorter)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I normally use extend_from_slice and second-guessed myself since I couldn't remember if there was a difference.
I checked the docs and went with this because of the last sentence here but that's maybe a bit hopeful still.

Note that this function is same as extend except that it is specialized to work with slices instead. If and when Rust gets specialization this function will likely be deprecated (but still available).

https://doc.rust-lang.org/stable/std/vec/struct.Vec.html#method.extend_from_slice

assert_eq!(func(xs), ys);

let xs = b"Hello\xC2 There\xFF Goodbye";
assert_eq!(func(xs), String::from("Hello\u{FFFD} There\u{FFFD} Goodbye"));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit (you don't have to change this but can if you're touching it anyways and want to): you've used .to_owned() above but use String::from here, i prefer .to_owned() myself so they could be made consistent

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed.

I just copied the test above but I prefer that too.

Copy link
Member

@Noratrieb Noratrieb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, this looks good! a small comment and then it's good

Refactor `into_utf8_lossy` to copy valid UTF-8 bytes into the buffer,
avoiding double validation of bytes.
Add tests that mirror the `String::from_utf8_lossy` tests
@okaneco
Copy link
Contributor Author

okaneco commented Sep 21, 2024

Force pushed
Use extend_from_slice in FromUtf8Error::into_utf8_lossy
Use to_owned in the tests instead of String::from

@Noratrieb
Copy link
Member

@bors r+ rollup

@bors
Copy link
Contributor

bors commented Sep 21, 2024

📌 Commit b94c5a1 has been approved by Noratrieb

It is now in the queue for this repository.

@bors bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Sep 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. T-libs Relevant to the library team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants