Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add new function bytes_to_utf8_free_me #22823

Open
wants to merge 1 commit into
base: blead
Choose a base branch
from

Conversation

khwilliamson
Copy link
Contributor

@khwilliamson khwilliamson commented Dec 5, 2024

This is like bytes_to_utf8, but if the representation of the input string is the same in UTF-8 as it is in native format, the allocation of new memory is skipped.

This presents optimization possibilities.

Suggestions for a better name are welcome

  • This set of changes requires a perldelta entry, to be furnished

utf8.c Outdated

const U8 * const send = s + *lenp;
Size_t variant_count = variant_under_utf8_count(s, send);
if (free_me_ptr != NULL && variant_count == 0 && s[*lenp-1] == '\0') {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here it looks like *lenp includes the terminating NUL, but below it doesn't.

Consider when *lenp starts as 1, this expects s[0] to be NUL which doesn't match what I expect from this API.

That said, we would be assuming that s[*lenp] is valid, s+*lenp would always be valid as a one-past-the-end pointer, but such a pointer cannot be dereferenced.

So that s[*lenp] is safe becomes a pre-condition when free_me_ptr isn't NULL.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm trying to understand this comment. I don't see how I'm dereferencing s[*lenp]. I am dereferencing s[*lenp - 1].

The input is not required to be NUlL terminated, but the output is. So if *lenp is 1, it looks at s[0]. If that is NUL, the function returns s unchanged, as it is a NUL-terminated string whose representation doesn't change when encoded in UTF-8. If it isn't a NUL, the function allocates new memory that includes whatever byte is in s[0] and appends a NUL to it.

In re-reading the code, I see I failed to consider the possibility that *lenp is 0, and that I might be overallocating the new memory by 1 byte. And I did a bit more clean up, so I dereference instead *(send -1). And I think it is better to dereference a pointer once into a local variable, rather than to dereference it multiple times

This is like bytes_to_utf8, but if the representation of the input
string is the same in UTF-8 as it is in native format, the allocation of
new memory is skipped.

This presents optimization possibilities.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants