Add new function bytes_to_utf8_free_me #22823

khwilliamson · 2024-12-05T00:22:59Z

This is like bytes_to_utf8, but if the representation of the input string is the same in UTF-8 as it is in native format, the allocation of new memory is skipped.

This presents optimization possibilities.

Suggestions for a better name are welcome

This set of changes requires a perldelta entry, to be furnished

tonycoz · 2024-12-10T02:58:42Z

utf8.c

+
+    const U8 * const send = s + *lenp;
+    Size_t variant_count = variant_under_utf8_count(s, send);
+    if (free_me_ptr != NULL && variant_count == 0 && s[*lenp-1] == '\0') {


Here it looks like *lenp includes the terminating NUL, but below it doesn't.

Consider when *lenp starts as 1, this expects s[0] to be NUL which doesn't match what I expect from this API.

That said, we would be assuming that s[*lenp] is valid, s+*lenp would always be valid as a one-past-the-end pointer, but such a pointer cannot be dereferenced.

So that s[*lenp] is safe becomes a pre-condition when free_me_ptr isn't NULL.

I'm trying to understand this comment. I don't see how I'm dereferencing s[*lenp]. I am dereferencing s[*lenp - 1].

The input is not required to be NUlL terminated, but the output is. So if *lenp is 1, it looks at s[0]. If that is NUL, the function returns s unchanged, as it is a NUL-terminated string whose representation doesn't change when encoded in UTF-8. If it isn't a NUL, the function allocates new memory that includes whatever byte is in s[0] and appends a NUL to it.

In re-reading the code, I see I failed to consider the possibility that *lenp is 0, and that I might be overallocating the new memory by 1 byte. And I did a bit more clean up, so I dereference instead *(send -1). And I think it is better to dereference a pointer once into a local variable, rather than to dereference it multiple times

There's no tests or example code so it's hard to tell how it's meant to be called.

Let's say I call:

Size_t len = 10; /* does not count the NUL, is that typical/expected? */ const U8 *free_me; const U8 *result = bytes_to_utf8_free_me("0123456789", &len, &free_me); ... Safefree(free_me);

As the code is written now, this will always allocate a new string, but if I call it with:

Size_t len = 11; /* does count the NUL */ const U8 *free_me; const U8 *result = bytes_to_utf8_free_me("0123456789", &len, &free_me); ... Safefree(free_me);

result will be a pointer to the string passed in, and free_me will be NULL.

Is including the NUL in the count passed in the intended way to call this function?

Note that if you do expect that, then when a string is allocated due to variants the resulting string will have double NUL termination, which is a bit unexpected.

I think you reviewed this before refreshing with the latest version available at the time. I had already noticed the double NUL and fixed it.

I don't know what to do about the length disparity. If you include the NUL in the length in blead, you will get a double NUL. In order for the new form to know that there is a trailing NUL, it has to be able examine that byte, and so the length has to include it. I added a paragraph to the pod explaining it. (hopefully)

It has since occurred to me that it might be better to just not make the guarantee of a trailing NUL if there is no other reason to allocate new memory.

khwilliamson · 2025-01-07T19:46:13Z

The only change since the last time is rewriting the pod

This is like bytes_to_utf8, but if the representation of the input string is the same in UTF-8 as it is in native format, the allocation of new memory is skipped. This presents optimization possibilities.

tonycoz · 2025-01-07T23:31:44Z

Still approved :)

tonycoz reviewed Dec 10, 2024

View reviewed changes

khwilliamson force-pushed the bytes_to_utf8 branch 2 times, most recently from ce98a9c to 5d31895 Compare December 16, 2024 03:50

tonycoz approved these changes Dec 17, 2024

View reviewed changes

khwilliamson force-pushed the bytes_to_utf8 branch from 5d31895 to 4a3e5ea Compare December 18, 2024 23:00

khwilliamson force-pushed the bytes_to_utf8 branch from 4a3e5ea to e051041 Compare January 7, 2025 19:45

Add new function bytes_to_utf8_free_me

cf64e7c

This is like bytes_to_utf8, but if the representation of the input string is the same in UTF-8 as it is in native format, the allocation of new memory is skipped. This presents optimization possibilities.

khwilliamson force-pushed the bytes_to_utf8 branch from e051041 to cf64e7c Compare January 7, 2025 19:51

khwilliamson merged commit 992f768 into Perl:blead Jan 8, 2025

khwilliamson deleted the bytes_to_utf8 branch January 28, 2025 05:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add new function bytes_to_utf8_free_me #22823

Add new function bytes_to_utf8_free_me #22823

Uh oh!

khwilliamson commented Dec 5, 2024 •

edited

Loading

Uh oh!

tonycoz Dec 10, 2024

Uh oh!

khwilliamson Dec 10, 2024

Uh oh!

tonycoz Dec 11, 2024

Uh oh!

khwilliamson Dec 16, 2024

Uh oh!

khwilliamson Dec 16, 2024

Uh oh!

khwilliamson commented Jan 7, 2025

Uh oh!

tonycoz commented Jan 7, 2025

Uh oh!

Uh oh!

Add new function bytes_to_utf8_free_me #22823

Add new function bytes_to_utf8_free_me #22823

Uh oh!

Conversation

khwilliamson commented Dec 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tonycoz Dec 10, 2024

Choose a reason for hiding this comment

Uh oh!

khwilliamson Dec 10, 2024

Choose a reason for hiding this comment

Uh oh!

tonycoz Dec 11, 2024

Choose a reason for hiding this comment

Uh oh!

khwilliamson Dec 16, 2024

Choose a reason for hiding this comment

Uh oh!

khwilliamson Dec 16, 2024

Choose a reason for hiding this comment

Uh oh!

khwilliamson commented Jan 7, 2025

Uh oh!

tonycoz commented Jan 7, 2025

Uh oh!

Uh oh!

khwilliamson commented Dec 5, 2024 •

edited

Loading