Need clear behavior for strings with utf8 flag set #2

dolmen · 2013-08-19T16:34:02Z

The base bencoding format specifies encoding only for strings of bytes, not Unicode strings. When decoding, there is no way to distinguish if the original data was an UTF-8 string or a byte buffer that appears to look like an UTF-8 string.

The module documentation should clearly specify (and the implementation be tested) how Perl strings with the utf8 flag given in the input to bencode will be handled. Throwing an exception would be an appropriate behavior, in order to force the user of the module to properly encode its data as bytes.

The bdecode function should clearly disallow a string of characters and allow only a string of bytes.

The text was updated successfully, but these errors were encountered:

ap · 2013-08-19T17:50:28Z

The UTF8 flag is irrelevant, and whether the string was bytes or characters cannot be known outside of one particular case. Namely, if the string matches /[^\x0-\xff]/, then it contains some wide characters, so it cannot be a (proper) byte string. (It may contain bytes mixed in with the characters if the code constructing it is buggy.) But a string that does not match this could be anything, regardless of whether its UTF8 flag is set.

In any case you are right, the docs should declare how the module handles this issue.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Need clear behavior for strings with utf8 flag set #2

Need clear behavior for strings with utf8 flag set #2

dolmen commented Aug 19, 2013

ap commented Aug 19, 2013 •

edited

Loading

Need clear behavior for strings with utf8 flag set #2

Need clear behavior for strings with utf8 flag set #2

Comments

dolmen commented Aug 19, 2013

ap commented Aug 19, 2013 • edited Loading

ap commented Aug 19, 2013 •

edited

Loading