Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Need clear behavior for strings with utf8 flag set #2

Open
dolmen opened this issue Aug 19, 2013 · 1 comment
Open

Need clear behavior for strings with utf8 flag set #2

dolmen opened this issue Aug 19, 2013 · 1 comment

Comments

@dolmen
Copy link

dolmen commented Aug 19, 2013

The base bencoding format specifies encoding only for strings of bytes, not Unicode strings. When decoding, there is no way to distinguish if the original data was an UTF-8 string or a byte buffer that appears to look like an UTF-8 string.

The module documentation should clearly specify (and the implementation be tested) how Perl strings with the utf8 flag given in the input to bencode will be handled. Throwing an exception would be an appropriate behavior, in order to force the user of the module to properly encode its data as bytes.

The bdecode function should clearly disallow a string of characters and allow only a string of bytes.

@ap
Copy link
Owner

ap commented Aug 19, 2013

The UTF8 flag is irrelevant, and whether the string was bytes or characters cannot be known outside of one particular case. Namely, if the string matches /[^\x0-\xff]/, then it contains some wide characters, so it cannot be a (proper) byte string. (It may contain bytes mixed in with the characters if the code constructing it is buggy.) But a string that does not match this could be anything, regardless of whether its UTF8 flag is set.

In any case you are right, the docs should declare how the module handles this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants