Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

utf16le + readStringNT compatibility #23

Open
azz opened this issue May 19, 2019 · 4 comments
Open

utf16le + readStringNT compatibility #23

azz opened this issue May 19, 2019 · 4 comments

Comments

@azz
Copy link

azz commented May 19, 2019

Consider:

const buffer = SmartBuffer.fromSize(0, 'utf16le');
buffer.writeStringNT('hello');
const output = buffer.readStringNT();

We'd expect output to be "hello", but it's current '', due to:

for (let i = this._readOffset; i < this.length; i++) {
if (this._buff[i] === 0x00) {
nullPos = i;
break;
}
}

The buffer (after the write) looks like this:

68 00 65 00 6c 00 6c 00 6f 00 00
   ^^ perceived NT            ^^ actual NT

I'm not sure if any encodings other than utf16le suffer from this, but to fix it the i++ should be changed to i += 2 for utf16le.

@JoshGlazebrook
Copy link
Owner

Hmm this one is interesting. I'll have to check the other possible encodings and see if any others do this.

@JoshGlazebrook
Copy link
Owner

So I looked into this a bit more, utf-16 is variable length, and a single character is represented by either 2 bytes or 4 bytes. So even the fix above will only work for certain characters.

I think the solution here is to just throw an error if attempting to write or read a null terminated string using utf16 or ucs2.

https://nodejs.org/api/buffer.html#buffer_buffers_and_character_encodings

https://en.wikipedia.org/wiki/Null-terminated_string#Character_encodings

Technically it looks like this isn't possible with even utf8, but it works for most characters.

@azz
Copy link
Author

azz commented Jul 22, 2019

I might be misremembering my UTF studies, but I'm pretty sure that continuation bytes (when code points extend beyond a single byte for utf8, or two bytes for utf16) cannot be 0, they have to be a negative number when expressed as a signed value (e.g. signed byte for utf8, signed short for utf16), that is the first bit must be a 1.

@exodustx0
Copy link

There is one question to ask here: is the null terminator to be interpreted as a character that is part of the string's encoding?

If yes, then the null terminator would be as it is in the string: 2 bytes, meaning you'd be checking for two consecutive null bytes at an even offset from the starting read offset.

If not, there is no way to safely detect a null terminator in UTF-16, as either byte of a code point may be null, so there are no guarantees when checking individual bytes.

Thus, I would say it's the logical decision to interpret the null terminator as a character in the string's encoding.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants