utf16le + readStringNT compatibility #23

azz · 2019-05-19T12:26:32Z

Consider:

const buffer = SmartBuffer.fromSize(0, 'utf16le');
buffer.writeStringNT('hello');
const output = buffer.readStringNT();

We'd expect output to be "hello", but it's current '', due to:

smart-buffer/src/smartbuffer.ts

Lines 685 to 690 in d35c0ce

    
           for (let i = this._readOffset; i < this.length; i++) { 
        
             if (this._buff[i] === 0x00) { 
        
               nullPos = i; 
        
               break; 
        
             } 
        
           }

The buffer (after the write) looks like this:

68 00 65 00 6c 00 6c 00 6f 00 00
   ^^ perceived NT            ^^ actual NT

I'm not sure if any encodings other than utf16le suffer from this, but to fix it the i++ should be changed to i += 2 for utf16le.

The text was updated successfully, but these errors were encountered:

JoshGlazebrook · 2019-05-22T17:43:32Z

Hmm this one is interesting. I'll have to check the other possible encodings and see if any others do this.

JoshGlazebrook · 2019-07-22T04:05:55Z

So I looked into this a bit more, utf-16 is variable length, and a single character is represented by either 2 bytes or 4 bytes. So even the fix above will only work for certain characters.

I think the solution here is to just throw an error if attempting to write or read a null terminated string using utf16 or ucs2.

https://nodejs.org/api/buffer.html#buffer_buffers_and_character_encodings

https://en.wikipedia.org/wiki/Null-terminated_string#Character_encodings

Technically it looks like this isn't possible with even utf8, but it works for most characters.

azz · 2019-07-22T08:57:23Z

I might be misremembering my UTF studies, but I'm pretty sure that continuation bytes (when code points extend beyond a single byte for utf8, or two bytes for utf16) cannot be 0, they have to be a negative number when expressed as a signed value (e.g. signed byte for utf8, signed short for utf16), that is the first bit must be a 1.

exodustx0 · 2021-01-24T15:09:35Z

There is one question to ask here: is the null terminator to be interpreted as a character that is part of the string's encoding?

If yes, then the null terminator would be as it is in the string: 2 bytes, meaning you'd be checking for two consecutive null bytes at an even offset from the starting read offset.

If not, there is no way to safely detect a null terminator in UTF-16, as either byte of a code point may be null, so there are no guarantees when checking individual bytes.

Thus, I would say it's the logical decision to interpret the null terminator as a character in the string's encoding.

anonghuser mentioned this issue Oct 16, 2023

Incorrect use of value.length in insertStringNT and writeStringNT when value is non-ascii strings or encoding is UTF-16 #50

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

utf16le + readStringNT compatibility #23

utf16le + readStringNT compatibility #23

azz commented May 19, 2019

JoshGlazebrook commented May 22, 2019

JoshGlazebrook commented Jul 22, 2019

azz commented Jul 22, 2019

exodustx0 commented Jan 24, 2021

utf16le + readStringNT compatibility #23

utf16le + readStringNT compatibility #23

Comments

azz commented May 19, 2019

JoshGlazebrook commented May 22, 2019

JoshGlazebrook commented Jul 22, 2019

azz commented Jul 22, 2019

exodustx0 commented Jan 24, 2021