Optimize Utf8Validator with constant input Vector.slice API #68

jatin-bhateja · 2025-08-06T18:11:19Z

Patch replaces existing bulky handling for pulling in the last, second last, and third last byte of the previous byte vector chunk in Utf8Validator with an optimized constant input Vector.slice API.

The following are the results of the performance analysis of Utf8ValidatorBenchmark on an AMD Ryzen 7 7840HS 8C 16T AVX512 system and i3 13th Gen Intel

jatin-bhateja · 2025-08-08T11:09:47Z

Quick notes on Utf8Validation implementation :-

There are two categories of errors, which are detected using 3 distributed lookup tables, which are indexed using 4-bit values,
i.e. first three nibbles of a pair of bytes.

    a. Errors detected using two consecutive byte sequences, strictly speaking, the higher order 12 bits of consecutive byte sequences.
        -  overlong 2 bytes
        -  overlong 3 bytes
        -  overlong 4 bytes
            - All overlong errors are related to encoding, which can be represented in a smaller number of bytes, given that all
              the binary numbers hold a property that a number less than a given binary number can be formed by displacing
              It's trailing set bits to a lower bit position, hence we pick the minimum value of multi-byte encodings and check for
              bit patterns lower than the first 12 bits.
              2-byte encoding
                    -  11yyxxxx 10zzzzzz
                    -  Min value :  110|00010 10|000000   -> C0 and C1 are illegal two byte encodings.
                                -   Byte1hi - 110|0 
                                -   Byte1lo - 0010
                                -   Byte2hi - 10|00

                    - Any value less than the minimum value will have Byte1lo as 0000,  0001; therefore first two rows of the second 
                       lookup table  "createByte1LowLookup" sets bits for OVERLONG_2BYTE.
                    - Existing distributed table lookups are generated based on this concept.
                 
              Similarly, for 3-byte encoding minimum value is 
                     - 1110 xxxx 10yyzzzz 10vvuuuu
                     - Min value : 1110 0000 1000000 1000000
                     - Thus Byte1hi lookup table set OVERLONG_3BYTE at index 1110 while Byte2hi lookup table set
                        OVERLONG_3BYTE at index value less than 1000 i.e., 0000 0001 0010 0011 0100 0101 0110 0111.
                  
             Similar logic also applies to malformed 4-byte encoding less than 0x10FFFF.
                     -   1111 0xxx 10yyzzzz 10uuvvvv 10aabbbb
                     -   Minimum value :  11110|000 10|000000 10|000000 10|000000 
                     -   | convention used here differentiates b/w fixed format bits and actual decodable bits.
                     -   There are no set decodable bits in the first twelve bits of the minimum bit sequence of 4 4-byte encodings.
                          thus, we set OVERLONG_4BYTS bit at index 1111 of Byte1hi and  at index 0000 of Byte1lo lookup tables.

    b.  TOO_SHORT, these are the cases where a leading byte is followed by a non-continuation byte.
            -  11xx xxx   - a 2, 3, and 4 byte encoded UTF8 characters will always have the most significant two bits set.
            -  Thus, we set TOO_SHORT bit for 1100, 1101, 1110, and 1111 indexes of the Byte1hi lookup table.
            -  This byte should be followed by a continuation byte, i.e., the one which begins with 10 at the most significant bits,
                else it's an error scenario
            -  Which is why TOO_SHORT bit is set for index values 0000, 0001, 0010, 0011, 0100, 0101, 0110, 0111 of the
               Byte2hi lookup table.
    
     c.  TOO_LONG, these are the cases where a leading ASCII byte is followed by a continuation byte.
            - All ASCII characters begin with 0 at the most significant bit, thus 0000-0111 indexes of Byte1hi lookup table set
              a TOO_LONG bit.
            - If the upper nibble of the second byte holds a bit pattern '10' as its most significant two bits, then it's an error
              scenario, which is why indexes 1000-1011 of Byte2hi lookup table set TOO_LONG bit.
 
     d.  TOO_LARGE,  these are characters where decoded bits are greater than U+10FFFF.
           -  This check is mainly for 4-byte encoding, where any value greater than 11110_100 10_001111 10_111111 10_111111
               is illegal.
           -  Therefore, index 1111 of Byte1hi, index range 0100 -> 1111 of table Byte1lo and indexes greater than equal to 1001
               of Byte2hi lookup table set TOO_LARGE bit.
 
     e.   SURROGATE, these are bit patterns which overlap with legal UTF16 encoding bit patterns and therefore qualify as illegal
           UTF8 encodings.
           - Again central ideal here is that we set the same bit across the index ranges of all the lookup tables to enable
             fast error detection.

     f.   3 or 4 bytes encoding error, missing continuation bytes.
        - We first look for the previous 2 and previous 3 bytes of each consecutive continuation bit pattern and error out if the bit
           pattern of the leading byte does not match with a 3 or 4-byte encoding format bits i.e., 1110 or 11110.
        -  TWO_CONTINUATIONS bits are set for index corresponding to 10xx  10xx of Byte1hi and Byte2hi lookup tables.
           Since we use 4 4-bit nibbles of the first twelve bits of a consecutive byte pair,  and perform a logical and b/w the 
           results of the lookup, hence TWO_CONTINUATIONS bit is set for all the indexes of Byte1lo lookup table.

    Currently, the implementation of distributed 3 lookup tables is very efficient. For this patch, we are aiming to reduce the code
    for looking up the last, last two, and last three bytes using a constant index Vector. slice operation.

jatin-bhateja · 2025-08-12T19:51:17Z

Here is the PoC implimentation for optimizing existing Utf8Validator with two lookup table over existing three lookup table solution.
#69

… utf8_validator_slice_opt

Optimize Utf8Validator with constant input Vector.slice API

9ea73d1

Merge branch 'main' of https://github.com/simdjson/simdjson-java into…

2e101d8

… utf8_validator_slice_opt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Optimize Utf8Validator with constant input Vector.slice API #68

Optimize Utf8Validator with constant input Vector.slice API #68

Uh oh!

jatin-bhateja commented Aug 6, 2025 •

edited

Loading

Uh oh!

jatin-bhateja commented Aug 8, 2025

Uh oh!

jatin-bhateja commented Aug 12, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Optimize Utf8Validator with constant input Vector.slice API #68

Are you sure you want to change the base?

Optimize Utf8Validator with constant input Vector.slice API #68

Uh oh!

Conversation

jatin-bhateja commented Aug 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jatin-bhateja commented Aug 8, 2025

Quick notes on Utf8Validation implementation :-

Uh oh!

jatin-bhateja commented Aug 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jatin-bhateja commented Aug 6, 2025 •

edited

Loading

jatin-bhateja commented Aug 12, 2025 •

edited

Loading