Skip to content

Optimize Utf8Validator with constant input Vector.slice API #68

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

jatin-bhateja
Copy link

@jatin-bhateja jatin-bhateja commented Aug 6, 2025

Patch replaces existing bulky handling for pulling in the last, second last, and third last byte of the previous byte vector chunk in Utf8Validator with an optimized constant input Vector.slice API.

The following are the results of the performance analysis of Utf8ValidatorBenchmark on an AMD Ryzen 7 7840HS 8C 16T AVX512 system.

Baseline: jdk-24.0.2
Benchmark                                (fileName)   Mode  Cnt      Score   Error  Units
Utf8ValidatorBenchmark.utf8Validator  /twitter.json  thrpt    2  62117.351          ops/s

Withopt: jdk-24.0.2
Benchmark                                (fileName)   Mode  Cnt      Score   Error  Units
Utf8ValidatorBenchmark.utf8Validator  /twitter.json  thrpt    2  80763.046          ops/s

Baseline: jdk-26
Benchmark                                     (fileName)   Mode  Cnt      Score   Error  Units
Utf8ValidatorBenchmark.utf8Validator       /twitter.json  thrpt    2  59075.477          ops/s

Withopt: jdk-26 (without ALIGNR)
Benchmark                                     (fileName)   Mode  Cnt      Score   Error  Units
Utf8ValidatorBenchmark.utf8Validator       /twitter.json  thrpt    2  81333.082          ops/s

Withopt: jdk mainline (with ALIGNR)
Benchmark                                     (fileName)   Mode  Cnt      Score   Error  Units
Utf8ValidatorBenchmark.utf8Validator       /twitter.json  thrpt    2  85728.800          ops/s
Utf8ValidatorBenchmark.utf8Validator:·asm  /twitter.json  thrpt             NaN            --- 

Analysis:-
Utf8Validator with slice API is performant on AVX512 targets even without ALIGNR-based instruction sequence.
But, on AVX2/Intel E-core targets only ALIGNR-based slice optimization part of JDK mainline PR[1] is performant.

On AMD targets, where both client and server support AVX512 ISA, the slice-based Utf8Validator will always
be performant even with stock JDK-24.0.2.

While on Intel targets, we need to wait till integration of PR[1] and also gradle compatibility with JDK-25+,
JDK-8290322 improved the byte vector rearrange on AVX2/E-core targets, but its performance does not
match up to the direct VPERMB instruction.

PS: Patch depends on integration of existing simd-json PR[2]

[1] openjdk/jdk#24104
[2] #67

@jatin-bhateja
Copy link
Author


Quick notes on Utf8Validation implementation :-

There are two categories of errors, which are detected using 3 distributed lookup tables, which are indexed using 4-bit values,
i.e. first three nibbles of a pair of bytes.

    a. Errors detected using two consecutive byte sequences, strictly speaking, the higher order 12 bits of consecutive byte sequences.
        -  overlong 2 bytes
        -  overlong 3 bytes
        -  overlong 4 bytes
            - All overlong errors are related to encoding, which can be represented in a smaller number of bytes, given that all
              the binary numbers hold a property that a number less than a given binary number can be formed by displacing
              It's trailing set bits to a lower bit position, hence we pick the minimum value of multi-byte encodings and check for
              bit patterns lower than the first 12 bits.
              2-byte encoding
                    -  11yyxxxx 10zzzzzz
                    -  Min value :  110|00010 10|000000   -> C0 and C1 are illegal two byte encodings.
                                -   Byte1hi - 110|0 
                                -   Byte1lo - 0010
                                -   Byte2hi - 10|00

                    - Any value less than the minimum value will have Byte1lo as 0000,  0001; therefore first two rows of the second 
                       lookup table  "createByte1LowLookup" sets bits for OVERLONG_2BYTE.
                    - Existing distributed table lookups are generated based on this concept.
                 
              Similarly, for 3-byte encoding minimum value is 
                     - 1110 xxxx 10yyzzzz 10vvuuuu
                     - Min value : 1110 0000 1000000 1000000
                     - Thus Byte1hi lookup table set OVERLONG_3BYTE at index 1110 while Byte2hi lookup table set
                        OVERLONG_3BYTE at index value less than 1000 i.e., 0000 0001 0010 0011 0100 0101 0110 0111.
                  
             Similar logic also applies to malformed 4-byte encoding less than 0x10FFFF.
                     -   1111 0xxx 10yyzzzz 10uuvvvv 10aabbbb
                     -   Minimum value :  11110|000 10|000000 10|000000 10|000000 
                     -   | convention used here differentiates b/w fixed format bits and actual decodable bits.
                     -   There are no set decodable bits in the first twelve bits of the minimum bit sequence of 4 4-byte encodings.
                          thus, we set OVERLONG_4BYTS bit at index 1111 of Byte1hi and  at index 0000 of Byte1lo lookup tables.

    b.  TOO_SHORT, these are the cases where a leading byte is followed by a non-continuation byte.
            -  11xx xxx   - a 2, 3, and 4 byte encoded UTF8 characters will always have the most significant two bits set.
            -  Thus, we set TOO_SHORT bit for 1100, 1101, 1110, and 1111 indexes of the Byte1hi lookup table.
            -  This byte should be followed by a continuation byte, i.e., the one which begins with 10 at the most significant bits,
                else it's an error scenario
            -  Which is why TOO_SHORT bit is set for index values 0000, 0001, 0010, 0011, 0100, 0101, 0110, 0111 of the
               Byte2hi lookup table.
    
     c.  TOO_LONG, these are the cases where a leading ASCII byte is followed by a continuation byte.
            - All ASCII characters begin with 0 at the most significant bit, thus 0000-0111 indexes of Byte1hi lookup table set
              a TOO_LONG bit.
            - If the upper nibble of the second byte holds a bit pattern '10' as its most significant two bits, then it's an error
              scenario, which is why indexes 1000-1011 of Byte2hi lookup table set TOO_LONG bit.
 
     d.  TOO_LARGE,  these are characters where decoded bits are greater than U+10FFFF.
           -  This check is mainly for 4-byte encoding, where any value greater than 11110_100 10_001111 10_111111 10_111111
               is illegal.
           -  Therefore, index 1111 of Byte1hi, index range 0100 -> 1111 of table Byte1lo and indexes greater than equal to 1001
               of Byte2hi lookup table set TOO_LARGE bit.
 
     e.   SURROGATE, these are bit patterns which overlap with legal UTF16 encoding bit patterns and therefore qualify as illegal
           UTF8 encodings.
           - Again central ideal here is that we set the same bit across the index ranges of all the lookup tables to enable
             fast error detection.

     f.   3 or 4 bytes encoding error, missing continuation bytes.
        - We first look for the previous 2 and previous 3 bytes of each consecutive continuation bit pattern and error out if the bit
           pattern of the leading byte does not match with a 3 or 4-byte encoding format bits i.e., 1110 or 11110.
        -  TWO_CONTINUATIONS bits are set for index corresponding to 10xx  10xx of Byte1hi and Byte2hi lookup tables.
           Since we use 4 4-bit nibbles of the first twelve bits of a consecutive byte pair,  and perform a logical and b/w the 
           results of the lookup, hence TWO_CONTINUATIONS bit is set for all the indexes of Byte1lo lookup table.

    Currently, the implementation of distributed 3 lookup tables is very efficient. For this patch, we are aiming to reduce the code
    for looking up the last, last two, and last three bytes using a constant index Vector. slice operation.

@jatin-bhateja
Copy link
Author

jatin-bhateja commented Aug 12, 2025

Here is the PoC implimentation for optimizing existing Utf8Validator with two lookup table over existing three lookup table solution.
#69

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants