8361842: Move input validation checks to Java for java.lang.StringCoding intrinsics #25998

vy · 2025-06-26T10:48:37Z

Validate input in java.lang.StringCoding intrinsic Java wrappers, improve their documentation, enhance the checks in the associated IR or assembly code, and adapt them to cause VM crash on invalid input.

Implementation notes

The goal of the associated umbrella issue JDK-8156534 is to, for java.lang.String* classes,

Move @IntrinsicCandidate-annotated public methods¹ (in Java code) to private ones, and wrap them with a public "front door" method
Since we moved the @IntrinsicCandidate annotation to a new method, intrinsic mappings – i.e., associated do_intrinsic() calls in vmIntrinsics.hpp – need to be updated too
Add necessary input validation (range, null, etc.) checks to the newly created public front door method
Place all input validation checks in the intrinsic code (add if missing!) behind a VerifyIntrinsicChecks VM flag

Following preliminary work needs to be carried out as well:

Add a new VerifyIntrinsicChecks VM flag
Update generate_string_range_check to produce a HaltNode. That is, crash the VM if VerifyIntrinsicChecks is set and a Java wrapper fails to spot an invalid input.

¹ @IntrinsicCandidate-annotated constructors are not subject to this change, since they are a special case.

Functional and performance tests

tier1 (which includes test/hotspot/jtreg/compiler/intrinsics/string) passes on several platforms. Further tiers will be executed after integrating reviewer feedback.
Performance impact is still actively monitored using test/micro/org/openjdk/bench/java/lang/String{En,De}code.java, among other tests. If you have suggestions on benchmarks, please share in the comments.

Verification of the VM crash

I've tested the VM crash scenario as follows:

Created the following test program:

public class StrIntri {
    public static void main(String[] args) {
        Exception lastException = null;
        for (int i = 0; i < 1_000_000; i++) {
            try {
                jdk.internal.access.SharedSecrets.getJavaLangAccess().countPositives(new byte[]{1,2,3}, 2, 5);
            } catch (Exception exception) {
                lastException = exception;
            }
        }
        if (lastException != null) {
            lastException.printStackTrace();
        } else {
            System.out.println("completed");
        }
    }
}

Compiled the JDK and run the test:

$ bash jib.sh configure -p linux-x64-slowdebug
$ CONF=linux-x64-slowdebug make jdk
$ ./build/linux-x64-slowdebug/jdk/bin/java -XX:+VerifyIntrinsicChecks --add-exports java.base/jdk.internal.access=ALL-UNNAMED StrIntri.java
java.lang.ArrayIndexOutOfBoundsException: Range [2, 2 + 5) out of bounds for length 3

Received AIOOBE as expected.

Removed all checks in StringCodec.java, and re-compiled the JDK
Set the countPositives(...) arguments in the program to (null, 1, 1), run it, and observed the VM crash with unexpected null in intrinsic.
Set the countPositives(...) arguments in the program to (new byte[]{1,2,3}, 2, 5), run it, and observed the VM crash with unexpected guard failure in intrinsic.

Progress

Change must be properly reviewed (1 review required, with at least 1 Reviewer)
Change must not contain extraneous whitespace
Commit message must refer to an issue

Issue

JDK-8361842: Move input validation checks to Java for java.lang.StringCoding intrinsics (Sub-task - P4)

Reviewers

Damon Fenacci (@dafedafe - Committer) 🔄 Re-review required (review applies to db1ed388)
Tobias Hartmann (@TobiHartmann - Reviewer) 🔄 Re-review required (review applies to db1ed388)

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/25998/head:pull/25998
$ git checkout pull/25998

Update a local copy of the PR:
$ git checkout pull/25998
$ git pull https://git.openjdk.org/jdk.git pull/25998/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 25998

View PR using the GUI difftool:
$ git pr show -t 25998

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/25998.diff

Using Webrev

Link to Webrev Comment

bridgekeeper · 2025-06-26T10:49:31Z

👋 Welcome back vyazici! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

openjdk · 2025-06-26T10:49:35Z

@vy This change is no longer ready for integration - check the PR body for details.

openjdk · 2025-06-26T10:50:11Z

@vy The following labels will be automatically applied to this pull request:

core-libs
graal
hotspot

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing lists. If you would like to change these labels, use the /label pull request command.

src/java.base/share/classes/java/lang/StringCoding.java

…r our cases This reverts commit 196fc5d.

vy · 2025-07-10T12:18:25Z

src/java.base/share/classes/java/lang/StringCoding.java

+     * </p>
+     *
+     * @param sa the source byte array containing characters encoded in UTF-16
+     * @param sp the index of the <em>byte (not character!)</em> from the source array to start reading from


Note the byte (not character!) emphasis here and below.

I think this is incorrect.
This is the index of a character (two bytes).
As it is used in encodeISOArray0(), it is incremented by 1 and passed to StringUTF16.getChar(), where it is multiplied by 2 to obtain the real byte[] index.

vy · 2025-07-10T12:20:45Z

src/java.base/share/classes/java/lang/StringCoding.java

+     *         {@linkplain Preconditions#checkFromIndexSize(int, int, int, BiFunction) out of bounds}
+     */
+    static int encodeISOArray(byte[] sa, int sp, byte[] da, int dp, int len) {
+        checkFromIndexSize(sp, len << 1, requireNonNull(sa, "sa").length, AIOOBE_FORMATTER);


sa contains 2-byte chars, and sp points to an index of this inflated array. Though, len denotes the codepoint count, hence the len << 1 while checking sp and len bounds.

The reference of sa.length is likely wrong also, as it is the source length in bytes but for the index check should be checking the source length in chars.
It might be worth trying to find or create a test for the accidental incorrect interpretation of length in bytes vs chars..

src/hotspot/share/opto/library_call.cpp

mlbridge · 2025-07-10T12:58:53Z

Webrevs

rose00 · 2025-07-10T20:54:24Z

I disagree with a small part of the statement of goals:

Always validate all input at the intrinsic (but preferably behind a VM flag)

As formulated above, this is a violation of DRY and if embraced the wrong way will lead to code that is harder to review and prove bug-free. Performing 100% accurate range/null/validation checks is deeply impractical for an assembly-based or IR-based intrinsic. It’s too hard to verify by code review, and coverage testing is suspect.

We must frankly put all the weight of verification on Java code, including Java bytecode intrinsic behaviors. Java code is high-level and can be read mostly as a declarative spec, if clearly written (as straight-line code, then the intrinsic call). Also, such simple Java code shapes (and their underlying bytecodes) are tested many orders of magnitude more than any given intrinsic.

I see two bits of evidence that you agree with me on this: 1. The intrinsic-local validation (IR or assembly) is allowed to Halt instead of throw, and 2. the intrinsic-local validation is optional, turned on only by a stress test mode. This tells me that the extra optional testing is also not required to be 100%.

Thus, I think the above goal would be better stated this way:

Validate input in the IR or assembly code of the intrinsic in an ad hoc manner to catch bugs in the Java validation.

Note: IR or assembly based validation code should not obscure the code or add large maintenance costs, and under a VM diagnostic flag (or debug flag), and causing a VM halt instead of a Java throw.

I think I'm agreeing with you on the material points. It is important to summarize our intentions accurately at the top, for those readers that are reading only the top as a summary.

vy · 2025-07-11T08:22:49Z

@rose00, thanks so much for the feedback. I agree with your remarks and get your points on "Always validate all input at the intrinsic" is a violation of DRY and an impractical goal.

I incorporated your suggestions as follows:

Renamed the ticket to Move input validation checks to Java for String-related intrinsics (to better reflect the goal)
Replaced Always validate all input at the intrinsic... with your suggestion

dafedafe

Thanks a lot for looking into this Volkan!
I left a couple of minor comments.
I also noticed that you haven't yet added the benchmark results to the description: do you want to run them again after the reviews?

src/hotspot/share/opto/c2_globals.hpp

src/hotspot/cpu/x86/macroAssembler_x86.cpp

src/hotspot/share/classfile/vmIntrinsics.hpp

vy · 2025-07-15T19:31:45Z

I left a couple of minor comments. I also noticed that you haven't yet added the benchmark results to the description: do you want to run them again after the reviews?

@dafedafe, thanks so much for the review! I've implemented the changes you requested, and shared some benchmark figures in the associated ticket. I am still actively working on evaluating the performance impact.

src/java.base/share/classes/java/lang/StringCoding.java

src/java.base/share/classes/sun/nio/cs/ISO_8859_1.java

src/java.base/share/classes/java/lang/StringCoding.java

rgiulietti · 2025-07-17T14:11:52Z

src/java.base/share/classes/java/lang/StringCoding.java

+     * </p>
+     *
+     * @param sa the source byte array containing characters encoded in UTF-16
+     * @param sp the index of the <em>byte (not character!)</em> from the source array to start reading from


I think this is incorrect.
This is the index of a character (two bytes).
As it is used in encodeISOArray0(), it is incremented by 1 and passed to StringUTF16.getChar(), where it is multiplied by 2 to obtain the real byte[] index.

rgiulietti · 2025-07-17T14:29:08Z

What is the thinking when an @IntrinsicCandidate method invokes another @IntrinsicCandidate method?
Which part is responsible for the checks?

For example, the Java code of StringCoding.encodeISOArray0() invokes StringUTF16.getChar(), another @IntrinsicCandidate method. The latter does not check its arguments (OK, there's an assert, but this is a weak check). The invocation from encodeISOArray0() is fine and safe, but getChar() is invoked by other parts of the code.

So what is the general strategy? Add checks to getChar() and rely on the runtime to eliminate redundant checks?

src/java.base/share/classes/java/lang/StringCoding.java

test/hotspot/jtreg/compiler/intrinsics/TestVerifyIntrinsicChecks.java

rgiulietti · 2025-07-17T15:33:37Z

What is the thinking when an @IntrinsicCandidate method invokes another @IntrinsicCandidate method? Which part is responsible for the checks?

For example, the Java code of StringCoding.encodeISOArray0() invokes StringUTF16.getChar(), another @IntrinsicCandidate method. The latter does not check its arguments (OK, there's an assert, but this is a weak check). The invocation from encodeISOArray0() is fine and safe, but getChar() is invoked by other parts of the code.

So what is the general strategy? Add checks to getChar() and rely on the runtime to eliminate redundant checks?

To reformulate my confusing question for the above example, there are apparently around 75-80 invocations of getChar(). How to make sure that they are all safe? Some are easy to verify, but others are not.

It's not possible to determine the required capacity of the target array in constant time, as Unicode code points may occupy either one or two `char` values. As a result, existing implementations typically invoke encoding methods in a loop, handling each unmappable character on a case-by-case basis. For an example, see `sun.nio.cs.DoubleByte.Encoder::encode`.

rose00 · 2025-07-18T23:18:16Z

What is the thinking when an @IntrinsicCandidate method invokes another @IntrinsicCandidate method? Which part is responsible for the checks?

This is a good question. Suppose IC1 calls IC2 and both are intrinsic candidates, and suppose that M1 and M2 are their checked "front doors".

I think the answer has to be that, once you start executing IC1, you cannot expect any further checks. Probably some assembler macro implements IC2 and it may be called from more than one place. The tricky thing to prove is that all uses of IC2's intrinsic code, whether direct (via M2) or indirect (via things like M1) have adequate checks.

If intrinsics are factored this way (as they are for string methods) I think that IC1 has to advertise that it calls IC2, so that the front door method M1 is responsible for validity checks for both IC1 and IC2. That is because after intrinsic expansion, IC2 is reached without going through M2; the entry was indirectly from M1. So M1 has to duplicate M2's front door checks.

To make this workable, it may be that M2's front door checks are factored into a subroutine FD2, so that M1 can refer to FD2, rather than do risky code duplication.

If (as in this case) IC2 loops over calls to IC1, then perhaps M2 should have a companion method FD2R which checks a range a range of inputs to IC2, so that M1 can call FD2R. If all goes well, then FD2R has a range check that duplicates the front door logic of M1, so that the JIT can remove the duplicate checking.

In the case of StringUTF16.getChar, I see it is marked as trusted, and it does not have a front-door method, and does have many callers. In the terms of this PR, perhaps it should be renamed getChar0 (or the like) to make it more clear (at non-local use points) that it must be called from trusted code. Perhaps it should also have a range check method associated with it, so that some callers can use that range check method, so that the non-local responsibility is more clearly fulfilled.

Maybe some callers (if less performance critical) should be changed to call a properly checked front-door method, getChar (as opposed to getChar0). Remaining callers of getChar0 should be clearly linked to the front-door checks required by getChar0.

The above seems to be the principled way to deal with an unchecked intrinsic called from many trusted use sites. The basic idea is that every trusted use site should reaffirm its responsibility locally, not just hope that a non-local assert will catch a bug. We want some kind of reviewable (static/local) proof that each use site (of an unchecked private intrinsics) has correct checks.

Some examples: A new front-door getChar method can be used in less important places like AbstractStringBuilder::getChar.

In trusted loops over getChar like String::encodeASCII, the loop containing getChar can be prefaced by a range check which is batched for all the loop iterations, something like StringUTF16.getCharChecks(val, 0, len). The same pattern occurs in String::encode8859_1 and encodeUTF8_UTF16 and computeSizeUTF8_UTF16 and maybe elsewhere. The val reference and limit variable len or sl should be marked final to ensure that the batched range check remains correct (because it should not take loop-variant inputs).

As I read through String.java I see that a batched range check would cover a lot of use cases… I haven't read though all the uses of getChar, however.

The intrinsic encodeISOArray0 (was implEncodeISOArray) calls getChar. This is an example where its front door method (now encodeISOArray with no "0") should call a batched check method like getCharBatchChecks. Let's look at this in detail. The getCharBatchChecks method could look like this:

//non-public
void getCharBatchChecks(byte[] val, int charStart, int charSize) {
  Objects.requireNonNull(val, "val");  // *** what style guide mandates this line??
  Preconditions.checkFromIndexSize(charStart << 1, charSize << 1, val.length, Preconditions.AIOOBE_FORMATTER);
  // *** using "char" in the names helps reduce confusion from the mix of byte and char indexes
}
…
static int encodeISOArray(…) {
    …
    StringUTF16.getCharBatchChecks(sa, sp, len);  // next method loops over getChar(sa, sp++)
    return encodeISOArray0(sa, sp, da, dp, len);
}

Note that after inlining, the batch checks exactly match pre-existing checks for the caller intrinsic. Perhaps the caller's checks could be removed manually, or perhaps the JIT removes the duplication.

Actually, I think you got this documentation wrong:

@param sp the index of the <em>byte (not character!)</em> from the source array to start reading from

AFAICT, sp is a char index; note that getChar scales it as (index=sp)<<1.

Note that getChar has zero javadoc, so you are left to guess helplessly about its index operand.

This stuff is complicated to get right. The above exercise in wiring up the checking logic tends to uncover bugs and misconceptions, I think.

rose00 · 2025-07-19T06:22:33Z

If (as in this case) IC2 loops over calls to IC1
Correction; I meant IC1 calls IC2, in a loop, N times. We don't want a pre-loop in M1 that checks each of N distinct arguments to IC2 (like N calls to M2 would), but rather a batch check routine which checks all of the arguments to IC2, in O(1) time.

vy · 2025-07-21T12:49:13Z

Needed to replace all Preconditions invocations throwing AIOOB on failure with a more lenient approach that returns 0 on out-of-bounds, because,

this matches the compiler intrinsic behavior
there are several (i.e., ~7) sun.nio.cs classes that depend on this lenient behavior. I needed to either fix(?) these 7 classes or make the intrinsic wrappers more lenient

vy · 2025-07-21T12:52:14Z

What is the thinking when an @IntrinsicCandidate method invokes another @IntrinsicCandidate method? Which part is responsible for the checks?

...
In the case of StringUTF16.getChar, ...

@rgiulietti, thanks so much for this crucial question. @rose00, thanks so much for the elaborate response. I will work on StringUTF16 in a separate PR and use these guidelines provided. 🙇

vy · 2025-07-21T13:53:46Z

Even though the tier1,tier2,tier3,tier4,tier5,hs-comp-stress,hs-precheckin-comp tests pass on several platforms, @rgiulietti pointed me other shortcomings regarding the recent lenient approach taken. Please allow me some time with this PR. I will keep this PR updated. 🍿

myankelev · 2025-07-21T15:39:34Z

Minor: could you please add a bug id to the @bug annotations in tests?

vy · 2025-07-23T12:49:56Z

Minor: could you please add a bug id to the @bug annotations in tests?

@myankelev, thanks for the heads up. Implemented in 1d02189.

RogerRiggs

Will re-review when the changes settle.

… size

vy · 2025-07-28T09:33:13Z

Status update:

Changes are pretty-much settled. I'm waiting for a final review from @cl4es.
Checked the performance impact of changes against DaCapo BioJava, and org.openjdk.bench.java.lang.String{En,De}code benchmarks on several platforms. Spotted one consistent significant (i.e., >2%) regression: StringDecode::decodeShortMixed – explicitly forcing inlining¹ did not help either. Granted this is the only, and benchmark- and platform-specific regression, @RogerRiggs suggested creating a follow-up ticket for this particular regression and integrating this PR. I will wait for input from @cl4es.

¹ -XX:CompileCommand=inline,java.lang.StringCoding::encodeISOArray -XX:CompileCommand=inline,java.lang.StringCoding::encodeAsciiArray

Move StringCoding::countPositives checks from C++ to Java

ac5df9f

openjdk bot added graal [email protected] hotspot [email protected] core-libs [email protected] labels Jun 26, 2025

Apply review feedback

1498824

liach reviewed Jun 26, 2025

View reviewed changes

src/java.base/share/classes/java/lang/StringCoding.java Outdated Show resolved Hide resolved

vy added 5 commits July 4, 2025 16:34

Add StringCodingCountPositives benchmark

196fc5d

Improve intrinsics in StringCoding

9932dd3

Remove StringCodingCountPositives, String{En,De}code already cove…

14275e5

…r our cases This reverts commit 196fc5d.

Fix EUC_JP.java.template broken due to encodeASCII rename

b9a6adf

Merge remote-tracking branch 'upstream/master' into strIntrinCheck

6af9864

vy changed the title ~~8156534: Check if range checks can be moved into Java wrapper for intrinsics~~ 8361842: Validate input in both Java and C++ for java.lang.StringCoding intrinsics Jul 10, 2025

vy commented Jul 10, 2025

View reviewed changes

src/hotspot/share/opto/library_call.cpp Show resolved Hide resolved

vy marked this pull request as ready for review July 10, 2025 12:55

openjdk bot added the rfr Pull request is ready for review label Jul 10, 2025

vy changed the title ~~8361842: Validate input in both Java and C++ for java.lang.StringCoding intrinsics~~ 8361842: Move input validation checks to Java for String-related intrinsics Jul 11, 2025

dafedafe reviewed Jul 14, 2025

View reviewed changes

src/hotspot/share/opto/c2_globals.hpp Outdated Show resolved Hide resolved

src/hotspot/cpu/x86/macroAssembler_x86.cpp Outdated Show resolved Hide resolved

src/hotspot/share/classfile/vmIntrinsics.hpp Outdated Show resolved Hide resolved

vy added 3 commits July 15, 2025 21:10

Improve wording of the VerifyIntrinsicChecks flag

c331fbf

Remove Markdown-styling in comments

b60ff45

Minimize the number of touched lines in vmIntrinsics.hpp

7c042b3

RogerRiggs reviewed Jul 15, 2025

View reviewed changes

rgiulietti reviewed Jul 17, 2025

View reviewed changes

src/java.base/share/classes/java/lang/StringCoding.java Outdated Show resolved Hide resolved

mur47x111 reviewed Jul 17, 2025

View reviewed changes

test/hotspot/jtreg/compiler/intrinsics/TestVerifyIntrinsicChecks.java Show resolved Hide resolved

vy changed the title ~~8361842: Move input validation checks to Java for String-related intrinsics~~ 8361842: Move input validation checks to Java for java.lang.StringCoding intrinsics Jul 18, 2025

openjdk bot added the ready Pull request is ready to be integrated label Jul 18, 2025

vy added 3 commits July 18, 2025 14:45

Disable TestVerifyIntrinsicChecks for GraalVM

4016c7a

Fix encodeISOArray bounds checks and Javadoc

943f840

vy added 2 commits July 21, 2025 10:01

Make StringCoding encoding intrinsics lenient

fb8f6ef

Merge remote-tracking branch 'upstream/master' into strIntrinCheck

f69374f

openjdk bot removed the ready Pull request is ready to be integrated label Jul 21, 2025

Remove superseded @throws Javadoc

86e3ed8

Fix bit shifting

025c7ef

vy added 4 commits July 22, 2025 12:40

Cap destination array bounds

07cd41c

Make source array bound checks lenient too

cb4780d

Improve wording of @param len

dc5e673

Add @bug tags

1d02189

RogerRiggs suggested changes Jul 24, 2025

View reviewed changes

vy added 2 commits July 25, 2025 09:36

Replace requireNonNull with implicit null checks to reduce bytecode…

e70dfa3

… size

Merge remote-tracking branch 'upstream/master' into strIntrinCheck

c322f0e

8361842: Move input validation checks to Java for java.lang.StringCoding intrinsics #25998

Are you sure you want to change the base?

8361842: Move input validation checks to Java for java.lang.StringCoding intrinsics #25998

Uh oh!

Conversation

vy commented Jun 26, 2025 • edited by openjdk bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Implementation notes

Functional and performance tests

Verification of the VM crash

Progress

Issue

Reviewers

Reviewing

Uh oh!

bridgekeeper bot commented Jun 26, 2025

Uh oh!

openjdk bot commented Jun 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openjdk bot commented Jun 26, 2025

Uh oh!

Uh oh!

vy Jul 10, 2025

Choose a reason for hiding this comment

Uh oh!

rgiulietti Jul 17, 2025

Choose a reason for hiding this comment

Uh oh!

vy Jul 10, 2025

Choose a reason for hiding this comment

Uh oh!

RogerRiggs Jul 17, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mlbridge bot commented Jul 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Webrevs

Uh oh!

rose00 commented Jul 10, 2025

Uh oh!

vy commented Jul 11, 2025

Uh oh!

dafedafe left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vy commented Jul 15, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rgiulietti Jul 17, 2025

Choose a reason for hiding this comment

Uh oh!

rgiulietti commented Jul 17, 2025

Uh oh!

Uh oh!

Uh oh!

rgiulietti commented Jul 17, 2025

Uh oh!

rose00 commented Jul 18, 2025

Uh oh!

rose00 commented Jul 19, 2025

Uh oh!

vy commented Jul 21, 2025

Uh oh!

vy commented Jul 21, 2025

Uh oh!

vy commented Jul 21, 2025

Uh oh!

myankelev commented Jul 21, 2025

vy commented Jun 26, 2025 •

edited by openjdk bot

Loading

openjdk bot commented Jun 26, 2025 •

edited

Loading

mlbridge bot commented Jul 10, 2025 •

edited

Loading