-
Notifications
You must be signed in to change notification settings - Fork 6.1k
8361842: Move input validation checks to Java for java.lang.StringCoding intrinsics #25998
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
👋 Welcome back vyazici! A progress list of the required criteria for merging this PR into |
@vy This change is no longer ready for integration - check the PR body for details. |
…r our cases This reverts commit 196fc5d.
* </p> | ||
* | ||
* @param sa the source byte array containing characters encoded in UTF-16 | ||
* @param sp the index of the <em>byte (not character!)</em> from the source array to start reading from |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note the byte (not character!)
emphasis here and below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is incorrect.
This is the index of a character (two bytes).
As it is used in encodeISOArray0()
, it is incremented by 1 and passed to StringUTF16.getChar()
, where it is multiplied by 2 to obtain the real byte[]
index.
* {@linkplain Preconditions#checkFromIndexSize(int, int, int, BiFunction) out of bounds} | ||
*/ | ||
static int encodeISOArray(byte[] sa, int sp, byte[] da, int dp, int len) { | ||
checkFromIndexSize(sp, len << 1, requireNonNull(sa, "sa").length, AIOOBE_FORMATTER); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sa
contains 2-byte char
s, and sp
points to an index of this inflated array. Though, len
denotes the codepoint count, hence the len << 1
while checking sp
and len
bounds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reference of sa.length
is likely wrong also, as it is the source length in bytes but for the index check should be checking the source length in chars.
It might be worth trying to find or create a test for the accidental incorrect interpretation of length in bytes vs chars..
Webrevs
|
I disagree with a small part of the statement of goals:
As formulated above, this is a violation of DRY and if embraced the wrong way will lead to code that is harder to review and prove bug-free. Performing 100% accurate range/null/validation checks is deeply impractical for an assembly-based or IR-based intrinsic. It’s too hard to verify by code review, and coverage testing is suspect. We must frankly put all the weight of verification on Java code, including Java bytecode intrinsic behaviors. Java code is high-level and can be read mostly as a declarative spec, if clearly written (as straight-line code, then the intrinsic call). Also, such simple Java code shapes (and their underlying bytecodes) are tested many orders of magnitude more than any given intrinsic. I see two bits of evidence that you agree with me on this: 1. The intrinsic-local validation (IR or assembly) is allowed to Halt instead of throw, and 2. the intrinsic-local validation is optional, turned on only by a stress test mode. This tells me that the extra optional testing is also not required to be 100%. Thus, I think the above goal would be better stated this way:
I think I'm agreeing with you on the material points. It is important to summarize our intentions accurately at the top, for those readers that are reading only the top as a summary. |
@rose00, thanks so much for the feedback. I agree with your remarks and get your points on "Always validate all input at the intrinsic" is a violation of DRY and an impractical goal. I incorporated your suggestions as follows:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for looking into this Volkan!
I left a couple of minor comments.
I also noticed that you haven't yet added the benchmark results to the description: do you want to run them again after the reviews?
@dafedafe, thanks so much for the review! I've implemented the changes you requested, and shared some benchmark figures in the associated ticket. I am still actively working on evaluating the performance impact. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me too. Thanks @vy!
* </p> | ||
* | ||
* @param sa the source byte array containing characters encoded in UTF-16 | ||
* @param sp the index of the <em>byte (not character!)</em> from the source array to start reading from |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is incorrect.
This is the index of a character (two bytes).
As it is used in encodeISOArray0()
, it is incremented by 1 and passed to StringUTF16.getChar()
, where it is multiplied by 2 to obtain the real byte[]
index.
What is the thinking when an For example, the Java code of So what is the general strategy? Add checks to |
To reformulate my confusing question for the above example, there are apparently around 75-80 invocations of |
It's not possible to determine the required capacity of the target array in constant time, as Unicode code points may occupy either one or two `char` values. As a result, existing implementations typically invoke encoding methods in a loop, handling each unmappable character on a case-by-case basis. For an example, see `sun.nio.cs.DoubleByte.Encoder::encode`.
This is a good question. Suppose IC1 calls IC2 and both are intrinsic candidates, and suppose that M1 and M2 are their checked "front doors". I think the answer has to be that, once you start executing IC1, you cannot expect any further checks. Probably some assembler macro implements IC2 and it may be called from more than one place. The tricky thing to prove is that all uses of IC2's intrinsic code, whether direct (via M2) or indirect (via things like M1) have adequate checks. If intrinsics are factored this way (as they are for string methods) I think that IC1 has to advertise that it calls IC2, so that the front door method M1 is responsible for validity checks for both IC1 and IC2. That is because after intrinsic expansion, IC2 is reached without going through M2; the entry was indirectly from M1. So M1 has to duplicate M2's front door checks. To make this workable, it may be that M2's front door checks are factored into a subroutine FD2, so that M1 can refer to FD2, rather than do risky code duplication. If (as in this case) IC2 loops over calls to IC1, then perhaps M2 should have a companion method FD2R which checks a range a range of inputs to IC2, so that M1 can call FD2R. If all goes well, then FD2R has a range check that duplicates the front door logic of M1, so that the JIT can remove the duplicate checking. In the case of Maybe some callers (if less performance critical) should be changed to call a properly checked front-door method, The above seems to be the principled way to deal with an unchecked intrinsic called from many trusted use sites. The basic idea is that every trusted use site should reaffirm its responsibility locally, not just hope that a non-local assert will catch a bug. We want some kind of reviewable (static/local) proof that each use site (of an unchecked private intrinsics) has correct checks. Some examples: A new front-door In trusted loops over As I read through The intrinsic
Note that after inlining, the batch checks exactly match pre-existing checks for the caller intrinsic. Perhaps the caller's checks could be removed manually, or perhaps the JIT removes the duplication. Actually, I think you got this documentation wrong:
AFAICT, Note that This stuff is complicated to get right. The above exercise in wiring up the checking logic tends to uncover bugs and misconceptions, I think. |
|
Needed to replace all
|
@rgiulietti, thanks so much for this crucial question. @rose00, thanks so much for the elaborate response. I will work on |
Even though the |
Minor: could you please add a bug id to the |
@myankelev, thanks for the heads up. Implemented in 1d02189. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will re-review when the changes settle.
Validate input in
java.lang.StringCoding
intrinsic Java wrappers, improve their documentation, enhance the checks in the associated IR or assembly code, and adapt them to cause VM crash on invalid input.Implementation notes
The goal of the associated umbrella issue JDK-8156534 is to, for
java.lang.String*
classes,@IntrinsicCandidate
-annotatedpublic
methods1 (in Java code) toprivate
ones, and wrap them with apublic
"front door" method@IntrinsicCandidate
annotation to a new method, intrinsic mappings – i.e., associateddo_intrinsic()
calls invmIntrinsics.hpp
– need to be updated tooVerifyIntrinsicChecks
VM flagFollowing preliminary work needs to be carried out as well:
VerifyIntrinsicChecks
VM flaggenerate_string_range_check
to produce aHaltNode
. That is, crash the VM ifVerifyIntrinsicChecks
is set and a Java wrapper fails to spot an invalid input.1
@IntrinsicCandidate
-annotated constructors are not subject to this change, since they are a special case.Functional and performance tests
tier1
(which includestest/hotspot/jtreg/compiler/intrinsics/string
) passes on several platforms. Further tiers will be executed after integrating reviewer feedback.Performance impact is still actively monitored using
test/micro/org/openjdk/bench/java/lang/String{En,De}code.java
, among other tests. If you have suggestions on benchmarks, please share in the comments.Verification of the VM crash
I've tested the VM crash scenario as follows:
Received
AIOOBE
as expected.StringCodec.java
, and re-compiled the JDKcountPositives(...)
arguments in the program to(null, 1, 1)
, run it, and observed the VM crash withunexpected null in intrinsic
.countPositives(...)
arguments in the program to(new byte[]{1,2,3}, 2, 5)
, run it, and observed the VM crash withunexpected guard failure in intrinsic
.Progress
Issue
Reviewers
Reviewing
Using
git
Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/25998/head:pull/25998
$ git checkout pull/25998
Update a local copy of the PR:
$ git checkout pull/25998
$ git pull https://git.openjdk.org/jdk.git pull/25998/head
Using Skara CLI tools
Checkout this PR locally:
$ git pr checkout 25998
View PR using the GUI difftool:
$ git pr show -t 25998
Using diff file
Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/25998.diff
Using Webrev
Link to Webrev Comment