-
Notifications
You must be signed in to change notification settings - Fork 6.1k
8361892: AArch64: Incorrect matching rule leading to improper oop instruction encoding #26249
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
…truction encoding
👋 Welcome back yadongwang! A progress list of the required criteria for merging this PR into |
@theRealAph @adinn Could you help review it? |
@yadongw This change now passes all automated pre-integration checks. ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details. After integration, the commit message for the final commit will be:
You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed. At the time when this comment was updated there had been 43 new commits pushed to the
As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details. ➡️ To integrate this PR with the above commit message to the |
Webrevs
|
Can't you just delete the special handling of byte_map_base? |
I can choose to delete it directly, but do you think the current changes in pr are safe? |
It looks like riscv.ad has the same problem. I think Andrew's suggestion is safe. We will emit the correct code based on relocType. |
Yes, it maybe a better solution for jdk main line, because immPollPage was remove in https://bugs.openjdk.org/browse/JDK-8220051. But how about jdk8u backport? |
Let me see if I can unpack your question. Yes jdk8u still uses immPollPage, which is similar to immByteMapBase, but I don't think it has the same problem, because I don't see how an oop could have the same address as the polling page. If we fix riscv.ad now, there won't be a clean backport to jdk8u because the riscv port doesn't exist. For aarch64, either of the two proposed fixes could be backported to jdk8u, correct? |
Yes, but byte_map_base can be same address as the polling page in jdk8u, in both 2 proposed fixes. |
Hi. Not a code review. Our internal CI reported a jtreg failure
It seems that the newly added assertion by this PR is hit in this jtreg case. |
Sure, I will follow Andrew's suggestion, removing immByteMapBase. |
It is such a beautiful bug to read about on Friday. So the net effect of this mismatch is that we miss oop relocation/record when
I think we should do these things separately:
The backports for (1) would not be clean, as Generational Shenandoah barrier checks would likely trigger technical conflicts in the code that is being removed. So there is doubly no point in going for clean backports, we should really slice them by the rule we are removing. |
Agree, very clear strategy. l'll just remove the bytemapbase rule in this pr and do enough tests. |
…truction encoding
The proposed solution is not simply going to work when the Leyden project introduces code save/restore. In the assembly phase for an AOT cache (i.e when compiling code to store in the cache) we need to recognize that an incoming address is the byte map base address and generate an lea with an external address relocation. So, the current code in premain relies on matching against immByteMapBase. I believe we can retain the current approach if we make the immByteMapBase predicate test the operand type for a RawPtr rather than an OopPtr. That is actually the key distinction that separates the cases we are dealing with. I believe it would work for both immByteMapBase and immPollPage and is a much smaller change to the status quo. |
Wait, but only AArch64/RISC-V have |
Poll page does this. No steenkin' reloc necessary...
|
To be clear, I would very much prefer to remove the special-case handling for card table base in AArch64/RISC-V AD, rather than piling on more special cases into that rule. |
I'm not sure how this works for the byte_map_base on x86. I just looked through the x86 premain code and I cannot find any special case handling for it. However, on both x86 and aarch64 we also use immediate ConP operand rules to inject relocs for constant addresses that refer to entries in the (global) AOT Runtime Constants area. So, this is another case similar to byte_map_base where we AOT compilation needs to recognize a special address and handle it appropriately. Here are the x86 rules for AOT Constant addressses
The referenced macro assembler method is defined as follows
I'm not sure whether the latest premain code is just in flux and does not yet deal with byte_map_base on x86 or whether there is some other mechanism. Perhaps @ashu-mehra or @vnkozlov can clarify. |
To be clear: it is not just the card table base that needs special case handling in Leyden. |
@shipilev One other thing. I believe that the only code that creates a RawPtr ConP for the card table base is in the shared C2 card table barrier set assembler. Now that we have moved to late barrier insertion for G1 (along with Shenandoah and Z) is there still any code -- in the dev tree or in jdk25 -- that will create one of these nodes? If not then perhaps the aarch64 immByteMapBase and loadByteMapBase rules are no longer needed and should be deleted? |
On x86 byte_map_base is handled in GC code: Using relocation for byte_map_base is not safe (see comment in I have long standing work to use AOTRuntimeConstants table instead for it. |
Ok, but those two methods cover cases where mov/lea instructions are directly generated into the code stream. Why is there no need for a C2 rule for immByteMapBaee and loadByteMapBase in x86_64.ad. Is it because nothing inserts the card table base into a C2 graph as a ConP node now that we have late barrier generation? If that is the case then I don't think we need the immAOTRuntimeConstantsAddress and loadAOTRCAddress rules in x86_64.ad either. Likewise we can drop the equivalents from aarch64.ad. |
Shenandoah still does not do late barrier expansion. But it also does not emit card table bases as constants, it loads card table bases from TLS. G1 would do this with throughput GC barriers soon too, AFAICS. I think Shenandoah also encodes some "unallocated" |
Correct. |
Correction. It is true for G1, Z, Shenandoah. For others we still have constant in C2 IR: I think we never fully tested AOT with them. |
Ok, so for Leyden premain we really need to detect a card table base ConP in an x86 C2 graph and generate an external reloc for it -- if, say, we use a generational serial/parallel GC. Likewise we will need to detect and generate external relocs for any ConP node that references an AOTRuntimeConstants field. We don't have to do it using a matching rule. We could instead implement the relevant logic in the encoding for a generic load(ConP) rule or in a macro assembler method called from that encoding. I still personally prefer the idea of distinguishing the relevant cases by detecting a RawPtr vs an OopPtr as being the least intrusive. However, as you say that may still leave us with a card base address that might cause other problems (e.g. it might be zero or a small negative offset). So, let's deal with the backports as Alexei proposed and, when we come to it, implement the premain variants in the encoding or macro assembler rather than via match rules. |
Here:
It says "Do not use |
You need to read it as a FIXME ;-) The current workaround that we have prototyped is to place the base 'address' in a field in the global AOTRuntimeConstants instance and load it via that field. That's not very attractive because we need to do an indirect load from an lea'd constant address (movz/movk/movk/ldr) at every occurrence of a GC barrier. We also need to relocate every mov sequence when we load the code from the archive. What we would like longer term is to store the address in a method/stub's constants section and use a pc-relative load to access it. We would need to provide the constant entry with a relocation so we can reinit it to the VM's current base at AOT code load but we can use the same constant for every occurrence of the barrier. That also means we update less pages during reloc which means we should eventually be able to rely on mmaping AOT cocde cache pages with limited copy on write for reloced insns. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I read that we are in consensus on removing the bad rule. I ran Linux AArch64 server fastdebug make test TEST=all
on Graviton 3 and seen no new problems. Therefore, I am approving this to get the reviewer train going.
/reviewers 2
Aha!
Eww.
IMHO that doesn't help very much. You're still looking at ~4 cycles to load from L1 dcache, and you kill a dcache line for it, and you consume a full xword of code space for it. If we put the byte-map base on a 32-bit boundary we only need a single MOVZ, and that can be relocated easily enough when we load the code from the archive. Surely that's better. Or is even that small work too much? We would need to provide the constant entry with a relocation so we can reinit it to the VM's current base at AOT code load but we can use the same constant for every occurrence of the barrier. That also means we update less pages during reloc which means we should eventually be able to rely on mmaping AOT cocde cache pages with limited copy on write for reloced insns. |
Well, it helps a bit even though it suffers from the faults you identify. The downside of using even a single MOVZ is that every on requries a reloc during AOT code reloading. So, the number of relocs in any code blob that we handle during loading is no longer 1. Instead it equals the number of post-barriers in the blob. The bigger hit will come when we try to optimize code loading by mmapping AOT code blobs into the code cache -- at present we copy-relocate it from an mmapped region. Instead of just one constant to patch at the start of the blob we will have many places to patch scattered throughout the code blob. So, many more copy-on-write pages rather than vanilla mapped pages. That drags back in copy overheads that the mmap is intended to avoid and also means less opportunities for co-hosted JVMs to share pages. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, let's start the ball rolling.
Is anyone trying to load AOT blobs at a fixed address? If C2 can be persuaded to treat the BMB as a value to be propagated like any other value then all of this conversation effectively becomes a don't care. Except for legacy architectures with insufficient registers, of course... |
That's not really a good idea for security. However, even if we did we would still need AOT code to refer to external symbols that may be at different locations in the Assembly (AOT compile) VM and production VM. In particular the BMB varies depending on where the card table and heap base are located, both determined by runtime allocations. n.b. it is important to note that BMB is not the base of the table. It is a pre-offset variant on that base address, computed by adding (card table base - heap base) >> log(card table granularity). As you are no doubt aware, a pre-offset address constant avoids the need for the barrier to do anything but shift a heap pointer down and add it to BMB to compute the associated card address. That explains why ConP(BMB) is problematic. It is not -- as often stated -- because the value is not an address (offseting an address by a constant still gives an address). The problem is rather what value that address might have. Depending on where the two regions are placed it could actually be, say, ConP(0) or some other ConP that we want to treat specially (especially a ConP in the oop range).
In principle that can already happen when the BMB is inserted into a C2 graph as a ConP -- although I have never seen it actually occurring. However, we have recently migrated barrier insertion to the back end for G1, Shenanadoah and ZGC, So, there is no opportunity for the compiler to perform constant elision in those three cases. The only GCs for which a ConP(BMB) is currently inserted into the graph are (generational) Serial/Parallel. |
We-ell, hmm. I'm not sure I agree with that. On AArch64, addresses are either 48, 52, 0r 56 bits in length. Anything outside that is, literally, not an address. I guess one could call such things "unmappable addresses", but then we'd be arguing about definitions, which is usually sterile and pointless.
That's fixable. We'd need the late barrier expansion to be passed the location of the card table base, hopefully in a register. Surely we have to do something sensible here. Adding a load latency to every oop store would be a Bad Thing. |
The bug is that the predicate rule of immByteMapBase would cause a ConP Node for oop incorrect matching with byte_map_base when the placeholder jni handle address was just allocated to the address of byte_map_base.
C2 uses JNI handles as placeholders to encoding constant oops, and one of some handle maybe locate at the address of byte_map_base, which is not memory reserved by CardTable. It's possible because JNIHandleBlocks are allocated by malloc.
// The assembler store_check code will do an unsigned shift of the oop,
// then add it to _byte_map_base, i.e.
//
// _byte_map = _byte_map_base + (uintptr_t(low_bound) >> card_shift)
_byte_map = (CardValue*) rs.base();
_byte_map_base = _byte_map - (uintptr_t(low_bound) >> _card_shift);
In aarch64 port, C2 will incorrectly match ConP for oop to ConP for byte_map_base by the immByteMapBase operand.
// Card Table Byte Map Base
operand immByteMapBase()
%{
// Get base of card map
predicate((jbyte*)n->get_ptr() ==
((CardTableModRefBS*)(Universe::heap()->barrier_set()))->byte_map_base);
match(ConP);
op_cost(0);
format %{ %}
interface(CONST_INTER);
%}
// Load Byte Map Base Constant
instruct loadByteMapBase(iRegPNoSp dst, immByteMapBase con)
%{
match(Set dst con);
ins_cost(INSN_COST);
format %{ "adr $dst, $con\t# Byte Map Base" %}
ins_encode(aarch64_enc_mov_byte_map_base(dst, con));
ins_pipe(ialu_imm);
%}
As below, a typical incorrect instructions generated by C2 for java.lang.ref.Finalizer.register(Ljava/lang/Object;)V (10 bytes) @ 0x0000ffff25caf0bc [0x0000ffff25caee80+0x23c], where 0xffff21730000 is the byte_map_base address mistakenly used as an object address:
0xffff25caf08c: ldaxr x8, [x11]
0xffff25caf090: cmp x10, x8
0xffff25caf094: b.ne 0xffff25caf0a0 // b.any
0xffff25caf098: stlxr w8, x28, [x11]
0xffff25caf09c: cbnz w8, 0xffff25caf08c
0xffff25caf0a0: orr x11, xzr, #0x3
0xffff25caf0a4: str x11, [x13]
0xffff25caf0a8: b.eq 0xffff25caef80 // b.none
0xffff25caf0ac: str x14, [sp]
0xffff25caf0b0: add x2, sp, #0x20
0xffff25caf0b4: adrp x1, 0xffff21730000
0xffff25caf0b8: bl 0xffff256fffc0
0xffff25caf0bc: ldr x14, [sp]
0xffff25caf0c0: b 0xffff25caef80
0xffff25caf0c4: add x13, sp, #0x20
0xffff25caf0c8: adrp x12, 0xffff21730000
0xffff25caf0cc: ldr x10, [x13]
0xffff25caf0d0: cmp x10, xzr
0xffff25caf0d4: b.eq 0xffff25caf130 // b.none
0xffff25caf0d8: ldr x11, [x12]
0xffff25caf0dc: tbnz w10, #1, 0xffff25caf0f8
0xffff25caf0e0: ldxr x11, [x12]
0xffff25caf0e4: cmp x13, x11
0xffff25caf0e8: b.ne 0xffff25caf130 // b.any
0xffff25caf0ec: stlxr w11, x10, [x12]
0xffff25caf0f0: cbz w11, 0xffff25caf130
0xffff25caf0f4: b 0xffff25caf0e0
Details see https://mail.openjdk.org/pipermail/aarch64-port-dev/2025-July/016021.html.
Progress
Issue
Reviewers
Reviewing
Using
git
Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/26249/head:pull/26249
$ git checkout pull/26249
Update a local copy of the PR:
$ git checkout pull/26249
$ git pull https://git.openjdk.org/jdk.git pull/26249/head
Using Skara CLI tools
Checkout this PR locally:
$ git pr checkout 26249
View PR using the GUI difftool:
$ git pr show -t 26249
Using diff file
Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/26249.diff
Using Webrev
Link to Webrev Comment