Skip to content

8361892: AArch64: Incorrect matching rule leading to improper oop instruction encoding #26249

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

yadongw
Copy link
Contributor

@yadongw yadongw commented Jul 10, 2025

The bug is that the predicate rule of immByteMapBase would cause a ConP Node for oop incorrect matching with byte_map_base when the placeholder jni handle address was just allocated to the address of byte_map_base.

C2 uses JNI handles as placeholders to encoding constant oops, and one of some handle maybe locate at the address of byte_map_base, which is not memory reserved by CardTable. It's possible because JNIHandleBlocks are allocated by malloc.

// The assembler store_check code will do an unsigned shift of the oop,
// then add it to _byte_map_base, i.e.
//
// _byte_map = _byte_map_base + (uintptr_t(low_bound) >> card_shift)
_byte_map = (CardValue*) rs.base();
_byte_map_base = _byte_map - (uintptr_t(low_bound) >> _card_shift);

In aarch64 port, C2 will incorrectly match ConP for oop to ConP for byte_map_base by the immByteMapBase operand.

// Card Table Byte Map Base
operand immByteMapBase()
%{
// Get base of card map
predicate((jbyte*)n->get_ptr() ==
((CardTableModRefBS*)(Universe::heap()->barrier_set()))->byte_map_base);
match(ConP);

op_cost(0);
format %{ %}
interface(CONST_INTER);
%}

// Load Byte Map Base Constant
instruct loadByteMapBase(iRegPNoSp dst, immByteMapBase con)
%{
match(Set dst con);

ins_cost(INSN_COST);
format %{ "adr $dst, $con\t# Byte Map Base" %}

ins_encode(aarch64_enc_mov_byte_map_base(dst, con));

ins_pipe(ialu_imm);
%}

As below, a typical incorrect instructions generated by C2 for java.lang.ref.Finalizer.register(Ljava/lang/Object;)V (10 bytes) @ 0x0000ffff25caf0bc [0x0000ffff25caee80+0x23c], where 0xffff21730000 is the byte_map_base address mistakenly used as an object address:
0xffff25caf08c: ldaxr x8, [x11]
0xffff25caf090: cmp x10, x8
0xffff25caf094: b.ne 0xffff25caf0a0 // b.any
0xffff25caf098: stlxr w8, x28, [x11]
0xffff25caf09c: cbnz w8, 0xffff25caf08c
0xffff25caf0a0: orr x11, xzr, #0x3
0xffff25caf0a4: str x11, [x13]
0xffff25caf0a8: b.eq 0xffff25caef80 // b.none
0xffff25caf0ac: str x14, [sp]
0xffff25caf0b0: add x2, sp, #0x20
0xffff25caf0b4: adrp x1, 0xffff21730000
0xffff25caf0b8: bl 0xffff256fffc0
0xffff25caf0bc: ldr x14, [sp]
0xffff25caf0c0: b 0xffff25caef80
0xffff25caf0c4: add x13, sp, #0x20
0xffff25caf0c8: adrp x12, 0xffff21730000
0xffff25caf0cc: ldr x10, [x13]
0xffff25caf0d0: cmp x10, xzr
0xffff25caf0d4: b.eq 0xffff25caf130 // b.none
0xffff25caf0d8: ldr x11, [x12]
0xffff25caf0dc: tbnz w10, #1, 0xffff25caf0f8
0xffff25caf0e0: ldxr x11, [x12]
0xffff25caf0e4: cmp x13, x11
0xffff25caf0e8: b.ne 0xffff25caf130 // b.any
0xffff25caf0ec: stlxr w11, x10, [x12]
0xffff25caf0f0: cbz w11, 0xffff25caf130
0xffff25caf0f4: b 0xffff25caf0e0

Details see https://mail.openjdk.org/pipermail/aarch64-port-dev/2025-July/016021.html.


Progress

  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue
  • Change must be properly reviewed (2 reviews required, with at least 1 Reviewer, 1 Author)

Issue

  • JDK-8361892: AArch64: Incorrect matching rule leading to improper oop instruction encoding (Bug - P3)

Reviewers

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/26249/head:pull/26249
$ git checkout pull/26249

Update a local copy of the PR:
$ git checkout pull/26249
$ git pull https://git.openjdk.org/jdk.git pull/26249/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 26249

View PR using the GUI difftool:
$ git pr show -t 26249

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/26249.diff

Using Webrev

Link to Webrev Comment

@bridgekeeper
Copy link

bridgekeeper bot commented Jul 10, 2025

👋 Welcome back yadongwang! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@yadongw
Copy link
Contributor Author

yadongw commented Jul 10, 2025

@theRealAph @adinn Could you help review it?

@openjdk
Copy link

openjdk bot commented Jul 10, 2025

@yadongw This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8361892: AArch64: Incorrect matching rule leading to improper oop instruction encoding

Reviewed-by: shade, adinn

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 43 new commits pushed to the master branch:

As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the master branch, type /integrate in a new comment.

@openjdk openjdk bot added the rfr Pull request is ready for review label Jul 10, 2025
@openjdk
Copy link

openjdk bot commented Jul 10, 2025

@yadongw The following label will be automatically applied to this pull request:

  • hotspot-compiler

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

@mlbridge
Copy link

mlbridge bot commented Jul 10, 2025

Webrevs

@theRealAph
Copy link
Contributor

@theRealAph @adinn Could you help review it?

Can't you just delete the special handling of byte_map_base?

@yadongw
Copy link
Contributor Author

yadongw commented Jul 11, 2025

@theRealAph @adinn Could you help review it?

Can't you just delete the special handling of byte_map_base?
It's fine to just delete immByteMapBase, and then ConP for byte_map_base will match immP in aarch64_enc_mov_p, and using adrp+add if valid address and using mov if unvalid.

    enc_class aarch64_enc_mov_p(iRegP dst, immP src) %{
    Register dst_reg = as_Register($dst$$reg);
    address con = (address)$src$$constant;
    if (con == nullptr || con == (address)1) {
      ShouldNotReachHere();
    } else {
      relocInfo::relocType rtype = $src->constant_reloc();
      if (rtype == relocInfo::oop_type) {
        __ movoop(dst_reg, (jobject)con);
      } else if (rtype == relocInfo::metadata_type) {
        __ mov_metadata(dst_reg, (Metadata*)con);
      } else {
        assert(rtype == relocInfo::none, "unexpected reloc type");
        if (! __ is_valid_AArch64_address(con) ||
            con < (address)(uintptr_t)os::vm_page_size()) {
          __ mov(dst_reg, con);
        } else {
          uint64_t offset;
          __ adrp(dst_reg, con, offset);
          __ add(dst_reg, dst_reg, offset);
        }
      }
    }
  %}

I can choose to delete it directly, but do you think the current changes in pr are safe?

@dean-long
Copy link
Member

It looks like riscv.ad has the same problem. I think Andrew's suggestion is safe. We will emit the correct code based on relocType.

@yadongw
Copy link
Contributor Author

yadongw commented Jul 11, 2025

It looks like riscv.ad has the same problem. I think Andrew's suggestion is safe. We will emit the correct code based on relocType.

Yes, it maybe a better solution for jdk main line, because immPollPage was remove in https://bugs.openjdk.org/browse/JDK-8220051. But how about jdk8u backport?

@dean-long
Copy link
Member

Let me see if I can unpack your question. Yes jdk8u still uses immPollPage, which is similar to immByteMapBase, but I don't think it has the same problem, because I don't see how an oop could have the same address as the polling page. If we fix riscv.ad now, there won't be a clean backport to jdk8u because the riscv port doesn't exist. For aarch64, either of the two proposed fixes could be backported to jdk8u, correct?

@yadongw
Copy link
Contributor Author

yadongw commented Jul 11, 2025

Let me see if I can unpack your question. Yes jdk8u still uses immPollPage, which is similar to immByteMapBase, but I don't think it has the same problem, because I don't see how an oop could have the same address as the polling page. If we fix riscv.ad now, there won't be a clean backport to jdk8u because the riscv port doesn't exist. For aarch64, either of the two proposed fixes could be backported to jdk8u, correct?

Yes, but byte_map_base can be same address as the polling page in jdk8u, in both 2 proposed fixes.

@shqking
Copy link
Contributor

shqking commented Jul 11, 2025

Hi. Not a code review. Our internal CI reported a jtreg failure test/hotspot/jtreg/sources/TestNoNULL.java with this PR on Nvidia Grace machine.

  • How to reproduce the failure: build one fastdebug JDK and run make test TEST="test/hotspot/jtreg/sources/TestNoNULL.java"
  • The snippet of the error log
STDERR:
Error: 'NULL' found in /tmp/src/hotspot/cpu/aarch64/aarch64.ad at line 4563:
            n->get_ptr_type()->isa_rawptr() != NULL &&
java.lang.RuntimeException: Test found 1 usages of 'NULL' in source files. See errors above.
        at TestNoNULL.main(TestNoNULL.java:84)
        at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104)
        at java.base/java.lang.reflect.Method.invoke(Method.java:565)
        at com.sun.javatest.regtest.agent.MainActionHelper$AgentVMRunnable.run(MainActionHelper.java:335)
        at java.base/java.lang.Thread.run(Thread.java:1474)

JavaTest Message: Test threw exception: java.lang.RuntimeException
JavaTest Message: shutting down test


TEST RESULT: Failed. Execution failed: `main' threw exception: java.lang.RuntimeException: Test found 1 usages of 'NULL' in source files. See errors above.

It seems that the newly added assertion by this PR is hit in this jtreg case.
Could you help take a look at this issue? Thanks.

@yadongw
Copy link
Contributor Author

yadongw commented Jul 11, 2025

Hi. Not a code review. Our internal CI reported a jtreg failure test/hotspot/jtreg/sources/TestNoNULL.java with this PR on Nvidia Grace machine.

  • How to reproduce the failure: build one fastdebug JDK and run make test TEST="test/hotspot/jtreg/sources/TestNoNULL.java"
  • The snippet of the error log
STDERR:
Error: 'NULL' found in /tmp/src/hotspot/cpu/aarch64/aarch64.ad at line 4563:
            n->get_ptr_type()->isa_rawptr() != NULL &&
java.lang.RuntimeException: Test found 1 usages of 'NULL' in source files. See errors above.
        at TestNoNULL.main(TestNoNULL.java:84)
        at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104)
        at java.base/java.lang.reflect.Method.invoke(Method.java:565)
        at com.sun.javatest.regtest.agent.MainActionHelper$AgentVMRunnable.run(MainActionHelper.java:335)
        at java.base/java.lang.Thread.run(Thread.java:1474)

JavaTest Message: Test threw exception: java.lang.RuntimeException
JavaTest Message: shutting down test


TEST RESULT: Failed. Execution failed: `main' threw exception: java.lang.RuntimeException: Test found 1 usages of 'NULL' in source files. See errors above.

It seems that the newly added assertion by this PR is hit in this jtreg case. Could you help take a look at this issue? Thanks.

Sure, I will follow Andrew's suggestion, removing immByteMapBase.

@shipilev
Copy link
Member

shipilev commented Jul 11, 2025

It is such a beautiful bug to read about on Friday.

So the net effect of this mismatch is that we miss oop relocation/record when ConP accidentally mismatches to card table base, did I get that right?

Yes, it maybe a better solution for jdk main line, because immPollPage was remove in https://bugs.openjdk.org/browse/JDK-8220051. But how about jdk8u backport?

I think we should do these things separately:

  1. immByteMapBase rule removal in AArch64, this PR, then backport it to 25, 21, 17, maybe to 11, 8
  2. immByteMapBase rule removal in RISC-V, separate PR, then backport it to 25, 21
  3. immPollPage rule removal in AArch64, in 11u and 8u specific PRs, IF we think that is a problem, which I don't think it is.

The backports for (1) would not be clean, as Generational Shenandoah barrier checks would likely trigger technical conflicts in the code that is being removed. So there is doubly no point in going for clean backports, we should really slice them by the rule we are removing.

@yadongw
Copy link
Contributor Author

yadongw commented Jul 12, 2025

It is such a beautiful bug to read about on Friday.

So the net effect of this mismatch is that we miss oop relocation/record when ConP accidentally mismatches to card table base, did I get that right?

Yes, it maybe a better solution for jdk main line, because immPollPage was remove in https://bugs.openjdk.org/browse/JDK-8220051. But how about jdk8u backport?

I think we should do these things separately:

  1. immByteMapBase rule removal in AArch64, this PR, then backport it to 25, 21, 17, maybe to 11, 8
  2. immByteMapBase rule removal in RISC-V, separate PR, then backport it to 25, 21
  3. immPollPage rule removal in AArch64, in 11u and 8u specific PRs, IF we think that is a problem, which I don't think it is.

The backports for (1) would not be clean, as Generational Shenandoah barrier checks would likely trigger technical conflicts in the code that is being removed. So there is doubly no point in going for clean backports, we should really slice them by the rule we are removing.

Agree, very clear strategy. l'll just remove the bytemapbase rule in this pr and do enough tests.

@adinn
Copy link
Contributor

adinn commented Jul 14, 2025

The proposed solution is not simply going to work when the Leyden project introduces code save/restore. In the assembly phase for an AOT cache (i.e when compiling code to store in the cache) we need to recognize that an incoming address is the byte map base address and generate an lea with an external address relocation. So, the current code in premain relies on matching against immByteMapBase.

I believe we can retain the current approach if we make the immByteMapBase predicate test the operand type for a RawPtr rather than an OopPtr. That is actually the key distinction that separates the cases we are dealing with. I believe it would work for both immByteMapBase and immPollPage and is a much smaller change to the status quo.

@shipilev
Copy link
Member

So, the current code in premain relies on matching against immByteMapBase.

Wait, but only AArch64/RISC-V have immByteMapBase, so I presume other platforms deal with the need to record relocation for card table base through some other means? With special immByteMapBase removed, would the same way work for AArch64?

@theRealAph
Copy link
Contributor

I believe it would work for both immByteMapBase and immPollPage and is a much smaller change to the status quo.

Poll page does this. No steenkin' reloc necessary...

// Move the address of the polling page into dest.
void MacroAssembler::get_polling_page(Register dest, relocInfo::relocType rtype) {
  ldr(dest, Address(rthread, JavaThread::polling_page_offset()));
}

@shipilev
Copy link
Member

To be clear, I would very much prefer to remove the special-case handling for card table base in AArch64/RISC-V AD, rather than piling on more special cases into that rule.

@adinn
Copy link
Contributor

adinn commented Jul 14, 2025

Wait, but only AArch64/RISC-V have immByteMapBase, so I presume other platforms deal with the need to record relocation for card table base through some other means? With special immByteMapBase removed, would the same way work for AArch64?

I'm not sure how this works for the byte_map_base on x86. I just looked through the x86 premain code and I cannot find any special case handling for it.

However, on both x86 and aarch64 we also use immediate ConP operand rules to inject relocs for constant addresses that refer to entries in the (global) AOT Runtime Constants area. So, this is another case similar to byte_map_base where we AOT compilation needs to recognize a special address and handle it appropriately. Here are the x86 rules for AOT Constant addressses

// AOT Runtime Constants Address
operand immAOTRuntimeConstantsAddress()
%{
  // Check if the address is in the range of AOT Runtime Constants
  predicate(AOTRuntimeConstants::contains((address)(n->get_ptr())));
  match(ConP);

  op_cost(0);
  format %{ %}
  interface(CONST_INTER);
%}
. . .
instruct loadAOTRCAddress(rRegP dst, immAOTRuntimeConstantsAddress con)
%{
  match(Set dst con);

  format %{ "leaq  $dst, $con\t# AOT Runtime Constants Address" %}

  ins_encode %{
    __ load_aotrc_address($dst$$Register, (address)$con$$constant);
  %}

  ins_pipe(ialu_reg_fat);
%}

The referenced macro assembler method is defined as follows

void MacroAssembler::load_aotrc_address(Register reg, address a) {
#if INCLUDE_CDS
  assert(AOTRuntimeConstants::contains(a), "address out of range for data area");
  if (AOTCodeCache::is_on_for_dump()) {
    // all aotrc field addresses should be registered in the AOTCodeCache address table
    lea(reg, ExternalAddress(a));
  } else {
    mov64(reg, (uint64_t)a);
  }
#else
  ShouldNotReachHere();
#endif
}

I'm not sure whether the latest premain code is just in flux and does not yet deal with byte_map_base on x86 or whether there is some other mechanism. Perhaps @ashu-mehra or @vnkozlov can clarify.

@adinn
Copy link
Contributor

adinn commented Jul 14, 2025

To be clear: it is not just the card table base that needs special case handling in Leyden.

@adinn
Copy link
Contributor

adinn commented Jul 14, 2025

@shipilev One other thing. I believe that the only code that creates a RawPtr ConP for the card table base is in the shared C2 card table barrier set assembler. Now that we have moved to late barrier insertion for G1 (along with Shenandoah and Z) is there still any code -- in the dev tree or in jdk25 -- that will create one of these nodes? If not then perhaps the aarch64 immByteMapBase and loadByteMapBase rules are no longer needed and should be deleted?

@vnkozlov
Copy link
Contributor

On x86 byte_map_base is handled in GC code:
https://github.com/openjdk/leyden/blob/premain/src/hotspot/cpu/x86/gc/g1/g1BarrierSetAssembler_x86.cpp#L314
https://github.com/openjdk/leyden/blob/premain/src/hotspot/cpu/x86/gc/shared/cardTableBarrierSetAssembler_x86.cpp#L67

Using relocation for byte_map_base is not safe (see comment in g1BarrierSetAssembler_x86.cpp).
We are "safe" because we bailout AOT code caching if byte_map_base is not relocatable:
https://github.com/openjdk/leyden/blob/premain/src/hotspot/share/code/aotCodeCache.cpp#L338

I have long standing work to use AOTRuntimeConstants table instead for it.

@adinn
Copy link
Contributor

adinn commented Jul 14, 2025

On x86 byte_map_base is handled in GC code:

Ok, but those two methods cover cases where mov/lea instructions are directly generated into the code stream. Why is there no need for a C2 rule for immByteMapBaee and loadByteMapBase in x86_64.ad. Is it because nothing inserts the card table base into a C2 graph as a ConP node now that we have late barrier generation?

If that is the case then I don't think we need the immAOTRuntimeConstantsAddress and loadAOTRCAddress rules in x86_64.ad either. Likewise we can drop the equivalents from aarch64.ad.

@shipilev
Copy link
Member

One other thing. I believe that the only code that creates a RawPtr ConP for the card table base is in the shared C2 card table barrier set assembler. Now that we have moved to late barrier insertion for G1 (along with Shenandoah and Z)

Shenandoah still does not do late barrier expansion. But it also does not emit card table bases as constants, it loads card table bases from TLS. G1 would do this with throughput GC barriers soon too, AFAICS. I think Shenandoah also encodes some "unallocated" ConP for the similar biased-base trick, e.g. for collection set bitmap, but none of those ever pretend to be oops.

@vnkozlov
Copy link
Contributor

Ok, but those two methods cover cases where mov/lea instructions are directly generated into the code stream. Why is there no need for a C2 rule for immByteMapBaee and loadByteMapBase in x86_64.ad. Is it because nothing inserts the card table base into a C2 graph as a ConP node now that we have late barrier generation?

Correct.

@vnkozlov
Copy link
Contributor

Ok, but those two methods cover cases where mov/lea instructions are directly generated into the code stream. Why is there no need for a C2 rule for immByteMapBaee and loadByteMapBase in x86_64.ad. Is it because nothing inserts the card table base into a C2 graph as a ConP node now that we have late barrier generation?

Correct.

Correction. It is true for G1, Z, Shenandoah. For others we still have constant in C2 IR:
https://github.com/openjdk/leyden/blob/premain/src/hotspot/share/gc/shared/c2/cardTableBarrierSetC2.cpp#L38

I think we never fully tested AOT with them.

@adinn
Copy link
Contributor

adinn commented Jul 14, 2025

Correction. It is true for G1, Z, Shenandoah. For others we still have constant in C2 IR

Ok, so for Leyden premain we really need to detect a card table base ConP in an x86 C2 graph and generate an external reloc for it -- if, say, we use a generational serial/parallel GC. Likewise we will need to detect and generate external relocs for any ConP node that references an AOTRuntimeConstants field.

We don't have to do it using a matching rule. We could instead implement the relevant logic in the encoding for a generic load(ConP) rule or in a macro assembler method called from that encoding.

I still personally prefer the idea of distinguishing the relevant cases by detecting a RawPtr vs an OopPtr as being the least intrusive. However, as you say that may still leave us with a card base address that might cause other problems (e.g. it might be zero or a small negative offset). So, let's deal with the backports as Alexei proposed and, when we come to it, implement the premain variants in the encoding or macro assembler rather than via match rules.

@theRealAph
Copy link
Contributor

On x86 byte_map_base is handled in GC code: https://github.com/openjdk/leyden/blob/premain/src/hotspot/cpu/x86/gc/g1/g1BarrierSetAssembler_x86.cpp#L314 https://github.com/openjdk/leyden/blob/premain/src/hotspot/cpu/x86/gc/shared/cardTableBarrierSetAssembler_x86.cpp#L67

Using relocation for byte_map_base is not safe (see comment in g1BarrierSetAssembler_x86.cpp). We are "safe" because we bailout AOT code caching if byte_map_base is not relocatable: https://github.com/openjdk/leyden/blob/premain/src/hotspot/share/code/aotCodeCache.cpp#L338

Here:

 // Do not use ExternalAddress to load 'byte_map_base', since 'byte_map_base' is NOT
  // a valid address and therefore is not properly handled by the relocation code.
  if (AOTCodeCache::is_on_for_dump()) {
    // AOT code needs relocation info for this address
    __ lea(tmp2, ExternalAddress((address)ct->card_table()->byte_map_base()));   // tmp2 := card table base address
  } else {

It says "Do not use ExternalAddress to load 'byte_map_base'" but then uses ExternalAddress. I'm baffled...

@adinn
Copy link
Contributor

adinn commented Jul 15, 2025

It says "Do not use ExternalAddress to load 'byte_map_base'" but then uses ExternalAddress. I'm baffled...

You need to read it as a FIXME ;-)

The current workaround that we have prototyped is to place the base 'address' in a field in the global AOTRuntimeConstants instance and load it via that field. That's not very attractive because we need to do an indirect load from an lea'd constant address (movz/movk/movk/ldr) at every occurrence of a GC barrier. We also need to relocate every mov sequence when we load the code from the archive.

What we would like longer term is to store the address in a method/stub's constants section and use a pc-relative load to access it. We would need to provide the constant entry with a relocation so we can reinit it to the VM's current base at AOT code load but we can use the same constant for every occurrence of the barrier. That also means we update less pages during reloc which means we should eventually be able to rely on mmaping AOT cocde cache pages with limited copy on write for reloced insns.

Copy link
Member

@shipilev shipilev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I read that we are in consensus on removing the bad rule. I ran Linux AArch64 server fastdebug make test TEST=all on Graviton 3 and seen no new problems. Therefore, I am approving this to get the reviewer train going.

/reviewers 2

@openjdk openjdk bot added the ready Pull request is ready to be integrated label Jul 15, 2025
@openjdk
Copy link

openjdk bot commented Jul 15, 2025

@shipilev
The total number of required reviews for this PR (including the jcheck configuration and the last /reviewers command) is now set to 2 (with at least 1 Reviewer, 1 Author).

@openjdk openjdk bot removed the ready Pull request is ready to be integrated label Jul 15, 2025
@theRealAph
Copy link
Contributor

It says "Do not use ExternalAddress to load 'byte_map_base'" but then uses ExternalAddress. I'm baffled...

You need to read it as a FIXME ;-)

Aha!

The current workaround that we have prototyped is to place the base 'address' in a field in the global AOTRuntimeConstants instance and load it via that field. That's not very attractive because we need to do an indirect load from an lea'd constant address (movz/movk/movk/ldr) at every occurrence of a GC barrier. We also need to relocate every mov sequence when we load the code from the archive.

Eww.

What we would like longer term is to store the address in a method/stub's constants section and use a pc-relative load to access it.

IMHO that doesn't help very much. You're still looking at ~4 cycles to load from L1 dcache, and you kill a dcache line for it, and you consume a full xword of code space for it. If we put the byte-map base on a 32-bit boundary we only need a single MOVZ, and that can be relocated easily enough when we load the code from the archive. Surely that's better. Or is even that small work too much?

We would need to provide the constant entry with a relocation so we can reinit it to the VM's current base at AOT code load but we can use the same constant for every occurrence of the barrier. That also means we update less pages during reloc which means we should eventually be able to rely on mmaping AOT cocde cache pages with limited copy on write for reloced insns.

@adinn
Copy link
Contributor

adinn commented Jul 15, 2025

IMHO that doesn't help very much. You're still looking at ~4 cycles to load from L1 dcache, and you kill a dcache line for it, and you consume a full xword of code space for it. If we put the byte-map base on a 32-bit boundary we only need a single MOVZ, and that can be relocated easily enough when we load the code from the archive. Surely that's better. Or is even that small work too much?

Well, it helps a bit even though it suffers from the faults you identify.

The downside of using even a single MOVZ is that every on requries a reloc during AOT code reloading. So, the number of relocs in any code blob that we handle during loading is no longer 1. Instead it equals the number of post-barriers in the blob.

The bigger hit will come when we try to optimize code loading by mmapping AOT code blobs into the code cache -- at present we copy-relocate it from an mmapped region. Instead of just one constant to patch at the start of the blob we will have many places to patch scattered throughout the code blob. So, many more copy-on-write pages rather than vanilla mapped pages. That drags back in copy overheads that the mmap is intended to avoid and also means less opportunities for co-hosted JVMs to share pages.

Copy link
Contributor

@adinn adinn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, let's start the ball rolling.

@openjdk openjdk bot added the ready Pull request is ready to be integrated label Jul 15, 2025
@theRealAph
Copy link
Contributor

IMHO that doesn't help very much. You're still looking at ~4 cycles to load from L1 dcache, and you kill a dcache line for it, and you consume a full xword of code space for it. If we put the byte-map base on a 32-bit boundary we only need a single MOVZ, and that can be relocated easily enough when we load the code from the archive. Surely that's better. Or is even that small work too much?

Well, it helps a bit even though it suffers from the faults you identify.

The downside of using even a single MOVZ is that every on requries a reloc during AOT code reloading. So, the number of relocs in any code blob that we handle during loading is no longer 1. Instead it equals the number of post-barriers in the blob.

The bigger hit will come when we try to optimize code loading by mmapping AOT code blobs into the code cache -- at present we copy-relocate it from an mmapped region. Instead of just one constant to patch at the start of the blob we will have many places to patch scattered throughout the code blob. So, many more copy-on-write pages rather than vanilla mapped pages. That drags back in copy overheads that the mmap is intended to avoid and also means less opportunities for co-hosted JVMs to share pages.

Is anyone trying to load AOT blobs at a fixed address?

If C2 can be persuaded to treat the BMB as a value to be propagated like any other value then all of this conversation effectively becomes a don't care. Except for legacy architectures with insufficient registers, of course...

@adinn
Copy link
Contributor

adinn commented Jul 16, 2025

Is anyone trying to load AOT blobs at a fixed address?

That's not really a good idea for security. However, even if we did we would still need AOT code to refer to external symbols that may be at different locations in the Assembly (AOT compile) VM and production VM. In particular the BMB varies depending on where the card table and heap base are located, both determined by runtime allocations.

n.b. it is important to note that BMB is not the base of the table. It is a pre-offset variant on that base address, computed by adding (card table base - heap base) >> log(card table granularity). As you are no doubt aware, a pre-offset address constant avoids the need for the barrier to do anything but shift a heap pointer down and add it to BMB to compute the associated card address.

That explains why ConP(BMB) is problematic. It is not -- as often stated -- because the value is not an address (offseting an address by a constant still gives an address). The problem is rather what value that address might have. Depending on where the two regions are placed it could actually be, say, ConP(0) or some other ConP that we want to treat specially (especially a ConP in the oop range).

If C2 can be persuaded to treat the BMB as a value to be propagated like any other value then all of this conversation effectively becomes a don't care. Except for legacy architectures with insufficient registers, of course...

In principle that can already happen when the BMB is inserted into a C2 graph as a ConP -- although I have never seen it actually occurring. However, we have recently migrated barrier insertion to the back end for G1, Shenanadoah and ZGC, So, there is no opportunity for the compiler to perform constant elision in those three cases. The only GCs for which a ConP(BMB) is currently inserted into the graph are (generational) Serial/Parallel.

@theRealAph
Copy link
Contributor

That explains why ConP(BMB) is problematic. It is not -- as often stated -- because the value is not an address (offseting an address by a constant still gives an address). The problem is rather what value that address might have.

We-ell, hmm. I'm not sure I agree with that. On AArch64, addresses are either 48, 52, 0r 56 bits in length. Anything outside that is, literally, not an address. I guess one could call such things "unmappable addresses", but then we'd be arguing about definitions, which is usually sterile and pointless.

If C2 can be persuaded to treat the BMB as a value to be propagated like any other value then all of this conversation effectively becomes a don't care. Except for legacy architectures with insufficient registers, of course...

In principle that can already happen when the BMB is inserted into a C2 graph as a ConP -- although I have never seen it actually occurring. However, we have recently migrated barrier insertion to the back end for G1, Shenanadoah and ZGC, So, there is no opportunity for the compiler to perform constant elision in those three cases.

That's fixable. We'd need the late barrier expansion to be passed the location of the card table base, hopefully in a register.

Surely we have to do something sensible here. Adding a load latency to every oop store would be a Bad Thing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hotspot-compiler [email protected] ready Pull request is ready to be integrated rfr Pull request is ready for review
Development

Successfully merging this pull request may close these issues.

7 participants