Skip to content

8360048: NMT crash in gtest/NMTGtests.java: fatal error: NMT corruption: Block at 0x0000017748307120: header canary broken #26284

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 5 commits into from

Conversation

dholmes-ora
Copy link
Member

@dholmes-ora dholmes-ora commented Jul 14, 2025

This is a clone of #25950 that we need to get integrated ASAP.


The canary header test failed since there were concurrent remove and free() from the tree. The remove operations are synch'ed with corresponding NMT lock. The ReserveMemory::reserve() uses the same lock internally and is not included in the locked code block.


I'm re-testing with tiers 1-4


Progress

  • Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue

Issue

  • JDK-8360048: NMT crash in gtest/NMTGtests.java: fatal error: NMT corruption: Block at 0x0000017748307120: header canary broken (Bug - P2)

Reviewers

Contributors

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/26284/head:pull/26284
$ git checkout pull/26284

Update a local copy of the PR:
$ git checkout pull/26284
$ git pull https://git.openjdk.org/jdk.git pull/26284/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 26284

View PR using the GUI difftool:
$ git pr show -t 26284

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/26284.diff

Using Webrev

Link to Webrev Comment

@dholmes-ora
Copy link
Member Author

/contributor add afshin-zafari

@bridgekeeper
Copy link

bridgekeeper bot commented Jul 14, 2025

👋 Welcome back dholmes! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk
Copy link

openjdk bot commented Jul 14, 2025

@dholmes-ora This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8360048: NMT crash in gtest/NMTGtests.java: fatal error: NMT corruption: Block at 0x0000017748307120: header canary broken

Co-authored-by: Afshin Zafari <[email protected]>
Reviewed-by: gziemski

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 12 new commits pushed to the master branch:

As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the master branch, type /integrate in a new comment.

@openjdk openjdk bot added the rfr Pull request is ready for review label Jul 14, 2025
@openjdk
Copy link

openjdk bot commented Jul 14, 2025

@dholmes-ora afshin-zafari was not found in the census.

Syntax: /contributor (add|remove) [@user | openjdk-user | Full Name <email@address>]. For example:

  • /contributor add @openjdk-bot
  • /contributor add duke
  • /contributor add J. Duke <[email protected]>

User names can only be used for users in the census associated with this repository. For other contributors you need to supply the full name and email address.

@openjdk
Copy link

openjdk bot commented Jul 14, 2025

@dholmes-ora The following label will be automatically applied to this pull request:

  • hotspot-runtime

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

@mlbridge
Copy link

mlbridge bot commented Jul 14, 2025

Webrevs

@dholmes-ora
Copy link
Member Author

/contributor add azafari

@openjdk
Copy link

openjdk bot commented Jul 14, 2025

@dholmes-ora
Contributor Afshin Zafari <[email protected]> successfully added.

Copy link
Member

@shipilev shipilev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you accepting nits here? I see other PR was already reviewed.

Really hard to understand where the fix is. Bug synopsis also does not help :) I assume the problem is really in the test?

@dholmes-ora
Copy link
Member Author

Are you accepting nits here? I see other PR was already reviewed.

The intent was to just create a proxy for the other PR so we could hit the integrate button. But that hasn't happened yet so nits accepted.

Really hard to understand where the fix is. Bug synopsis also does not help :) I assume the problem is really in the test?

I have similar thoughts and am asking for some clarification from Gerard (as original reviewer).

@gerard-ziemski
Copy link

gerard-ziemski commented Jul 14, 2025

Really hard to understand where the fix is. Bug synopsis also does not help :) I assume the problem is really in the test?

Agree, it's confusing, Afshin said:

The canary header test failed since there were concurrent remove and free() from the tree. The remove operations are synch'ed with corresponding NMT lock.

but frankly I don't see any locks involved in this code path:

This where we detect the issue:

inline OutTypeParam MallocHeader::resolve_checked_impl(InTypeParam memblock) {
  char msg[256];
  address corruption = nullptr;
  if (!is_valid_malloced_pointer(memblock, msg, sizeof(msg))) {
    fatal("Not a valid malloc pointer: " PTR_FORMAT ": %s", p2i(memblock), msg);
  }
  OutTypeParam header_pointer = (OutTypeParam)memblock - 1;
  if (!header_pointer->check_block_integrity(msg, sizeof(msg), &corruption)) {
    header_pointer->print_block_on_error(tty, corruption != nullptr ? corruption : (address)header_pointer);
    fatal("NMT corruption: Block at " PTR_FORMAT ": %s", p2i(memblock), msg);
  }
  return header_pointer;
}

called by:

inline MallocHeader* MallocHeader::resolve_checked(void* memblock) {
  return MallocHeader::resolve_checked_impl<void*, MallocHeader*>(memblock);
}

called by:

void* MallocTracker::record_free_block(void* memblock) {
...
  MallocHeader* header = MallocHeader::resolve_checked(memblock);
...
}

called by:

static inline void* record_free(void* memblock) {
...
    return MallocTracker::record_free_block(memblock);
}

called by:

void  os::free(void *memblock) {
...
  void* const old_outer_ptr = MemTracker::record_free(memblock);
...
}

called by:

  void Treap::remove_all() {
...
      _allocator.free(head);
...
  }

called by:

  static void test_add_committed_region_adjacent_overlapping() {
    RegionsTree* rtree = VirtualMemoryTracker::Instance::tree();
    rtree->tree().remove_all();

    size_t size  = 0x01000000;
    ReservedSpace rs = MemoryReserver::reserve(size, mtTest);
    MemTracker::NmtVirtualMemoryLocker nvml;
...

As you can see in the old code, we call remove_all before we lock (MemTracker::NmtVirtualMemoryLocker)

I think the simplest temp fix here would be to do:

  static void test_add_committed_region_adjacent_overlapping() {
    MemTracker::NmtVirtualMemoryLocker nvml;

    RegionsTree* rtree = VirtualMemoryTracker::Instance::tree();
    rtree->tree().remove_all();

    size_t size  = 0x01000000;
    ReservedSpace rs = MemoryReserver::reserve(size, mtTest);

In the new code we don't call remove_all()

Afshin original fix incorporated feedback, not directly applicable to this fix, and now I wish we went with a simple fix and left other enhancements for later. Live and learn...

Copy link

@gerard-ziemski gerard-ziemski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wishing that we did not include all the tangential work in the original fix...

@openjdk openjdk bot added the ready Pull request is ready to be integrated label Jul 14, 2025
Copy link
Member

@shipilev shipilev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So replacing the uses of (sigleton, shared) VirtualMemoryTracker::Instance with local VirtualMemoryTracker vmt(true); dodges the locking issue, right?

What confuses me in this patch is why are we doing replacements like:

-  site->commit_memory(rgn->committed_size());
+  site->commit_memory(VirtualMemoryTracker::Instance::committed_size(rgn));

Doesn't that introduce dependencies on that singleton instance? Can you confirm that is sane?

@coleenp
Copy link
Contributor

coleenp commented Jul 15, 2025

@gerard-ziemski Can you prepare just the test fix for us to review? This should be a further RFE.

@gerard-ziemski
Copy link

gerard-ziemski commented Jul 15, 2025

So replacing the uses of (sigleton, shared) VirtualMemoryTracker::Instance with local VirtualMemoryTracker vmt(true); dodges the locking issue, right?

Exactly.

What confuses me in this patch is why are we doing replacements like:

-  site->commit_memory(rgn->committed_size());
+  site->commit_memory(VirtualMemoryTracker::Instance::committed_size(rgn));

Doesn't that introduce dependencies on that singleton instance? Can you confirm that is sane?

I will take a look, but I think the plan now is to come up with a simple fix (since this needs to be backported to jdk25), so it looks like we are not going to push this one (Coleen said you fixed your issue already, so there is urgency anymore? Please let me know if that is not the case)

@gerard-ziemski
Copy link

@gerard-ziemski Can you prepare just the test fix for us to review? This should be a further RFE.

Happily, on it...

@shipilev
Copy link
Member

Coleen said you fixed your issue already, so there is urgency anymore? Please let me know if that is not the case)

Yeah, mine is JDK-8361752, it is orthogonal to this. I am still interested not having gtest failures in our testing :)

@dholmes-ora
Copy link
Member Author

Closing this PR and returning control to Afshin and Gerard. The main issue we thought this would address was JDK-8361752 but that is a separate issue.

@dholmes-ora dholmes-ora deleted the 8360048-nmt branch July 17, 2025 02:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hotspot-runtime [email protected] ready Pull request is ready to be integrated rfr Pull request is ready for review
Development

Successfully merging this pull request may close these issues.

5 participants