Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More thread crash improvements #417

Open
wants to merge 18 commits into
base: dev
Choose a base branch
from

Conversation

mabruzzo
Copy link
Collaborator

This does a bunch of things (that would be a little annoying to subdivide into separate pull requests -- but I'm happy to do that). We divide up the description of this PR into a few parts.

I have highlighted points for where feedback might be particularly useful in "important quote-blocks"

Bug Fixes

  • fixes a significant bug introduced by Consolidate GPU Error Checking Function #350. The changes introduced by Consolidate GPU Error Checking Function #350 to consolidate error handling routines, accidentally started suppressing all errors related to kernel launches. The way that kernel-launch errors is very subtle (so it's easy to see how the error handling could have accidentally been broken). Without the changes in this PR, all failures of Cholla's kernels to launch are completely silent!
  • fixing the preceding bug revealed that when I factored out PostUpdate_Conserved_Correct_Crashed_3D in Improving thread crash logic #406, the function to average cells, I had messed up the launch parameters. That is fixed by the current PR.1

More Logging when applying "fixes"

We now start logging more information when we overwrite values. This mainly describes 2 locations

  1. Temperature_Floor_Kernel now describes wherever we apply a temperature floor (this helped us identify the need for the bugfixes in the preceding section)
  2. Average_Cell_All_Fields now describes values of advected internal energy values when applying updates (cases without the dual-energy formalism are appropriately handled). It also prints out any neighbors that get omitted when we a

These changes raises the question: "Are we making the log too noisy?" In general, I don't really think so, but let's talk through the 2 cases.

  1. I could imagine the case being a little noisy for Temperature_Floor_Kernel

Important

My inclination is to wait until it becomes a problem, but I could be convinced to comment-out these warnings in most cases.

  1. I feel strongly that this extra logging is important for Average_Cell_All_Fields. Prior to this PR, we already logged every time this function is called. All we are doing is reporting any neighbors that can't be used by this function since that should be rare. Furthermore, if a scenario ever comes up where somebody feels that this pollutes the log, it's probably indicative of a serious simulation issue!

Provide explicit behavior when PostUpdate_Conserved_Correct_Crashed_3D fails

The simulation now aborts if PostUpdate_Conserved_Correct_Crashed_3D tries to "fix" a crashed cell, but the crashed cell does not have at least 2 valid neighbors to use for averaging. The number 2 is arbitrary. I originally just wanted to implement defined behavior when there were 0 valid neighbors. But, I decided that having just 1 valid neighbor was also pretty bad

Reorder when we apply PostUpdate_Conserved_Correct_Crashed_3D

Finally, we also reordered PostUpdate_Conserved_Correct_Crashed_3D and "dual energy reconciliation."2

This fixed a problem that occurs with some frequency (and was exposed by all of the preceeding fixes) in particle-based feedback simulations.

Essentially a scenario comes up where the hydrodynamic fields are seemingly valid at the start of a cycle. But, you end with a weird scenario by the end of the hydrodynamic solve (before any kind of source terms other than self-gravity). Essentially, you have a crashed cell surrounded by neighboring cells that all have a positive total energy that is exceeded by the corresponding kinetic energy.

Because the advected internal energy always seems to be reasonable in the neighboring cells seems to be reasonable, we simply apply dual-energy reconcilliation before PostUpdate_Conserved_Correct_Crashed_3D.

  • Care is taken to ensure that "dual-energy reconciliation" does not modify crashed cells. We use the existing definition for a crashed cell: any cell with a density or total energy that is a NaN or a negative (should we also consider cells where one of these fields is 0 to be a crashed cell?)

Important

Should we also consider cells where density or total energy is 0 to also be crashed cells?

  • "dual-energy reconciliation" also takes care to ignore crashed cells during the step where we need the max nearby total energy.

Important

We don't do anything special if a neighboring cell has a positive total energy that is exceeding by the corresponding cell's kinetic energy. An alternative might be to use the sum of the advected internal energy and the kinetic energy for that cell. (I'm not sure how much this actually matters).

For posterity, here are more details about the particular problem:

  • I confirmed that self-gravity source terms aren't to be blamed for this issue
  • I also confirmed when we just use "resolved feedback" or if we use both resolved and unresolved feedback
  • I did a deepdive into a scenario produced by a sim with just "resolved feedback"
    • this scenario emerged within the core of a thermal energy dump (a "resolved" SN) that was deposited off center in an existing SN remnant. The star cluster that produced this newest injection and almost certainly produced the existing remnant last underwent a SN ~70 kyr earlier.
    • I traced out the evolution of the dump over about 15 cycles (I think we started tracing the evolution ~3 kyr after the SN and then we followed it for another ~3 kyr) until the problematic scenario manifested
    • there was no other feedback over that time and the min cooling time in any of those cells was no less than many 10s of kyr. Thus, the issue was primarily hydrodynamic
    • The scenario occured at the core of the newer supernova remnant. Over the 15 observed cycles, the temperature in the center of the newer remnant dropped from >~1e8 to somewhere between 1e5 and 1e6 K. (since the cooling time was fairly long over this period, the temperature reduction transpired because of "adiabatic cooling")

To be clear, this solution is a bit of a hack.3 I suspect that this scenario arises because either:

  • we should be using larger stencils (I could totally imagine that the small-size of our stencil coupled with differential background flows has bizarre effects)
  • we might want to average the density, velocity and thermal energy before we perform "resolved feedback" (i.e. thermal energy dumps)

Other solutions to this particular scenario could involve Riemann/reconstruction fallback (or rewinding and taking a smaller timestep).

Footnotes

  1. For that reason, this PR and Improving thread crash logic #406 should be merged at the same time (or they should be combined together -- whatever is easier)

  2. This refers to the canonical step of the dual-energy formalism where we might replace the advected internal energy field with the value of Etot - Kinetic and the common additional logic of subsequently replacing Etot with the advected internal energy plus the kinetic energy.

  3. While the dual-engine formalism can be used in this scenario, I don't think it is really intended to solve this kind of issue

mabruzzo and others added 18 commits July 22, 2024 13:36
In slightly more detail, we invoke `Average_Cell_All_Fields` to get the average values from the neighbors to overwrite the value in a crashed cell or (when the `AVERAGE_SLOW_CELLS` macro is defined).

  - We have already been excluding any neighboring cell from this average that had crashed.

  - The key inovation here is that we now exclude a neighboring cell that we would consider a slow cell (when the `AVERAGE_SLOW_CELLS` macro is defined).

To implement this, we pass around the SlowCellConditionCheck object, which implements methods to check whether a cell is considered slow or not. When the `AVERAGE_SLOW_CELLS` macro is defined, that method does the full check. When the macro isn't defined, the logic short-circuits and always returns that the cell isn't slow.
The dual-energy formalism now respects crashed cells in the function
call `Select_Internal_Energy_ND`
-> the function does not modify any cells where the cell "crashed"
-> while considering the neighboring values of the crashed cells, we
   ignore any neighbors that "crashed"
-> we also now explicitly pick advected internal energy over the
   internal energy computed from total energy in cases where the total
   energy is positive, but does not exceed kinetic energy.

This was all done in preparation for Averaging Crashed Cells after
It now operates after the dual-energy formalism.
@mabruzzo mabruzzo mentioned this pull request Oct 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant