More thread crash improvements #417

mabruzzo · 2024-10-19T20:45:33Z

This does a bunch of things (that would be a little annoying to subdivide into separate pull requests -- but I'm happy to do that). We divide up the description of this PR into a few parts.

I have highlighted points for where feedback might be particularly useful in "important quote-blocks"

Bug Fixes

fixes a significant bug introduced by Consolidate GPU Error Checking Function #350. The changes introduced by Consolidate GPU Error Checking Function #350 to consolidate error handling routines, accidentally started suppressing all errors related to kernel launches. The way that kernel-launch errors is very subtle (so it's easy to see how the error handling could have accidentally been broken). Without the changes in this PR, all failures of Cholla's kernels to launch are completely silent!
fixing the preceding bug revealed that when I factored out PostUpdate_Conserved_Correct_Crashed_3D in Improving thread crash logic #406, the function to average cells, I had messed up the launch parameters. That is fixed by the current PR.¹

More Logging when applying "fixes"

We now start logging more information when we overwrite values. This mainly describes 2 locations

Temperature_Floor_Kernel now describes wherever we apply a temperature floor (this helped us identify the need for the bugfixes in the preceding section)
Average_Cell_All_Fields now describes values of advected internal energy values when applying updates (cases without the dual-energy formalism are appropriately handled). It also prints out any neighbors that get omitted when we a

These changes raises the question: "Are we making the log too noisy?" In general, I don't really think so, but let's talk through the 2 cases.

I could imagine the case being a little noisy for Temperature_Floor_Kernel

Important

My inclination is to wait until it becomes a problem, but I could be convinced to comment-out these warnings in most cases.

I feel strongly that this extra logging is important for Average_Cell_All_Fields. Prior to this PR, we already logged every time this function is called. All we are doing is reporting any neighbors that can't be used by this function since that should be rare. Furthermore, if a scenario ever comes up where somebody feels that this pollutes the log, it's probably indicative of a serious simulation issue!

Provide explicit behavior when `PostUpdate_Conserved_Correct_Crashed_3D` fails

The simulation now aborts if PostUpdate_Conserved_Correct_Crashed_3D tries to "fix" a crashed cell, but the crashed cell does not have at least 2 valid neighbors to use for averaging. The number 2 is arbitrary. I originally just wanted to implement defined behavior when there were 0 valid neighbors. But, I decided that having just 1 valid neighbor was also pretty bad

Reorder when we apply `PostUpdate_Conserved_Correct_Crashed_3D`

Finally, we also reordered PostUpdate_Conserved_Correct_Crashed_3D and "dual energy reconciliation."²

This fixed a problem that occurs with some frequency (and was exposed by all of the preceeding fixes) in particle-based feedback simulations.

Essentially a scenario comes up where the hydrodynamic fields are seemingly valid at the start of a cycle. But, you end with a weird scenario by the end of the hydrodynamic solve (before any kind of source terms other than self-gravity). Essentially, you have a crashed cell surrounded by neighboring cells that all have a positive total energy that is exceeded by the corresponding kinetic energy.

Because the advected internal energy always seems to be reasonable in the neighboring cells seems to be reasonable, we simply apply dual-energy reconcilliation before PostUpdate_Conserved_Correct_Crashed_3D.

Care is taken to ensure that "dual-energy reconciliation" does not modify crashed cells. We use the existing definition for a crashed cell: any cell with a density or total energy that is a NaN or a negative (should we also consider cells where one of these fields is 0 to be a crashed cell?)

Important

Should we also consider cells where density or total energy is 0 to also be crashed cells?

"dual-energy reconciliation" also takes care to ignore crashed cells during the step where we need the max nearby total energy.

Important

We don't do anything special if a neighboring cell has a positive total energy that is exceeding by the corresponding cell's kinetic energy. An alternative might be to use the sum of the advected internal energy and the kinetic energy for that cell. (I'm not sure how much this actually matters).

For posterity, here are more details about the particular problem:

I confirmed that self-gravity source terms aren't to be blamed for this issue
I also confirmed when we just use "resolved feedback" or if we use both resolved and unresolved feedback
I did a deepdive into a scenario produced by a sim with just "resolved feedback"
- this scenario emerged within the core of a thermal energy dump (a "resolved" SN) that was deposited off center in an existing SN remnant. The star cluster that produced this newest injection and almost certainly produced the existing remnant last underwent a SN ~70 kyr earlier.
- I traced out the evolution of the dump over about 15 cycles (I think we started tracing the evolution ~3 kyr after the SN and then we followed it for another ~3 kyr) until the problematic scenario manifested
- there was no other feedback over that time and the min cooling time in any of those cells was no less than many 10s of kyr. Thus, the issue was primarily hydrodynamic
- The scenario occured at the core of the newer supernova remnant. Over the 15 observed cycles, the temperature in the center of the newer remnant dropped from >~1e8 to somewhere between 1e5 and 1e6 K. (since the cooling time was fairly long over this period, the temperature reduction transpired because of "adiabatic cooling")

To be clear, this solution is a bit of a hack.³ I suspect that this scenario arises because either:

we should be using larger stencils (I could totally imagine that the small-size of our stencil coupled with differential background flows has bizarre effects)
we might want to average the density, velocity and thermal energy before we perform "resolved feedback" (i.e. thermal energy dumps)

Other solutions to this particular scenario could involve Riemann/reconstruction fallback (or rewinding and taking a smaller timestep).

For that reason, this PR and Improving thread crash logic #406 should be merged at the same time (or they should be combined together -- whatever is easier) ↩
This refers to the canonical step of the dual-energy formalism where we might replace the advected internal energy field with the value of Etot - Kinetic and the common additional logic of subsequently replacing Etot with the advected internal energy plus the kinetic energy. ↩
While the dual-engine formalism can be used in this scenario, I don't think it is really intended to solve this kind of issue ↩

In slightly more detail, we invoke `Average_Cell_All_Fields` to get the average values from the neighbors to overwrite the value in a crashed cell or (when the `AVERAGE_SLOW_CELLS` macro is defined). - We have already been excluding any neighboring cell from this average that had crashed. - The key inovation here is that we now exclude a neighboring cell that we would consider a slow cell (when the `AVERAGE_SLOW_CELLS` macro is defined). To implement this, we pass around the SlowCellConditionCheck object, which implements methods to check whether a cell is considered slow or not. When the `AVERAGE_SLOW_CELLS` macro is defined, that method does the full check. When the macro isn't defined, the logic short-circuits and always returns that the cell isn't slow.

for more information, see https://pre-commit.ci

… zone

The dual-energy formalism now respects crashed cells in the function call `Select_Internal_Energy_ND` -> the function does not modify any cells where the cell "crashed" -> while considering the neighboring values of the crashed cells, we ignore any neighbors that "crashed" -> we also now explicitly pick advected internal energy over the internal energy computed from total energy in cases where the total energy is positive, but does not exceed kinetic energy. This was all done in preparation for Averaging Crashed Cells after

It now operates after the dual-energy formalism.

These changes were mostly stylistic in nature: - change `>= 0.` comparisons to `> 0.` comparisons - change physics variable names in newly written code to be consistent with the style guide

mabruzzo and others added 18 commits July 22, 2024 13:36

break off thread-crash-handling into a separate kernel in VL+CT solver

52b533b

modify Average_Cell_Single_Field to account for stale cells.

84e2165

light refactoring.

3a0ad02

adding an explanatory docstring.

50a26c0

forgot to commit a file.

7e7cb4e

[pre-commit.ci] auto fixes from pre-commit.com hooks

94f44f3

for more information, see https://pre-commit.ci

a few assorted tweaks.

c594775

tweak Average_Slow_Cells_3D so that we ignore slow-cells in the ghost…

dbf7153

… zone

tidying up

afa081c

fixing a bug in GPU_Error_Check

2dface2

fixing the bug in the launch of PostUpdate_Conserved_Correct_Crashed_3D

1016602

make print statements about manual overrides more explicit.

b6c875e

slightly refactor the style of Average_Cell_All_Fields

7400ffd

we now report errors when we can't slow-cell-average

11ff1cc

during cell-averaging we now print skipped values.

0c7a6f3

Average_Cell_All_Fields now respects DE Formalism

dd4e219

It now operates after the dual-energy formalism.

mabruzzo mentioned this pull request Oct 26, 2024

Refactor dual energy #418

Open

mabruzzo mentioned this pull request Jan 24, 2025

Improving thread crash logic #406

Closed

helenarichie self-assigned this Jan 24, 2025

Address @helenarichie's comments from GH-cholla-hydro#406

c34dbf1

These changes were mostly stylistic in nature: - change `>= 0.` comparisons to `> 0.` comparisons - change physics variable names in newly written code to be consistent with the style guide

evaneschneider requested a review from helenarichie January 24, 2025 21:00

evaneschneider unassigned helenarichie Jan 30, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

More thread crash improvements #417

More thread crash improvements #417

mabruzzo commented Oct 19, 2024

More thread crash improvements #417

Are you sure you want to change the base?

More thread crash improvements #417

Conversation

mabruzzo commented Oct 19, 2024

Bug Fixes

More Logging when applying "fixes"

Provide explicit behavior when PostUpdate_Conserved_Correct_Crashed_3D fails

Reorder when we apply PostUpdate_Conserved_Correct_Crashed_3D

Footnotes

Provide explicit behavior when `PostUpdate_Conserved_Correct_Crashed_3D` fails

Reorder when we apply `PostUpdate_Conserved_Correct_Crashed_3D`