Inconsistent behavior with Max Snapshots Per File with Restarts #6795

bogensch · 2024-12-04T00:25:28Z

In EAMxx the YAML directive “Max Snapshot Per File” seems to be tripping up upon restarts and not behaving as expected.

For example, a DPxx run that is 6 hours in duration I have an output stream with Max Snapshots Per File: 1 with hourly averaged output. I have observed the following:

CNTL: If I set the simulation to run through without restarts: it will produce 6 files as expected.
EXP01: If I direct the simulation to run with one restart halfway through: it will produce 5 output files. The third file in the series will contain two time slices.
EXP02: If I direct the simulation to run with restarts each hour (hence, five restarts): it will produce 1 output file. The output file will contain all 6 time slices.

I have observed similar behavior in multiple DPxx simulations for both pm-cpu and pm-gpu. In a large production run I’m doing I have it doing daily restarts. I have an output stream with hourly output with Max Snapshots Per File: 24. In this case it is putting all my data into ONE file.

I have performed one global ne30 test and noticed similar behavior with restarts and unexpected behavior with Max Snapshots Per File treatment. Thus, this does not appear to be a DPxx specific problem but a general problem.

For a quick reproducer of EXP01, run the DYCOMS-RF01 case:
https://github.com/E3SM-Project/scmlib/blob/master/DPxx_SCREAM_SCRIPTS/run_dpxx_scream_DYCOMSrf01.csh
Set to run for 3 hours (search for stop_n) and set for one restart.

And direct to the following YAML:

%YAML 1.1
---
Averaging Type: Average
Max Snapshots Per File: 1
filename_prefix: ${CASE}.scream.hourly.avg
Fields:
  Physics PG2:
    Field Names:
    - LiqWaterPath
output_control:
  Frequency: 1
  frequency_units: nhours

The text was updated successfully, but these errors were encountered:

mahf708 · 2024-12-04T21:53:42Z

I don't believe we've seen this odd behavior when we have Max Snapshots Per File equaling a day worth of data (so 8 snaps for 3-hourly stream). I do wonder if this is specific to Max Snapshots Per File: 1 ...

mahf708 · 2024-12-04T21:56:22Z

we discussed making this default 👀 👀

Restart:
  force_new_file: true

bogensch · 2024-12-04T21:56:29Z

@mahf708 I see this when I have Max Snapshots Per File: 24 . You probably don't have this issue because I see your stream sets:

Restart:
  force_new_file: true

which overrides this behavior (consistent with my experience too).

mahf708 · 2024-12-04T21:57:04Z

Yep! I was gonna say the same exact thing!

I have an output stream with hourly output with Max Snapshots Per File: 24. In this case it is putting all my data into ONE file.

Also, sorry I just saw this in your op :/

bartgol · 2024-12-04T22:02:28Z

Even without forcing a new file, the output infrastructure should see that the last output file is full, so it should NOT try to add a new slice to that one. Definitely a bug.

mahf708 · 2024-12-04T22:24:20Z

Even without forcing a new file, the output infrastructure should see that the last output file is full, so it should NOT try to add a new slice to that one. Definitely a bug.

Guess something's gone wrong here?

      // Check if the prev run wrote any output file (it may have not, if the restart was written
      // before the 1st output step). If there is a file, check if there's still room in it.
      const auto& last_output_filename = get_attribute<std::string>(rhist_file,"GLOBAL","last_output_filename");
      m_resume_output_file = last_output_filename!="" and not restart_pl.get("force_new_file",false);
      if (m_resume_output_file) {
        m_output_file_specs.storage.num_snapshots_in_file = scorpio::get_attribute<int>(rhist_file,"GLOBAL","last_output_file_num_snaps");

        if (m_output_file_specs.storage.snapshot_fits(m_output_control.next_write_ts)) {
          // The setup_file call will not register any new variable (the file is in Append mode,
          // so all dims/vars must already be in the file). However, it will register decompositions,
          // since those are a property of the run, not of the file.
          m_output_file_specs.filename = last_output_filename;
          m_output_file_specs.is_open = true;
          setup_file(m_output_file_specs,m_output_control);
        } else {
          m_output_file_specs.close();
        }
      }

bartgol · 2024-12-04T22:27:52Z

I guess. But nothing stands out... I would have to reproduce manually, then inundate the src code with print statements and see...

I'll get to it hopefully this week. Unless someone else feels like taking this on

AaronDonahue · 2024-12-05T00:44:12Z

quick look at Peter B's case shows that in the metadata last_output_file_num_snaps is always 0 for those rhist files. So likely the issue is wherever that metadata is being set.

AaronDonahue · 2024-12-05T00:49:51Z

https://github.com/E3SM-Project/E3SM/blob/master/components/eamxx/src/share/io/scream_output_manager.cpp#L537

So probably would just need print statements here. I have a bunch of meetings tomorrow, but if I get a good break I will take a look.

This commit fixes an issue during restarts that occurs with averaged type output. The restart history file (rhist) metadata was incorrectly setup which could lead EAMxx to reopen files that already had the max number of snaps in them and continue to fill them at the restart step. Fixes #6795

bogensch added bug EAMxx PRs focused on capabilities for EAMxx labels Dec 4, 2024

bogensch assigned AaronDonahue and bartgol Dec 4, 2024

AaronDonahue linked a pull request Dec 11, 2024 that will close this issue

WIP: Fix an issue in EAMxx with incorrect metadata in rhist file #6846

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistent behavior with Max Snapshots Per File with Restarts #6795

Inconsistent behavior with Max Snapshots Per File with Restarts #6795

bogensch commented Dec 4, 2024

mahf708 commented Dec 4, 2024

mahf708 commented Dec 4, 2024

bogensch commented Dec 4, 2024

mahf708 commented Dec 4, 2024 •

edited

Loading

bartgol commented Dec 4, 2024

mahf708 commented Dec 4, 2024

bartgol commented Dec 4, 2024

AaronDonahue commented Dec 5, 2024

AaronDonahue commented Dec 5, 2024

Inconsistent behavior with Max Snapshots Per File with Restarts #6795

Inconsistent behavior with Max Snapshots Per File with Restarts #6795

Comments

bogensch commented Dec 4, 2024

mahf708 commented Dec 4, 2024

mahf708 commented Dec 4, 2024

bogensch commented Dec 4, 2024

mahf708 commented Dec 4, 2024 • edited Loading

bartgol commented Dec 4, 2024

mahf708 commented Dec 4, 2024

bartgol commented Dec 4, 2024

AaronDonahue commented Dec 5, 2024

AaronDonahue commented Dec 5, 2024

mahf708 commented Dec 4, 2024 •

edited

Loading