Highly suspected "bug" related to emission files & reproducibility issues #55

j34ni · 2020-04-04T08:32:37Z

Adding emission files significantly affects the reproducibility of the simulations, even if they only contain zeros (whether these zeros are defined as float or as double precision).

This is very likely linked to the problem that occurred in the summer 2019 (while carrying on the CMIP6 runs) although at that time it made the model crash more/less randomly, which is not the case here.

MichaelSchulzMETNO · 2020-04-06T09:15:16Z

@j34ni
This should be certainly be tracked down. I wonder

is it a CAMS or NorESM problem. Does it also happen in AMIP configuration?
Can it be circumvented by manipulating the emissions without adding a file, eg by putting the SO2 emissions in one of the existing emission files to zero. Or add the Pinatubo emission in month 6 to SO2 fluxes in an existing file.
whether this happens on vilje and fram.
if the output is bit unidentical from the first month?

j34ni · 2020-04-06T10:21:35Z

Is it a CAM or NorESM problem: this is hard to say. I only made short runs and the problem was easy to spot after only a few months because there were already significant differences for instance in the sea ice (which do not occur with prescribed SSTs). I have asked Ada about what happened for CMIP and how they diagnosed the issues at the time.
Can it be circumvented: probably, but would it not be more sensible to find a more permanent solution since this bug is likely to also have other consequences not yet understood.
Only tried on Fram since I did not have CPU time on Vilje.
Not bit identical from the first month, yes, and some differences already clearly visible (sea ice fraction and probably other variables also).

DirkOlivie · 2020-04-07T08:19:32Z

Here some comments :

The problems we encountered with emission files (such a s (1) crash or (2) start of non-being bit identical) often started in the middle of the month. The atm.log file is a place where one can follow the state of the model every time step (and one can see when two simulations start to diverge).
Not all combinations of compsets and machines have been tested. However a few results are :
(1) The problem only appeared on fram, not on vilje.
(2) On Fram, it happened for the fully-coupled compsets when using 30 nodes (+/- standard setup).
(3) On Fram, it happened for the fixed-SST compsets when using 32 nodes, not when using 16 nodes.
With the "frc2"-type compsets (which use less emission files), we avoided crashes and the problem of being not bit identical. Maybe it is an option to do the Pinatubo tests with the N1850frc2 compset.

MichaelSchulzMETNO · 2020-04-07T09:41:24Z

Suggestion from Thomas email: @tto061 @j34ni @DirkOlivie

run a parallel test with prescribed SSTs and sea-ice (e.g.
NF2000climo compset)
run a parallel test with CESM cam (e.g. F2000climo compset,
assuming you can adjust your input to suit MAM -- if not, please
ignore)
run a parallel test without land component (QP compset; you'd
need to reset all your inputs manually; I can probably help you
with that).

j34ni · 2020-04-14T09:18:10Z

I ran a NF2000climo compset with and without additional zeros emission files and the results are different also!

tto061 · 2020-04-14T15:53:58Z

OK thanks Jean. So we've ruled out sea-ice. Do you think you can try test #2? also could you share your NorESM case directories and point to your NorESM root directory for these tests on fram?

j34ni · 2020-04-29T06:57:04Z

I have not done this particular test on Fram but on a Virtual Machine, with the same run-time environment (same compiler version, same libraries, etc.), without batch system or queuing time (and also with less computational resources).

Let me know if you want to look at particular files and I will put them somewhere accessible to you.

j34ni · 2020-05-04T12:33:40Z

As for CESM and the F2000climo compset, I ran it several times in similar conditions (f19 res) and it never crashed. Also it does not give different results when adding other emission files with zeros.

j34ni · 2020-05-04T12:46:35Z

I forgot to mention that I did all the CESM tests with the latest release (cesm2.1.3), is it worth trying older versions or should we focus on NorESM?

MichaelSchulzMETNO · 2020-05-04T12:51:34Z

I believe, we should just test the newest NorESM-CAM6-Nor without "coupling" to other components. My suspicion it is related to the emissions read in CAM in combination with some other feature of the aerosol or CAM-Nor code.

A test could be to see if NF2000climo, CAM6-NOR with MAM4 aerosol can be run. @DirkOlivie is that possible? Would be interesting anyway.

j34ni · 2020-05-04T13:05:35Z

It seems to me that there are several problems which may or may not be related: i) intermittent NorESM crashes (occurrence of NaNs & INFs), ii) non bit-for-bit reproducibility, and iii) issues when reading the emission files

DirkOlivie · 2020-05-04T13:49:39Z

The NF2000climo compset or the more recent CAM6-Nor compsets impose using the CAM-Oslo aerosol scheme (essential part of the compset definition).
Have the frc2 compsets been tested in this context? Has a test been done without any emissions?

oyvindseland · 2020-05-04T14:17:13Z

If no-one else does I can check if adding zero when reading in existing files matter, i.e. before the numbers are scattered to the chunks

tto061 · 2020-05-04T14:40:49Z

Hi Dirk & Øyvind et al I'm using them routinely on tetralith, without any problems -- but only after I switched from CLM50%BGC-CROP to CLM50%SP; although my error may have looked more like what Øyvind's getting (crash with NaNs on land points). Never tried without (or zero) emissions. Cheers Thomas

…

On 2020-05-04 15:49, DirkOlivie wrote: The NF2000climo compset or the more recent CAM6-Nor compsets impose using the CAM-Oslo aerosol scheme (essential part of the compset definition). Have the frc2 compsets been tested in this context? Has a test been done without any emissions? — You are receiving this because you were assigned. Reply to this email directly, view it on GitHub <#55 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ADZGLJCKMI36FC2BBT5FIC3RP3BYFANCNFSM4L5U7ZHQ>.

oyvindseland · 2020-05-05T13:13:12Z

Hi

I can elaborate a bit more on my comment above. The check that can be done is to add an extra input sector at the point where the input files are read in but instead of reading in a zero file just define the input array to be zero. The purpose would be to check if it is the read-in process itself that causes the problem or whether it is when defining new sectors.
If the results are different still the addition of zero can be done further down in the physics structure.

monsieuralok · 2020-05-06T12:45:31Z

update from 22/10/2019
When I was executing compset NFPTAERO60 with grid f19_f19_mg17, I was getting some strange value of field names which is getting crash atleast when I compile with MPI+OpenMP.

I printed the following block from file ndrop.F90 around line 2172

#ifdef OSLO_AERO
tendencyCounted(:)=.FALSE.
do m = 1, ntot_amode
do l = 1, nspec_amode(m)
mm = mam_idx(m,l)
lptr = getTracerIndex(m,l,.false.)
if(.NOT. tendencyCounted(lptr))then
print*,mm,fieldname(mm),'ndrop'
call outfld(fieldname(mm), coltend(:,lptr), pcols,lchnk)
call outfld(fieldname_cw(mm), coltend_cw(:,lptr), pcols,lchnk)
tendencyCounted(lptr)=.TRUE.
endif
end do
end do
#endif

I get :

        8 ^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@ndrop
      12 BC_AI_mixnuc1           ndrop
      13 OM_AI_mixnuc1           ndrop
      15 SO4_A2_mixnuc1          ndrop
      18 SO4_PR_mixnuc1          ndrop
      19 BC_AC_mixnuc1           ndrop
      20 OM_AC_mixnuc1           ndrop
      22 SO4_AC_mixnuc1          ndrop
      26 DST_A2_mixnuc1          ndrop
      34 DST_A3_mixnuc1          ndrop
      35 ^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@ndrop

When, I printed fieldname(mm) for mm=8,14,35 @ line 263 in ndrop.F90; I guess it is never assigned any value or initialized.

Second, it could be that it should not loop over these numbers. Please could you check and update.

tto061 · 2020-05-06T13:44:08Z

Further adding to the picture: as far as I can tell, none of my integrations on tetralith, including the NFHIST cases for CMIP6, are reproducible bfb with default compiler options (i.e. -O2 doe fortran), either from existing restarts or from default i.c. I have no reproducibility test with -O0 option.

j34ni · 2020-05-14T14:31:00Z

I am investigating the bug with different tools (like the Intel Inspector) for memory and thread checking and debugging.
I think I am getting there.
That now seems to work on a virtual machine.

DirkOlivie · 2020-05-18T08:12:48Z

@j34ni A temporary solution might be to use one 3D SO2--emission file, which will contain the standard 3D emissions + Pinatubo explosion emission. Would you like me to create such a file?

j34ni · 2020-05-18T08:40:10Z

@DirkOlivie We can give it a go

j34ni · 2020-05-22T14:25:40Z

@MichaelSchulzMETNO @DirkOlivie @monsieuralok @tto061 I eventually got NorESM working in the Conda environment (with a GNU compiler) and did not manage to make it to crash yet!

There may be something very wrong with the Intel 2018 compiler, as was already the case when I was running the Variable Resolution CESM (for which I ended up using Intel 2019)

MichaelSchulzMETNO · 2020-05-27T17:24:56Z

@j34ni Did you / could you explain how one can run NorESM in a conda enviroment? Is that in the NorESM2 documentation already? ( I mean thats really interesting to have !!)

oyvindseland · 2020-05-27T19:17:59Z

@j34ni Really great news that you can run NorESM in a Conda environment. It is going to be interesting to see scaling results.

j34ni · 2020-05-28T07:35:06Z

@MichaelSchulzMETNO At the moment that has not been much documented, it is still work in progress building on the "conda cesm" recipe. That was mainly used for teaching purposes (to learn how to run an ESM on Galaxy), and for development (without having to wait in a queue). However a proper "conda noresm" will be made available soon, that will allow a simple installation and contain what's needed to run the model (including configuration files, the Microsoft Kernel Library instead of Blas/Lapack, etc.) on generic platform first, and after that on an HPC.

j34ni · 2020-05-28T07:35:42Z

@oyvindseland Yes, we will have to evaluate the scalability on an HPC (so far that only used small configurations on virtual machines with a single node), Betzy comes at the perfect time...

j34ni · 2020-06-08T12:31:57Z

@MichaelSchulzMETNO @DirkOlivie @monsieuralok @tto061
Some of the problems occur at the very beginning of a run: initialization issues (obviously) but also non-BFB reproducibility and even crashes due to NaNs or INFs.

To test that quickly:

create a new case (for example original_N1850_f19_tn14) and run the simulation for 1 day;
create a 1st branch from the original case (needs a copy of the restart files & rpointers from the original case in the run dir) and continue the run for a single time step;
create a 2nd branch from the original case, add a couple of zero_emission files (copy also the restarts in the run dir) and run it for 1 time step;
continue the original simulation for 1 time step (CONTINUE_RUN=TRUE);
compare the original case with the 1st and 2nd branches.

I did that many times with CESM and the 3 simulations provide identical results, systematically.

So far with NorESM that only worked with the GNU (for instance 9.3.0) and Intel (2019.5) compilers, not with Intel 2018, whether it makes use of Alok's SourceMods or not.

That is not meant to replace a long run, but it is much faster to evaluate the effect of various fixes: if the 3 simulations do not provide identical results after one time step, there is no need to waste more resources. However if they do provide identical results, the simulation can always fail later.

j34ni added the bug Something isn't working label Apr 4, 2020

j34ni assigned MichaelSchulzMETNO Apr 4, 2020

MichaelSchulzMETNO assigned tto061 May 4, 2020

MichaelSchulzMETNO assigned DirkOlivie and j34ni May 4, 2020

tto061 mentioned this issue May 13, 2020

NorESM2 integrity checks #79

Open

MichaelSchulzMETNO added the good-to-have label Jun 10, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Highly suspected "bug" related to emission files & reproducibility issues #55

Highly suspected "bug" related to emission files & reproducibility issues #55

j34ni commented Apr 4, 2020

MichaelSchulzMETNO commented Apr 6, 2020 •

edited

Loading

j34ni commented Apr 6, 2020

DirkOlivie commented Apr 7, 2020

MichaelSchulzMETNO commented Apr 7, 2020

j34ni commented Apr 14, 2020

tto061 commented Apr 14, 2020

j34ni commented Apr 29, 2020

j34ni commented May 4, 2020

j34ni commented May 4, 2020

MichaelSchulzMETNO commented May 4, 2020

j34ni commented May 4, 2020

DirkOlivie commented May 4, 2020

oyvindseland commented May 4, 2020

tto061 commented May 4, 2020 via email

oyvindseland commented May 5, 2020

monsieuralok commented May 6, 2020

tto061 commented May 6, 2020

j34ni commented May 14, 2020

DirkOlivie commented May 18, 2020

j34ni commented May 18, 2020

j34ni commented May 22, 2020

MichaelSchulzMETNO commented May 27, 2020 •

edited

Loading

oyvindseland commented May 27, 2020

j34ni commented May 28, 2020

j34ni commented May 28, 2020

j34ni commented Jun 8, 2020 •

edited

Loading

Highly suspected "bug" related to emission files & reproducibility issues #55

Highly suspected "bug" related to emission files & reproducibility issues #55

Comments

j34ni commented Apr 4, 2020

MichaelSchulzMETNO commented Apr 6, 2020 • edited Loading

j34ni commented Apr 6, 2020

DirkOlivie commented Apr 7, 2020

MichaelSchulzMETNO commented Apr 7, 2020

j34ni commented Apr 14, 2020

tto061 commented Apr 14, 2020

j34ni commented Apr 29, 2020

j34ni commented May 4, 2020

j34ni commented May 4, 2020

MichaelSchulzMETNO commented May 4, 2020

j34ni commented May 4, 2020

DirkOlivie commented May 4, 2020

oyvindseland commented May 4, 2020

tto061 commented May 4, 2020 via email

oyvindseland commented May 5, 2020

monsieuralok commented May 6, 2020

tto061 commented May 6, 2020

j34ni commented May 14, 2020

DirkOlivie commented May 18, 2020

j34ni commented May 18, 2020

j34ni commented May 22, 2020

MichaelSchulzMETNO commented May 27, 2020 • edited Loading

oyvindseland commented May 27, 2020

j34ni commented May 28, 2020

j34ni commented May 28, 2020

j34ni commented Jun 8, 2020 • edited Loading

MichaelSchulzMETNO commented Apr 6, 2020 •

edited

Loading

MichaelSchulzMETNO commented May 27, 2020 •

edited

Loading

j34ni commented Jun 8, 2020 •

edited

Loading