Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Highly suspected "bug" related to emission files & reproducibility issues #55

Open
j34ni opened this issue Apr 4, 2020 · 26 comments
Open
Assignees
Labels
bug Something isn't working good-to-have

Comments

@j34ni
Copy link

j34ni commented Apr 4, 2020

Adding emission files significantly affects the reproducibility of the simulations, even if they only contain zeros (whether these zeros are defined as float or as double precision).

This is very likely linked to the problem that occurred in the summer 2019 (while carrying on the CMIP6 runs) although at that time it made the model crash more/less randomly, which is not the case here.

@j34ni j34ni added the bug Something isn't working label Apr 4, 2020
@MichaelSchulzMETNO
Copy link
Contributor

MichaelSchulzMETNO commented Apr 6, 2020

@j34ni
This should be certainly be tracked down. I wonder

  • is it a CAMS or NorESM problem. Does it also happen in AMIP configuration?
  • Can it be circumvented by manipulating the emissions without adding a file, eg by putting the SO2 emissions in one of the existing emission files to zero. Or add the Pinatubo emission in month 6 to SO2 fluxes in an existing file.
  • whether this happens on vilje and fram.
  • if the output is bit unidentical from the first month?

@j34ni
Copy link
Author

j34ni commented Apr 6, 2020

  • Is it a CAM or NorESM problem: this is hard to say. I only made short runs and the problem was easy to spot after only a few months because there were already significant differences for instance in the sea ice (which do not occur with prescribed SSTs). I have asked Ada about what happened for CMIP and how they diagnosed the issues at the time.

  • Can it be circumvented: probably, but would it not be more sensible to find a more permanent solution since this bug is likely to also have other consequences not yet understood.

  • Only tried on Fram since I did not have CPU time on Vilje.

  • Not bit identical from the first month, yes, and some differences already clearly visible (sea ice fraction and probably other variables also).

@DirkOlivie
Copy link
Contributor

Here some comments :

  • The problems we encountered with emission files (such a s (1) crash or (2) start of non-being bit identical) often started in the middle of the month. The atm.log file is a place where one can follow the state of the model every time step (and one can see when two simulations start to diverge).
  • Not all combinations of compsets and machines have been tested. However a few results are :
    (1) The problem only appeared on fram, not on vilje.
    (2) On Fram, it happened for the fully-coupled compsets when using 30 nodes (+/- standard setup).
    (3) On Fram, it happened for the fixed-SST compsets when using 32 nodes, not when using 16 nodes.
  • With the "frc2"-type compsets (which use less emission files), we avoided crashes and the problem of being not bit identical. Maybe it is an option to do the Pinatubo tests with the N1850frc2 compset.

@MichaelSchulzMETNO
Copy link
Contributor

Suggestion from Thomas email: @tto061 @j34ni @DirkOlivie

  1. run a parallel test with prescribed SSTs and sea-ice (e.g.
    NF2000climo compset)
  2. run a parallel test with CESM cam (e.g. F2000climo compset,
    assuming you can adjust your input to suit MAM -- if not, please
    ignore)
  3. run a parallel test without land component (QP compset; you'd
    need to reset all your inputs manually; I can probably help you
    with that).

@j34ni
Copy link
Author

j34ni commented Apr 14, 2020

I ran a NF2000climo compset with and without additional zeros emission files and the results are different also!

@tto061
Copy link

tto061 commented Apr 14, 2020

OK thanks Jean. So we've ruled out sea-ice. Do you think you can try test #2? also could you share your NorESM case directories and point to your NorESM root directory for these tests on fram?

@j34ni
Copy link
Author

j34ni commented Apr 29, 2020

I have not done this particular test on Fram but on a Virtual Machine, with the same run-time environment (same compiler version, same libraries, etc.), without batch system or queuing time (and also with less computational resources).

Let me know if you want to look at particular files and I will put them somewhere accessible to you.

@j34ni
Copy link
Author

j34ni commented May 4, 2020

As for CESM and the F2000climo compset, I ran it several times in similar conditions (f19 res) and it never crashed. Also it does not give different results when adding other emission files with zeros.

@j34ni
Copy link
Author

j34ni commented May 4, 2020

I forgot to mention that I did all the CESM tests with the latest release (cesm2.1.3), is it worth trying older versions or should we focus on NorESM?

@MichaelSchulzMETNO
Copy link
Contributor

I believe, we should just test the newest NorESM-CAM6-Nor without "coupling" to other components. My suspicion it is related to the emissions read in CAM in combination with some other feature of the aerosol or CAM-Nor code.

A test could be to see if NF2000climo, CAM6-NOR with MAM4 aerosol can be run. @DirkOlivie is that possible? Would be interesting anyway.

@j34ni
Copy link
Author

j34ni commented May 4, 2020

It seems to me that there are several problems which may or may not be related: i) intermittent NorESM crashes (occurrence of NaNs & INFs), ii) non bit-for-bit reproducibility, and iii) issues when reading the emission files

@DirkOlivie
Copy link
Contributor

The NF2000climo compset or the more recent CAM6-Nor compsets impose using the CAM-Oslo aerosol scheme (essential part of the compset definition).
Have the frc2 compsets been tested in this context? Has a test been done without any emissions?

@oyvindseland
Copy link

If no-one else does I can check if adding zero when reading in existing files matter, i.e. before the numbers are scattered to the chunks

@tto061
Copy link

tto061 commented May 4, 2020 via email

@oyvindseland
Copy link

Hi

I can elaborate a bit more on my comment above. The check that can be done is to add an extra input sector at the point where the input files are read in but instead of reading in a zero file just define the input array to be zero. The purpose would be to check if it is the read-in process itself that causes the problem or whether it is when defining new sectors.
If the results are different still the addition of zero can be done further down in the physics structure.

@monsieuralok
Copy link

update from 22/10/2019
When I was executing compset NFPTAERO60 with grid f19_f19_mg17, I was getting some strange value of field names which is getting crash atleast when I compile with MPI+OpenMP.

I printed the following block from file ndrop.F90 around line 2172

#ifdef OSLO_AERO
tendencyCounted(:)=.FALSE.
do m = 1, ntot_amode
do l = 1, nspec_amode(m)
mm = mam_idx(m,l)
lptr = getTracerIndex(m,l,.false.)
if(.NOT. tendencyCounted(lptr))then
print*,mm,fieldname(mm),'ndrop'
call outfld(fieldname(mm), coltend(:,lptr), pcols,lchnk)
call outfld(fieldname_cw(mm), coltend_cw(:,lptr), pcols,lchnk)
tendencyCounted(lptr)=.TRUE.
endif
end do
end do
#endif

I get :

        8 ^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@ndrop
      12 BC_AI_mixnuc1           ndrop
      13 OM_AI_mixnuc1           ndrop
      15 SO4_A2_mixnuc1          ndrop
      18 SO4_PR_mixnuc1          ndrop
      19 BC_AC_mixnuc1           ndrop
      20 OM_AC_mixnuc1           ndrop
      22 SO4_AC_mixnuc1          ndrop
      26 DST_A2_mixnuc1          ndrop
      34 DST_A3_mixnuc1          ndrop
      35 ^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@ndrop

When, I printed fieldname(mm) for mm=8,14,35 @ line 263 in ndrop.F90; I guess it is never assigned any value or initialized.

Second, it could be that it should not loop over these numbers. Please could you check and update.

@tto061
Copy link

tto061 commented May 6, 2020

Further adding to the picture: as far as I can tell, none of my integrations on tetralith, including the NFHIST cases for CMIP6, are reproducible bfb with default compiler options (i.e. -O2 doe fortran), either from existing restarts or from default i.c. I have no reproducibility test with -O0 option.

@j34ni
Copy link
Author

j34ni commented May 14, 2020

I am investigating the bug with different tools (like the Intel Inspector) for memory and thread checking and debugging.
I think I am getting there.
That now seems to work on a virtual machine.

@DirkOlivie
Copy link
Contributor

@j34ni A temporary solution might be to use one 3D SO2--emission file, which will contain the standard 3D emissions + Pinatubo explosion emission. Would you like me to create such a file?

@j34ni
Copy link
Author

j34ni commented May 18, 2020

@DirkOlivie We can give it a go

@j34ni
Copy link
Author

j34ni commented May 22, 2020

@MichaelSchulzMETNO @DirkOlivie @monsieuralok @tto061 I eventually got NorESM working in the Conda environment (with a GNU compiler) and did not manage to make it to crash yet!

There may be something very wrong with the Intel 2018 compiler, as was already the case when I was running the Variable Resolution CESM (for which I ended up using Intel 2019)

@MichaelSchulzMETNO
Copy link
Contributor

MichaelSchulzMETNO commented May 27, 2020

@j34ni Did you / could you explain how one can run NorESM in a conda enviroment? Is that in the NorESM2 documentation already? ( I mean thats really interesting to have !!)

@oyvindseland
Copy link

@j34ni Really great news that you can run NorESM in a Conda environment. It is going to be interesting to see scaling results.

@j34ni
Copy link
Author

j34ni commented May 28, 2020

@MichaelSchulzMETNO At the moment that has not been much documented, it is still work in progress building on the "conda cesm" recipe. That was mainly used for teaching purposes (to learn how to run an ESM on Galaxy), and for development (without having to wait in a queue). However a proper "conda noresm" will be made available soon, that will allow a simple installation and contain what's needed to run the model (including configuration files, the Microsoft Kernel Library instead of Blas/Lapack, etc.) on generic platform first, and after that on an HPC.

@j34ni
Copy link
Author

j34ni commented May 28, 2020

@oyvindseland Yes, we will have to evaluate the scalability on an HPC (so far that only used small configurations on virtual machines with a single node), Betzy comes at the perfect time...

@j34ni
Copy link
Author

j34ni commented Jun 8, 2020

@MichaelSchulzMETNO @DirkOlivie @monsieuralok @tto061
Some of the problems occur at the very beginning of a run: initialization issues (obviously) but also non-BFB reproducibility and even crashes due to NaNs or INFs.

To test that quickly:

  • create a new case (for example original_N1850_f19_tn14) and run the simulation for 1 day;
  • create a 1st branch from the original case (needs a copy of the restart files & rpointers from the original case in the run dir) and continue the run for a single time step;
  • create a 2nd branch from the original case, add a couple of zero_emission files (copy also the restarts in the run dir) and run it for 1 time step;
  • continue the original simulation for 1 time step (CONTINUE_RUN=TRUE);
  • compare the original case with the 1st and 2nd branches.

I did that many times with CESM and the 3 simulations provide identical results, systematically.

So far with NorESM that only worked with the GNU (for instance 9.3.0) and Intel (2019.5) compilers, not with Intel 2018, whether it makes use of Alok's SourceMods or not.

That is not meant to replace a long run, but it is much faster to evaluate the effect of various fixes: if the 3 simulations do not provide identical results after one time step, there is no need to waste more resources. However if they do provide identical results, the simulation can always fail later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working good-to-have
Projects
Status: Todo
Development

No branches or pull requests

6 participants