Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

zstd on WCOSS2 for UFS #3

Open
edwardhartnett opened this issue Aug 6, 2024 · 39 comments
Open

zstd on WCOSS2 for UFS #3

edwardhartnett opened this issue Aug 6, 2024 · 39 comments
Assignees

Comments

@edwardhartnett
Copy link
Contributor

Enable and install zstd compression.

Currently testing on acorn.

@junwang-noaa
Copy link

FYI. The corresponding ufs weather model issue is at: ufs-community/ufs-weather-model#2319

@edwardhartnett
Copy link
Contributor Author

Still waiting for testing on WCOSS2.

NCO must review zstd. Once test is finished, Hang will request review.

@edwardhartnett
Copy link
Contributor Author

Some email activity:

Hi, Rahul,
Okay,I can start the procedure now.
Hopefully, we don't need the review, since they have an incomplete zstd on acorn.
I will start the zstd request first and see what they say.
Hang

On Fri, Aug 23, 2024 at 10:51 AM Rahul Mahajan - NOAA Federal <[email protected]> wrote:
What is preventing from requesting the installation before getting confirmation from UFS.
In all likelihood, we will be using zstd for compression.
Since zstd is a new library, it will need to go through security review which will take time.

Preliminary testing in the UFS has shown that zstd will work, and the work required for it will either be in UFS or netCDF (I assume) to provide compression settings.

I suggest we open a ticket ASAP w/ WCOSS2/GDIT to request the review and installation of zstd. Any amendments necessary can be communicated on the ticket at a later stage if necessary.

What do you think?

Rahul

On Fri, Aug 23, 2024 at 10:46 AM Hang Lei - NOAA Affiliate <[email protected]> wrote:
Hi, Rahul

We don't have the request now.
Following our procedure, we need to wait for UFS to test the zstd compression and confirm with us.
Then we can send the request. The issue has been open on our github and awaiting confirmation from the UFS team.

As I checked with GDIT, they have a proven zstd version.
Therefore, we may only submit the request for installation.

Thanks,
Hang

On Fri, Aug 23, 2024 at 10:42 AM Rahul Mahajan - NOAA Federal <[email protected]> wrote:
Hi Hang,

Do we have an open ticket w/ GDIT requesting the installation of the zstd library on WCOSS2?
If we do, can you please provide the ticket number and/or forward the email?
If there are any updates that you would like to share, please do.

I spoke w/ Steven this AM and he was unable to find the ticket in the system.

Thanks,
Rahul

@aerorahul @JacobCarley-NOAA @Hang-Lei-NOAA please discuss this topic on this issue so we have a full record.

@junwang-noaa has the UFS team been able to test zstd on WCOSS2?

I do not believe we should proceed until after this testing has taken place. Is there some reason it can't be done?

@junwang-noaa
Copy link

Please see discussion on ufs weather model issue #2319.
Several RT tests failed on acorn.

@edwardhartnett
Copy link
Contributor Author

OK, @junwang-noaa please let us know when you are satisfied with the testing and we are ready to proceed with the install. Let us know if there is anything else we can do to help move this forward.

@edwardhartnett
Copy link
Contributor Author

@junwang-noaa , @BrianCurtis-NOAA , @JacobCarley-NOAA and @aerorahul This issue is a great example of the extra costs that accompany manual software testing, and that's why no one does it anymore, outside the UFS.

Three weeks have been spent on manual testing, and we still don't have a clear answer. And the next time around, we will do everything all over again, again manually. Every time we want to test we will have multiple expensive PhD-programmers spend N weeks on easily automated tasks. (Recently we went through the same with the UPP.)

This has happened for the last few decades, and is intended to happen for the next few decades. Consider of the accumulated cost of all this manual testing! (Probably 10 to 15% of all project costs are being spent on manual testing.)

Furthermore, the waiting drives costs into other teams. It's not just the UFS team that has been waiting three weeks, other teams have been spending time on this and would like to tick it off the list, but instead are marking time and waiting.

With most scientific software (ex. zstd, netcdf-c, HDF5, NCEPLIBS, UFS_UTILS, etc.) we can run the unit tests after the install and know whether it worked or not within 10 minutes, without waiting for other human involvement. It's a shame that the UFS code cannot provide this very basic level of self-testing, as other projects do. (And have done, for decades.) Most other projects have fewer resources, which no doubt helps guide them to more efficient methods.

The weeks spent in even one testing cycle would provide ample time to start on automated testing, reducing the test burden next time. Eventually, no manual testing would be needed. Since it is UFS programmers who are doing the manual testing, it is also the UFS team which will directly benefit from freeing time currently spent on manual testing. If anyone on the UFS team wants to put more efficient methods in place, I'm happy to help. But I can't do it alone.

It may be that in the past, with ample schedule and plenty of resources, this kind of delay and additional cost was acceptable. However, I would suggest that in today's face-paced UFS eco-system, it's time to tighten up and remove these inefficient time-sinks.

How about a UFS system that can install and self-test without a three-week wait? This is easily within reach.

@junwang-noaa
Copy link

@edwardhartnett I am confused, I thought that the libraries (zstd, netcdf-c, HDF5, NCEPLIBS, UFS_UTILS, etc.) have unit tests, so the new feature or the installation should be well tested. It seems what you are saying is that if the library is updated, it's the application's responsibility to test if the library feature or the installation is working or not. Please note that, UFS system is public to the community and anyone can test it without any waiting time. If you don't know how to access and test UFS WM, below are the commands.

git clone --recursive https://github.com/ufs-community/ufs-weather-model
cd ufs-weather-model
cd modulefiles (update module files with new library)
cd ../tests
./rt.sh 

Further documentation can be found at:

https://github.com/ufs-community/ufs-weather-model/wiki/Running-regression-test-using-rt.sh

Code managers have been given several trainings on testing UFS WM, but so far testing libraries still falls on code managers on wcoss2.

@edwardhartnett
Copy link
Contributor Author

I guess I am puzzled as to why this is taking so long to test, if you have such a simple test apparatus ready to use.

Please let us know when testing is complete and we can move on to the next phase.

@junwang-noaa
Copy link

junwang-noaa commented Aug 26, 2024

@edwardhartnett We have discussed this already in the UFS WM issue. First of all, acorn is not always stable in the past several weeks, and it is the only platform that we were asked to test the library installation. Second, we can test if the library is installed correctly (at least some verifications has been done). At this time, it looks to me the installation issue with previous ESMF pio error showed up, but this zstd library installation should have nothing to do with that, we have kept testing it, which takes longer time.

@edwardhartnett
Copy link
Contributor Author

The presence of zstd in netCDF is something that can be detected at build time. Let's modify the ufs_weather_model build system so that any problems in using zstd can be detected immediately at configure-time. I will take a look.

@junwang-noaa
Copy link

So there is no testing/log in ESMF installation script to confirm that zstd netcdf is used? UFS WM build system has to be used to identify the installation issue?

@edwardhartnett
Copy link
Contributor Author

edwardhartnett commented Aug 26, 2024

ESMF should also check the version and capabilities of netCDF.

The UFS WM build system should check, as is standard with cmake systems.

Every problem that the UFS WM build can find, is an expensive delay which can be easily avoided. Every time we have an expensive delay, we need to ask if some CMake checking would have saved that time. The cmake build should not allow the software to be built unless all necessary dependencies (and their capabilities) are present.

If ESMF does not test these things, that's unfortunate, but makes it even more important that UFS WM does.

@edwardhartnett
Copy link
Contributor Author

ATMLND test failed. Hang is investigating further. This may be related to current problems on acorn.

@BrianCurtis-NOAA
Copy link

@edwardhartnett @Hang-Lei-NOAA make sure you use the following on Acorn in tests/fv3_conf/fv3_qsub.in_acorn:

# Needed as WAR for SEGV in MPI_Finalise() under OFED-5.8 and higher.
export FI_OFI_RXM_USE_SRX=0
export FI_VERBS_PREFER_XRC=0

This is a temporary workaround as Acorn SA's fix vendor issues.

@Hang-Lei-NOAA
Copy link

Hang-Lei-NOAA commented Aug 30, 2024 via email

@edwardhartnett
Copy link
Contributor Author

@Hang-Lei-NOAA have you rerun the ATMLND test with @BrianCurtis-NOAA 's setting?

@Hang-Lei-NOAA
Copy link

Hang-Lei-NOAA commented Sep 5, 2024 via email

@JacobCarley-NOAA
Copy link

I talked with NCO this morning and we might have a possible path forward. I'm paraphrasing here, but essentially they've asked us to reach out directly to the SPA team (i.e. Steven and Justin) and work with them to install and test zstd on the main machine (wcoss2 backup). This will remove any potential issues with Acorn. I'm happy to help coordinate this - just let me know.

@Hang-Lei-NOAA
Copy link

Hang-Lei-NOAA commented Sep 6, 2024 via email

@edwardhartnett
Copy link
Contributor Author

edwardhartnett commented Sep 23, 2024

Acorn compiler has issues. Last week it built netCDF, this week it cannot.

Hang installed stack on cactus. Our regression tests are failing because of slight differences in the verification data. Last month's version of UFS does work. Current development branch fails only because of these differences in the verification data.

Hang tried to repeat results on acorn but it is still unstable.

@junwang-noaa and @DusanJovic-NOAA Hang will be sending an email with these results this morning.

Note that none of this actually had anything to do with zstd installation. ;-)

As part of this effort @edwardhartnett asked the MAPL and ESMF teams to add some unit tests that use zstd and they have agreed to do so in a future release. We have also started running MAPL unit tests whenever MAPL is installed. @AlexanderRichert-NOAA will add MAPL and ESMF to the packages tested weekly, automatically with spack installs.

@Hang-Lei-NOAA
Copy link

The summary of test of zstd on wcoss2 machines:
Acorn: the C compiler is not stable. Sometime does not compile correctly.
Dogwoods/Cactus: installation for software verification propose for GDIT.
UFS tests with new and old(two month ago) UFS develop versions. Full tests finished without break: Some of the UFS test passed with all identical results to baseline. Some of them finished but not all datafiles are identical.
I noticed that the model is also modified. Both new UFS and two month old's UFS have difference.

@junwang-noaa
Copy link

@Hang-Lei-NOAA can you provide the module file and the location of the new UFS develop branch on castus/dogwood?

@Hang-Lei-NOAA
Copy link

Hang-Lei-NOAA commented Sep 24, 2024 via email

@edwardhartnett
Copy link
Contributor Author

Can we close this issue?

@Hang-Lei-NOAA
Copy link

@edwardhartnett Brian’s test find the zstd’s results are not identical for different runs. Could you confirm that zstd will give consistent results when repeating the cases.

@edwardhartnett
Copy link
Contributor Author

Yes, zstd is lossless. It will give the same results every time.

@BrianCurtis-NOAA
Copy link

https://github.com/ufs-community/ufs-weather-model/pull/2444/files

@edwardhartnett We've switched all ideflate to zstandard_level and if ideflate was 1 we made zstandard_level 5, otherwise 0. Maybe we're not doing this move quite right. If you have any ideas, please let me know.

@edwardhartnett
Copy link
Contributor Author

THat sounds correct.

When you switch back to zlib do you still see the differenes?

Are you using quanitze as well?

@BrianCurtis-NOAA
Copy link

THat sounds correct.

When you switch back to zlib do you still see the differenes?

Are you using quanitze as well?

if we're using zstandard_level does it require a quantize_mode and/or a quantize_nsd ? I can see places where we don't have a quantize_mode listed with a zstandard_level in out model_configure.

@edwardhartnett
Copy link
Contributor Author

edwardhartnett commented Nov 1, 2024

Zstd will work with with and without quantize. Just like zlib.

@BrianCurtis-NOAA
Copy link

@Hang-Lei-NOAA Did you build the nccmp with the netcdf you built with zstd support?

@edwardhartnett
Copy link
Contributor Author

Also, were netcdf-c and netcdf-fortran tests run? When the tests are run, look for the zstd test. If it is running and passing, zstd is working.

Also are we running the parallel I/O tests for netcdf-c and netcdf-fortran? To do so, both packages require: --enable-parallel-tests --with-mpiexec="srun -A MYPROJ" where MYPROJ is whatever project you can charge jobs to. (I'm not sure if it's -A or -a...)

The point of the --with-mpiexec="" argument is to specify another command to use to run jobs instead of mpiexec.

These tests should always be run and checked before any debugging work happens with any application.

@Hang-Lei-NOAA
Copy link

Hang-Lei-NOAA commented Nov 1, 2024 via email

@Hang-Lei-NOAA
Copy link

Hang-Lei-NOAA commented Nov 1, 2024 via email

@BrianCurtis-NOAA
Copy link

The system installed nccmp is not. Since this has not been passed to GDIT. On Fri, Nov 1, 2024 at 3:09 PM Hang Lei - NOAA Affiliate @.> wrote:

@brian which nccmp are you referring to? On Fri, Nov 1, 2024 at 3:08 PM Edward Hartnett @.
> wrote: > Also, were netcdf-c and netcdf-fortran tests run? When the tests are run, > look for the zstd test. If it is running and passing, zstd is working. > > Also are we running the parallel I/O tests for netcdf-c and > netcdf-fortran? To do so, both packages require: --enable-parallel-tests > --with-mpiexec="srun -A MYPROJ" where MYPROJ is whatever project you can > charge jobs to. (I'm not sure if it's -A or -a...) > > The point of the --with-mpiexec="" argument is to specify another command > to use to run jobs instead of mpiexec. > > These tests should always be run and checked before any debugging work > happens with any application. > > — > Reply to this email directly, view it on GitHub > <#3 (comment)>, > or unsubscribe > https://github.com/notifications/unsubscribe-auth/AKWSMFFPO3HQPT72VMXKZHLZ6PGSBAVCNFSM6AAAAABMCRRKI2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDINJSGQZTQOJRGQ > . > You are receiving this because you were mentioned.Message ID: > @.***> >

Is there any reason to think that if the nccmp is not built with the netcdf with zstd that it would report differences even if there were none?

@Hang-Lei-NOAA
Copy link

Hang-Lei-NOAA commented Nov 4, 2024 via email

@BrianCurtis-NOAA
Copy link

@Hang-Lei-NOAA am I doing something wrong here:

brian.curtis@clogin05:/lfs/h2/emc/nems/noscrub/brian.curtis/git/BrianCurtis-NOAA/ufs-weather-model/zstd_netcdf/tests> !8680
nccmp -d -S -q -f -g -B --Attribute=checksum --warn=format /lfs/h2/emc/ptmp/brian.curtis/FV3_RT/REGRESSION_TEST/control_p8_intel/sfcf000.nc /lfs/h2/emc/ptmp/brian.curtis/FV3_RT/rt_250036/control_p8_intel/sfcf000.nc
nccmp: error while loading shared libraries: libnetcdf.so.19: cannot open shared object file: No such file or directory
brian.curtis@clogin05:/lfs/h2/emc/nems/noscrub/brian.curtis/git/BrianCurtis-NOAA/ufs-weather-model/zstd_netcd
f/tests> nccmp --help
nccmp: error while loading shared libraries: libnetcdf.so.19: cannot open shared object file: No such file or directory
brian.curtis@clogin05:/lfs/h2/emc/nems/noscrub/brian.curtis/git/BrianCurtis-NOAA/ufs-weather-model/zstd_netcdf/tests> module show nccmp
---------------------------------------------------------------------------------------------------------
   /lfs/h2/emc/eib/save/hang.lei/forgdit/nco_wcoss2/install2/modulefiles/mpi/intel/19.1.3.304/cray-mpich/8.1.12/nccmp/1.8.9.0.lua:
---------------------------------------------------------------------------------------------------------
help([[]])
conflict("nccmp")
prepend_path("PATH","/lfs/h2/emc/eib/save/hang.lei/forgdit/nco_wcoss2/install2/intel-19.1.3.304/cray-mpich-8.1.12/nccmp/1.8.9.0/bin")
prepend_path("MANPATH","/lfs/h2/emc/eib/save/hang.lei/forgdit/nco_wcoss2/install2/intel-19.1.3.304/cray-mpich-8.1.12/nccmp/1.8.9.0/share/man")
setenv("NCCMP_ROOT","/lfs/h2/emc/eib/save/hang.lei/forgdit/nco_wcoss2/install2/intel-19.1.3.304/cray-mpich-8.1.12/nccmp/1.8.9.0")
setenv("NCCMP_VERSION","1.8.9.0")
whatis("Name: nccmp")
whatis("Version: 1.8.9.0")
whatis("Category: library")
whatis("Description: NetCDF Comparision Utility")

@BrianCurtis-NOAA
Copy link

@Hang-Lei-NOAA It seems when I tried to load just nccmp/pnetcdf/netcdf/hdf5 it doesn't like it. when I load the full modulefile suite it won't load nccmp (from your location). As soon as intel/19.1.3.304 is loaded it goes straight to the NCO version of nccmp

@Hang-Lei-NOAA
Copy link

Hang-Lei-NOAA commented Nov 4, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants