-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
zstd on WCOSS2 for UFS #3
Comments
FYI. The corresponding ufs weather model issue is at: ufs-community/ufs-weather-model#2319 |
Still waiting for testing on WCOSS2. NCO must review zstd. Once test is finished, Hang will request review. |
Some email activity:
@aerorahul @JacobCarley-NOAA @Hang-Lei-NOAA please discuss this topic on this issue so we have a full record. @junwang-noaa has the UFS team been able to test zstd on WCOSS2? I do not believe we should proceed until after this testing has taken place. Is there some reason it can't be done? |
Please see discussion on ufs weather model issue #2319. |
OK, @junwang-noaa please let us know when you are satisfied with the testing and we are ready to proceed with the install. Let us know if there is anything else we can do to help move this forward. |
@junwang-noaa , @BrianCurtis-NOAA , @JacobCarley-NOAA and @aerorahul This issue is a great example of the extra costs that accompany manual software testing, and that's why no one does it anymore, outside the UFS. Three weeks have been spent on manual testing, and we still don't have a clear answer. And the next time around, we will do everything all over again, again manually. Every time we want to test we will have multiple expensive PhD-programmers spend N weeks on easily automated tasks. (Recently we went through the same with the UPP.) This has happened for the last few decades, and is intended to happen for the next few decades. Consider of the accumulated cost of all this manual testing! (Probably 10 to 15% of all project costs are being spent on manual testing.) Furthermore, the waiting drives costs into other teams. It's not just the UFS team that has been waiting three weeks, other teams have been spending time on this and would like to tick it off the list, but instead are marking time and waiting. With most scientific software (ex. zstd, netcdf-c, HDF5, NCEPLIBS, UFS_UTILS, etc.) we can run the unit tests after the install and know whether it worked or not within 10 minutes, without waiting for other human involvement. It's a shame that the UFS code cannot provide this very basic level of self-testing, as other projects do. (And have done, for decades.) Most other projects have fewer resources, which no doubt helps guide them to more efficient methods. The weeks spent in even one testing cycle would provide ample time to start on automated testing, reducing the test burden next time. Eventually, no manual testing would be needed. Since it is UFS programmers who are doing the manual testing, it is also the UFS team which will directly benefit from freeing time currently spent on manual testing. If anyone on the UFS team wants to put more efficient methods in place, I'm happy to help. But I can't do it alone. It may be that in the past, with ample schedule and plenty of resources, this kind of delay and additional cost was acceptable. However, I would suggest that in today's face-paced UFS eco-system, it's time to tighten up and remove these inefficient time-sinks. How about a UFS system that can install and self-test without a three-week wait? This is easily within reach. |
@edwardhartnett I am confused, I thought that the libraries (zstd, netcdf-c, HDF5, NCEPLIBS, UFS_UTILS, etc.) have unit tests, so the new feature or the installation should be well tested. It seems what you are saying is that if the library is updated, it's the application's responsibility to test if the library feature or the installation is working or not. Please note that, UFS system is public to the community and anyone can test it without any waiting time. If you don't know how to access and test UFS WM, below are the commands.
Further documentation can be found at: https://github.com/ufs-community/ufs-weather-model/wiki/Running-regression-test-using-rt.sh Code managers have been given several trainings on testing UFS WM, but so far testing libraries still falls on code managers on wcoss2. |
I guess I am puzzled as to why this is taking so long to test, if you have such a simple test apparatus ready to use. Please let us know when testing is complete and we can move on to the next phase. |
@edwardhartnett We have discussed this already in the UFS WM issue. First of all, acorn is not always stable in the past several weeks, and it is the only platform that we were asked to test the library installation. Second, we can test if the library is installed correctly (at least some verifications has been done). At this time, it looks to me the installation issue with previous ESMF pio error showed up, but this zstd library installation should have nothing to do with that, we have kept testing it, which takes longer time. |
The presence of zstd in netCDF is something that can be detected at build time. Let's modify the ufs_weather_model build system so that any problems in using zstd can be detected immediately at configure-time. I will take a look. |
So there is no testing/log in ESMF installation script to confirm that zstd netcdf is used? UFS WM build system has to be used to identify the installation issue? |
ESMF should also check the version and capabilities of netCDF. The UFS WM build system should check, as is standard with cmake systems. Every problem that the UFS WM build can find, is an expensive delay which can be easily avoided. Every time we have an expensive delay, we need to ask if some CMake checking would have saved that time. The cmake build should not allow the software to be built unless all necessary dependencies (and their capabilities) are present. If ESMF does not test these things, that's unfortunate, but makes it even more important that UFS WM does. |
ATMLND test failed. Hang is investigating further. This may be related to current problems on acorn. |
@edwardhartnett @Hang-Lei-NOAA make sure you use the following on Acorn in tests/fv3_conf/fv3_qsub.in_acorn:
This is a temporary workaround as Acorn SA's fix vendor issues. |
@brian Curtis - NOAA Affiliate ***@***.***> Thank Brian for the
information. I really did not catch it.
…On Fri, Aug 30, 2024 at 10:45 AM Brian Curtis ***@***.***> wrote:
@edwardhartnett <https://github.com/edwardhartnett> @Hang-Lei-NOAA
<https://github.com/Hang-Lei-NOAA> make sure you use the following on
Acorn in tests/fv3_conf/fv3_qsub.in_acorn:
# Needed as WAR for SEGV in MPI_Finalise() under OFED-5.8 and higher.
export FI_OFI_RXM_USE_SRX=0
export FI_VERBS_PREFER_XRC=0
This is a temporary workaround as Acorn SA's fix vendor issues.
—
Reply to this email directly, view it on GitHub
<#3 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AKWSMFH2LOAW3AOTPVZSPF3ZUCARRAVCNFSM6AAAAABMCRRKI2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMRRGQ4DQMZVG4>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
@Hang-Lei-NOAA have you rerun the ATMLND test with @BrianCurtis-NOAA 's setting? |
The acorn has issues, the atmlnd passed previously and connot repeat the
success now. I am switching to an alternative test on wcoss2 with develop
branch.
…On Wed, Sep 4, 2024 at 8:41 AM Edward Hartnett ***@***.***> wrote:
@Hang-Lei-NOAA <https://github.com/Hang-Lei-NOAA> have you rerun the
ATMLND test with @BrianCurtis-NOAA <https://github.com/BrianCurtis-NOAA>
's setting?
—
Reply to this email directly, view it on GitHub
<#3 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AKWSMFCX7DDRPQCYCED4EKLZU35XVAVCNFSM6AAAAABMCRRKI2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMRYHA4DQNZTG4>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
I talked with NCO this morning and we might have a possible path forward. I'm paraphrasing here, but essentially they've asked us to reach out directly to the SPA team (i.e. Steven and Justin) and work with them to install and test zstd on the main machine (wcoss2 backup). This will remove any potential issues with Acorn. I'm happy to help coordinate this - just let me know. |
Thanks for Jacob's leadership. We will quickly figure out and start the
code delivery.
…On Fri, Sep 6, 2024 at 11:15 AM JacobCarley-NOAA ***@***.***> wrote:
I talked with NCO this morning and we might have a possible path forward.
I'm paraphrasing here, but essentially they've asked us to reach out
directly to the SPA team (i.e. Steven and Justin) and work with them to
install and test zstd on the main machine (wcoss2 backup). This will remove
any potential issues with Acorn. I'm happy to help coordinate this - just
let me know.
—
Reply to this email directly, view it on GitHub
<#3 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AKWSMFCPKWFGDHKV5O537NDZVHBIXAVCNFSM6AAAAABMCRRKI2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMZUGI4DAMRQGU>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Acorn compiler has issues. Last week it built netCDF, this week it cannot. Hang installed stack on cactus. Our regression tests are failing because of slight differences in the verification data. Last month's version of UFS does work. Current development branch fails only because of these differences in the verification data. Hang tried to repeat results on acorn but it is still unstable. @junwang-noaa and @DusanJovic-NOAA Hang will be sending an email with these results this morning. Note that none of this actually had anything to do with zstd installation. ;-) As part of this effort @edwardhartnett asked the MAPL and ESMF teams to add some unit tests that use zstd and they have agreed to do so in a future release. We have also started running MAPL unit tests whenever MAPL is installed. @AlexanderRichert-NOAA will add MAPL and ESMF to the packages tested weekly, automatically with spack installs. |
The summary of test of zstd on wcoss2 machines: |
@Hang-Lei-NOAA can you provide the module file and the location of the new UFS develop branch on castus/dogwood? |
@Jun.Wang and @brian
Please see the modulefile on cactus:
/lfs/h2/emc/eib/noscrub/hang.lei/ufsdevelop/modulefiles/ufs_wcoss2.intel.lua
Some tests runs identical.
Some are not identical. (aersol associated, atmlnd etc.)
…On Mon, Sep 23, 2024 at 9:27 PM Jun Wang ***@***.***> wrote:
@Hang-Lei-NOAA <https://github.com/Hang-Lei-NOAA> can you provide the
module file and the location of the new UFS develop branch on
castus/dogwood?
—
Reply to this email directly, view it on GitHub
<#3 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AKWSMFEJIOQSALGIJZLKIPLZYC5YJAVCNFSM6AAAAABMCRRKI2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGNRZHEYTSNZRGI>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Can we close this issue? |
@edwardhartnett Brian’s test find the zstd’s results are not identical for different runs. Could you confirm that zstd will give consistent results when repeating the cases. |
Yes, zstd is lossless. It will give the same results every time. |
https://github.com/ufs-community/ufs-weather-model/pull/2444/files @edwardhartnett We've switched all ideflate to zstandard_level and if ideflate was 1 we made zstandard_level 5, otherwise 0. Maybe we're not doing this move quite right. If you have any ideas, please let me know. |
THat sounds correct. When you switch back to zlib do you still see the differenes? Are you using quanitze as well? |
if we're using zstandard_level does it require a quantize_mode and/or a quantize_nsd ? I can see places where we don't have a quantize_mode listed with a zstandard_level in out model_configure. |
Zstd will work with with and without quantize. Just like zlib. |
@Hang-Lei-NOAA Did you build the nccmp with the netcdf you built with zstd support? |
Also, were netcdf-c and netcdf-fortran tests run? When the tests are run, look for the zstd test. If it is running and passing, zstd is working. Also are we running the parallel I/O tests for netcdf-c and netcdf-fortran? To do so, both packages require: --enable-parallel-tests --with-mpiexec="srun -A MYPROJ" where MYPROJ is whatever project you can charge jobs to. (I'm not sure if it's -A or -a...) The point of the --with-mpiexec="" argument is to specify another command to use to run jobs instead of mpiexec. These tests should always be run and checked before any debugging work happens with any application. |
@brian which nccmp are you referring to?
…On Fri, Nov 1, 2024 at 3:08 PM Edward Hartnett ***@***.***> wrote:
Also, were netcdf-c and netcdf-fortran tests run? When the tests are run,
look for the zstd test. If it is running and passing, zstd is working.
Also are we running the parallel I/O tests for netcdf-c and
netcdf-fortran? To do so, both packages require: --enable-parallel-tests
--with-mpiexec="srun -A MYPROJ" where MYPROJ is whatever project you can
charge jobs to. (I'm not sure if it's -A or -a...)
The point of the --with-mpiexec="" argument is to specify another command
to use to run jobs instead of mpiexec.
These tests should always be run and checked before any debugging work
happens with any application.
—
Reply to this email directly, view it on GitHub
<#3 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AKWSMFFPO3HQPT72VMXKZHLZ6PGSBAVCNFSM6AAAAABMCRRKI2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDINJSGQZTQOJRGQ>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
The system installed nccmp is not. Since this has not been passed to GDIT.
On Fri, Nov 1, 2024 at 3:09 PM Hang Lei - NOAA Affiliate ***@***.***>
wrote:
… @brian which nccmp are you referring to?
On Fri, Nov 1, 2024 at 3:08 PM Edward Hartnett ***@***.***>
wrote:
> Also, were netcdf-c and netcdf-fortran tests run? When the tests are run,
> look for the zstd test. If it is running and passing, zstd is working.
>
> Also are we running the parallel I/O tests for netcdf-c and
> netcdf-fortran? To do so, both packages require: --enable-parallel-tests
> --with-mpiexec="srun -A MYPROJ" where MYPROJ is whatever project you can
> charge jobs to. (I'm not sure if it's -A or -a...)
>
> The point of the --with-mpiexec="" argument is to specify another command
> to use to run jobs instead of mpiexec.
>
> These tests should always be run and checked before any debugging work
> happens with any application.
>
> —
> Reply to this email directly, view it on GitHub
> <#3 (comment)>,
> or unsubscribe
> <https://github.com/notifications/unsubscribe-auth/AKWSMFFPO3HQPT72VMXKZHLZ6PGSBAVCNFSM6AAAAABMCRRKI2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDINJSGQZTQOJRGQ>
> .
> You are receiving this because you were mentioned.Message ID:
> ***@***.***>
>
|
Is there any reason to think that if the nccmp is not built with the netcdf with zstd that it would report differences even if there were none? |
@brian Curtis - NOAA Affiliate ***@***.***> I test the new build
nccmp with other modules. The result seems no affected.
Please test zstd-netcdf build nccmp:
/lfs/h2/emc/eib/save/hang.lei/forgdit/nco_wcoss2/install2/modulefiles/mpi/intel/19.1.3.304/cray-mpich/8.1.12/nccmp/1.8.9.0.lua
On Fri, Nov 1, 2024 at 5:08 PM Brian Curtis ***@***.***>
wrote:
… The system installed nccmp is not. Since this has not been passed to GDIT.
On Fri, Nov 1, 2024 at 3:09 PM Hang Lei - NOAA Affiliate *@*.
*> wrote: … <#m_5425726746669771209_> @brian <https://github.com/brian>
which nccmp are you referring to? On Fri, Nov 1, 2024 at 3:08 PM Edward
Hartnett @.*> wrote: > Also, were netcdf-c and netcdf-fortran tests run?
When the tests are run, > look for the zstd test. If it is running and
passing, zstd is working. > > Also are we running the parallel I/O tests
for netcdf-c and > netcdf-fortran? To do so, both packages require:
--enable-parallel-tests > --with-mpiexec="srun -A MYPROJ" where MYPROJ is
whatever project you can > charge jobs to. (I'm not sure if it's -A or
-a...) > > The point of the --with-mpiexec="" argument is to specify
another command > to use to run jobs instead of mpiexec. > > These tests
should always be run and checked before any debugging work > happens with
any application. > > — > Reply to this email directly, view it on GitHub > <#3
(comment)
<#3 (comment)>>,
> or unsubscribe >
https://github.com/notifications/unsubscribe-auth/AKWSMFFPO3HQPT72VMXKZHLZ6PGSBAVCNFSM6AAAAABMCRRKI2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDINJSGQZTQOJRGQ
> . > You are receiving this because you were mentioned.Message ID: > *@*.***>
>
Is there any reason to think that if the nccmp is not built with the
netcdf with zstd that it would report differences even if there were none?
—
Reply to this email directly, view it on GitHub
<#3 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AKWSMFDQ4UU5WXOAWULU2ZLZ6PUVPAVCNFSM6AAAAABMCRRKI2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDINJSGU4TGMRRG4>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
@Hang-Lei-NOAA am I doing something wrong here:
|
@Hang-Lei-NOAA It seems when I tried to load just nccmp/pnetcdf/netcdf/hdf5 it doesn't like it. when I load the full modulefile suite it won't load nccmp (from your location). As soon as intel/19.1.3.304 is loaded it goes straight to the NCO version of nccmp |
@brian Curtis - NOAA Affiliate ***@***.***> I am trying to check
your loading script and block the directly loading the system library. But
the cactus seems to be having problems. I could not login to the system for
the past hour till now. Have to wait.
…On Mon, Nov 4, 2024 at 8:39 AM Brian Curtis ***@***.***> wrote:
@Hang-Lei-NOAA <https://github.com/Hang-Lei-NOAA> It seems when I tried
to load just nccmp/pnetcdf/netcdf/hdf5 it doesn't like it. when I load the
full modulefile suite it won't load nccmp (from your location). As soon as
intel/19.1.3.304 is loaded it goes straight to the NCO version of nccmp
—
Reply to this email directly, view it on GitHub
<#3 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AKWSMFDVMM3SYLZKBJQP5FLZ65TI3AVCNFSM6AAAAABMCRRKI2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDINJUG42DEOJVGY>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Enable and install zstd compression.
Currently testing on acorn.
The text was updated successfully, but these errors were encountered: