As WRES team, I would like to identify a candidate for github Release v6.30, test it, and if it passes, deploy it; candidate RC1: 2d40c6af06573e03e5a429d6ebf348e5194ca3d68b7531741706e7842538c390 wres-20250311-3bce492.zip #448

HankHerr-NOAA · 2025-03-11T11:52:55Z

HankHerr-NOAA · 2025-03-11T11:54:11Z

James:

If you have a change that you think I should wait for, let me know. Otherwise, I hope to start the deployment process following the instructions in VLab in about a half-hour. Thanks,

Hank

james-d-brown · 2025-03-11T12:09:58Z

No, not especially.

HankHerr-NOAA · 2025-03-11T12:21:23Z

I think I see a PR #449 addressing issue #228 in process. I'll wait for that to be done and then nominate that revision for deployment.

Hank

HankHerr-NOAA · 2025-03-11T12:28:11Z

Anyone know why the external services tests have failed twice in the past couple of days? The failure appears to be related to NWIS.

Hank

HankHerr-NOAA · 2025-03-11T12:32:42Z

The latest test reported a "server error" when reaching out to NWIS,

08:19:10.998 [Test worker] DEBUG wres.http.WebClient -- Got server error from https://nwis.waterservices.usgs.gov/nwis/iv/?format=json&indent=on&sites=01631000&parameterCd=00060&startDT=2020-03-01T00:00:00Z&endDT=2020-04-30T23:59:59Z in PT0.035378378S.

NwisTest > canGetAndParseResponseFromNwisWithWebClientAndJacksonPojos() FAILED
com.fasterxml.jackson.core.JsonParseException at NwisTest.java:60

Maybe this is the related to the same NWIS issue that triggered retries which resulted in a connection leak that James addressed. Point is, I don't think this is the WRES; I think its NWIS.

Hank

james-d-brown · 2025-03-11T12:34:47Z

Yeah, just issues with NWIS, I think. Manually executed evaluations that rely on NWIS data were failing too.

As an aside, these tests should be renamed as WRES external services tests rather than WRES WRDS SERVICE TEST in the e-mail notification.

HankHerr-NOAA · 2025-03-11T12:40:29Z

Thanks, James.

Waiting for the merge to complete. Looks like it needs to run through checks/testing using the master, now. I'll also have to wait on a round of testing by Jenkins to complete. Hopefully, by 9:30 EST or sooner, we can start the deploy.

Hank

HankHerr-NOAA · 2025-03-11T13:01:04Z

Finally. Took over 17 minutes to complete the system tests for the merge, and GitHub only runs a subset of the system tests.

Step 1: "Tag a commit as staging to kick off pre-release workflow". Doing that now,

Hank

HankHerr-NOAA · 2025-03-11T13:07:31Z

Commit 3bce492 tagged. Workflows are running.

Hank

HankHerr-NOAA · 2025-03-11T13:22:28Z

Updated the title of this ticket to include the hash of the .zip. Updated the VLab wiki to indicate where to find that hash.

Hank

HankHerr-NOAA · 2025-03-11T14:55:56Z

The pre-release database migration test is still on-going. Its now on the, "Fresh migration from upcoming release". Hopefully, this will complete by the afternoon. I guess we'll see.

Hank

james-d-brown · 2025-03-11T14:58:33Z

Oh dear, we may be back to the bad old days of a 24+ hour turnaround, even if RC1 succeeds all gates :(

We'll need to investigate that issue as a matter of priority.

epag · 2025-03-11T16:08:14Z

Looks like it has finished now, but I agree, the times seem to be ever climbing. I need to focus on finishing this rhel8 migration, but after that I will prioritize looking at this. Or if I am halted on the rhel8 migration during a potential (likely?) shutdown, I can look at this then since I think we aren't allowed to do anything production related during a shutdown

HankHerr-NOAA · 2025-03-11T16:38:37Z

Pre-release steps done. I'm proceeding with the deployment Step 3, initiating the release using Jenkins.

Hank

HankHerr-NOAA · 2025-03-11T16:42:01Z

I submitted the job which has kicked off a full scenario test. The initiate-release-deployment is job 14.

Once deployed to staging, I'll shift my focus on the VLab ticket and UAT.

Hank

epag · 2025-03-11T16:43:01Z

There were some failures and off behavior but I am aware of what caused those, going to fix them quick and restart a deploy

HankHerr-NOAA · 2025-03-11T16:54:39Z

Evan gave me the thumbs up to try again, so I started initiate release job 15.

Hank

HankHerr-NOAA · 2025-03-11T18:05:51Z

The 900 tests are running particularly slowly due to issues obtaining data from USGS NWIS and may end up dying at some point. I'm told that that would result in the Jenkins workflow halting, preventing deployment to staging.

We can either agree to ignore the 900 series and force the staging deployment. Or we can hold up the deployment with some sort of fix for that issue. Though I don't really think the NWIS fails is an issue we can fix.

Thoughts?

Hank

HankHerr-NOAA · 2025-03-11T18:17:13Z

Its the regular system tests that are taking a while. Specifically, 703. Again, do we wait it out (eventually, it may time out and fail) or do we just trigger the 900 series and move on with the deployment?

Hank

HankHerr-NOAA · 2025-03-11T18:21:37Z

Evan mentioned killing the system tests, moving to the 900 series, and then deploying. Since this is just going to staging, I think that's a good idea. However, it would be good to see 703 and the external services tests pass before we deploy to production. We can revisit that discussion later.

Hank

james-d-brown · 2025-03-11T18:22:35Z

If this were a critical deploy to address a cve, I would say go ahead because the risk/reward would be stacked to mitigating a bigger risk with a smaller one. However, in this case, I don't think we really want to set a precedent of ignoring gates, even if we can explain them. The problem is that these tests incorporate attributes other than NWIS and our unit/integration test coverage is not (close to) 100%, more like 50%.

james-d-brown · 2025-03-11T18:23:40Z

I appreciate that makes us depend on live services (that can break) for our deploy and it would be better if we could test these same features without that being the case but, right now, we cannot. I have no issues with attempting to improve resilience in that regard, but that is for later.

epag · 2025-03-11T18:24:41Z

It is also worth mentioning that while the actual deploy artifacts didn't pass the sys tests, but we did test the commit already as part of the automated testing pipeline

[Successful] wres-automated-testing/full-scenario-testing - 69 PASSED, 0 FAILED for commit: 3bce492

Also seems like NWIS is just hard reporting a service outage now as well

epag · 2025-03-11T18:32:17Z

I don't have a super strong opinion if we wait or push through here in this specific case, but I think we should strive towards separating checking partner live service uptime from our deployment process. We shouldn't allow issues with partner services stop us from making progress/deployments

james-d-brown · 2025-03-11T18:36:03Z

Agree with that, but there is a difference between that situation and the current situation and we are not deploying an urgent fix, so I think we should wait and then work towards a better separation of concerns. However, I will say that many of these tests against live services were specifically designed to test that capability, so it would require us to stand up an equivalent service/api, at least for the limited data required, which is a headache in itself. But, again, that is a future discussion.

HankHerr-NOAA · 2025-03-11T18:42:35Z

Since we are close to the end of today, anyway, I'm going to push the deployment to tomorrow morning and re-trigger the initiate-release Jenkins workflow when I get in in the morning. If it continues to not get past the system tests, then we'll probably have to hold off on deployment. If 703 is failing due to USGS NWIS, then some of our UAT is likely to fail, as well, and we'll be forced to delay regardless.

More tomorrow,

Hank

epag · 2025-03-11T18:42:46Z

However, I will say that many of these tests against live services were specifically designed to test that capability, so it would require us to stand up an equivalent service/api, at least for the limited data required, which is a headache in itself. But, again, that is a future discussion.

Agreed, I think this feeds into some earlier conversations that we have had around not using full/complete datasets for our tests and creating leaner tests/datasets which it seems like would be needed in order to separate live partner services from our deploy pipeline. Also since we are on Jenkins now with its own host, we can download data to there and rely on that instead as well

Agree with that, but there is a difference between that situation and the current situation and we are not deploying an urgent fix, so I think we should wait and then work towards a better separation of concerns.

I have no qualms with this

Also it looks like, based on this test, that we will retry calls for well over an hour... That seems somewhat excessive to me, let me know what you all think

First failure: T17:23
Last failure: T18:40

HankHerr-NOAA · 2025-03-11T19:39:11Z

Well, looks like the full-scenario-testing passed, followed by the 900 series. And it appears to have deployed to staging.

I'll note the details in the VLab ticket,

Hank

HankHerr-NOAA · 2025-03-11T19:47:59Z

I'll commence with UAT and standalone testing tomorrow morning. Staging COWRES is ready for use,

Hank

HankHerr-NOAA · 2025-03-11T20:06:09Z

I'm unchecking box 3 for now. The workflow corresponding to box 3 requires a passage of the system tests. As Evan pointed out to me a chat, scenario703 did not pass, so the system tests do not pass. The 900 series passed and staging has the revision installed from the correct artifact.

Tomorrow, we just need to make sure the system tests can pass using the same artifacts. I'll work with Evan on how best to make sure that happens (he mentioned just stopping the workflow after the system tests pass and while its in the middle of the 900 series tests. This is assuming USGS NWIS is up and running, of course.

Have a great evening!

Hank

HankHerr-NOAA · 2025-03-12T11:18:38Z

NWIS appears to be responsive today. I'm going to trigger the initiate-release-deployment workflow and see if the system tests succeed.

Hank

HankHerr-NOAA · 2025-03-12T11:25:09Z

Just triggered initiate-release-deployment, which triggered full-scenario-testing 16. I'll keep my eyes on it.

I also checked the overnight evaluations in staging and they all succeeded. That's a good sign.

I don't plan to stop the deployment partway, as mentioned yesterday. Instead, I'll start UAT once I've confirmed the system tests have succeeded, or at least 703 passed, and then expect to be interrupted when its redeployed to staging. Again, Jenkins is just pulling the artifacts from GitHub; its not rebuilding them. So what is currently in staging is identical to what is being deployed now.

UAT to start in a bit,

Hank

HankHerr-NOAA · 2025-03-12T11:54:57Z

System testing passed. I confirmed it manually in the console output. I'm going to start UAT in staging,

Hank

HankHerr-NOAA · 2025-03-12T12:42:08Z

The artifacts were deployed to staging, again. That forced a pause in UAT. Continuing now,

Hank

HankHerr-NOAA · 2025-03-12T13:12:09Z

One thing to note: If the artifacts deployed were truly the same, then the images would be the same, and, when pushed to the registry, all of the layers should already exist, so the date/time associated with the images in the registry should not be changed.

I confirmed that this was the case for the worker and tasker images. Thus, I can be confident my tests earlier this morning before the deployment to staging are still valid. Cool.

Continuing with UAT,

Hank

HankHerr-NOAA · 2025-03-12T17:01:50Z

Pull request in for the Docker files, compose-entry.yml, compose-workers.yml. See #452. I'll wait for it to pass the checks.

Hank

HankHerr-NOAA · 2025-03-12T17:05:29Z

No database documentation changes are needed, so I'll check that box. The documentation changes for reference_date_pools are being pushed to 6.31. Checking those boxes.

Hank

Docker files for v6.30; refs GitHub #448

HankHerr-NOAA · 2025-03-12T17:12:00Z

I think its revision ec8f281 that has the docker file changes, but let me confirm through GitHub.

Hank

HankHerr-NOAA · 2025-03-12T17:13:04Z

Yes that's the commit after the merge. Now doing the tagging step,

Hank

HankHerr-NOAA · 2025-03-12T17:28:12Z

I'm doing another docker pull request and merge. I took the easy way out and committed the compose from production, which uses different Docker network options, instead of from staging which uses the default options. Evan recommended against that, so I'm pushing/merging again. Once it passes the checks, I'll capture the revision, tag it, and be done with it.

Hank

Changed to use default net, not different net; refs GitHub #448

HankHerr-NOAA · 2025-03-12T17:35:45Z

The docker revision is c996384.

Tagging,

Hank

HankHerr-NOAA · 2025-03-12T17:38:45Z

https://github.com/NOAA-OWP/wres/releases/tag/v6.30
https://github.com/NOAA-OWP/wres/releases/tag/v6.30-docker

If something looks off, please let me know.

Hank

HankHerr-NOAA · 2025-03-12T17:39:42Z

The only ticket remaining open is this one. Closing it now.

Hank

HankHerr-NOAA added the deployment Work related to deployments label Mar 11, 2025

HankHerr-NOAA added this to the v6.30 milestone Mar 11, 2025

HankHerr-NOAA added a commit that referenced this issue Mar 12, 2025

Docker files for v6.30; refs GitHub #448

dc0cc95

HankHerr-NOAA added a commit that referenced this issue Mar 12, 2025

Merge pull request #452 from NOAA-OWP/v30_docker_files

ec8f281

Docker files for v6.30; refs GitHub #448

HankHerr-NOAA added a commit that referenced this issue Mar 12, 2025

Changed to use default net, not different net; refs GitHub #448

0e2e376

HankHerr-NOAA added a commit that referenced this issue Mar 12, 2025

Merge pull request #453 from NOAA-OWP/v630_docker_part2

c996384

Changed to use default net, not different net; refs GitHub #448

HankHerr-NOAA closed this as completed Mar 12, 2025

As WRES team, I would like to identify a candidate for github Release v6.30, test it, and if it passes, deploy it; candidate RC1: 2d40c6af06573e03e5a429d6ebf348e5194ca3d68b7531741706e7842538c390 wres-20250311-3bce492.zip #448

As WRES team, I would like to identify a candidate for github Release v6.30, test it, and if it passes, deploy it; candidate RC1: 2d40c6af06573e03e5a429d6ebf348e5194ca3d68b7531741706e7842538c390 wres-20250311-3bce492.zip #448

Comments

HankHerr-NOAA commented Mar 11, 2025 • edited Loading

HankHerr-NOAA commented Mar 11, 2025

james-d-brown commented Mar 11, 2025

HankHerr-NOAA commented Mar 11, 2025

HankHerr-NOAA commented Mar 11, 2025

HankHerr-NOAA commented Mar 11, 2025

james-d-brown commented Mar 11, 2025

HankHerr-NOAA commented Mar 11, 2025

HankHerr-NOAA commented Mar 11, 2025

HankHerr-NOAA commented Mar 11, 2025

HankHerr-NOAA commented Mar 11, 2025

HankHerr-NOAA commented Mar 11, 2025

james-d-brown commented Mar 11, 2025

epag commented Mar 11, 2025

HankHerr-NOAA commented Mar 11, 2025

HankHerr-NOAA commented Mar 11, 2025

epag commented Mar 11, 2025

HankHerr-NOAA commented Mar 11, 2025

HankHerr-NOAA commented Mar 11, 2025

HankHerr-NOAA commented Mar 11, 2025

HankHerr-NOAA commented Mar 11, 2025

james-d-brown commented Mar 11, 2025

james-d-brown commented Mar 11, 2025

epag commented Mar 11, 2025

epag commented Mar 11, 2025

james-d-brown commented Mar 11, 2025

HankHerr-NOAA commented Mar 11, 2025

epag commented Mar 11, 2025 • edited Loading

HankHerr-NOAA commented Mar 11, 2025

HankHerr-NOAA commented Mar 11, 2025

HankHerr-NOAA commented Mar 11, 2025

HankHerr-NOAA commented Mar 12, 2025

HankHerr-NOAA commented Mar 12, 2025

HankHerr-NOAA commented Mar 12, 2025

HankHerr-NOAA commented Mar 12, 2025

HankHerr-NOAA commented Mar 12, 2025

HankHerr-NOAA commented Mar 12, 2025

HankHerr-NOAA commented Mar 12, 2025

HankHerr-NOAA commented Mar 12, 2025

HankHerr-NOAA commented Mar 12, 2025

HankHerr-NOAA commented Mar 12, 2025

HankHerr-NOAA commented Mar 12, 2025

HankHerr-NOAA commented Mar 12, 2025

HankHerr-NOAA commented Mar 12, 2025

HankHerr-NOAA commented Mar 11, 2025 •

edited

Loading

epag commented Mar 11, 2025 •

edited

Loading