Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

As WRES team, I would like to identify a candidate for github Release v6.30, test it, and if it passes, deploy it; candidate RC1: 2d40c6af06573e03e5a429d6ebf348e5194ca3d68b7531741706e7842538c390 wres-20250311-3bce492.zip #448

Closed
10 tasks done
HankHerr-NOAA opened this issue Mar 11, 2025 · 43 comments
Labels
deployment Work related to deployments
Milestone

Comments

@HankHerr-NOAA
Copy link
Contributor

HankHerr-NOAA commented Mar 11, 2025

This deployment will not be preceded by a dependency update.

  • 1 - Tag a commit as staging to kick off pre-release workflow (Pre-release github action; this tests db migration)
  • 2 - Nominate a commit that has passed the pre-release workflow (Pre-release github action)
  • 3 - Initiate staging release (initiate-release-deployment Jenkins workflow; emails will report test results for external service, system tests, and 900 series)
  • 4 - Test candidate with identified alpha (internal) User Acceptance Tests (-ti)
  • 5 - Test candidate as a standalone.
  • 6 - Push yml files changed
  • 7 - Tag the commits that produced the release that passed everything and was already deployed to -prod
  • 8 - Update database documentation in wiki
  • 9 - Update user documentation
  • 10 - Close GitHub tickets actually completed in this release and move tickets not completed to backlog
@HankHerr-NOAA HankHerr-NOAA added the deployment Work related to deployments label Mar 11, 2025
@HankHerr-NOAA HankHerr-NOAA added this to the v6.30 milestone Mar 11, 2025
@HankHerr-NOAA
Copy link
Contributor Author

James:

If you have a change that you think I should wait for, let me know. Otherwise, I hope to start the deployment process following the instructions in VLab in about a half-hour. Thanks,

Hank

@james-d-brown
Copy link
Collaborator

No, not especially.

@HankHerr-NOAA
Copy link
Contributor Author

I think I see a PR #449 addressing issue #228 in process. I'll wait for that to be done and then nominate that revision for deployment.

Hank

@HankHerr-NOAA
Copy link
Contributor Author

Anyone know why the external services tests have failed twice in the past couple of days? The failure appears to be related to NWIS.

Hank

@HankHerr-NOAA
Copy link
Contributor Author

The latest test reported a "server error" when reaching out to NWIS,

08:19:10.998 [Test worker] DEBUG wres.http.WebClient -- Got server error from https://nwis.waterservices.usgs.gov/nwis/iv/?format=json&indent=on&sites=01631000&parameterCd=00060&startDT=2020-03-01T00:00:00Z&endDT=2020-04-30T23:59:59Z in PT0.035378378S.

NwisTest > canGetAndParseResponseFromNwisWithWebClientAndJacksonPojos() FAILED
com.fasterxml.jackson.core.JsonParseException at NwisTest.java:60

Maybe this is the related to the same NWIS issue that triggered retries which resulted in a connection leak that James addressed. Point is, I don't think this is the WRES; I think its NWIS.

Hank

@james-d-brown
Copy link
Collaborator

Yeah, just issues with NWIS, I think. Manually executed evaluations that rely on NWIS data were failing too.

As an aside, these tests should be renamed as WRES external services tests rather than WRES WRDS SERVICE TEST in the e-mail notification.

@HankHerr-NOAA
Copy link
Contributor Author

Thanks, James.

Waiting for the merge to complete. Looks like it needs to run through checks/testing using the master, now. I'll also have to wait on a round of testing by Jenkins to complete. Hopefully, by 9:30 EST or sooner, we can start the deploy.

Hank

@HankHerr-NOAA
Copy link
Contributor Author

Finally. Took over 17 minutes to complete the system tests for the merge, and GitHub only runs a subset of the system tests.

Step 1: "Tag a commit as staging to kick off pre-release workflow". Doing that now,

Hank

@HankHerr-NOAA HankHerr-NOAA changed the title As WRES team, I would like to identify a candidate for github Release v6.30, test it, and if it passes, deploy it; candidate RC1: TBD wres-TBD.zip As WRES team, I would like to identify a candidate for github Release v6.30, test it, and if it passes, deploy it; candidate RC1: 3bce4921ad05c85dc94c1b3d3f6e2c63278a7007wres-TBD.zip Mar 11, 2025
@HankHerr-NOAA
Copy link
Contributor Author

Commit 3bce492 tagged. Workflows are running.

Hank

@HankHerr-NOAA HankHerr-NOAA changed the title As WRES team, I would like to identify a candidate for github Release v6.30, test it, and if it passes, deploy it; candidate RC1: 3bce4921ad05c85dc94c1b3d3f6e2c63278a7007wres-TBD.zip As WRES team, I would like to identify a candidate for github Release v6.30, test it, and if it passes, deploy it; candidate RC1: 3bce4921ad05c85dc94c1b3d3f6e2c63278a7007 wres-20250311-3bce492.zip Mar 11, 2025
@HankHerr-NOAA HankHerr-NOAA changed the title As WRES team, I would like to identify a candidate for github Release v6.30, test it, and if it passes, deploy it; candidate RC1: 3bce4921ad05c85dc94c1b3d3f6e2c63278a7007 wres-20250311-3bce492.zip As WRES team, I would like to identify a candidate for github Release v6.30, test it, and if it passes, deploy it; candidate RC1: 2d40c6af06573e03e5a429d6ebf348e5194ca3d68b7531741706e7842538c390 wres-20250311-3bce492.zip Mar 11, 2025
@HankHerr-NOAA
Copy link
Contributor Author

Updated the title of this ticket to include the hash of the .zip. Updated the VLab wiki to indicate where to find that hash.

Hank

@HankHerr-NOAA
Copy link
Contributor Author

The pre-release database migration test is still on-going. Its now on the, "Fresh migration from upcoming release". Hopefully, this will complete by the afternoon. I guess we'll see.

Hank

@james-d-brown
Copy link
Collaborator

Oh dear, we may be back to the bad old days of a 24+ hour turnaround, even if RC1 succeeds all gates :(

We'll need to investigate that issue as a matter of priority.

@epag
Copy link
Collaborator

epag commented Mar 11, 2025

Looks like it has finished now, but I agree, the times seem to be ever climbing. I need to focus on finishing this rhel8 migration, but after that I will prioritize looking at this. Or if I am halted on the rhel8 migration during a potential (likely?) shutdown, I can look at this then since I think we aren't allowed to do anything production related during a shutdown

@HankHerr-NOAA
Copy link
Contributor Author

Pre-release steps done. I'm proceeding with the deployment Step 3, initiating the release using Jenkins.

Hank

@HankHerr-NOAA
Copy link
Contributor Author

I submitted the job which has kicked off a full scenario test. The initiate-release-deployment is job 14.

Once deployed to staging, I'll shift my focus on the VLab ticket and UAT.

Hank

@epag
Copy link
Collaborator

epag commented Mar 11, 2025

There were some failures and off behavior but I am aware of what caused those, going to fix them quick and restart a deploy

@HankHerr-NOAA
Copy link
Contributor Author

Evan gave me the thumbs up to try again, so I started initiate release job 15.

Hank

@HankHerr-NOAA
Copy link
Contributor Author

The 900 tests are running particularly slowly due to issues obtaining data from USGS NWIS and may end up dying at some point. I'm told that that would result in the Jenkins workflow halting, preventing deployment to staging.

We can either agree to ignore the 900 series and force the staging deployment. Or we can hold up the deployment with some sort of fix for that issue. Though I don't really think the NWIS fails is an issue we can fix.

Thoughts?

Hank

@HankHerr-NOAA
Copy link
Contributor Author

Its the regular system tests that are taking a while. Specifically, 703. Again, do we wait it out (eventually, it may time out and fail) or do we just trigger the 900 series and move on with the deployment?

Hank

@HankHerr-NOAA
Copy link
Contributor Author

Evan mentioned killing the system tests, moving to the 900 series, and then deploying. Since this is just going to staging, I think that's a good idea. However, it would be good to see 703 and the external services tests pass before we deploy to production. We can revisit that discussion later.

Hank

@james-d-brown
Copy link
Collaborator

If this were a critical deploy to address a cve, I would say go ahead because the risk/reward would be stacked to mitigating a bigger risk with a smaller one. However, in this case, I don't think we really want to set a precedent of ignoring gates, even if we can explain them. The problem is that these tests incorporate attributes other than NWIS and our unit/integration test coverage is not (close to) 100%, more like 50%.

@james-d-brown
Copy link
Collaborator

I appreciate that makes us depend on live services (that can break) for our deploy and it would be better if we could test these same features without that being the case but, right now, we cannot. I have no issues with attempting to improve resilience in that regard, but that is for later.

@epag
Copy link
Collaborator

epag commented Mar 11, 2025

It is also worth mentioning that while the actual deploy artifacts didn't pass the sys tests, but we did test the commit already as part of the automated testing pipeline

[Successful] wres-automated-testing/full-scenario-testing - 69 PASSED, 0 FAILED for commit: 3bce492

Also seems like NWIS is just hard reporting a service outage now as well

@epag
Copy link
Collaborator

epag commented Mar 11, 2025

I don't have a super strong opinion if we wait or push through here in this specific case, but I think we should strive towards separating checking partner live service uptime from our deployment process. We shouldn't allow issues with partner services stop us from making progress/deployments

@james-d-brown
Copy link
Collaborator

Agree with that, but there is a difference between that situation and the current situation and we are not deploying an urgent fix, so I think we should wait and then work towards a better separation of concerns. However, I will say that many of these tests against live services were specifically designed to test that capability, so it would require us to stand up an equivalent service/api, at least for the limited data required, which is a headache in itself. But, again, that is a future discussion.

@HankHerr-NOAA
Copy link
Contributor Author

Since we are close to the end of today, anyway, I'm going to push the deployment to tomorrow morning and re-trigger the initiate-release Jenkins workflow when I get in in the morning. If it continues to not get past the system tests, then we'll probably have to hold off on deployment. If 703 is failing due to USGS NWIS, then some of our UAT is likely to fail, as well, and we'll be forced to delay regardless.

More tomorrow,

Hank

@epag
Copy link
Collaborator

epag commented Mar 11, 2025

However, I will say that many of these tests against live services were specifically designed to test that capability, so it would require us to stand up an equivalent service/api, at least for the limited data required, which is a headache in itself. But, again, that is a future discussion.

Agreed, I think this feeds into some earlier conversations that we have had around not using full/complete datasets for our tests and creating leaner tests/datasets which it seems like would be needed in order to separate live partner services from our deploy pipeline. Also since we are on Jenkins now with its own host, we can download data to there and rely on that instead as well

Agree with that, but there is a difference between that situation and the current situation and we are not deploying an urgent fix, so I think we should wait and then work towards a better separation of concerns.

I have no qualms with this

Also it looks like, based on this test, that we will retry calls for well over an hour... That seems somewhat excessive to me, let me know what you all think

First failure: T17:23
Last failure: T18:40

@HankHerr-NOAA
Copy link
Contributor Author

Well, looks like the full-scenario-testing passed, followed by the 900 series. And it appears to have deployed to staging.

I'll note the details in the VLab ticket,

Hank

@HankHerr-NOAA
Copy link
Contributor Author

I'll commence with UAT and standalone testing tomorrow morning. Staging COWRES is ready for use,

Hank

@HankHerr-NOAA
Copy link
Contributor Author

I'm unchecking box 3 for now. The workflow corresponding to box 3 requires a passage of the system tests. As Evan pointed out to me a chat, scenario703 did not pass, so the system tests do not pass. The 900 series passed and staging has the revision installed from the correct artifact.

Tomorrow, we just need to make sure the system tests can pass using the same artifacts. I'll work with Evan on how best to make sure that happens (he mentioned just stopping the workflow after the system tests pass and while its in the middle of the 900 series tests. This is assuming USGS NWIS is up and running, of course.

Have a great evening!

Hank

@HankHerr-NOAA
Copy link
Contributor Author

NWIS appears to be responsive today. I'm going to trigger the initiate-release-deployment workflow and see if the system tests succeed.

Hank

@HankHerr-NOAA
Copy link
Contributor Author

Just triggered initiate-release-deployment, which triggered full-scenario-testing 16. I'll keep my eyes on it.

I also checked the overnight evaluations in staging and they all succeeded. That's a good sign.

I don't plan to stop the deployment partway, as mentioned yesterday. Instead, I'll start UAT once I've confirmed the system tests have succeeded, or at least 703 passed, and then expect to be interrupted when its redeployed to staging. Again, Jenkins is just pulling the artifacts from GitHub; its not rebuilding them. So what is currently in staging is identical to what is being deployed now.

UAT to start in a bit,

Hank

@HankHerr-NOAA
Copy link
Contributor Author

System testing passed. I confirmed it manually in the console output. I'm going to start UAT in staging,

Hank

@HankHerr-NOAA
Copy link
Contributor Author

The artifacts were deployed to staging, again. That forced a pause in UAT. Continuing now,

Hank

@HankHerr-NOAA
Copy link
Contributor Author

One thing to note: If the artifacts deployed were truly the same, then the images would be the same, and, when pushed to the registry, all of the layers should already exist, so the date/time associated with the images in the registry should not be changed.

I confirmed that this was the case for the worker and tasker images. Thus, I can be confident my tests earlier this morning before the deployment to staging are still valid. Cool.

Continuing with UAT,

Hank

@HankHerr-NOAA
Copy link
Contributor Author

Pull request in for the Docker files, compose-entry.yml, compose-workers.yml. See #452. I'll wait for it to pass the checks.

Hank

@HankHerr-NOAA
Copy link
Contributor Author

No database documentation changes are needed, so I'll check that box. The documentation changes for reference_date_pools are being pushed to 6.31. Checking those boxes.

Hank

HankHerr-NOAA added a commit that referenced this issue Mar 12, 2025
Docker files for v6.30; refs GitHub #448
@HankHerr-NOAA
Copy link
Contributor Author

I think its revision ec8f281 that has the docker file changes, but let me confirm through GitHub.

Hank

@HankHerr-NOAA
Copy link
Contributor Author

Yes that's the commit after the merge. Now doing the tagging step,

Hank

@HankHerr-NOAA
Copy link
Contributor Author

I'm doing another docker pull request and merge. I took the easy way out and committed the compose from production, which uses different Docker network options, instead of from staging which uses the default options. Evan recommended against that, so I'm pushing/merging again. Once it passes the checks, I'll capture the revision, tag it, and be done with it.

Hank

HankHerr-NOAA added a commit that referenced this issue Mar 12, 2025
Changed to use default net, not different net; refs GitHub #448
@HankHerr-NOAA
Copy link
Contributor Author

The docker revision is c996384.

Tagging,

Hank

@HankHerr-NOAA
Copy link
Contributor Author

@HankHerr-NOAA
Copy link
Contributor Author

The only ticket remaining open is this one. Closing it now.

Hank

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
deployment Work related to deployments
Projects
None yet
Development

No branches or pull requests

3 participants