-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
As WRES team, I would like to identify a candidate for github Release v6.30, test it, and if it passes, deploy it; candidate RC1: 2d40c6af06573e03e5a429d6ebf348e5194ca3d68b7531741706e7842538c390 wres-20250311-3bce492.zip #448
Comments
James: If you have a change that you think I should wait for, let me know. Otherwise, I hope to start the deployment process following the instructions in VLab in about a half-hour. Thanks, Hank |
No, not especially. |
Anyone know why the external services tests have failed twice in the past couple of days? The failure appears to be related to NWIS. Hank |
The latest test reported a "server error" when reaching out to NWIS,
Maybe this is the related to the same NWIS issue that triggered retries which resulted in a connection leak that James addressed. Point is, I don't think this is the WRES; I think its NWIS. Hank |
Yeah, just issues with NWIS, I think. Manually executed evaluations that rely on NWIS data were failing too. As an aside, these tests should be renamed as WRES external services tests rather than |
Thanks, James. Waiting for the merge to complete. Looks like it needs to run through checks/testing using the master, now. I'll also have to wait on a round of testing by Jenkins to complete. Hopefully, by 9:30 EST or sooner, we can start the deploy. Hank |
Finally. Took over 17 minutes to complete the system tests for the merge, and GitHub only runs a subset of the system tests. Step 1: "Tag a commit as staging to kick off pre-release workflow". Doing that now, Hank |
Commit 3bce492 tagged. Workflows are running. Hank |
Updated the title of this ticket to include the hash of the .zip. Updated the VLab wiki to indicate where to find that hash. Hank |
The pre-release database migration test is still on-going. Its now on the, "Fresh migration from upcoming release". Hopefully, this will complete by the afternoon. I guess we'll see. Hank |
Oh dear, we may be back to the bad old days of a 24+ hour turnaround, even if RC1 succeeds all gates :( We'll need to investigate that issue as a matter of priority. |
Looks like it has finished now, but I agree, the times seem to be ever climbing. I need to focus on finishing this rhel8 migration, but after that I will prioritize looking at this. Or if I am halted on the rhel8 migration during a potential (likely?) shutdown, I can look at this then since I think we aren't allowed to do anything production related during a shutdown |
Pre-release steps done. I'm proceeding with the deployment Step 3, initiating the release using Jenkins. Hank |
I submitted the job which has kicked off a full scenario test. The Once deployed to staging, I'll shift my focus on the VLab ticket and UAT. Hank |
There were some failures and off behavior but I am aware of what caused those, going to fix them quick and restart a deploy |
Evan gave me the thumbs up to try again, so I started initiate release job 15. Hank |
The 900 tests are running particularly slowly due to issues obtaining data from USGS NWIS and may end up dying at some point. I'm told that that would result in the Jenkins workflow halting, preventing deployment to staging. We can either agree to ignore the 900 series and force the staging deployment. Or we can hold up the deployment with some sort of fix for that issue. Though I don't really think the NWIS fails is an issue we can fix. Thoughts? Hank |
Its the regular system tests that are taking a while. Specifically, 703. Again, do we wait it out (eventually, it may time out and fail) or do we just trigger the 900 series and move on with the deployment? Hank |
Evan mentioned killing the system tests, moving to the 900 series, and then deploying. Since this is just going to staging, I think that's a good idea. However, it would be good to see 703 and the external services tests pass before we deploy to production. We can revisit that discussion later. Hank |
If this were a critical deploy to address a cve, I would say go ahead because the risk/reward would be stacked to mitigating a bigger risk with a smaller one. However, in this case, I don't think we really want to set a precedent of ignoring gates, even if we can explain them. The problem is that these tests incorporate attributes other than NWIS and our unit/integration test coverage is not (close to) 100%, more like 50%. |
I appreciate that makes us depend on live services (that can break) for our deploy and it would be better if we could test these same features without that being the case but, right now, we cannot. I have no issues with attempting to improve resilience in that regard, but that is for later. |
It is also worth mentioning that while the actual deploy artifacts didn't pass the sys tests, but we did test the commit already as part of the automated testing pipeline [Successful] wres-automated-testing/full-scenario-testing - 69 PASSED, 0 FAILED for commit: 3bce492 Also seems like NWIS is just hard reporting a service outage now as well |
I don't have a super strong opinion if we wait or push through here in this specific case, but I think we should strive towards separating checking partner live service uptime from our deployment process. We shouldn't allow issues with partner services stop us from making progress/deployments |
Agree with that, but there is a difference between that situation and the current situation and we are not deploying an urgent fix, so I think we should wait and then work towards a better separation of concerns. However, I will say that many of these tests against live services were specifically designed to test that capability, so it would require us to stand up an equivalent service/api, at least for the limited data required, which is a headache in itself. But, again, that is a future discussion. |
Since we are close to the end of today, anyway, I'm going to push the deployment to tomorrow morning and re-trigger the initiate-release Jenkins workflow when I get in in the morning. If it continues to not get past the system tests, then we'll probably have to hold off on deployment. If 703 is failing due to USGS NWIS, then some of our UAT is likely to fail, as well, and we'll be forced to delay regardless. More tomorrow, Hank |
Agreed, I think this feeds into some earlier conversations that we have had around not using full/complete datasets for our tests and creating leaner tests/datasets which it seems like would be needed in order to separate live partner services from our deploy pipeline. Also since we are on Jenkins now with its own host, we can download data to there and rely on that instead as well
I have no qualms with this Also it looks like, based on this test, that we will retry calls for well over an hour... That seems somewhat excessive to me, let me know what you all think First failure: T17:23 |
Well, looks like the full-scenario-testing passed, followed by the 900 series. And it appears to have deployed to staging. I'll note the details in the VLab ticket, Hank |
I'll commence with UAT and standalone testing tomorrow morning. Staging COWRES is ready for use, Hank |
I'm unchecking box 3 for now. The workflow corresponding to box 3 requires a passage of the system tests. As Evan pointed out to me a chat, scenario703 did not pass, so the system tests do not pass. The 900 series passed and staging has the revision installed from the correct artifact. Tomorrow, we just need to make sure the system tests can pass using the same artifacts. I'll work with Evan on how best to make sure that happens (he mentioned just stopping the workflow after the system tests pass and while its in the middle of the 900 series tests. This is assuming USGS NWIS is up and running, of course. Have a great evening! Hank |
NWIS appears to be responsive today. I'm going to trigger the Hank |
Just triggered I also checked the overnight evaluations in staging and they all succeeded. That's a good sign. I don't plan to stop the deployment partway, as mentioned yesterday. Instead, I'll start UAT once I've confirmed the system tests have succeeded, or at least 703 passed, and then expect to be interrupted when its redeployed to staging. Again, Jenkins is just pulling the artifacts from GitHub; its not rebuilding them. So what is currently in staging is identical to what is being deployed now. UAT to start in a bit, Hank |
System testing passed. I confirmed it manually in the console output. I'm going to start UAT in staging, Hank |
The artifacts were deployed to staging, again. That forced a pause in UAT. Continuing now, Hank |
One thing to note: If the artifacts deployed were truly the same, then the images would be the same, and, when pushed to the registry, all of the layers should already exist, so the date/time associated with the images in the registry should not be changed. I confirmed that this was the case for the worker and tasker images. Thus, I can be confident my tests earlier this morning before the deployment to staging are still valid. Cool. Continuing with UAT, Hank |
Pull request in for the Docker files, compose-entry.yml, compose-workers.yml. See #452. I'll wait for it to pass the checks. Hank |
No database documentation changes are needed, so I'll check that box. The documentation changes for Hank |
Docker files for v6.30; refs GitHub #448
I think its revision ec8f281 that has the docker file changes, but let me confirm through GitHub. Hank |
Yes that's the commit after the merge. Now doing the tagging step, Hank |
I'm doing another docker pull request and merge. I took the easy way out and committed the compose from production, which uses different Docker network options, instead of from staging which uses the default options. Evan recommended against that, so I'm pushing/merging again. Once it passes the checks, I'll capture the revision, tag it, and be done with it. Hank |
Changed to use default net, not different net; refs GitHub #448
The docker revision is c996384. Tagging, Hank |
https://github.com/NOAA-OWP/wres/releases/tag/v6.30 If something looks off, please let me know. Hank |
The only ticket remaining open is this one. Closing it now. Hank |
This deployment will not be preceded by a dependency update.
The text was updated successfully, but these errors were encountered: