-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
As a user, I want the WRES to retry WRDS requests when they fail, even when the response body is in flight after tls handshake occurred and headers parsed. #73
Comments
Original Redmine Comment Is the production WRES still running? Yes, a smoke test evaluation ran successfully. Hmmm... Hank |
Original Redmine Comment We've run this evaluation previously several times without issue, and Alex many times more. Let me see if a rerun succeeds. Its a long evaluation, so I may not know for a couple hours. Thanks, Hank |
Original Redmine Comment On the earlier run, I see this further down:
What is the timeout limit on our end for a NWM request? Hnak |
Original Redmine Comment The job id for my second attempt at the evaluation is 7173193725700182922 (production COWRES). Again, if either of you know the timeout we apply for WRES NWM service requests, please let me know. Thanks! Hank |
Original Redmine Comment If I recall correctly, that looks like a request timeout which is at a higher level, as opposed to a socket connect timeout or socket read timeout. Looking. |
Original Redmine Comment In @wres.io.utilities.WebClient@: |
Original Redmine Comment And looking at @git blame@ where @REQUEST_TIMEOUT@ is added as args to the OkHttp client, I find #69947, and looking over that, I think the above is one of those timeouts, i.e. request timeout. |
Original Redmine Comment So 20 minutes. Gautam tells me they have a 30 minute timeout on their end. I think the service hit a major hiccup in their response time. Sadly, from Gautam in the WRES/WRDS meeting, he says they have a very hard time monitoring this, because they don't have Check_MK, and, imho, subpar logging. I asked him if they logged time to fulfill requests, and he said no, but he'll look to add something. Anyway, sounds like a WRDS issue, but if you want to look into it further, please let me know. Thanks! Hank |
Original Redmine Comment I looked for a JUnit test that tests some of this but did not find one that would easily reproduce. I think the quickest way to confirm for sure would be to set that request limit to 1000ms or 500ms in a local development version and then make the request of WRDS, see what stack trace comes up. |
Original Redmine Comment Yes, it's a WRDS issue, but it would be ideal if WRES could retry in this case. I don't think there's an easy way to get WRES to retry here, or it might already be retrying. Was this after several retries or was it with no retries? You would have to look at the full log to find out. |
Original Redmine Comment Looking, Hank |
Original Redmine Comment From looking at the full stdout, it looks like more than one request at that same time failed, and it doesn't look like WRES retried. WRES does retry in almost all situations but there is this narrow band of situations where it is difficult to retry. |
Original Redmine Comment Thanks for looking. I was having a hard time following the chain of events. Hank |
Original Redmine Comment I need to step away for a bit (meeting, then picking up kid from school), but if there is anything you'd like for me to checkout, let me know. My second attempt at the evaluation, 7173193725700182922, is still progressing as expected. Thanks, Hank |
Original Redmine Comment You're welcome, and I think I see what happened now. The stack traces where it shows @InterruptedIOException: null@ is where those threads got interrupted due to that last original/official/real @InterruptedIOException: timeout@ occurred in another thread. So this is the one that propagated and stopped the evaluation and probably happened first, chronologically:
And then due to the above, which is in the narrow case that is difficult to retry, while it was propagating, the other threads got notified and printed these:
On the WRDS side: improve reliability, perhaps a reverse proxy might help if not already added, or varnish cache, or some combination, maybe better tech underneath. |
Original Redmine Comment I've updated the subject and description to focus this ticket on implementing the retry mechanism. The second evaluation attempt, 7173193725700182922, succeeded with output generated. I'm going to move up the schedule for the evaluation today to see if it succeeds this morning. Leaving this ticket as New, lowering to High priority, and leaving it in the Backlog. Thanks, Hank |
Author Name: Hank (Hank)
Original Redmine Issue: 101945, https://vlab.noaa.gov/redmine/issues/101945
Original Date: 2022-03-01
Per investigation by Jesse, the problem presented originally in this ticket occurred because of a WRDS hiccup and the WRES not retrying the request after that hiccup. The issue occurred when WRDS was returning the response body to the WRES, after the socket connection, TLS handshake, and the headers had been received and parsed. In that case, no retry happened and the evaluation failed. This ticket can be resolved when a retry ability is implemented to cover that possibility.
Original Description is below. Thanks,
Hank
===========================================
The job:
5318596096827668180 (production)
The error (URL domain omitted):
Manual execution of the request works fine. What went wrong?
Hank
The text was updated successfully, but these errors were encountered: