Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

As a developer, if the core server completes an evaluation, I don't want the user to receive a failure because of a timeout during relaying outputs to the shim. #330

Open
HankHerr-NOAA opened this issue Oct 4, 2024 · 2 comments
Labels
bug Something isn't working

Comments

@HankHerr-NOAA
Copy link
Contributor

See VLab #120579-238. The evaluation was pretty small and fast with little stdout and little output. In the stdout log, I saw this:

2024-10-04T17:15:08.530+0000  [pool-2-thread-1] INFO EvaluationService - Evaluation with internal ID 1901617939395097451 and evaluation ID of P6YTi1bggENyDcwc19rpTFNYIwY has returned
2024-10-04T17:15:08.876+0000  [Thread-9] WARN EvaluationService - There was not an expected status transition; timeout thread closing project in status COMPLETED
2024-10-04T17:15:08.876+0000  [Thread-9] INFO EvaluationService - Closing Evaluation

Note the warning. The status endpoint returns a failure (better messaging, above, about that failure would be nice, since it was just a warning).

In the shim, I saw this exception (in context):

2024-10-04T17:15:08.880+0000  [pool-4-thread-1] INFO JobStandardStreamMessenger - Finished sending STDOUT for job 6948032875964178351
2024-10-04T17:15:09.325+0000  [main] WARN JobStatusMessenger - Failed to create channel, declare exchange, or get status.
java.io.IOException: Unable to get status with given ID
    at wres.worker.JobStatusMessenger.getEvaluationStatus(JobStatusMessenger.java:175)
    at wres.worker.JobStatusMessenger.run(JobStatusMessenger.java:136)
    at wres.worker.WresEvaluationProcessor.executeEvaluation(WresEvaluationProcessor.java:257)
    at wres.worker.WresEvaluationProcessor.call(WresEvaluationProcessor.java:184)
    at wres.worker.Worker.main(Worker.java:141)
2024-10-04T17:15:09.327+0000  [main] INFO JobStatusMessenger - Finished sending STATUS for job 6948032875964178351

Evan postulated that something slowed the ability the shim to get the final status update and outputs from the core server within a set time out, and that led to an exception which lead to the status failure being returned to the user.

The problem is not easily reproducible. The exact same evaluation completed successfully minutes later and I have not seen that exception, or any related exception like that (i.e., representing a shim-server breakdown), in months.

I'll add a comment with notes from Evan and James in the ticket after posting,

Hank

@HankHerr-NOAA HankHerr-NOAA added the bug Something isn't working label Oct 4, 2024
@HankHerr-NOAA
Copy link
Contributor Author

From VLab...

Evan noted this:

What the issue looks like to me based on your description is the following:
- I think the communication error is a red harring, because if it was the root it would happen before the second wrning you sent.
- We have a timer in place between when a project is completed before it is closed to ensure unclosed projects don't perpetually hang a worker
- If that timer is triggered we close the project

The project being closed mid read of the results SHOULDNT cause any issue, but that is looking like what happened. I can look more into this next week, but we may need to extend the errant job timeout to allow for more time before forecably closing an evaluation

James responded:

If that is the cause, then I don't think the solution can be additional time - because there will always be a bigger evaluation than the arbitrary wait, and the wait cannot be indefinite - it must be to wait until all outputs have been served when the serving is in progress, in other words an event-based trigger, not a time trigger.

Thanks,

Hank

@epag
Copy link
Collaborator

epag commented Oct 4, 2024

To your comment James I guess we could do the following:

  • Start a timer when an evaluation completes (lets say 3 minutes)
  • Stop that timer when the call to get the results of an evaluation is started
  • Start a new timer when the evaluation results call returns (3 minutes again)
  • stop that timer when the close call is made

This way the time is truly just monitoring the state transitions and allow unlimited time for the output location to be retrieved.

I am going to test though if this is indeed what is causing it by reducing the time amount locally and seeing if I can reproduce

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants