-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
logs: better capturing of Dask and job logs in error situations #739
Comments
Did you trim the logs between the job logs and the worker logs? The error is captured in the job logs, but you are looking at the worker logs in the output of the rcg logs -w dask-coffea-serial-kubernetes command. Can you paste a full log example for a complete reference? Note 1: I assumed that by the three dots at the bottom of the job logs, you trimmed all the logs between the job logs and the scheduler logs. |
Yes, the logs were trimmed, but I think OSError was not there.... The original tmux session is now gone so cannot double check. When I now rerun the workflow a few times, the error is well caught, for example:
I checked recent 4 runs, out of which there was 1 success and 3 failures, and the failures were always having logs: $ for runnumber in $(rcg list --filter name=dask-coffea-serial-kubernetes --filter status=failed | awk ' NR>1 {print $2}'); do rcg logs -w dask-coffea-serial-kubernetes.$runnumber | grep -c ^OSError; done
1
1
1 So far so good then? We could keep this issue opened for a week or so, and see if it reproduces? |
Current behaviour
When a workflow using Dask fails, the errors may not be fully captured and exposed back to users.
Here is one observation from running
reana-demo-dask-coffea
example.The status shows that the workflow failed:
The logs show ditto:
However, when one consults the Kubernetes pod logs, one can see the root cause is the job pod:
This error did not seem to have been captured and exposed back to the user.
Expected behaviour
The user should see in the logs that the workflow failed due to an XRootD "Operation expired" data access error.
Note
It looks like we may need to check exceptional situations when capturing both Dask worker logs and REANA job logs and merging them.
The text was updated successfully, but these errors were encountered: