logs: better capturing of Dask and job logs in error situations #739

tiborsimko · 2025-01-28T13:01:07Z

Current behaviour

When a workflow using Dask fails, the errors may not be fully captured and exposed back to users.

Here is one observation from running reana-demo-dask-coffea example.

The status shows that the workflow failed:

$ rcg status -w dask-coffea-serial-kubernetes
NAME                            RUN_NUMBER   CREATED               STARTED               ENDED                 STATUS   PROGRESS
dask-coffea-serial-kubernetes   1            2025-01-28T10:02:38   2025-01-28T10:02:51   2025-01-28T10:04:03   failed   0/1

The logs show ditto:

$ rcg logs -w dask-coffea-serial-kubernetes

==> Workflow engine logs
2025-01-28 10:02:51,384 | root | MainThread | INFO | Publishing step:0, cmd: python analysis.py, total steps 1 to MQ
2025-01-28 10:04:03,456 | root | MainThread | INFO | Workflow a98dfa47-b7a2-4979-b75f-44ee4b224776 finished. Files available at /var/reana/users/00000000-0000-0000-0000-000000000000/workflows/a98dfa47-b7a2-4979-b75f-44ee4b224776.



==> Job logs
==> Step: process
==> Workflow ID: a98dfa47-b7a2-4979-b75f-44ee4b224776
==> Compute backend: Kubernetes
==> Job ID: reana-run-job-d60a2b71-c263-46c5-8bb1-06793e9083b4
==> Docker image: docker.io/coffeateam/coffea-dask-cc7:0.7.22-py3.10-g7f049
==> Command: python analysis.py
==> Status: failed
==> Started: 2025-01-28T10:02:51
==> Finished: 2025-01-28T10:03:59
==> Logs:

...


                -------------------------------------------------------------------
                -------------------------------------------------------------------
                ----------------         DASK WORKER LOGS          ----------------
                -------------------------------------------------------------------
                -------------------------------------------------------------------




                /usr/local/lib/python3.10/site-packages/distributed/cli/dask_worker.py:264: FutureWarning: dask-worker is deprecated and will be removed in a future release; use `dask worker` instead
  warnings.warn(
2025-01-28 10:02:56,570 - distributed.nanny - INFO -         Start Nanny at: 'tcp://10.244.0.62:44707'
2025-01-28 10:02:57,354 - distributed.worker - INFO -       Start worker at:    tcp://10.244.0.62:41989
2025-01-28 10:02:57,354 - distributed.worker - INFO -          Listening to:    tcp://10.244.0.62:41989
2025-01-28 10:02:57,354 - distributed.worker - INFO -           Worker name: reana-run-dask-a98dfa47-default-worker-1b35478060
2025-01-28 10:02:57,354 - distributed.worker - INFO -          dashboard at:           10.244.0.62:8788
2025-01-28 10:02:57,354 - distributed.worker - INFO - Waiting to connect to: tcp://reana-run-dask-a98dfa47-scheduler.default.svc.cluster.local:8786
2025-01-28 10:02:57,354 - distributed.worker - INFO - -------------------------------------------------
2025-01-28 10:02:57,354 - distributed.worker - INFO -               Threads:                         16
2025-01-28 10:02:57,354 - distributed.worker - INFO -                Memory:                   2.00 GiB
2025-01-28 10:02:57,354 - distributed.worker - INFO -       Local Directory: /tmp/dask-worker-space/worker-qrbfmwwd
2025-01-28 10:02:57,354 - distributed.worker - INFO - -------------------------------------------------
2025-01-28 10:02:57,360 - distributed.worker - INFO -         Registered to: tcp://reana-run-dask-a98dfa47-scheduler.default.svc.cluster.local:8786
2025-01-28 10:02:57,360 - distributed.worker - INFO - -------------------------------------------------
2025-01-28 10:02:57,361 - distributed.core - INFO - Starting established connection to tcp://reana-run-dask-a98dfa47-scheduler.default.svc.cluster.local:8786
/usr/local/lib/python3.10/site-packages/coffea/nanoevents/schemas/nanoaod.py:201: RuntimeWarning: Missing cross-reference index for Electron_photonIdx => Photon
  warnings.warn(
/usr/local/lib/python3.10/site-packages/coffea/nanoevents/schemas/nanoaod.py:201: RuntimeWarning: Missing cross-reference index for FatJet_subJetIdx1 => SubJet
  warnings.warn(
/usr/local/lib/python3.10/site-packages/coffea/nanoevents/schemas/nanoaod.py:201: RuntimeWarning: Missing cross-reference index for FatJet_subJetIdx2 => SubJet
  warnings.warn(
/usr/local/lib/python3.10/site-packages/coffea/nanoevents/schemas/nanoaod.py:201: RuntimeWarning: Missing cross-reference index for FsrPhoton_muonIdx => Muon
  warnings.warn(
/usr/local/lib/python3.10/site-packages/coffea/nanoevents/schemas/nanoaod.py:201: RuntimeWarning: Missing cross-reference index for Jet_electronIdx1 => Electron
  warnings.warn(
/usr/local/lib/python3.10/site-packages/coffea/nanoevents/schemas/nanoaod.py:201: RuntimeWarning: Missing cross-reference index for Jet_electronIdx2 => Electron
  warnings.warn(
/usr/local/lib/python3.10/site-packages/coffea/nanoevents/schemas/nanoaod.py:201: RuntimeWarning: Missing cross-reference index for Jet_muonIdx1 => Muon
  warnings.warn(
/usr/local/lib/python3.10/site-packages/coffea/nanoevents/schemas/nanoaod.py:201: RuntimeWarning: Missing cross-reference index for Jet_muonIdx2 => Muon
  warnings.warn(
/usr/local/lib/python3.10/site-packages/coffea/nanoevents/schemas/nanoaod.py:201: RuntimeWarning: Missing cross-reference index for Muon_fsrPhotonIdx => FsrPhoton
  warnings.warn(
/usr/local/lib/python3.10/site-packages/coffea/nanoevents/schemas/nanoaod.py:201: RuntimeWarning: Missing cross-reference index for Photon_electronIdx => Electron
  warnings.warn(

However, when one consults the Kubernetes pod logs, one can see the root cause is the job pod:

$ k logs reana-run-job-d60a2b71-c263-46c5-8bb1-06793e9083b4-rvwr9
Defaulted container "job" out of: job, krb5-renew, krb5-init (init)
Matplotlib created a temporary cache directory at /tmp/matplotlib-t6r42k3b because the default path (/.config/matplotlib) is not a writable directory; it is highly recommended to set the MPLCONFIGDIR environment variable to a writable directory, in particular to speed up the import of Matplotlib and to better support multiprocessing.
[######                                Traceback (most recent call last):
  File "/var/reana/users/00000000-0000-0000-0000-000000000000/workflows/a98dfa47-b7a2-4979-b75f-44ee4b224776/analysis.py", line 65, in <module>
    output, metrics = run(fileset, "Events", processor_instance=Processor())
  File "/usr/local/lib/python3.10/site-packages/coffea/processor/executor.py", line 1700, in __call__
    wrapped_out = self.run(fileset, processor_instance, treename)
  File "/usr/local/lib/python3.10/site-packages/coffea/processor/executor.py", line 1848, in run
    wrapped_out, e = executor(chunks, closure, None)
  File "/usr/local/lib/python3.10/site-packages/coffea/processor/executor.py", line 974, in __call__
    else _decompress(work.result())
  File "/usr/local/lib/python3.10/site-packages/distributed/client.py", line 287, in result
    raise exc.with_traceback(tb)
  File "/usr/local/lib/python3.10/site-packages/coffea/processor/executor.py", line 221, in __call__
    out = self.function(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/coffea/processor/executor.py", line 1367, in automatic_retries
    raise e
  File "/usr/local/lib/python3.10/site-packages/coffea/processor/executor.py", line 1336, in automatic_retries
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/coffea/processor/executor.py", line 1575, in _work_function
    filecontext = uproot.open(
  File "/usr/local/lib/python3.10/site-packages/uproot/reading.py", line 141, in open
    file = ReadOnlyFile(
  File "/usr/local/lib/python3.10/site-packages/uproot/reading.py", line 580, in __init__
    self._source = Source(
  File "/usr/local/lib/python3.10/site-packages/uproot/source/xrootd.py", line 272, in __init__
    self._open()
  File "/usr/local/lib/python3.10/site-packages/uproot/source/xrootd.py", line 275, in _open
    self._resource = XRootDResource(self._file_path, self._timeout)
  File "/usr/local/lib/python3.10/site-packages/uproot/source/xrootd.py", line 83, in __init__
    self._open()
  File "/usr/local/lib/python3.10/site-packages/uproot/source/xrootd.py", line 92, in _open
    self._xrd_error(status)
  File "/usr/local/lib/python3.10/site-packages/uproot/source/xrootd.py", line 118, in _xrd_error
    raise OSError(
OSError: XRootD error: [ERROR] Operation expired
in file root://eospublic.cern.ch//eos/root-eos/benchmark/Run2012B_SingleMu.root

This error did not seem to have been captured and exposed back to the user.

Expected behaviour

The user should see in the logs that the workflow failed due to an XRootD "Operation expired" data access error.

Note

It looks like we may need to check exceptional situations when capturing both Dask worker logs and REANA job logs and merging them.

The text was updated successfully, but these errors were encountered:

Alputer · 2025-01-30T09:25:14Z

Did you trim the logs between the job logs and the worker logs? The error is captured in the job logs, but you are looking at the worker logs in the output of the rcg logs -w dask-coffea-serial-kubernetes command. Can you paste a full log example for a complete reference?

Note 1: I assumed that by the three dots at the bottom of the job logs, you trimmed all the logs between the job logs and the scheduler logs.
Note 2: I tried to reproduce the issue, and I was able to capture the error 5/5 times in the job logs.

tiborsimko · 2025-01-30T13:06:32Z

Yes, the logs were trimmed, but I think OSError was not there.... The original tmux session is now gone so cannot double check.

When I now rerun the workflow a few times, the error is well caught, for example:

...

==> Job logs
==> Step: process
==> Workflow ID: d290d2b1-7187-4327-a191-a4dced561e53
==> Compute backend: Kubernetes
==> Job ID: reana-run-job-ee102750-2a90-4d82-a581-7ea0337da2c6
==> Docker image: docker.io/coffeateam/coffea-dask-cc7:0.7.22-py3.10-g7f049
==> Command: python analysis.py
==> Status: failed
==> Started: 2025-01-30T10:14:06
==> Finished: 2025-01-30T10:15:03
==> Logs:
Matplotlib created a temporary cache directory at /tmp/matplotlib-e1fgbro2 because the default path (/.config/matplotlib) is not a writable directory; it is highly recommended to set the MPLCONFIGDIR environment variable to a writa>
^M[                                        ] | 0% Completed |  0.0s^M[                                        ] | 0% Completed |  0.1s^M[                                        ] | 0% Completed |  0.2s^M[                           >
Traceback (most recent call last):
  File "/var/reana/users/00000000-0000-0000-0000-000000000000/workflows/d290d2b1-7187-4327-a191-a4dced561e53/analysis.py", line 65, in <module>
    output, metrics = run(fileset, "Events", processor_instance=Processor())
  File "/usr/local/lib/python3.10/site-packages/coffea/processor/executor.py", line 1700, in __call__
    wrapped_out = self.run(fileset, processor_instance, treename)
  File "/usr/local/lib/python3.10/site-packages/coffea/processor/executor.py", line 1848, in run
    wrapped_out, e = executor(chunks, closure, None)
  File "/usr/local/lib/python3.10/site-packages/coffea/processor/executor.py", line 974, in __call__
    else _decompress(work.result())
  File "/usr/local/lib/python3.10/site-packages/distributed/client.py", line 287, in result
    raise exc.with_traceback(tb)
  File "/usr/local/lib/python3.10/site-packages/coffea/processor/executor.py", line 221, in __call__
    out = self.function(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/coffea/processor/executor.py", line 1367, in automatic_retries
    raise e
  File "/usr/local/lib/python3.10/site-packages/coffea/processor/executor.py", line 1336, in automatic_retries
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/coffea/processor/executor.py", line 1575, in _work_function
    filecontext = uproot.open(
  File "/usr/local/lib/python3.10/site-packages/uproot/reading.py", line 141, in open
    file = ReadOnlyFile(
  File "/usr/local/lib/python3.10/site-packages/uproot/reading.py", line 580, in __init__
    self._source = Source(
  File "/usr/local/lib/python3.10/site-packages/uproot/source/xrootd.py", line 272, in __init__
    self._open()
  File "/usr/local/lib/python3.10/site-packages/uproot/source/xrootd.py", line 275, in _open
    self._resource = XRootDResource(self._file_path, self._timeout)
  File "/usr/local/lib/python3.10/site-packages/uproot/source/xrootd.py", line 83, in __init__
    self._open()
  File "/usr/local/lib/python3.10/site-packages/uproot/source/xrootd.py", line 92, in _open
    self._xrd_error(status)
  File "/usr/local/lib/python3.10/site-packages/uproot/source/xrootd.py", line 118, in _xrd_error
    raise OSError(
OSError: XRootD error: [ERROR] Operation expired
in file root://eospublic.cern.ch//eos/root-eos/benchmark/Run2012B_SingleMu.root
3s^M[#####                                   ] | 13% Completed | 17.4s^M[#####                                   ] | 13% Completed | 17.5s^M[#####                                   ] | 13% Completed | 17.6s^M[#####                 >


                -------------------------------------------------------------------
                -------------------------------------------------------------------
                ----------------        DASK SCHEDULER LOGS        ----------------
                -------------------------------------------------------------------

...

I checked recent 4 runs, out of which there was 1 success and 3 failures, and the failures were always having logs:

$ for runnumber in $(rcg list --filter name=dask-coffea-serial-kubernetes --filter status=failed | awk ' NR>1 {print $2}'); do rcg logs -w dask-coffea-serial-kubernetes.$runnumber | grep -c ^OSError; done
1
1
1

So far so good then?

We could keep this issue opened for a week or so, and see if it reproduces?

tiborsimko added this to Dask Jan 29, 2025

tiborsimko moved this to Backlog in Dask Jan 29, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

logs: better capturing of Dask and job logs in error situations #739

logs: better capturing of Dask and job logs in error situations #739

tiborsimko commented Jan 28, 2025

Alputer commented Jan 30, 2025

tiborsimko commented Jan 30, 2025

logs: better capturing of Dask and job logs in error situations #739

logs: better capturing of Dask and job logs in error situations #739

Comments

tiborsimko commented Jan 28, 2025

Current behaviour

Expected behaviour

Note

Alputer commented Jan 30, 2025

tiborsimko commented Jan 30, 2025