Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

missing jobstore #164

Open
ionox0 opened this issue Mar 22, 2021 · 1 comment
Open

missing jobstore #164

ionox0 opened this issue Mar 22, 2021 · 1 comment
Assignees

Comments

@ionox0
Copy link
Member

ionox0 commented Mar 22, 2021

One of our ACCESS SV jobs failed due to the jobstore folder going missing (or possibly never being created).

Here is the Run object and stack trace:

http://voyager:5001/admin/runner/run/7f264fb8-65c7-473a-a223-0399d414cff6/change/?_changelist_filters=app%3Df833115f-c440-4fbf-833f-e8332730e641%26p%3D1%26q%3D06302_AE

jx27 2021-03-20 02:59:25,688 MainThread DEBUG toil.common: Shutting down batch system ...
jx27 2021-03-20 02:59:25,716 MainThread DEBUG toil.common: Obtained node ID 14e14258-9b46-4ae7-b0e0-f48d45462195 from file /proc/sys/kernel/random/boot_id
jx27 2021-03-20 02:59:25,716 MainThread DEBUG toil.deferred: Cleaning up deferred functions system
jx27 2021-03-20 02:59:25,729 MainThread DEBUG toil.deferred: Running for file /work/ci/beagle/work/05ed461c-1d53-4a99-9b5d-15dbb0edc8b9/toil-9c4d9f09-c5bb-42fb-9645-4361a5f3d75f-14e14258-9b46-4ae7-b0e0-f48d45462195/deferred/tmpzFAIz6
jx27 2021-03-20 02:59:25,729 MainThread DEBUG toil.deferred: Running orphaned deferred functions
jx27 2021-03-20 02:59:25,729 MainThread DEBUG toil.deferred: Deleting /work/ci/beagle/work/05ed461c-1d53-4a99-9b5d-15dbb0edc8b9/toil-9c4d9f09-c5bb-42fb-9645-4361a5f3d75f-14e14258-9b46-4ae7-b0e0-f48d45462195/deferred/tmpzFAIz6
jx27 2021-03-20 02:59:26,038 Thread-415 DEBUG toil.batchSystems.abstractGridEngineBatchSystem: No activity, sleeping for 1s
jx27 2021-03-20 02:59:26,039 Thread-415 DEBUG toil.batchSystems.abstractGridEngineBatchSystem: Received queue sentinel.
jx27 2021-03-20 02:59:26,039 MainThread DEBUG toil.common: ... finished shutting down the batch system in 0.35135602951 seconds.
Traceback (most recent call last):
  File "/juno/home/accessbot/miniconda3/envs/ACCESS_2.0.0/bin/toil-cwl-runner", line 8, in <module>
    sys.exit(main())
  File "/home/accessbot/miniconda3/envs/ACCESS_2.0.0/lib/python2.7/site-packages/toil/cwl/cwltoil.py", line 1374, in main
    result = toil.start(wf1)
  File "/home/accessbot/miniconda3/envs/ACCESS_2.0.0/lib/python2.7/site-packages/toil/common.py", line 782, in start
    self._serialiseEnv()
  File "/home/accessbot/miniconda3/envs/ACCESS_2.0.0/lib/python2.7/site-packages/toil/common.py", line 1006, in _serialiseEnv
    with self._jobStore.writeSharedFileStream("environment.pickle") as fileHandle:
  File "/home/accessbot/miniconda3/envs/ACCESS_2.0.0/lib/python2.7/contextlib.py", line 17, in __enter__
    return self.gen.next()
  File "/home/accessbot/miniconda3/envs/ACCESS_2.0.0/lib/python2.7/site-packages/toil/jobStores/fileJobStore.py", line 516, in writeSharedFileStream
    with open(self._getSharedFilePath(sharedFileName), 'wb') as f:
IOError: [Errno 2] No such file or directory: '/work/ci/beagle/job-store/05ed461c-1d53-4a99-9b5d-15dbb0edc8b9/files/shared/environment.pickle'
@ionox0 ionox0 self-assigned this Mar 22, 2021
@allanbolipata
Copy link
Contributor

@ionox0 Currently we suspect this is a cluster issue as it happens intermittently.

Short term fix is to re-run this and monitor to see if it happens again. We will discuss this with HPC because it seems to happen more and more frequently.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants