Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Workflow invocation f4920dcfcf3a664a stopped #27

Open
hechth opened this issue Nov 29, 2024 · 8 comments
Open

Workflow invocation f4920dcfcf3a664a stopped #27

hechth opened this issue Nov 29, 2024 · 8 comments

Comments

@hechth
Copy link

hechth commented Nov 29, 2024

Somehow my workflow invocation seems to have just stalled and the jobs are not progressing - this hasn't happened before.

For example job 12a2cf9000f34f9a is a job from the cut tool and it is running since 2 days.

@martindemko
Copy link
Contributor

Hi,
Unfortunately, we are having some trouble connecting some compute nodes to the storage where job data are held and processed. That's most probably the reason for stalled job(s) and therefore the following depending jobs could not continue.

@zsalvet
Copy link

zsalvet commented Dec 2, 2024

Which PBS job is this ?

@martindemko
Copy link
Contributor

Hi,
it's 6965793.pbs-m1.metacentrum.cz but PBS job was fine. The trouble happened in pulsar when it tried to copy outputs back to Galaxy.

@hechth
Copy link
Author

hechth commented Dec 3, 2024

This currently affects around 1000 jobs in the invocation mentioned in the issue title.

Should I re-run everything or do you think that this will finish at some point?

@martindemko
Copy link
Contributor

To be honest, I'm new to workflow invocations. But @martenson was looking at it with me, and we found out that the older of two of your current invocations has been canceled sub-workflow (canceled by user, so that must be you) and I don't know why it's still hanging there and didn't realize that it should be canceled too. Question: did you really cancel the sub-workflow? And the newer one is still waiting for several jobs to finish even though those jobs already finished, but then Galaxy was restarted and must have lost track of the workflow because after restart it marked those jobs as running. I found this in logs, and I will try to dig more into this with Martin, maybe that's actually a bug because these invocations are kinda new. So please, don't cancel the newer invocation just yet. Maybe we will be able to restart it. Thank you

@martenson
Copy link
Member

martenson commented Dec 4, 2024

afaik @hechth currently has no limit on UMSA resources so please do not cancel/delete/remove problematic invocations/jobs/dataset so we can have the best information when debugging -- you can always execute a new one

@martindemko
Copy link
Contributor

@martenson, could you please try to investigate deeper to see if it could really be a bug? Anyway, it's weird behavior to reschedule a complete job by itself just because of the restart of the Galaxy.

@hechth
Copy link
Author

hechth commented Dec 4, 2024

@martindemko the jobs are still idling. I might have cancelled the previous workflow or sub-workflow in an attempt to free up some resources, that is possible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants