Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HQ batch job submission fails at Karolina #791

Closed
svatosFZU opened this issue Dec 9, 2024 · 9 comments · Fixed by #792
Closed

HQ batch job submission fails at Karolina #791

svatosFZU opened this issue Dec 9, 2024 · 9 comments · Fixed by #792

Comments

@svatosFZU
Copy link

Recently, I have noticed that the HyperQueue has a problem submitting jobs to Karolina's batch system. The stderr shows:

/var/spool/slurmd/job1984117/slurm_script: line 9: syntax error near unexpected token `('
/var/spool/slurmd/job1984117/slurm_script: line 9: `/scratch/project/open-29-6/session/ACmount.sh && RUST_LOG=hyperqueue=debug /home/svatosm/hq-v0.19.0-linux-x64/hq (deleted) worker start --idle-timeout "5m" --manager "slurm" --server-dir "/home/svatosm/.hq-server/001" --on-server-lost "finish-running" --time-limit "1day 23h 59m"; /scratch/project/open-29-6/session/ACumount.sh'

So, it seems the problem is that the HyperQueue adds (deleted) into hq command in hq-submit.sh for some reason, e.g.

#!/bin/bash
#SBATCH --nodes=1
#SBATCH --job-name=hq-4-3
#SBATCH --output=/home/svatosm/.hq-server/001/autoalloc/4/003/stdout
#SBATCH --error=/home/svatosm/.hq-server/001/autoalloc/4/003/stderr
#SBATCH --time=48:00:00
#SBATCH -AOPEN-29-6 -pqcpu -c128

/scratch/project/open-29-6/session/ACmount.sh && RUST_LOG=hyperqueue=debug /home/svatosm/hq-v0.19.0-linux-x64/hq (deleted) worker start --idle-timeout "5m" --manager "slurm" --server-dir "/home/svatosm/.hq-server/001" --on-server-lost "finish-running" --time-limit "1day 23h 59m"; /scratch/project/open-29-6/session/ACumount.sh

Interestingly, I do not see this problem on Barbora, even though all are running the same version (v20).

@Kobzol
Copy link
Collaborator

Kobzol commented Dec 9, 2024

Hi, that looks a lot like this issue: #452. The (deleted) part is added by Linux, when it figures out that the executable has been removed. Is it possible that the hq binary has been removed or moved, or its working directory has been removed or moved, in the meantime?

@svatosFZU
Copy link
Author

I see I have picked some old job from previous version rather than from a new one. Other than that, no the HQ binaries are there.

[[email protected] ~]$ ls /home/svatosm/hq-v0.19.0-linux-x64/hq
/home/svatosm/hq-v0.19.0-linux-x64/hq
[[email protected] ~]$ ls /home/svatosm/hq-v0.20.0-linux-x64/hq
/home/svatosm/hq-v0.20.0-linux-x64/hq

@Kobzol
Copy link
Collaborator

Kobzol commented Dec 9, 2024

I see. Probably it's some networked filesystem issue then, where Linux thinks that the file has been removed for some reason. Created #792 to try to work around this.

@svatosFZU
Copy link
Author

OK, so can this be fixed within the HQ or should I relocate the file?

@Kobzol
Copy link
Collaborator

Kobzol commented Dec 9, 2024

The proposed PR should fix this issue, in the sense that we will just ignore the (deleted) suffix. But it's unclear whether Linux will be able to execute the binary at the specified path if it seemingly considers it to be deleted 🤷‍♂️ So moving the file might also help, e.g. from the home directory to the PROJECT filesystem. This issue can also be transient and resolve itself.

@svatosFZU
Copy link
Author

OK, I would try this first and see if it fixes the problem or not and move the binary based on that.

@svatosFZU
Copy link
Author

One more question:
If something like this happens, is it somehow cached in the HyperQueue? I deleted the journal file, restarted the server and everything started working.

@spirali
Copy link
Collaborator

spirali commented Dec 11, 2024

No, there is no persistent memory except journal and access file (access file only holds information where server is running and encryption keys; a new access file should be generated by default when server is started).

@Kobzol
Copy link
Collaborator

Kobzol commented Dec 11, 2024

More likely it was the rate limiter. HQ tries to be very conservative with automatic allocations when it sees that they start failing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants