-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add new 'RAW' variables for SHARE and WORK variables to ensure share/work dirs are created, not left with broken symlinks #5978
base: master
Are you sure you want to change the base?
Conversation
CYLC_WORKFLOW_SHARE_DIR_RAW and CYLC_WORKFLOW_WORK_DIR_RAW added. These are then used for 'mkdir' to ensure directories are created. Currently, if SHARE_DIR and WORK_DIR are symlinks, and the dir they point to does not exist, the dirs will not be created, and tasks may fail. The new variables ensure the correct directories are always created.
@ColemanTom - we discussed this in the project meeting today. We wondered whether having the disk yanked out from underneath you is really something that Cylc should be expected to handle. @dpmatthews might have some more specific comments tomorrow. At this stage I don't see a problem with handling it if we can, in principle, but given the purpose of the share directory (shared space for workflow task IO) it seems likely that simply recreating it mid-run would be insufficient to allow the workflow continue - wouldn't you expect critical input files for upcoming tasks to be disappeared by the disk failover? Also can you explain where you expect these new RAW variables to be defined? |
Assuming workflows are using
It depends on the system and timing. Hypothetical based on real models. Hypothetical 1 - cold start, easy rewind
Hypothetical 2 - warm start, no work dir
Next question
Unfortunately I cleaned out my examples yesterday due to quotas being neared, so I can't show you a screenshot, but, in this current MR, they would be defined in the |
@@ -1345,6 +1348,9 @@ def get_job_conf( | |||
'pre-script': rtconfig['pre-script'], | |||
'script': rtconfig['script'], | |||
'submit_num': itask.submit_num, | |||
'symlink_dirs': get_dirs_to_symlink( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this were to go forward, I imagine making sure get_dirs_to_symlink
was cached would be a reasonable thing to do for a minor speedup with limited extra memory load.
Note that
I wonder if it might actually be simplest and safest just to do the run-time equivalent of that, for workflows that are running when the failover occurs, rather than try to handle it automatically. I think that would be this: # job.sh
# Create share and work directories
- mkdir -p "${CYLC_WORKFLOW_SHARE_DIR}" || true
+ mkdir -p "${CYLC_WORKFLOW_SHARE_DIR}" Then jobs submitted after the failover will just fail, with a simple error message that explains what's wrong. Then it would be up to the user to diagnose the problem (pretty easy, given that you ought to know that disk failover is a possibility) and:
To me that seems a reasonable way to respond to something as drastic as the disk being replaced under the running workflow. What do you think? |
Hard to do in our environment where there is no direct access to a user and access is via Cylc edit runs only. Can your proposal fix the dangling symlinks via an edit run? You can discuss our environment requirements more with @jarich as I'm defintely able to handle making directories myself because I work as myself. Although.. what if a workflow is running under a different user, you have permissions to run their system, and they are on leave so can't make the folder themselves? Again, I guess it comes back to, it needs to be doable via a Cylc edit run still?
I do support failing early (removing |
This is something that our workflows need to handle. This isn't the same as it being something Cylc needs to handle, but it would be nice if it did. Disk failovers are rare, but they happen sometimes, and while we absolutely expect jobs will fail when we do a disk failover, we need them to succeed with no more actions required than to re-run the job. First level support will not have access to create any missing directories or symbolic links. There are other moments in time when we can make sure that these actions are carried out if Cylc isn't the right place to do it, but I would not have thought some basic disk filesystem failover was out of scope for how other agencies run their HPCs. |
Yes, we do have code which makes them, its just tedious having every workflow have to call a script/function (and I'm pretty sure some don't so this has caused us problems in the past). |
What I was trying to get at above, in part, is whether or not having Cylc automatically fix the symlinks really is sufficient under the circumstances. Consider some generic Cylc workflow, regardless of model cold/warm starts.
These task will almost certainly be communicating through the share directory: foo's output files are bar's input files, etc. If I understand Tom's response above, some of your workflows may require rewinding to the previous cycle, and some may require manual run-dir manipulation to prepare these "intermediate files" after the fail-over? At least in the second case, I presume your first level support staff can't do that either? If so, maybe it's reasonable to say that disk fail-over isn't a "first level support" problem. Or if it is, those staff need to be given the power to fix it? (Note, I'm not playing devil's advocate for the hell of it, just trying to understand exactly what your situation is, to form an opinion on whether or not it should be a "Cylc problem" 😁 !) |
Not to the previous cycle, just rerun the current cycle. If in Cylc7 world, right click on the cycle, reset status to waiting, let it run. No manual run-dir manipulation takes place. If the Cylc team don't want this sort of functionality, that is completely fine, its a proposal/idea. I imagine I can easily modify our global-init-script to do mkdir on the resolved directories (as its a known pattern in the |
@ColemanTom - I'm not dismissing the idea outright, at least not yet, it seems worth considering, especially if it's true that no other external manual intervention would be required for your fail-over use case. Others on the team who work closer to the metal than me might have stronger opinions, let's see. At the very least, I think we should remove the |
I just assumed there was fear of a race condition - and perhaps in a non-POSIX linux environment there is? |
Ah, yeah .. maybe that's it (I'd have thought |
Changes like this are high risk and require a lot of thinking and testing (this may have been protecting us for some use cases). |
It was added (without an explanatory comment) in PR #17 - merged 30 Sep 2011 🤯 |
I feel like the discussion around |
On the topic of this MR itself, the Issue is currently listed as against the |
See #6000 |
@ColemanTom - thanks for engaging on this (and other!) issues, much appreciated. However, I don't think we've reached a consensus on this one yet, so I've punted it back to 8.4.0 for now. |
I would prefer the readlink approach suggested in the original issue. if [[ ! -e "${CYLC_WORKFLOW_SHARE_DIR}" ]]; then
if [[ -L "${CYLC_WORKFLOW_SHARE_DIR}" ]]; then
mkdir -p $(readlink "${CYLC_WORKFLOW_SHARE_DIR}") || true
else
mkdir -p "${CYLC_WORKFLOW_SHARE_DIR}" || true
fi
fi As far as I can see this is safe and harmless, |
Note: it is not safe to use the global config to determine what symlinks are in place - there are no guarantees it matches the reality (configs can change / existing directory structures can be used). |
Closes #5567
CYLC_WORKFLOW_SHARE_DIR_RAW and CYLC_WORKFLOW_WORK_DIR_RAW added. These are then used for 'mkdir' to ensure directories are created. Currently, if SHARE_DIR and WORK_DIR are symlinks, and the dir they point to does not exist, the dirs will not be created, and tasks may fail. The new variables ensure the correct directories are always created.
I've not fully tested this (and am honestly not sure where to test this in an automated sense or what I can leverage from the existing infrastructure) but wanted to open up for comment before I spend any more time on it.
Check List
CONTRIBUTING.md
and added my name as a Code Contributor.setup.cfg
(andconda-environment.yml
if present).CHANGES.md
entry included if this is a change that can affect users