Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle filesystem latency when creating RUN_COMPLETE file #803

Conversation

hwikle-lanl
Copy link
Collaborator

Code review checklist:

  • Code is generally sensical and well commented
  • Variable/function names all telegraph their purpose and contents
  • Functions/classes have useful doc strings
  • Function arguments are typed
  • Added/modified unit tests to cover changes.
  • New features have documentation added to the docs.
  • New features and backwards compatibility breaks are noted in the RELEASE.md

@hwikle-lanl hwikle-lanl self-assigned this Jan 24, 2025
@hwikle-lanl hwikle-lanl linked an issue Jan 24, 2025 that may be closed by this pull request
Copy link
Collaborator

@Paul-Ferrell Paul-Ferrell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really don't like throwing an exception here.
I think we should try harder to force a cache update. Instead of just doing an exists, you might try tossing in a path.parent.iterdir(). Apparently opening the parent directory should cause the directory cache to refresh.

If that doesn't seem to be working, drop a symlink to the temp file location where you were going to move the original file, and walk away instead of throwing an exception.

Also, CREATION_ERROR is explicitly for test creation problems. I'd create a new 'WARNING' label for cases like this.

File system quirks mean that the RUN_COMPLETE file may not yet exist
on the file system after it is created. Since the file is used by
the same process relatively soon after creation, this can occassionally
cause a FileNotFoundError. It was therefore necessary to wait for
the file to be synced to the file system, and to impose a timeout.
The previous timeout value was too short and resulted in TimeoutErrors
relatively frequently. This commit increases the timeout value from
2 seconds to 30 seconds, per the suggestion of Paul Ferrell.
The idiosyncrocies of NFS means that when the RUN_COMPLETE file is
written, it may not appear on the local file system, which results
in a FileNotFoundError when that file is accessed. This commit adds
a fallback when the file does not exist on the local file system, in
the process reverting the previous solution which involved waiting
and eventually raising a timeout.

The fallback consists of the following steps:

1. Create the file as normal, and check whether it exists.

2. If it does not exist, attempt to force an NFS cache reset by listing
the contents of the parent directory.

3. If the file still does not exist, create a symlink to the location
where the file is expected eventually to be.
@hwikle-lanl hwikle-lanl force-pushed the 802-uncaught-exception-when-creation-of-run_complete-file-times-out branch from 25f6f34 to f7b6152 Compare February 4, 2025 20:12
@hwikle-lanl hwikle-lanl changed the title Catch unhandled TimeoutError and log status Handle filesystem latency when creating RUN_COMPLETE file Feb 5, 2025
@hwikle-lanl hwikle-lanl added the bug Something isn't working label Feb 5, 2025
Copy link
Collaborator

@smehta99 smehta99 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@hwikle-lanl hwikle-lanl merged commit 00aae1a into develop Feb 5, 2025
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Uncaught exception when creation of RUN_COMPLETE file times out
3 participants