Handle filesystem latency when creating `RUN_COMPLETE` file #803

hwikle-lanl · 2025-01-24T15:38:00Z

Code review checklist:

Code is generally sensical and well commented
Variable/function names all telegraph their purpose and contents
Functions/classes have useful doc strings
Function arguments are typed
Added/modified unit tests to cover changes.
New features have documentation added to the docs.
New features and backwards compatibility breaks are noted in the RELEASE.md

Paul-Ferrell

I really don't like throwing an exception here.
I think we should try harder to force a cache update. Instead of just doing an exists, you might try tossing in a path.parent.iterdir(). Apparently opening the parent directory should cause the directory cache to refresh.

If that doesn't seem to be working, drop a symlink to the temp file location where you were going to move the original file, and walk away instead of throwing an exception.

Also, CREATION_ERROR is explicitly for test creation problems. I'd create a new 'WARNING' label for cases like this.

File system quirks mean that the RUN_COMPLETE file may not yet exist on the file system after it is created. Since the file is used by the same process relatively soon after creation, this can occassionally cause a FileNotFoundError. It was therefore necessary to wait for the file to be synced to the file system, and to impose a timeout. The previous timeout value was too short and resulted in TimeoutErrors relatively frequently. This commit increases the timeout value from 2 seconds to 30 seconds, per the suggestion of Paul Ferrell.

The idiosyncrocies of NFS means that when the RUN_COMPLETE file is written, it may not appear on the local file system, which results in a FileNotFoundError when that file is accessed. This commit adds a fallback when the file does not exist on the local file system, in the process reverting the previous solution which involved waiting and eventually raising a timeout. The fallback consists of the following steps: 1. Create the file as normal, and check whether it exists. 2. If it does not exist, attempt to force an NFS cache reset by listing the contents of the parent directory. 3. If the file still does not exist, create a symlink to the location where the file is expected eventually to be.

smehta99

LGTM

hwikle-lanl requested a review from Paul-Ferrell January 24, 2025 15:38

hwikle-lanl self-assigned this Jan 24, 2025

hwikle-lanl linked an issue Jan 24, 2025 that may be closed by this pull request

Uncaught exception when creation of RUN_COMPLETE file times out #802

Closed

Paul-Ferrell reviewed Jan 24, 2025

View reviewed changes

hwikle-lanl added 6 commits February 4, 2025 12:47

Catch unhandled TimeoutError and log status

53ef32f

Add unit test for missing RUN_COMPLETE file fallback

6215557

Fix failing style tests

62df721

Fix error

f7b6152

hwikle-lanl force-pushed the 802-uncaught-exception-when-creation-of-run_complete-file-times-out branch from 25f6f34 to f7b6152 Compare February 4, 2025 20:12

hwikle-lanl requested review from smehta99 and Paul-Ferrell February 4, 2025 20:40

hwikle-lanl changed the title ~~Catch unhandled TimeoutError and log status~~ Handle filesystem latency when creating RUN_COMPLETE file Feb 5, 2025

hwikle-lanl added the bug Something isn't working label Feb 5, 2025

smehta99 approved these changes Feb 5, 2025

View reviewed changes

hwikle-lanl merged commit 00aae1a into develop Feb 5, 2025
7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle filesystem latency when creating `RUN_COMPLETE` file #803

Handle filesystem latency when creating `RUN_COMPLETE` file #803

hwikle-lanl commented Jan 24, 2025

Paul-Ferrell left a comment

smehta99 left a comment

Handle filesystem latency when creating RUN_COMPLETE file #803

Handle filesystem latency when creating RUN_COMPLETE file #803

Conversation

hwikle-lanl commented Jan 24, 2025

Paul-Ferrell left a comment

Choose a reason for hiding this comment

smehta99 left a comment

Choose a reason for hiding this comment

Handle filesystem latency when creating `RUN_COMPLETE` file #803

Handle filesystem latency when creating `RUN_COMPLETE` file #803