Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

shared-cache - enable shared-cache support in SCR #481

Open
mcfadden8 opened this issue Feb 23, 2022 · 2 comments · May be fixed by #511
Open

shared-cache - enable shared-cache support in SCR #481

mcfadden8 opened this issue Feb 23, 2022 · 2 comments · May be fixed by #511
Assignees

Comments

@mcfadden8
Copy link
Collaborator

mcfadden8 commented Feb 23, 2022

SCR can direct the application to write dataset files to subdirectories within a cache directory. SCR also stores its redundancy data in these subdirectories.

Question: Should it be considered an error to configure redunancy schemes when cache is shared?

To construct the full path of a cache directory, SCR incorporates a cache base directory name (SCR_CACHE_BASE) with the user name and the allocation id associated with the resource allocation.

The cache directory name is currently derived from the concatenation of the cache base directory (SCR_CACHE_BASE), the user name running the application, and the job scheduler resource allocation id. This presents a name collision problem when the cache is on a shared file system.

This ticket proposes that the cache directory name should also have the MPI rank numbed appended to the name above.

Question: Should we just append this in general? Or only when the cache is on a shared file system, which begs the question of how SCR can determine when/if the file system is shared. My vote is to simply append the rank number as a general rule after the session id.

@mcfadden8 mcfadden8 self-assigned this Feb 23, 2022
@mcfadden8 mcfadden8 removed the WIP label Mar 4, 2022
@mcfadden8 mcfadden8 changed the title Append node-name to SCR_CACHE_BASE to make the name node unique shared-cache - enable shared-cache support in SCR Jul 20, 2022
@mcfadden8
Copy link
Collaborator Author

mcfadden8 commented Jul 27, 2022

@adammoody, I think that this change is required in order for SCR to support a shared cache. Do you agree?

If so, should a shared cache be a mode that SCR is configured in? Or, should we simply change the naming scheme in general so that it works in both a shared and non-shared cache?

@adammoody
Copy link
Contributor

adammoody commented Jul 27, 2022

To start with, let's only claim to support SINGLE when using a shared cache. We'll assume that the shared cache is reliable enough that redundancy is not necessary. Also, I think it'll be too complicated for us (and maybe not possible) to try to implement a redundancy scheme that could actually tolerate failures of the file system, e.g. in the case that a Lustre server drops out.

I don't know whether we can easily enforce that one only uses SINGLE, since we can't easily determine whether a storage location is node-local or global. Having said that, I think Dong had created something we may be able to use (https://computing.llnl.gov/projects/fast-global-file-status). But to keep things simple, let's pretend that we can't for now.

Instead, we can document that it is on the user to mark any shared storage as GLOBAL by defining a proper storage descriptor. We can then enforce that only the SINGLE redundancy scheme is valid to use with a GLOBAL storage descriptor.

  • It's on the user to define a store descriptor to specify a shared cache as GLOBAL. If they don't, SCR is allowed to blow up in some bad way.
  • SCR enforces that only SINGLE can be used with a GLOBAL store descriptor.

@mcfadden8 mcfadden8 linked a pull request Nov 2, 2022 that will close this issue
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants