-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NOBatchSystem class: run a Rocoto workflow without a batch system #18
Comments
I am moving the MOABSHBatchSystem to its own issue. |
I want to emphasize here that the NoBatchSystem scheduler type should be designed and implemented as a stand-alone scheduler method selectable by users for use on any system, including a laptop/workstation. This has been requested before, but the use cases at the time were not compelling enough to devote time to it. Please ensure your solution will provide the capability in a general, robust, way for everyone, everywhere. |
Chris, Yes, that is how it is designed. We already have potential customers that need that to run the FV3 GFS Beta. It has no knowledge of batch systems. Instead, it tracks daemon processes using a directory on a filesystem to trade information. You can even kill the jobs from a remote machine by adding a $jobname.kill file in that directory. The best part is:
as in, "stop bugging me about a scheduler, and just run the jobs." Sincerely, |
Are you using ~/.rocoto for storing the information about the processes being tracked? |
Chris, No. I let the user specify the directory. Using the home directory for scrub space or metadata-heavy activities is risky because the home quota is often small and the home partition is often less capable than others on the machine.
The job_id_dir does not take a cyclestr because the BatchSystem classes do not know which cycle they are submitting for. Instead, it is all one area, as it would be for a real batch system. There is one file per job ($jobname.job) and the rocotorewind kills them by making a "kill" file "$jobname.kill" I'm thinking of having no default for the job_id_dir, and forcing the user to specify it, as a safety measure. |
Exposing those sorts of details make Rocoto less usable for novices. Please keep the default path set to /tmp or ~/.rocoto. Also, the tag you suggest doesn't make any sense except when the NoBatchSystem scheduler is chosen. So, it needs to be specified in a different way. Please provide a ":TempDirectory" configure option in the rocotorc file to provide a means for users who may want to override it. |
Chris, That is an excellent idea. I'll work on that soon. |
Chris, I do like your idea of configuring in the ~/.rocoto, but on second thought, we will also need a way for users to modify the job pid directory on a per-workflow basis. Some workflows will have thousands or tens of thousands of active jobs, which will result in such a huge amount of metadata access for pid work that you may have to split them across multiple filesystems or filesets. I suggest we add a way to pass custom options to the batch system in a consistent manner. This would be documented as "advanced usage." Here is an example of how one might configure putting the NOBatchSystem and LSFBatchSystem into one workflow, and configure them separately.
This could have more interesting applications, like allowing the workflow to be split across multiple machines.
|
Un-closing. I closed this by accident. |
Running a workflow with thousands of active tasks without a batch system is madness. That is not something that should be supported. |
Chris, Well then, there's the other matter of cluttering up the user's ~/.rocoto directory. We've already had people hit their quota because Rocoto keeps making copies of its configuration file, and generating huge log files, every time it runs. If you add to this some pid files, which will not be deleted if a user prematurely ends a rocoto workflow, then the problem will get even worse. |
The pid files are extremely small. And there must be a way ensure stale files do not accumulate. The other issues are/were bugs that have been or need to be fixed. You can put the pid files in a tmp directory of the users' choosing (via the config option) and you can create subdirectories under that if you want to group them by workflow. |
Chris, The NOBatchSystem deletes the old files once it has recorded the job's status in the workflow database. If the user stops running rocotorun, then there will be some stale files lying around. If the user does that many times, for large workflows, then there will be thousands of files after a few months. Switching to /tmp eliminates the usefulness of the rocotorewind command, which is able to remotely kill a job by making a Sincerely, |
Then find a way to prevent that from happening or to clean up the stale files. |
I would like to mention that this feature is extensively tested and very stable. Apart from the jobid directory and scrubbing changes requested by Chris, it is definitely ready for a pull request. |
@samtrahan Does the "NoBatchSystem" still work in rocoto? I tried specifying "no" / "none"/"" as the scheduler but rocoto complains at the line where that is done.
Line 112
where SCHED
You mentioned about the need to specify Thanks |
@danielabdi-noaa - This issue and the associated code is more than 4 years old. While maintenance of existing capabilities and bug fixes have been a high priority, time for development of major new features has not been available for the past few years. This has not been given the attention it deserves. I am not convinced the implementation here is as robust as it needs to be for supporting execution of the UFS. We can talk offline about this if you want. |
@christopherwharrop Thanks for the info. I thought the |
Yes, it's confusing. That file was created as a placeholder a long time ago but the feature never materialized. It probably should be deleted. If this is ever added, then the one Sam made would take its place. |
Looks like renewed interest in this feature from years back. Hope to see some action. 🤞🏾 |
Rocoto currently requires that all workflow tasks run as batch jobs. While Rocoto was designed for workflows running on HPC resources, which are always managed by batch systems, it is sometimes desirable to use Rocoto to manage small workflows on workstations or other systems that do not have a batch system. Additionally, there may be times, such as when tasks are extremely small computationally and of very short duration, when it is more appropriate to run a task on the local host rather than via the batch system.
The text was updated successfully, but these errors were encountered: