-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Initial Job Sizing Infrastructure #488
base: dev
Are you sure you want to change the base?
Conversation
Added a dataclass to represent batch job sizes, including light validation.
Added a class method to `gempyor.batch.JobSize` class that is slightly more convenient to use from a CLI script. Provides a good home for future code realted to GH-360. Corresponding tests and docs.
Added internal `gempyor.batch._resolve_batch_system` utility to make sense of batch system CLI options.
Added a representation of batch job time limits, similar to `JobSize`, along with corresponding documentation and unit tests.
Added the `JobTimeLimit.from_per_simulation_time` class method to easily create instances from user provided inputs. Also added comparison methods for ease of unit testing.
Added an enum for batch systems to formalize the valid systems flepiMoP can run on.
Mainly reduced the functionality of the `size_from_jobs_sims_blocks` class method. Best to leave that to more feedback and GH-360.
Added a class to represent job resources required/requested for a batch job. Includes documentation and some unit tests, but not enough.
More complete set of unit tests for the `JobResources` class.
The `JobSize.from_jobs_simulations_blocks` now considers batch system, namely to limit the size of the job when running locally since local inference is meant for testing.
The `JobTimeLimit` class has specific formatting for `BatchSystem.SLURM`.
Thanks Tim, looks good to me. From what I can tell this is just translating some of the old inference launcher script to this infrastructure? For the defaults here you've set:
Is this just the infrastructure you're building here, rather than specifics of these? Also just to confirm some of the language here (can move this discussion elsewhere if better elsewhere, or to deal with later). For the R inference For emcee i also don't think we have any blocking capabilities afaik, and i don't think we use blocks on our HPCs - i think this was mainly an issue for AWS. A separate thought, but can consider (later) whether this is needed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure about the design here. Seems more like we should have a BatchSystem ABC rather than enum, and have BatchSystem's do their respective things?
aws: bool, | ||
local: bool, | ||
slurm: bool, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why these flags? they are mutually exclusive, and the selection is captured by batch_system
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed offline, but adding note here for other reviewers.
flepiMoP/flepimop/gempyor_pkg/src/gempyor/batch.py
Lines 878 to 904 in ad11fae
click.Option( | |
param_decls=["--batch-system", "batch_system"], | |
default=None, | |
type=click.Choice(("aws", "local", "slurm"), case_sensitive=False), | |
help="The name of the batch system being used.", | |
), | |
click.Option( | |
param_decls=["--aws", "aws"], | |
default=False, | |
type=bool, | |
is_flag=True, | |
help="A flag indicating this is being run on AWS.", | |
), | |
click.Option( | |
param_decls=["--local", "local"], | |
default=False, | |
type=bool, | |
is_flag=True, | |
help="A flag indicating this is being run local.", | |
), | |
click.Option( | |
param_decls=["--slurm", "slurm"], | |
default=False, | |
type=bool, | |
is_flag=True, | |
help="A flag indicating this is being run on slurm.", | |
), |
The reason for this was to resolve these options from the prior pull request, GH-394, into a single consistent answer. Have removed for now and will rethink how to implement in a follow up PR to accommodate the dynamic batch systems added in 78d960a.
Replaced the `BatchSystem` enum with an ABC that is extended by classes representing those systems. The ABC defines clear defaults for common formatting methods. Removed references to the original enum in `JobResources`. Removed `JobTimeLimit` all together as its formatting logic has been consolidated into the batch system class instances.
Added unit tests for `LocalBatchSystem` and `SlurmBatchSystem` specific behavior as well as added warnings to `LocalBatchSystem` when overriding user inputs.
* Reorganized the code within the `gempyor.batch` module to better group similiar objects together, and * Added examples to all exported functionality to promote usability.
These are just "defaults", I expect this to look different once working in the resource estimation abilities. I'm also open to changing defaults to whatever within reason.
Yeah, the and flepiMoP/flepimop/gempyor_pkg/src/gempyor/batch.py Lines 1141 to 1154 in ad11fae
|
Removed old (already commented out) unit tests for `JobResources` and `JobSize`.
I've removed references to specific inference methods from this PR for now, I think incorporating inference job type constraints are better suited for a follow-up PR. |
* Switched `JobSize`, `JobResources` from data classes to pydantic's `BaseModel`. * Remove tests that already cover features provided by pydantic`. * Updated examples for those classes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great!
Describe your changes.
This pull request helps to move GH-487 along by adding:
BatchSystem
enum to represent the three environments we expect to be running batch jobs on,JobSize
class to represent general properties of a batch job,JobResources
class to represent the general resource requirements of a batch job, andJobTimeLimit
class to represent the time limit a job is allowed to run for.Spun out of GH-394.
Does this pull request make any user interface changes? If so please describe.
The user interface changes are new functionality added to
gempyor.batch
and are documented via docstrings but not in GitBook yet. I think it's not quite ready for GitBook documentation yet because:flepimop submit
Action For Submitting Calibration Jobs To HPC Environments #487).What does your pull request address? Tag relevant issues.
This pull request addresses GH-487 by spinning out smaller infrastructure components from GH-394 to make that PR easier to review. With that said I expect these classes to change some between now and the final product for GH-487 when working on resource estimation for
flepimop submit
.