Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial Job Sizing Infrastructure #488

Open
wants to merge 19 commits into
base: dev
Choose a base branch
from

Conversation

TimothyWillard
Copy link
Contributor

Describe your changes.

This pull request helps to move GH-487 along by adding:

  • The BatchSystem enum to represent the three environments we expect to be running batch jobs on,
  • The JobSize class to represent general properties of a batch job,
  • The JobResources class to represent the general resource requirements of a batch job, and
  • The JobTimeLimit class to represent the time limit a job is allowed to run for.

Spun out of GH-394.

Does this pull request make any user interface changes? If so please describe.

The user interface changes are new functionality added to gempyor.batch and are documented via docstrings but not in GitBook yet. I think it's not quite ready for GitBook documentation yet because:

  1. These classes are much more developer oriented than user oriented, I expect that non-developers would never directly interact with these classes, and
  2. I expect these classes to change some in response to adding resource estimation abilities (see task 3 of [Feature request]: flepimop submit Action For Submitting Calibration Jobs To HPC Environments #487).

What does your pull request address? Tag relevant issues.

This pull request addresses GH-487 by spinning out smaller infrastructure components from GH-394 to make that PR easier to review. With that said I expect these classes to change some between now and the final product for GH-487 when working on resource estimation for flepimop submit.

Added a dataclass to represent batch job sizes, including light
validation.
Added a class method to `gempyor.batch.JobSize` class that is slightly
more convenient to use from a CLI script. Provides a good home for
future code realted to GH-360. Corresponding tests and docs.
Added internal `gempyor.batch._resolve_batch_system` utility to make
sense of batch system CLI options.
Added a representation of batch job time limits, similar to `JobSize`,
along with corresponding documentation and unit tests.
Added the `JobTimeLimit.from_per_simulation_time` class method to easily
create instances from user provided inputs. Also added comparison
methods for ease of unit testing.
Added an enum for batch systems to formalize the valid systems flepiMoP
can run on.
Mainly reduced the functionality of the `size_from_jobs_sims_blocks`
class method. Best to leave that to more feedback and GH-360.
Added a class to represent job resources required/requested for a batch
job. Includes documentation and some unit tests, but not enough.
More complete set of unit tests for the `JobResources` class.
The `JobSize.from_jobs_simulations_blocks` now considers batch system,
namely to limit the size of the job when running locally since local
inference is meant for testing.
The `JobTimeLimit` class has specific formatting for
`BatchSystem.SLURM`.
@TimothyWillard TimothyWillard added enhancement Request for improvement or addition of new feature(s). gempyor Concerns the Python core. batch Relating to batch processing. high priority High priority. next release Marks a PR as a target to include in the next release. labels Jan 27, 2025
@saraloo
Copy link
Contributor

saraloo commented Jan 28, 2025

Thanks Tim, looks good to me. From what I can tell this is just translating some of the old inference launcher script to this infrastructure? For the defaults here you've set:

nodes = 1 if nodes is None else nodes
cpus = 2 * job_size.jobs if cpus is None else cpus
memory = 2 * 1024 * job_size.simulations if memory is None else memory

Is this just the infrastructure you're building here, rather than specifics of these?

Also just to confirm some of the language here (can move this discussion elsewhere if better elsewhere, or to deal with later). For the R inference job = slot, and simulations = sims_per_job in the old inference launcher script? ie in the case of a single block simulations = iterations?
Idk if we care about it right now but for later when we harmonise across inference methods, we should stick to the same language. I think for emcee job=walker and simulations=niterations?

For emcee i also don't think we have any blocking capabilities afaik, and i don't think we use blocks on our HPCs - i think this was mainly an issue for AWS. A separate thought, but can consider (later) whether this is needed.

Copy link
Contributor

@pearsonca pearsonca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure about the design here. Seems more like we should have a BatchSystem ABC rather than enum, and have BatchSystem's do their respective things?

Comment on lines 42 to 44
aws: bool,
local: bool,
slurm: bool,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why these flags? they are mutually exclusive, and the selection is captured by batch_system

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed offline, but adding note here for other reviewers.

click.Option(
param_decls=["--batch-system", "batch_system"],
default=None,
type=click.Choice(("aws", "local", "slurm"), case_sensitive=False),
help="The name of the batch system being used.",
),
click.Option(
param_decls=["--aws", "aws"],
default=False,
type=bool,
is_flag=True,
help="A flag indicating this is being run on AWS.",
),
click.Option(
param_decls=["--local", "local"],
default=False,
type=bool,
is_flag=True,
help="A flag indicating this is being run local.",
),
click.Option(
param_decls=["--slurm", "slurm"],
default=False,
type=bool,
is_flag=True,
help="A flag indicating this is being run on slurm.",
),

The reason for this was to resolve these options from the prior pull request, GH-394, into a single consistent answer. Have removed for now and will rethink how to implement in a follow up PR to accommodate the dynamic batch systems added in 78d960a.

Replaced the `BatchSystem` enum with an ABC that is extended by classes
representing those systems. The ABC defines clear defaults for common
formatting methods. Removed references to the original enum in
`JobResources`. Removed `JobTimeLimit` all together as its formatting
logic has been consolidated into the batch system class instances.
Added unit tests for `LocalBatchSystem` and `SlurmBatchSystem` specific
behavior as well as added warnings to `LocalBatchSystem` when overriding
user inputs.
* Reorganized the code within the `gempyor.batch` module to better group
  similiar objects together, and
* Added examples to all exported functionality to promote usability.
@TimothyWillard
Copy link
Contributor Author

Is this just the infrastructure you're building here, rather than specifics of these?

These are just "defaults", I expect this to look different once working in the resource estimation abilities. I'm also open to changing defaults to whatever within reason.

Also just to confirm some of the language here (can move this discussion elsewhere if better elsewhere, or to deal with later). For the R inference job = slot, and simulations = sims_per_job in the old inference launcher script? ie in the case of a single block simulations = iterations?
Idk if we care about it right now but for later when we harmonise across inference methods, we should stick to the same language. I think for emcee job=walker and simulations=niterations?

Yeah, the JobSize class is designed to address GH-395 which is about creating consistent vocab for these concepts. The prior PR, GH-394, in fact used these as you suggest. See:

https://github.com/HopkinsIDD/flepiMoP/blob/ad11faea053045f6e61fab546ef9335b518ff726/flepimop/gempyor_pkg/src/gempyor/templates/inference_command.bash.j2

and

# Job size
job_size = JobSize.from_jobs_simulations_blocks(
kwargs["jobs"],
kwargs["simulations"],
kwargs["blocks"],
inference_method,
batch_system,
)
logger.info("Preparing a job with size %s", job_size)
if inference_method == "emcee" and job_size.blocks != 1:
logger.warning(
"When using EMCEE for inference the job size blocks is ignored, given %u.",
job_size.blocks,
)

Removed old (already commented out) unit tests for `JobResources` and
`JobSize`.
@TimothyWillard
Copy link
Contributor Author

I've removed references to specific inference methods from this PR for now, I think incorporating inference job type constraints are better suited for a follow-up PR.

flepimop/gempyor_pkg/src/gempyor/batch.py Outdated Show resolved Hide resolved
flepimop/gempyor_pkg/src/gempyor/batch.py Outdated Show resolved Hide resolved
* Switched `JobSize`, `JobResources` from data classes to pydantic's
  `BaseModel`.
* Remove tests that already cover features provided by pydantic`.
* Updated examples for those classes.
Copy link
Collaborator

@emprzy emprzy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
batch Relating to batch processing. enhancement Request for improvement or addition of new feature(s). gempyor Concerns the Python core. high priority High priority. next release Marks a PR as a target to include in the next release.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature request]: flepimop submit Action For Submitting Calibration Jobs To HPC Environments
5 participants