Initial Job Sizing Infrastructure #488

TimothyWillard · 2025-01-27T19:23:22Z

Describe your changes.

This pull request helps to move GH-487 along by adding:

The BatchSystem enum to represent the three environments we expect to be running batch jobs on,
The JobSize class to represent general properties of a batch job,
The JobResources class to represent the general resource requirements of a batch job, and
The JobTimeLimit class to represent the time limit a job is allowed to run for.

Spun out of GH-394.

Does this pull request make any user interface changes? If so please describe.

The user interface changes are new functionality added to gempyor.batch and are documented via docstrings but not in GitBook yet. I think it's not quite ready for GitBook documentation yet because:

These classes are much more developer oriented than user oriented, I expect that non-developers would never directly interact with these classes, and
I expect these classes to change some in response to adding resource estimation abilities (see task 3 of [Feature request]: flepimop submit Action For Submitting Calibration Jobs To HPC Environments #487).

What does your pull request address? Tag relevant issues.

This pull request addresses GH-487 by spinning out smaller infrastructure components from GH-394 to make that PR easier to review. With that said I expect these classes to change some between now and the final product for GH-487 when working on resource estimation for flepimop submit.

Added a dataclass to represent batch job sizes, including light validation.

Added a class method to `gempyor.batch.JobSize` class that is slightly more convenient to use from a CLI script. Provides a good home for future code realted to GH-360. Corresponding tests and docs.

Added internal `gempyor.batch._resolve_batch_system` utility to make sense of batch system CLI options.

Added a representation of batch job time limits, similar to `JobSize`, along with corresponding documentation and unit tests.

Added the `JobTimeLimit.from_per_simulation_time` class method to easily create instances from user provided inputs. Also added comparison methods for ease of unit testing.

Added an enum for batch systems to formalize the valid systems flepiMoP can run on.

Mainly reduced the functionality of the `size_from_jobs_sims_blocks` class method. Best to leave that to more feedback and GH-360.

Added a class to represent job resources required/requested for a batch job. Includes documentation and some unit tests, but not enough.

More complete set of unit tests for the `JobResources` class.

The `JobSize.from_jobs_simulations_blocks` now considers batch system, namely to limit the size of the job when running locally since local inference is meant for testing.

The `JobTimeLimit` class has specific formatting for `BatchSystem.SLURM`.

saraloo · 2025-01-28T01:58:44Z

Thanks Tim, looks good to me. From what I can tell this is just translating some of the old inference launcher script to this infrastructure? For the defaults here you've set:

nodes = 1 if nodes is None else nodes
cpus = 2 * job_size.jobs if cpus is None else cpus
memory = 2 * 1024 * job_size.simulations if memory is None else memory

Is this just the infrastructure you're building here, rather than specifics of these?

Also just to confirm some of the language here (can move this discussion elsewhere if better elsewhere, or to deal with later). For the R inference job = slot, and simulations = sims_per_job in the old inference launcher script? ie in the case of a single block simulations = iterations?
Idk if we care about it right now but for later when we harmonise across inference methods, we should stick to the same language. I think for emcee job=walker and simulations=niterations?

For emcee i also don't think we have any blocking capabilities afaik, and i don't think we use blocks on our HPCs - i think this was mainly an issue for AWS. A separate thought, but can consider (later) whether this is needed.

pearsonca

Not sure about the design here. Seems more like we should have a BatchSystem ABC rather than enum, and have BatchSystem's do their respective things?

pearsonca · 2025-01-28T14:08:23Z

flepimop/gempyor_pkg/src/gempyor/batch.py

+        aws: bool,
+        local: bool,
+        slurm: bool,


why these flags? they are mutually exclusive, and the selection is captured by batch_system

Discussed offline, but adding note here for other reviewers.

flepiMoP/flepimop/gempyor_pkg/src/gempyor/batch.py

Lines 878 to 904 in ad11fae

click.Option(

param_decls=["--batch-system", "batch_system"],

default=None,

type=click.Choice(("aws", "local", "slurm"), case_sensitive=False),

help="The name of the batch system being used.",

),

click.Option(

param_decls=["--aws", "aws"],

default=False,

type=bool,

is_flag=True,

help="A flag indicating this is being run on AWS.",

),

click.Option(

param_decls=["--local", "local"],

default=False,

type=bool,

is_flag=True,

help="A flag indicating this is being run local.",

),

click.Option(

param_decls=["--slurm", "slurm"],

default=False,

type=bool,

is_flag=True,

help="A flag indicating this is being run on slurm.",

),

The reason for this was to resolve these options from the prior pull request, GH-394, into a single consistent answer. Have removed for now and will rethink how to implement in a follow up PR to accommodate the dynamic batch systems added in 78d960a.

Replaced the `BatchSystem` enum with an ABC that is extended by classes representing those systems. The ABC defines clear defaults for common formatting methods. Removed references to the original enum in `JobResources`. Removed `JobTimeLimit` all together as its formatting logic has been consolidated into the batch system class instances.

Added unit tests for `LocalBatchSystem` and `SlurmBatchSystem` specific behavior as well as added warnings to `LocalBatchSystem` when overriding user inputs.

* Reorganized the code within the `gempyor.batch` module to better group similiar objects together, and * Added examples to all exported functionality to promote usability.

TimothyWillard · 2025-01-29T15:03:24Z

Is this just the infrastructure you're building here, rather than specifics of these?

These are just "defaults", I expect this to look different once working in the resource estimation abilities. I'm also open to changing defaults to whatever within reason.

Also just to confirm some of the language here (can move this discussion elsewhere if better elsewhere, or to deal with later). For the R inference job = slot, and simulations = sims_per_job in the old inference launcher script? ie in the case of a single block simulations = iterations?
Idk if we care about it right now but for later when we harmonise across inference methods, we should stick to the same language. I think for emcee job=walker and simulations=niterations?

Yeah, the JobSize class is designed to address GH-395 which is about creating consistent vocab for these concepts. The prior PR, GH-394, in fact used these as you suggest. See:

https://github.com/HopkinsIDD/flepiMoP/blob/ad11faea053045f6e61fab546ef9335b518ff726/flepimop/gempyor_pkg/src/gempyor/templates/inference_command.bash.j2

and

flepiMoP/flepimop/gempyor_pkg/src/gempyor/batch.py

Lines 1141 to 1154 in ad11fae

    
           # Job size 
        
           job_size = JobSize.from_jobs_simulations_blocks( 
        
               kwargs["jobs"], 
        
               kwargs["simulations"], 
        
               kwargs["blocks"], 
        
               inference_method, 
        
               batch_system, 
        
           ) 
        
           logger.info("Preparing a job with size %s", job_size) 
        
           if inference_method == "emcee" and job_size.blocks != 1: 
        
               logger.warning( 
        
                   "When using EMCEE for inference the job size blocks is ignored, given %u.", 
        
                   job_size.blocks, 
        
               )

Removed old (already commented out) unit tests for `JobResources` and `JobSize`.

TimothyWillard · 2025-01-29T15:14:30Z

I've removed references to specific inference methods from this PR for now, I think incorporating inference job type constraints are better suited for a follow-up PR.

flepimop/gempyor_pkg/src/gempyor/batch.py

* Switched `JobSize`, `JobResources` from data classes to pydantic's `BaseModel`. * Remove tests that already cover features provided by pydantic`. * Updated examples for those classes.

emprzy

This looks great!

TimothyWillard added 13 commits January 27, 2025 12:32

Initial JobSize class

0e99468

Added a dataclass to represent batch job sizes, including light validation.

Add JobSize.size_from_jobs_sims_blocks

82544b3

Added a class method to `gempyor.batch.JobSize` class that is slightly more convenient to use from a CLI script. Provides a good home for future code realted to GH-360. Corresponding tests and docs.

Add _resolve_batch_system utility

36cc8b2

Added internal `gempyor.batch._resolve_batch_system` utility to make sense of batch system CLI options.

Add JobTimeLimit class

eaf996d

Added a representation of batch job time limits, similar to `JobSize`, along with corresponding documentation and unit tests.

Add from_per_simulation_time method

ccadbf7

Added the `JobTimeLimit.from_per_simulation_time` class method to easily create instances from user provided inputs. Also added comparison methods for ease of unit testing.

Add BatchSystem enum

c36db0f

Added an enum for batch systems to formalize the valid systems flepiMoP can run on.

Greatly simplify JobSize for first draft

a113b48

Mainly reduced the functionality of the `size_from_jobs_sims_blocks` class method. Best to leave that to more feedback and GH-360.

Add JobResources class

e8298f4

Added a class to represent job resources required/requested for a batch job. Includes documentation and some unit tests, but not enough.

Further unit tests of JobResources

7567931

More complete set of unit tests for the `JobResources` class.

Optional overrides to job resource presets

057e645

Incorporate batch system into JobSize

2afd17f

The `JobSize.from_jobs_simulations_blocks` now considers batch system, namely to limit the size of the job when running locally since local inference is meant for testing.

Add Self shim for python 3.10

933ac63

Update JobTimeLimit tests for SLURM

f0651e9

The `JobTimeLimit` class has specific formatting for `BatchSystem.SLURM`.

TimothyWillard added this to the Batch Submission And Usage milestone Jan 27, 2025

TimothyWillard linked an issue Jan 27, 2025 that may be closed by this pull request

[Feature request]: flepimop submit Action For Submitting Calibration Jobs To HPC Environments #487

Open

6 tasks

TimothyWillard added enhancement Request for improvement or addition of new feature(s). gempyor Concerns the Python core. batch Relating to batch processing. high priority High priority. next release Marks a PR as a target to include in the next release. labels Jan 27, 2025

Merge dev into feature/487/job-sizing-infra

71228a7

TimothyWillard requested review from pearsonca, jcblemai, fang19911030, anjalika-nande, saraloo, MacdonaldJoshuaCaleb and emprzy January 27, 2025 20:42

pearsonca reviewed Jan 28, 2025

View reviewed changes

TimothyWillard added 3 commits January 28, 2025 16:21

Add unit tests for local/slurm batch systems

2182eca

Added unit tests for `LocalBatchSystem` and `SlurmBatchSystem` specific behavior as well as added warnings to `LocalBatchSystem` when overriding user inputs.

Reorg gempyor.batch module, add examples

2ce336a

* Reorganized the code within the `gempyor.batch` module to better group similiar objects together, and * Added examples to all exported functionality to promote usability.

TimothyWillard requested a review from pearsonca January 29, 2025 15:03

Removed unnecessary unit tests for gempyor.batch

41accb0

Removed old (already commented out) unit tests for `JobResources` and `JobSize`.

pearsonca requested changes Jan 29, 2025

View reviewed changes

flepimop/gempyor_pkg/src/gempyor/batch.py Outdated Show resolved Hide resolved

flepimop/gempyor_pkg/src/gempyor/batch.py Outdated Show resolved Hide resolved

Switch Job* from data classes to pydantic

6033c38

* Switched `JobSize`, `JobResources` from data classes to pydantic's `BaseModel`. * Remove tests that already cover features provided by pydantic`. * Updated examples for those classes.

emprzy approved these changes Jan 29, 2025

View reviewed changes

TimothyWillard requested a review from pearsonca January 29, 2025 18:52

pearsonca approved these changes Jan 29, 2025

View reviewed changes

anjalika-nande approved these changes Jan 29, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Initial Job Sizing Infrastructure #488

Initial Job Sizing Infrastructure #488

TimothyWillard commented Jan 27, 2025

saraloo commented Jan 28, 2025

pearsonca left a comment

pearsonca Jan 28, 2025

TimothyWillard Jan 29, 2025

TimothyWillard commented Jan 29, 2025

TimothyWillard commented Jan 29, 2025

emprzy left a comment

	click.Option(
	param_decls=["--batch-system", "batch_system"],
	default=None,
	type=click.Choice(("aws", "local", "slurm"), case_sensitive=False),
	help="The name of the batch system being used.",
	),
	click.Option(
	param_decls=["--aws", "aws"],
	default=False,
	type=bool,
	is_flag=True,
	help="A flag indicating this is being run on AWS.",
	),
	click.Option(
	param_decls=["--local", "local"],
	default=False,
	type=bool,
	is_flag=True,
	help="A flag indicating this is being run local.",
	),
	click.Option(
	param_decls=["--slurm", "slurm"],
	default=False,
	type=bool,
	is_flag=True,
	help="A flag indicating this is being run on slurm.",
	),

Initial Job Sizing Infrastructure #488

Are you sure you want to change the base?

Initial Job Sizing Infrastructure #488

Conversation

TimothyWillard commented Jan 27, 2025

Describe your changes.

Does this pull request make any user interface changes? If so please describe.

What does your pull request address? Tag relevant issues.

saraloo commented Jan 28, 2025

pearsonca left a comment

Choose a reason for hiding this comment

pearsonca Jan 28, 2025

Choose a reason for hiding this comment

TimothyWillard Jan 29, 2025

Choose a reason for hiding this comment

TimothyWillard commented Jan 29, 2025

TimothyWillard commented Jan 29, 2025

emprzy left a comment

Choose a reason for hiding this comment