Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Core Refactor SmartSim API updates #768

Closed
wants to merge 64 commits into from

Conversation

juliaputko
Copy link
Contributor

No description provided.

amandarichardsonn and others added 30 commits May 2, 2024 16:55
Move the test suite used for SmartSim v0.X.Y to a `tests/_legacy` dir.
This directory is explicitly skipped during test collection as these
tests are not expected to pass as SmartSim transitions to a new user
facing and core API.

Additional changes to CI to make sure the feature branch continues
to pass while in a chaotic development state.

[ committed by @MattToast ]
[ reviewed by @amandarichardsonn ]
Fixed inconsistency when adding run arguments N, nodes, or ntasks, to a RunSettings object with leading `-` characters.

 [ committed by @juliaputko ]
 [ reviewed by @amandarichardsonn  ]
BaseJobGroup, JobGroup and ColocatedJobGroup skeletons added.

[ Committed by @amandarichardsonn ]
[ Reviewed by @juliaputko ]
Addresses type errors found in the SmartSim core refactor branch.
Re-adds a static type check back into the CI to allow for gradual typing
as more parts of the API become more concrete

[ committed by @MattToast ]
[ reviewed by @amandarichardsonn ]
Creates a basic implementation for a `CompundEntity` base class that can
be used to build a launchable containing multiple processes (`Job`s,
`JobGroup`, etc) each with their own settings from a single
`LaunchSettings` instance.

This class is then subclassed to create an `Ensemble` class that mimics
the original implementation: I takes in an `Application` (`Model`) and
then maps settings/params/input files over the collection before
creating the launchable jobs.

[ committed by @MattToast ]
[ reviewed by @amandarichardsonn ]

---------

Co-authored-by: amandarichardsonn <[email protected]>
Entry point for running file operations move, remove, symlink, copy, 
and configure using command line arguments. Included tests.

 [ committed by @juliaputko ]
 [ reviewed by @MattToast , @amandarichardsonn, @mellis13 ]
Add headers/marks/boilerplate to test files for `smartsim-refactor`.

[ committed by @MattToast ]
[ reviewed by @amandarichardsonn ]
Create a `Dispatcher` class and a global `DEFAULT_DISPATCHER` instance.
This can be used by the experiment class to get which type of launcher
are to be used for a specific `LaunchSettings` by maintaining an
internal mapping of arguments type -> launcher type. This mapping can be
built in a type safe way by calling `Dispatcher.disptach`. Experiments
now use the `DEFAULT_DISPATCHER` to start jobs through an 
`Experiment.start` method

[ committed by @MattToast ]
[ reviewed by @amandarichardsonn @mellis13 ]
Running `pytest tests/test_file_operations.py` was overwriting a test
configuration file. Fixes the offending test and adds a missing
`assert` statement.

[ committed by @MattToast ]
[ reviewed by @juliaputko ]
juliaputko and others added 26 commits August 15, 2024 12:53
Update smartsim-refactor branch with the changes in develop

 [ committed by @juliaputko ]
 [ reviewed by @MattToast ]
Move Shelllauncher to _core/shell.

 [ committed by @juliaputko ]
 [ reviewed by @MattToast , @mellis13 ]
Remove slurm specific methods from LaunchSettings.  Add a ShellLaunchArguments class. 

[ committed by @juliaputko ]
[ reviewed by @MattToast ]
Removed all references to Redis and RedisAI, and other unused methods.

 [ committed by @juliaputko ]
 [ reviewed by @MattToast , @mellis13 ]
…abs#671)

This PR merges in additional context to the ShellLauncher.start function to open a subprocess. The working directory is now set in popen, the environment variables are now set in popen, and the file paths to which standard output and standard error streams should be redirected to have been set. Additionally this PR merges in Unit tests for the ShellLauncher class.

[ reviewed by @MattToast ]
[ committed by @amandarichardsonn ]
Refactor of Application

[ committed by @juliaputko ]
[ reviewed by @MattToast , @mellis13 ]
`Experiment` was given a `wait` method that takes a collection of
Launched Job IDs and will wait until the launch reaches a terminal state
by either completing or erroring out. Implements a polling based
solution.

[ committed by @MattToast ]
[ reviewed by @amandarichardsonn @juliaputko @mellis13 ]
Refactor SmartSimEntity class, remove ExecutableProtocol protocol,
Application becomes subclass of ABC SmartSImEntity.

[ reviewed by @MattToast @mellis13 ]
[ committed by @amandarichardsonn ]
Add an `Experiment.stop` method to stop jobs prematurely. Teach
launchers how to stop the jobs that they manage.

[ committed by @MattToast ]
[ reviewed by @mellis13 ]
Refactor Ensemble parameters

[ committed by @juliaputko ]
[ reviewed by @MattToast, @mellis13  ]
In the shell launcher, an unintended and unused type union sunk into the
``ShellLauncherCommand`` type for the ``command_tuple`` type, allowing
the attr to be of type ``Sequence[str]`` or ``tuple[str, tuple[str,
...]]``. The former works as expected, but the latter would error at
runtime when sent to the ``ShellLauncher`` when opening a subprocess.

```py
class ShellLauncher:
    ...
    def start(self, shell_command: ShellLauncherCommand) -> LaunchedJobID:
        ...
        exe, *rest = shell_command.command_tuple
        #     ^^^^ Mypy thinks this is `list[str] | list[tuple[str]]`
        expanded_exe = helpers.expand_exe_path(exe)
        # pylint: disable-next=consider-using-with
        self._launched[id_] = sp.Popen(
            (expanded_exe, *rest),
       #    ^^^^^^^^^^^^^^^^^^^^^
       #    And inadvertently casts this to `tuple[Any]` which errors
       #    at runtime
            cwd=shell_command.path,
            env={k: v for k, v in shell_command.env.items() if v is not None},
            stdout=shell_command.stdout,
            stderr=shell_command.stderr,
        )
        ...
```

Because this type was not being used, it can simply be removed from the
union.
[ committed by @MattToast ]
[ reviewed by @amandarichardsonn ]
This PR converts Camel case files to Snake case.

[reviewed by @mellis13 ]
[ committed by @amandarichardsonn ]
…rayLabs#693)

Remove EntityList, EntitySequence, JobManager, and Controller as well as any references. 

[ committed by @juliaputko ]
[ reviewed by @amandarichardsonn, @mellis13 ]
This PR makes small updates to the existing BatchSettings
implementation.

[ reviewed by @mellis13 @MattToast ]
[ committed by @amandarichardsonn ]
The Ensemble class is moved to /builders/ in this PR. The strategies module is moved to/builder/utils in this PR. as_jobs is changed to build_jobs. Documentation is updated.

[ reviewed by @mellis13 ]
[ committed by @amandarichardsonn ]
…ectory entrypoint (CrayLabs#695)

This PR adds a configure_directory entry point, as well as tests. It
also removes TaggedFilesHierarchy and replaces it with os.walk. Finally,
the Generator tests have been refactored.

[ reviewed by @MattToast @mellis13 @juliaputko ]
[ committed by @amandarichardsonn ]
@juliaputko juliaputko requested a review from ankona October 28, 2024 23:26
@juliaputko juliaputko closed this Oct 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants