forked from IGITUGraz/L2L
-
Notifications
You must be signed in to change notification settings - Fork 14
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
first version of new runner explanation
- Loading branch information
Showing
1 changed file
with
31 additions
and
37 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -247,7 +247,7 @@ the reason for the contract enforced on the Optimizee constructor | |
|
||
Note that all the (non-exploring) paramters to the `Optimizer` is passed in to its constructor through a | ||
:func:`~collections.namedtuple` to keep the paramters documented. For examples see :class:`.GeneticAlgorithmParameters` | ||
or :class:`.SimulatedAnnealingParameters` | ||
or :class:`.CrossEntropyParameters` | ||
|
||
The :meth:`~l2l.optimizers.optimizer.Optimizer.post_process` function: | ||
---------------------------------------------------------------------- | ||
|
@@ -295,43 +295,28 @@ logging and recording. See the source of :file:`bin/l2l-template.py` for more de | |
|
||
Execution setup | ||
~~~~~~~~~~~~~~~ | ||
The L2L framework works with JUBE in order to deploy the execution of the different instances of the optimizee on | ||
the available computational resources. This requires that the trajectory contains a parameter group called JUBE_params | ||
The L2L framework works together with workers from a runner class to distribute the execution of the different instances of the optimizee on | ||
the available computational resources. This requires that the trajectory contains a parameter group called runner_params | ||
which contains details for the right execution of the program. | ||
|
||
**Mandatory** steps to define the execution of the optimizees: | ||
1. Add a parameter group to the :obj: traj called JUBE_params using its :meth: f_add_parameter_group. | ||
2. Setup the execution command :attr: exec by using the trajectory :meth: f_add_parameter_to_group. | ||
Add parameter to group receives three parameters, which in this case should be specified as: | ||
group_name=JUBE_params, key="exec", val=<execution command string> | ||
This <execution command string> will be used to launch individual optimizees. An example of a simple call without using MPI calls | ||
is: "python " + os.path.join(paths.simulation_path, "run_files/run_optimizee.py" | ||
3. Setup the ready and working paths :attr: exec by using the trajectory :meth: f_add_parameter_to_group. | ||
Add parameter to group receives three parameters, which in this case should be specified as: | ||
group_name=JUBE_params, key="paths", val=<path object> | ||
<path object> should contain the root working path. An example of this path is: | ||
paths = Paths(name, dict(run_num='test'), root_dir_path=<root_dir_path>, suffix="-example") | ||
|
||
In order to launch simulations on a laptop or a local cluster without a scheduler, only the mandatory parameters must | ||
be specified. These parameters are part of the template. | ||
|
||
To launch the simulations on a cluster with a scheduler, the following optional parameters must be defined. They currently match | ||
slurm but this can also be adjusted to other schedulers. | ||
1. Name of the scheduler, :atr: "scheduler", e.g. "Slurm" | ||
2. Command to submit jobs to the schedulers, :atr: "submit_cmd", e.g. "sbatch" | ||
3. Template file for the particular scheduler, :atr: "job_file", e.g. "job.run" | ||
4. Number of nodes to request for each run, :atr: "nodes", e.g. "1" | ||
5. Requested time for the compute resources, :atr: "walltime", e.g. "00:01:00" | ||
6. MPI Processes per node, :atr: "ppn", e.g. "1" | ||
7. CPU cores per MPI process, :atr: "cpu_pp", e.g. "1" | ||
8. Threads per process, :atr: "threads_pp", e.g. "1" | ||
9. Type of emails to be sent from the scheduler, :atr: "mail_mode", e.g. "ALL" | ||
10. Email to notify events from the scheduler, :atr: "mail_address", e.g. "[email protected]" | ||
11. Error file for the job, :atr: "err_file", e.g. "stderr" | ||
12. Output file for the job, :atr: "out_file", e.g. "stdout" | ||
13. MPI Processes per job, :atr: "tasks_per_job", e.g. "1" | ||
|
||
See the :file: 'l2l-template-scheduler.py' for a base file with all these parameters. | ||
**Mandatory** steps to define the execution of the optimizees, if you do not want to use the default parameters: | ||
|
||
1. Create a dictionary that contains the runner parameters: | ||
* **srun**: This is the srun command that is called when running the program in parallel to execute an individual. | ||
The default-parameter is an empty string, which means that the program is only executed locally. | ||
* **exec**: This is the command to execute an individual. The default is set to python. | ||
* **max_workers**: The maximum number of workers must be determined by the user, | ||
depending on how many computer resources were requested and how many are required per individual. | ||
For example: a total of 1 node with 100 cores was requested; | ||
50 cores are required for an individual, therefore a maximum of 2 workers may be used. | ||
The default is set to 32 workers. | ||
* **work_path**: Specifies the path for the workspace. The results of the simulation, | ||
the trajectories and the logs for the individual workers are stored here. The default-parameter is set to the root_dir_path | ||
of the experiment. | ||
* **path_obj**: Strores the path object. | ||
2. Pass the dictionary to the experiment while calling **experiment.prepare_experiment(runner_params=params)**. | ||
|
||
See the :file:`bin/l2l-template.py` for a base file with all these parameters. | ||
|
||
Examples | ||
******** | ||
|
@@ -348,12 +333,21 @@ Data postprocessing | |
|
||
Todo... | ||
|
||
.. _checkpointing: | ||
|
||
Checkpointing | ||
************* | ||
|
||
Currently, checkpointing is only available for genetic algorithms. | ||
Here, a generation from a previous simulation can be read in and continued from there. | ||
For an example look at :file:`bin/l2l-fun-ga-checkpoint.py` | ||
|
||
.. _parallelization: | ||
|
||
Parallelization | ||
*************** | ||
|
||
We also support running different instances of the experiments on different cores and hosts using Jube. | ||
We also support running different instances of the experiments on different cores. | ||
|
||
|
||
.. _logging: | ||
|