Skip to content

Commit

Permalink
Update Scalability Tutorial (#262)
Browse files Browse the repository at this point in the history
* add empty requirements file for cuda

* add requirements files and update pyproject toml

* update pyproject

* update installation in pyproject.toml

* update readme and horovod installation script

* update readme with horovod explanation

* update horovod installation script

* update readme with -e flag

* fix linter readme errors

* add more info to readme

* trailing whitespace 🙃

* trailing whitespace 🙃 (again)

* add draft of table of contents to readme

* update readme toc

* update readme toc again

* add section about uv lock to readme

* update toc of readme

* fix errors in readme

* add version numbers to packages in pyproject.toml

* remove uv.lock (for now)

* remove link from readme

* put toc in html comment

* remove toc, remove ds and horovod from reqs, add docs comment to pyproj

* Itwinai jlab Docker image (#236)

* Refactor Dockerfiles

* Refactor container gen script

* ADD jlab dockerfile

* First working version of jlab container

* ADD CMCC requirements

* update dockerfiles

* ADD nvconda and refactor

* Update containers

* ADD containers

* ADD simple plus dockerfile

* Update NV deps

* Update CUDA

* Add comment

* Cleanup

* Cleanup

* UPDATE README

* Refactor

* Fix linter

* Refactor dockerfiles and improve tests

* Refactor

* Refactor

* Fix

* Add first tests for HPC

* First broken tests for HPC

* Update tests and strategy

* UPDATE tests

* Update horovod tests

* Update tests and jlab deps

* Add MLFLow tracking URI

* ADD distributed trainer tests

* mpirun container deepspeed

* Fix distributed strategy tests on multi-node

* ADD srun launcher

* Refactor jobscript

* Cleanup

* isort tests

* Refactor scripts

* Minor fixes

* Add logging to file for all workers

* Add jupyter base files

* Add jupyter base files

* spelling

* Update provenance deps

* Update DS version

* Update prov docs

* Cleanup

* add nvidia dep

* Remove incomplete work

* update pyproject

* ADD hadolit config file

* FIX flag

* Fix linters

* Refactor

* Update prov4ml

* Update pytest CI

* Minor fix

* Incorporate feedback

* Update Dockerfiles

* Incorporate feedback

* Update comments

* Refactor tests

* Virgo HDF5 file format (#240)

* update virgo generated dataset to use hdf5 format

* add functionality for selecting output location

* set new data format as standard

* make virgo work with new data loader and add progress bar

* remove old generation files and add script for concatenating hdf5 files

* remove old generation files and add script for concatenating hdf5 files

* rename folder using hyphens

* remove multiprocessing

* add multiprocessing at correct place

* update handling of seed and num processes

* Gpu monitoring (#237)

* add gpu utilization decorator and begin work on plots

* add decorator for gpu energy utilization

* Added config option to hpo script, styling (#235)

* Update README.md

* Update README.md

* Update createEnvVega.sh

* remove unused dist file

* run black and isort to fix linting errors

* remove redundant variable

* remove trailing whitespace

* fix issues from PR

* fix import in eurac trainer

* fix linting errors

* update logging directory and pattern

* update default pattern for gpu energy plots

* fix isort linting

* add support for none pattern and general cleanup

* fix linting errors with black and isort

* add configurable and dynamic wait and warmup times for the profiler

* remove old plot

* move horovod import

* fix linting errors

---------

Co-authored-by: Anna Lappe <[email protected]>
Co-authored-by: Matteo Bunino <[email protected]>

* Scalability test wall clock (#239)

* add gpu utilization decorator and begin work on plots

* add decorator for gpu energy utilization

* Added config option to hpo script, styling (#235)

* Update README.md

* Update README.md

* Update createEnvVega.sh

* remove unused dist file

* run black and isort to fix linting errors

* temporary changes

* remove redundant variable

* add absolute time plot

* remove trailing whitespace

* remove redundant variable

* remove trailing whitespace

* begin implementation of backup

* fix issues from PR

* fix issues from PR

* add backup to gpu monitoring

* fix import in eurac trainer

* cleanup backup mechanism slightly

* fix linting errors

* update logging directory and pattern

* update default pattern for gpu energy plots

* fix isort linting

* add support for none pattern and general cleanup

* fix linting errors with black and isort

* fix import in eurac trainer

* fix linting errors

* update logging directory and pattern

* update default pattern for gpu energy plots

* fix isort linting

* add support for none pattern and general cleanup

* fix linting errors with black and isort

* begin implementation of backup

* add backup to gpu monitoring

* add backup functionality to communication plot

* rewrite epochtimetracker and refactor scalability plot code

* cleanup scalability plot code

* updating some epochtimetracker dependencies

* add configurable and dynamic wait and warmup times for the profiler

* temporary changes

* add absolute time plot

* begin implementation of backup

* add backup to gpu monitoring

* cleanup backup mechanism slightly

* fix isort linting

* add support for none pattern and general cleanup

* fix linting errors with black and isort

* begin implementation of backup

* add backup functionality to communication plot

* rewrite epochtimetracker and refactor scalability plot code

* cleanup scalability plot code

* updating some epochtimetracker dependencies

* fix linting errors

* fix more linting errors

* add utilization percentage plot

* run isort for linting

* update default save path for metrics

* add decorators to virgo and some cleanup

* add contributions and cleanup

* fix linting errors

* change 'credits' to 'credit'

* update communication plot style

* update function names

* update scalability function for a more streamlined approach

* run isort

* move horovod import

* fix linting errors

* add contributors

---------

Co-authored-by: Anna Lappe <[email protected]>
Co-authored-by: Matteo Bunino <[email protected]>

* make virgo work with new data loader and add progress bar

* add contributors

* update ruff settings in pyproject

* update virgo dataset concatenation

* add isort option to ruff

* break imports on purpose

* break more imports to test

* remove ruff config file

* 😀

* test linter 😁

* remove comment in github workflows

* add validation python to linter and make more mistakes

* add linting errors to trainer

* remove isort and flake8 and replace with ruff

* update linters

* run formatter on virgo folder

* fix linting errors and stuff from PR

* update config

* change config for timing code

* update profiler to use 'with' for context managing

* fix profiler.py

---------

Co-authored-by: Anna Lappe <[email protected]>
Co-authored-by: Matteo Bunino <[email protected]>

* add requirements files and update pyproject toml

* update installation in pyproject.toml

* add pytorch extra to horovod and remove redundant script

* update readme tutorial with pip installation

* add uv tutorial in separate file

* fix linting errors

* update horovod install script

* fix dead link

* update readme

* add uv installation command to readme

* add requirements files and update pyproject toml

* update pyproject

* update installation in pyproject.toml

* add version numbers to packages in pyproject.toml

* update horovod install script and add pip as dependency

* formatting

* fix linting

* trailing whitespace

* remove comment from readme

* remove comments and small formatting difference

* fix profiler bug where profiler is never set to trainer

* begin refactoring the scaling tests

* add contributors

* fix linting errors

* update scaling test trainers

* update plotting code and small bugfix in profiler

* tiny update to requirements

* reformat wrt indentations and newlines

* fix layout of plot and use update comm regexes

* more clean up [WIP]

* update deepspeed trainer

* some cleanup

* small cleanup

* fix deepspeed in scalability tutorial

* add subset to horovod so it finishes in time

* small cleanup in itwinai trainer

* update default slurm log dir name

* update slurm log directory in config files

* allow user to specify number of nodes for scalability analysis

* allow user to specify imagenet subset size

* enable epoch time logging for tutorial

* update readme

* add folder for scalability metrics

* fix linting errors

* remove import comments in itwinai trainer file

* sort imports

* small cleanup: comments from PR

* fix virgo config

---------

Co-authored-by: Matteo Bunino <[email protected]>
Co-authored-by: Anna Lappe <[email protected]>
  • Loading branch information
3 people committed Jan 22, 2025
1 parent ebb868c commit 7481237
Show file tree
Hide file tree
Showing 35 changed files with 1,994 additions and 3,714 deletions.
99 changes: 2 additions & 97 deletions docs/tutorials/distrib-ml/torch_scaling_test.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,104 +3,9 @@ PyTorch scaling test

.. include:: ../../../tutorials/distributed-ml/torch-scaling-test/README.md
:parser: myst_parser.sphinx_
:end-before: Example of scalability plot generated by
:end-before: Below follows an example of


Example of scalability plot generated by ``itwinai scalability-report``:

Below follows an example of scalability plot generated by ``itwinai scalability-report``:

.. image:: ../../../tutorials/distributed-ml/torch-scaling-test/img/report.png


Configuration files
-------------------


base.yaml
+++++++++

.. literalinclude:: ../../../tutorials/distributed-ml/torch-scaling-test/config/base.yaml
:language: yaml


ddp.yaml
++++++++

.. literalinclude:: ../../../tutorials/distributed-ml/torch-scaling-test/config/ddp.yaml
:language: yaml


deepspeed.yaml
++++++++++++++

.. literalinclude:: ../../../tutorials/distributed-ml/torch-scaling-test/config/deepspeed.yaml
:language: yaml


horovod.yaml
++++++++++++

.. literalinclude:: ../../../tutorials/distributed-ml/torch-scaling-test/config/horovod.yaml
:language: yaml



Training scripts and utils
--------------------------


ddp_trainer.py
++++++++++++++

.. literalinclude:: ../../../tutorials/distributed-ml/torch-scaling-test/ddp_trainer.py
:language: python


deepspeed_trainer.py
++++++++++++++++++++

.. literalinclude:: ../../../tutorials/distributed-ml/torch-scaling-test/deepspeed_trainer.py
:language: python


horovod_trainer.py
++++++++++++++++++

.. literalinclude:: ../../../tutorials/distributed-ml/torch-scaling-test/horovod_trainer.py
:language: python


itwinai_trainer.py
++++++++++++++++++

.. literalinclude:: ../../../tutorials/distributed-ml/torch-scaling-test/itwinai_trainer.py
:language: python


utils.py
++++++++

.. literalinclude:: ../../../tutorials/distributed-ml/torch-scaling-test/utils.py
:language: python


runall.sh
+++++++++++++++++++

.. literalinclude:: ../../../tutorials/distributed-ml/torch-scaling-test//runall.sh
:language: bash


scaling-test.sh
+++++++++++++++++++

.. literalinclude:: ../../../tutorials/distributed-ml/torch-scaling-test/scaling-test.sh
:language: bash


slurm.sh
+++++++++++++++++++

.. literalinclude:: ../../../tutorials/distributed-ml/torch-scaling-test/slurm.sh
:language: bash

10 changes: 10 additions & 0 deletions env-files/tensorflow/generic_tf.sh
Original file line number Diff line number Diff line change
@@ -1,5 +1,15 @@
#!/bin/bash

# --------------------------------------------------------------------------------------
# Part of the interTwin Project: https://www.intertwin.eu/
#
# Created by: Matteo Bunino
#
# Credit:
# - Jarl Sondre Sæther <[email protected]> - CERN
# - Matteo Bunino <[email protected]> - CERN
# --------------------------------------------------------------------------------------

if [ -z "$ENV_NAME" ]; then
ENV_NAME=".venv-tf"
fi
Expand Down
11 changes: 11 additions & 0 deletions env-files/torch/generic_torch.sh
Original file line number Diff line number Diff line change
@@ -1,4 +1,15 @@
#!/bin/bash

# --------------------------------------------------------------------------------------
# Part of the interTwin Project: https://www.intertwin.eu/
#
# Created by: Matteo Bunino
#
# Credit:
# - Jarl Sondre Sæther <[email protected]> - CERN
# - Matteo Bunino <[email protected]> - CERN
# --------------------------------------------------------------------------------------

if [ -z "$ENV_NAME" ]; then
ENV_NAME=".venv-pytorch"
fi
Expand Down
10 changes: 10 additions & 0 deletions env-files/torch/install-horovod-deepspeed-cuda.sh
Original file line number Diff line number Diff line change
@@ -1,5 +1,15 @@
#!/bin/bash

# --------------------------------------------------------------------------------------
# Part of the interTwin Project: https://www.intertwin.eu/
#
# Created by: Jarl Sondre Sæther
#
# Credit:
# - Jarl Sondre Sæther <[email protected]> - CERN
# - Matteo Bunino <[email protected]> - CERN
# --------------------------------------------------------------------------------------

# DeepSpeed variables
export DS_BUILD_CCL_COMM=1
export DS_BUILD_UTILS=1
Expand Down
2 changes: 1 addition & 1 deletion src/itwinai/loggers.py
Original file line number Diff line number Diff line change
Expand Up @@ -1177,7 +1177,7 @@ class EpochTimeTracker:
"""Tracker for epoch execution time during training."""

def __init__(
self, strategy_name: str, save_path: Union[Path, str], num_nodes: int
self, strategy_name: str, save_path: Path | str, num_nodes: int
) -> None:
if isinstance(save_path, str):
save_path = Path(save_path)
Expand Down
2 changes: 2 additions & 0 deletions src/itwinai/scalability.py
Original file line number Diff line number Diff line change
Expand Up @@ -108,6 +108,8 @@ def create_absolute_plot(avg_epoch_time_df: pd.DataFrame) -> None:
ax.grid(True)

output_path = Path("plots/absolute_scalability_plot.png")
output_path.parent.mkdir(parents=True, exist_ok=True)
plt.tight_layout()
plt.savefig(output_path)
print(f"Saving absolute plot to '{output_path.resolve()}'.")
sns.reset_orig()
Expand Down
4 changes: 2 additions & 2 deletions src/itwinai/slurm/slurm_config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -3,8 +3,8 @@ account: intertwin
dist_strat: horovod
time: 00:11:11

std_out: slurm_jobs/job.out
err_out: slurm_jobs/err.out
std_out: slurm_job_logs/job.out
err_out: slurm_job_logs/err.out

num_nodes: 1
num_tasks_per_node: 1
Expand Down
14 changes: 8 additions & 6 deletions src/itwinai/slurm/slurm_script_builder.py
Original file line number Diff line number Diff line change
Expand Up @@ -172,6 +172,8 @@ def get_debug_command(self) -> str:
echo ""
echo "### Other Variables ###"
echo "Distributed Strategy: {self.distributed_strategy}"
echo "Current working directory: $(pwd)"
echo "Which python: $(which python)"
"""
debug_print_command = debug_print_command.strip()
return remove_indentation_from_multiline_string(debug_print_command)
Expand Down Expand Up @@ -201,10 +203,10 @@ def process_slurm_script(
self.slurm_script_configuration.job_name = self.generate_identifier()

if self.slurm_script_configuration.std_out is None:
std_out_path = Path("slurm_jobs") / (self.generate_identifier() + ".out")
std_out_path = Path("slurm_job_logs") / (self.generate_identifier() + ".out")
self.slurm_script_configuration.std_out = std_out_path
if self.slurm_script_configuration.err_out is None:
err_out_path = Path("slurm_jobs") / (self.generate_identifier() + ".err")
err_out_path = Path("slurm_job_logs") / (self.generate_identifier() + ".err")
self.slurm_script_configuration.err_out = err_out_path

# Making sure the std out and err out folders exist
Expand All @@ -218,9 +220,9 @@ def process_slurm_script(
# Generate the script using the given configuration
script = self.slurm_script_configuration.format_script()
if not submit_slurm_job and not retain_file:
print("#" * 30)
print("#" * 20, "SLURM Script Preview", "#"*20)
print(script)
print("#" * 30)
print("#" * 62)
return

if file_path is None:
Expand Down Expand Up @@ -258,8 +260,8 @@ def run_slurm_script_all_strategies(

# Overriding job_name, std_out and err_out
self.slurm_script_configuration.job_name = self.generate_identifier()
std_out_path = Path("slurm_jobs") / (self.generate_identifier() + ".out")
err_out_path = Path("slurm_jobs") / (self.generate_identifier() + ".err")
std_out_path = Path("slurm_job_logs") / (self.generate_identifier() + ".out")
err_out_path = Path("slurm_job_logs") / (self.generate_identifier() + ".err")
self.slurm_script_configuration.std_out = std_out_path
self.slurm_script_configuration.err_out = err_out_path

Expand Down
40 changes: 34 additions & 6 deletions src/itwinai/slurm/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,8 @@
# - Jarl Sondre Sæther <[email protected]> - CERN
# --------------------------------------------------------------------------------------

from typing import List

from itwinai.parser import ArgumentParser


Expand All @@ -18,6 +20,31 @@ def remove_indentation_from_multiline_string(multiline_string: str) -> str:
return "\n".join([line.lstrip() for line in multiline_string.split("\n")])


def scalability_nodes_list(value: str | List[int]) -> List[int]:
"""Checks that the value it receives conforms to the comma-separated integer
constraint and returns the parsed list if successful.
Returns:
The list of integers that was parsed.
Raises:
ValueError: If unable to parse the integers e.g. due to formatting errors.
"""

if isinstance(value, list):
if not all([isinstance(x, int) for x in value]):
raise ValueError(f"Provided list, '{value}', contains non-integer values.")
else:
return value

try:
return [int(n) for n in value.split(",")]
except ValueError:
raise ValueError(
f"Invalid input: '{value}', must be formatted as comma-separated integers."
)


def get_slurm_job_parser() -> ArgumentParser:
# Default arguments for the SLURM script configuration
default_account = "intertwin"
Expand All @@ -38,16 +65,11 @@ def get_slurm_job_parser() -> ArgumentParser:
default_pipe_key = "rnn_training_pipeline"
default_training_command = None
default_python_venv = ".venv"
default_scalability_nodes = "1,2,4,8"

parser = ArgumentParser(parser_mode="omegaconf")

# Arguments specific to the SLURM script configuration
parser.add_argument(
"--job_name",
type=str,
default=default_job_name,
help="The name of the SLURM job",
)
parser.add_argument(
"--job-name",
type=str,
Expand Down Expand Up @@ -142,6 +164,12 @@ def get_slurm_job_parser() -> ArgumentParser:
default=default_python_venv,
help="Which python venv to use for running the command.",
)
parser.add_argument(
"--scalability-nodes",
type=scalability_nodes_list,
default=default_scalability_nodes,
help="A comma-separated list of node numbers to use for the scalability test.",
)

# Boolean arguments where you only need to include the flag and not an actual value
parser.add_argument(
Expand Down
5 changes: 3 additions & 2 deletions src/itwinai/torch/monitoring/plotting.py
Original file line number Diff line number Diff line change
Expand Up @@ -107,6 +107,7 @@ def gpu_bar_plot(
raise ValueError(
f"DataFrame is missing the following columns: {missing_columns}"
)

sns.set_theme()

strategies = data_df["strategy"].unique()
Expand Down Expand Up @@ -138,9 +139,9 @@ def gpu_bar_plot(
ax.set_xticklabels(unique_gpu_counts)
ax.legend(title="Strategy")

figure_width = int(1.5 * len(unique_gpu_counts))
fig.set_figheight(6)
figure_width = max(int(2 * len(unique_gpu_counts)), 8)
fig.set_figwidth(figure_width)
fig.set_figheight(figure_width * 0.8)

sns.reset_orig()

Expand Down
19 changes: 12 additions & 7 deletions src/itwinai/torch/profiling/communication_plot.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,8 +16,6 @@
import seaborn as sns
from matplotlib.patches import Patch

# from itwinai.scalability import convert_matching_files_to_dataframe

# Doing this because otherwise I get an error about X11 Forwarding which I believe
# is due to the server trying to pass the image to the client computer
matplotlib.use("Agg")
Expand All @@ -40,9 +38,15 @@ def calculate_comp_and_comm_time(df: pd.DataFrame) -> Tuple[float, float]:
f"\nMissing columns: {missing_columns}"
)

nccl_comm_pattern = (
r"ncclKernel_(?:AllReduce|Broadcast|Reduce|AllGather|ReduceScatter|SendRecv)"
)
comm_types = [
"AllReduce",
"Broadcast",
"Reduce",
"AllGather",
"Gather",
"ReduceScatter",
]
nccl_comm_pattern = rf"(?:{'|'.join(comm_types)})"
cuda_stream_pattern = r"cudaStream(?:WaitEvent|Synchronize)"

# Any operation that is a part of PyTorch's ATen library is considered a computation
Expand Down Expand Up @@ -133,10 +137,11 @@ def communication_overhead_stacked_bar_plot(
ax.legend(handles=ax.get_legend_handles_labels()[0] + [hatch_patch])

# Dynamically adjusting the width of the figure
figure_width = int(1.5 * len(gpu_numbers))
fig.set_figheight(5)
figure_width = max(int(2 * len(gpu_numbers)), 8)
fig.set_figwidth(figure_width)
fig.set_figheight(figure_width * 0.8)

# Resetting so that seaborn's theme doesn't affect other plots
sns.reset_orig()

return fig, ax
Expand Down
4 changes: 3 additions & 1 deletion src/itwinai/torch/profiling/profiler.py
Original file line number Diff line number Diff line change
Expand Up @@ -89,13 +89,15 @@ def profiled_method(self: TorchTrainer, *args, **kwargs) -> Any:
warmup_epochs=self.profiling_warmup_epochs,
)
with profile(
activities=[ProfilerActivity.CUDA],
activities=[ProfilerActivity.CUDA, ProfilerActivity.CPU],
schedule=schedule(
wait=wait_epochs,
warmup=warmup_epochs,
active=active_epochs,
),
with_modules=True
) as profiler:
self.profiler = profiler
result = method(self, *args, **kwargs)

strategy = self.strategy
Expand Down
3 changes: 2 additions & 1 deletion src/itwinai/torch/trainer.py
Original file line number Diff line number Diff line change
Expand Up @@ -422,7 +422,8 @@ def set_epoch(self, epoch: int) -> None:
Args:
epoch (int): epoch number, from 0 to ``epochs-1``.
"""
if self.profiler is not None:
if self.profiler is not None and epoch > 0:
# We don't want to start stepping until after the first epoch
self.profiler.step()
self._set_epoch_dataloaders(epoch)

Expand Down
Loading

0 comments on commit 7481237

Please sign in to comment.