-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Gpu monitoring #237
Merged
Merged
Gpu monitoring #237
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
matbun
reviewed
Oct 28, 2024
matbun
reviewed
Oct 29, 2024
matbun
reviewed
Oct 29, 2024
matbun
reviewed
Oct 31, 2024
annaelisalappe
previously approved these changes
Nov 4, 2024
LGTM |
matbun
approved these changes
Nov 7, 2024
jarlsondre
added a commit
that referenced
this pull request
Nov 8, 2024
* add gpu utilization decorator and begin work on plots * add decorator for gpu energy utilization * Added config option to hpo script, styling (#235) * Update README.md * Update README.md * Update createEnvVega.sh * remove unused dist file * run black and isort to fix linting errors * remove redundant variable * remove trailing whitespace * fix issues from PR * fix import in eurac trainer * fix linting errors * update logging directory and pattern * update default pattern for gpu energy plots * fix isort linting * add support for none pattern and general cleanup * fix linting errors with black and isort * add configurable and dynamic wait and warmup times for the profiler * remove old plot * move horovod import * fix linting errors --------- Co-authored-by: Anna Lappe <[email protected]> Co-authored-by: Matteo Bunino <[email protected]>
jarlsondre
added a commit
that referenced
this pull request
Nov 15, 2024
* update virgo generated dataset to use hdf5 format * add functionality for selecting output location * set new data format as standard * make virgo work with new data loader and add progress bar * remove old generation files and add script for concatenating hdf5 files * remove old generation files and add script for concatenating hdf5 files * rename folder using hyphens * remove multiprocessing * add multiprocessing at correct place * update handling of seed and num processes * Gpu monitoring (#237) * add gpu utilization decorator and begin work on plots * add decorator for gpu energy utilization * Added config option to hpo script, styling (#235) * Update README.md * Update README.md * Update createEnvVega.sh * remove unused dist file * run black and isort to fix linting errors * remove redundant variable * remove trailing whitespace * fix issues from PR * fix import in eurac trainer * fix linting errors * update logging directory and pattern * update default pattern for gpu energy plots * fix isort linting * add support for none pattern and general cleanup * fix linting errors with black and isort * add configurable and dynamic wait and warmup times for the profiler * remove old plot * move horovod import * fix linting errors --------- Co-authored-by: Anna Lappe <[email protected]> Co-authored-by: Matteo Bunino <[email protected]> * Scalability test wall clock (#239) * add gpu utilization decorator and begin work on plots * add decorator for gpu energy utilization * Added config option to hpo script, styling (#235) * Update README.md * Update README.md * Update createEnvVega.sh * remove unused dist file * run black and isort to fix linting errors * temporary changes * remove redundant variable * add absolute time plot * remove trailing whitespace * remove redundant variable * remove trailing whitespace * begin implementation of backup * fix issues from PR * fix issues from PR * add backup to gpu monitoring * fix import in eurac trainer * cleanup backup mechanism slightly * fix linting errors * update logging directory and pattern * update default pattern for gpu energy plots * fix isort linting * add support for none pattern and general cleanup * fix linting errors with black and isort * fix import in eurac trainer * fix linting errors * update logging directory and pattern * update default pattern for gpu energy plots * fix isort linting * add support for none pattern and general cleanup * fix linting errors with black and isort * begin implementation of backup * add backup to gpu monitoring * add backup functionality to communication plot * rewrite epochtimetracker and refactor scalability plot code * cleanup scalability plot code * updating some epochtimetracker dependencies * add configurable and dynamic wait and warmup times for the profiler * temporary changes * add absolute time plot * begin implementation of backup * add backup to gpu monitoring * cleanup backup mechanism slightly * fix isort linting * add support for none pattern and general cleanup * fix linting errors with black and isort * begin implementation of backup * add backup functionality to communication plot * rewrite epochtimetracker and refactor scalability plot code * cleanup scalability plot code * updating some epochtimetracker dependencies * fix linting errors * fix more linting errors * add utilization percentage plot * run isort for linting * update default save path for metrics * add decorators to virgo and some cleanup * add contributions and cleanup * fix linting errors * change 'credits' to 'credit' * update communication plot style * update function names * update scalability function for a more streamlined approach * run isort * move horovod import * fix linting errors * add contributors --------- Co-authored-by: Anna Lappe <[email protected]> Co-authored-by: Matteo Bunino <[email protected]> * make virgo work with new data loader and add progress bar * add contributors * update ruff settings in pyproject * update virgo dataset concatenation * add isort option to ruff * break imports on purpose * break more imports to test * remove ruff config file * 😀 * test linter 😁 * remove comment in github workflows * add validation python to linter and make more mistakes * add linting errors to trainer * remove isort and flake8 and replace with ruff * update linters * run formatter on virgo folder * fix linting errors and stuff from PR * update config * change config for timing code * update profiler to use 'with' for context managing * fix profiler.py --------- Co-authored-by: Anna Lappe <[email protected]> Co-authored-by: Matteo Bunino <[email protected]>
jarlsondre
added a commit
that referenced
this pull request
Nov 15, 2024
* update virgo generated dataset to use hdf5 format * add functionality for selecting output location * set new data format as standard * make virgo work with new data loader and add progress bar * remove old generation files and add script for concatenating hdf5 files * remove old generation files and add script for concatenating hdf5 files * rename folder using hyphens * remove multiprocessing * add multiprocessing at correct place * update handling of seed and num processes * Gpu monitoring (#237) * add gpu utilization decorator and begin work on plots * add decorator for gpu energy utilization * Added config option to hpo script, styling (#235) * Update README.md * Update README.md * Update createEnvVega.sh * remove unused dist file * run black and isort to fix linting errors * remove redundant variable * remove trailing whitespace * fix issues from PR * fix import in eurac trainer * fix linting errors * update logging directory and pattern * update default pattern for gpu energy plots * fix isort linting * add support for none pattern and general cleanup * fix linting errors with black and isort * add configurable and dynamic wait and warmup times for the profiler * remove old plot * move horovod import * fix linting errors --------- Co-authored-by: Anna Lappe <[email protected]> Co-authored-by: Matteo Bunino <[email protected]> * Scalability test wall clock (#239) * add gpu utilization decorator and begin work on plots * add decorator for gpu energy utilization * Added config option to hpo script, styling (#235) * Update README.md * Update README.md * Update createEnvVega.sh * remove unused dist file * run black and isort to fix linting errors * temporary changes * remove redundant variable * add absolute time plot * remove trailing whitespace * remove redundant variable * remove trailing whitespace * begin implementation of backup * fix issues from PR * fix issues from PR * add backup to gpu monitoring * fix import in eurac trainer * cleanup backup mechanism slightly * fix linting errors * update logging directory and pattern * update default pattern for gpu energy plots * fix isort linting * add support for none pattern and general cleanup * fix linting errors with black and isort * fix import in eurac trainer * fix linting errors * update logging directory and pattern * update default pattern for gpu energy plots * fix isort linting * add support for none pattern and general cleanup * fix linting errors with black and isort * begin implementation of backup * add backup to gpu monitoring * add backup functionality to communication plot * rewrite epochtimetracker and refactor scalability plot code * cleanup scalability plot code * updating some epochtimetracker dependencies * add configurable and dynamic wait and warmup times for the profiler * temporary changes * add absolute time plot * begin implementation of backup * add backup to gpu monitoring * cleanup backup mechanism slightly * fix isort linting * add support for none pattern and general cleanup * fix linting errors with black and isort * begin implementation of backup * add backup functionality to communication plot * rewrite epochtimetracker and refactor scalability plot code * cleanup scalability plot code * updating some epochtimetracker dependencies * fix linting errors * fix more linting errors * add utilization percentage plot * run isort for linting * update default save path for metrics * add decorators to virgo and some cleanup * add contributions and cleanup * fix linting errors * change 'credits' to 'credit' * update communication plot style * update function names * update scalability function for a more streamlined approach * run isort * move horovod import * fix linting errors * add contributors --------- Co-authored-by: Anna Lappe <[email protected]> Co-authored-by: Matteo Bunino <[email protected]> * make virgo work with new data loader and add progress bar * add contributors * update ruff settings in pyproject * update virgo dataset concatenation * add isort option to ruff * break imports on purpose * break more imports to test * remove ruff config file * 😀 * test linter 😁 * remove comment in github workflows * add validation python to linter and make more mistakes * add linting errors to trainer * remove isort and flake8 and replace with ruff * update linters * run formatter on virgo folder * fix linting errors and stuff from PR * update config * change config for timing code * update profiler to use 'with' for context managing * fix profiler.py --------- Co-authored-by: Anna Lappe <[email protected]> Co-authored-by: Matteo Bunino <[email protected]>
annaelisalappe
added a commit
that referenced
this pull request
Nov 15, 2024
* add gpu utilization decorator and begin work on plots * add decorator for gpu energy utilization * Added config option to hpo script, styling (#235) * Update README.md * Update README.md * Update createEnvVega.sh * remove unused dist file * run black and isort to fix linting errors * remove redundant variable * remove trailing whitespace * fix issues from PR * fix import in eurac trainer * fix linting errors * update logging directory and pattern * update default pattern for gpu energy plots * fix isort linting * add support for none pattern and general cleanup * fix linting errors with black and isort * add configurable and dynamic wait and warmup times for the profiler * remove old plot * move horovod import * fix linting errors --------- Co-authored-by: Anna Lappe <[email protected]> Co-authored-by: Matteo Bunino <[email protected]>
annaelisalappe
added a commit
that referenced
this pull request
Nov 15, 2024
* update virgo generated dataset to use hdf5 format * add functionality for selecting output location * set new data format as standard * make virgo work with new data loader and add progress bar * remove old generation files and add script for concatenating hdf5 files * remove old generation files and add script for concatenating hdf5 files * rename folder using hyphens * remove multiprocessing * add multiprocessing at correct place * update handling of seed and num processes * Gpu monitoring (#237) * add gpu utilization decorator and begin work on plots * add decorator for gpu energy utilization * Added config option to hpo script, styling (#235) * Update README.md * Update README.md * Update createEnvVega.sh * remove unused dist file * run black and isort to fix linting errors * remove redundant variable * remove trailing whitespace * fix issues from PR * fix import in eurac trainer * fix linting errors * update logging directory and pattern * update default pattern for gpu energy plots * fix isort linting * add support for none pattern and general cleanup * fix linting errors with black and isort * add configurable and dynamic wait and warmup times for the profiler * remove old plot * move horovod import * fix linting errors --------- Co-authored-by: Anna Lappe <[email protected]> Co-authored-by: Matteo Bunino <[email protected]> * Scalability test wall clock (#239) * add gpu utilization decorator and begin work on plots * add decorator for gpu energy utilization * Added config option to hpo script, styling (#235) * Update README.md * Update README.md * Update createEnvVega.sh * remove unused dist file * run black and isort to fix linting errors * temporary changes * remove redundant variable * add absolute time plot * remove trailing whitespace * remove redundant variable * remove trailing whitespace * begin implementation of backup * fix issues from PR * fix issues from PR * add backup to gpu monitoring * fix import in eurac trainer * cleanup backup mechanism slightly * fix linting errors * update logging directory and pattern * update default pattern for gpu energy plots * fix isort linting * add support for none pattern and general cleanup * fix linting errors with black and isort * fix import in eurac trainer * fix linting errors * update logging directory and pattern * update default pattern for gpu energy plots * fix isort linting * add support for none pattern and general cleanup * fix linting errors with black and isort * begin implementation of backup * add backup to gpu monitoring * add backup functionality to communication plot * rewrite epochtimetracker and refactor scalability plot code * cleanup scalability plot code * updating some epochtimetracker dependencies * add configurable and dynamic wait and warmup times for the profiler * temporary changes * add absolute time plot * begin implementation of backup * add backup to gpu monitoring * cleanup backup mechanism slightly * fix isort linting * add support for none pattern and general cleanup * fix linting errors with black and isort * begin implementation of backup * add backup functionality to communication plot * rewrite epochtimetracker and refactor scalability plot code * cleanup scalability plot code * updating some epochtimetracker dependencies * fix linting errors * fix more linting errors * add utilization percentage plot * run isort for linting * update default save path for metrics * add decorators to virgo and some cleanup * add contributions and cleanup * fix linting errors * change 'credits' to 'credit' * update communication plot style * update function names * update scalability function for a more streamlined approach * run isort * move horovod import * fix linting errors * add contributors --------- Co-authored-by: Anna Lappe <[email protected]> Co-authored-by: Matteo Bunino <[email protected]> * make virgo work with new data loader and add progress bar * add contributors * update ruff settings in pyproject * update virgo dataset concatenation * add isort option to ruff * break imports on purpose * break more imports to test * remove ruff config file * 😀 * test linter 😁 * remove comment in github workflows * add validation python to linter and make more mistakes * add linting errors to trainer * remove isort and flake8 and replace with ruff * update linters * run formatter on virgo folder * fix linting errors and stuff from PR * update config * change config for timing code * update profiler to use 'with' for context managing * fix profiler.py --------- Co-authored-by: Anna Lappe <[email protected]> Co-authored-by: Matteo Bunino <[email protected]>
annaelisalappe
added a commit
that referenced
this pull request
Nov 15, 2024
* add gpu utilization decorator and begin work on plots * add decorator for gpu energy utilization * Added config option to hpo script, styling (#235) * Update README.md * Update README.md * Update createEnvVega.sh * remove unused dist file * run black and isort to fix linting errors * remove redundant variable * remove trailing whitespace * fix issues from PR * fix import in eurac trainer * fix linting errors * update logging directory and pattern * update default pattern for gpu energy plots * fix isort linting * add support for none pattern and general cleanup * fix linting errors with black and isort * add configurable and dynamic wait and warmup times for the profiler * remove old plot * move horovod import * fix linting errors --------- Co-authored-by: Anna Lappe <[email protected]> Co-authored-by: Matteo Bunino <[email protected]>
annaelisalappe
added a commit
that referenced
this pull request
Nov 20, 2024
* update virgo generated dataset to use hdf5 format * add functionality for selecting output location * set new data format as standard * make virgo work with new data loader and add progress bar * remove old generation files and add script for concatenating hdf5 files * remove old generation files and add script for concatenating hdf5 files * rename folder using hyphens * remove multiprocessing * add multiprocessing at correct place * update handling of seed and num processes * Gpu monitoring (#237) * add gpu utilization decorator and begin work on plots * add decorator for gpu energy utilization * Added config option to hpo script, styling (#235) * Update README.md * Update README.md * Update createEnvVega.sh * remove unused dist file * run black and isort to fix linting errors * remove redundant variable * remove trailing whitespace * fix issues from PR * fix import in eurac trainer * fix linting errors * update logging directory and pattern * update default pattern for gpu energy plots * fix isort linting * add support for none pattern and general cleanup * fix linting errors with black and isort * add configurable and dynamic wait and warmup times for the profiler * remove old plot * move horovod import * fix linting errors --------- Co-authored-by: Anna Lappe <[email protected]> Co-authored-by: Matteo Bunino <[email protected]> * Scalability test wall clock (#239) * add gpu utilization decorator and begin work on plots * add decorator for gpu energy utilization * Added config option to hpo script, styling (#235) * Update README.md * Update README.md * Update createEnvVega.sh * remove unused dist file * run black and isort to fix linting errors * temporary changes * remove redundant variable * add absolute time plot * remove trailing whitespace * remove redundant variable * remove trailing whitespace * begin implementation of backup * fix issues from PR * fix issues from PR * add backup to gpu monitoring * fix import in eurac trainer * cleanup backup mechanism slightly * fix linting errors * update logging directory and pattern * update default pattern for gpu energy plots * fix isort linting * add support for none pattern and general cleanup * fix linting errors with black and isort * fix import in eurac trainer * fix linting errors * update logging directory and pattern * update default pattern for gpu energy plots * fix isort linting * add support for none pattern and general cleanup * fix linting errors with black and isort * begin implementation of backup * add backup to gpu monitoring * add backup functionality to communication plot * rewrite epochtimetracker and refactor scalability plot code * cleanup scalability plot code * updating some epochtimetracker dependencies * add configurable and dynamic wait and warmup times for the profiler * temporary changes * add absolute time plot * begin implementation of backup * add backup to gpu monitoring * cleanup backup mechanism slightly * fix isort linting * add support for none pattern and general cleanup * fix linting errors with black and isort * begin implementation of backup * add backup functionality to communication plot * rewrite epochtimetracker and refactor scalability plot code * cleanup scalability plot code * updating some epochtimetracker dependencies * fix linting errors * fix more linting errors * add utilization percentage plot * run isort for linting * update default save path for metrics * add decorators to virgo and some cleanup * add contributions and cleanup * fix linting errors * change 'credits' to 'credit' * update communication plot style * update function names * update scalability function for a more streamlined approach * run isort * move horovod import * fix linting errors * add contributors --------- Co-authored-by: Anna Lappe <[email protected]> Co-authored-by: Matteo Bunino <[email protected]> * make virgo work with new data loader and add progress bar * add contributors * update ruff settings in pyproject * update virgo dataset concatenation * add isort option to ruff * break imports on purpose * break more imports to test * remove ruff config file * 😀 * test linter 😁 * remove comment in github workflows * add validation python to linter and make more mistakes * add linting errors to trainer * remove isort and flake8 and replace with ruff * update linters * run formatter on virgo folder * fix linting errors and stuff from PR * update config * change config for timing code * update profiler to use 'with' for context managing * fix profiler.py --------- Co-authored-by: Anna Lappe <[email protected]> Co-authored-by: Matteo Bunino <[email protected]>
jarlsondre
added a commit
that referenced
this pull request
Dec 2, 2024
* add empty requirements file for cuda * add requirements files and update pyproject toml * update pyproject * update installation in pyproject.toml * update readme and horovod installation script * update readme with horovod explanation * update horovod installation script * update readme with -e flag * fix linter readme errors * add more info to readme * trailing whitespace 🙃 * trailing whitespace 🙃 (again) * add draft of table of contents to readme * update readme toc * update readme toc again * add section about uv lock to readme * update toc of readme * fix errors in readme * add version numbers to packages in pyproject.toml * remove uv.lock (for now) * remove link from readme * put toc in html comment * remove toc, remove ds and horovod from reqs, add docs comment to pyproj * Itwinai jlab Docker image (#236) * Refactor Dockerfiles * Refactor container gen script * ADD jlab dockerfile * First working version of jlab container * ADD CMCC requirements * update dockerfiles * ADD nvconda and refactor * Update containers * ADD containers * ADD simple plus dockerfile * Update NV deps * Update CUDA * Add comment * Cleanup * Cleanup * UPDATE README * Refactor * Fix linter * Refactor dockerfiles and improve tests * Refactor * Refactor * Fix * Add first tests for HPC * First broken tests for HPC * Update tests and strategy * UPDATE tests * Update horovod tests * Update tests and jlab deps * Add MLFLow tracking URI * ADD distributed trainer tests * mpirun container deepspeed * Fix distributed strategy tests on multi-node * ADD srun launcher * Refactor jobscript * Cleanup * isort tests * Refactor scripts * Minor fixes * Add logging to file for all workers * Add jupyter base files * Add jupyter base files * spelling * Update provenance deps * Update DS version * Update prov docs * Cleanup * add nvidia dep * Remove incomplete work * update pyproject * ADD hadolit config file * FIX flag * Fix linters * Refactor * Update prov4ml * Update pytest CI * Minor fix * Incorporate feedback * Update Dockerfiles * Incorporate feedback * Update comments * Refactor tests * Virgo HDF5 file format (#240) * update virgo generated dataset to use hdf5 format * add functionality for selecting output location * set new data format as standard * make virgo work with new data loader and add progress bar * remove old generation files and add script for concatenating hdf5 files * remove old generation files and add script for concatenating hdf5 files * rename folder using hyphens * remove multiprocessing * add multiprocessing at correct place * update handling of seed and num processes * Gpu monitoring (#237) * add gpu utilization decorator and begin work on plots * add decorator for gpu energy utilization * Added config option to hpo script, styling (#235) * Update README.md * Update README.md * Update createEnvVega.sh * remove unused dist file * run black and isort to fix linting errors * remove redundant variable * remove trailing whitespace * fix issues from PR * fix import in eurac trainer * fix linting errors * update logging directory and pattern * update default pattern for gpu energy plots * fix isort linting * add support for none pattern and general cleanup * fix linting errors with black and isort * add configurable and dynamic wait and warmup times for the profiler * remove old plot * move horovod import * fix linting errors --------- Co-authored-by: Anna Lappe <[email protected]> Co-authored-by: Matteo Bunino <[email protected]> * Scalability test wall clock (#239) * add gpu utilization decorator and begin work on plots * add decorator for gpu energy utilization * Added config option to hpo script, styling (#235) * Update README.md * Update README.md * Update createEnvVega.sh * remove unused dist file * run black and isort to fix linting errors * temporary changes * remove redundant variable * add absolute time plot * remove trailing whitespace * remove redundant variable * remove trailing whitespace * begin implementation of backup * fix issues from PR * fix issues from PR * add backup to gpu monitoring * fix import in eurac trainer * cleanup backup mechanism slightly * fix linting errors * update logging directory and pattern * update default pattern for gpu energy plots * fix isort linting * add support for none pattern and general cleanup * fix linting errors with black and isort * fix import in eurac trainer * fix linting errors * update logging directory and pattern * update default pattern for gpu energy plots * fix isort linting * add support for none pattern and general cleanup * fix linting errors with black and isort * begin implementation of backup * add backup to gpu monitoring * add backup functionality to communication plot * rewrite epochtimetracker and refactor scalability plot code * cleanup scalability plot code * updating some epochtimetracker dependencies * add configurable and dynamic wait and warmup times for the profiler * temporary changes * add absolute time plot * begin implementation of backup * add backup to gpu monitoring * cleanup backup mechanism slightly * fix isort linting * add support for none pattern and general cleanup * fix linting errors with black and isort * begin implementation of backup * add backup functionality to communication plot * rewrite epochtimetracker and refactor scalability plot code * cleanup scalability plot code * updating some epochtimetracker dependencies * fix linting errors * fix more linting errors * add utilization percentage plot * run isort for linting * update default save path for metrics * add decorators to virgo and some cleanup * add contributions and cleanup * fix linting errors * change 'credits' to 'credit' * update communication plot style * update function names * update scalability function for a more streamlined approach * run isort * move horovod import * fix linting errors * add contributors --------- Co-authored-by: Anna Lappe <[email protected]> Co-authored-by: Matteo Bunino <[email protected]> * make virgo work with new data loader and add progress bar * add contributors * update ruff settings in pyproject * update virgo dataset concatenation * add isort option to ruff * break imports on purpose * break more imports to test * remove ruff config file * 😀 * test linter 😁 * remove comment in github workflows * add validation python to linter and make more mistakes * add linting errors to trainer * remove isort and flake8 and replace with ruff * update linters * run formatter on virgo folder * fix linting errors and stuff from PR * update config * change config for timing code * update profiler to use 'with' for context managing * fix profiler.py --------- Co-authored-by: Anna Lappe <[email protected]> Co-authored-by: Matteo Bunino <[email protected]> * add requirements files and update pyproject toml * update installation in pyproject.toml * add pytorch extra to horovod and remove redundant script * update readme tutorial with pip installation * add uv tutorial in separate file * fix linting errors * update horovod install script * fix dead link * update readme * add uv installation command to readme * add requirements files and update pyproject toml * update pyproject * update installation in pyproject.toml * add version numbers to packages in pyproject.toml * update horovod install script and add pip as dependency * formatting * fix linting * trailing whitespace * remove comment from readme * remove comments and small formatting difference * move uv tutorial under docs/ * update readme with nvidia and amd instead of linux * remove duplicate entries in pyproject and reformat distributed file * update readmes * separate horovod ds installation script into two files * fix linting errors and update dependencies * fix tests and update lockfile * fix linting errors * update installation scripts for testing * add local test command * add tf to installation in readme * add torch cuda to project dependencies * remove index from tutorial * remove unused comments and update tutorial --------- Co-authored-by: Matteo Bunino <[email protected]> Co-authored-by: Anna Lappe <[email protected]>
annaelisalappe
added a commit
that referenced
this pull request
Dec 10, 2024
* Added first draft of virgo raytorchtrainer integration * Added MLFlow logger integration to raytorchtrainer * Added ray components to main scripts * Added first draft of raydeepspeed strategy * Update createEnvVega.sh * Gpu monitoring (#237) * add gpu utilization decorator and begin work on plots * add decorator for gpu energy utilization * Added config option to hpo script, styling (#235) * Update README.md * Update README.md * Update createEnvVega.sh * remove unused dist file * run black and isort to fix linting errors * remove redundant variable * remove trailing whitespace * fix issues from PR * fix import in eurac trainer * fix linting errors * update logging directory and pattern * update default pattern for gpu energy plots * fix isort linting * add support for none pattern and general cleanup * fix linting errors with black and isort * add configurable and dynamic wait and warmup times for the profiler * remove old plot * move horovod import * fix linting errors --------- Co-authored-by: Anna Lappe <[email protected]> Co-authored-by: Matteo Bunino <[email protected]> * Scalability test wall clock (#239) * add gpu utilization decorator and begin work on plots * add decorator for gpu energy utilization * Added config option to hpo script, styling (#235) * Update README.md * Update README.md * Update createEnvVega.sh * remove unused dist file * run black and isort to fix linting errors * temporary changes * remove redundant variable * add absolute time plot * remove trailing whitespace * remove redundant variable * remove trailing whitespace * begin implementation of backup * fix issues from PR * fix issues from PR * add backup to gpu monitoring * fix import in eurac trainer * cleanup backup mechanism slightly * fix linting errors * update logging directory and pattern * update default pattern for gpu energy plots * fix isort linting * add support for none pattern and general cleanup * fix linting errors with black and isort * fix import in eurac trainer * fix linting errors * update logging directory and pattern * update default pattern for gpu energy plots * fix isort linting * add support for none pattern and general cleanup * fix linting errors with black and isort * begin implementation of backup * add backup to gpu monitoring * add backup functionality to communication plot * rewrite epochtimetracker and refactor scalability plot code * cleanup scalability plot code * updating some epochtimetracker dependencies * add configurable and dynamic wait and warmup times for the profiler * temporary changes * add absolute time plot * begin implementation of backup * add backup to gpu monitoring * cleanup backup mechanism slightly * fix isort linting * add support for none pattern and general cleanup * fix linting errors with black and isort * begin implementation of backup * add backup functionality to communication plot * rewrite epochtimetracker and refactor scalability plot code * cleanup scalability plot code * updating some epochtimetracker dependencies * fix linting errors * fix more linting errors * add utilization percentage plot * run isort for linting * update default save path for metrics * add decorators to virgo and some cleanup * add contributions and cleanup * fix linting errors * change 'credits' to 'credit' * update communication plot style * update function names * update scalability function for a more streamlined approach * run isort * move horovod import * fix linting errors * add contributors --------- Co-authored-by: Anna Lappe <[email protected]> Co-authored-by: Matteo Bunino <[email protected]> * Itwinai jlab Docker image (#236) * Refactor Dockerfiles * Refactor container gen script * ADD jlab dockerfile * First working version of jlab container * ADD CMCC requirements * update dockerfiles * ADD nvconda and refactor * Update containers * ADD containers * ADD simple plus dockerfile * Update NV deps * Update CUDA * Add comment * Cleanup * Cleanup * UPDATE README * Refactor * Fix linter * Refactor dockerfiles and improve tests * Refactor * Refactor * Fix * Add first tests for HPC * First broken tests for HPC * Update tests and strategy * UPDATE tests * Update horovod tests * Update tests and jlab deps * Add MLFLow tracking URI * ADD distributed trainer tests * mpirun container deepspeed * Fix distributed strategy tests on multi-node * ADD srun launcher * Refactor jobscript * Cleanup * isort tests * Refactor scripts * Minor fixes * Add logging to file for all workers * Add jupyter base files * Add jupyter base files * spelling * Update provenance deps * Update DS version * Update prov docs * Cleanup * add nvidia dep * Remove incomplete work * update pyproject * ADD hadolit config file * FIX flag * Fix linters * Refactor * Update prov4ml * Update pytest CI * Minor fix * Incorporate feedback * Update Dockerfiles * Incorporate feedback * Update comments * Refactor tests * Added first draft of virgo raytorchtrainer integration * Added ray components to main scripts * Added first draft of raydeepspeed strategy * Virgo HDF5 file format (#240) * update virgo generated dataset to use hdf5 format * add functionality for selecting output location * set new data format as standard * make virgo work with new data loader and add progress bar * remove old generation files and add script for concatenating hdf5 files * remove old generation files and add script for concatenating hdf5 files * rename folder using hyphens * remove multiprocessing * add multiprocessing at correct place * update handling of seed and num processes * Gpu monitoring (#237) * add gpu utilization decorator and begin work on plots * add decorator for gpu energy utilization * Added config option to hpo script, styling (#235) * Update README.md * Update README.md * Update createEnvVega.sh * remove unused dist file * run black and isort to fix linting errors * remove redundant variable * remove trailing whitespace * fix issues from PR * fix import in eurac trainer * fix linting errors * update logging directory and pattern * update default pattern for gpu energy plots * fix isort linting * add support for none pattern and general cleanup * fix linting errors with black and isort * add configurable and dynamic wait and warmup times for the profiler * remove old plot * move horovod import * fix linting errors --------- Co-authored-by: Anna Lappe <[email protected]> Co-authored-by: Matteo Bunino <[email protected]> * Scalability test wall clock (#239) * add gpu utilization decorator and begin work on plots * add decorator for gpu energy utilization * Added config option to hpo script, styling (#235) * Update README.md * Update README.md * Update createEnvVega.sh * remove unused dist file * run black and isort to fix linting errors * temporary changes * remove redundant variable * add absolute time plot * remove trailing whitespace * remove redundant variable * remove trailing whitespace * begin implementation of backup * fix issues from PR * fix issues from PR * add backup to gpu monitoring * fix import in eurac trainer * cleanup backup mechanism slightly * fix linting errors * update logging directory and pattern * update default pattern for gpu energy plots * fix isort linting * add support for none pattern and general cleanup * fix linting errors with black and isort * fix import in eurac trainer * fix linting errors * update logging directory and pattern * update default pattern for gpu energy plots * fix isort linting * add support for none pattern and general cleanup * fix linting errors with black and isort * begin implementation of backup * add backup to gpu monitoring * add backup functionality to communication plot * rewrite epochtimetracker and refactor scalability plot code * cleanup scalability plot code * updating some epochtimetracker dependencies * add configurable and dynamic wait and warmup times for the profiler * temporary changes * add absolute time plot * begin implementation of backup * add backup to gpu monitoring * cleanup backup mechanism slightly * fix isort linting * add support for none pattern and general cleanup * fix linting errors with black and isort * begin implementation of backup * add backup functionality to communication plot * rewrite epochtimetracker and refactor scalability plot code * cleanup scalability plot code * updating some epochtimetracker dependencies * fix linting errors * fix more linting errors * add utilization percentage plot * run isort for linting * update default save path for metrics * add decorators to virgo and some cleanup * add contributions and cleanup * fix linting errors * change 'credits' to 'credit' * update communication plot style * update function names * update scalability function for a more streamlined approach * run isort * move horovod import * fix linting errors * add contributors --------- Co-authored-by: Anna Lappe <[email protected]> Co-authored-by: Matteo Bunino <[email protected]> * make virgo work with new data loader and add progress bar * add contributors * update ruff settings in pyproject * update virgo dataset concatenation * add isort option to ruff * break imports on purpose * break more imports to test * remove ruff config file * 😀 * test linter 😁 * remove comment in github workflows * add validation python to linter and make more mistakes * add linting errors to trainer * remove isort and flake8 and replace with ruff * update linters * run formatter on virgo folder * fix linting errors and stuff from PR * update config * change config for timing code * update profiler to use 'with' for context managing * fix profiler.py --------- Co-authored-by: Anna Lappe <[email protected]> Co-authored-by: Matteo Bunino <[email protected]> * Added first draft of virgo raytorchtrainer integration * Added ray components to main scripts * Gpu monitoring (#237) * add gpu utilization decorator and begin work on plots * add decorator for gpu energy utilization * Added config option to hpo script, styling (#235) * Update README.md * Update README.md * Update createEnvVega.sh * remove unused dist file * run black and isort to fix linting errors * remove redundant variable * remove trailing whitespace * fix issues from PR * fix import in eurac trainer * fix linting errors * update logging directory and pattern * update default pattern for gpu energy plots * fix isort linting * add support for none pattern and general cleanup * fix linting errors with black and isort * add configurable and dynamic wait and warmup times for the profiler * remove old plot * move horovod import * fix linting errors --------- Co-authored-by: Anna Lappe <[email protected]> Co-authored-by: Matteo Bunino <[email protected]> * Added ray components to main scripts * Added first draft of raydeepspeed strategy * Added working version of deepspeed strategy, added RayDistributedStrategy as parent class * Scalability test wall clock (#239) * add gpu utilization decorator and begin work on plots * add decorator for gpu energy utilization * Added config option to hpo script, styling (#235) * Update README.md * Update README.md * Update createEnvVega.sh * remove unused dist file * run black and isort to fix linting errors * temporary changes * remove redundant variable * add absolute time plot * remove trailing whitespace * remove redundant variable * remove trailing whitespace * begin implementation of backup * fix issues from PR * fix issues from PR * add backup to gpu monitoring * fix import in eurac trainer * cleanup backup mechanism slightly * fix linting errors * update logging directory and pattern * update default pattern for gpu energy plots * fix isort linting * add support for none pattern and general cleanup * fix linting errors with black and isort * fix import in eurac trainer * fix linting errors * update logging directory and pattern * update default pattern for gpu energy plots * fix isort linting * add support for none pattern and general cleanup * fix linting errors with black and isort * begin implementation of backup * add backup to gpu monitoring * add backup functionality to communication plot * rewrite epochtimetracker and refactor scalability plot code * cleanup scalability plot code * updating some epochtimetracker dependencies * add configurable and dynamic wait and warmup times for the profiler * temporary changes * add absolute time plot * begin implementation of backup * add backup to gpu monitoring * cleanup backup mechanism slightly * fix isort linting * add support for none pattern and general cleanup * fix linting errors with black and isort * begin implementation of backup * add backup functionality to communication plot * rewrite epochtimetracker and refactor scalability plot code * cleanup scalability plot code * updating some epochtimetracker dependencies * fix linting errors * fix more linting errors * add utilization percentage plot * run isort for linting * update default save path for metrics * add decorators to virgo and some cleanup * add contributions and cleanup * fix linting errors * change 'credits' to 'credit' * update communication plot style * update function names * update scalability function for a more streamlined approach * run isort * move horovod import * fix linting errors * add contributors --------- Co-authored-by: Anna Lappe <[email protected]> Co-authored-by: Matteo Bunino <[email protected]> * Itwinai jlab Docker image (#236) * Refactor Dockerfiles * Refactor container gen script * ADD jlab dockerfile * First working version of jlab container * ADD CMCC requirements * update dockerfiles * ADD nvconda and refactor * Update containers * ADD containers * ADD simple plus dockerfile * Update NV deps * Update CUDA * Add comment * Cleanup * Cleanup * UPDATE README * Refactor * Fix linter * Refactor dockerfiles and improve tests * Refactor * Refactor * Fix * Add first tests for HPC * First broken tests for HPC * Update tests and strategy * UPDATE tests * Update horovod tests * Update tests and jlab deps * Add MLFLow tracking URI * ADD distributed trainer tests * mpirun container deepspeed * Fix distributed strategy tests on multi-node * ADD srun launcher * Refactor jobscript * Cleanup * isort tests * Refactor scripts * Minor fixes * Add logging to file for all workers * Add jupyter base files * Add jupyter base files * spelling * Update provenance deps * Update DS version * Update prov docs * Cleanup * add nvidia dep * Remove incomplete work * update pyproject * ADD hadolit config file * FIX flag * Fix linters * Refactor * Update prov4ml * Update pytest CI * Minor fix * Incorporate feedback * Update Dockerfiles * Incorporate feedback * Update comments * Refactor tests * Virgo HDF5 file format (#240) * update virgo generated dataset to use hdf5 format * add functionality for selecting output location * set new data format as standard * make virgo work with new data loader and add progress bar * remove old generation files and add script for concatenating hdf5 files * remove old generation files and add script for concatenating hdf5 files * rename folder using hyphens * remove multiprocessing * add multiprocessing at correct place * update handling of seed and num processes * Gpu monitoring (#237) * add gpu utilization decorator and begin work on plots * add decorator for gpu energy utilization * Added config option to hpo script, styling (#235) * Update README.md * Update README.md * Update createEnvVega.sh * remove unused dist file * run black and isort to fix linting errors * remove redundant variable * remove trailing whitespace * fix issues from PR * fix import in eurac trainer * fix linting errors * update logging directory and pattern * update default pattern for gpu energy plots * fix isort linting * add support for none pattern and general cleanup * fix linting errors with black and isort * add configurable and dynamic wait and warmup times for the profiler * remove old plot * move horovod import * fix linting errors --------- Co-authored-by: Anna Lappe <[email protected]> Co-authored-by: Matteo Bunino <[email protected]> * Scalability test wall clock (#239) * add gpu utilization decorator and begin work on plots * add decorator for gpu energy utilization * Added config option to hpo script, styling (#235) * Update README.md * Update README.md * Update createEnvVega.sh * remove unused dist file * run black and isort to fix linting errors * temporary changes * remove redundant variable * add absolute time plot * remove trailing whitespace * remove redundant variable * remove trailing whitespace * begin implementation of backup * fix issues from PR * fix issues from PR * add backup to gpu monitoring * fix import in eurac trainer * cleanup backup mechanism slightly * fix linting errors * update logging directory and pattern * update default pattern for gpu energy plots * fix isort linting * add support for none pattern and general cleanup * fix linting errors with black and isort * fix import in eurac trainer * fix linting errors * update logging directory and pattern * update default pattern for gpu energy plots * fix isort linting * add support for none pattern and general cleanup * fix linting errors with black and isort * begin implementation of backup * add backup to gpu monitoring * add backup functionality to communication plot * rewrite epochtimetracker and refactor scalability plot code * cleanup scalability plot code * updating some epochtimetracker dependencies * add configurable and dynamic wait and warmup times for the profiler * temporary changes * add absolute time plot * begin implementation of backup * add backup to gpu monitoring * cleanup backup mechanism slightly * fix isort linting * add support for none pattern and general cleanup * fix linting errors with black and isort * begin implementation of backup * add backup functionality to communication plot * rewrite epochtimetracker and refactor scalability plot code * cleanup scalability plot code * updating some epochtimetracker dependencies * fix linting errors * fix more linting errors * add utilization percentage plot * run isort for linting * update default save path for metrics * add decorators to virgo and some cleanup * add contributions and cleanup * fix linting errors * change 'credits' to 'credit' * update communication plot style * update function names * update scalability function for a more streamlined approach * run isort * move horovod import * fix linting errors * add contributors --------- Co-authored-by: Anna Lappe <[email protected]> Co-authored-by: Matteo Bunino <[email protected]> * make virgo work with new data loader and add progress bar * add contributors * update ruff settings in pyproject * update virgo dataset concatenation * add isort option to ruff * break imports on purpose * break more imports to test * remove ruff config file * 😀 * test linter 😁 * remove comment in github workflows * add validation python to linter and make more mistakes * add linting errors to trainer * remove isort and flake8 and replace with ruff * update linters * run formatter on virgo folder * fix linting errors and stuff from PR * update config * change config for timing code * update profiler to use 'with' for context managing * fix profiler.py --------- Co-authored-by: Anna Lappe <[email protected]> Co-authored-by: Matteo Bunino <[email protected]> * Changed files to create docs env and build docs remotely on juwels (#251) * Changed files to create docs env and build docs remotely on juwels * add docs extra to pyproject.toml * Added updated information to README * Grammar * Trailing whitespaces (*melting emoji*) --------- Co-authored-by: Anna Lappe <[email protected]> Co-authored-by: Jarl Sondre Sæther <[email protected]> * Fix scalability bug (#252) * add barrier implementation to distributed * fix profiler, seemingly * add print suppress to virgo * fix import bug * fix trailing whitespace 😇 * remove barrier method * Isort, format, delete old files * Formatting imports with ruff * Linting * Specified super linter should not use isort * Unm-messed up the cli.py file * Incorporated PR comments * Fixed & sorted imports * Remove horovod option from RayNoiseGeneratorTrainer * Typo * Incorporate PR comments (most importantly, change inheritance for ray strategies) * PR comments, refactored RayDDPStrategy * PR comments (super) * PR comments (refactored dataloader function in RayTorchTrainer) * PR comments * Linting * Remove else and line break * Removed patch version specifications, refactored slurm launcher script * Removed unused export in slurm script, removed abstract train method in RayTorchTrainer * Bugfix in the search alg/ scheduler setting and linting * Pyproject versions (PR comment) * I had already done that actually, so nevermind... Undoing the last commit * PR comments --------- Co-authored-by: Matteo Bunino <[email protected]> Co-authored-by: Jarl Sæther <[email protected]>
annaelisalappe
added a commit
that referenced
this pull request
Dec 10, 2024
* Added first draft of virgo raytorchtrainer integration * Added MLFlow logger integration to raytorchtrainer * Added ray components to main scripts * Added first draft of raydeepspeed strategy * Update createEnvVega.sh * Gpu monitoring (#237) * add gpu utilization decorator and begin work on plots * add decorator for gpu energy utilization * Added config option to hpo script, styling (#235) * Update README.md * Update README.md * Update createEnvVega.sh * remove unused dist file * run black and isort to fix linting errors * remove redundant variable * remove trailing whitespace * fix issues from PR * fix import in eurac trainer * fix linting errors * update logging directory and pattern * update default pattern for gpu energy plots * fix isort linting * add support for none pattern and general cleanup * fix linting errors with black and isort * add configurable and dynamic wait and warmup times for the profiler * remove old plot * move horovod import * fix linting errors --------- Co-authored-by: Anna Lappe <[email protected]> Co-authored-by: Matteo Bunino <[email protected]> * Scalability test wall clock (#239) * add gpu utilization decorator and begin work on plots * add decorator for gpu energy utilization * Added config option to hpo script, styling (#235) * Update README.md * Update README.md * Update createEnvVega.sh * remove unused dist file * run black and isort to fix linting errors * temporary changes * remove redundant variable * add absolute time plot * remove trailing whitespace * remove redundant variable * remove trailing whitespace * begin implementation of backup * fix issues from PR * fix issues from PR * add backup to gpu monitoring * fix import in eurac trainer * cleanup backup mechanism slightly * fix linting errors * update logging directory and pattern * update default pattern for gpu energy plots * fix isort linting * add support for none pattern and general cleanup * fix linting errors with black and isort * fix import in eurac trainer * fix linting errors * update logging directory and pattern * update default pattern for gpu energy plots * fix isort linting * add support for none pattern and general cleanup * fix linting errors with black and isort * begin implementation of backup * add backup to gpu monitoring * add backup functionality to communication plot * rewrite epochtimetracker and refactor scalability plot code * cleanup scalability plot code * updating some epochtimetracker dependencies * add configurable and dynamic wait and warmup times for the profiler * temporary changes * add absolute time plot * begin implementation of backup * add backup to gpu monitoring * cleanup backup mechanism slightly * fix isort linting * add support for none pattern and general cleanup * fix linting errors with black and isort * begin implementation of backup * add backup functionality to communication plot * rewrite epochtimetracker and refactor scalability plot code * cleanup scalability plot code * updating some epochtimetracker dependencies * fix linting errors * fix more linting errors * add utilization percentage plot * run isort for linting * update default save path for metrics * add decorators to virgo and some cleanup * add contributions and cleanup * fix linting errors * change 'credits' to 'credit' * update communication plot style * update function names * update scalability function for a more streamlined approach * run isort * move horovod import * fix linting errors * add contributors --------- Co-authored-by: Anna Lappe <[email protected]> Co-authored-by: Matteo Bunino <[email protected]> * Itwinai jlab Docker image (#236) * Refactor Dockerfiles * Refactor container gen script * ADD jlab dockerfile * First working version of jlab container * ADD CMCC requirements * update dockerfiles * ADD nvconda and refactor * Update containers * ADD containers * ADD simple plus dockerfile * Update NV deps * Update CUDA * Add comment * Cleanup * Cleanup * UPDATE README * Refactor * Fix linter * Refactor dockerfiles and improve tests * Refactor * Refactor * Fix * Add first tests for HPC * First broken tests for HPC * Update tests and strategy * UPDATE tests * Update horovod tests * Update tests and jlab deps * Add MLFLow tracking URI * ADD distributed trainer tests * mpirun container deepspeed * Fix distributed strategy tests on multi-node * ADD srun launcher * Refactor jobscript * Cleanup * isort tests * Refactor scripts * Minor fixes * Add logging to file for all workers * Add jupyter base files * Add jupyter base files * spelling * Update provenance deps * Update DS version * Update prov docs * Cleanup * add nvidia dep * Remove incomplete work * update pyproject * ADD hadolit config file * FIX flag * Fix linters * Refactor * Update prov4ml * Update pytest CI * Minor fix * Incorporate feedback * Update Dockerfiles * Incorporate feedback * Update comments * Refactor tests * Added first draft of virgo raytorchtrainer integration * Added ray components to main scripts * Added first draft of raydeepspeed strategy * Virgo HDF5 file format (#240) * update virgo generated dataset to use hdf5 format * add functionality for selecting output location * set new data format as standard * make virgo work with new data loader and add progress bar * remove old generation files and add script for concatenating hdf5 files * remove old generation files and add script for concatenating hdf5 files * rename folder using hyphens * remove multiprocessing * add multiprocessing at correct place * update handling of seed and num processes * Gpu monitoring (#237) * add gpu utilization decorator and begin work on plots * add decorator for gpu energy utilization * Added config option to hpo script, styling (#235) * Update README.md * Update README.md * Update createEnvVega.sh * remove unused dist file * run black and isort to fix linting errors * remove redundant variable * remove trailing whitespace * fix issues from PR * fix import in eurac trainer * fix linting errors * update logging directory and pattern * update default pattern for gpu energy plots * fix isort linting * add support for none pattern and general cleanup * fix linting errors with black and isort * add configurable and dynamic wait and warmup times for the profiler * remove old plot * move horovod import * fix linting errors --------- Co-authored-by: Anna Lappe <[email protected]> Co-authored-by: Matteo Bunino <[email protected]> * Scalability test wall clock (#239) * add gpu utilization decorator and begin work on plots * add decorator for gpu energy utilization * Added config option to hpo script, styling (#235) * Update README.md * Update README.md * Update createEnvVega.sh * remove unused dist file * run black and isort to fix linting errors * temporary changes * remove redundant variable * add absolute time plot * remove trailing whitespace * remove redundant variable * remove trailing whitespace * begin implementation of backup * fix issues from PR * fix issues from PR * add backup to gpu monitoring * fix import in eurac trainer * cleanup backup mechanism slightly * fix linting errors * update logging directory and pattern * update default pattern for gpu energy plots * fix isort linting * add support for none pattern and general cleanup * fix linting errors with black and isort * fix import in eurac trainer * fix linting errors * update logging directory and pattern * update default pattern for gpu energy plots * fix isort linting * add support for none pattern and general cleanup * fix linting errors with black and isort * begin implementation of backup * add backup to gpu monitoring * add backup functionality to communication plot * rewrite epochtimetracker and refactor scalability plot code * cleanup scalability plot code * updating some epochtimetracker dependencies * add configurable and dynamic wait and warmup times for the profiler * temporary changes * add absolute time plot * begin implementation of backup * add backup to gpu monitoring * cleanup backup mechanism slightly * fix isort linting * add support for none pattern and general cleanup * fix linting errors with black and isort * begin implementation of backup * add backup functionality to communication plot * rewrite epochtimetracker and refactor scalability plot code * cleanup scalability plot code * updating some epochtimetracker dependencies * fix linting errors * fix more linting errors * add utilization percentage plot * run isort for linting * update default save path for metrics * add decorators to virgo and some cleanup * add contributions and cleanup * fix linting errors * change 'credits' to 'credit' * update communication plot style * update function names * update scalability function for a more streamlined approach * run isort * move horovod import * fix linting errors * add contributors --------- Co-authored-by: Anna Lappe <[email protected]> Co-authored-by: Matteo Bunino <[email protected]> * make virgo work with new data loader and add progress bar * add contributors * update ruff settings in pyproject * update virgo dataset concatenation * add isort option to ruff * break imports on purpose * break more imports to test * remove ruff config file * 😀 * test linter 😁 * remove comment in github workflows * add validation python to linter and make more mistakes * add linting errors to trainer * remove isort and flake8 and replace with ruff * update linters * run formatter on virgo folder * fix linting errors and stuff from PR * update config * change config for timing code * update profiler to use 'with' for context managing * fix profiler.py --------- Co-authored-by: Anna Lappe <[email protected]> Co-authored-by: Matteo Bunino <[email protected]> * Added first draft of virgo raytorchtrainer integration * Added ray components to main scripts * Gpu monitoring (#237) * add gpu utilization decorator and begin work on plots * add decorator for gpu energy utilization * Added config option to hpo script, styling (#235) * Update README.md * Update README.md * Update createEnvVega.sh * remove unused dist file * run black and isort to fix linting errors * remove redundant variable * remove trailing whitespace * fix issues from PR * fix import in eurac trainer * fix linting errors * update logging directory and pattern * update default pattern for gpu energy plots * fix isort linting * add support for none pattern and general cleanup * fix linting errors with black and isort * add configurable and dynamic wait and warmup times for the profiler * remove old plot * move horovod import * fix linting errors --------- Co-authored-by: Anna Lappe <[email protected]> Co-authored-by: Matteo Bunino <[email protected]> * Added ray components to main scripts * Added first draft of raydeepspeed strategy * Added working version of deepspeed strategy, added RayDistributedStrategy as parent class * Scalability test wall clock (#239) * add gpu utilization decorator and begin work on plots * add decorator for gpu energy utilization * Added config option to hpo script, styling (#235) * Update README.md * Update README.md * Update createEnvVega.sh * remove unused dist file * run black and isort to fix linting errors * temporary changes * remove redundant variable * add absolute time plot * remove trailing whitespace * remove redundant variable * remove trailing whitespace * begin implementation of backup * fix issues from PR * fix issues from PR * add backup to gpu monitoring * fix import in eurac trainer * cleanup backup mechanism slightly * fix linting errors * update logging directory and pattern * update default pattern for gpu energy plots * fix isort linting * add support for none pattern and general cleanup * fix linting errors with black and isort * fix import in eurac trainer * fix linting errors * update logging directory and pattern * update default pattern for gpu energy plots * fix isort linting * add support for none pattern and general cleanup * fix linting errors with black and isort * begin implementation of backup * add backup to gpu monitoring * add backup functionality to communication plot * rewrite epochtimetracker and refactor scalability plot code * cleanup scalability plot code * updating some epochtimetracker dependencies * add configurable and dynamic wait and warmup times for the profiler * temporary changes * add absolute time plot * begin implementation of backup * add backup to gpu monitoring * cleanup backup mechanism slightly * fix isort linting * add support for none pattern and general cleanup * fix linting errors with black and isort * begin implementation of backup * add backup functionality to communication plot * rewrite epochtimetracker and refactor scalability plot code * cleanup scalability plot code * updating some epochtimetracker dependencies * fix linting errors * fix more linting errors * add utilization percentage plot * run isort for linting * update default save path for metrics * add decorators to virgo and some cleanup * add contributions and cleanup * fix linting errors * change 'credits' to 'credit' * update communication plot style * update function names * update scalability function for a more streamlined approach * run isort * move horovod import * fix linting errors * add contributors --------- Co-authored-by: Anna Lappe <[email protected]> Co-authored-by: Matteo Bunino <[email protected]> * Itwinai jlab Docker image (#236) * Refactor Dockerfiles * Refactor container gen script * ADD jlab dockerfile * First working version of jlab container * ADD CMCC requirements * update dockerfiles * ADD nvconda and refactor * Update containers * ADD containers * ADD simple plus dockerfile * Update NV deps * Update CUDA * Add comment * Cleanup * Cleanup * UPDATE README * Refactor * Fix linter * Refactor dockerfiles and improve tests * Refactor * Refactor * Fix * Add first tests for HPC * First broken tests for HPC * Update tests and strategy * UPDATE tests * Update horovod tests * Update tests and jlab deps * Add MLFLow tracking URI * ADD distributed trainer tests * mpirun container deepspeed * Fix distributed strategy tests on multi-node * ADD srun launcher * Refactor jobscript * Cleanup * isort tests * Refactor scripts * Minor fixes * Add logging to file for all workers * Add jupyter base files * Add jupyter base files * spelling * Update provenance deps * Update DS version * Update prov docs * Cleanup * add nvidia dep * Remove incomplete work * update pyproject * ADD hadolit config file * FIX flag * Fix linters * Refactor * Update prov4ml * Update pytest CI * Minor fix * Incorporate feedback * Update Dockerfiles * Incorporate feedback * Update comments * Refactor tests * Virgo HDF5 file format (#240) * update virgo generated dataset to use hdf5 format * add functionality for selecting output location * set new data format as standard * make virgo work with new data loader and add progress bar * remove old generation files and add script for concatenating hdf5 files * remove old generation files and add script for concatenating hdf5 files * rename folder using hyphens * remove multiprocessing * add multiprocessing at correct place * update handling of seed and num processes * Gpu monitoring (#237) * add gpu utilization decorator and begin work on plots * add decorator for gpu energy utilization * Added config option to hpo script, styling (#235) * Update README.md * Update README.md * Update createEnvVega.sh * remove unused dist file * run black and isort to fix linting errors * remove redundant variable * remove trailing whitespace * fix issues from PR * fix import in eurac trainer * fix linting errors * update logging directory and pattern * update default pattern for gpu energy plots * fix isort linting * add support for none pattern and general cleanup * fix linting errors with black and isort * add configurable and dynamic wait and warmup times for the profiler * remove old plot * move horovod import * fix linting errors --------- Co-authored-by: Anna Lappe <[email protected]> Co-authored-by: Matteo Bunino <[email protected]> * Scalability test wall clock (#239) * add gpu utilization decorator and begin work on plots * add decorator for gpu energy utilization * Added config option to hpo script, styling (#235) * Update README.md * Update README.md * Update createEnvVega.sh * remove unused dist file * run black and isort to fix linting errors * temporary changes * remove redundant variable * add absolute time plot * remove trailing whitespace * remove redundant variable * remove trailing whitespace * begin implementation of backup * fix issues from PR * fix issues from PR * add backup to gpu monitoring * fix import in eurac trainer * cleanup backup mechanism slightly * fix linting errors * update logging directory and pattern * update default pattern for gpu energy plots * fix isort linting * add support for none pattern and general cleanup * fix linting errors with black and isort * fix import in eurac trainer * fix linting errors * update logging directory and pattern * update default pattern for gpu energy plots * fix isort linting * add support for none pattern and general cleanup * fix linting errors with black and isort * begin implementation of backup * add backup to gpu monitoring * add backup functionality to communication plot * rewrite epochtimetracker and refactor scalability plot code * cleanup scalability plot code * updating some epochtimetracker dependencies * add configurable and dynamic wait and warmup times for the profiler * temporary changes * add absolute time plot * begin implementation of backup * add backup to gpu monitoring * cleanup backup mechanism slightly * fix isort linting * add support for none pattern and general cleanup * fix linting errors with black and isort * begin implementation of backup * add backup functionality to communication plot * rewrite epochtimetracker and refactor scalability plot code * cleanup scalability plot code * updating some epochtimetracker dependencies * fix linting errors * fix more linting errors * add utilization percentage plot * run isort for linting * update default save path for metrics * add decorators to virgo and some cleanup * add contributions and cleanup * fix linting errors * change 'credits' to 'credit' * update communication plot style * update function names * update scalability function for a more streamlined approach * run isort * move horovod import * fix linting errors * add contributors --------- Co-authored-by: Anna Lappe <[email protected]> Co-authored-by: Matteo Bunino <[email protected]> * make virgo work with new data loader and add progress bar * add contributors * update ruff settings in pyproject * update virgo dataset concatenation * add isort option to ruff * break imports on purpose * break more imports to test * remove ruff config file * 😀 * test linter 😁 * remove comment in github workflows * add validation python to linter and make more mistakes * add linting errors to trainer * remove isort and flake8 and replace with ruff * update linters * run formatter on virgo folder * fix linting errors and stuff from PR * update config * change config for timing code * update profiler to use 'with' for context managing * fix profiler.py --------- Co-authored-by: Anna Lappe <[email protected]> Co-authored-by: Matteo Bunino <[email protected]> * Changed files to create docs env and build docs remotely on juwels (#251) * Changed files to create docs env and build docs remotely on juwels * add docs extra to pyproject.toml * Added updated information to README * Grammar * Trailing whitespaces (*melting emoji*) --------- Co-authored-by: Anna Lappe <[email protected]> Co-authored-by: Jarl Sondre Sæther <[email protected]> * Fix scalability bug (#252) * add barrier implementation to distributed * fix profiler, seemingly * add print suppress to virgo * fix import bug * fix trailing whitespace 😇 * remove barrier method * Isort, format, delete old files * Formatting imports with ruff * Linting * Specified super linter should not use isort * Unm-messed up the cli.py file * Incorporated PR comments * Fixed & sorted imports * Remove horovod option from RayNoiseGeneratorTrainer * Typo * HPO tutorial first draft * Incorporate PR comments (most importantly, change inheritance for ray strategies) * First draft tutorial * PR comments, refactored RayDDPStrategy * PR comments (super) * PR comments (refactored dataloader function in RayTorchTrainer) * PR comments * Linting * Remove else and line break * Removed patch version specifications, refactored slurm launcher script * Added how-it-works for HPO, updated HPO tutorial * Removed unused export in slurm script, removed abstract train method in RayTorchTrainer * Bugfix in the search alg/ scheduler setting and linting * Pyproject versions (PR comment) * I had already done that actually, so nevermind... Undoing the last commit * Updated tutorial * Link and reference fixes * bash linting error * Added MNIST data to gitignore * Removed MNIST files for testing * PR comments * Phrasing of one sentence * Updated .gitignore * Duplicate things --------- Co-authored-by: Matteo Bunino <[email protected]> Co-authored-by: Jarl Sæther <[email protected]>
annaelisalappe
added a commit
that referenced
this pull request
Dec 12, 2024
* Added first draft of virgo raytorchtrainer integration * Added MLFlow logger integration to raytorchtrainer * Added ray components to main scripts * Added first draft of raydeepspeed strategy * Update createEnvVega.sh * Gpu monitoring (#237) * add gpu utilization decorator and begin work on plots * add decorator for gpu energy utilization * Added config option to hpo script, styling (#235) * Update README.md * Update README.md * Update createEnvVega.sh * remove unused dist file * run black and isort to fix linting errors * remove redundant variable * remove trailing whitespace * fix issues from PR * fix import in eurac trainer * fix linting errors * update logging directory and pattern * update default pattern for gpu energy plots * fix isort linting * add support for none pattern and general cleanup * fix linting errors with black and isort * add configurable and dynamic wait and warmup times for the profiler * remove old plot * move horovod import * fix linting errors --------- Co-authored-by: Anna Lappe <[email protected]> Co-authored-by: Matteo Bunino <[email protected]> * Scalability test wall clock (#239) * add gpu utilization decorator and begin work on plots * add decorator for gpu energy utilization * Added config option to hpo script, styling (#235) * Update README.md * Update README.md * Update createEnvVega.sh * remove unused dist file * run black and isort to fix linting errors * temporary changes * remove redundant variable * add absolute time plot * remove trailing whitespace * remove redundant variable * remove trailing whitespace * begin implementation of backup * fix issues from PR * fix issues from PR * add backup to gpu monitoring * fix import in eurac trainer * cleanup backup mechanism slightly * fix linting errors * update logging directory and pattern * update default pattern for gpu energy plots * fix isort linting * add support for none pattern and general cleanup * fix linting errors with black and isort * fix import in eurac trainer * fix linting errors * update logging directory and pattern * update default pattern for gpu energy plots * fix isort linting * add support for none pattern and general cleanup * fix linting errors with black and isort * begin implementation of backup * add backup to gpu monitoring * add backup functionality to communication plot * rewrite epochtimetracker and refactor scalability plot code * cleanup scalability plot code * updating some epochtimetracker dependencies * add configurable and dynamic wait and warmup times for the profiler * temporary changes * add absolute time plot * begin implementation of backup * add backup to gpu monitoring * cleanup backup mechanism slightly * fix isort linting * add support for none pattern and general cleanup * fix linting errors with black and isort * begin implementation of backup * add backup functionality to communication plot * rewrite epochtimetracker and refactor scalability plot code * cleanup scalability plot code * updating some epochtimetracker dependencies * fix linting errors * fix more linting errors * add utilization percentage plot * run isort for linting * update default save path for metrics * add decorators to virgo and some cleanup * add contributions and cleanup * fix linting errors * change 'credits' to 'credit' * update communication plot style * update function names * update scalability function for a more streamlined approach * run isort * move horovod import * fix linting errors * add contributors --------- Co-authored-by: Anna Lappe <[email protected]> Co-authored-by: Matteo Bunino <[email protected]> * Itwinai jlab Docker image (#236) * Refactor Dockerfiles * Refactor container gen script * ADD jlab dockerfile * First working version of jlab container * ADD CMCC requirements * update dockerfiles * ADD nvconda and refactor * Update containers * ADD containers * ADD simple plus dockerfile * Update NV deps * Update CUDA * Add comment * Cleanup * Cleanup * UPDATE README * Refactor * Fix linter * Refactor dockerfiles and improve tests * Refactor * Refactor * Fix * Add first tests for HPC * First broken tests for HPC * Update tests and strategy * UPDATE tests * Update horovod tests * Update tests and jlab deps * Add MLFLow tracking URI * ADD distributed trainer tests * mpirun container deepspeed * Fix distributed strategy tests on multi-node * ADD srun launcher * Refactor jobscript * Cleanup * isort tests * Refactor scripts * Minor fixes * Add logging to file for all workers * Add jupyter base files * Add jupyter base files * spelling * Update provenance deps * Update DS version * Update prov docs * Cleanup * add nvidia dep * Remove incomplete work * update pyproject * ADD hadolit config file * FIX flag * Fix linters * Refactor * Update prov4ml * Update pytest CI * Minor fix * Incorporate feedback * Update Dockerfiles * Incorporate feedback * Update comments * Refactor tests * Added first draft of virgo raytorchtrainer integration * Added ray components to main scripts * Added first draft of raydeepspeed strategy * Virgo HDF5 file format (#240) * update virgo generated dataset to use hdf5 format * add functionality for selecting output location * set new data format as standard * make virgo work with new data loader and add progress bar * remove old generation files and add script for concatenating hdf5 files * remove old generation files and add script for concatenating hdf5 files * rename folder using hyphens * remove multiprocessing * add multiprocessing at correct place * update handling of seed and num processes * Gpu monitoring (#237) * add gpu utilization decorator and begin work on plots * add decorator for gpu energy utilization * Added config option to hpo script, styling (#235) * Update README.md * Update README.md * Update createEnvVega.sh * remove unused dist file * run black and isort to fix linting errors * remove redundant variable * remove trailing whitespace * fix issues from PR * fix import in eurac trainer * fix linting errors * update logging directory and pattern * update default pattern for gpu energy plots * fix isort linting * add support for none pattern and general cleanup * fix linting errors with black and isort * add configurable and dynamic wait and warmup times for the profiler * remove old plot * move horovod import * fix linting errors --------- Co-authored-by: Anna Lappe <[email protected]> Co-authored-by: Matteo Bunino <[email protected]> * Scalability test wall clock (#239) * add gpu utilization decorator and begin work on plots * add decorator for gpu energy utilization * Added config option to hpo script, styling (#235) * Update README.md * Update README.md * Update createEnvVega.sh * remove unused dist file * run black and isort to fix linting errors * temporary changes * remove redundant variable * add absolute time plot * remove trailing whitespace * remove redundant variable * remove trailing whitespace * begin implementation of backup * fix issues from PR * fix issues from PR * add backup to gpu monitoring * fix import in eurac trainer * cleanup backup mechanism slightly * fix linting errors * update logging directory and pattern * update default pattern for gpu energy plots * fix isort linting * add support for none pattern and general cleanup * fix linting errors with black and isort * fix import in eurac trainer * fix linting errors * update logging directory and pattern * update default pattern for gpu energy plots * fix isort linting * add support for none pattern and general cleanup * fix linting errors with black and isort * begin implementation of backup * add backup to gpu monitoring * add backup functionality to communication plot * rewrite epochtimetracker and refactor scalability plot code * cleanup scalability plot code * updating some epochtimetracker dependencies * add configurable and dynamic wait and warmup times for the profiler * temporary changes * add absolute time plot * begin implementation of backup * add backup to gpu monitoring * cleanup backup mechanism slightly * fix isort linting * add support for none pattern and general cleanup * fix linting errors with black and isort * begin implementation of backup * add backup functionality to communication plot * rewrite epochtimetracker and refactor scalability plot code * cleanup scalability plot code * updating some epochtimetracker dependencies * fix linting errors * fix more linting errors * add utilization percentage plot * run isort for linting * update default save path for metrics * add decorators to virgo and some cleanup * add contributions and cleanup * fix linting errors * change 'credits' to 'credit' * update communication plot style * update function names * update scalability function for a more streamlined approach * run isort * move horovod import * fix linting errors * add contributors --------- Co-authored-by: Anna Lappe <[email protected]> Co-authored-by: Matteo Bunino <[email protected]> * make virgo work with new data loader and add progress bar * add contributors * update ruff settings in pyproject * update virgo dataset concatenation * add isort option to ruff * break imports on purpose * break more imports to test * remove ruff config file * 😀 * test linter 😁 * remove comment in github workflows * add validation python to linter and make more mistakes * add linting errors to trainer * remove isort and flake8 and replace with ruff * update linters * run formatter on virgo folder * fix linting errors and stuff from PR * update config * change config for timing code * update profiler to use 'with' for context managing * fix profiler.py --------- Co-authored-by: Anna Lappe <[email protected]> Co-authored-by: Matteo Bunino <[email protected]> * Added first draft of virgo raytorchtrainer integration * Added ray components to main scripts * Gpu monitoring (#237) * add gpu utilization decorator and begin work on plots * add decorator for gpu energy utilization * Added config option to hpo script, styling (#235) * Update README.md * Update README.md * Update createEnvVega.sh * remove unused dist file * run black and isort to fix linting errors * remove redundant variable * remove trailing whitespace * fix issues from PR * fix import in eurac trainer * fix linting errors * update logging directory and pattern * update default pattern for gpu energy plots * fix isort linting * add support for none pattern and general cleanup * fix linting errors with black and isort * add configurable and dynamic wait and warmup times for the profiler * remove old plot * move horovod import * fix linting errors --------- Co-authored-by: Anna Lappe <[email protected]> Co-authored-by: Matteo Bunino <[email protected]> * Added ray components to main scripts * Added first draft of raydeepspeed strategy * Added working version of deepspeed strategy, added RayDistributedStrategy as parent class * Scalability test wall clock (#239) * add gpu utilization decorator and begin work on plots * add decorator for gpu energy utilization * Added config option to hpo script, styling (#235) * Update README.md * Update README.md * Update createEnvVega.sh * remove unused dist file * run black and isort to fix linting errors * temporary changes * remove redundant variable * add absolute time plot * remove trailing whitespace * remove redundant variable * remove trailing whitespace * begin implementation of backup * fix issues from PR * fix issues from PR * add backup to gpu monitoring * fix import in eurac trainer * cleanup backup mechanism slightly * fix linting errors * update logging directory and pattern * update default pattern for gpu energy plots * fix isort linting * add support for none pattern and general cleanup * fix linting errors with black and isort * fix import in eurac trainer * fix linting errors * update logging directory and pattern * update default pattern for gpu energy plots * fix isort linting * add support for none pattern and general cleanup * fix linting errors with black and isort * begin implementation of backup * add backup to gpu monitoring * add backup functionality to communication plot * rewrite epochtimetracker and refactor scalability plot code * cleanup scalability plot code * updating some epochtimetracker dependencies * add configurable and dynamic wait and warmup times for the profiler * temporary changes * add absolute time plot * begin implementation of backup * add backup to gpu monitoring * cleanup backup mechanism slightly * fix isort linting * add support for none pattern and general cleanup * fix linting errors with black and isort * begin implementation of backup * add backup functionality to communication plot * rewrite epochtimetracker and refactor scalability plot code * cleanup scalability plot code * updating some epochtimetracker dependencies * fix linting errors * fix more linting errors * add utilization percentage plot * run isort for linting * update default save path for metrics * add decorators to virgo and some cleanup * add contributions and cleanup * fix linting errors * change 'credits' to 'credit' * update communication plot style * update function names * update scalability function for a more streamlined approach * run isort * move horovod import * fix linting errors * add contributors --------- Co-authored-by: Anna Lappe <[email protected]> Co-authored-by: Matteo Bunino <[email protected]> * Itwinai jlab Docker image (#236) * Refactor Dockerfiles * Refactor container gen script * ADD jlab dockerfile * First working version of jlab container * ADD CMCC requirements * update dockerfiles * ADD nvconda and refactor * Update containers * ADD containers * ADD simple plus dockerfile * Update NV deps * Update CUDA * Add comment * Cleanup * Cleanup * UPDATE README * Refactor * Fix linter * Refactor dockerfiles and improve tests * Refactor * Refactor * Fix * Add first tests for HPC * First broken tests for HPC * Update tests and strategy * UPDATE tests * Update horovod tests * Update tests and jlab deps * Add MLFLow tracking URI * ADD distributed trainer tests * mpirun container deepspeed * Fix distributed strategy tests on multi-node * ADD srun launcher * Refactor jobscript * Cleanup * isort tests * Refactor scripts * Minor fixes * Add logging to file for all workers * Add jupyter base files * Add jupyter base files * spelling * Update provenance deps * Update DS version * Update prov docs * Cleanup * add nvidia dep * Remove incomplete work * update pyproject * ADD hadolit config file * FIX flag * Fix linters * Refactor * Update prov4ml * Update pytest CI * Minor fix * Incorporate feedback * Update Dockerfiles * Incorporate feedback * Update comments * Refactor tests * Virgo HDF5 file format (#240) * update virgo generated dataset to use hdf5 format * add functionality for selecting output location * set new data format as standard * make virgo work with new data loader and add progress bar * remove old generation files and add script for concatenating hdf5 files * remove old generation files and add script for concatenating hdf5 files * rename folder using hyphens * remove multiprocessing * add multiprocessing at correct place * update handling of seed and num processes * Gpu monitoring (#237) * add gpu utilization decorator and begin work on plots * add decorator for gpu energy utilization * Added config option to hpo script, styling (#235) * Update README.md * Update README.md * Update createEnvVega.sh * remove unused dist file * run black and isort to fix linting errors * remove redundant variable * remove trailing whitespace * fix issues from PR * fix import in eurac trainer * fix linting errors * update logging directory and pattern * update default pattern for gpu energy plots * fix isort linting * add support for none pattern and general cleanup * fix linting errors with black and isort * add configurable and dynamic wait and warmup times for the profiler * remove old plot * move horovod import * fix linting errors --------- Co-authored-by: Anna Lappe <[email protected]> Co-authored-by: Matteo Bunino <[email protected]> * Scalability test wall clock (#239) * add gpu utilization decorator and begin work on plots * add decorator for gpu energy utilization * Added config option to hpo script, styling (#235) * Update README.md * Update README.md * Update createEnvVega.sh * remove unused dist file * run black and isort to fix linting errors * temporary changes * remove redundant variable * add absolute time plot * remove trailing whitespace * remove redundant variable * remove trailing whitespace * begin implementation of backup * fix issues from PR * fix issues from PR * add backup to gpu monitoring * fix import in eurac trainer * cleanup backup mechanism slightly * fix linting errors * update logging directory and pattern * update default pattern for gpu energy plots * fix isort linting * add support for none pattern and general cleanup * fix linting errors with black and isort * fix import in eurac trainer * fix linting errors * update logging directory and pattern * update default pattern for gpu energy plots * fix isort linting * add support for none pattern and general cleanup * fix linting errors with black and isort * begin implementation of backup * add backup to gpu monitoring * add backup functionality to communication plot * rewrite epochtimetracker and refactor scalability plot code * cleanup scalability plot code * updating some epochtimetracker dependencies * add configurable and dynamic wait and warmup times for the profiler * temporary changes * add absolute time plot * begin implementation of backup * add backup to gpu monitoring * cleanup backup mechanism slightly * fix isort linting * add support for none pattern and general cleanup * fix linting errors with black and isort * begin implementation of backup * add backup functionality to communication plot * rewrite epochtimetracker and refactor scalability plot code * cleanup scalability plot code * updating some epochtimetracker dependencies * fix linting errors * fix more linting errors * add utilization percentage plot * run isort for linting * update default save path for metrics * add decorators to virgo and some cleanup * add contributions and cleanup * fix linting errors * change 'credits' to 'credit' * update communication plot style * update function names * update scalability function for a more streamlined approach * run isort * move horovod import * fix linting errors * add contributors --------- Co-authored-by: Anna Lappe <[email protected]> Co-authored-by: Matteo Bunino <[email protected]> * make virgo work with new data loader and add progress bar * add contributors * update ruff settings in pyproject * update virgo dataset concatenation * add isort option to ruff * break imports on purpose * break more imports to test * remove ruff config file * 😀 * test linter 😁 * remove comment in github workflows * add validation python to linter and make more mistakes * add linting errors to trainer * remove isort and flake8 and replace with ruff * update linters * run formatter on virgo folder * fix linting errors and stuff from PR * update config * change config for timing code * update profiler to use 'with' for context managing * fix profiler.py --------- Co-authored-by: Anna Lappe <[email protected]> Co-authored-by: Matteo Bunino <[email protected]> * Changed files to create docs env and build docs remotely on juwels (#251) * Changed files to create docs env and build docs remotely on juwels * add docs extra to pyproject.toml * Added updated information to README * Grammar * Trailing whitespaces (*melting emoji*) --------- Co-authored-by: Anna Lappe <[email protected]> Co-authored-by: Jarl Sondre Sæther <[email protected]> * Fix scalability bug (#252) * add barrier implementation to distributed * fix profiler, seemingly * add print suppress to virgo * fix import bug * fix trailing whitespace 😇 * remove barrier method * Isort, format, delete old files * Formatting imports with ruff * Linting * Specified super linter should not use isort * Unm-messed up the cli.py file * Incorporated PR comments * Fixed & sorted imports * Remove horovod option from RayNoiseGeneratorTrainer * Typo * HPO tutorial first draft * Incorporate PR comments (most importantly, change inheritance for ray strategies) * First draft tutorial * PR comments, refactored RayDDPStrategy * PR comments (super) * PR comments (refactored dataloader function in RayTorchTrainer) * PR comments * Linting * Remove else and line break * Removed patch version specifications, refactored slurm launcher script * Added how-it-works for HPO, updated HPO tutorial * Removed unused export in slurm script, removed abstract train method in RayTorchTrainer * Bugfix in the search alg/ scheduler setting and linting * Pyproject versions (PR comment) * I had already done that actually, so nevermind... Undoing the last commit * Updated tutorial * Link and reference fixes * bash linting error * Added MNIST data to gitignore * Removed MNIST files for testing * PR comments * Phrasing of one sentence * Updated .gitignore * Added new tutorial on non-distributed HPO * Added tutorial code (there is still some bugs) * Fixed bugs in tutorial * Linting and making sure code is up-to-date in tutorial * Duplicates from merge ... --------- Co-authored-by: Matteo Bunino <[email protected]> Co-authored-by: Jarl Sæther <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
Add decorator and CLI command for measuring GPU utilization. You can now add the decorator to any
execute()
function of aTorchTrainer
class and it will log the files in a folder namedutilization_logs
. Then, when the training is finished, you can create a plot using theitwinai generate-gpu-energy-plot
.An example of a resulting plot (real data from training on EURAC) can be seen here:
Related issue :
#221