Skip to content

Commit

Permalink
Scalability test wall clock (#239)
Browse files Browse the repository at this point in the history
* add gpu utilization decorator and begin work on plots

* add decorator for gpu energy utilization

* Added config option to hpo script, styling (#235)

* Update README.md

* Update README.md

* Update createEnvVega.sh

* remove unused dist file

* run black and isort to fix linting errors

* temporary changes

* remove redundant variable

* add absolute time plot

* remove trailing whitespace

* remove redundant variable

* remove trailing whitespace

* begin implementation of backup

* fix issues from PR

* fix issues from PR

* add backup to gpu monitoring

* fix import in eurac trainer

* cleanup backup mechanism slightly

* fix linting errors

* update logging directory and pattern

* update default pattern for gpu energy plots

* fix isort linting

* add support for none pattern and general cleanup

* fix linting errors with black and isort

* fix import in eurac trainer

* fix linting errors

* update logging directory and pattern

* update default pattern for gpu energy plots

* fix isort linting

* add support for none pattern and general cleanup

* fix linting errors with black and isort

* begin implementation of backup

* add backup to gpu monitoring

* add backup functionality to communication plot

* rewrite epochtimetracker and refactor scalability plot code

* cleanup scalability plot code

* updating some epochtimetracker dependencies

* add configurable and dynamic wait and warmup times for the profiler

* temporary changes

* add absolute time plot

* begin implementation of backup

* add backup to gpu monitoring

* cleanup backup mechanism slightly

* fix isort linting

* add support for none pattern and general cleanup

* fix linting errors with black and isort

* begin implementation of backup

* add backup functionality to communication plot

* rewrite epochtimetracker and refactor scalability plot code

* cleanup scalability plot code

* updating some epochtimetracker dependencies

* fix linting errors

* fix more linting errors

* add utilization percentage plot

* run isort for linting

* update default save path for metrics

* add decorators to virgo and some cleanup

* add contributions and cleanup

* fix linting errors

* change 'credits' to 'credit'

* update communication plot style

* update function names

* update scalability function for a more streamlined approach

* run isort

* move horovod import

* fix linting errors

* add contributors

---------

Co-authored-by: Anna Lappe <[email protected]>
Co-authored-by: Matteo Bunino <[email protected]>
  • Loading branch information
3 people committed Nov 20, 2024
1 parent 1a34203 commit b2ceb4f
Showing 1 changed file with 11 additions and 12 deletions.
23 changes: 11 additions & 12 deletions src/itwinai/torch/profiling/communication_plot.py
Original file line number Diff line number Diff line change
Expand Up @@ -142,39 +142,38 @@ def communication_overhead_stacked_bar_plot(
return fig, ax


>>>>>> > d538510(Gpu monitoring( # 237))
def get_comp_fraction_full_array(
df: pd.DataFrame, print_table: bool=False
df: pd.DataFrame, print_table: bool = False
) -> np.ndarray:
"""Creates a MxN NumPy array where M is the number of strategies and N is the
number of GPU configurations. The strategies are sorted alphabetically and the GPU
configurations are sorted in ascending number of GPUs.
"""
unique_num_gpus=sorted(df["num_gpus"].unique(), key=lambda x: int(x))
unique_strategies=sorted(df["strategy"].unique())
values=[]
unique_num_gpus = sorted(df["num_gpus"].unique(), key=lambda x: int(x))
unique_strategies = sorted(df["strategy"].unique())
values = []

table_string=""
table_string = ""

for strategy in unique_strategies:
strategy_values=[]
strategy_values = []
for num_gpus in unique_num_gpus:
filtered_df=df[
filtered_df = df[
(df["strategy"] == strategy) & (df["num_gpus"] == num_gpus)
]

row_string=f"{strategy:>12} | {num_gpus:>10}"
row_string = f"{strategy:>12} | {num_gpus:>10}"

# Allows some strategies or num GPUs to not be included
if len(filtered_df) == 0:
comp_time, comm_time=np.NaN, np.NaN
comp_time, comm_time = np.NaN, np.NaN
strategy_values.append(np.NaN)

row_string += f" | {'(NO DATA)':>15}"
else:
comp_time, comm_time=calculate_comp_and_comm_time(df=filtered_df)
comp_time, comm_time = calculate_comp_and_comm_time(df=filtered_df)
# Avoid division-by-zero errors (1e-10)
comp_fraction=comp_time / (comp_time + comm_time + 1e-10)
comp_fraction = comp_time / (comp_time + comm_time + 1e-10)
strategy_values.append(comp_fraction)

row_string += f" | {comp_time:>8.2f}s"
Expand Down

0 comments on commit b2ceb4f

Please sign in to comment.