Skip to content

Commit

Permalink
Support dumping training logs for TensorBoard visualization toolkit. (#…
Browse files Browse the repository at this point in the history
…1144)

*Issue #, if available:*
#988 

*Description of changes:*
Add a TensorBoard tracker that will save training loss, validation
scores and test scores into TensorBoard logs.



By submitting this pull request, I confirm that you can use, modify,
copy, and redistribute this contribution, under the terms of your
choice.

---------

Co-authored-by: Xiang Song <[email protected]>
Co-authored-by: Theodore Vasiloudis <[email protected]>
  • Loading branch information
3 people authored Jan 28, 2025
1 parent c2148ae commit 7c053cf
Show file tree
Hide file tree
Showing 12 changed files with 424 additions and 24 deletions.
2 changes: 1 addition & 1 deletion .github/workflow_scripts/pytest_check.sh
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ GS_HOME=$(pwd)
# Add SageMaker launch scripts to make the scripts testable
export PYTHONPATH="${PYTHONPATH}:${GS_HOME}/sagemaker/launch/"

python3 -m pip install pytest
python3 -m pip install pytest tensorboard
FORCE_CUDA=1 python3 -m pip install -e '.[test]' --no-build-isolation

# Run SageMaker tests
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -161,10 +161,10 @@ GraphStorm provides a set of parameters to control how and where to save and res
- Yaml: ``save_perf_results_path: /model/results/``
- Argument: ``--save-perf-results-path /model/results/``
- Default value: ``None``
- **task_tracker**: A task tracker used to formalize and report model performance metrics. Now GraphStorm only supports sagemaker_task_tracker which prints evaluation metrics in a formatted way so that a user can capture those metrics through SageMaker. See Monitor and Analyze Training Jobs Using Amazon CloudWatch Metrics for more details.
- **task_tracker**: A task tracker used to formalize and report model performance metrics. Now GraphStorm supports two task trackers: ``sagemaker_task_tracker`` and ``tensorboard_task_tracker``. ``sagemaker_task_tracker`` prints evaluation metrics in a formatted way so that a user can capture those metrics through SageMaker. (See Monitor and Analyze Training Jobs Using Amazon CloudWatch Metrics for more details.) ``tensorboard_task_tracker`` dumps evaluation metrics in a formatted way that can be loaded by TensorBoard. The default path for storing the TensorBoard logs is ``./runs/`` under **workspace**. Users can define their own TensorBoard log directory by setting **task_tracker** as ``tensorboard_task_tracker:LOG_PATH``, where ``LOG_PATH`` will be the TensorBoard log directory. (Note: to use ``tensorboard_task_tracker``, one should install the tensorboard Python package using ``pip install tensorboard`` or during graphstorm installation using ``pip install graphstorm[tensorboard]``.)

- Yaml: ``task_tracker: sagemaker_task_tracker``
- Argument: ``--task_tracker sagemaker_task_tracker``
- Yaml: ``task_tracker: tensorboard_task_tracker:./logs/``
- Argument: ``--task_tracker tensorboard_task_tracker:./logs/``
- Default value: ``sagemaker_task_tracker``
- **restore_model_path**: A path where GraphStorm model parameters were saved. For training, if restore_model_path is set, GraphStom will retrieve the model parameters from restore_model_path instead of initializing the parameters. For inference, restore_model_path must be provided.

Expand Down
2 changes: 2 additions & 0 deletions python/graphstorm/config/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -56,5 +56,7 @@
BUILTIN_CLASS_LOSS_FUNCTION)
from .config import (GRAPHSTORM_LP_EMB_L2_NORMALIZATION,
GRAPHSTORM_LP_EMB_NORMALIZATION_METHODS)
from .config import (GRAPHSTORM_SAGEMAKER_TASK_TRACKER,
GRAPHSTORM_TENSORBOARD_TASK_TRACKER)

from .config import TaskInfo
58 changes: 54 additions & 4 deletions python/graphstorm/config/argument.py
Original file line number Diff line number Diff line change
Expand Up @@ -1725,17 +1725,67 @@ def topk_model_to_save(self):
@property
def task_tracker(self):
""" A task tracker used to formalize and report model performance metrics.
Default is ``sagemaker_task_tracker``.
The supported task trackers includes SageMaker (sagemaker_task_tracker) and
TensorBoard (tensorboard_task_tracker). The user can specify it in the
yaml configuration as following:
.. code:: json
basic:
task_tracker: "tensorboard_task_tracker"
The default is ``sagemaker_task_tracker``, which will log the metrics using
Python logging facility.
For TensorBoard tracker, users can specify a file directory to store the
logs by providing the file path information in a format of
``tensorboard_task_tracker:FILE_PATH``. The tensorboard logs will be stored
under ``FILE_PATH``.
.. versionchanged:: 0.4.1
Add support for tensorboard tracker.
"""
# pylint: disable=no-member
if hasattr(self, "_task_tracker"):
assert self._task_tracker in SUPPORTED_TASK_TRACKER
return self._task_tracker
tracker_info = self._task_tracker.split(":")
task_tracker_name = tracker_info[0]

assert task_tracker_name in SUPPORTED_TASK_TRACKER, \
f"Task tracker must be one of {SUPPORTED_TASK_TRACKER}," \
f"But got {task_tracker_name}"
return task_tracker_name

# By default, use SageMaker task tracker
# It works as normal print
return GRAPHSTORM_SAGEMAKER_TASK_TRACKER

@property
def task_tracker_logpath(self):
""" A path for a task tracker to store the logs.
SageMaker trackers will ignore this property.
For TensorBoard tracker, users can specify a file directory
to store the logs by providing the file path information in
a format of ``tensorboard_task_tracker:FILE_PATH``. The
task_tracker_logpath will be set to ``FILE_PATH``.
Default: None
.. versionadded:: 0.4.1
"""
# pylint: disable=no-member
if hasattr(self, "_task_tracker"):
tracker_info = self._task_tracker.split(":")
# task_tracker information in the format of
# tensorboard_task_tracker:FILE_PATH
if len(tracker_info) > 1:
return tracker_info[1]
else:
return None
return None

@property
def log_report_frequency(self):
""" Get print/log frequency in number of iterations
Expand Down Expand Up @@ -3237,7 +3287,7 @@ def _add_output_args(parser):
def _add_task_tracker(parser):
group = parser.add_argument_group(title="task_tracker")
group.add_argument("--task-tracker", type=str, default=argparse.SUPPRESS,
help=f'Task tracker name. Now we only support {GRAPHSTORM_SAGEMAKER_TASK_TRACKER}')
help=f'Task tracker name. Now we support {SUPPORTED_TASK_TRACKER}')
group.add_argument("--log-report-frequency", type=int, default=argparse.SUPPRESS,
help="Task running log report frequency. "
"In training, every log_report_frequency, the task states are reported")
Expand Down
4 changes: 3 additions & 1 deletion python/graphstorm/config/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -77,8 +77,10 @@

# Task tracker
GRAPHSTORM_SAGEMAKER_TASK_TRACKER = "sagemaker_task_tracker"
GRAPHSTORM_TENSORBOARD_TASK_TRACKER = "tensorboard_task_tracker"

SUPPORTED_TASK_TRACKER = [GRAPHSTORM_SAGEMAKER_TASK_TRACKER]
SUPPORTED_TASK_TRACKER = [GRAPHSTORM_SAGEMAKER_TASK_TRACKER,
GRAPHSTORM_TENSORBOARD_TASK_TRACKER]

# Link prediction decoder
BUILTIN_LP_DOT_DECODER = "dot_product"
Expand Down
7 changes: 5 additions & 2 deletions python/graphstorm/gsf.py
Original file line number Diff line number Diff line change
Expand Up @@ -1122,8 +1122,11 @@ def create_builtin_task_tracker(config):
config: GSConfig
Configurations
"""
tracker_class = get_task_tracker_class(config.task_tracker)
return tracker_class(config.eval_frequency)
task_tracker = config.task_tracker
log_dir = config.task_tracker_logpath
tracker_class = get_task_tracker_class(task_tracker)
return tracker_class(log_report_frequency=config.eval_frequency,
log_dir=log_dir)

def get_builtin_lp_eval_dataloader_class(config):
""" Return a builtin link prediction evaluation dataloader
Expand Down
13 changes: 10 additions & 3 deletions python/graphstorm/tracker/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,8 +18,11 @@
Builtin training tracker supports:
- GSSageMakerTaskTracker: GraphStorm SageMaker Task Tracker
"""
from .graphstorm_tracker import GSTaskTrackerAbc
from ..config import (GRAPHSTORM_SAGEMAKER_TASK_TRACKER,
GRAPHSTORM_TENSORBOARD_TASK_TRACKER)

from .sagemaker_tracker import GSSageMakerTaskTracker
from .tensorboard_tracker import GSTensorBoardTracker

def get_task_tracker_class(tracker_name):
""" Get builtin task tracker
Expand All @@ -29,10 +32,14 @@ def get_task_tracker_class(tracker_name):
tracker_name: str
task tracker name. 'SageMaker' for GSSageMakerTaskTracker
"""
if tracker_name == 'SageMaker':
if tracker_name == GRAPHSTORM_SAGEMAKER_TASK_TRACKER:
# SageMaker tracker also works as normal print tracker
return GSSageMakerTaskTracker
# TODO: Support mlflow, etc.
elif tracker_name == GRAPHSTORM_TENSORBOARD_TASK_TRACKER:
# Note: TensorBoard support is optional.
# To enable GSTensorBoardTracker, one should
# install the tensorboard Python package
return GSTensorBoardTracker
else:
# by default use GSSageMakerTaskTracker
return GSSageMakerTaskTracker
12 changes: 10 additions & 2 deletions python/graphstorm/tracker/graphstorm_tracker.py
Original file line number Diff line number Diff line change
Expand Up @@ -23,12 +23,20 @@ class GSTaskTrackerAbc():
Parameters
----------
log_report_frequency: int
The frequency of reporting model performance metrics through task_tracker.
The frequency of reporting model performance metrics through task_tracker.
The frequency is defined by using number of iterations, i.e., every N iterations
the evaluation metrics will be reported.
log_dir: str
Directory to save the logs. TaskTrackers may store logs on disk for
visualization or offline analysis.
Default: None
.. versionchanged:: 0.4.1
Added argument ``log_dir``.
"""
def __init__(self, log_report_frequency):
def __init__(self, log_report_frequency, log_dir=None):
self._report_frequency = log_report_frequency # Can be None if not provided
self._log_dir = log_dir

@abc.abstractmethod
def log_metric(self, metric_name, metric_value, step, force_report=False):
Expand Down
194 changes: 194 additions & 0 deletions python/graphstorm/tracker/tensorboard_tracker.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,194 @@
"""
Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
TensorBoard task tracker
"""
import numbers
import logging
import importlib

from ..utils import get_rank
from .sagemaker_tracker import GSSageMakerTaskTracker

class GSTensorBoardTracker(GSSageMakerTaskTracker):
""" GraphStorm builtin TensorBoard task tracker.
GSTensorBoardTracker inherits from GSSageMakerTaskTracker.
It follows the same logic as GSSageMakerTaskTracker to print logs.
It uses torch.utils.tensorboard.SummaryWriter to
dump training losses, validation and test
scores into log files.
Parameters
----------
log_report_frequency: int
The frequency of reporting model performance metrics through task_tracker.
The frequency is defined by using number of iterations, i.e., every N iterations
the evaluation metrics will be reported.
log_dir: str
Save directory location. The default setting is
runs/**CURRENT_DATETIME_HOSTNAME**, which changes after each run.
Use hierarchical folder structure to compare
between runs easily. e.g. pass in 'runs/exp1', 'runs/exp2', etc.
See https://pytorch.org/docs/stable/tensorboard.html for more detials.
Default: None.
.. versionadded:: 0.4.1
The :py:class:`GSTensorBoardTracker`.
"""
def __init__(self, log_report_frequency, log_dir=None):
super().__init__(log_report_frequency, log_dir)
try:
tensorboard = importlib.import_module("torch.utils.tensorboard")
except ImportError as err:
msg = (
"GSTensorBoardTracker requires tensorboard to run. "
"Please install the tensorboard Python package.")
raise ImportError(msg) from err
self._writer = tensorboard.SummaryWriter(log_dir)

def log_metric(self, metric_name, metric_value, step, force_report=False):
""" log validation or test metric
Parameters
----------
metric_name: str
Validation or test metric name
metric_value:
Validation or test metric value
step: int
The corresponding step/iteration in the training loop.
force_report: bool
If true, report the metric
"""
if force_report or self._do_report(step):
if metric_value is not None:
if isinstance(metric_value, str):
# Only rank 0 will write log to TensorBoard
if get_rank() == 0:
self._writer.add_text(metric_name, metric_value, step)
logging.info("Step %d | %s: %s", step, metric_name, metric_value)
elif isinstance(metric_value, numbers.Number):
# Only rank 0 will write log to TensorBoard
if get_rank() == 0:
self._writer.add_scalar(metric_name, metric_value, step)
logging.info("Step %d | %s: %.4f", step, metric_name, metric_value)
else:
# Only rank 0 will write log to TensorBoard
if get_rank() == 0:
self._writer.add_text(metric_name, str(metric_value), step)
logging.info("Step %d | %s: %s", step, metric_name, str(metric_value))

def log_train_metric(self, metric_name, metric_value, step, force_report=False):
""" Log train metric
Parameters
----------
metric_name: str
Train metric name
metric_value:
Train metric value
step: int
The corresponding step/iteration in the training loop.
force_report: bool
If true, report the metric
"""
metric_name = f"{metric_name}/Train"
self.log_metric(metric_name, metric_value, step, force_report)

def log_best_test(self, metric_name, metric_value, step, force_report=False):
""" Log best test score
Parameters
----------
metric_name: str
Test metric name
metric_value:
Test metric value
step: int
The corresponding step/iteration in the training loop.
force_report: bool
If true, report the metric
"""
metric_name = f"{metric_name}/Best Test"
self.log_metric(metric_name, metric_value, step, force_report)

def log_test_metric(self, metric_name, metric_value, step, force_report=False):
""" Log test metric
Parameters
----------
metric_name: str
Test metric name
metric_value:
Test metric value
step: int
The corresponding step/iteration in the training loop.
force_report: bool
If true, report the metric
"""
metric_name = f"{metric_name}/Test"
self.log_metric(metric_name, metric_value, step, force_report)

def log_best_valid(self, metric_name, metric_value, step, force_report=False):
""" Log best validation score
Parameters
----------
metric_name: str
Validation metric name
metric_value:
Validation metric value
step: int
The corresponding step/iteration in the training loop.
force_report: bool
If true, report the metric
"""
metric_name = f"{metric_name}/Best Validation"
self.log_metric(metric_name, metric_value, step, force_report)

def log_valid_metric(self, metric_name, metric_value, step, force_report=False):
""" Log validation metric
Parameters
----------
metric_name: str
Validation metric name
metric_value: float
Validation metric value
step: int
The corresponding step/iteration in the training loop.
force_report: bool
If true, report the metric
"""
metric_name = f"{metric_name}/Validation"
self.log_metric(metric_name, metric_value, step, force_report)

def log_best_iter(self, metric_name, best_iter, step, force_report=False):
""" Log best iteration
Parameters
----------
metric_name: str
Metric name
iter:
Best iteration number
step: int
The corresponding step/iteration in the training loop.
force_report: bool
If true, report the metric
"""
metric_name = f"{metric_name}/Best Iteration"
self.log_metric(metric_name, best_iter, step, force_report)
Loading

0 comments on commit 7c053cf

Please sign in to comment.