Support dumping training logs for TensorBoard visualization toolkit. (#…

…1144) *Issue #, if available:* #988 *Description of changes:* Add a TensorBoard tracker that will save training loss, validation scores and test scores into TensorBoard logs. By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice. --------- Co-authored-by: Xiang Song <[email protected]> Co-authored-by: Theodore Vasiloudis <[email protected]>
awslabs · Jan 28, 2025 · 7c053cf · 7c053cf
1 parent c2148ae
commit 7c053cf
Show file tree

Hide file tree

Showing 12 changed files with 424 additions and 24 deletions.
diff --git a/.github/workflow_scripts/pytest_check.sh b/.github/workflow_scripts/pytest_check.sh
@@ -9,7 +9,7 @@ GS_HOME=$(pwd)
 # Add SageMaker launch scripts to make the scripts testable
 export PYTHONPATH="${PYTHONPATH}:${GS_HOME}/sagemaker/launch/"
 
-python3 -m pip install pytest
+python3 -m pip install pytest tensorboard
 FORCE_CUDA=1 python3 -m pip install -e '.[test]'  --no-build-isolation
 
 # Run SageMaker tests

diff --git a/docs/source/cli/model-training-inference/configuration-run.rst b/docs/source/cli/model-training-inference/configuration-run.rst
@@ -161,10 +161,10 @@ GraphStorm provides a set of parameters to control how and where to save and res
     - Yaml: ``save_perf_results_path: /model/results/``
     - Argument: ``--save-perf-results-path /model/results/``
     - Default value: ``None``
-- **task_tracker**: A task tracker used to formalize and report model performance metrics. Now GraphStorm only supports sagemaker_task_tracker which prints evaluation metrics in a formatted way so that a user can capture those metrics through SageMaker. See Monitor and Analyze Training Jobs Using Amazon CloudWatch Metrics  for more details.
+- **task_tracker**: A task tracker used to formalize and report model performance metrics. Now GraphStorm supports two task trackers: ``sagemaker_task_tracker`` and ``tensorboard_task_tracker``. ``sagemaker_task_tracker`` prints evaluation metrics in a formatted way so that a user can capture those metrics through SageMaker. (See Monitor and Analyze Training Jobs Using Amazon CloudWatch Metrics  for more details.) ``tensorboard_task_tracker`` dumps evaluation metrics in a formatted way that can be loaded by TensorBoard. The default path for storing the TensorBoard logs is ``./runs/`` under **workspace**. Users can define their own TensorBoard log directory by setting **task_tracker** as ``tensorboard_task_tracker:LOG_PATH``, where ``LOG_PATH`` will be the TensorBoard log directory. (Note: to use ``tensorboard_task_tracker``, one should install the tensorboard Python package using ``pip install tensorboard`` or during graphstorm installation using ``pip install graphstorm[tensorboard]``.)
 
-    - Yaml: ``task_tracker: sagemaker_task_tracker``
-    - Argument: ``--task_tracker sagemaker_task_tracker``
+    - Yaml: ``task_tracker: tensorboard_task_tracker:./logs/``
+    - Argument: ``--task_tracker tensorboard_task_tracker:./logs/``
     - Default value: ``sagemaker_task_tracker``
 - **restore_model_path**: A path where GraphStorm model parameters were saved. For training, if restore_model_path is set, GraphStom will retrieve the model parameters from restore_model_path instead of initializing the parameters. For inference, restore_model_path must be provided.
 

diff --git a/python/graphstorm/config/__init__.py b/python/graphstorm/config/__init__.py
@@ -56,5 +56,7 @@
                      BUILTIN_CLASS_LOSS_FUNCTION)
 from .config import (GRAPHSTORM_LP_EMB_L2_NORMALIZATION,
                      GRAPHSTORM_LP_EMB_NORMALIZATION_METHODS)
+from .config import (GRAPHSTORM_SAGEMAKER_TASK_TRACKER,
+                     GRAPHSTORM_TENSORBOARD_TASK_TRACKER)
 
 from .config import TaskInfo
diff --git a/python/graphstorm/config/argument.py b/python/graphstorm/config/argument.py
@@ -1725,17 +1725,67 @@ def topk_model_to_save(self):
     @property
     def task_tracker(self):
         """ A task tracker used to formalize and report model performance metrics.
-            Default is ``sagemaker_task_tracker``.
+
+            The supported task trackers includes SageMaker (sagemaker_task_tracker) and
+            TensorBoard (tensorboard_task_tracker). The user can specify it in the
+            yaml configuration as following:
+
+            .. code:: json
+
+                basic:
+                    task_tracker: "tensorboard_task_tracker"
+
+            The default is ``sagemaker_task_tracker``, which will log the metrics using
+            Python logging facility.
+
+            For TensorBoard tracker, users can specify a file directory to store the
+            logs by providing the file path information in a format of
+            ``tensorboard_task_tracker:FILE_PATH``. The tensorboard logs will be stored
+            under ``FILE_PATH``.
+
+            .. versionchanged:: 0.4.1
+                Add support for tensorboard tracker.
         """
         # pylint: disable=no-member
         if hasattr(self, "_task_tracker"):
-            assert self._task_tracker in SUPPORTED_TASK_TRACKER
-            return self._task_tracker
+            tracker_info = self._task_tracker.split(":")
+            task_tracker_name = tracker_info[0]
+
+            assert task_tracker_name in SUPPORTED_TASK_TRACKER, \
+                f"Task tracker must be one of {SUPPORTED_TASK_TRACKER}," \
+                f"But got {task_tracker_name}"
+            return task_tracker_name
 
         # By default, use SageMaker task tracker
         # It works as normal print
         return GRAPHSTORM_SAGEMAKER_TASK_TRACKER
 
+    @property
+    def task_tracker_logpath(self):
+        """ A path for a task tracker to store the logs.
+
+            SageMaker trackers will ignore this property.
+
+            For TensorBoard tracker, users can specify a file directory
+            to store the logs by providing the file path information in
+            a format of ``tensorboard_task_tracker:FILE_PATH``. The
+            task_tracker_logpath will be set to ``FILE_PATH``.
+
+            Default: None
+
+            .. versionadded:: 0.4.1
+        """
+        # pylint: disable=no-member
+        if hasattr(self, "_task_tracker"):
+            tracker_info = self._task_tracker.split(":")
+            # task_tracker information in the format of
+            # tensorboard_task_tracker:FILE_PATH
+            if len(tracker_info) > 1:
+                return tracker_info[1]
+            else:
+                return None
+        return None
+
     @property
     def log_report_frequency(self):
         """ Get print/log frequency in number of iterations
@@ -3237,7 +3287,7 @@ def _add_output_args(parser):
 def _add_task_tracker(parser):
     group = parser.add_argument_group(title="task_tracker")
     group.add_argument("--task-tracker", type=str, default=argparse.SUPPRESS,
-            help=f'Task tracker name. Now we only support {GRAPHSTORM_SAGEMAKER_TASK_TRACKER}')
+            help=f'Task tracker name. Now we support {SUPPORTED_TASK_TRACKER}')
     group.add_argument("--log-report-frequency", type=int, default=argparse.SUPPRESS,
             help="Task running log report frequency. "
                  "In training, every log_report_frequency, the task states are reported")

diff --git a/python/graphstorm/config/config.py b/python/graphstorm/config/config.py
@@ -77,8 +77,10 @@
 
 # Task tracker
 GRAPHSTORM_SAGEMAKER_TASK_TRACKER = "sagemaker_task_tracker"
+GRAPHSTORM_TENSORBOARD_TASK_TRACKER = "tensorboard_task_tracker"
 
-SUPPORTED_TASK_TRACKER = [GRAPHSTORM_SAGEMAKER_TASK_TRACKER]
+SUPPORTED_TASK_TRACKER = [GRAPHSTORM_SAGEMAKER_TASK_TRACKER,
+                          GRAPHSTORM_TENSORBOARD_TASK_TRACKER]
 
 # Link prediction decoder
 BUILTIN_LP_DOT_DECODER = "dot_product"

diff --git a/python/graphstorm/gsf.py b/python/graphstorm/gsf.py
@@ -1122,8 +1122,11 @@ def create_builtin_task_tracker(config):
     config: GSConfig
         Configurations
     """
-    tracker_class = get_task_tracker_class(config.task_tracker)
-    return tracker_class(config.eval_frequency)
+    task_tracker = config.task_tracker
+    log_dir = config.task_tracker_logpath
+    tracker_class = get_task_tracker_class(task_tracker)
+    return tracker_class(log_report_frequency=config.eval_frequency,
+                         log_dir=log_dir)
 
 def get_builtin_lp_eval_dataloader_class(config):
     """ Return a builtin link prediction evaluation dataloader

diff --git a/python/graphstorm/tracker/__init__.py b/python/graphstorm/tracker/__init__.py
@@ -18,8 +18,11 @@
     Builtin training tracker supports:
      - GSSageMakerTaskTracker: GraphStorm SageMaker Task Tracker
 """
-from .graphstorm_tracker import GSTaskTrackerAbc
+from ..config import (GRAPHSTORM_SAGEMAKER_TASK_TRACKER,
+                      GRAPHSTORM_TENSORBOARD_TASK_TRACKER)
+
 from .sagemaker_tracker import GSSageMakerTaskTracker
+from .tensorboard_tracker import GSTensorBoardTracker
 
 def get_task_tracker_class(tracker_name):
     """ Get builtin task tracker
@@ -29,10 +32,14 @@ def get_task_tracker_class(tracker_name):
     tracker_name: str
         task tracker name. 'SageMaker' for GSSageMakerTaskTracker
     """
-    if tracker_name == 'SageMaker':
+    if tracker_name == GRAPHSTORM_SAGEMAKER_TASK_TRACKER:
         # SageMaker tracker also works as normal print tracker
         return GSSageMakerTaskTracker
-    # TODO: Support mlflow, etc.
+    elif tracker_name == GRAPHSTORM_TENSORBOARD_TASK_TRACKER:
+        # Note: TensorBoard support is optional.
+        # To enable GSTensorBoardTracker, one should
+        # install the tensorboard Python package
+        return GSTensorBoardTracker
     else:
         # by default use GSSageMakerTaskTracker
         return GSSageMakerTaskTracker
diff --git a/python/graphstorm/tracker/graphstorm_tracker.py b/python/graphstorm/tracker/graphstorm_tracker.py
@@ -23,12 +23,20 @@ class GSTaskTrackerAbc():
         Parameters
         ----------
         log_report_frequency: int
-            The frequency of reporting model performance metrics through task_tracker. 
+            The frequency of reporting model performance metrics through task_tracker.
             The frequency is defined by using number of iterations, i.e., every N iterations
             the evaluation metrics will be reported.
+        log_dir: str
+            Directory to save the logs. TaskTrackers may store logs on disk for
+            visualization or offline analysis.
+            Default: None
+
+        .. versionchanged:: 0.4.1
+            Added argument ``log_dir``.
     """
-    def __init__(self, log_report_frequency):
+    def __init__(self, log_report_frequency, log_dir=None):
         self._report_frequency = log_report_frequency # Can be None if not provided
+        self._log_dir = log_dir
 
     @abc.abstractmethod
     def log_metric(self, metric_name, metric_value, step, force_report=False):

diff --git a/python/graphstorm/tracker/tensorboard_tracker.py b/python/graphstorm/tracker/tensorboard_tracker.py
@@ -0,0 +1,194 @@
+"""
+    Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
+
+    Licensed under the Apache License, Version 2.0 (the "License");
+    you may not use this file except in compliance with the License.
+    You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+    Unless required by applicable law or agreed to in writing, software
+    distributed under the License is distributed on an "AS IS" BASIS,
+    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+    See the License for the specific language governing permissions and
+    limitations under the License.
+
+    TensorBoard task tracker
+"""
+import numbers
+import logging
+import importlib
+
+from ..utils import get_rank
+from .sagemaker_tracker import GSSageMakerTaskTracker
+
+class GSTensorBoardTracker(GSSageMakerTaskTracker):
+    """ GraphStorm builtin TensorBoard task tracker.
+
+        GSTensorBoardTracker inherits from GSSageMakerTaskTracker.
+        It follows the same logic as GSSageMakerTaskTracker to print logs.
+        It uses torch.utils.tensorboard.SummaryWriter to
+        dump training losses, validation and test
+        scores into log files.
+
+        Parameters
+        ----------
+        log_report_frequency: int
+            The frequency of reporting model performance metrics through task_tracker.
+            The frequency is defined by using number of iterations, i.e., every N iterations
+            the evaluation metrics will be reported.
+        log_dir: str
+            Save directory location. The default setting is
+            runs/**CURRENT_DATETIME_HOSTNAME**, which changes after each run.
+            Use hierarchical folder structure to compare
+            between runs easily. e.g. pass in 'runs/exp1', 'runs/exp2', etc.
+            See https://pytorch.org/docs/stable/tensorboard.html for more detials.
+            Default: None.
+
+        .. versionadded:: 0.4.1
+            The :py:class:`GSTensorBoardTracker`.
+    """
+    def __init__(self, log_report_frequency, log_dir=None):
+        super().__init__(log_report_frequency, log_dir)
+        try:
+            tensorboard = importlib.import_module("torch.utils.tensorboard")
+        except ImportError as err:
+            msg =  (
+                "GSTensorBoardTracker requires tensorboard to run. "
+                "Please install the tensorboard Python package.")
+            raise ImportError(msg) from err
+        self._writer  = tensorboard.SummaryWriter(log_dir)
+
+    def log_metric(self, metric_name, metric_value, step, force_report=False):
+        """ log validation or test metric
+
+        Parameters
+        ----------
+        metric_name: str
+            Validation or test metric name
+        metric_value:
+            Validation or test metric value
+        step: int
+            The corresponding step/iteration in the training loop.
+        force_report: bool
+            If true, report the metric
+        """
+        if force_report or self._do_report(step):
+            if metric_value is not None:
+                if isinstance(metric_value, str):
+                    # Only rank 0 will write log to TensorBoard
+                    if get_rank() == 0:
+                        self._writer.add_text(metric_name, metric_value, step)
+                    logging.info("Step %d | %s: %s", step, metric_name, metric_value)
+                elif isinstance(metric_value, numbers.Number):
+                    # Only rank 0 will write log to TensorBoard
+                    if get_rank() == 0:
+                        self._writer.add_scalar(metric_name, metric_value, step)
+                    logging.info("Step %d | %s: %.4f", step, metric_name, metric_value)
+                else:
+                    # Only rank 0 will write log to TensorBoard
+                    if get_rank() == 0:
+                        self._writer.add_text(metric_name, str(metric_value), step)
+                    logging.info("Step %d | %s: %s", step, metric_name, str(metric_value))
+
+    def log_train_metric(self, metric_name, metric_value, step, force_report=False):
+        """ Log train metric
+
+            Parameters
+            ----------
+            metric_name: str
+                Train metric name
+            metric_value:
+                Train metric value
+            step: int
+                The corresponding step/iteration in the training loop.
+            force_report: bool
+                If true, report the metric
+        """
+        metric_name = f"{metric_name}/Train"
+        self.log_metric(metric_name, metric_value, step, force_report)
+
+    def log_best_test(self, metric_name, metric_value, step, force_report=False):
+        """ Log best test score
+
+            Parameters
+            ----------
+            metric_name: str
+                Test metric name
+            metric_value:
+                Test metric value
+            step: int
+                The corresponding step/iteration in the training loop.
+            force_report: bool
+                If true, report the metric
+        """
+        metric_name = f"{metric_name}/Best Test"
+        self.log_metric(metric_name, metric_value, step, force_report)
+
+    def log_test_metric(self, metric_name, metric_value, step, force_report=False):
+        """ Log test metric
+
+            Parameters
+            ----------
+            metric_name: str
+                Test metric name
+            metric_value:
+                Test metric value
+            step: int
+                The corresponding step/iteration in the training loop.
+            force_report: bool
+                If true, report the metric
+        """
+        metric_name = f"{metric_name}/Test"
+        self.log_metric(metric_name, metric_value, step, force_report)
+
+    def log_best_valid(self, metric_name, metric_value, step, force_report=False):
+        """ Log best validation score
+
+            Parameters
+            ----------
+            metric_name: str
+                Validation metric name
+            metric_value:
+                Validation metric value
+            step: int
+                The corresponding step/iteration in the training loop.
+            force_report: bool
+                If true, report the metric
+        """
+        metric_name = f"{metric_name}/Best Validation"
+        self.log_metric(metric_name, metric_value, step, force_report)
+
+    def log_valid_metric(self, metric_name, metric_value, step, force_report=False):
+        """ Log validation metric
+
+            Parameters
+            ----------
+            metric_name: str
+                Validation metric name
+            metric_value: float
+                Validation metric value
+            step: int
+                The corresponding step/iteration in the training loop.
+            force_report: bool
+                If true, report the metric
+        """
+        metric_name = f"{metric_name}/Validation"
+        self.log_metric(metric_name, metric_value, step, force_report)
+
+    def log_best_iter(self, metric_name, best_iter, step, force_report=False):
+        """ Log best iteration
+
+            Parameters
+            ----------
+            metric_name: str
+                Metric name
+            iter:
+                Best iteration number
+            step: int
+                The corresponding step/iteration in the training loop.
+            force_report: bool
+                If true, report the metric
+        """
+        metric_name = f"{metric_name}/Best Iteration"
+        self.log_metric(metric_name, best_iter, step, force_report)