[TRACKING] feat: Integrate SwanLab for experiment tracking with online/offline mode and local dashboard support #218

ShaohonChen · 2025-02-07T04:46:00Z

Pull Request Description

This PR introduces SwanLab, a lightweight open-source experiment tracking tool, as a new logging option for the training framework. The integration provides both online and offline tracking capabilities, along with a local dashboard for visualizing results. Below is a detailed overview of the changes and usage instructions:

Key Features of SwanLab Integration

Online and Offline Tracking:
- Online Mode: Track experiments remotely and store data on SwanLab's cloud platform.
- Offline Mode: Use a local dashboard to visualize training logs without an internet connection.
Hardware Monitoring:
- Automatically tracks GPU usage, power consumption, temperature, and other hardware metrics.
- Supports NVIDIA GPUs and Huawei Ascend NPUs.
Remote Access:
- View training progress remotely via the SwanLab web interface or mobile app.
Local Dashboard:
- Includes an open-source local dashboard for offline visualization of training logs.

Usage Instructions

Step 1: Set Up Online Tracking (Optional)

To use SwanLab's online tracking, log in to the SwanLab website and obtain your API key from the Settings page. Then, authenticate using the following command:

swanlab login

If you prefer offline mode, skip this step.

Step 2: Configure SwanLab as the Logger

To enable SwanLab as the experiment tracker, add trainer.logger=['swanlab'] to your training command. For example, using the Post-train a LLM using PPO with GSM8K dataset workflow:

PYTHONUNBUFFERED=1 python3 -m verl.trainer.main_ppo \
 data.train_files=$HOME/data/gsm8k/train.parquet \
 data.val_files=$HOME/data/gsm8k/test.parquet \
 data.train_batch_size=256 \
 data.val_batch_size=1312 \
 data.max_prompt_length=512 \
 data.max_response_length=256 \
 actor_rollout_ref.model.path=Qwen/Qwen2.5-0.5B-Instruct \
 actor_rollout_ref.actor.optim.lr=1e-6 \
 actor_rollout_ref.actor.ppo_mini_batch_size=64 \
 actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=4 \
 actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=8 \
 actor_rollout_ref.rollout.tensor_model_parallel_size=1 \
 actor_rollout_ref.rollout.gpu_memory_utilization=0.4 \
 actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=4 \
 critic.optim.lr=1e-5 \
 critic.model.path=Qwen/Qwen2.5-0.5B-Instruct \
 critic.ppo_micro_batch_size_per_gpu=4 \
 algorithm.kl_ctrl.kl_coef=0.001 \
 trainer.logger=['console','swanlab'] \
 +trainer.val_before_train=False \
 trainer.default_hdfs_dir=null \
 trainer.n_gpus_per_node=1 \
 trainer.nnodes=1 \
 trainer.save_freq=10 \
 trainer.test_freq=10 \
 trainer.total_epochs=15 2>&1 | tee verl_demo.log

If you are not logged in, you will be prompted to choose a tracking mode:

Cloud Mode: Upload logs to SwanLab's cloud platform.
Cloud-Only Mode: Upload logs to the cloud but do not save them locally.
Local Mode: Save logs locally for offline tracking.

Alternatively, you can configure SwanLab using environment variables:

export SWANLAB_API_KEY=<your_api_key>          # Set API key for online tracking
export SWANLAB_LOG_DIR=<local_log_path>        # Set local log directory
export SWANLAB_MODE=<mode>                    # Set tracking mode: cloud (default), cloud-only, local, or disabled

Step 3: View Training Logs

After logging in, you will see a confirmation message:

Online Tracking: View logs on the SwanLab website.

For more details, refer to the SwanLab Cloud Documentation.

Offline Tracking: Use the local dashboard to visualize logs:
```
swanlab watch
```
For advanced configurations, such as setting a custom port, refer to the Offline Dashboard Documentation and CLI Documentation.

Impact

Provides a lightweight, flexible, and user-friendly experiment tracking solution.
Supports both online and offline use cases, making it suitable for environments with restricted internet access.
Enhances hardware monitoring capabilities for better resource utilization.

This PR is ready for review. Feedback and suggestions are welcome!

ShaohonChen · 2025-02-07T14:38:16Z

#198 SwanLab supports monitoring for Ascend hardware.

ShaohonChen added 5 commits February 7, 2025 00:31

add SwanLab integration for experiment tracking

2c6dbc0

call swanlab.finish when the tracker is destructed if swanlab in used

9d39079

add local logging path and local mode options for SwanLab

149f004

format code using scripts/format.sh

7190924

add SwanLab description to README

5c3e64a

ShaohonChen changed the title ~~[FEAT]: Integrate SwanLab for experiment tracking with online/offline mode and local dashboard support~~ [TRACKING] feat: Integrate SwanLab for experiment tracking with online/offline mode and local dashboard support Feb 7, 2025

vermouth1992 approved these changes Feb 7, 2025

View reviewed changes

vermouth1992 merged commit 958a326 into volcengine:main Feb 7, 2025
11 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[TRACKING] feat: Integrate SwanLab for experiment tracking with online/offline mode and local dashboard support #218

[TRACKING] feat: Integrate SwanLab for experiment tracking with online/offline mode and local dashboard support #218

ShaohonChen commented Feb 7, 2025

ShaohonChen commented Feb 7, 2025

[TRACKING] feat: Integrate SwanLab for experiment tracking with online/offline mode and local dashboard support #218

[TRACKING] feat: Integrate SwanLab for experiment tracking with online/offline mode and local dashboard support #218

Conversation

ShaohonChen commented Feb 7, 2025

Pull Request Description

Key Features of SwanLab Integration

Usage Instructions

Step 1: Set Up Online Tracking (Optional)

Step 2: Configure SwanLab as the Logger

Step 3: View Training Logs

Impact

ShaohonChen commented Feb 7, 2025