-
Notifications
You must be signed in to change notification settings - Fork 239
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
W&B wandb support #699
W&B wandb support #699
Conversation
torchtitan/metrics.py
Outdated
"W&B logging requested but wandb package is not installed. Continuing without W&B logging." | ||
) | ||
enable_wandb = False | ||
elif enable_wandb: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a nit: W&B logs it's own init statement, so having one here might be overkill.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah good point, w&b is actually quite loud with emojis
I personally use wandb and have integrated it into my code. However, the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's a good idea to add support for wandb as it continues to gain popularity. Would it be possible to make wandb and tensorboard mutually exclusive? It doesn't worth to log into two places.
@fegin sure thing I can look into that, just to be clear on the gap to merge
|
I would suggest that it is an optional dependency. While wandb becomes increasingly popular, most users are still using tensorboard.
Thanks, that would be good.
I think a e2e run command with wandb screenshot would be good eough.
Name is okay, but please check |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for helping add wandb!
README.md
Outdated
# Llama 3.2 tokenizer.model | ||
python torchtitan/datasets/download_tokenizer.py --repo_id meta-llama/Llama-3.2-3B --hf_token=... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was indeed for Llama 2, not for Llama 3.2.
I think we can remove Llama 2 files if they are not helpful anymore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can send a seperate PR deprecating Llama2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. Can we revert this change for now, as torchtitan doesn't support Llama 3.2 atm?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
torchtitan/metrics.py
Outdated
wandb_metrics["step"] = step | ||
wandb.log(wandb_metrics) | ||
|
||
def log_memory_stats(self, memory_stats: DeviceMemStats, step: int): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this seems to not be called anywhere?
Made the checks mutually exclusive, if both are set I default to WanDB
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you please also add a short note on how to use W&B? similar to
https://github.com/pytorch/torchtitan/blob/main/README.md?plain=1#L63
and
https://github.com/pytorch/torchtitan/blob/main/README.md?plain=1#L103
It might make sense to have a separate .md
in docs/
containing both TB and W&B instructions.
README.md
Outdated
# Llama 3.2 tokenizer.model | ||
python torchtitan/datasets/download_tokenizer.py --repo_id meta-llama/Llama-3.2-3B --hf_token=... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. Can we revert this change for now, as torchtitan doesn't support Llama 3.2 atm?
torchtitan/config_manager.py
Outdated
# Logging backend validations | ||
if hasattr(self.metrics, "enable_tensorboard") and hasattr( | ||
self.metrics, "enable_wandb" | ||
): | ||
if self.metrics.enable_tensorboard and self.metrics.enable_wandb: | ||
logger.warning( | ||
"Both TensorBoard and WandB logging were enabled. Using WandB only." | ||
) | ||
# Modify the config to disable TensorBoard | ||
self.metrics.enable_tensorboard = False | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have slightly different opinion on this. I think we should let user control whether they want to use either or both. In rare cases, they may find enabling both to be useful (e.g. monitoring on wandb, but still keeping the TB logs).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright that's easy to fix and it's my preference as well but @fegin for visibility as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, I think unless they are showing different information that can be useful, I'm not sure why we need to enable both and for performance, this is going to be bad (but probably not noticeable). But I don't have a strong opinion if people prefer to enable both.
torchtitan/metrics.py
Outdated
|
||
return MetricLogger(log_dir, tag, enable_tb) | ||
return MetricLogger(log_dir, tag, enable_tb, enable_wandb, wandb_config) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From an extensibility perspective, it would be nicer to have this pattern:
if not should_log:
return DummyLogger()
if enable_wandb:
return WandbLogger(wandb_config)
if enable_tb:
return TensorBoardLogger(log_dir, tag)
raise NotImplementedError
With
class DummyLogger:
def log(self, *args, **kwargs):
pass
def close(self):
pass
class TensorBoardLogger:
def __init__(self, log_dir, tag):
self.tag = tag
self.writer = SummaryWriter(log_dir, max_queue=1000)
def log(self, metrics: Dict[str, Any], step: int):
for k, v in metrics.items():
tag = k if self.tag is None else f"{self.tag}/{k}"
self.writer.add_scalar(tag, v, step)
def close(self):
self.writer.close()
Since each class has its own separate logic, it's easier for forks to have their custom loggers with reduced conflicts. It also avoids the problem of optional types inside the classes (avoids assertions).
You could then import wandb
inside the WandbLogger
class to avoid importing it during startup
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the way this code looks but it goes back to the same issue we're discussing above around whether we should have multiple loggers in case people enable both wandb and tensorboard
Right now implicitly enabling both would enable whatever is the topmost condition
Also more curious but are you deliberately not using inheritance here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My bad, I did not look at the existing comments extensively.
For that case, you (or a fork) could add a logger class that takes a list of loggers and simply iterates over them.
loggers = []
if should_log:
if enable_wandb:
loggers.append(WandbLogger(wandb_config))
if enable_tb:
loggers.append(TensorBoardLogger(log_dir, tag))
return Loggers(loggers)
This removes the need for DummyLogger
, iterating over an empty list does nothing.
However, this only works well if the loggers agree to exposing the same interface under the hood, but my opinion is that this repo should stay simple and not think too much about this. From my previous experience, users don't enable two loggers at the same time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you've convinced me, 2 loggers seems clunky - incorporated your feedback
Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The metrics.py structure LGTM now. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks awesome! Thank you very much!
Had some final inline comments.
Co-authored-by: tianyu-l <[email protected]>
Co-authored-by: tianyu-l <[email protected]>
Co-authored-by: tianyu-l <[email protected]>
Thanks again for all the feedback. Had a final lint issue I fixed now. Would you be OK if I merge this or would you rather do so? |
action="store_true", | ||
help="Whether to log metrics to TensorBoard", | ||
default=True, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was flipping the enable_color_printing
default intended? If so, the action store_true
does not fit well anymore. It's now true by default
This PR adds experimental wandb support, not sure this is "landable" considering y'all uses tensorboard by default. Personally I vastly prefer wandb because I can share my training runs with a link and don't need to muck around with ssh tunneling so I'm just opening this since I'm using it myself. If there's interest I can work to land this.
To use this you just kick of a training as usual with
CONFIG_FILE="./train_configs/llama3_8b.toml" ./run_llama_train.sh
but also runwandb login
and paste in your tokenChanges in logs will look like
Also only slightly related but llama 3 tokenizer is not available on hf anymore so added instructions for 3.1 and 3.2
Click here for detailed logs.
[rank0]:2024-11-25 11:33:24,320 - root - INFO - Dumping traces at step 1000 [rank0]:2024-11-25 11:33:24,576 - root - INFO - Finished dumping traces in 0.26 seconds [rank0]:2024-11-25 11:33:24,577 - root - INFO - Sleeping 2 seconds for other ranks to complete [rank0]:wandb: [rank0]:wandb: [rank0]:wandb: Run history: [rank0]:wandb: loss_metrics/global_avg_loss █▆▅▄▄▃▃▃▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ [rank0]:wandb: loss_metrics/global_max_loss █▇▄▄▃▃▄▃▃▆▃▃▃▃▃▃▂▂▂▂▃▂▂▃▁▂▂▂▁▃▂▁▂▁▂▂▁▄▁▁ [rank0]:wandb: memory/max_active(%) ▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ [rank0]:wandb: memory/max_active(GiB) ▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ [rank0]:wandb: memory/max_reserved(%) ▁███████████████████████████████████████ [rank0]:wandb: memory/max_reserved(GiB) ▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ [rank0]:wandb: memory/num_alloc_retries ▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ [rank0]:wandb: memory/num_ooms ▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ [rank0]:wandb: mfu(%) ▁███████▇██████▇█████████▇█▇████████████ [rank0]:wandb: step ▁▁▁▁▂▂▂▂▂▂▃▃▃▃▃▃▃▄▄▄▄▄▅▅▅▆▆▆▆▇▇▇▇▇▇▇████ [rank0]:wandb: time_metrics/data_loading(%) ▁▁▁▁▂▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▂▁▁▁▁▁▁▁▁▂▁▁▂▁▁▁▂ [rank0]:wandb: time_metrics/data_loading(s) ▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂ [rank0]:wandb: time_metrics/end_to_end(s) ▁▇▇▇▇█▇▇▇█▇▇▇▇▇▇▇▇▇▇▇▇██▇▇▇█▇▇▇▇▇▇█▇█▇▇▇ [rank0]:wandb: wps ███▁████▄█▇▅████████▅▄████▇███▇▄████▇██▇ [rank0]:wandb: [rank0]:wandb: Run summary: [rank0]:wandb: loss_metrics/global_avg_loss 4.53519 [rank0]:wandb: loss_metrics/global_max_loss 4.99517 [rank0]:wandb: memory/max_active(%) 43.33611 [rank0]:wandb: memory/max_active(GiB) 41.17145 [rank0]:wandb: memory/max_reserved(%) 52.19301 [rank0]:wandb: memory/max_reserved(GiB) 49.58594 [rank0]:wandb: memory/num_alloc_retries 0 [rank0]:wandb: memory/num_ooms 0 [rank0]:wandb: mfu(%) 30.75216 [rank0]:wandb: step 1000 [rank0]:wandb: time_metrics/data_loading(%) 1.01461 [rank0]:wandb: time_metrics/data_loading(s) 0.01583 [rank0]:wandb: time_metrics/end_to_end(s) 1.55993 [rank0]:wandb: wps 5251.52034 [rank0]:wandb: [rank0]:wandb: 🚀 View run skilled-glitter-1 at: https://wandb.ai/sahancpal-meta/torchtitan/runs/r1zqr75b