Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Only rank0 log metrics to console #453

Merged
merged 5 commits into from
Mar 12, 2024

Conversation

hxdtest
Copy link
Contributor

@hxdtest hxdtest commented Feb 15, 2024

I use python -m torch.distributed.run xxx to launch the training processes. If reduce_global_loss is True, only rank0 reduces global loss and other ranks doesn't reduce. The metrics logging to console by different ranks are confusing.

train/CrossEntropyLoss=0.0370
train/Perplexity=1.038
throughput/total_tokens=181,873,410,048
throughput/device/tokens_per_second=14,991
throughput/device/batches_per_second=0.2288
2024-02-12 15:14:50,866 - olmo.train - INFO - [step=43362/739328]
train/CrossEntropyLoss=0.0380
train/Perplexity=1.039
throughput/total_tokens=181,873,410,048
throughput/device/tokens_per_second=14,991
throughput/device/batches_per_second=0.2288
2024-02-12 15:14:50,866 - olmo.train - INFO - [step=43362/739328]
train/CrossEntropyLoss=0.0378
train/Perplexity=1.039
throughput/total_tokens=181,873,410,048
throughput/device/tokens_per_second=14,991
throughput/device/batches_per_second=0.2288
2024-02-12 15:14:50,866 - olmo.train - INFO - [step=43362/739328]
train/CrossEntropyLoss=0.0383
train/Perplexity=1.039
throughput/total_tokens=181,873,410,048
throughput/device/tokens_per_second=14,991
throughput/device/batches_per_second=0.2288
2024-02-12 15:14:50,866 - olmo.train - INFO - [step=43362/739328]
train/CrossEntropyLoss=2.421
train/Perplexity=11.25
throughput/total_tokens=181,873,410,048
throughput/device/tokens_per_second=14,991
throughput/device/batches_per_second=0.2288

Only rank0 should log metrics to console.

@epwalsh
Copy link
Member

epwalsh commented Mar 8, 2024

You can set the environment variable LOG_FILTER_TYPE=rank0_only.

@dirkgr
Copy link
Member

dirkgr commented Mar 9, 2024

That's true, but it's still not good that we're logging garbage values to the console.

@epwalsh
Copy link
Member

epwalsh commented Mar 11, 2024

I would suggest we still log something, which is useful for debugging. We can omit all of the metrics that don't make sense.

Copy link
Member

@epwalsh epwalsh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@dirkgr dirkgr merged commit ed47c29 into allenai:main Mar 12, 2024
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants