-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Chakra callback #12514
base: main
Are you sure you want to change the base?
Add Chakra callback #12514
Conversation
898fae5
to
47cbede
Compare
Signed-off-by: Taekyung Heo <[email protected]>
47cbede
to
6a950ff
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks reasonable
from nemo.utils.get_rank import get_rank | ||
|
||
|
||
class ChakraCallback(Callback): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rename to PytorchProfilerCallback rather than some code name.
|
||
self.trace_observer = torch.profiler.ExecutionTraceObserver() | ||
|
||
def trace_handler(prof): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Inner function cannot be serialized, is it necessary? Maybe move it out.
if end_step < start_step: | ||
raise ValueError("end_step must be greater than or equal to start_step.") | ||
|
||
if not trace_dir or not os.path.isdir(trace_dir): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If trace_dir
is required, should it have no default?
Adding a short test would be good as well, to ensure all files that are expected to be created do exist. Fine with a separate PR for that |
This PR is stale because it has been open for 14 days with no activity. Remove stale label or comment or update or this will be closed in 7 days. |
Important
The
Update branch
button must only be pressed in very rare occasions.An outdated branch is never blocking the merge of a PR.
Please reach out to the automation team before pressing that button.
What does this PR do?
This PR introduces the
ChakraCallback
, a new PyTorch Lightning callback that collects Chakra host and device traces during training.Collection:
nemo.lightning.pytorch.callbacks
Changelog
ChakraCallback
tonemo.lightning.pytorch.callbacks
.__init__.py
to includeChakraCallback
.Usage
Example of how to use the
ChakraCallback
:GitHub Actions CI
The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.
The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".
Before your PR is "Ready for review"
Pre checks:
PR Type:
If you haven't finished some of the above items you can still open "Draft" PR.
Additional Information