-
Notifications
You must be signed in to change notification settings - Fork 239
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BE] dump compile trace to CI output for debugging, and reduce uploaded folder size in CI #739
Conversation
[ghstack-poisoned]
ghstack-source-id: b8ca3dc898a688e89edfc64d0417ebe456d0f394 Pull Request resolved: #739
[ghstack-poisoned]
test_runner.py
Outdated
# TODO: temporarily disabling to let CI be able to run other tests | ||
# OverrideDefinitions( | ||
# [ | ||
# [ | ||
# "--training.compile", | ||
# "--activation_checkpoint.mode selective", | ||
# "--activation_checkpoint.selective_ac_option op", | ||
# ], | ||
# ], | ||
# "1D compile with selective op AC", | ||
# "1d_compile_sac_op", | ||
# ), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll revert this temporary change before merging
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, pleaser revert the comment before landing.
@tianyu-l just curious: the time per PR on CI is significantly impacted by 'upload_artifacts' step even before your change. I was wondering if at some point this is a bottleneck and we need to be more picky about which files we upload. On that note, do you know how large the compile-trace artifacts are and whether it is making a difference to PR CI time? |
as titled, so that we can download the compile-related code after seeing a CI failure [ghstack-poisoned]
…duce uploaded folder size in CI" as titled, so that we can download the compile-related code after seeing a CI failure To answer wconstab's question, I downloaded the artifact with & without compile trace, and here’s my findings: 1. The compile trace is comparable to profiler trace, about <10MB for a job. And there are only two tests with torch.compile. -> so it should be fine to enable uploading compile traces. 2. The biggest overhead is from checkpoint, each job about 300MB or more. I forgot to remove `enable_checkpoint` flag for PP jobs. -> I removed it for PP jobs, which saves a lot. 3. For checkpointing-related tests, we still need `enable_checkpoint`, but I suggest we modify the default interval from 5-steps to 10-steps -> so that we halve the overhead in those tests. 4. I think we can by default disable TB logging and profiler trace logging -> only enable them on 1 job should be fine. results: 1. 4 GPU running time reduced from ~17 minutes to 15m 18s. 2. uploaded folder size dropped from 2.7 GB to 309 MB. [ghstack-poisoned]
…kload ghstack-source-id: fe1076b088646dd1d3f3da4bb646c17bd3fa555d Pull Request resolved: #739
Stack from ghstack (oldest at bottom):
as titled, so that we can download the compile-related code after seeing a CI failure
To answer @wconstab's question, I downloaded the artifact with & without compile trace, and here’s my findings:
enable_checkpoint
flag for PP jobs. -> I removed it for PP jobs, which saves a lot.enable_checkpoint
, but I suggest we modify the default interval from 5-steps to 10-steps -> so that we halve the overhead in those tests.results: