Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BE] dump compile trace to CI output for debugging, and reduce uploaded folder size in CI #739

Merged
merged 4 commits into from
Dec 17, 2024

Conversation

tianyu-l
Copy link
Contributor

@tianyu-l tianyu-l commented Dec 16, 2024

Stack from ghstack (oldest at bottom):

as titled, so that we can download the compile-related code after seeing a CI failure

To answer @wconstab's question, I downloaded the artifact with & without compile trace, and here’s my findings:

  1. The compile trace is comparable to profiler trace, about <10MB for a job. And there are only two tests with torch.compile. -> so it should be fine to enable uploading compile traces.
  2. The biggest overhead is from checkpoint, each job about 300MB or more. I forgot to remove enable_checkpoint flag for PP jobs. -> I removed it for PP jobs, which saves a lot.
  3. For checkpointing-related tests, we still need enable_checkpoint, but I suggest we modify the default interval from 5-steps to 10-steps -> so that we halve the overhead in those tests.
  4. I think we can by default disable TB logging and profiler trace logging -> only enable them on 1 job should be fine.

results:

  1. 4 GPU running time reduced from ~17 minutes to 15m 18s.
  2. uploaded folder size dropped from 2.7 GB to 309 MB.

tianyu-l added a commit that referenced this pull request Dec 16, 2024
ghstack-source-id: b8ca3dc898a688e89edfc64d0417ebe456d0f394
Pull Request resolved: #739
@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Dec 16, 2024
test_runner.py Outdated
Comment on lines 64 to 75
# TODO: temporarily disabling to let CI be able to run other tests
# OverrideDefinitions(
# [
# [
# "--training.compile",
# "--activation_checkpoint.mode selective",
# "--activation_checkpoint.selective_ac_option op",
# ],
# ],
# "1D compile with selective op AC",
# "1d_compile_sac_op",
# ),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll revert this temporary change before merging

@tianyu-l tianyu-l changed the title dump compile trace to CI output for debugging [BE] dump compile trace to CI output for debugging Dec 16, 2024
Copy link
Contributor

@fegin fegin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, pleaser revert the comment before landing.

@wconstab
Copy link
Contributor

@tianyu-l just curious: the time per PR on CI is significantly impacted by 'upload_artifacts' step even before your change. I was wondering if at some point this is a bottleneck and we need to be more picky about which files we upload. On that note, do you know how large the compile-trace artifacts are and whether it is making a difference to PR CI time?

as titled, so that we can download the compile-related code after seeing a CI failure

[ghstack-poisoned]
@tianyu-l tianyu-l changed the title [BE] dump compile trace to CI output for debugging [BE] dump compile trace to CI output for debugging, and reduce uploaded folder size in CI Dec 17, 2024
…duce uploaded folder size in CI"


as titled, so that we can download the compile-related code after seeing a CI failure

To answer wconstab's question, I downloaded the artifact with & without compile trace, and here’s my findings:
1. The compile trace is comparable to profiler trace, about <10MB for a job. And there are only two tests with torch.compile. -> so it should be fine to enable uploading compile traces.
2. The biggest overhead is from checkpoint, each job about 300MB or more. I forgot to remove `enable_checkpoint` flag for PP jobs. -> I removed it for PP jobs, which saves a lot.
3. For checkpointing-related tests, we still need `enable_checkpoint`, but I suggest we modify the default interval from 5-steps to 10-steps -> so that we halve the overhead in those tests.
4. I think we can by default disable TB logging and profiler trace logging -> only enable them on 1 job should be fine.

results:
1. 4 GPU running time reduced from ~17 minutes to 15m 18s.
2. uploaded folder size dropped from 2.7 GB to 309 MB.

[ghstack-poisoned]
@tianyu-l tianyu-l merged commit 3fb2758 into gh/tianyu-l/28/base Dec 17, 2024
4 of 5 checks passed
tianyu-l added a commit that referenced this pull request Dec 17, 2024
…kload

ghstack-source-id: fe1076b088646dd1d3f3da4bb646c17bd3fa555d
Pull Request resolved: #739
@tianyu-l tianyu-l deleted the gh/tianyu-l/28/head branch December 17, 2024 23:53
@tianyu-l tianyu-l restored the gh/tianyu-l/28/head branch December 18, 2024 00:08
@tianyu-l tianyu-l deleted the gh/tianyu-l/28/head branch December 18, 2024 00:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants