-
Notifications
You must be signed in to change notification settings - Fork 253
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BE] add integration test for the generation script #741
Conversation
[ghstack-poisoned]
ghstack-source-id: 28b322d8ce59ba501e948e999d198d32d6919c56 Pull Request resolved: #741
as titled [ghstack-poisoned]
ghstack-source-id: f037c5ad8bee21bf5e5d09a0a93ef780c967a8f2 Pull Request resolved: #741
scripts/generate/test_generate.py
Outdated
if seed is not None: | ||
torch.manual_seed(seed) | ||
# PYTHONHASHSEED can be a decimal number in the range [0, 2**32 - 1] | ||
os.environ["PYTHONHASHSEED"] = str(seed % 2**32) | ||
torch.use_deterministic_algorithms(True) | ||
torch.backends.cudnn.deterministic = True | ||
torch.backends.cudnn.benchmark = False | ||
# env var for deterministic CuBLAS | ||
# https://pytorch.org/docs/stable/generated/torch.use_deterministic_algorithms.html | ||
os.environ["CUBLAS_WORKSPACE_CONFIG"] = ":4096:8" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is the set_determinism()
removed from the test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This call of set_determinism()
was added before @wconstab's PR to make RNG right, which broke this line. Given how seed
is explicitly used here, I'm not sure if the code over there would indicate (would it still be correct?). Need @XilunWu's help on understanding more.
This PR tries to restore the behavior before the "BC-breaking change" and guard on its running.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is the set_determinism change for correct RNG affecting this code? I would hope that calling `set_determinism(seed, deterministic=True) would be equivalent to some manual stuff done here. What is the issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
discussed offline; refactored to use set_determinism
as titled [ghstack-poisoned]
ghstack-source-id: 431ed69d0babd8b16f1488b6bb8f90aea4c6f983 Pull Request resolved: #741
scripts/generate/test_generate.py
Outdated
@@ -147,6 +141,8 @@ def test_generate( | |||
# sequences would require https://github.com/pytorch/torchtitan/pull/686 | |||
apply_tp_minus_sp(model, world_mesh["tp"]) | |||
|
|||
utils.set_determinism(world_mesh, device, seed, deterministic=(seed is not None)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"seed is not None" is equiavlent to the old logic, but i still think its better to separate the concepts.
it seems mandatory to use a seed for the token sampler so that each rank uses the same token. However, it should be optional to force kernels to run in the slower 'deterministic' mode. So it should be possible to run with a seed but not assume deterministic=True
"--optimizer.early_step_in_backward", | ||
], | ||
[ | ||
# placeholder for the generation script's generate step |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this part WIP?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I wrote it this way because for generation, the first step is to call run_llama_train.sh
and the second step is to call the generating script
https://github.com/pytorch/torchtitan/pull/741/files#diff-3b751e36d12b5fa68ae66727b4d8c6ef2cce12d4f2444b46fd882942c9f4a87fR441-R447
To make it less hacky I think we need some surgery to the file & classes.
as titled [ghstack-poisoned]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
as titled [ghstack-poisoned]
encounter ghstack issues, reopening in #749 |
Stack from ghstack (oldest at bottom):
as titled