-
Notifications
You must be signed in to change notification settings - Fork 445
is_local_main_process -> is_main_process in finetune.py #975
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -956,7 +956,7 @@ def main(args: FlatArguments, tc: TokenizerConfig): | |
os.path.join(get_last_checkpoint_path(args, incomplete=True), "COMPLETED"), "w" | ||
) as f: | ||
f.write("COMPLETED") # annoyingly, empty files arent uploaded by beaker. | ||
if accelerator.is_local_main_process: # TODO: in mason local model this is gonna error out if using something like output/test; because mason used the same shared file ssytem. | ||
if accelerator.is_main_process: # TODO: in mason local model this is gonna error out if using something like output/test; because mason used the same shared file ssytem. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. While this change to To fully resolve the race conditions during checkpointing, this file-writing logic should also be restricted to the main process. Additionally, the I recommend the following structure: if accelerator.is_main_process:
# use this to mark the checkpoint as completely saved, to avoid restoring from garbled checkpoints
with open(
os.path.join(get_last_checkpoint_path(args, incomplete=True), "COMPLETED"), "w"
) as f:
f.write("COMPLETED") # annoyingly, empty files arent uploaded by beaker.
clean_last_n_checkpoints(args.output_dir, args.keep_last_n_checkpoints) This ensures both operations are performed atomically by a single process. |
||
clean_last_n_checkpoints(args.output_dir, args.keep_last_n_checkpoints) | ||
accelerator.wait_for_everyone() | ||
|
||
|
@@ -971,7 +971,7 @@ def main(args: FlatArguments, tc: TokenizerConfig): | |
# use this to mark the checkpoint as completely saved, to avoid restoring from garbled checkpoints | ||
with open(os.path.join(get_last_checkpoint_path(args, incomplete=True), "COMPLETED"), "w") as f: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I agree with gemini-bot here, we should probably put this completed write under the main process too. |
||
f.write("COMPLETED") # annoyingly, empty files arent uploaded by beaker. | ||
if accelerator.is_local_main_process: | ||
if accelerator.is_main_process: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Similar to my other comment, there's a race condition here. The code that writes the To prevent potential file system issues, this logic should be moved inside the Here is the recommended change: if accelerator.is_main_process:
# use this to mark the checkpoint as completely saved, to avoid restoring from garbled checkpoints
with open(os.path.join(get_last_checkpoint_path(args, incomplete=True), "COMPLETED"), "w") as f:
f.write("COMPLETED") # annoyingly, empty files arent uploaded by beaker.
clean_last_n_checkpoints(args.output_dir, args.keep_last_n_checkpoints) |
||
clean_last_n_checkpoints(args.output_dir, args.keep_last_n_checkpoints) | ||
accelerator.wait_for_everyone() | ||
|
||
|
@@ -981,7 +981,7 @@ def main(args: FlatArguments, tc: TokenizerConfig): | |
) | ||
|
||
# remove all checkpoints to save space | ||
if args.clean_checkpoints_at_end and accelerator.is_local_main_process: | ||
if args.clean_checkpoints_at_end and accelerator.is_main_process: | ||
clean_last_n_checkpoints(args.output_dir, keep_last_n_checkpoints=0) | ||
|
||
if ( | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with gemini-bot here, we should probably put this completed write under the main process too.