Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question] llava 预训练中断,继续预训练默认加载模型报错 #1785

Open
cqray1990 opened this issue Nov 27, 2024 · 0 comments
Open

Comments

@cqray1990
Copy link

cqray1990 commented Nov 27, 2024

Question

默认预训练加载模型是在 deepspeed_checkpoint_dirs = sorted(glob.glob(f"{checkpoint_path}/global_step*")),可是预训练保存模型的时候根本就没有这个目录

                      def deepspeed_load_checkpoint(deepspeed_engine, checkpoint_path, load_module_strict=True):
                                # it's possible that the user is trying to resume from model_path, which doesn't necessarily
                                # contain a deepspeed checkpoint. e.g. examples just check if the dir exists and assume it's
                                # a resume from a checkpoint and not just a local pretrained weight. So we check here if the
                                # path contains what looks like a deepspeed checkpoint
                                import glob
                            
                                deepspeed_checkpoint_dirs = sorted(glob.glob(f"{checkpoint_path}/global_step*"))
                            
                                if len(deepspeed_checkpoint_dirs) > 0:
                                    logger.info(f"Attempting to resume from {checkpoint_path}")
                                    # this magically updates self.optimizer and self.lr_scheduler
                                    load_path, _ = deepspeed_engine.load_checkpoint(
                                        checkpoint_path,
                                        load_module_strict=load_module_strict,
                                        load_optimizer_states=True,
                                        load_lr_scheduler_states=True,
                                    )
                                    if load_path is None:
                                        raise ValueError(f"[deepspeed] failed to resume from checkpoint {checkpoint_path}")
                                else:
                                    raise ValueError(f"Can't find a valid checkpoint at {checkpoint_path}")
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant