Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

训练完成后OOM #2

Open
uloveqian2021 opened this issue May 7, 2023 · 7 comments
Open

训练完成后OOM #2

uloveqian2021 opened this issue May 7, 2023 · 7 comments

Comments

@uloveqian2021
Copy link

大佬,模型训练完后就报OOM,怎么解?
OutOfMemoryError: CUDA out of memory. Tried to allocate 256.00 MiB (GPU 0; 22.06 GiB total capacity;

@jiangxinyang227
Copy link
Owner

你是训的哪个模型,用什么方法的?是训练完之后保存模型时报的错误吗?

@uloveqian2021
Copy link
Author

你是训的哪个模型,用什么方法的?是训练完之后保存模型时报的错误吗?

ChatGLM, 用的是lora_deepspeed, 4卡A10(24G), 模型训练完成就报OOM,源码中save model的部分被注释掉了

@jiangxinyang227
Copy link
Owner

你应该时没有配置deepspeed config,需要执行accelerate config,另外deepspeed在执行时还会有个bug需要改下源码,具体如何操作你参考下这篇博客https://www.cnblogs.com/jiangxinyang/p/17330352.html

@uloveqian2021
Copy link
Author

你应该时没有配置deepspeed config,需要执行accelerate config,另外deepspeed在执行时还会有个bug需要改下源码,具体如何操作你参考下这篇博客https://www.cnblogs.com/jiangxinyang/p/17330352.html

好的,感谢大佬指导!

@uloveqian2021
Copy link
Author

你应该时没有配置deepspeed config,需要执行accelerate config,另外deepspeed在执行时还会有个bug需要改下源码,具体如何操作你参考下这篇博客https://www.cnblogs.com/jiangxinyang/p/17330352.html

大佬,按照你的指导配置deepspeed config,执行 bash run.sh 后 又报新的错误: Error: Incorrect padding(base64)

│ │
│ /opt/conda/lib/python3.7/site-packages/accelerate/utils/dataclasses.py:511 in post_init
│ │
│ 508 │ │ │ or isinstance(self.hf_ds_config, HfDeepSpeedConfig) │
│ 509 │ │ ): │
│ 510 │ │ │ if not isinstance(self.hf_ds_config, HfDeepSpeedConfig): │
│ ❱ 511 │ │ │ │ self.hf_ds_config = HfDeepSpeedConfig(self.hf_ds_config) │
│ 512 │ │ │ if "gradient_accumulation_steps" not in self.hf_ds_config.config: │
│ 513 │ │ │ │ self.hf_ds_config.config["gradient_accumulation_steps"] = 1 │
│ 514 │ │ │ if "zero_optimization" not in self.hf_ds_config.config: │
│ │
│ /opt/conda/lib/python3.7/site-packages/accelerate/utils/deepspeed.py:52 in init
│ │
│ 49 │ │ │ │ config = json.load(f) │
│ 50 │ │ else: │
│ 51 │ │ │ try: │
│ ❱ 52 │ │ │ │ config_decoded = base64.urlsafe_b64decode(config_file_or_dict).decode("u │
│ 53 │ │ │ │ config = json.loads(config_decoded) │
│ 54 │ │ │ except (UnicodeDecodeError, AttributeError): │
│ 55 │ │ │ │ raise ValueError( │
│ │
│ /opt/conda/lib/python3.7/base64.py:133 in urlsafe_b64decode │
│ │
│ 130 │ """ │
│ 131 │ s = _bytes_from_decode_data(s) │
│ 132 │ s = s.translate(_urlsafe_decode_translation) │
│ ❱ 133 │ return b64decode(s) │
│ 134 │
│ 135 │
│ 136 │
│ │
│ /opt/conda/lib/python3.7/base64.py:87 in b64decode │
│ │
│ 84 │ │ s = s.translate(bytes.maketrans(altchars, b'+/')) │
│ 85 │ if validate and not re.fullmatch(b'[A-Za-z0-9+/]*={0,2}', s): │
│ 86 │ │ raise binascii.Error('Non-base64 digit found') │
│ ❱ 87 │ return binascii.a2b_base64(s) │
│ 88 │
│ 89 │
│ 90 def standard_b64encode(s): │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
Error: Incorrect padding

@jiangxinyang227
Copy link
Owner

这个看上去像是训练数据编码的问题,你用utf8编码读取数据看看

@uloveqian2021
Copy link
Author

这个看上去像是训练数据编码的问题,你用utf8编码读取数据看看

目前看不是utf8编码的问题,就是配置accelerate config之后会报这个错误,不配置accelerate config可以正常训练,但是训练完成保存时会OOM, 这个不知道跟accelerate的版本号有没有关系

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants