训练完成后OOM #2

uloveqian2021 · 2023-05-07T10:03:31Z

大佬，模型训练完后就报OOM，怎么解？
OutOfMemoryError: CUDA out of memory. Tried to allocate 256.00 MiB (GPU 0; 22.06 GiB total capacity;

jiangxinyang227 · 2023-05-08T01:22:39Z

你是训的哪个模型，用什么方法的？是训练完之后保存模型时报的错误吗？

uloveqian2021 · 2023-05-08T23:48:17Z

你是训的哪个模型，用什么方法的？是训练完之后保存模型时报的错误吗？

ChatGLM, 用的是lora_deepspeed, 4卡A10(24G), 模型训练完成就报OOM，源码中save model的部分被注释掉了

jiangxinyang227 · 2023-05-09T11:20:31Z

你应该时没有配置deepspeed config，需要执行accelerate config，另外deepspeed在执行时还会有个bug需要改下源码，具体如何操作你参考下这篇博客https://www.cnblogs.com/jiangxinyang/p/17330352.html

uloveqian2021 · 2023-05-09T13:12:09Z

你应该时没有配置deepspeed config，需要执行accelerate config，另外deepspeed在执行时还会有个bug需要改下源码，具体如何操作你参考下这篇博客https://www.cnblogs.com/jiangxinyang/p/17330352.html

好的，感谢大佬指导！

uloveqian2021 · 2023-05-11T03:06:23Z

你应该时没有配置deepspeed config，需要执行accelerate config，另外deepspeed在执行时还会有个bug需要改下源码，具体如何操作你参考下这篇博客https://www.cnblogs.com/jiangxinyang/p/17330352.html

大佬，按照你的指导配置deepspeed config，执行　bash run.sh 后又报新的错误: Error: Incorrect padding(base64)

│ │
│ /opt/conda/lib/python3.7/site-packages/accelerate/utils/dataclasses.py:511 in post_init │
│ │
│ 508 │ │ │ or isinstance(self.hf_ds_config, HfDeepSpeedConfig) │
│ 509 │ │ ): │
│ 510 │ │ │ if not isinstance(self.hf_ds_config, HfDeepSpeedConfig): │
│ ❱ 511 │ │ │ │ self.hf_ds_config = HfDeepSpeedConfig(self.hf_ds_config) │
│ 512 │ │ │ if "gradient_accumulation_steps" not in self.hf_ds_config.config: │
│ 513 │ │ │ │ self.hf_ds_config.config["gradient_accumulation_steps"] = 1 │
│ 514 │ │ │ if "zero_optimization" not in self.hf_ds_config.config: │
│ │
│ /opt/conda/lib/python3.7/site-packages/accelerate/utils/deepspeed.py:52 in init │
│ │
│ 49 │ │ │ │ config = json.load(f) │
│ 50 │ │ else: │
│ 51 │ │ │ try: │
│ ❱ 52 │ │ │ │ config_decoded = base64.urlsafe_b64decode(config_file_or_dict).decode("u │
│ 53 │ │ │ │ config = json.loads(config_decoded) │
│ 54 │ │ │ except (UnicodeDecodeError, AttributeError): │
│ 55 │ │ │ │ raise ValueError( │
│ │
│ /opt/conda/lib/python3.7/base64.py:133 in urlsafe_b64decode │
│ │
│ 130 │ """ │
│ 131 │ s = _bytes_from_decode_data(s) │
│ 132 │ s = s.translate(_urlsafe_decode_translation) │
│ ❱ 133 │ return b64decode(s) │
│ 134 │
│ 135 │
│ 136 │
│ │
│ /opt/conda/lib/python3.7/base64.py:87 in b64decode │
│ │
│ 84 │ │ s = s.translate(bytes.maketrans(altchars, b'+/')) │
│ 85 │ if validate and not re.fullmatch(b'[A-Za-z0-9+/]*={0,2}', s): │
│ 86 │ │ raise binascii.Error('Non-base64 digit found') │
│ ❱ 87 │ return binascii.a2b_base64(s) │
│ 88 │
│ 89 │
│ 90 def standard_b64encode(s): │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
Error: Incorrect padding

jiangxinyang227 · 2023-05-11T06:32:15Z

这个看上去像是训练数据编码的问题，你用utf8编码读取数据看看

uloveqian2021 · 2023-05-18T00:44:57Z

这个看上去像是训练数据编码的问题，你用utf8编码读取数据看看

目前看不是utf8编码的问题，就是配置accelerate config之后会报这个错误，不配置accelerate config可以正常训练，但是训练完成保存时会OOM, 这个不知道跟accelerate的版本号有没有关系

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

训练完成后OOM #2

训练完成后OOM #2

uloveqian2021 commented May 7, 2023

jiangxinyang227 commented May 8, 2023

uloveqian2021 commented May 8, 2023

jiangxinyang227 commented May 9, 2023

uloveqian2021 commented May 9, 2023

uloveqian2021 commented May 11, 2023

jiangxinyang227 commented May 11, 2023

uloveqian2021 commented May 18, 2023

训练完成后OOM #2

训练完成后OOM #2

Comments

uloveqian2021 commented May 7, 2023

jiangxinyang227 commented May 8, 2023

uloveqian2021 commented May 8, 2023

jiangxinyang227 commented May 9, 2023

uloveqian2021 commented May 9, 2023

uloveqian2021 commented May 11, 2023

jiangxinyang227 commented May 11, 2023

uloveqian2021 commented May 18, 2023