qwen2vl-72b 多卡推理 #295

ChinChyi · 2024-09-29T05:12:35Z

需要测试一个非常长的问题，在单张卡上72b肯定爆显存，有没有将token分发到多张卡上的推理方式，类似于intervl的no_split_module_classes

The text was updated successfully, but these errors were encountered:

kq-chen · 2024-09-29T20:18:11Z

应该是有的，是这个modeling_qwen2_vl.py#L1039？正常来说，直接在多卡机器上跑会把模型切开到不同gpu上的，是跑的时候遇到什么报错了么？

ChinChyi · 2024-09-30T05:03:28Z

@kq-chen 报错out of memory，我们是做了一个benchmark，想在qwen上进行测试，使用accelerate多机训练，发现就算是batch=1也总是报显存溢出的错。

yhy-2000 · 2024-09-30T06:55:49Z

是不是没开bf16 flashattn

kq-chen · 2024-10-01T18:50:59Z

用多张卡跑会遇到什么问题呢？使用device_map="auto"正常会将模型切到多张卡上：

model = Qwen2VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2-VL-72B-Instruct",
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
    device_map="auto",
)

或者也可以试试用vllm，可以参考：#260 (comment)

Leon1207 · 2024-10-16T07:58:24Z

作者您好，在多卡推理时我遇到的问题是，假设使用device_map="auto"，在有限资源的情况下，当视频帧数增加时，其中一张卡的显存会急剧增长（大于其他卡的增长，也就是出现显存分配不均匀，导致爆显存），这应该是跟device_map的切分策略有关系？这是不是不可避免的呢？

jiah-li · 2024-10-17T07:47:52Z

作者您好，在多卡推理时我遇到的问题是，假设使用device_map="auto"，在有限资源的情况下，当视频帧数增加时，其中一张卡的显存会急剧增长（大于其他卡的增长，也就是出现显存分配不均匀，导致爆显存），这应该是跟device_map的切分策略有关系？这是不是不可避免的呢？

您好，多卡推理的时候您有遇到这个问题吗？请问是如何解决的呢？
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0! (when checking argument for argument mask in method wrapper_CUDA__masked_scatter_)

jiah-li · 2024-10-18T02:51:52Z

作者您好，在多卡推理时我遇到的问题是，假设使用device_map="auto"，在有限资源的情况下，当视频帧数增加时，其中一张卡的显存会急剧增长（大于其他卡的增长，也就是出现显存分配不均匀，导致爆显存），这应该是跟device_map的切分策略有关系？这是不是不可避免的呢？

您好我也遇到了这个问题请问解决了吗？

MaxSuperMax33 · 2024-10-24T05:01:17Z

作者您好，在多卡推理时我遇到的问题是，假设使用device_map=“auto”，在有限资源的情况下，当视频帧数增加时，其中一张卡的显存会急剧增长（大于其他卡的增长，也就是出现显存分配不均匀，导致爆显存），这应该是跟device_map的切分策略有关系？这是不是不可避免的呢？

您好，多卡推理的时候您有遇到这个问题吗？请问是如何解决的呢？RuntimeError：预期所有张量都位于同一设备上，但发现至少两个设备，cuda：1 和 cuda：0！（在方法 wrapper_CUDA__masked_scatter_ 中检查参数掩码的参数时）

您好，我遇到了和您同样的两个问题，请问您解决了吗

wjizhong · 2024-10-31T03:44:03Z

您好，我遇到了和您同样的两个问题，请问您解决了吗

Leon1207 · 2024-10-31T03:46:57Z

您好，我遇到了和您同样的两个问题，请问您解决了吗

或许可以试试用flash attention和bfloat16

model = Qwen2VLForConditionalGeneration.from_pretrained(
"Qwen2-VL-72B-Instruct",
torch_dtype=torch.bfloat16,
attn_implementation="flash_attention_2",
device_map="auto",
)

sjghh · 2024-12-19T14:01:43Z

您好，我遇到了和您同样的两个问题，请问您解决了吗

或许可以试试用flash attention和bfloat16

model = Qwen2VLForConditionalGeneration.from_pretrained( "Qwen2-VL-72B-Instruct", torch_dtype=torch.bfloat16, attn_implementation="flash_attention_2", device_map="auto", )
麻烦您，请问attn_implementation="flash_attention_2",是怎么安装呢，我遇到了，ImportError: FlashAttention2 has been toggled on, but it cannot be used due to the following error: the package flash_attn seems to be not installed. Please refer to the documentation of https://huggingface.co/docs/transformers/perf_infer_gpu_one#flashattention-2 to install Flash Attention 2.

kq-chen closed this as completed Oct 1, 2024

kq-chen reopened this Oct 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

qwen2vl-72b 多卡推理 #295

qwen2vl-72b 多卡推理 #295

ChinChyi commented Sep 29, 2024

kq-chen commented Sep 29, 2024

ChinChyi commented Sep 30, 2024

yhy-2000 commented Sep 30, 2024

kq-chen commented Oct 1, 2024 •

edited

Loading

Leon1207 commented Oct 16, 2024

jiah-li commented Oct 17, 2024

jiah-li commented Oct 18, 2024

MaxSuperMax33 commented Oct 24, 2024

wjizhong commented Oct 31, 2024

Leon1207 commented Oct 31, 2024 •

edited

Loading

sjghh commented Dec 19, 2024

qwen2vl-72b 多卡推理 #295

qwen2vl-72b 多卡推理 #295

Comments

ChinChyi commented Sep 29, 2024

kq-chen commented Sep 29, 2024

ChinChyi commented Sep 30, 2024

yhy-2000 commented Sep 30, 2024

kq-chen commented Oct 1, 2024 • edited Loading

Leon1207 commented Oct 16, 2024

jiah-li commented Oct 17, 2024

jiah-li commented Oct 18, 2024

MaxSuperMax33 commented Oct 24, 2024

wjizhong commented Oct 31, 2024

Leon1207 commented Oct 31, 2024 • edited Loading

sjghh commented Dec 19, 2024

kq-chen commented Oct 1, 2024 •

edited

Loading

Leon1207 commented Oct 31, 2024 •

edited

Loading