生成过程显卡利用率较低问题 #965

QN-time · 2025-03-18T04:25:09Z

Model Series

Qwen2.5--VL-7B-Instruct

What are the models used?

Qwen2.5--VL-7B-Instruct

What is the scenario where the problem happened?

生成过程中，采用2卡 4090进行生成，但4090显卡利用率较低。如何提高显卡利用率，加快生成速度?

Information about environment

2卡 4090
Python 3.10

Description

Steps to reproduce

我尝试在2卡 4090上跑Qwen2.5-7B-Instruct模型，并且能够顺利输出，但唯一的问题是显卡利用率上不去，其中一张卡长期保持在1%，另外一张卡长期保持在48%到49%

Code

The following example input & output can be used:
def segment_caption():
device_map = {'visual': 0, 'model.embed_tokens': 0, 'model.layers.0': 1, 'model.layers.1': 1,
'model.layers.2': 1,
'model.layers.3': 1, 'model.layers.4': 1, 'model.layers.5': 1, 'model.layers.6': 1,
'model.layers.7': 1,
'model.layers.8': 1, 'model.layers.9': 1, 'model.layers.10': 1, 'model.layers.11': 1,
'model.layers.12': 1, 'model.layers.13': 1, 'model.layers.14': 1, 'model.layers.15': 1,
'model.layers.16': 1, 'model.layers.17': 1, 'model.layers.18': 1, 'model.layers.19': 1,
'model.layers.20': 1, 'model.layers.21': 1, 'model.layers.22': 1, 'model.layers.23': 1,
'model.layers.24': 1, 'model.layers.25': 1, 'model.layers.26': 1, 'model.layers.27': 1,
'model.norm': 1,
'model.rotary_emb': 1, 'lm_head': 1}
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
"Qwen/Qwen2.5-VL-7B-Instruct", torch_dtype="auto", device_map=device_map
)
min_pixels = 256 * 28 * 28
max_pixels = 1280 * 28 * 28
processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct", min_pixels=min_pixels,
max_pixels=max_pixels)

# Preparation for inference
if image_path:
    text = processor.apply_chat_template(
        messages, tokenize=False, add_generation_prompt=True
    )
    image_inputs, video_inputs = process_vision_info(messages)
    inputs = processor(
        text=[text],
        images=image_inputs,
        videos=video_inputs,
        padding=True,
        return_tensors="pt",
    )
    inputs = inputs.to(model.device)
else:
    input_prompt = processor.apply_chat_template(
        messages, tokenize=False, add_generation_prompt=True
    )
    inputs = processor(
        text=[input_prompt], text_kwargs={"padding": False}, return_tensors="pt"
    )
    inputs = inputs.to("cuda")


# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=2048)
generated_ids_trimmed = [
    out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)
return output_text

Expected results

预期结果：我希望这两张卡的利用率提高,加快生成速度。

The text was updated successfully, but these errors were encountered:

wulipc · 2025-03-18T08:44:56Z

建议使用 vLLM 部署推理，详见 Readme 文档

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

生成过程显卡利用率较低问题 #965

生成过程显卡利用率较低问题 #965

QN-time commented Mar 18, 2025

wulipc commented Mar 18, 2025

生成过程显卡利用率较低问题 #965

生成过程显卡利用率较低问题 #965

Comments

QN-time commented Mar 18, 2025

Model Series

What are the models used?

What is the scenario where the problem happened?

Information about environment

Description

Steps to reproduce

Code

Expected results

wulipc commented Mar 18, 2025