Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

生成过程显卡利用率较低问题 #965

Open
QN-time opened this issue Mar 18, 2025 · 1 comment
Open

生成过程显卡利用率较低问题 #965

QN-time opened this issue Mar 18, 2025 · 1 comment

Comments

@QN-time
Copy link

QN-time commented Mar 18, 2025

Model Series

Qwen2.5--VL-7B-Instruct

What are the models used?

Qwen2.5--VL-7B-Instruct

What is the scenario where the problem happened?

生成过程中,采用2卡 4090进行生成,但4090显卡利用率较低。如何提高显卡利用率,加快生成速度?

Information about environment

2卡 4090
Python 3.10

Description

Steps to reproduce

我尝试在2卡 4090上跑Qwen2.5-7B-Instruct模型,并且能够顺利输出,但唯一的问题是显卡利用率上不去,其中一张卡长期保持在1%,另外一张卡长期保持在48%到49%

Image

Code

The following example input & output can be used:
def segment_caption():
device_map = {'visual': 0, 'model.embed_tokens': 0, 'model.layers.0': 1, 'model.layers.1': 1,
'model.layers.2': 1,
'model.layers.3': 1, 'model.layers.4': 1, 'model.layers.5': 1, 'model.layers.6': 1,
'model.layers.7': 1,
'model.layers.8': 1, 'model.layers.9': 1, 'model.layers.10': 1, 'model.layers.11': 1,
'model.layers.12': 1, 'model.layers.13': 1, 'model.layers.14': 1, 'model.layers.15': 1,
'model.layers.16': 1, 'model.layers.17': 1, 'model.layers.18': 1, 'model.layers.19': 1,
'model.layers.20': 1, 'model.layers.21': 1, 'model.layers.22': 1, 'model.layers.23': 1,
'model.layers.24': 1, 'model.layers.25': 1, 'model.layers.26': 1, 'model.layers.27': 1,
'model.norm': 1,
'model.rotary_emb': 1, 'lm_head': 1}
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
"Qwen/Qwen2.5-VL-7B-Instruct", torch_dtype="auto", device_map=device_map
)
min_pixels = 256 * 28 * 28
max_pixels = 1280 * 28 * 28
processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct", min_pixels=min_pixels,
max_pixels=max_pixels)

# Preparation for inference
if image_path:
    text = processor.apply_chat_template(
        messages, tokenize=False, add_generation_prompt=True
    )
    image_inputs, video_inputs = process_vision_info(messages)
    inputs = processor(
        text=[text],
        images=image_inputs,
        videos=video_inputs,
        padding=True,
        return_tensors="pt",
    )
    inputs = inputs.to(model.device)
else:
    input_prompt = processor.apply_chat_template(
        messages, tokenize=False, add_generation_prompt=True
    )
    inputs = processor(
        text=[input_prompt], text_kwargs={"padding": False}, return_tensors="pt"
    )
    inputs = inputs.to("cuda")


# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=2048)
generated_ids_trimmed = [
    out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)
return output_text

Expected results

预期结果:我希望这两张卡的利用率提高,加快生成速度。

@wulipc
Copy link
Contributor

wulipc commented Mar 18, 2025

建议使用 vLLM 部署推理,详见 Readme 文档

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants