Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Model] Adding support for MiniCPM-V #4087

Merged
merged 70 commits into from
Jul 25, 2024
Merged

Conversation

HwwwwwwwH
Copy link
Contributor

@HwwwwwwwH HwwwwwwwH commented Apr 15, 2024

Adding support for MiniCPM-V-2, please review.
HuggingFace Page: https://huggingface.co/openbmb/MiniCPM-V-2

FIX #4943
FIX #5808

NOTE: This model was added after the release of 0.5.3.post1, so it'll only be included in the next release (e.g. 0.5.4). If you want to use it now, please install vLLM from source (i.e. main branch).

@HwwwwwwwH
Copy link
Contributor Author

There's an incompatible pip's dependency error, the questions are listed as follows:

  • MiniCPM-V need Timm package, where should I add this dependency requirement? I can see many different requirements files in the root dictionary of vllm.
  • Timm package needs torch==2.1.2, nvidia-nccl-cu12==2.18.1, triton==2.1.0, but these of vllm are torch==2.2.1, nvidia-nccl-cu12==2.19.3, triton==2.20. How can I solve this problem?

@youkaichao
Copy link
Member

seems to be related with @ywang96 RFC #4194 on multi-modality models.

@youkaichao
Copy link
Member

Timm package needs torch==2.1.2, nvidia-nccl-cu12==2.18.1, triton==2.1.0, but these of vllm are torch==2.2.1, nvidia-nccl-cu12==2.19.3, triton==2.20. How can I solve this problem?

We can't do anything until timm has the same dependency as vllm. Or you can try to remove timm dependency.

@HwwwwwwwH
Copy link
Contributor Author

Sry, we were confused by this situation.
Actually, timm only requires torch >= 1.7 and we've add this dependency in requirements-common.txt.
Please review. @youkaichao @ywang96

@HwwwwwwwH HwwwwwwwH mentioned this pull request Apr 26, 2024
Copy link
Collaborator

@esmeetu esmeetu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay review. Thanks for your contribution! Looks good to me and left some minor comments. But there are so many custom stuff that are hard to review carefully. IMO, it's better that you can encapsulate this into your own package and import it into vllm for better maintenance.

requirements-common.txt Outdated Show resolved Hide resolved
vllm/model_executor/models/minicpmv.py Outdated Show resolved Hide resolved
vllm/model_executor/models/minicpmv.py Outdated Show resolved Hide resolved
vllm/model_executor/models/minicpmv.py Outdated Show resolved Hide resolved
vllm/model_executor/models/minicpmv.py Show resolved Hide resolved
vllm/model_executor/models/minicpmv.py Outdated Show resolved Hide resolved
@jeejeelee
Copy link
Collaborator

@HwwwwwwwH Thanks for your excellent work, may I ask what is preventing the progress of this PR?

@HwwwwwwwH
Copy link
Contributor Author

Very sry for late!! We've been working on the new VLM MiniCPM-V-2.5 last few days.

I've pushed the new commit according to the reviews. And I see some new features about VLM, is there any requirements for adapting these features?

Really sry~

@jeejeelee
Copy link
Collaborator

ping @ywang96

@DarkLight1337
Copy link
Member

It should have been fixed last night. Please update to the latest main branch.

@ZHANG-SH97
Copy link

ZHANG-SH97 commented Aug 1, 2024

@HwwwwwwwH I find Qwen2Model in init_llm, Are there any plans to release maybe minicpmv3-Qwen2 in the future? * v *

@Howe-Young
Copy link

It should have been fixed last night. Please update to the latest main branch.

thanks for your reply! The latest code can run normally, but there is a problem with Chinese output inference(contains some '<|eot_id|><|eot_id|>' characters), which is not available in English. What is the reason for this?
English output:
image
Chinese output:
image

@DarkLight1337
Copy link
Member

Could you show the input prompt for each case?

@whyiug
Copy link
Contributor

whyiug commented Aug 1, 2024

It should have been fixed last night. Please update to the latest main branch.

thanks for your reply! The latest code can run normally, but there is a problem with Chinese output inference(contains some '<|eot_id|><|eot_id|>' characters), which is not available in English. What is the reason for this? English output: image Chinese output: image
@Howe-Young
perhaps you need add stop_tokens.

stop_token_ids = [tokenizer.eos_id, tokenizer.eot_id]
sampling_params = SamplingParams(
stop_token_ids=stop_token_ids,
)

@Howe-Young
Copy link

show the input prompt for each case?

English prompt:

question = "please describe the image in detail"
messages = [{
    'role': 'user',
    'content': f'(<image>./</image>)\n{question}'
}]
prompt = tokenizer.apply_chat_template(messages,
                                           tokenize=False,
                                           add_generation_prompt=True)

Chinese prompt:

question = "详细描述图片内容"
messages = [{
    'role': 'user',
    'content': f'(<image>./</image>)\n{question}'
}]
prompt = tokenizer.apply_chat_template(messages,
                                           tokenize=False,
                                           add_generation_prompt=True)

@Howe-Young
Copy link

stop_token_ids = [tokenizer.eos_id, tokenizer.eot_id]
sampling_params = SamplingParams(
stop_token_ids=stop_token_ids,
)

thanks, add stop_token_ids=stop_token_ids it works!

@1223243
Copy link

1223243 commented Aug 2, 2024

你好!MiniCPMv2_5与 openai 兼容 API 一起使用效果很好,但目前似乎不支持某些 API 参数。例如,当我想根据问题获取 logprobs 时:

from openai import OpenAI
  client = OpenAI(
      base_url="http://localhost:8000/v1",
      api_key="token-abc123",
  )

  completion = client.chat.completions.create(
  model="openbmb/MiniCPM-Llama3-V-2_5",
  messages=[
      {"role": "user", "content": "Do you think 2 is larger than 1? Answer yes or no."}
  ],
  extra_body={
      "stop": ['<|eot_id|>'],
      "echo": True,
      "max_tokens": 1,
      "logprobs": True,
  }
  )

输出仅包含输出的 logprobs(例如,“yes”, logprob = “-0.0065”),不包含输入提示。那么如何解决这个问题呢?

请问一下,我使用pip install vllm 安装的vllm版本是0.5.3.post1,为啥还是不能使用python -m vllm.entrypoints.openai.api_server \ --model /home/nlp/xc/NLP/LLM/openLLM/MiniCPM-Llama3-V-2_5 \,他提醒我说不支持这个模型

@DarkLight1337
Copy link
Member

DarkLight1337 commented Aug 2, 2024

你好!MiniCPMv2_5与 openai 兼容 API 一起使用效果很好,但目前似乎不支持某些 API 参数。例如,当我想根据问题获取 logprobs 时:

from openai import OpenAI
  client = OpenAI(
      base_url="http://localhost:8000/v1",
      api_key="token-abc123",
  )

  completion = client.chat.completions.create(
  model="openbmb/MiniCPM-Llama3-V-2_5",
  messages=[
      {"role": "user", "content": "Do you think 2 is larger than 1? Answer yes or no."}
  ],
  extra_body={
      "stop": ['<|eot_id|>'],
      "echo": True,
      "max_tokens": 1,
      "logprobs": True,
  }
  )

输出仅包含输出的 logprobs(例如,“yes”, logprob = “-0.0065”),不包含输入提示。那么如何解决这个问题呢?

请问一下,我使用pip install vllm 安装的vllm版本是0.5.3.post1,为啥还是不能使用python -m vllm.entrypoints.openai.api_server \ --model /home/nlp/xc/NLP/LLM/openLLM/MiniCPM-Llama3-V-2_5 \,他提醒我说不支持这个模型

This model was added after the release of 0.5.3.post1, so it'll only be included in the next release (e.g. 0.5.4). If you want to use it now, please install vLLM from source (i.e. main branch).

@ywang96
Copy link
Member

ywang96 commented Aug 2, 2024

你好!MiniCPMv2_5与 openai 兼容 API 一起使用效果很好,但目前似乎不支持某些 API 参数。例如,当我想根据问题获取 logprobs 时:

from openai import OpenAI
  client = OpenAI(
      base_url="http://localhost:8000/v1",
      api_key="token-abc123",
  )

  completion = client.chat.completions.create(
  model="openbmb/MiniCPM-Llama3-V-2_5",
  messages=[
      {"role": "user", "content": "Do you think 2 is larger than 1? Answer yes or no."}
  ],
  extra_body={
      "stop": ['<|eot_id|>'],
      "echo": True,
      "max_tokens": 1,
      "logprobs": True,
  }
  )

输出仅包含输出的 logprobs(例如,“yes”, logprob = “-0.0065”),不包含输入提示。那么如何解决这个问题呢?

请问一下,我使用pip install vllm 安装的vllm版本是0.5.3.post1,为啥还是不能使用python -m vllm.entrypoints.openai.api_server \ --model /home/nlp/xc/NLP/LLM/openLLM/MiniCPM-Llama3-V-2_5 \,他提醒我说不支持这个模型

This model was added after the release of 0.5.3.post1, so it'll only be included in the next release (e.g. 0.5.4). If you want to use it now, please install vLLM from source (i.e. main branch).

@DarkLight1337 I'm updating this PR description to link to this comment from you given how many times we had to answer the same question :P

@PancakeAwesome
Copy link

PancakeAwesome commented Aug 6, 2024

offline vllm推理 minicpmv2-6 会出现推理结果一直重复输出 某段文字。
vllm ==0.5.4
推理代码:

messages = [{
    'role': 'user',
    'content': f'(<image>./</image>)\n{question}'
}]
prompt = tokenizer.apply_chat_template(messages,
                                       tokenize=False,
                                       add_generation_prompt=True)

stop_token_ids = ['<|eot_id|>']
sampling_params = SamplingParams(temperature=0.7, max_tokens=8192, stop_token_ids=stop_token_ids)

inputs = {
    "prompt": prompt,
    "multi_modal_data": {
        "image": image
    },
}


outputs = llm.generate(inputs, sampling_params=sampling_params)

for o in outputs:
    generated_text = o.outputs[0].text
    print(generated_text)

辛苦帮忙看下,感谢~ @ywang96

@ywang96
Copy link
Member

ywang96 commented Aug 6, 2024

offline vllm推理 minicpmv2-6 会出现推理结果一直重复输出 某段文字。 vllm ==0.5.4 推理代码:

messages = [{
    'role': 'user',
    'content': f'(<image>./</image>)\n{question}'
}]
prompt = tokenizer.apply_chat_template(messages,
                                       tokenize=False,
                                       add_generation_prompt=True)

stop_token_ids = ['<|eot_id|>']
sampling_params = SamplingParams(temperature=0.7, max_tokens=8192, stop_token_ids=stop_token_ids)

inputs = {
    "prompt": prompt,
    "multi_modal_data": {
        "image": image
    },
}


outputs = llm.generate(inputs, sampling_params=sampling_params)

for o in outputs:
    generated_text = o.outputs[0].text
    print(generated_text)

辛苦帮忙看下,感谢~ @ywang96

Could you share a sample input/output with the repetitive generation?

@whyiug
Copy link
Contributor

whyiug commented Aug 7, 2024

offline vllm推理 minicpmv2-6 会出现推理结果一直重复输出 某段文字。 vllm ==0.5.4 推理代码:

messages = [{
    'role': 'user',
    'content': f'(<image>./</image>)\n{question}'
}]
prompt = tokenizer.apply_chat_template(messages,
                                       tokenize=False,
                                       add_generation_prompt=True)

stop_token_ids = ['<|eot_id|>']
sampling_params = SamplingParams(temperature=0.7, max_tokens=8192, stop_token_ids=stop_token_ids)

inputs = {
    "prompt": prompt,
    "multi_modal_data": {
        "image": image
    },
}


outputs = llm.generate(inputs, sampling_params=sampling_params)

for o in outputs:
    generated_text = o.outputs[0].text
    print(generated_text)

辛苦帮忙看下,感谢~ @ywang96
try it.

stop_tokens = ['<|im_end|>', '<|endoftext|>']
stop_token_ids = [tokenizer.convert_tokens_to_ids(i) for i in stop_tokens]
sampling_params = SamplingParams(
            stop_token_ids=stop_token_ids,
)

@PancakeAwesome
Copy link

offline vllm推理 minicpmv2-6 会出现推理结果一直重复输出 某段文字。 vllm ==0.5.4 推理代码:

messages = [{
    'role': 'user',
    'content': f'(<image>./</image>)\n{question}'
}]
prompt = tokenizer.apply_chat_template(messages,
                                       tokenize=False,
                                       add_generation_prompt=True)

stop_token_ids = ['<|eot_id|>']
sampling_params = SamplingParams(temperature=0.7, max_tokens=8192, stop_token_ids=stop_token_ids)

inputs = {
    "prompt": prompt,
    "multi_modal_data": {
        "image": image
    },
}


outputs = llm.generate(inputs, sampling_params=sampling_params)

for o in outputs:
    generated_text = o.outputs[0].text
    print(generated_text)

辛苦帮忙看下,感谢~ @ywang96

Could you share a sample input/output with the repetitive generation?

prompt in Chinese, which probably means producing some classic advertising copy

@PancakeAwesome
Copy link

By the way, how can i use minicpmv2-6's fewshot feature wtih VLLM structure.

@PancakeAwesome
Copy link

Here is minicpmv2-6 infer best practice with VLLM :

from PIL import Image
from transformers import AutoTokenizer
from vllm import LLM, SamplingParams

# 图像文件路径列表
IMAGES = [
    "/root/ld/ld_project/MiniCPM-V/assets/airplane.jpeg",  # 本地图片路径
]

# 模型名称或路径
MODEL_NAME = "/root/ld/ld_model_pretrained/Minicpmv2_6"  # 本地模型路径或Hugging Face模型名称

# 打开并转换图像
image = Image.open(IMAGES[0]).convert("RGB")

# 初始化分词器
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, trust_remote_code=True)

# 初始化语言模型
llm = LLM(model=MODEL_NAME,
           gpu_memory_utilization=1,  # 使用全部GPU内存
           trust_remote_code=True,
           max_model_len=2048)  # 根据内存状况可调整此值

# 构建对话消息
messages = [{'role': 'user', 'content': '(<image>./</image>)\n' + '请描述这张图片'}]

# 应用对话模板到消息
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

# 设置停止符ID
# 2.0
# stop_token_ids = [tokenizer.eos_id]
# 2.5
#stop_token_ids = [tokenizer.eos_id, tokenizer.eot_id]
# 2.6 
stop_tokens = ['<|im_end|>', '<|endoftext|>']
stop_token_ids = [tokenizer.convert_tokens_to_ids(i) for i in stop_tokens]

# 设置生成参数
sampling_params = SamplingParams(
    stop_token_ids=stop_token_ids,
    # temperature=0.7,
    # top_p=0.8,
    # top_k=100,
    # seed=3472,
    max_tokens=1024,
    # min_tokens=150,
    temperature=0,
    use_beam_search=True,
    # length_penalty=1.2,
    best_of=3)

# 获取模型输出
outputs = llm.generate({
    "prompt": prompt,
    "multi_modal_data": {
        "image": image
    }
}, sampling_params=sampling_params)
print(outputs[0].outputs[0].text)

@PancakeAwesome
Copy link

offline vllm推理 minicpmv2-6 会出现推理结果一直重复输出 某段文字。 vllm ==0.5.4 推理代码:

messages = [{
    'role': 'user',
    'content': f'(<image>./</image>)\n{question}'
}]
prompt = tokenizer.apply_chat_template(messages,
                                       tokenize=False,
                                       add_generation_prompt=True)

stop_token_ids = ['<|eot_id|>']
sampling_params = SamplingParams(temperature=0.7, max_tokens=8192, stop_token_ids=stop_token_ids)

inputs = {
    "prompt": prompt,
    "multi_modal_data": {
        "image": image
    },
}


outputs = llm.generate(inputs, sampling_params=sampling_params)

for o in outputs:
    generated_text = o.outputs[0].text
    print(generated_text)

辛苦帮忙看下,感谢~ @ywang96
try it.

stop_tokens = ['<|im_end|>', '<|endoftext|>']
stop_token_ids = [tokenizer.convert_tokens_to_ids(i) for i in stop_tokens]
sampling_params = SamplingParams(
            stop_token_ids=stop_token_ids,
)

Thank u very much, I think problem is each version has different stoptoken-ids. These codes will work, I think.

@PancakeAwesome
Copy link

By the way, how can i use minicpmv2-6's fewshot feature wtih VLLM structure.

here is official fewshot feature usage with transformers:

import torch
from PIL import Image
from transformers import AutoModel, AutoTokenizer

model = AutoModel.from_pretrained('openbmb/MiniCPM-V-2_6', trust_remote_code=True,
    attn_implementation='sdpa', torch_dtype=torch.bfloat16) # sdpa or flash_attention_2, no eager
model = model.eval().cuda()
tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-V-2_6', trust_remote_code=True)

question = "production date" 
image1 = Image.open('example1.jpg').convert('RGB')
answer1 = "2023.08.04"
image2 = Image.open('example2.jpg').convert('RGB')
answer2 = "2007.04.24"
image_test = Image.open('test.jpg').convert('RGB')

msgs = [
    {'role': 'user', 'content': [image1, question]}, {'role': 'assistant', 'content': [answer1]},
    {'role': 'user', 'content': [image2, question]}, {'role': 'assistant', 'content': [answer2]},
    {'role': 'user', 'content': [image_test, question]}
]

answer = model.chat(
    image=None,
    msgs=msgs,
    tokenizer=tokenizer
)
print(answer)

@PancakeAwesome
Copy link

By the way, how can i use minicpmv2-6's fewshot feature wtih VLLM structure.

Looking forward your reply~Thank u. @ywang96 @whyiug

@xyfZzz
Copy link

xyfZzz commented Aug 7, 2024

在5.4.0版本的vllm中以openai api形式部署minicpm-v-2.6,遇到这个报错,请帮忙看下:

Process Process-1:
Traceback (most recent call last):
  File "/app/apps/anaconda3/envs/vllm_054_cu118/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/app/apps/anaconda3/envs/vllm_054_cu118/lib/python3.9/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/app/apps/anaconda3/envs/vllm_054_cu118/lib/python3.9/site-packages/vllm/entrypoints/openai/rpc/server.py", line 217, in run_rpc_serve
r
    server = AsyncEngineRPCServer(async_engine_args, usage_context, port)
  File "/app/apps/anaconda3/envs/vllm_054_cu118/lib/python3.9/site-packages/vllm/entrypoints/openai/rpc/server.py", line 25, in __init__
    self.engine = AsyncLLMEngine.from_engine_args(async_engine_args,
  File "/app/apps/anaconda3/envs/vllm_054_cu118/lib/python3.9/site-packages/vllm/engine/async_llm_engine.py", line 471, in from_engine_args
    engine = cls(
  File "/app/apps/anaconda3/envs/vllm_054_cu118/lib/python3.9/site-packages/vllm/engine/async_llm_engine.py", line 381, in __init__
    self.engine = self._init_engine(*args, **kwargs)
  File "/app/apps/anaconda3/envs/vllm_054_cu118/lib/python3.9/site-packages/vllm/engine/async_llm_engine.py", line 552, in _init_engine
    return engine_class(*args, **kwargs)
  File "/app/apps/anaconda3/envs/vllm_054_cu118/lib/python3.9/site-packages/vllm/engine/llm_engine.py", line 249, in __init__
    self.model_executor = executor_class(
  File "/app/apps/anaconda3/envs/vllm_054_cu118/lib/python3.9/site-packages/vllm/executor/executor_base.py", line 47, in __init__
    self._init_executor()
  File "/app/apps/anaconda3/envs/vllm_054_cu118/lib/python3.9/site-packages/vllm/executor/gpu_executor.py", line 35, in _init_executor
    self.driver_worker.init_device()
  File "/app/apps/anaconda3/envs/vllm_054_cu118/lib/python3.9/site-packages/vllm/worker/worker.py", line 123, in init_device
    torch.cuda.set_device(self.device)
  File "/app/apps/anaconda3/envs/vllm_054_cu118/lib/python3.9/site-packages/torch/cuda/__init__.py", line 420, in set_device
    torch._C._cuda_setDevice(device)
  File "/app/apps/anaconda3/envs/vllm_054_cu118/lib/python3.9/site-packages/torch/cuda/__init__.py", line 300, in _lazy_init
    raise RuntimeError(
RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method

@AlphaINF
Copy link

AlphaINF commented Aug 7, 2024

在0.5.4版本中推理minicpm-V,会出现out of memory的情况,采用OpenAI格式部署
OpenBMB/MiniCPM-V#392

@AlphaINF
Copy link

AlphaINF commented Aug 7, 2024

我在一张A100-80G显卡上面做了测试,发现使用vllm加载时,内存会先到16GB(读取模型),读取完毕后的某一个瞬间,内存会达到29GB的峰值,然后又降低到了19GB。原因不明。

@sfyumi
Copy link

sfyumi commented Aug 7, 2024

How to load the vision model in a separate gpu to avoid oom?

@AlphaINF
Copy link

AlphaINF commented Aug 7, 2024

@sfyumi I have a solution. In default vllm's max-num-seqs default to 256 and it's too large for the 3090, just lower the number to 32 for max-num-seqs and raise gpu-memory-utilization to 1.

@HwwwwwwwH
Copy link
Contributor Author

By the way, how can i use minicpmv2-6's fewshot feature wtih VLLM structure.

here is official fewshot feature usage with transformers:

import torch
from PIL import Image
from transformers import AutoModel, AutoTokenizer

model = AutoModel.from_pretrained('openbmb/MiniCPM-V-2_6', trust_remote_code=True,
    attn_implementation='sdpa', torch_dtype=torch.bfloat16) # sdpa or flash_attention_2, no eager
model = model.eval().cuda()
tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-V-2_6', trust_remote_code=True)

question = "production date" 
image1 = Image.open('example1.jpg').convert('RGB')
answer1 = "2023.08.04"
image2 = Image.open('example2.jpg').convert('RGB')
answer2 = "2007.04.24"
image_test = Image.open('test.jpg').convert('RGB')

msgs = [
    {'role': 'user', 'content': [image1, question]}, {'role': 'assistant', 'content': [answer1]},
    {'role': 'user', 'content': [image2, question]}, {'role': 'assistant', 'content': [answer2]},
    {'role': 'user', 'content': [image_test, question]}
]

answer = model.chat(
    image=None,
    msgs=msgs,
    tokenizer=tokenizer
)
print(answer)

I think this could work

msgs = [
    {'role': 'user', 'content': "(<image>./</image>)" + question}, {'role': 'assistant', 'content': answer1},
    {'role': 'user', 'content': "(<image>./</image>)" + question}, {'role': 'assistant', 'content': answer2},
    {'role': 'user', 'content': "(<image>./</image>)" + question}
]
prompt = tokenizer.apply_chat_template(
    msgs,
    tokenize=False,
    add_generation_prompt=True
)
inputs = {
    "prompt": prompt,
    "multi_modal_data": {
        "image": [image1, image2, image_test]
    },
}

@HwwwwwwwH
Copy link
Contributor Author

我在一张A100-80G显卡上面做了测试,发现使用vllm加载时,内存会先到16GB(读取模型),读取完毕后的某一个瞬间,内存会达到29GB的峰值,然后又降低到了19GB。原因不明。

vLLM will send dummy data(with multiple dummy images) to the model. Since MiniCPM-V has only a few image tokens, there might be a large number of dummy images which could cause OOM. You can add max_model_len=2048 while initializing LLM.

kylesayrs pushed a commit to neuralmagic/vllm that referenced this pull request Aug 17, 2024
Alvant pushed a commit to compressa-ai/vllm that referenced this pull request Oct 26, 2024
KuntaiDu pushed a commit to KuntaiDu/vllm that referenced this pull request Nov 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready ONLY add when PR is ready to merge/full CI is needed
Projects
None yet