Skip to content

Commit

Permalink
0.6.0
Browse files Browse the repository at this point in the history
  • Loading branch information
matatonic committed Apr 6, 2024
1 parent 3401b8e commit 7724a24
Show file tree
Hide file tree
Showing 19 changed files with 287 additions and 140 deletions.
63 changes: 39 additions & 24 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,12 +7,12 @@ An OpenAI API compatible vision server, it functions like `gpt-4-vision-preview`
- Does not connect to the OpenAI API and does not require an OpenAI API Key
- Not affiliated with OpenAI in any way

Backend Model support:
Model support:
- [X] [InternLM-XComposer2](https://huggingface.co/internlm/internlm-xcomposer2-7b) [finetune] (multi-image chat model, lots of warnings on startup, but works fine)
- [X] [InternLM-XComposer2-VL](https://huggingface.co/internlm/internlm-xcomposer2-vl-7b) [pretrain] *(only supports a single image, also lots of warnings)
- [X] [LlavaNext](https://huggingface.co/llava-hf) - (llava-v1.6-mistral-7b-hf, llava-v1.6-34b-hf - llava-v1.6-34b-hf is not working well yet) *(only supports a single image)
- [X] [Llava](https://huggingface.co/llava-hf) - (llava-v1.5-vicuna-7b-hf, llava-v1.5-vicuna-13b-hf, llava-v1.5-bakLlava-7b-hf) *(only supports a single image)
- [X] [Qwen-VL-Chat](https://huggingface.co/Qwen/Qwen-VL-Chat)
- [X] [InternLM-XComposer2](https://huggingface.co/internlm/internlm-xcomposer2-7b) [finetune] (multi-image chat model, you may need to add "in English" to the first prompt.)
- [X] [InternLM-XComposer2-VL](https://huggingface.co/internlm/internlm-xcomposer2-vl-7b) [pretrain] *(only supports a single image)
- [X] Moondream2 - [vikhyatk/moondream2](https://huggingface.co/vikhyatk/moondream2) *(only supports a single image)
- [ ] Moondream1 - [vikhyatk/moondream1](https://huggingface.co/vikhyatk/moondream1)
- [ ] Deepseek-VL - [deepseek-ai/deepseek-vl-7b-chat](https://huggingface.co/deepseek-ai/deepseek-vl-7b-chat)
Expand All @@ -27,20 +27,15 @@ Some vision systems include their own OpenAI compatible API server. Also include
- [X] [THUDM/CogVLM](https://github.com/THUDM/CogVLM) ([cogvlm-chat-hf](https://huggingface.co/THUDM/cogvlm-chat-hf), [cogagent-chat-hf](https://huggingface.co/THUDM/cogagent-chat-hf)), `docker-compose.cogvlm.yml` **Recommended for 16GB-40GB GPU**s
- [X] [01-ai](https://huggingface.co/01-ai)/Yi-VL ([Yi-VL-6B](https://huggingface.co/01-ai/Yi-VL-6B), [Yi-VL-34B](https://huggingface.co/01-ai/Yi-VL-34B)), `docker-compose.yi-vl.yml`

Version: 0.5.0
Version: 0.6.0

Recent updates:
- new backend: XComposer2 (multi-image finetuned chat model)
- new backend: XComposer2-VL (single image pretrained model)
- new backend: MiniCPM-V aka. OmniLMM-3B
- Yi-VL and CogVLM (docker containers only)
- new backend: Qwen-VL
- new backend: llava (1.5)
- new backend: llavanext (1.6+)
- multi-turn questions & answers
- chat_with_images.py test tool and code sample
- selectable chat formats
- flash attention 2, accelerate (device split), bitsandbytes (4bit, 8bit) support
- Automatic selection of backend, based on the model name
- Enable trust_remote_code by default
- Improved parameter support: temperature, top_p, max_tokens, system prompts
- Improved default generation parameters and sampler settings
- Improved system prompt for InternLM-XComposer2 & InternLM-XComposer2-VL, Fewer refusals and should not require "In English" nearly as much while still supporting Chinese.
- Fix: chat_with_images.py url filename bug


See: [OpenVLM Leaderboard](https://huggingface.co/spaces/opencompass/open_vlm_leaderboard)
Expand All @@ -54,36 +49,38 @@ API Documentation
Installation instructions
-------------------------

(**Docker Recommended**)

```shell
# install the python dependencies
pip install -r requirements.txt
# Install backend specific requirements (or select only backends you plan to use)
pip install -r requirements.moondream.txt -r requirements.qwen-vl.txt
# install the package
pip install .
# run the server
python vision.py
# run the server with your chosen model
python vision.py --model vikhyatk/moondream2
```

Usage
-----

```
usage: vision.py [-h] [-m MODEL] [-b BACKEND] [-f FORMAT] [-d DEVICE] [--no-trust-remote-code] [-4] [-8] [-F] [-P PORT] [-H HOST] [--preload]
usage: vision.py [-h] -m MODEL [-b BACKEND] [-f FORMAT] [-d DEVICE] [--no-trust-remote-code] [-4] [-8] [-F] [-P PORT] [-H HOST] [--preload]
OpenedAI Vision API Server
options:
-h, --help show this help message and exit
-m MODEL, --model MODEL
The model to use, Ex. llava-hf/llava-v1.6-mistral-7b-hf (default: vikhyatk/moondream2)
The model to use, Ex. llava-hf/llava-v1.6-mistral-7b-hf (default: None)
-b BACKEND, --backend BACKEND
Force the backend to use (moondream1, moondream2, llavanext, llava, qwen-vl) (default: None)
-f FORMAT, --format FORMAT
Force a specific chat format. (vicuna, mistral, chatml, llama2, phi15, gemma) (doesn't work with all models) (default: None)
-d DEVICE, --device DEVICE
Set the torch device for the model. Ex. cuda:1 (default: auto)
--no-trust-remote-code
c
Don't trust remote code (required for some models) (default: False)
-4, --load-in-4bit load in 4bit (doesn't work with all models) (default: False)
-8, --load-in-8bit load in 8bit (doesn't work with all models) (default: False)
Expand All @@ -96,9 +93,15 @@ options:
Docker support
--------------

You can run the server via docker like so:
1) Edit the docker-compose file to suit your needs.

2) You can run the server via docker like so:
```shell
docker compose up
# for CogVLM
docker compose -f docker-compose.cogvlm.yml up
# for VI-VL
docker compose -f docker-compose.yi-vl.yml up
```

Sample API Usage
Expand All @@ -109,11 +112,23 @@ Sample API Usage
Example:
```
$ python chat_with_image.py https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg
Answer: This is a beautiful image of a wooden path leading through a lush green field. The path appears to be well-trodden, suggesting it's a popular route for walking or hiking. The sky is a clear blue with some scattered clouds, indicating a pleasant day with good weather. The field is vibrant and seems to be well-maintained, which could suggest it's part of a park or nature reserve. The overall scene is serene and inviting, perfect for a peaceful walk in nature.
Answer: The image captures a serene landscape of a grassy field, where a wooden walkway cuts through the center. The path is flanked by tall, lush green grass on either side, leading the eye towards the horizon. A few trees and bushes are scattered in the distance, adding depth to the scene. Above, the sky is a clear blue, dotted with white clouds that add to the tranquil atmosphere.
Question: Are there any animals in the picture?
Answer: No, there are no animals visible in the picture. The focus is on the path and the surrounding natural landscape.
Answer: No, there are no animals visible in the picture.
Question:
Question: ^D
$
```

Known Bugs & Workarounds
------------------------

1. Related to cuda device split, If you get:
```
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1! (when checking argument for argument tensors in method wrapper_CUDA_cat)
```
Try to specify a single cuda device with `CUDA_VISIBLE_DEVICES=1` (or # of your GPU) before running the script. or set the device via `--device \<device\>` on the command line.

2. 4bit/8bit and flash attention 2 don't work for all the models. No workaround.
2 changes: 1 addition & 1 deletion backend/deepseek-vl.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ def __init__(self, model_id: str, device: str, extra_params = {}, format = None)

print(f"Loaded on device: {self.model.device} with dtype: {self.model.dtype}")

async def chat_with_images(self, messages: list[Message], max_tokens: int) -> str:
async def chat_with_images(self, request: ImageChatRequest) -> str:
# XXX WIP
conversation = [
{
Expand Down
9 changes: 6 additions & 3 deletions backend/generic.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,12 +16,15 @@ def __init__(self, model_id: str, device: str, extra_params = {}, format = None)

print(f"Loaded on device: {self.model.device} with dtype: {self.model.dtype}")

async def chat_with_images(self, messages: list[Message], max_tokens: int) -> str:
images, prompt = await prompt_from_messages(messages, self.format)
async def chat_with_images(self, request: ImageChatRequest) -> str:
images, prompt = await prompt_from_messages(request.messages, self.format)

encoded_images = self.model.encode_image(images)
inputs = self.tokenizer(prompt, encoded_images, return_tensors="pt")
output = self.model.generate(**inputs, max_new_tokens=max_tokens)

params = self.get_generation_params(request)

output = self.model.generate(**inputs, **params)
response = self.tokenizer.decode(output[0], skip_special_tokens=True)

return answer_from_response(response, self.format)
8 changes: 5 additions & 3 deletions backend/llava.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,12 +21,14 @@ def __init__(self, model_id: str, device: str, extra_params = {}, format = None)

print(f"Loaded on device: {self.model.device} with dtype: {self.model.dtype}")

async def chat_with_images(self, messages: list[Message], max_tokens: int) -> str:
async def chat_with_images(self, request: ImageChatRequest) -> str:

images, prompt = await prompt_from_messages(messages, self.format)
images, prompt = await prompt_from_messages(request.messages, self.format)
inputs = self.processor(prompt, images, return_tensors="pt").to(self.device)

output = self.model.generate(**inputs, max_new_tokens=max_tokens)
params = self.get_generation_params(request)

output = self.model.generate(**inputs, **params)
response = self.processor.decode(output[0], skip_special_tokens=True)

return answer_from_response(response, self.format)
10 changes: 6 additions & 4 deletions backend/llavanext.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,12 +21,14 @@ def __init__(self, model_id: str, device: str, extra_params = {}, format = None)

print(f"Loaded on device: {self.model.device} with dtype: {self.model.dtype}")

async def chat_with_images(self, messages: list[Message], max_tokens: int) -> str:
images, prompt = await prompt_from_messages(messages, self.format)
async def chat_with_images(self, request: ImageChatRequest) -> str:

images, prompt = await prompt_from_messages(request.messages, self.format)
inputs = self.processor(prompt, images, return_tensors="pt").to(self.model.device)

output = self.model.generate(**inputs, max_new_tokens=max_tokens)
params = self.get_generation_params(request)

output = self.model.generate(**inputs, **params)
response = self.processor.decode(output[0], skip_special_tokens=True)

return answer_from_response(response, self.format)
13 changes: 7 additions & 6 deletions backend/minigemini.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,8 +22,8 @@ def __init__(self, model_id: str, device: str, extra_params = {}, format = None)

print(f"Loaded on device: {self.model.device} with dtype: {self.model.dtype}")

async def chat_with_images(self, messages: list[Message], max_tokens: int) -> str:
images, prompt = await prompt_from_messages(messages, self.format)
async def chat_with_images(self, request: ImageChatRequest) -> str:
images, prompt = await prompt_from_messages(request.messages, self.format)

#encoded_images = self.model.encode_image(images).to(self.device)
# square?
Expand All @@ -32,18 +32,19 @@ async def chat_with_images(self, messages: list[Message], max_tokens: int) -> st

input_ids = tokenizer_image_token(prompt, self.tokenizer, IMAGE_TOKEN_INDEX, return_tensors='pt').unsqueeze(0).to(self.model.device)

params = self.get_generation_params(request)

with torch.inference_mode():
output_ids = self.model.generate(
input_ids,
images=image_tensor,
images_aux=None,
do_sample=False,
temperature=0.0,
max_new_tokens=max_tokens,
bos_token_id=self.tokenizer.bos_token_id, # Begin of sequence token
eos_token_id=self.tokenizer.eos_token_id, # End of sequence token
pad_token_id=self.tokenizer.pad_token_id, # Pad token
use_cache=True)
use_cache=True,
**params,
)

answer = self.tokenizer.batch_decode(output_ids, skip_special_tokens=True)[0].strip()

Expand Down
11 changes: 6 additions & 5 deletions backend/monkey.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,19 +14,19 @@ def __init__(self, model_id: str, device: str, extra_params = {}, format = None)
super().__init__(model_id, device, extra_params, format)

# XXX currently bugged https://huggingface.co/echo840/Monkey/discussions/4
self.tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=self.params.get('trust_remote_code', False))
self.model = AutoModelForCausalLM.from_pretrained(**self.params).eval()
self.tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=self.params.get('trust_remote_code', False))

self.tokenizer.padding_side = 'left'
self.tokenizer.pad_token_id = self.tokenizer.eod_id

print(f"Loaded on device: {self.model.device} with dtype: {self.model.dtype}")

async def chat_with_images(self, messages: list[Message], max_tokens: int) -> str:
async def chat_with_images(self, request: ImageChatRequest) -> str:
files = []
prompt = ''

for m in messages:
for m in request.messages:
if m.role == 'user':
p = ''
for c in m.content:
Expand All @@ -48,19 +48,20 @@ async def chat_with_images(self, messages: list[Message], max_tokens: int) -> st
attention_mask = input_ids.attention_mask.to(self.model.device)
input_ids = input_ids.input_ids.to(self.model.device)

params = self.get_generation_params(request)

pred = self.model.generate(
input_ids=input_ids,
attention_mask=attention_mask,
do_sample=False,
num_beams=1,
max_new_tokens=512,
min_new_tokens=1,
length_penalty=1,
num_return_sequences=1,
output_hidden_states=True,
use_cache=True,
pad_token_id=self.tokenizer.eod_id,
eos_token_id=self.tokenizer.eod_id,
**params,
)
response = self.tokenizer.decode(pred[0][input_ids.size(1):].cpu(), skip_special_tokens=True).strip()

Expand Down
8 changes: 5 additions & 3 deletions backend/moondream1.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,10 +22,12 @@ def __init__(self, model_id: str, device: str, extra_params = {}, format = None)

print(f"Loaded on device: {self.model.device} with dtype: {self.model.dtype}")

async def chat_with_images(self, messages: list[Message], max_tokens: int) -> str:
images, prompt = await prompt_from_messages(messages, self.format)
async def chat_with_images(self, request: ImageChatRequest) -> str:
images, prompt = await prompt_from_messages(request.messages, self.format)
encoded_images = self.model.encode_image(images[0]).to(self.model.device)

params = self.get_generation_params(request)

# XXX currently broken here...
"""
File "hf_home/modules/transformers_modules/vikhyatk/moondream1/f6e9da68e8f1b78b8f3ee10905d56826db7a5802/modeling_phi.py", line 318, in forward
Expand All @@ -37,7 +39,7 @@ async def chat_with_images(self, messages: list[Message], max_tokens: int) -> st
prompt,
eos_text="<END>",
tokenizer=self.tokenizer,
max_new_tokens=max_tokens,
**params,
)[0]
answer = re.sub("<$|<END$", "", answer).strip()
return answer
Expand Down
9 changes: 5 additions & 4 deletions backend/moondream2.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,18 +25,19 @@ def __init__(self, model_id: str, device: str, extra_params = {}, format = None)

print(f"Loaded on device: {self.model.device} with dtype: {self.model.dtype}")

async def chat_with_images(self, messages: list[Message], max_tokens: int) -> str:
images, prompt = await phi15_prompt_from_messages(messages)
async def chat_with_images(self, request: ImageChatRequest) -> str:
images, prompt = await prompt_from_messages(request.messages, format=self.format)

encoded_images = self.model.encode_image(images).to(self.device)

params = self.get_generation_params(request)

answer = self.model.generate(
encoded_images,
prompt,
eos_text="<END>",
tokenizer=self.tokenizer,
max_new_tokens=max_tokens,
#**kwargs,
**params,
)[0]
answer = re.sub("<$|<END$", "", answer).strip()
return answer
10 changes: 6 additions & 4 deletions backend/omnilmm12b.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,17 +10,17 @@ class VisionQnA(VisionQnABase):
def __init__(self, model_id: str, device: str, extra_params = {}, format = None):
super().__init__(model_id, device, extra_params, format)

self.tokenizer = AutoTokenizer.from_pretrained(model_id, model_max_length=2048) #trust_remote_code=self.params.get('trust_remote_code', False))
self.tokenizer = AutoTokenizer.from_pretrained(model_id, model_max_length=2048, trust_remote_code=self.params.get('trust_remote_code', False))
self.model = AutoModel.from_pretrained(**self.params).to(dtype=self.params['torch_dtype']).eval()

print(f"Loaded on device: {self.model.device} with dtype: {self.model.dtype}")

async def chat_with_images(self, messages: list[Message], max_tokens: int) -> str:
async def chat_with_images(self, request: ImageChatRequest) -> str:
# 3B
image = None
msgs = []

for m in messages:
for m in request.messages:
if m.role == 'user':
for c in m.content:
if c.type == 'image_url':
Expand All @@ -32,12 +32,14 @@ async def chat_with_images(self, messages: list[Message], max_tokens: int) -> st
if c.type == 'text':
msgs.extend([{ 'role': 'assistant', 'content': c.text }])

params = self.get_generation_params(request)

answer, context, _ = self.model.chat(
image=image,
msgs=msgs,
context=None,
tokenizer=self.tokenizer,
max_new_tokens=max_tokens
**params,
)

return answer
Expand Down
Loading

0 comments on commit 7724a24

Please sign in to comment.