0.6.0

matatonic · Apr 6, 2024 · 7724a24 · 7724a24
1 parent 3401b8e
commit 7724a24
Show file tree

Hide file tree

Showing 19 changed files with 287 additions and 140 deletions.
diff --git a/README.md b/README.md
@@ -7,12 +7,12 @@ An OpenAI API compatible vision server, it functions like `gpt-4-vision-preview`
 - Does not connect to the OpenAI API and does not require an OpenAI API Key
 - Not affiliated with OpenAI in any way
 
-Backend Model support:
+Model support:
+- [X] [InternLM-XComposer2](https://huggingface.co/internlm/internlm-xcomposer2-7b) [finetune] (multi-image chat model, lots of warnings on startup, but works fine)
+- [X] [InternLM-XComposer2-VL](https://huggingface.co/internlm/internlm-xcomposer2-vl-7b) [pretrain] *(only supports a single image, also lots of warnings)
 - [X] [LlavaNext](https://huggingface.co/llava-hf) - (llava-v1.6-mistral-7b-hf, llava-v1.6-34b-hf - llava-v1.6-34b-hf is not working well yet) *(only supports a single image)
 - [X] [Llava](https://huggingface.co/llava-hf) - (llava-v1.5-vicuna-7b-hf, llava-v1.5-vicuna-13b-hf, llava-v1.5-bakLlava-7b-hf) *(only supports a single image)
 - [X] [Qwen-VL-Chat](https://huggingface.co/Qwen/Qwen-VL-Chat)
-- [X] [InternLM-XComposer2](https://huggingface.co/internlm/internlm-xcomposer2-7b) [finetune] (multi-image chat model, you may need to add "in English" to the first prompt.)
-- [X] [InternLM-XComposer2-VL](https://huggingface.co/internlm/internlm-xcomposer2-vl-7b) [pretrain] *(only supports a single image)
 - [X] Moondream2 - [vikhyatk/moondream2](https://huggingface.co/vikhyatk/moondream2) *(only supports a single image)
 - [ ] Moondream1 - [vikhyatk/moondream1](https://huggingface.co/vikhyatk/moondream1)
 - [ ] Deepseek-VL - [deepseek-ai/deepseek-vl-7b-chat](https://huggingface.co/deepseek-ai/deepseek-vl-7b-chat)
@@ -27,20 +27,15 @@ Some vision systems include their own OpenAI compatible API server. Also include
 - [X] [THUDM/CogVLM](https://github.com/THUDM/CogVLM) ([cogvlm-chat-hf](https://huggingface.co/THUDM/cogvlm-chat-hf), [cogagent-chat-hf](https://huggingface.co/THUDM/cogagent-chat-hf)), `docker-compose.cogvlm.yml` **Recommended for 16GB-40GB GPU**s
 - [X] [01-ai](https://huggingface.co/01-ai)/Yi-VL ([Yi-VL-6B](https://huggingface.co/01-ai/Yi-VL-6B), [Yi-VL-34B](https://huggingface.co/01-ai/Yi-VL-34B)), `docker-compose.yi-vl.yml`
 
-Version: 0.5.0
+Version: 0.6.0
 
 Recent updates:
-- new backend: XComposer2 (multi-image finetuned chat model)
-- new backend: XComposer2-VL (single image pretrained model)
-- new backend: MiniCPM-V aka. OmniLMM-3B
-- Yi-VL and CogVLM (docker containers only)
-- new backend: Qwen-VL
-- new backend: llava (1.5)
-- new backend: llavanext (1.6+)
-- multi-turn questions & answers
-- chat_with_images.py test tool and code sample
-- selectable chat formats
-- flash attention 2, accelerate (device split), bitsandbytes (4bit, 8bit) support
+- Automatic selection of backend, based on the model name
+- Enable trust_remote_code by default
+- Improved parameter support: temperature, top_p, max_tokens, system prompts
+- Improved default generation parameters and sampler settings
+- Improved system prompt for InternLM-XComposer2 & InternLM-XComposer2-VL, Fewer refusals and should not require "In English" nearly as much while still supporting Chinese.
+- Fix: chat_with_images.py url filename bug
 
 
 See: [OpenVLM Leaderboard](https://huggingface.co/spaces/opencompass/open_vlm_leaderboard)
@@ -54,36 +49,38 @@ API Documentation
 Installation instructions
 -------------------------
 
+(**Docker Recommended**)
+
 ```shell
 # install the python dependencies
 pip install -r requirements.txt
 # Install backend specific requirements (or select only backends you plan to use)
 pip install -r requirements.moondream.txt -r requirements.qwen-vl.txt
 # install the package
 pip install .
-# run the server
-python vision.py
+# run the server with your chosen model
+python vision.py --model vikhyatk/moondream2
 ```
 
 Usage
 -----
 
 ```
-usage: vision.py [-h] [-m MODEL] [-b BACKEND] [-f FORMAT] [-d DEVICE] [--no-trust-remote-code] [-4] [-8] [-F] [-P PORT] [-H HOST] [--preload]
+usage: vision.py [-h] -m MODEL [-b BACKEND] [-f FORMAT] [-d DEVICE] [--no-trust-remote-code] [-4] [-8] [-F] [-P PORT] [-H HOST] [--preload]
 
 OpenedAI Vision API Server
 
 options:
   -h, --help            show this help message and exit
   -m MODEL, --model MODEL
-                        The model to use, Ex. llava-hf/llava-v1.6-mistral-7b-hf (default: vikhyatk/moondream2)
+                        The model to use, Ex. llava-hf/llava-v1.6-mistral-7b-hf (default: None)
   -b BACKEND, --backend BACKEND
                         Force the backend to use (moondream1, moondream2, llavanext, llava, qwen-vl) (default: None)
   -f FORMAT, --format FORMAT
                         Force a specific chat format. (vicuna, mistral, chatml, llama2, phi15, gemma) (doesn't work with all models) (default: None)
   -d DEVICE, --device DEVICE
                         Set the torch device for the model. Ex. cuda:1 (default: auto)
-  --no-trust-remote-code
+  c
                         Don't trust remote code (required for some models) (default: False)
   -4, --load-in-4bit    load in 4bit (doesn't work with all models) (default: False)
   -8, --load-in-8bit    load in 8bit (doesn't work with all models) (default: False)
@@ -96,9 +93,15 @@ options:
 Docker support
 --------------
 
-You can run the server via docker like so:
+1) Edit the docker-compose file to suit your needs.
+
+2) You can run the server via docker like so:
 ```shell
 docker compose up
+# for CogVLM
+docker compose -f docker-compose.cogvlm.yml up
+# for VI-VL
+docker compose -f docker-compose.yi-vl.yml up
 ```
 
 Sample API Usage
@@ -109,11 +112,23 @@ Sample API Usage
 Example:
 ```
 $ python chat_with_image.py https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg
-Answer: This is a beautiful image of a wooden path leading through a lush green field. The path appears to be well-trodden, suggesting it's a popular route for walking or hiking. The sky is a clear blue with some scattered clouds, indicating a pleasant day with good weather. The field is vibrant and seems to be well-maintained, which could suggest it's part of a park or nature reserve. The overall scene is serene and inviting, perfect for a peaceful walk in nature.
+Answer: The image captures a serene landscape of a grassy field, where a wooden walkway cuts through the center. The path is flanked by tall, lush green grass on either side, leading the eye towards the horizon. A few trees and bushes are scattered in the distance, adding depth to the scene. Above, the sky is a clear blue, dotted with white clouds that add to the tranquil atmosphere.
+
 
 Question: Are there any animals in the picture?
-Answer: No, there are no animals visible in the picture. The focus is on the path and the surrounding natural landscape. 
+Answer: No, there are no animals visible in the picture.
 
-Question: 
+Question: ^D
+$
+```
+
+Known Bugs & Workarounds
+------------------------
+
+1. Related to cuda device split, If you get:
+```
+RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1! (when checking argument for argument tensors in method wrapper_CUDA_cat)
 ```
+Try to specify a single cuda device with `CUDA_VISIBLE_DEVICES=1` (or # of your GPU) before running the script. or set the device via `--device \<device\>` on the command line.
 
+2. 4bit/8bit and flash attention 2 don't work for all the models. No workaround.
diff --git a/backend/deepseek-vl.py b/backend/deepseek-vl.py
@@ -19,7 +19,7 @@ def __init__(self, model_id: str, device: str, extra_params = {}, format = None)
 
         print(f"Loaded on device: {self.model.device} with dtype: {self.model.dtype}")
 
-    async def chat_with_images(self, messages: list[Message], max_tokens: int) -> str:
+    async def chat_with_images(self, request: ImageChatRequest) -> str:
         # XXX WIP
         conversation = [
             {

diff --git a/backend/generic.py b/backend/generic.py
@@ -16,12 +16,15 @@ def __init__(self, model_id: str, device: str, extra_params = {}, format = None)
 
         print(f"Loaded on device: {self.model.device} with dtype: {self.model.dtype}")
 
-    async def chat_with_images(self, messages: list[Message], max_tokens: int) -> str:
-        images, prompt = await prompt_from_messages(messages, self.format)
+    async def chat_with_images(self, request: ImageChatRequest) -> str:
+        images, prompt = await prompt_from_messages(request.messages, self.format)
 
         encoded_images = self.model.encode_image(images)
         inputs = self.tokenizer(prompt, encoded_images, return_tensors="pt")
-        output = self.model.generate(**inputs, max_new_tokens=max_tokens)
+
+        params = self.get_generation_params(request)
+
+        output = self.model.generate(**inputs, **params)
         response = self.tokenizer.decode(output[0], skip_special_tokens=True)
 
         return answer_from_response(response, self.format)
diff --git a/backend/llava.py b/backend/llava.py
@@ -21,12 +21,14 @@ def __init__(self, model_id: str, device: str, extra_params = {}, format = None)
 
         print(f"Loaded on device: {self.model.device} with dtype: {self.model.dtype}")
 
-    async def chat_with_images(self, messages: list[Message], max_tokens: int) -> str:
+    async def chat_with_images(self, request: ImageChatRequest) -> str:
 
-        images, prompt = await prompt_from_messages(messages, self.format)
+        images, prompt = await prompt_from_messages(request.messages, self.format)
         inputs = self.processor(prompt, images, return_tensors="pt").to(self.device)
 
-        output = self.model.generate(**inputs, max_new_tokens=max_tokens)
+        params = self.get_generation_params(request)
+
+        output = self.model.generate(**inputs, **params)
         response = self.processor.decode(output[0], skip_special_tokens=True)
 
         return answer_from_response(response, self.format)
diff --git a/backend/llavanext.py b/backend/llavanext.py
@@ -21,12 +21,14 @@ def __init__(self, model_id: str, device: str, extra_params = {}, format = None)
 
         print(f"Loaded on device: {self.model.device} with dtype: {self.model.dtype}")
 
-    async def chat_with_images(self, messages: list[Message], max_tokens: int) -> str:
-                               
-        images, prompt = await prompt_from_messages(messages, self.format)
+    async def chat_with_images(self, request: ImageChatRequest) -> str:
+
+        images, prompt = await prompt_from_messages(request.messages, self.format)
         inputs = self.processor(prompt, images, return_tensors="pt").to(self.model.device)
 
-        output = self.model.generate(**inputs, max_new_tokens=max_tokens)
+        params = self.get_generation_params(request)
+
+        output = self.model.generate(**inputs, **params)
         response = self.processor.decode(output[0], skip_special_tokens=True)
 
         return answer_from_response(response, self.format)
diff --git a/backend/minigemini.py b/backend/minigemini.py
@@ -22,8 +22,8 @@ def __init__(self, model_id: str, device: str, extra_params = {}, format = None)
 
         print(f"Loaded on device: {self.model.device} with dtype: {self.model.dtype}")
 
-    async def chat_with_images(self, messages: list[Message], max_tokens: int) -> str:
-        images, prompt = await prompt_from_messages(messages, self.format)
+    async def chat_with_images(self, request: ImageChatRequest) -> str:
+        images, prompt = await prompt_from_messages(request.messages, self.format)
 
         #encoded_images = self.model.encode_image(images).to(self.device)
         # square?
@@ -32,18 +32,19 @@ async def chat_with_images(self, messages: list[Message], max_tokens: int) -> st
 
         input_ids = tokenizer_image_token(prompt, self.tokenizer, IMAGE_TOKEN_INDEX, return_tensors='pt').unsqueeze(0).to(self.model.device)
 
+        params = self.get_generation_params(request)
+
         with torch.inference_mode():
             output_ids = self.model.generate(
                 input_ids,
                 images=image_tensor,
                 images_aux=None,
-                do_sample=False,
-                temperature=0.0,
-                max_new_tokens=max_tokens,
                 bos_token_id=self.tokenizer.bos_token_id,  # Begin of sequence token
                 eos_token_id=self.tokenizer.eos_token_id,  # End of sequence token
                 pad_token_id=self.tokenizer.pad_token_id,  # Pad token
-                use_cache=True)
+                use_cache=True,
+                **params,
+            )
 
         answer = self.tokenizer.batch_decode(output_ids, skip_special_tokens=True)[0].strip()
 

diff --git a/backend/monkey.py b/backend/monkey.py
@@ -14,19 +14,19 @@ def __init__(self, model_id: str, device: str, extra_params = {}, format = None)
         super().__init__(model_id, device, extra_params, format)
 
          # XXX currently bugged https://huggingface.co/echo840/Monkey/discussions/4
-        self.tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=self.params.get('trust_remote_code', False))
         self.model = AutoModelForCausalLM.from_pretrained(**self.params).eval()
+        self.tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=self.params.get('trust_remote_code', False))
 
         self.tokenizer.padding_side = 'left'
         self.tokenizer.pad_token_id = self.tokenizer.eod_id
 
         print(f"Loaded on device: {self.model.device} with dtype: {self.model.dtype}")
 
-    async def chat_with_images(self, messages: list[Message], max_tokens: int) -> str:
+    async def chat_with_images(self, request: ImageChatRequest) -> str:
         files = []
         prompt = ''
 
-        for m in messages:
+        for m in request.messages:
             if m.role == 'user':
                 p = ''
                 for c in m.content:
@@ -48,19 +48,20 @@ async def chat_with_images(self, messages: list[Message], max_tokens: int) -> st
         attention_mask = input_ids.attention_mask.to(self.model.device)
         input_ids = input_ids.input_ids.to(self.model.device)
 
+        params = self.get_generation_params(request)
+
         pred = self.model.generate(
             input_ids=input_ids,
             attention_mask=attention_mask,
-            do_sample=False,
             num_beams=1,
-            max_new_tokens=512,
             min_new_tokens=1,
             length_penalty=1,
             num_return_sequences=1,
             output_hidden_states=True,
             use_cache=True,
             pad_token_id=self.tokenizer.eod_id,
             eos_token_id=self.tokenizer.eod_id,
+            **params,
         )
         response = self.tokenizer.decode(pred[0][input_ids.size(1):].cpu(), skip_special_tokens=True).strip()
 

diff --git a/backend/moondream1.py b/backend/moondream1.py
@@ -22,10 +22,12 @@ def __init__(self, model_id: str, device: str, extra_params = {}, format = None)
 
         print(f"Loaded on device: {self.model.device} with dtype: {self.model.dtype}")
 
-    async def chat_with_images(self, messages: list[Message], max_tokens: int) -> str:
-        images, prompt = await prompt_from_messages(messages, self.format)
+    async def chat_with_images(self, request: ImageChatRequest) -> str:
+        images, prompt = await prompt_from_messages(request.messages, self.format)
         encoded_images = self.model.encode_image(images[0]).to(self.model.device)
 
+        params = self.get_generation_params(request)
+
         # XXX currently broken here... 
         """
           File "hf_home/modules/transformers_modules/vikhyatk/moondream1/f6e9da68e8f1b78b8f3ee10905d56826db7a5802/modeling_phi.py", line 318, in forward
@@ -37,7 +39,7 @@ async def chat_with_images(self, messages: list[Message], max_tokens: int) -> st
             prompt,
             eos_text="<END>",
             tokenizer=self.tokenizer,
-            max_new_tokens=max_tokens,
+            **params,
         )[0]
         answer = re.sub("<$|<END$", "", answer).strip()
         return answer

diff --git a/backend/moondream2.py b/backend/moondream2.py
@@ -25,18 +25,19 @@ def __init__(self, model_id: str, device: str, extra_params = {}, format = None)
 
         print(f"Loaded on device: {self.model.device} with dtype: {self.model.dtype}")
 
-    async def chat_with_images(self, messages: list[Message], max_tokens: int) -> str:
-        images, prompt = await phi15_prompt_from_messages(messages)
+    async def chat_with_images(self, request: ImageChatRequest) -> str:
+        images, prompt = await prompt_from_messages(request.messages, format=self.format)
 
         encoded_images = self.model.encode_image(images).to(self.device)
 
+        params = self.get_generation_params(request)
+
         answer = self.model.generate(
             encoded_images,
             prompt,
             eos_text="<END>",
             tokenizer=self.tokenizer,
-            max_new_tokens=max_tokens,
-            #**kwargs,
+            **params,
         )[0]
         answer = re.sub("<$|<END$", "", answer).strip()
         return answer
diff --git a/backend/omnilmm12b.py b/backend/omnilmm12b.py
@@ -10,17 +10,17 @@ class VisionQnA(VisionQnABase):
     def __init__(self, model_id: str, device: str, extra_params = {}, format = None):
         super().__init__(model_id, device, extra_params, format)
 
-        self.tokenizer = AutoTokenizer.from_pretrained(model_id, model_max_length=2048) #trust_remote_code=self.params.get('trust_remote_code', False))
+        self.tokenizer = AutoTokenizer.from_pretrained(model_id, model_max_length=2048, trust_remote_code=self.params.get('trust_remote_code', False))
         self.model = AutoModel.from_pretrained(**self.params).to(dtype=self.params['torch_dtype']).eval()
 
         print(f"Loaded on device: {self.model.device} with dtype: {self.model.dtype}")
 
-    async def chat_with_images(self, messages: list[Message], max_tokens: int) -> str:
+    async def chat_with_images(self, request: ImageChatRequest) -> str:
         # 3B
         image = None
         msgs = []
 
-        for m in messages:
+        for m in request.messages:
             if m.role == 'user':
                 for c in m.content:
                     if c.type == 'image_url':
@@ -32,12 +32,14 @@ async def chat_with_images(self, messages: list[Message], max_tokens: int) -> st
                     if c.type == 'text':
                         msgs.extend([{ 'role': 'assistant', 'content': c.text }])
 
+        params = self.get_generation_params(request)
+
         answer, context, _ = self.model.chat(
             image=image,
             msgs=msgs,
             context=None,
             tokenizer=self.tokenizer,
-            max_new_tokens=max_tokens
+            **params,
         )
 
         return answer