-
-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Model] Initial support for LLaVA-NeXT #4199
Merged
Merged
Changes from all commits
Commits
Show all changes
178 commits
Select commit
Hold shift + click to select a range
a26badd
Support image processor
DarkLight1337 adf2b94
Support image processor
DarkLight1337 ea4f8ed
Add LLaVA-NeXT architecture
DarkLight1337 1a0ecca
Convert dtype in multi modal processing
DarkLight1337 45b6756
Move `MultiModalData` to new subpackage `multimodal`
DarkLight1337 6ed8397
Add multi-modal processor registry
DarkLight1337 8c48208
Initialize the processor only once
DarkLight1337 613ec1b
Merge branch 'upstream' into mm-data-processor
DarkLight1337 c48a7d4
Move processor to model runner
DarkLight1337 3232231
Refactor registry to plugin pattern in order to support specifying du…
DarkLight1337 92a0283
Merge branch 'upstream' into mm-data-processor
DarkLight1337 5d42800
Combine prompt inputs
DarkLight1337 5db2c5e
Fix a bunch of tests
DarkLight1337 74c5905
Fix LLaVA test
DarkLight1337 cd8917b
Merge branch 'upstream' into llm-inputs
DarkLight1337 b49aba7
Fix `benchmark_latency` test
DarkLight1337 bfd7295
Merge branch 'upstream' into llm-inputs
DarkLight1337 45c7f23
Merge branch 'upstream' into llm-inputs
DarkLight1337 493e6ed
Merge branch 'upstream' into llm-inputs
DarkLight1337 df1b20b
Merge branch 'upstream' into mm-data-processor
DarkLight1337 20aeceb
Merge branch 'upstream' into llm-inputs
DarkLight1337 0f46653
Merge branch 'upstream' into llm-inputs
DarkLight1337 c4f3540
Clarify tokenizer usage
DarkLight1337 ab8182c
Rename `encode_request -> process_model_inputs`
DarkLight1337 eac33e1
Support old API in `LLM.generate`
DarkLight1337 0ff8189
Merge branch 'upstream' into mm-data-processor
DarkLight1337 9663b50
Fix import error
DarkLight1337 703d318
Add tests to ensure old API still works
DarkLight1337 19d85f9
Let all entrypoints tests be run at the same time
DarkLight1337 0cf2dbe
Merge branch 'upstream' into mm-data-processor
DarkLight1337 554e8c5
Apply formatter
DarkLight1337 0921bad
Merge branch 'upstream' into mm-data-processor
DarkLight1337 baebd99
Merge branch 'upstream' into llm-inputs
DarkLight1337 2cc5498
Merge branch 'upstream' into mm-data-processor
DarkLight1337 dc9816f
Merge branch 'upstream' into llm-inputs
DarkLight1337 1c50600
Merge branch 'upstream' into llm-inputs
DarkLight1337 5759dfa
Add tests for LLM.encode and fix corresponding bugs
DarkLight1337 cc4bfb5
Apply formatter
DarkLight1337 6085b08
Merge branch 'upstream' into llm-inputs
DarkLight1337 d5c9731
Rename `_add_requests` to `_validate_and_add_requests` to be more sim…
DarkLight1337 4f218a5
Separate `entrypoints` tests into two groups
DarkLight1337 428df48
Merge branch 'upstream' into mm-data-processor
DarkLight1337 f153450
Remove duplicate comment
DarkLight1337 a9201d0
Fix memory profiling error
DarkLight1337 ceebfa6
Fix memory usage for embedding server
DarkLight1337 7d991cd
Update embeddings API to use new imputs
DarkLight1337 0e79dfb
Merge branch 'upstream' into llm-inputs
DarkLight1337 b867b5e
Merge branch 'upstream' into mm-data-processor
DarkLight1337 2c0d58f
Merge branch 'upstream' into llm-inputs
DarkLight1337 26f7253
Merge branch 'upstream' into mm-data-processor
DarkLight1337 d553693
Apply formatter
DarkLight1337 48e7a4a
Merge branch 'upstream' into llm-inputs
DarkLight1337 595654c
Merge branch 'upstream' into mm-data-processor
DarkLight1337 b6c0e29
Merge branch 'upstream' into llm-inputs
DarkLight1337 e055472
Avoid duplicate `Tensor.to` calls
DarkLight1337 3097582
Merge `llm` groups back into one by enabling gc
DarkLight1337 9fe9bed
Add test for image pixel processor
DarkLight1337 222cb90
Improve CLI args
DarkLight1337 33294d5
Rename `multi_modal_datas` parameter
DarkLight1337 31cedac
Rename `input_processor` to be more explicit
DarkLight1337 21a0218
Rename `multi_modal_data` to be more explicit
DarkLight1337 32ae773
Remove patch for LLaVA-NeXT
DarkLight1337 78450eb
Apply formatter
DarkLight1337 f4defe6
Apply multi-modal refactor to `CPUModelRunner`
DarkLight1337 c43173b
Fix multi-modal handling in `EmbeddingModelRunner`
DarkLight1337 4c8e64e
Merge branch 'upstream' into mm-data-processor
DarkLight1337 ce58b25
Move dummy image data generation to model-agnostic file
DarkLight1337 d81f9f1
Add multimodal docs
DarkLight1337 7bbd123
Improve documentation for LLM/engine
DarkLight1337 056eb61
Direct readers to the `PromptInputs` class
DarkLight1337 b3b990a
Separate `_run_engine` from `_validate_and_add_requests`
DarkLight1337 2169def
Add flag for deprecating legacy API
DarkLight1337 3dbded1
Add tests for `deprecate_kwargs`
DarkLight1337 8e20317
Apply formatter
DarkLight1337 fdccaa2
Rename attribute to be less misleading
DarkLight1337 77ee1c8
Renable using `'fork'` start method and improve speed by using `torch…
DarkLight1337 b1bcdd1
Simplify logic of casting request output
DarkLight1337 44b4681
Improve code readability
DarkLight1337 50343cb
Fix `multi_modal_data` being a required key
DarkLight1337 45aa420
Fix index out of range error
DarkLight1337 d4e2589
Use a flag to control whether to check output types
DarkLight1337 c07b579
Simplify flags
DarkLight1337 9d56eb0
Move output validation to a more appropriate location
DarkLight1337 bc05031
Add message to deprecation notice
DarkLight1337 95d4130
Apply formatter
DarkLight1337 cc84f65
Remove unused parameter in `_validate_and_add_requests` and fix test
DarkLight1337 6c5d4a6
Simplify code
DarkLight1337 fd2da12
Move attribute assignment outside `_init_tokenizer`
DarkLight1337 d78de94
Only emit warning once
DarkLight1337 8a86829
Simplify assignment expression
DarkLight1337 731ac0e
Place special case at the start
DarkLight1337 2d1a0bc
move API reference to under developer doc
ywang96 7b8ce2c
Fix links in docs
DarkLight1337 fff21a1
Remove unnecessary code to avoid repeated warning
DarkLight1337 82233ec
Merge branch 'llm-inputs' into mm-data-processor
DarkLight1337 797e8a5
Simplify code and fix type annotations
DarkLight1337 e10b3fc
Update docs
DarkLight1337 c6a9fcf
Use intersphinx and avoid long default values
DarkLight1337 a26e1e3
Merge branch 'upstream' into mm-data-processor
DarkLight1337 883bea4
Apply formatter
DarkLight1337 46bc1ea
Merge branch 'upstream' into mm-data-processor
DarkLight1337 d350bb3
Fix bad merge
DarkLight1337 2a166a7
Do not support multiple multimodal data in legacy API
DarkLight1337 db12c29
Reinstate whitespace
DarkLight1337 4a0a85c
Merge branch 'upstream' into mm-data-processor
DarkLight1337 6529280
Merge branch 'upstream' into mm-data-processor
DarkLight1337 dc6c5fd
Fix bad config dict
DarkLight1337 2ed2fdc
Fix tests
DarkLight1337 8d09112
Apply formatter
DarkLight1337 3fe1f61
Remove `multi_modal_data` support in legacy API
DarkLight1337 46af1ac
Add NOTE and TODO
DarkLight1337 f620a1b
Add missing type annotations
DarkLight1337 70b4165
Rename functions
DarkLight1337 87c2da4
Add NOTE
DarkLight1337 7fc620c
Fix multimodal inputs being on wrong device
DarkLight1337 cd63022
Rename `MM_REGISTRY` to be more explicit
DarkLight1337 19fea82
Merge branch 'upstream' into mm-data-processor
DarkLight1337 43f2660
fix upstream merge
ywang96 5d3a063
Merge branch 'upstream' into mm-data-processor
DarkLight1337 b6754a4
Enable passing tensor directly as image
DarkLight1337 01b0512
Add pillow to intersphinx and fix quote format
DarkLight1337 a996b34
Fix mock imports
DarkLight1337 52ed274
Trigger pipeline
DarkLight1337 559bd46
Automatically convert dtype
DarkLight1337 69c4ff6
Comment out failing test for now
DarkLight1337 960e5eb
Fix blank pages in docs
DarkLight1337 a3c6fdb
Use the module name, not package name
DarkLight1337 d78d456
Trigger pipeline
DarkLight1337 243eb90
Trigger pipeline 2
DarkLight1337 501b11c
Fix formatting [skip ci]
DarkLight1337 3d20f6d
Merge branch 'upstream' into mm-data-processor
DarkLight1337 680cee9
Merge branch 'upstream' into mm-data-processor
DarkLight1337 2f0178b
Merge branch 'mm-data-processor' into llava-next
DarkLight1337 dd461f3
Fix bad merge
DarkLight1337 91dc8a9
Fix bad merge
DarkLight1337 6ae4fc1
Merge branch 'upstream' into llava-next
DarkLight1337 89930a4
Run LLaVA-NeXT tests in CI
DarkLight1337 95c0469
Simplify test specification
DarkLight1337 456c180
Fix unable to initialize LLaVA-NeXT model
DarkLight1337 411eeb3
Fix OOM when loading LLaVA-NeXT on HuggingFace
DarkLight1337 93384b9
Fix LLaVA-NeXT not using multimodal registry
DarkLight1337 3a5bf29
Improve error message
DarkLight1337 193daa8
Fix `image_sizes` being missing when tensor is passed directly
DarkLight1337 3f3eccf
Fix incorrect dummy data
DarkLight1337 4ca713e
Add validation for `image_sizes`
DarkLight1337 6b8b850
Merge branch 'upstream' into llava-next
DarkLight1337 abd76a0
Fix model not being able to be split across GPUs
DarkLight1337 7b8a3df
Fix wrong shape
DarkLight1337 930aa4b
Test LLaVA-NeXT processor
DarkLight1337 cd60af8
Remove unnecessary `worker_use_ray`
DarkLight1337 d843b0b
Fix incorrect template for LLaVA(-NeXT) tests
DarkLight1337 cdb0699
Clean up model loading
DarkLight1337 7ea733a
Use a smaller LLaVA-NeXT model for testing
DarkLight1337 b5fbe46
Improve repr for easier debugging
DarkLight1337 246bf1b
Revert `device_map="auto"` since the model can fit in one GPU now
DarkLight1337 02f3ef5
Fix insufficient `max_model_len`
DarkLight1337 bc03534
Apply formatter
DarkLight1337 0586af9
Resize image to match the required number of tokens
DarkLight1337 7adcc79
Merge branch 'upstream' into llava-next
DarkLight1337 0cd4e25
Remove unnecessary gc
DarkLight1337 52e12cb
Remove `tp>1` test as it caused ray workers to hang at the end
DarkLight1337 ac3162f
Add xfail
DarkLight1337 8032ba9
Fix broken CI template
DarkLight1337 556e3fd
Also xfail LLaVA-NeXT processor
DarkLight1337 ec24033
Disallow image features in LLaVA-NeXT
DarkLight1337 4d1ce23
Merge branch 'upstream' into llava-next
DarkLight1337 c122409
Add reference
DarkLight1337 4d40449
Move input type check to initialization time
DarkLight1337 c748dd9
Add warning when image is resized
DarkLight1337 f235732
Avoid model inheritance
DarkLight1337 935a7f9
Apply formatter
DarkLight1337 f1dd1e3
Merge branch 'upstream' into llava-next
DarkLight1337 e586b81
Merge branch 'upstream' into llava-next
DarkLight1337 9afe5b4
Also use context manager in LLaVA-NeXT test
DarkLight1337 fa89a22
update supported models
ywang96 2df8398
Remove asterisk
DarkLight1337 23cb8fa
Use proper capitalization
DarkLight1337 1ed7bf2
Merge branch 'upstream' into llava-next
DarkLight1337 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,123 @@ | ||
from typing import List, Tuple | ||
|
||
import pytest | ||
from transformers import AutoTokenizer | ||
|
||
from vllm.config import VisionLanguageConfig | ||
|
||
from ..conftest import IMAGE_FILES | ||
|
||
pytestmark = pytest.mark.llava | ||
|
||
_PREFACE = ( | ||
"A chat between a curious human and an artificial intelligence assistant. " | ||
"The assistant gives helpful, detailed, and polite answers to the human's " | ||
"questions.") | ||
|
||
# The image token is placed before "user" on purpose so that the test can pass | ||
HF_IMAGE_PROMPTS = [ | ||
f"{_PREFACE} <image>\nUSER: What's the content of the image? ASSISTANT:", | ||
f"{_PREFACE} <image>\nUSER: What is the season? ASSISTANT:", | ||
] | ||
|
||
assert len(HF_IMAGE_PROMPTS) == len(IMAGE_FILES) | ||
|
||
|
||
def iter_llava_next_configs(model_name: str): | ||
image_hw_to_feature_size = { | ||
(336, 336): 1176, | ||
(672, 672): 2928, | ||
(1344, 336): 1944, | ||
(336, 1344): 1890, | ||
} | ||
|
||
for (h, w), f in image_hw_to_feature_size.items(): | ||
for input_type, input_shape in [ | ||
(VisionLanguageConfig.ImageInputType.PIXEL_VALUES, (1, 3, h, w)), | ||
]: | ||
yield (model_name, | ||
VisionLanguageConfig(image_input_type=input_type, | ||
image_feature_size=f, | ||
image_token_id=32000, | ||
image_input_shape=input_shape, | ||
image_processor=model_name, | ||
image_processor_revision=None)) | ||
|
||
|
||
model_and_vl_config = [ | ||
*iter_llava_next_configs("llava-hf/llava-v1.6-vicuna-7b-hf"), | ||
] | ||
|
||
|
||
def vllm_to_hf_output(vllm_output: Tuple[List[int], str], | ||
vlm_config: VisionLanguageConfig, model_id: str): | ||
"""Sanitize vllm output to be comparable with hf output. | ||
The function reduces `input_ids` from 1, 32000, 32000, ..., 32000, | ||
x1, x2, x3 ... to 1, 32000, x1, x2, x3 ... | ||
It also reduces `output_str` from "<image><image>bla" to "bla". | ||
""" | ||
input_ids, output_str = vllm_output | ||
image_token_id = vlm_config.image_token_id | ||
|
||
tokenizer = AutoTokenizer.from_pretrained(model_id) | ||
image_token_str = tokenizer.decode(image_token_id) | ||
|
||
hf_input_ids = [ | ||
input_id for idx, input_id in enumerate(input_ids) | ||
if input_id != image_token_id or input_ids[idx - 1] != image_token_id | ||
] | ||
hf_output_str = output_str \ | ||
.replace(image_token_str * vlm_config.image_feature_size, " ") | ||
|
||
return hf_input_ids, hf_output_str | ||
|
||
|
||
@pytest.mark.xfail( | ||
reason="Inconsistent image processor being used due to lack " | ||
"of support for dynamic image token replacement") | ||
@pytest.mark.parametrize("model_and_config", model_and_vl_config) | ||
@pytest.mark.parametrize("dtype", ["half"]) | ||
@pytest.mark.parametrize("max_tokens", [128]) | ||
def test_models(hf_runner, vllm_runner, hf_images, vllm_images, | ||
model_and_config, dtype: str, max_tokens: int) -> None: | ||
"""Inference result should be the same between hf and vllm. | ||
|
||
All the image fixtures for the test is under tests/images. | ||
For huggingface runner, we provide the PIL images as input. | ||
For vllm runner, we provide MultiModalData objects and corresponding | ||
vision language config as input. | ||
Note, the text input is also adjusted to abide by vllm contract. | ||
The text output is sanitized to be able to compare with hf. | ||
""" | ||
model_id, vlm_config = model_and_config | ||
|
||
with hf_runner(model_id, dtype=dtype, is_vision_model=True) as hf_model: | ||
hf_outputs = hf_model.generate_greedy(HF_IMAGE_PROMPTS, | ||
max_tokens, | ||
images=hf_images) | ||
|
||
vllm_image_prompts = [ | ||
p.replace("<image>", "<image>" * vlm_config.image_feature_size) | ||
for p in HF_IMAGE_PROMPTS | ||
] | ||
|
||
with vllm_runner( | ||
model_id, | ||
dtype=dtype, | ||
# should be greater than image_feature_size | ||
max_model_len=4096, | ||
enforce_eager=True, | ||
**vlm_config.as_cli_args_dict(), | ||
) as vllm_model: | ||
vllm_outputs = vllm_model.generate_greedy(vllm_image_prompts, | ||
max_tokens, | ||
images=vllm_images) | ||
|
||
for i in range(len(HF_IMAGE_PROMPTS)): | ||
hf_output_ids, hf_output_str = hf_outputs[i] | ||
vllm_output_ids, vllm_output_str = vllm_to_hf_output( | ||
vllm_outputs[i], vlm_config, model_id) | ||
assert hf_output_str == vllm_output_str, ( | ||
f"Test{i}:\nHF: {hf_output_str!r}\nvLLM: {vllm_output_str!r}") | ||
assert hf_output_ids == vllm_output_ids, ( | ||
f"Test{i}:\nHF: {hf_output_ids}\nvLLM: {vllm_output_ids}") |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a specific reason why we changed this and why we changed to 560?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I originally made a typo (meant to be 336 instead of 33). But I should have made the image larger than that anyway to test whether HF does resizing in the same way.