You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
thanks for the great work. I was trying to reproduce your code, I noticed during pretraining, if you set the mm_vision_output_token_count = 576 you will get:
File "llava-token-compression/llava/model/multimodal_projector/quecc.py", line 74, in forward
query_states_2d = einops.rearrange(self.q_proj(x), 'b (h w) d -> b d h w',
File "llava-token-compression/moe/lib/python3.10/site-packages/einops/einops.py", line 483, in rearrange
return reduce(cast(Tensor, tensor), pattern, reduction='rearrange', **axes_lengths)
File "llava-token-compression/moe/lib/python3.10/site-packages/einops/einops.py", line 420, in reduce
raise EinopsError(message + '\n {}'.format(e))
einops.EinopsError: Error while processing rearrange-reduction pattern "b (h w) d -> b d h w".
Input tensor shape: torch.Size([16, 256, 1024]). Additional info: {'h': 24, 'w': 24}.
Shape mismatch, 256 != 576
it seems it just works with --mm_vision_output_token_count 256 during pretraining and --mm_vision_output_token_count 576 during finetuning. How to set 16 or 4 or 1?
Also, during the evaluation for GQA:
Traceback (most recent call last):
File "llava-token-compression/moe/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "llava-token-compression/moe/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "llava-token-compression/llava/eval/model_vqa_loader.py", line 134, in <module>
eval_model(args)
File "llava-token-compression/llava/eval/model_vqa_loader.py", line 95, in eval_model
output_ids = model.generate(
File "llava-token-compression/moe/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "llava-token-compression/llava/model/language_model/llava_llama.py", line 135, in generate
) = self.prepare_inputs_labels_for_multimodal(
File "llava-token-compression/llava/model/llava_arch.py", line 243, in prepare_inputs_labels_for_multimodal
image_features = self.encode_images(images, query_text)
File "llava-token-compression/llava/model/llava_arch.py", line 152, in encode_images
features = self.get_model().get_vision_tower()(images, text)
File "llava-token-compression/moe/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "llava-token-compression/moe/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "llava-token-compression/moe/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "llava-token-compression/llava/model/multimodal_encoder/clip_encoder.py", line 101, in forward
text_input_tokens = self.llm_tokenizer(text, padding=True, return_tensors='pt').to(device=self.device)
TypeError: 'NoneType' object is not callable
The text was updated successfully, but these errors were encountered:
mm_vision_output_token_count is used to denote the number of visual tokens that the vision encoder (CLIP-L-14 in our case) produces. Based on the input tensor shape of 16x256x1024, are you potentially using a different vision encoder?
If you want to compress the tokens, adjust the mm_vision_token_compression_kernel_size and mm_vision_token_compression_stride variables. In our paper, we set both values to be the same, so you set both values to be x, then the number of tokens is reduced by a factor of x**2.
Someone else did mention there were issues during the deployment/inference, which we are currently looking into. I believe they were able to solve it here: #4.
Thanks for the comment. I was able to fix my issue but the issue that you mentioned on inference still persist. although I added that line to the code.
thanks for the great work. I was trying to reproduce your code, I noticed during pretraining, if you set the
mm_vision_output_token_count = 576
you will get:it seems it just works with
--mm_vision_output_token_count 256
during pretraining and--mm_vision_output_token_count 576
during finetuning. How to set 16 or 4 or 1?Also, during the evaluation for GQA:
The text was updated successfully, but these errors were encountered: