pretraining Shape mismatch issue #6

mzamini92 · 2024-11-26T13:46:05Z

thanks for the great work. I was trying to reproduce your code, I noticed during pretraining, if you set the mm_vision_output_token_count = 576 you will get:

  File "llava-token-compression/llava/model/multimodal_projector/quecc.py", line 74, in forward
    query_states_2d = einops.rearrange(self.q_proj(x), 'b (h w) d -> b d h w',
  File "llava-token-compression/moe/lib/python3.10/site-packages/einops/einops.py", line 483, in rearrange
    return reduce(cast(Tensor, tensor), pattern, reduction='rearrange', **axes_lengths)
  File "llava-token-compression/moe/lib/python3.10/site-packages/einops/einops.py", line 420, in reduce
    raise EinopsError(message + '\n {}'.format(e))
einops.EinopsError:  Error while processing rearrange-reduction pattern "b (h w) d -> b d h w".
 Input tensor shape: torch.Size([16, 256, 1024]). Additional info: {'h': 24, 'w': 24}.
 Shape mismatch, 256 != 576

it seems it just works with --mm_vision_output_token_count 256 during pretraining and --mm_vision_output_token_count 576 during finetuning. How to set 16 or 4 or 1?
Also, during the evaluation for GQA:

Traceback (most recent call last):
  File "llava-token-compression/moe/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "llava-token-compression/moe/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "llava-token-compression/llava/eval/model_vqa_loader.py", line 134, in <module>
    eval_model(args)
  File "llava-token-compression/llava/eval/model_vqa_loader.py", line 95, in eval_model
    output_ids = model.generate(
  File "llava-token-compression/moe/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "llava-token-compression/llava/model/language_model/llava_llama.py", line 135, in generate
    ) = self.prepare_inputs_labels_for_multimodal(
  File "llava-token-compression/llava/model/llava_arch.py", line 243, in prepare_inputs_labels_for_multimodal
    image_features = self.encode_images(images, query_text)
  File "llava-token-compression/llava/model/llava_arch.py", line 152, in encode_images
    features = self.get_model().get_vision_tower()(images, text)
  File "llava-token-compression/moe/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "llava-token-compression/moe/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "llava-token-compression/moe/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "llava-token-compression/llava/model/multimodal_encoder/clip_encoder.py", line 101, in forward
    text_input_tokens = self.llm_tokenizer(text, padding=True, return_tensors='pt').to(device=self.device)
TypeError: 'NoneType' object is not callable

The text was updated successfully, but these errors were encountered:

kevinli573 · 2024-11-26T18:27:19Z

mm_vision_output_token_count is used to denote the number of visual tokens that the vision encoder (CLIP-L-14 in our case) produces. Based on the input tensor shape of 16x256x1024, are you potentially using a different vision encoder?

If you want to compress the tokens, adjust the mm_vision_token_compression_kernel_size and mm_vision_token_compression_stride variables. In our paper, we set both values to be the same, so you set both values to be x, then the number of tokens is reduced by a factor of x**2.

Someone else did mention there were issues during the deployment/inference, which we are currently looking into. I believe they were able to solve it here: #4.

mzamini92 · 2024-12-04T14:01:54Z

Thanks for the comment. I was able to fix my issue but the issue that you mentioned on inference still persist. although I added that line to the code.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pretraining Shape mismatch issue #6

pretraining Shape mismatch issue #6

mzamini92 commented Nov 26, 2024 •

edited

Loading

kevinli573 commented Nov 26, 2024

mzamini92 commented Dec 4, 2024

pretraining Shape mismatch issue #6

pretraining Shape mismatch issue #6

Comments

mzamini92 commented Nov 26, 2024 • edited Loading

kevinli573 commented Nov 26, 2024

mzamini92 commented Dec 4, 2024

mzamini92 commented Nov 26, 2024 •

edited

Loading