Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pretraining Shape mismatch issue #6

Open
mzamini92 opened this issue Nov 26, 2024 · 2 comments
Open

pretraining Shape mismatch issue #6

mzamini92 opened this issue Nov 26, 2024 · 2 comments

Comments

@mzamini92
Copy link

mzamini92 commented Nov 26, 2024

thanks for the great work. I was trying to reproduce your code, I noticed during pretraining, if you set the mm_vision_output_token_count = 576 you will get:

  File "llava-token-compression/llava/model/multimodal_projector/quecc.py", line 74, in forward
    query_states_2d = einops.rearrange(self.q_proj(x), 'b (h w) d -> b d h w',
  File "llava-token-compression/moe/lib/python3.10/site-packages/einops/einops.py", line 483, in rearrange
    return reduce(cast(Tensor, tensor), pattern, reduction='rearrange', **axes_lengths)
  File "llava-token-compression/moe/lib/python3.10/site-packages/einops/einops.py", line 420, in reduce
    raise EinopsError(message + '\n {}'.format(e))
einops.EinopsError:  Error while processing rearrange-reduction pattern "b (h w) d -> b d h w".
 Input tensor shape: torch.Size([16, 256, 1024]). Additional info: {'h': 24, 'w': 24}.
 Shape mismatch, 256 != 576

it seems it just works with --mm_vision_output_token_count 256 during pretraining and --mm_vision_output_token_count 576 during finetuning. How to set 16 or 4 or 1?
Also, during the evaluation for GQA:

Traceback (most recent call last):
  File "llava-token-compression/moe/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "llava-token-compression/moe/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "llava-token-compression/llava/eval/model_vqa_loader.py", line 134, in <module>
    eval_model(args)
  File "llava-token-compression/llava/eval/model_vqa_loader.py", line 95, in eval_model
    output_ids = model.generate(
  File "llava-token-compression/moe/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "llava-token-compression/llava/model/language_model/llava_llama.py", line 135, in generate
    ) = self.prepare_inputs_labels_for_multimodal(
  File "llava-token-compression/llava/model/llava_arch.py", line 243, in prepare_inputs_labels_for_multimodal
    image_features = self.encode_images(images, query_text)
  File "llava-token-compression/llava/model/llava_arch.py", line 152, in encode_images
    features = self.get_model().get_vision_tower()(images, text)
  File "llava-token-compression/moe/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "llava-token-compression/moe/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "llava-token-compression/moe/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "llava-token-compression/llava/model/multimodal_encoder/clip_encoder.py", line 101, in forward
    text_input_tokens = self.llm_tokenizer(text, padding=True, return_tensors='pt').to(device=self.device)
TypeError: 'NoneType' object is not callable

@kevinli573
Copy link
Collaborator

mm_vision_output_token_count is used to denote the number of visual tokens that the vision encoder (CLIP-L-14 in our case) produces. Based on the input tensor shape of 16x256x1024, are you potentially using a different vision encoder?

If you want to compress the tokens, adjust the mm_vision_token_compression_kernel_size and mm_vision_token_compression_stride variables. In our paper, we set both values to be the same, so you set both values to be x, then the number of tokens is reduced by a factor of x**2.

Someone else did mention there were issues during the deployment/inference, which we are currently looking into. I believe they were able to solve it here: #4.

@mzamini92
Copy link
Author

Thanks for the comment. I was able to fix my issue but the issue that you mentioned on inference still persist. although I added that line to the code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants