-
Notifications
You must be signed in to change notification settings - Fork 58
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question on Long-range Information in MambaVisionMixer with a Single Scan #62
Comments
Hi @zzzack66 thank you for the kind words and I truly appreciate your in-depth question. The global mixing capability is primarily due to the design of MambaVisionMixer, in our opinion. Specifically, we allow for a non-SSM branch and mixing via concatenation to account for the lost content through the SSM branch itself. In addition, the addition of self-attention blocks mitigate the need to multi-pass scan approaches. Your understanding of implicit encoding via Mamba blocks is precisely correct ! the dot-product in the self-attention mechanism itself is order-agnostic and it treats a sequence of tokens as a set. Hence, Transformer-like architecture such as ViT must inject information about where each token occurs in the sequence (via positional encodings) so the model can learn to distinguish. On the other hand, Mamba is designed so that sequence ordering is already captured by its architecture. It does not rely on an external embedding trick. Specifically, Mamba incorporates ways to preserve token order (e.g., using shift operations or other architectural cues) so that a separate positional encoding step is unnecessary. Now when you use Mamba-based design, like ours, in conjunction with self-attention (a hybrid Mamba + Self-Attention architecture), you do not need a separate positional encoding because Mamba has already embedded the positional (order) information into the token representations. Once the input has been passed through the Mamba module, each token vector inherently carries a notion of where it comes from in the sequence. So self-attention can exploit that without needing another positional encoding. This intuition has also been confirmed by several hybrid efforts in the NLP domain in which using a combination of Mamba + Self-attention without the need to use RoPE (or other position encoding methods) has shown prominent results. Kind Regards, |
Dear Author, |
Hi @FlxhSui The 1D convolution still operates in a sliding-window manner, which can implicitly encode relative positions in a local region. A token in the middle of the sequence sees a left neighbor in index offset -1, a right neighbor offset +1, and so forth. As a result, the 1D convolution still provides the needed positional information in this case. In addition, we still rely on SSM in our MambaVisionMixer. Because the scan operation in SSM processes tokens in sequence, each update is inherently tied to the step index n. This means each token’s representation is contextually unique to its position in the sequence. The hidden state at step i can only have come from steps that come before it. As a result, the SSM scan itself does impose a positional structure, similar in spirit to what an RNN does. That alone can serve as an implicit form of positional encoding. I hope this is helpful ! Kind Regards |
I appreciate your detailed explanation. I will try to play with the Mixer Block. |
Thank you for your response |
Hi there,
Thanks for your excellent work. And congratulations on acceptance to CVPR.
Looking through your paper and code. I have been analyzing the difference between other mamba-based vision backbones and MambaVisionMixer in terms of their global modeling mechanisms, particularly regarding their scanning strategies. I found that MambaVisionMixer appears to achieve Long-range Information (global modeling) with just a single scan.
I would like to understand the key reasons why MambaVisionMixer can accomplish this efficiently. Is it primarily due to the use of 1D convolutions in the channel dimension before applying the selective scan, which mixes spatial information beforehand? Or does the "chunk and cat" operation in the Mixer contribute to capturing long-range dependencies in a more efficient way?
Besides, as you mentioned in other issues, Mamba blocks implicitly enforce position encoding, so the attention blocks don't need the patch embedding operation. The reason for that is that the MambaVisionMixer block has already shaped the features in the shape of [B, L, C], and we don't need an explicit patch embedding in the attention block. Do I have a correct understanding on this part?
I appreciate any insights you can share on this.
Best regards,
Zack
The text was updated successfully, but these errors were encountered: