Question on Long-range Information in MambaVisionMixer with a Single Scan #62

zzzack66 · 2025-03-14T16:03:03Z

Hi there,

Thanks for your excellent work. And congratulations on acceptance to CVPR.

Looking through your paper and code. I have been analyzing the difference between other mamba-based vision backbones and MambaVisionMixer in terms of their global modeling mechanisms, particularly regarding their scanning strategies. I found that MambaVisionMixer appears to achieve Long-range Information (global modeling) with just a single scan.

I would like to understand the key reasons why MambaVisionMixer can accomplish this efficiently. Is it primarily due to the use of 1D convolutions in the channel dimension before applying the selective scan, which mixes spatial information beforehand? Or does the "chunk and cat" operation in the Mixer contribute to capturing long-range dependencies in a more efficient way?

Besides, as you mentioned in other issues, Mamba blocks implicitly enforce position encoding, so the attention blocks don't need the patch embedding operation. The reason for that is that the MambaVisionMixer block has already shaped the features in the shape of [B, L, C], and we don't need an explicit patch embedding in the attention block. Do I have a correct understanding on this part?

I appreciate any insights you can share on this.

Best regards,
Zack

ahatamiz · 2025-03-16T18:04:54Z

Hi @zzzack66 thank you for the kind words and I truly appreciate your in-depth question.

The global mixing capability is primarily due to the design of MambaVisionMixer, in our opinion. Specifically, we allow for a non-SSM branch and mixing via concatenation to account for the lost content through the SSM branch itself. In addition, the addition of self-attention blocks mitigate the need to multi-pass scan approaches.

Your understanding of implicit encoding via Mamba blocks is precisely correct ! the dot-product in the self-attention mechanism itself is order-agnostic and it treats a sequence of tokens as a set. Hence, Transformer-like architecture such as ViT must inject information about where each token occurs in the sequence (via positional encodings) so the model can learn to distinguish. On the other hand, Mamba is designed so that sequence ordering is already captured by its architecture. It does not rely on an external embedding trick. Specifically, Mamba incorporates ways to preserve token order (e.g., using shift operations or other architectural cues) so that a separate positional encoding step is unnecessary.

Now when you use Mamba-based design, like ours, in conjunction with self-attention (a hybrid Mamba + Self-Attention architecture), you do not need a separate positional encoding because Mamba has already embedded the positional (order) information into the token representations. Once the input has been passed through the Mamba module, each token vector inherently carries a notion of where it comes from in the sequence. So self-attention can exploit that without needing another positional encoding.

This intuition has also been confirmed by several hybrid efforts in the NLP domain in which using a combination of Mamba + Self-attention without the need to use RoPE (or other position encoding methods) has shown prominent results.

Kind Regards,

FlxhSui · 2025-03-17T14:15:53Z

Dear Author,
I have a question. In your paper, you modified the causal 1D convolution in Mamba to a non-causal 1D convolution. Since, in theory, causal convolutions inherently carry implicit positional information, while non-causal convolutions generally do not encode such information, I hope you can explain this change. Thank you!
Best regards,
Flxh

ahatamiz · 2025-03-17T14:45:52Z

Hi @FlxhSui

The 1D convolution still operates in a sliding-window manner, which can implicitly encode relative positions in a local region. A token in the middle of the sequence sees a left neighbor in index offset -1, a right neighbor offset +1, and so forth. As a result, the 1D convolution still provides the needed positional information in this case.

In addition, we still rely on SSM in our MambaVisionMixer. Because the scan operation in SSM processes tokens in sequence, each update is inherently tied to the step index n. This means each token’s representation is contextually unique to its position in the sequence. The hidden state at step i can only have come from steps that come before it.

As a result, the SSM scan itself does impose a positional structure, similar in spirit to what an RNN does. That alone can serve as an implicit form of positional encoding.

I hope this is helpful !

Kind Regards

hpzhan66 · 2025-03-17T15:46:52Z

I appreciate your detailed explanation. I will try to play with the Mixer Block.

FlxhSui · 2025-03-18T03:16:26Z

Hi @FlxhSui

The 1D convolution still operates in a sliding-window manner, which can implicitly encode relative positions in a local region. A token in the middle of the sequence sees a left neighbor in index offset -1, a right neighbor offset +1, and so forth. As a result, the 1D convolution still provides the needed positional information in this case.

In addition, we still rely on SSM in our MambaVisionMixer. Because the scan operation in SSM processes tokens in sequence, each update is inherently tied to the step index n. This means each token’s representation is contextually unique to its position in the sequence. The hidden state at step i can only have come from steps that come before it.

As a result, the SSM scan itself does impose a positional structure, similar in spirit to what an RNN does. That alone can serve as an implicit form of positional encoding.

I hope this is helpful !

Kind Regards

Thank you for your response

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question on Long-range Information in MambaVisionMixer with a Single Scan #62

Question on Long-range Information in MambaVisionMixer with a Single Scan #62

zzzack66 commented Mar 14, 2025 •

edited

Loading

ahatamiz commented Mar 16, 2025

FlxhSui commented Mar 17, 2025

ahatamiz commented Mar 17, 2025

hpzhan66 commented Mar 17, 2025

FlxhSui commented Mar 18, 2025

Question on Long-range Information in MambaVisionMixer with a Single Scan #62

Question on Long-range Information in MambaVisionMixer with a Single Scan #62

Comments

zzzack66 commented Mar 14, 2025 • edited Loading

ahatamiz commented Mar 16, 2025

FlxhSui commented Mar 17, 2025

ahatamiz commented Mar 17, 2025

hpzhan66 commented Mar 17, 2025

FlxhSui commented Mar 18, 2025

zzzack66 commented Mar 14, 2025 •

edited

Loading