Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Impact of Spatial Information #11

Open
znygithub opened this issue Feb 19, 2025 · 1 comment
Open

Impact of Spatial Information #11

znygithub opened this issue Feb 19, 2025 · 1 comment

Comments

@znygithub
Copy link

First of all, I would like to thank you for open-sourcing such a great piece of work. This is a great attempt in the field of Mamba.

The part that confuses me is that it says you used other Spaotio Token numbers for the experiment. If the dimension after the attention_mask is [1, 96, 80, 16, 16], and you used 8×8 blocking, then it would become [1, 80, 96, 64]. Is this also the dimension for inputting to Mamba? Or does the input to Mamba not have spatial information?

Impact of Spatial Information. We replaced the frame global average pooling of frame stem with patch embedding for comparative experiments, where both position embedding and temporal embedding were implemented utilizing learnable parameters. For a fair comparison, the comparative experiments all use vanilla FFN, as token sequences purely based on temporal may have an advantage in one-dimensional FFT. As shown in Table 4, replacing average pooling with patch embedding resulted in significant performance degradation, even with temporal embedding and position embedding. From the experimental results, it appears that spatial sequence may greatly interfere with Mamba’s understanding of the temporal sequence. For linear Mamba, it may be advantageous to distribute multidimensional information across different channels, with each channel retaining single-dimensional information.

@zizheng-guo
Copy link
Owner

Both the transformer and Mamba accept a three-dimensional tensor as input (batch_size, sequence_length, dim). The tensor [1, 80, 96, 64] needs to be reshaped to [1, 80*64, 96] for input.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants