Impact of Spatial Information #11

znygithub · 2025-02-19T10:08:17Z

First of all, I would like to thank you for open-sourcing such a great piece of work. This is a great attempt in the field of Mamba.

The part that confuses me is that it says you used other Spaotio Token numbers for the experiment. If the dimension after the attention_mask is [1, 96, 80, 16, 16], and you used 8×8 blocking, then it would become [1, 80, 96, 64]. Is this also the dimension for inputting to Mamba? Or does the input to Mamba not have spatial information?

Impact of Spatial Information. We replaced the frame global average pooling of frame stem with patch embedding for comparative experiments, where both position embedding and temporal embedding were implemented utilizing learnable parameters. For a fair comparison, the comparative experiments all use vanilla FFN, as token sequences purely based on temporal may have an advantage in one-dimensional FFT. As shown in Table 4, replacing average pooling with patch embedding resulted in significant performance degradation, even with temporal embedding and position embedding. From the experimental results, it appears that spatial sequence may greatly interfere with Mamba’s understanding of the temporal sequence. For linear Mamba, it may be advantageous to distribute multidimensional information across different channels, with each channel retaining single-dimensional information.

zizheng-guo · 2025-02-24T04:31:59Z

Both the transformer and Mamba accept a three-dimensional tensor as input (batch_size, sequence_length, dim). The tensor [1, 80, 96, 64] needs to be reshaped to [1, 80*64, 96] for input.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Impact of Spatial Information #11

Impact of Spatial Information #11

znygithub commented Feb 19, 2025

zizheng-guo commented Feb 24, 2025

Impact of Spatial Information #11

Impact of Spatial Information #11

Comments

znygithub commented Feb 19, 2025

zizheng-guo commented Feb 24, 2025