You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
First of all, I would like to thank you for open-sourcing such a great piece of work. This is a great attempt in the field of Mamba.
The part that confuses me is that it says you used other Spaotio Token numbers for the experiment. If the dimension after the attention_mask is [1, 96, 80, 16, 16], and you used 8×8 blocking, then it would become [1, 80, 96, 64]. Is this also the dimension for inputting to Mamba? Or does the input to Mamba not have spatial information?
Impact of Spatial Information. We replaced the frame global average pooling of frame stem with patch embedding for comparative experiments, where both position embedding and temporal embedding were implemented utilizing learnable parameters. For a fair comparison, the comparative experiments all use vanilla FFN, as token sequences purely based on temporal may have an advantage in one-dimensional FFT. As shown in Table 4, replacing average pooling with patch embedding resulted in significant performance degradation, even with temporal embedding and position embedding. From the experimental results, it appears that spatial sequence may greatly interfere with Mamba’s understanding of the temporal sequence. For linear Mamba, it may be advantageous to distribute multidimensional information across different channels, with each channel retaining single-dimensional information.
The text was updated successfully, but these errors were encountered:
Both the transformer and Mamba accept a three-dimensional tensor as input (batch_size, sequence_length, dim). The tensor [1, 80, 96, 64] needs to be reshaped to [1, 80*64, 96] for input.
First of all, I would like to thank you for open-sourcing such a great piece of work. This is a great attempt in the field of Mamba.
The part that confuses me is that it says you used other Spaotio Token numbers for the experiment. If the dimension after the attention_mask is [1, 96, 80, 16, 16], and you used 8×8 blocking, then it would become [1, 80, 96, 64]. Is this also the dimension for inputting to Mamba? Or does the input to Mamba not have spatial information?
The text was updated successfully, but these errors were encountered: