1 Why Mamba and Structured State Space Sequence Models?
-
The fundamental problem in deep sequence modelling is how to efficiently compress the context into a smaller learnable state representation whilst maintaining the quality of state representation. As seen in Figure 1.1, transformers have powerful in-context learning capabilties due to the inherent nature of attention but it’s uncompressed memory state (the attention matrix) makes for inefficient inference especially with long-range dependencies (LRD) or large context window settings. On the end, RNNs and S4 models may be efficient but fail to preserve context state required to perform well in tasks that require in-context reasoning. Mamba proposes a context-aware method to dynamically filter out inputs in the sequence to effectively compress the context.
+
The fundamental problem in deep sequence modelling is how to efficiently compress the context into a smaller learnable state representation whilst maintaining the quality of state representation. As seen in Figure 1.1, transformers have powerful in-context learning capabilties due to the inherent nature of attention but it’s uncompressed memory state (the attention matrix) makes for inefficient inference especially with long-range dependencies (LRD) or large context window settings. On the other end, RNNs and S4 models may be efficient but fail to preserve context state required to perform well in tasks that require in-context reasoning. Mamba proposes a context-aware method to dynamically filter out inputs in the sequence to effectively compress the context.
-Figure 1.4: Mamba: Matching Transformer Performance with Efficiency in Training and Inference [3]
+Figure 1.4: Mamba: Matching Transformer Performance with Efficiency in Training and Inference [3]
@@ -348,16 +348,16 @@
-
-Head View: Visualising attention head activations between different layers. Connecting lines are weighted based on the attention score between respective words.
+
+Head View: Visualising attention head activations between different layers. Connecting lines are weighted based on the attention score between respective words.
-
-Neuron View: Visualising query, key and value embeddings when computing attention between each token and other tokens within the sequence. Positive values are colored as blue and negative values as orange.
+
+Neuron View: Visualising query, key and value embeddings when computing attention between each token and other tokens within the sequence. Positive values are colored as blue and negative values as orange.
@@ -368,7 +368,7 @@
Figure 1.6 that GPT4’s recall performance starts to degrade above 73K tokens where the low recall performance was placed between 7-50% document depth given. However, facts at the beginning of documents were recalled regardless of document length. This also seems to be the case for Anthropic’s Claude 2.1 model.
+
We can observe in the following experiments in Figure 1.6 that GPT4’s recall performance starts to degrade above 73K tokens where we observce low recall performance when fact is placed between 7-50% document depth. However, facts at the beginning of documents were recalled regardless of document length. This also seems to be the case for Anthropic’s Claude 2.1 model.
@@ -376,23 +376,23 @@
-
-OpenAI’s GPT-4-128K
+
+OpenAI’s GPT-4-128K Long Context Performance
-
-Anthropic’s Claude 2.1
+
+Anthropic’s Claude 2.1 Long Context Performance
-Figure 1.6: Needle In A Haystack - Pressure Testing LLMs Results for Long Context Retrieval [9]
+Figure 1.6: Needle In A Haystack: Pressure Testing LLMs Results for Long Context Retrieval [9]
@@ -400,10 +400,10 @@
-
+
-Figure 1.7: Lost in the Middle: Performance Degrades When Information Access is in the Middle of Document [10]
+Figure 1.7: Lost in the Middle: Performance Degrades When Information Access is in the Middle of Document [10]
@@ -416,7 +416,7 @@
-
+
Figure 1.8: Comparison of scaled dot-product attention with and without KV caching [12]
@@ -465,7 +465,7 @@
-
+
Figure 1.9: Unrolling Recurrent Neural Network Architecture Over Time
@@ -550,10 +550,10 @@
2 What are Struct
-
+
-Figure 2.1: The Three Representations of Linear State Space Layers in S4: (Left) State space models allow us to model continuous-time systems .(Center) The discretised recurrent format can be used for fast autoregressive inference. Recent theory on continuous-time memorisation of the hidden state transition matrix \(\mathbf{\bar{A}}\) enables us to capture LRDs mathematically and empirically. (Right) Unrolling the RNN into a global convolutional representation allows for efficient training by computing the layer depthwise in parallel [15].
+Figure 2.1: The Three Representations of Linear State Space Layers in S4: (Left) State space models allow us to model continuous-time systems .(Center) The discretised recurrent format can be used for fast autoregressive inference. Recent theory on continuous-time memorisation of the hidden state transition matrix \(\mathbf{\bar{A}}\) enables us to capture LRDs mathematically and empirically. (Right) Unrolling the RNN into a global convolutional representation allows for efficient training by computing the layer depthwise in parallel [15].
-Figure 2.3: From Continuous to Discrete SSMs With the Zero Order Hold Rule [1]
+Figure 2.3: From Continuous to Discrete SSMs With the Zero Order Hold Rule
significantly, and overcome the \(O(Td^2)\) computational complexity and \(O(Td)\) space complexity in applying \(\mathbf{\bar{A}}\) for each time step in the sequence.
-
The original S4 approach was to leverage the Diagonal Plus Low-Rank (DPLR) structure in complex space [18] which significantly reduces the space and time complexity as we only need to store and compute the diagonal elements and low-rank components of the dense matrix. It can be expressed as \(\mathbf{\bar{A}}=\mathbf{\Lambda}+ \mathbf{PQ^*}\) where \(\mathbf{\Lambda}\) is the diagonal matrix and \(\mathbf{PQ}\) are low-rank matrices (vectors for rank-1 updates). The addition of the low-rank term allows the DPLR matrix to capture more complex relationships in LRD compared to a simple diagonal matrix whilst specialised techniques like the Woodbury identity make operations on DPLR matrices feasible and efficient. This was followed by a paper that showed empirically that just using the diagonal matrix and removing the low-rank portion of the DPLR form of the HIPPO matrix, yielded similar results [18].
-
This work led to S4D used in Mamba [19], further improving the computational effiency and expressiveness of \(\mathbf{\bar{A}}\) by leveraging the Vandermonde Matrix to compute the diagonal matrix, leveraging the properties of eigenvectors and eigenvalues to efficiently capture more complex relationships between state variables (such as powers and exponentials). This is expressed as \(\mathbf{\bar{A}}=\mathbf{V \Lambda V^{-1}}\) where \(\mathbf{\Lambda}\) is the diagonal matrix of eigenvalues, \(\mathbf{V}\) is the Vandermonde matrix of eigenvectors and \(\mathbf{V^{-1}}\) is the inverse Vandermonde matrix.
+
The original S4 approach was to leverage the Diagonal Plus Low-Rank (DPLR) structure in complex space [19] which significantly reduces the space and time complexity as we only need to store and compute the diagonal elements and low-rank components of the dense matrix. It can be expressed as \(\mathbf{\bar{A}}=\mathbf{\Lambda}+ \mathbf{PQ^*}\) where \(\mathbf{\Lambda}\) is the diagonal matrix and \(\mathbf{PQ}\) are low-rank matrices (vectors for rank-1 updates). The addition of the low-rank term allows the DPLR matrix to capture more complex relationships in LRD compared to a simple diagonal matrix whilst specialised techniques like the Woodbury identity make operations on DPLR matrices feasible and efficient. This was followed by a paper that showed empirically that just using the diagonal matrix and removing the low-rank portion of the DPLR form of the HIPPO matrix, yielded similar results [19].
+
This work led to S4D used in Mamba [20], further improving the computational effiency and expressiveness of \(\mathbf{\bar{A}}\) by leveraging the Vandermonde Matrix to compute the diagonal matrix, leveraging the properties of eigenvectors and eigenvalues to efficiently capture more complex relationships between state variables (such as powers and exponentials). This is expressed as \(\mathbf{\bar{A}}=\mathbf{V \Lambda V^{-1}}\) where \(\mathbf{\Lambda}\) is the diagonal matrix of eigenvalues, \(\mathbf{V}\) is the Vandermonde matrix of eigenvectors and \(\mathbf{V^{-1}}\) is the inverse Vandermonde matrix.
@@ -742,7 +752,7 @@
-
+
Diagonal Plus Low-rank Approximation
@@ -754,17 +764,17 @@
-Figure 2.7: S4 vs S4D Architecture [19]
+Figure 2.7: S4 vs S4D Architecture
@@ -777,7 +787,7 @@
-
+
Visualising S4 vs S4D Results
@@ -789,7 +799,7 @@
-
+
S4 vs S4D Long Range Arena Results
@@ -799,7 +809,7 @@
-Figure 2.8: S4 vs S4D Results [19]
+Figure 2.8: S4 vs S4D Results [20]
@@ -812,7 +822,7 @@
3 How does Mamba
-
+
Figure 3.1: Differences between S4 and Mamba (S6) [3]
@@ -831,10 +841,10 @@
-
+
-Selective Copying: This requires time-varying models that can selectively remember or ignore inputs depending on their content.
+Selective Copying: This requires time-varying models that can selectively remember or ignore inputs depending on their content.
@@ -843,17 +853,17 @@
-
+
-Induction Heads: This is an associative recall task which requires retrieving an answer based on context, a key ability of LLMs.
+Induction Heads: This is an associative recall task which requires retrieving an answer based on context, a key ability of LLMs.
-Figure 3.2: Tasks to Demontrastae Context-Aware Reasoning [1]
+Figure 3.2: Tasks to Demonstrate Context-Aware Reasoning [1]
@@ -871,10 +881,10 @@
-
+
-Selective Copying Results: Accuracy for combinations of architectures
+Selective Copying Results: Accuracy for combinations of architectures
@@ -883,10 +893,10 @@
-
+
-Induction Heads Extrapolation: Mamba has ability to maintain high induction test accuracy for sequence length up to 1 million tokens
+Induction Heads Extrapolation: Mamba has ability to maintain high induction test accuracy for sequence length up to 1 million tokens
@@ -900,7 +910,7 @@
3.2 Selective SSM Layer for Parallelised Training
-
However, making the system time-varying means we can no longer perform convolution in Equation 2.4 to parallelise training since it assumes a fixed kernel. To address this, Mamba introduces the selective scan layer. It is the implementation of a hard-aware selective parallel scan algorithm with the same GPU kernel fusion techniques in FlashAttention[20] for transformers, as a result of Mamba being a collaborative paper between Albert Gu (S4) and Tri Dao (FlashAttention). Therefore, the core optimisations for all three techniques, parallel scan, kernel fusion and recomputation in the selective SSM layer are to try and perform as many operations in the fast memory (SRAM) layer of the GPU before saving results back to high-bandwidth memory (HBM) of the GPU (see Figure 3.6). This reduces the data transfer (IO) between them, as loading is often the slowest process [21]. For more details on model optimisation on GPUs, this is a good read from first principles.
+
However, making the system time-varying means we can no longer perform convolution in Equation 2.4 to parallelise training since it assumes a fixed kernel. To address this, Mamba introduces the selective scan layer. It is the implementation of a hard-aware selective parallel scan algorithm with the same GPU kernel fusion techniques in FlashAttention[21] for transformers, as a result of Mamba being a collaborative paper between Albert Gu (S4) and Tri Dao (FlashAttention). Therefore, the core optimisations for all three techniques, parallel scan, kernel fusion and recomputation in the selective SSM layer are to try and perform as many operations in the fast memory (SRAM) layer of the GPU before saving results back to high-bandwidth memory (HBM) (see Figure 3.6). This reduces the data transfer (IO) between them, as loading is often the slowest process [22]. For more details on model optimisation on GPUs, this is a good read from first principles.
@@ -910,7 +920,7 @@
-
+
(Left): Average Memory Bandwidth for A100(Right): Selective SSM Architecture Simplified: The select state layer is kept and computed in SRAM. [1]
@@ -924,10 +934,10 @@
-
+
-State Selection with Hardware-Aware State Expansion: The selection mechanism ensures the expanded matrix states only materialise in SRAM to reduce data transfer and computation between SRAM<>HBM. [3]
+State Selection with Hardware-Aware State Expansion: The selection mechanism ensures the expanded matrix states only materialise in SRAM to reduce data transfer and computation between SRAM<>HBM. [3]
-Figure 3.5: Visualising the Linear vs Parallel Associative Scan Operation [22]
+Figure 3.5: Visualising the Linear vs Parallel Associative Scan Operation [23]
@@ -987,15 +997,15 @@
-
+
-(Left): FlashAttention: The \(\mathbf{(QK)V}\) matrix of size \(N\times N\) is computed in SRAM using tiling before being written to HBM. (Right): Speedup of Attention on GPT-2
+(Left): FlashAttention: The \(\mathbf{(QK)V}\) matrix of size \(N^2\) is computed in SRAM using tiling before being written to HBM. (Right): Speedup of Attention on GPT-2
-Figure 3.6: Example of Kernel Fusion Enabling Efficient Operations in FlashAttention [20]
+Figure 3.6: Example of Kernel Fusion Enabling Efficient Operations in FlashAttention [21]
@@ -1012,7 +1022,7 @@
Recomputing of Activations on Backward Pass: Blue = forward, Red = backward Source
@@ -1040,10 +1050,10 @@
-
+
-Saving GPU Memory with Re-computation [23]
+Saving GPU Memory with Re-computation [24]
@@ -1058,7 +1068,7 @@
3.3 Mamba Architecture
-
The Mamba model is made by stacking multiple layers of Mamba blocks, similar to self-attention in the transformer. It is heavily inspired by its predecessor, the Hungry Hungry Hippo (H3) Architecture [24]. It starts with projecting inputs to hidden state, followed by convolution over projected dimensions with sigmoid-weighted linear unit (SILU) /Swish activation [25]. The SSM operation is then computed followed by the skip connection operation \(\mathbf{D}\) before downscaling for another linear projection.
+
The Mamba model is made by stacking multiple layers of Mamba blocks, similar to self-attention in the transformer. It is heavily inspired by its predecessor, the Hungry Hungry Hippo (H3) Architecture [25]. It starts with projecting inputs to hidden state, followed by convolution over projected dimensions with sigmoid-weighted linear unit (SILU) /Swish activation [26]. The SSM operation is then computed followed by the skip connection operation \(\mathbf{D}\) before downscaling for another linear projection.
The full architecture includes tokenising inputs to an embedding later, followed by the Mamba block repeated N times for the length of the sequence N with the inclusion of couple RMS Norm normalisation layers and a softmax layer for choosing the next output token.
@@ -1069,10 +1079,10 @@
-
+
-From H3 to the Mamba Block [24]
+From H3 to the Mamba Block [25]
Comparison of Mamba variants with different popular 7B LLMs on Piqa, Winogrande, Lambada, and Hellaswag Source
@@ -1120,7 +1130,7 @@
-
+
Evaluation Comparison of Mamba variants with several similar-sized LLMs [3]
@@ -1149,10 +1159,10 @@
4 Conclusion and
-
+
-Timeline of SSM based Models [26]
+Timeline of SSM based Models [27]
@@ -1161,10 +1171,10 @@
4 Conclusion and
-
+
-SSM Model Landscape Over Various Domains [27]
+SSM Model Landscape Over Various Domains [28]
@@ -1177,9 +1187,9 @@
4 Conclusion and
4.1 Applications and Architectures
-
From a recent survey, there are still stability challenges scaling SSMs to the same network size as SoTA transformers especially in vision [26]. Fusion techniques may fill in each others’ shortcomings between CNNs, vision transformers and vision mamba models in future to allow for better generalisation performance with long-context dependencies. For example, this has lead to the open-source release of a new LLM foundation model, Jamba, from AI32 Labs fusing the Transformer, Mamba, and MoE (Mixture-of-Experts) architectures to enable context length of 256K tokens with performance reaching Mixtral-7B and Llama2-7B with a reduced KV cache memory footprint of only 4GB [28].
+
From a recent survey, there are still stability challenges scaling SSMs to the same network size as SoTA transformers especially in vision [27]. Fusion techniques may fill in each others’ shortcomings between CNNs, vision transformers and vision mamba models in future to allow for better generalisation performance with long-context dependencies. For example, this has lead to the open-source release of a new LLM foundation model, Jamba, from AI32 Labs fusing the Transformer, Mamba, and MoE (Mixture-of-Experts) architectures to enable context length of 256K tokens with performance reaching Mixtral-7B and Llama2-7B with a reduced KV cache memory footprint of only 4GB [29].
The plethora of Mamba vision variants of late extend the selective scan algorithm to 2 dimensions where the scan techniques can be categorised into four groups: scan mode, scan axis, scan continuity and scan sampling (see Figure 4.2).
-
However, a recent paper, MambaOut, highlights that Mamba models may not be needed for tasks that do not require long-sequence dependencies and autoregressive characteristics, such as image classification [29] which they prove by showing that MambaOut can outperform SoTA vision Mamba models on ImageNet-1K classification without the Mamba block. It will be fruitful, however, to evaluate Mamba’s performance on detection and segmentation in long-context settings such as with long-term video sequences (movies) or high-dimensional imagery (remote sensing).
+
However, a recent paper, MambaOut, highlights that Mamba models may not be needed for tasks that do not require long-sequence dependencies and autoregressive characteristics, such as image classification [30] which they prove by showing that MambaOut can outperform SoTA vision Mamba models on ImageNet-1K classification without the Mamba block. It will be fruitful, however, to evaluate Mamba’s performance on detection and segmentation in long-context settings such as with long-term video sequences (movies) or high-dimensional imagery (remote sensing).
Modifying Mamba’s inherent 1D nature of selective scan meant for a causal sequential stream to a bi-directional 2D scan technique has posed algorithmic challenges in scalability and stability, as well as maintaining spatial information without redundancy in computation. Therefore, there needs to be advancements in the scanning operators in order to apply Mamba on higher-dimensional non-causal visual data more effectively in future and to capture and obtain more comprehensive skewed feature representations to enhance the feature learning in SSMs.
A. Gu et al., “Combining recurrent, convolutional, and continuous-time models with linear state-space layers.” 2021. Available: https://arxiv.org/abs/2110.13985
A. Gupta, A. Gu, and J. Berant, “Diagonal state spaces are as effective as structured state spaces.” 2022. Available: https://arxiv.org/abs/2203.14343
+
[19]
A. Gupta, A. Gu, and J. Berant, “Diagonal state spaces are as effective as structured state spaces.” 2022. Available: https://arxiv.org/abs/2203.14343
-
[19]
A. Gu, A. Gupta, K. Goel, and C. Ré, “On the parameterization and initialization of diagonal state space models.” 2022. Available: https://arxiv.org/abs/2206.11893
+
[20]
A. Gu, A. Gupta, K. Goel, and C. Ré, “On the parameterization and initialization of diagonal state space models.” 2022. Available: https://arxiv.org/abs/2206.11893
-
[20]
T. Dao, D. Y. Fu, S. Ermon, A. Rudra, and C. Ré, “FlashAttention: Fast and memory-efficient exact attention with IO-awareness.” 2022. Available: https://arxiv.org/abs/2205.14135
+
[21]
T. Dao, D. Y. Fu, S. Ermon, A. Rudra, and C. Ré, “FlashAttention: Fast and memory-efficient exact attention with IO-awareness.” 2022. Available: https://arxiv.org/abs/2205.14135
V. Korthikanti et al., “Reducing activation recomputation in large transformer models.” 2022. Available: https://arxiv.org/abs/2205.05198
+
[24]
V. Korthikanti et al., “Reducing activation recomputation in large transformer models.” 2022. Available: https://arxiv.org/abs/2205.05198
-
[24]
D. Y. Fu, T. Dao, K. K. Saab, A. W. Thomas, A. Rudra, and C. Ré, “Hungry hungry hippos: Towards language modeling with state space models.” 2023. Available: https://arxiv.org/abs/2212.14052
+
[25]
D. Y. Fu, T. Dao, K. K. Saab, A. W. Thomas, A. Rudra, and C. Ré, “Hungry hungry hippos: Towards language modeling with state space models.” 2023. Available: https://arxiv.org/abs/2212.14052
-
[25]
S. Elfwing, E. Uchibe, and K. Doya, “Sigmoid-weighted linear units for neural network function approximation in reinforcement learning.” 2017. Available: https://arxiv.org/abs/1702.03118
+
[26]
S. Elfwing, E. Uchibe, and K. Doya, “Sigmoid-weighted linear units for neural network function approximation in reinforcement learning.” 2017. Available: https://arxiv.org/abs/1702.03118
-
[26]
X. Wang et al., “State space model for new-generation network alternative to transformers: A survey.” 2024. Available: https://arxiv.org/abs/2404.09516
+
[27]
X. Wang et al., “State space model for new-generation network alternative to transformers: A survey.” 2024. Available: https://arxiv.org/abs/2404.09516
-
[27]
B. N. Patro and V. S. Agneeswaran, “Mamba-360: Survey of state space models as transformer alternative for long sequence modelling: Methods, applications, and challenges.” 2024. Available: https://arxiv.org/abs/2404.16112
+
[28]
B. N. Patro and V. S. Agneeswaran, “Mamba-360: Survey of state space models as transformer alternative for long sequence modelling: Methods, applications, and challenges.” 2024. Available: https://arxiv.org/abs/2404.16112
W. Yu and X. Wang, “MambaOut: Do we really need mamba for vision?”arXiv preprint arXiv:2405.07992, 2024.
+
[30]
W. Yu and X. Wang, “MambaOut: Do we really need mamba for vision?”arXiv preprint arXiv:2405.07992, 2024.
-
[30]
R. Xu, S. Yang, Y. Wang, B. Du, and H. Chen, “A survey on vision mamba: Models, applications and challenges.” 2024. Available: https://arxiv.org/abs/2404.18861
+
[31]
R. Xu, S. Yang, Y. Wang, B. Du, and H. Chen, “A survey on vision mamba: Models, applications and challenges.” 2024. Available: https://arxiv.org/abs/2404.18861
@@ -1320,50 +1333,50 @@
5 References
Figure 1.1: Spectrum of Efficiency vs Effectiveness of State Representation in Different Model Architecture Families [1]Figure 1.2: Discrete - Continuous Spectrum of Data Sources and Examples [4]
-Figure 1.3: Long Range Arena: Benchmark Spanning Text Images, Symbolic Reasoning (1K-16K token length) [6]
-Figure 1.4: Mamba: Matching Transformer Performance with Efficiency in Training and Inference [3]
-Figure 1.4: Mamba: Matching Transformer Performance with Efficiency in Training and Inference [3]
-Head View: Visualising attention head activations between different layers. Connecting lines are weighted based on the attention score between respective words.
-Neuron View: Visualising query, key and value embeddings when computing attention between each token and other tokens within the sequence. Positive values are colored as blue and negative values as orange.
-OpenAI’s GPT-4-128K
-Anthropic’s Claude 2.1
-Figure 1.7: Lost in the Middle: Performance Degrades When Information Access is in the Middle of Document [10]
+Figure 1.3: Long Range Arena: Benchmark Spanning Text Images, Symbolic Reasoning (1K-16K token length) [6]
+Figure 1.4: Mamba: Matching Transformer Performance with Efficiency in Training and Inference [3]
+Figure 1.4: Mamba: Matching Transformer Performance with Efficiency in Training and Inference [3]
+Head View: Visualising attention head activations between different layers. Connecting lines are weighted based on the attention score between respective words.
+Neuron View: Visualising query, key and value embeddings when computing attention between each token and other tokens within the sequence. Positive values are colored as blue and negative values as orange.
+OpenAI’s GPT-4-128K Long Context Performance
+Anthropic’s Claude 2.1 Long Context Performance
+Figure 1.7: Lost in the Middle: Performance Degrades When Information Access is in the Middle of Document [10]Figure 1.8: Comparison of scaled dot-product attention with and without KV caching [12]Figure 1.9: Unrolling Recurrent Neural Network Architecture Over Time
-Figure 2.1: The Three Representations of Linear State Space Layers in S4: (Left) State space models allow us to model continuous-time systems .(Center) The discretised recurrent format can be used for fast autoregressive inference. Recent theory on continuous-time memorisation of the hidden state transition matrix \(\mathbf{\bar{A}}\) enables us to capture LRDs mathematically and empirically. (Right) Unrolling the RNN into a global convolutional representation allows for efficient training by computing the layer depthwise in parallel [15].
+Figure 2.1: The Three Representations of Linear State Space Layers in S4: (Left) State space models allow us to model continuous-time systems .(Center) The discretised recurrent format can be used for fast autoregressive inference. Recent theory on continuous-time memorisation of the hidden state transition matrix \(\mathbf{\bar{A}}\) enables us to capture LRDs mathematically and empirically. (Right) Unrolling the RNN into a global convolutional representation allows for efficient training by computing the layer depthwise in parallel [15].Figure 2.2: Visualising State Space Models [1]Figure 2.2: Visualising State Space Models [1]
-Figure 2.3: From Continuous to Discrete SSMs With the Zero Order Hold Rule [1]
-Figure 2.3: From Continuous to Discrete SSMs With the Zero Order Hold Rule [1]
+Zero Order Hold Sampling Function
+Discrete SSM Diagram [1]Figure 2.4: Visualising 1D Convolution with 1x3 Kernel [16]
-Signal in Time and Frequency Domain [16]
-Legendre Polynomials [17]
+Signal in Time and Frequency Domain [17]
+Legendre Polynomials [18]Figure 2.6: Generalised HIPPO Operator Performing Approximations Over Uniform and Time Varying Measures [4]Figure 2.6: Generalised HIPPO Operator Performing Approximations Over Uniform and Time Varying Measures [4]Figure 2.6: Generalised HIPPO Operator Performing Approximations Over Uniform and Time Varying Measures [4]Diagonal Plus Low-rank Approximation
-S4D Recurrent and Convolutional View: Colors denote independent 1D SSMs; purple denotes trainable parameters.
+S4D Recurrent and Convolutional View: Colors denote independent 1D SSMs; purple denotes trainable parameters [20]Visualising S4 vs S4D ResultsS4 vs S4D Long Range Arena ResultsFigure 3.1: Differences between S4 and Mamba (S6) [3]
-Selective Copying: This requires time-varying models that can selectively remember or ignore inputs depending on their content.
-Induction Heads: This is an associative recall task which requires retrieving an answer based on context, a key ability of LLMs.
-Selective Copying Results: Accuracy for combinations of architectures
-Induction Heads Extrapolation: Mamba has ability to maintain high induction test accuracy for sequence length up to 1 million tokens
+Selective Copying: This requires time-varying models that can selectively remember or ignore inputs depending on their content.
+Induction Heads: This is an associative recall task which requires retrieving an answer based on context, a key ability of LLMs.
+Selective Copying Results: Accuracy for combinations of architectures
+Induction Heads Extrapolation: Mamba has ability to maintain high induction test accuracy for sequence length up to 1 million tokens(Left): Average Memory Bandwidth for A100(Right): Selective SSM Architecture Simplified: The select state layer is kept and computed in SRAM. [1]
-State Selection with Hardware-Aware State Expansion: The selection mechanism ensures the expanded matrix states only materialise in SRAM to reduce data transfer and computation between SRAM<>HBM. [3]
+State Selection with Hardware-Aware State Expansion: The selection mechanism ensures the expanded matrix states only materialise in SRAM to reduce data transfer and computation between SRAM<>HBM. [3]Visualisation of Linear ScanVisualisation of Blelloch Algorithm (Work-Efficient Parallel Prefix Scan)
-(Left): FlashAttention: The \(\mathbf{(QK)V}\) matrix of size \(N\times N\) is computed in SRAM using tiling before being written to HBM. (Right): Speedup of Attention on GPT-2
+(Left): FlashAttention: The \(\mathbf{(QK)V}\) matrix of size \(N^2\) is computed in SRAM using tiling before being written to HBM. (Right): Speedup of Attention on GPT-2Neural Network Computation Graph SourceRecomputing of Activations on Backward Pass: Blue = forward, Red = backward Source
-Saving GPU Memory with Re-computation [23]
-From H3 to the Mamba Block [24]
+Saving GPU Memory with Re-computation [24]
+From H3 to the Mamba Block [25]Mamba Block Decoder Architecture [1]Comparison of Mamba variants with different popular 7B LLMs on Piqa, Winogrande, Lambada, and Hellaswag SourceEvaluation Comparison of Mamba variants with several similar-sized LLMs [3]
-Timeline of SSM based Models [26]
-SSM Model Landscape Over Various Domains [27]
+Timeline of SSM based Models [27]
+SSM Model Landscape Over Various Domains [28]Vision Mamba Scan TechniquesVision Mamba Model Landscape