diff --git a/.nojekyll b/.nojekyll index def7a36..01e4d7c 100644 --- a/.nojekyll +++ b/.nojekyll @@ -1 +1 @@ -586b03db \ No newline at end of file +3a52b6e6 \ No newline at end of file diff --git a/blog/index.html b/blog/index.html index 22572ab..09f7c64 100644 --- a/blog/index.html +++ b/blog/index.html @@ -215,7 +215,7 @@
The fundamental problem in deep sequence modelling is how to efficiently compress the context into a smaller learnable state representation whilst maintaining the quality of state representation. As seen in Figure 1.1, transformers have powerful in-context learning capabilties due to the inherent nature of attention but it’s uncompressed memory state (the attention matrix) makes for inefficient inference especially with long-range dependencies (LRD) or large context window settings. On the end, RNNs and S4 models may be efficient but fail to preserve context state required to perform well in tasks that require in-context reasoning. Mamba proposes a context-aware method to dynamically filter out inputs in the sequence to effectively compress the context.
+The fundamental problem in deep sequence modelling is how to efficiently compress the context into a smaller learnable state representation whilst maintaining the quality of state representation. As seen in Figure 1.1, transformers have powerful in-context learning capabilties due to the inherent nature of attention but it’s uncompressed memory state (the attention matrix) makes for inefficient inference especially with long-range dependencies (LRD) or large context window settings. On the other end, RNNs and S4 models may be efficient but fail to preserve context state required to perform well in tasks that require in-context reasoning. Mamba proposes a context-aware method to dynamically filter out inputs in the sequence to effectively compress the context.
We can observe in the following experiments in Figure 1.6 that GPT4’s recall performance starts to degrade above 73K tokens where we observce low recall performance when fact is placed between 7-50% document depth. However, facts at the beginning of documents were recalled regardless of document length. This also seems to be the case for Anthropic’s Claude 2.1 model.
The original S4 approach was to leverage the Diagonal Plus Low-Rank (DPLR) structure in complex space [18] which significantly reduces the space and time complexity as we only need to store and compute the diagonal elements and low-rank components of the dense matrix. It can be expressed as \(\mathbf{\bar{A}}=\mathbf{\Lambda}+ \mathbf{PQ^*}\) where \(\mathbf{\Lambda}\) is the diagonal matrix and \(\mathbf{PQ}\) are low-rank matrices (vectors for rank-1 updates). The addition of the low-rank term allows the DPLR matrix to capture more complex relationships in LRD compared to a simple diagonal matrix whilst specialised techniques like the Woodbury identity make operations on DPLR matrices feasible and efficient. This was followed by a paper that showed empirically that just using the diagonal matrix and removing the low-rank portion of the DPLR form of the HIPPO matrix, yielded similar results [18].
-This work led to S4D used in Mamba [19], further improving the computational effiency and expressiveness of \(\mathbf{\bar{A}}\) by leveraging the Vandermonde Matrix to compute the diagonal matrix, leveraging the properties of eigenvectors and eigenvalues to efficiently capture more complex relationships between state variables (such as powers and exponentials). This is expressed as \(\mathbf{\bar{A}}=\mathbf{V \Lambda V^{-1}}\) where \(\mathbf{\Lambda}\) is the diagonal matrix of eigenvalues, \(\mathbf{V}\) is the Vandermonde matrix of eigenvectors and \(\mathbf{V^{-1}}\) is the inverse Vandermonde matrix.
+The original S4 approach was to leverage the Diagonal Plus Low-Rank (DPLR) structure in complex space [19] which significantly reduces the space and time complexity as we only need to store and compute the diagonal elements and low-rank components of the dense matrix. It can be expressed as \(\mathbf{\bar{A}}=\mathbf{\Lambda}+ \mathbf{PQ^*}\) where \(\mathbf{\Lambda}\) is the diagonal matrix and \(\mathbf{PQ}\) are low-rank matrices (vectors for rank-1 updates). The addition of the low-rank term allows the DPLR matrix to capture more complex relationships in LRD compared to a simple diagonal matrix whilst specialised techniques like the Woodbury identity make operations on DPLR matrices feasible and efficient. This was followed by a paper that showed empirically that just using the diagonal matrix and removing the low-rank portion of the DPLR form of the HIPPO matrix, yielded similar results [19].
+This work led to S4D used in Mamba [20], further improving the computational effiency and expressiveness of \(\mathbf{\bar{A}}\) by leveraging the Vandermonde Matrix to compute the diagonal matrix, leveraging the properties of eigenvectors and eigenvalues to efficiently capture more complex relationships between state variables (such as powers and exponentials). This is expressed as \(\mathbf{\bar{A}}=\mathbf{V \Lambda V^{-1}}\) where \(\mathbf{\Lambda}\) is the diagonal matrix of eigenvalues, \(\mathbf{V}\) is the Vandermonde matrix of eigenvectors and \(\mathbf{V^{-1}}\) is the inverse Vandermonde matrix.
However, making the system time-varying means we can no longer perform convolution in Equation 2.4 to parallelise training since it assumes a fixed kernel. To address this, Mamba introduces the selective scan layer. It is the implementation of a hard-aware selective parallel scan algorithm with the same GPU kernel fusion techniques in FlashAttention [20] for transformers, as a result of Mamba being a collaborative paper between Albert Gu (S4) and Tri Dao (FlashAttention). Therefore, the core optimisations for all three techniques, parallel scan, kernel fusion and recomputation in the selective SSM layer are to try and perform as many operations in the fast memory (SRAM) layer of the GPU before saving results back to high-bandwidth memory (HBM) of the GPU (see Figure 3.6). This reduces the data transfer (IO) between them, as loading is often the slowest process [21]. For more details on model optimisation on GPUs, this is a good read from first principles.
+However, making the system time-varying means we can no longer perform convolution in Equation 2.4 to parallelise training since it assumes a fixed kernel. To address this, Mamba introduces the selective scan layer. It is the implementation of a hard-aware selective parallel scan algorithm with the same GPU kernel fusion techniques in FlashAttention [21] for transformers, as a result of Mamba being a collaborative paper between Albert Gu (S4) and Tri Dao (FlashAttention). Therefore, the core optimisations for all three techniques, parallel scan, kernel fusion and recomputation in the selective SSM layer are to try and perform as many operations in the fast memory (SRAM) layer of the GPU before saving results back to high-bandwidth memory (HBM) (see Figure 3.6). This reduces the data transfer (IO) between them, as loading is often the slowest process [22]. For more details on model optimisation on GPUs, this is a good read from first principles.
The Mamba model is made by stacking multiple layers of Mamba blocks, similar to self-attention in the transformer. It is heavily inspired by its predecessor, the Hungry Hungry Hippo (H3) Architecture [24]. It starts with projecting inputs to hidden state, followed by convolution over projected dimensions with sigmoid-weighted linear unit (SILU) /Swish activation [25]. The SSM operation is then computed followed by the skip connection operation \(\mathbf{D}\) before downscaling for another linear projection.
+The Mamba model is made by stacking multiple layers of Mamba blocks, similar to self-attention in the transformer. It is heavily inspired by its predecessor, the Hungry Hungry Hippo (H3) Architecture [25]. It starts with projecting inputs to hidden state, followed by convolution over projected dimensions with sigmoid-weighted linear unit (SILU) /Swish activation [26]. The SSM operation is then computed followed by the skip connection operation \(\mathbf{D}\) before downscaling for another linear projection.
The full architecture includes tokenising inputs to an embedding later, followed by the Mamba block repeated N times for the length of the sequence N with the inclusion of couple RMS Norm normalisation layers and a softmax layer for choosing the next output token.
@@ -1081,7 +1091,7 @@From a recent survey, there are still stability challenges scaling SSMs to the same network size as SoTA transformers especially in vision [26]. Fusion techniques may fill in each others’ shortcomings between CNNs, vision transformers and vision mamba models in future to allow for better generalisation performance with long-context dependencies. For example, this has lead to the open-source release of a new LLM foundation model, Jamba, from AI32 Labs fusing the Transformer, Mamba, and MoE (Mixture-of-Experts) architectures to enable context length of 256K tokens with performance reaching Mixtral-7B and Llama2-7B with a reduced KV cache memory footprint of only 4GB [28].
+From a recent survey, there are still stability challenges scaling SSMs to the same network size as SoTA transformers especially in vision [27]. Fusion techniques may fill in each others’ shortcomings between CNNs, vision transformers and vision mamba models in future to allow for better generalisation performance with long-context dependencies. For example, this has lead to the open-source release of a new LLM foundation model, Jamba, from AI32 Labs fusing the Transformer, Mamba, and MoE (Mixture-of-Experts) architectures to enable context length of 256K tokens with performance reaching Mixtral-7B and Llama2-7B with a reduced KV cache memory footprint of only 4GB [29].
The plethora of Mamba vision variants of late extend the selective scan algorithm to 2 dimensions where the scan techniques can be categorised into four groups: scan mode, scan axis, scan continuity and scan sampling (see Figure 4.2).
-However, a recent paper, MambaOut, highlights that Mamba models may not be needed for tasks that do not require long-sequence dependencies and autoregressive characteristics, such as image classification [29] which they prove by showing that MambaOut can outperform SoTA vision Mamba models on ImageNet-1K classification without the Mamba block. It will be fruitful, however, to evaluate Mamba’s performance on detection and segmentation in long-context settings such as with long-term video sequences (movies) or high-dimensional imagery (remote sensing).
+However, a recent paper, MambaOut, highlights that Mamba models may not be needed for tasks that do not require long-sequence dependencies and autoregressive characteristics, such as image classification [30] which they prove by showing that MambaOut can outperform SoTA vision Mamba models on ImageNet-1K classification without the Mamba block. It will be fruitful, however, to evaluate Mamba’s performance on detection and segmentation in long-context settings such as with long-term video sequences (movies) or high-dimensional imagery (remote sensing).
Modifying Mamba’s inherent 1D nature of selective scan meant for a causal sequential stream to a bi-directional 2D scan technique has posed algorithmic challenges in scalability and stability, as well as maintaining spatial information without redundancy in computation. Therefore, there needs to be advancements in the scanning operators in order to apply Mamba on higher-dimensional non-causal visual data more effectively in future and to capture and obtain more comprehensive skewed feature representations to enhance the feature learning in SSMs.