S4 vs S4D Long Range Arena Results
@@ -822,7 +836,7 @@
3 How does Mamba
-
+
Figure 3.1: Differences between S4 and Mamba (S6) [3]
@@ -841,7 +855,7 @@
-
+
Selective Copying: This requires time-varying models that can selectively remember or ignore inputs depending on their content.
@@ -853,7 +867,7 @@
-
+
Induction Heads: This is an associative recall task which requires retrieving an answer based on context, a key ability of LLMs.
@@ -881,7 +895,7 @@
-
+
Selective Copying Results: Accuracy for combinations of architectures
@@ -893,7 +907,7 @@
-
+
Induction Heads Extrapolation: Mamba has ability to maintain high induction test accuracy for sequence length up to 1 million tokens
@@ -920,7 +934,7 @@
-
+
(Left): Average Memory Bandwidth for A100(Right): Selective SSM Architecture Simplified: The select state layer is kept and computed in SRAM. [1]
@@ -934,7 +948,7 @@
-
+
State Selection with Hardware-Aware State Expansion: The selection mechanism ensures the expanded matrix states only materialise in SRAM to reduce data transfer and computation between SRAM<>HBM. [3]
@@ -960,7 +974,7 @@
(Left): FlashAttention: The \(\mathbf{(QK)V}\) matrix of size \(N^2\) is computed in SRAM using tiling before being written to HBM. (Right): Speedup of Attention on GPT-2
@@ -1022,7 +1036,7 @@