diff --git a/content/posts/flash_attention.md b/content/posts/flash_attention.md index cf80758..eba9b4a 100644 --- a/content/posts/flash_attention.md +++ b/content/posts/flash_attention.md @@ -1,7 +1,7 @@ --- title: Flash Attention description: Reduce the memory usage used to compute exact attention. -date: 2024-03-26 +date: 2024-03-23 tldr: Reduce the memory usage used to compute exact attention. draft: false tags: [attention, inference] diff --git a/content/posts/mqa_gqa.md b/content/posts/mqa_gqa.md index 67d9ed4..fbbc8da 100644 --- a/content/posts/mqa_gqa.md +++ b/content/posts/mqa_gqa.md @@ -33,7 +33,8 @@ For each head in a given group, we calculate attention outputs as The query matrices will be shared by all groups under a given head, and the key and value matrices will be used for all attention calculations within a given group. -**Conversions from Multi Head Attention.** A natural question might be how one could take a model which uses multi-head attention and convert it to model using multi query attention or grouped query attention. To convert to multi query attention, we want to find a single representative matrix for both $K$ and $V$ from our set of $H$ different heads. We achieve this via mean pooling. For instance for $K$, +#### Conversions from Multi Head Attention. +A natural question might be how one could take a model which uses multi-head attention and convert it to model using multi query attention or grouped query attention. To convert to multi query attention, we want to find a single representative matrix for both $K$ and $V$ from our set of $H$ different heads. We achieve this via mean pooling. For instance for $K$, \begin{equation} \text{mean pooling}(K_1,\dots,K_h) \rightarrow K'. diff --git a/content/posts/sliding_window_attention.md b/content/posts/sliding_window_attention.md index 6803214..f9f4c6d 100644 --- a/content/posts/sliding_window_attention.md +++ b/content/posts/sliding_window_attention.md @@ -1,7 +1,7 @@ --- title: Sliding Window Attention description: Altering the tokens to which a token in the input sequence attends. -date: 2024-03-22 +date: 2024-03-27 tldr: Altering the tokens to which a token in the input sequence attends. draft: false tags: [attention] diff --git a/content/posts/sparse_attention.md b/content/posts/sparse_attention.md index 4689868..381f940 100644 --- a/content/posts/sparse_attention.md +++ b/content/posts/sparse_attention.md @@ -1,7 +1,7 @@ --- title: Sparse Attention description: Reducing the number of calculations to compute attention. -date: 2024-03-22 +date: 2024-03-25 tldr: Reducing the number of calculations to compute attention. draft: false tags: [attention]