Attention 機制進化史：由 2014 到 2026 嘅演變

最近研究唔同 attention mechanisms,先發覺原來呢十幾年嚟出現咗咁多唔同嘅變種!有啲而家仍然係主流,有啲就已經冇乜人用。我覺得值得整理一個 timeline,記錄呢個領域點樣一步步演化。

呢篇文會按時間順序講解各種 attention 機制,並且會標明邊啲仍然活躍 ✅、邊啲已經過時 ❌、邊啲只係學術研究 🎓。

📑 目錄

2014-2015: Attention 嘅誕生 🐣

Seq2Seq 嘅瓶頸 (2014)

喺 attention 出現之前,機器翻譯用 sequence-to-sequence (Seq2Seq) 架構:

javascriptEncoder → Fixed-size vector → Decoder

問題:

成個輸入句子要壓縮成一個固定長度嘅 vector
句子越長,信息損失越嚴重
譯長句嗰陣,開頭嘅詞會俾遺忘

💡 想像你要將成本書嘅內容濃縮成一句說話,然後用嗰句說話嚟重構成本書——當然會失真!

Bahdanau Attention (2015) ✅ 仍在用(教學)

論文: Neural Machine Translation by Jointly Learning to Align and Translate (Bahdanau et al., 2015)

核心 idea: Decoder 每一步都可以望返去 encoder 嘅所有 hidden states,動態決定要 focus 邊度。

計算方式

Alignment score (用 small neural network):

e_{ij} = a(s_{i-1}, h_j)

其中 $s_{i-1}$ 係 decoder 上一步嘅 state, $h_j$ 係 encoder 第 $j$ 個 hidden state。

Attention weights (softmax normalization):

\alpha_{ij} = \frac{\exp(e_{ij})}{\sum_{k=1}^{T} \exp(e_{ik})}

Context vector (weighted sum):

c_i = \sum_{j=1}^{T} \alpha_{ij} h_j

具體例子:翻譯 "I love AI"

假設 encoder 產生 3 個 hidden states,decoder 生成第一個字「我」:

Encoder hidden states (簡化成 2D):

H = \begin{bmatrix} h_1 & h_2 & h_3 \end{bmatrix} = \begin{bmatrix} 0.8 & 0.2 & 0.9 \\ 0.6 & 0.7 & 0.3 \end{bmatrix}

Decoder state:

s_0 = \begin{bmatrix} 0.5 \\ 0.5 \end{bmatrix}

Step 1: 計 alignment scores (假設用 dot product 簡化):

\begin{aligned} e_{01} &= s_0^T h_1 = 0.5(0.8) + 0.5(0.6) = 0.70 \\ e_{02} &= s_0^T h_2 = 0.5(0.2) + 0.5(0.7) = 0.45 \\ e_{03} &= s_0^T h_3 = 0.5(0.9) + 0.5(0.3) = 0.60 \end{aligned}

Step 2: Softmax → attention weights:

\begin{aligned} \alpha_{01} &= \frac{e^{0.70}}{e^{0.70} + e^{0.45} + e^{0.60}} = \frac{2.01}{2.01 + 1.57 + 1.82} = 0.37 \\ \alpha_{02} &= \frac{1.57}{5.40} = 0.29 \\ \alpha_{03} &= \frac{1.82}{5.40} = 0.34 \end{aligned}

Step 3: Context vector (weighted sum):

c_0 = \alpha_{01} h_1 + \alpha_{02} h_2 + \alpha_{03} h_3

= 0.37 \begin{bmatrix} 0.8 \\ 0.6 \end{bmatrix} + 0.29 \begin{bmatrix} 0.2 \\ 0.7 \end{bmatrix} + 0.34 \begin{bmatrix} 0.9 \\ 0.3 \end{bmatrix}

= \begin{bmatrix} 0.296 + 0.058 + 0.306 \\ 0.222 + 0.203 + 0.102 \end{bmatrix} = \begin{bmatrix} 0.66 \\ 0.53 \end{bmatrix}

Attention matrix 視覺化:

javascriptDecoder → | h₁(I)  h₂(love) h₃(AI)
"我"      | 0.37   0.29     0.34    ← 最關注 "I"
"愛"      | 0.05   0.82     0.13    ← 最關注 "love"
"人工"    | 0.08   0.15     0.77    ← 最關注 "AI"

狀態: ✅ 仍在用於教學

而家唔會用喺 production (太慢、效果唔夠好)
但係教科書同課程仍然會教,因為概念清晰
係理解 attention 嘅最佳起點

Luong Attention (2015) ✅ 仍在用(教學)

論文: Effective Approaches to Attention-based Neural Machine Translation (Luong et al., 2015)

改進: 提出多種 alignment function 同 global/local attention 變種。

Alignment Functions

Dot product:

\text{score}(s_i, h_j) = s_i^T h_j

General:

\text{score}(s_i, h_j) = s_i^T W h_j

Concat (Bahdanau 嘅方法):

\text{score}(s_i, h_j) = v^T \tanh(W[s_i; h_j])

三種 Scoring 對比例子

假設 decoder state 同 encoder hidden state:

s = \begin{bmatrix} 0.6 \\ 0.8 \end{bmatrix}, \quad h = \begin{bmatrix} 0.5 \\ 0.9 \end{bmatrix}

1. Dot product:

\text{score} = s^T h = 0.6(0.5) + 0.8(0.9) = 0.30 + 0.72 = 1.02

2. General (假設權重矩陣):

W = \begin{bmatrix} 1.2 & 0.3 \\ 0.4 & 1.1 \end{bmatrix}

Wh = \begin{bmatrix} 1.2(0.5) + 0.3(0.9) \\ 0.4(0.5) + 1.1(0.9) \end{bmatrix} = \begin{bmatrix} 0.87 \\ 1.19 \end{bmatrix}

\text{score} = s^T(Wh) = 0.6(0.87) + 0.8(1.19) = 0.522 + 0.952 = 1.474

3. Concat:

[s; h] = \begin{bmatrix} 0.6 \\ 0.8 \\ 0.5 \\ 0.9 \end{bmatrix}

經過 tanh(W[s;h]) 同 v^T 之後得出一個 scalar score。

分別:

Dot: 最簡單,但假設 s 同 h 嘅 dimensions 對齊
General: 有學習嘅權重,更 flexible
Concat: Bahdanau 方法,最 expressive 但計算最貴

Global vs Local Attention

Global: 同 Bahdanau 類似,attend 晒所有 positions。

Local: 只 attend 一個 window (例如前後各 5 個 positions),減少計算量。

狀態: ✅ 仍在特定場景使用

Local attention 嘅 idea 影響咗後來嘅 sparse attention
Dot product score 變成 Transformer 嘅標準做法
但原版 RNN-based 實現已經冇人用

2017: Transformer 革命 🚀

Scaled Dot-Product Attention (2017) ✅ 核心方法,仍廣泛使用

論文: Attention Is All You Need (Vaswani et al., 2017)

呢個係現代 NLP 嘅分水嶺!

Query, Key, Value 概念

首次引入 QKV (Query, Key, Value) 框架:

\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right) V

點解要除 $sqrt{d_k}$ ?

當 $d_k$ (key dimension) 好大嗰陣,dot product 嘅值會好大,令 softmax 進入 saturation region (gradient 接近 0)。除以 $\sqrt{d_k}$ 可以穩定訓練。

完整矩陣例子

假設 4 個 tokens,每個 embedding 係 3D:

Input embeddings:

X = \begin{bmatrix} 0.5 & 0.2 & 0.8 \\ 0.3 & 0.9 & 0.1 \\ 0.7 & 0.4 & 0.6 \\ 0.2 & 0.8 & 0.5 \end{bmatrix} \quad \text{(4 tokens × 3 dim)}

Weight matrices (簡化成 3×3):

W^Q = \begin{bmatrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \end{bmatrix}, \quad W^K = \begin{bmatrix} 0.8 & 0.1 & 0.2 \\ 0.1 & 0.9 & 0.3 \\ 0.2 & 0.1 & 0.7 \end{bmatrix}, \quad W^V = \begin{bmatrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \end{bmatrix}

Step 1: 計 Q, K, V

Q = XW^Q = X \quad \text{(用 identity 簡化)}

K = XW^K = \begin{bmatrix} 0.56 & 0.18 & 0.69 \\ 0.26 & 0.84 & 0.22 \\ 0.68 & 0.52 & 0.71 \\ 0.26 & 0.79 & 0.49 \end{bmatrix}

V = XW^V = X

Step 2: 計 attention scores

QK^T = \begin{bmatrix} 0.5 & 0.2 & 0.8 \\ 0.3 & 0.9 & 0.1 \\ 0.7 & 0.4 & 0.6 \\ 0.2 & 0.8 & 0.5 \end{bmatrix} \begin{bmatrix} 0.56 & 0.26 & 0.68 & 0.26 \\ 0.18 & 0.84 & 0.52 & 0.79 \\ 0.69 & 0.22 & 0.71 & 0.49 \end{bmatrix}

= \begin{bmatrix} 0.868 & 0.472 & 1.076 & 0.722 \\ 0.501 & 1.116 & 0.679 & 1.111 \\ 0.862 & 0.856 & 1.182 & 0.856 \\ 0.801 & 0.883 & 0.991 & 1.117 \end{bmatrix}

Step 3: Scaling (假設 $d_k = 3$ ,所以除 $\sqrt{3} \approx 1.73$ ):

\frac{QK^T}{\sqrt{d_k}} = \begin{bmatrix} 0.50 & 0.27 & 0.62 & 0.42 \\ 0.29 & 0.65 & 0.39 & 0.64 \\ 0.50 & 0.49 & 0.68 & 0.49 \\ 0.46 & 0.51 & 0.57 & 0.65 \end{bmatrix}

Step 4: Softmax (每行獨立做):

\text{Attention Weights} = \begin{bmatrix} 0.23 & 0.18 & 0.27 & 0.21 \\ 0.19 & 0.27 & 0.21 & 0.27 \\ 0.23 & 0.23 & 0.27 & 0.23 \\ 0.22 & 0.23 & 0.24 & 0.27 \end{bmatrix}

Step 5: 乘 V (weighted sum of values):

\text{Output} = \text{Attention Weights} \times V

= \begin{bmatrix} 0.23 & 0.18 & 0.27 & 0.21 \\ 0.19 & 0.27 & 0.21 & 0.27 \\ 0.23 & 0.23 & 0.27 & 0.23 \\ 0.22 & 0.23 & 0.24 & 0.27 \end{bmatrix} \begin{bmatrix} 0.5 & 0.2 & 0.8 \\ 0.3 & 0.9 & 0.1 \\ 0.7 & 0.4 & 0.6 \\ 0.2 & 0.8 & 0.5 \end{bmatrix}

= \begin{bmatrix} 0.48 & 0.45 & 0.54 \\ 0.41 & 0.59 & 0.44 \\ 0.48 & 0.50 & 0.53 \\ 0.43 & 0.54 & 0.51 \end{bmatrix}

視覺化 attention pattern:

javascriptToken0  Token1  Token2  Token3
Token0   0.23    0.18    0.27 ⬆  0.21
Token1   0.19    0.27 ⬆  0.21    0.27 ⬆
Token2   0.23    0.23    0.27 ⬆  0.23
Token3   0.22    0.23    0.24    0.27 ⬆

⬆ = 最高 attention weight

關鍵觀察:

每個 token 嘅 output 係所有 tokens 嘅加權平均
Attention weights 加埋等於 1 (因為 softmax)
唔同 token 可以 attend 去唔同位置

狀態: ✅ 仍然係標準做法

幾乎所有現代 LLM 都用呢個公式
GPT、BERT、T5、LLaMA 全部都係基於呢個
已經係 de facto standard

Multi-Head Attention (2017) ✅ 核心方法,仍廣泛使用

核心 idea: 用多個 attention heads 並行計算,捕捉唔同嘅 patterns。

\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h) W^O

其中每個 head:

\text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V)

好處:

唔同 heads 學習唔同 aspects (syntax, semantics, long-range dependencies)
可以並行計算,發揮 GPU 優勢
增加 model expressiveness

具體例子:2 個 Heads

假設 input 係 2 個 tokens,每個 embedding 6D,用 2 個 heads (每個 head 3D):

Input:

X = \begin{bmatrix} 0.5 & 0.2 & 0.8 & 0.1 & 0.9 & 0.3 \\ 0.7 & 0.4 & 0.6 & 0.8 & 0.2 & 0.5 \end{bmatrix} \quad \text{(2 tokens × 6 dim)}

Head 1 (用前 3 dims):

X_1 = \begin{bmatrix} 0.5 & 0.2 & 0.8 \\ 0.7 & 0.4 & 0.6 \end{bmatrix}

計 attention → 得出:

\text{head}_1 = \begin{bmatrix} 0.55 & 0.25 & 0.75 \\ 0.65 & 0.35 & 0.65 \end{bmatrix}

Head 2 (用後 3 dims):

X_2 = \begin{bmatrix} 0.1 & 0.9 & 0.3 \\ 0.8 & 0.2 & 0.5 \end{bmatrix}

計 attention → 得出:

\text{head}_2 = \begin{bmatrix} 0.45 & 0.65 & 0.35 \\ 0.70 & 0.30 & 0.50 \end{bmatrix}

Concatenate:

\text{Concat}(\text{head}_1, \text{head}_2) = \begin{bmatrix} 0.55 & 0.25 & 0.75 & 0.45 & 0.65 & 0.35 \\ 0.65 & 0.35 & 0.65 & 0.70 & 0.30 & 0.50 \end{bmatrix}

最後乘 output projection $W^O$ (6×6) 得出 final output。

關鍵點:

每個 head 獨立計算,可以並行
Head 1 可能學到 syntactic patterns,Head 2 學到 semantic patterns
Concat 之後保留晒所有 heads 嘅信息

狀態: ✅ 仍然係標準配置

現代模型通常用 8-96 個 heads
GPT-3: 96 heads
LLaMA-2 70B: 64 heads

Cross-Attention (2017) ✅ 仍廣泛使用

Transformer encoder-decoder 架構入面,decoder 用 cross-attention 嚟 attend 去 encoder 嘅 outputs。

特點:

Query 嚟自 decoder
Key 同 Value 嚟自 encoder
實現 source-target alignment (例如翻譯嗰陣)

具體例子:翻譯 "I love" → "我愛"

Encoder outputs (source: "I love"):

H_{\text{enc}} = \begin{bmatrix} 0.7 & 0.4 & 0.6 \\ 0.3 & 0.9 & 0.5 \end{bmatrix} \begin{matrix} \text{← "I"} \\ \text{← "love"} \end{matrix}

Decoder state (generating "愛"):

H_{\text{dec}} = \begin{bmatrix} 0.5 & 0.8 & 0.4 \end{bmatrix} \quad \text{← state after generating "我"}

Cross-Attention:

Query: 嚟自 decoder → $Q = H_{\text{dec}}W^Q$
Key, Value: 嚟自 encoder → $K = H_{\text{enc}}W^K, V = H_{\text{enc}}W^V$

假設用簡化權重,計 attention scores:

QK^T = \begin{bmatrix} 0.5 & 0.8 & 0.4 \end{bmatrix} \begin{bmatrix} 0.7 & 0.3 \\ 0.4 & 0.9 \\ 0.6 & 0.5 \end{bmatrix} = \begin{bmatrix} 0.91 & 1.31 \end{bmatrix}

Softmax:

\text{Weights} = [0.35, 0.65]

解讀: 生成 "愛" 嗰陣,decoder attend 更多去 "love" (0.65) 而唔係 "I" (0.35),符合翻譯對齊!

Output (weighted sum of encoder values):

\text{context} = 0.35 \times \text{V}_{\text{I}} + 0.65 \times \text{V}_{\text{love}}

呢個 context vector 會 feed 去 decoder 嚟生成下一個字。

狀態: ✅ 仍在特定任務使用

Encoder-decoder models 仍然用: T5, BART, Whisper (語音識別)
多模態模型: CLIP, Flamingo (vision-language)
但係純語言模型 (GPT 系列) 唔使,因為係 decoder-only

2018-2019: 第一波優化 🔧

Self-Attention (2017+) ✅ 核心方法

Self-attention 係 Transformer 嘅精髓: Q, K, V 全部都嚟自同一個 input sequence。

\text{Self-Attention}(X) = \text{Attention}(XW^Q, XW^K, XW^V)

用途:

Encoder (BERT): bidirectional self-attention
Decoder (GPT): causal self-attention (只能望前面)

具體例子:句子 "I love AI"

假設每個 word embedding 係 3D (簡化):

X = \begin{bmatrix} 0.8 & 0.3 & 0.5 \\ 0.2 & 0.9 & 0.4 \\ 0.6 & 0.5 & 0.7 \end{bmatrix} \begin{matrix} \text{← "I"} \\ \text{← "love"} \\ \text{← "AI"} \end{matrix}

計 Q, K, V (用簡化權重):

假設 $W^Q = W^K = W^V = I$ (identity),所以:

Q = K = V = X

Step 1: Attention scores $QK^T$ :

QK^T = \begin{bmatrix} 0.8 & 0.3 & 0.5 \\ 0.2 & 0.9 & 0.4 \\ 0.6 & 0.5 & 0.7 \end{bmatrix} \begin{bmatrix} 0.8 & 0.2 & 0.6 \\ 0.3 & 0.9 & 0.5 \\ 0.5 & 0.4 & 0.7 \end{bmatrix}

= \begin{bmatrix} 0.98 & 0.63 & 1.06 \\ 0.63 & 1.21 & 0.85 \\ 1.06 & 0.85 & 1.35 \end{bmatrix}

Step 2: Softmax (每行):

\text{Attention Weights} = \begin{bmatrix} 0.31 & 0.22 & 0.47 \\ 0.22 & 0.47 & 0.31 \\ 0.31 & 0.24 & 0.45 \end{bmatrix}

解讀:

"I" (row 1) attends most to "AI" (0.47)
"love" (row 2) attends most to itself (0.47)
"AI" (row 3) attends most to itself (0.45)

Step 3: 乘 V:

\text{Output} = \text{Attention Weights} \times V

每個 token 嘅 output 係所有 tokens 嘅加權組合,捕捉到 contextual information!

狀態: ✅ 係現代 transformer 嘅核心

Transformer-XL: Relative Position (2019) ❌ 已少用

論文: Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context (Dai et al., 2019)

問題: 原本 Transformer 有固定嘅 context length,唔能處理超長序列。

解決方法:

Segment-level recurrence: 將前一個 segment 嘅 hidden states cache 住
Relative position encodings: 用相對位置而唔係絕對位置

點解冇人用?

Implementation 複雜
Recurrence 機制唔夠 parallelizable
後來 RoPE (2021) 提供咗更簡單嘅相對位置方案
ALiBi (2022) 嘅外推能力更強

狀態: ❌ 基本已被淘汰

但佢嘅 relative position encoding idea 影響咗後來嘅 RoPE 同 ALiBi!

Sparse Transformer (2019) 🎓 學術研究為主

論文: Generating Long Sequences with Sparse Transformers (Child et al., 2019)

核心 idea: 唔係每個 token 都 attend 所有其他 tokens,只 attend 一部分。

Sparse Patterns

Strided attention:

每隔 $k$ 個 positions attend 一次:

javascriptToken 0 attends: [0, 5, 10, 15, ...]  (stride=5)

Fixed attention:

某啲 positions 有 "global" attention,所有 tokens 都 attend 去佢哋。

複雜度: 由 $O(n^2)$ 降到 $O(n\sqrt{n})$ 或 $O(n \log n)$

狀態: 🎓 主要係學術研究

點解冇普及?

Implementation 複雜,需要 custom CUDA kernels
對於常見嘅序列長度 (2k-8k tokens),dense attention + Flash Attention 已經夠快
Sparse patterns 要針對 task 設計,唔夠 general

邊度仍在用?

BigBird (Google, 2020): 用咗 sparse attention
Longformer (AllenAI, 2020): 結合 local + global attention
但主流 LLMs (GPT, LLaMA) 都唔用

2020-2021: 長序列挑戰 📏

Longformer (2020) ✅ 仍在特定場景使用

論文: Longformer: The Long-Document Transformer (Beltagy et al., 2020)

核心 idea: 結合 local windowed attention + global attention。

Attention Pattern

Local attention: 每個 token attend 前後 $w$ 個 tokens (例如 $w=512$ )
Global attention: 某啲特殊 tokens (例如 [CLS]) attend 全部 tokens

複雜度: $O(n times w)$ ,linear in sequence length!

Attention Pattern 視覺化

假設 8 個 tokens,window size = 2 (前後各 1 個 token):

Full Attention (傳統):

javascript0  1  2  3  4  5  6  7
0  [ ✓  ✓  ✓  ✓  ✓  ✓  ✓  ✓ ]  ← token 0 attend 晒所有
1  [ ✓  ✓  ✓  ✓  ✓  ✓  ✓  ✓ ]
2  [ ✓  ✓  ✓  ✓  ✓  ✓  ✓  ✓ ]
3  [ ✓  ✓  ✓  ✓  ✓  ✓  ✓  ✓ ]
4  [ ✓  ✓  ✓  ✓  ✓  ✓  ✓  ✓ ]
5  [ ✓  ✓  ✓  ✓  ✓  ✓  ✓  ✓ ]
6  [ ✓  ✓  ✓  ✓  ✓  ✓  ✓  ✓ ]
7  [ ✓  ✓  ✓  ✓  ✓  ✓  ✓  ✓ ]

每個 token: 8 個 attention 計算
總計: 8 × 8 = 64 計算 → O(n²)

Longformer Local Attention (window=2):

javascript0  1  2  3  4  5  6  7
0  [ ✓  ✓  ·  ·  ·  ·  ·  · ]  ← token 0 只 attend [0,1]
1  [ ✓  ✓  ✓  ·  ·  ·  ·  · ]  ← token 1 attend [0,1,2]
2  [ ·  ✓  ✓  ✓  ·  ·  ·  · ]  ← attend [1,2,3]
3  [ ·  ·  ✓  ✓  ✓  ·  ·  · ]  ← attend [2,3,4]
4  [ ·  ·  ·  ✓  ✓  ✓  ·  · ]
5  [ ·  ·  ·  ·  ✓  ✓  ✓  · ]
6  [ ·  ·  ·  ·  ·  ✓  ✓  ✓ ]
7  [ ·  ·  ·  ·  ·  ·  ✓  ✓ ]  ← token 7 只 attend [6,7]

每個 token: 2-3 個 attention 計算
總計: ~8 × 2.5 = 20 計算 → O(n×w)

Longformer + Global Attention (token 0 係 [CLS]):

javascript0  1  2  3  4  5  6  7
0  [ ✓  ✓  ✓  ✓  ✓  ✓  ✓  ✓ ]  ← [CLS] attend 全部
1  [ ✓  ✓  ✓  ·  ·  ·  ·  · ]  ← local + attend to [CLS]
2  [ ✓  ✓  ✓  ✓  ·  ·  ·  · ]  ← local + attend to [CLS]
3  [ ✓  ·  ✓  ✓  ✓  ·  ·  · ]
4  [ ✓  ·  ·  ✓  ✓  ✓  ·  · ]
5  [ ✓  ·  ·  ·  ✓  ✓  ✓  · ]
6  [ ✓  ·  ·  ·  ·  ✓  ✓  ✓ ]
7  [ ✓  ·  ·  ·  ·  ·  ✓  ✓ ]
    ↑
   全部都 attend 去 [CLS]

實際應用 (4096 tokens, window=512):

Full attention: 4096² = 16M 計算
Longformer: 4096 × 512 = 2M 計算 (減少 8x!)

狀態: ✅ 文檔理解任務仍在用

LED (Longformer Encoder-Decoder): 用於文檔 summarization
適合處理超長文檔 (16k-32k tokens)
但係語言生成任務,大家仍然偏好 dense attention + better position encodings

BigBird (2020) 🎓 學術為主

論文: Big Bird: Transformers for Longer Sequences (Zaheer et al., 2020)

Sparse pattern: 結合三種 attention:

Random attention: 隨機 attend 幾個 tokens
Window attention: Local sliding window
Global attention: 特定 tokens attend 全部

狀態: 🎓 主要係研究用途

Google 有推出 BigBird models,但唔算主流
理論上有 universal approximation 保證,但實際應用有限

Linformer (2020) ❌ 已少用

論文: Linformer: Self-Attention with Linear Complexity (Wang et al., 2020)

核心 idea: 用 low-rank approximation 將 $n \times n$ 嘅 attention matrix 降到 $n \times k$ (其中 $k ll n$ )。

點做?

將 Key 同 Value 用 projection 降維:

K' = KE, \quad V' = VF

其中 $E, F in mathbb{R}^{n times k}$ 。

點解失敗?

信息損失: Low-rank approximation 會損失 long-range dependencies
唔夠 flexible: $k$ 係 hyperparameter,要手動 tune
Flash Attention 更好: 2022 年 Flash Attention 出現,做到 $O(n^2)$ 但係 memory-efficient,效果更好

狀態: ❌ 基本已被淘汰

Performer (2021) ❌ 已少用

論文: Rethinking Attention with Performers (Choromanski et al., 2021)

核心 idea: 用 kernel methods 將 attention 近似成 linear complexity。

數學技巧:

用 random feature maps $\phi$ 將 softmax 近似:

\text{Attention}(Q, K, V) \approx \frac{\phi(Q) (\phi(K)^T V)}{\phi(Q) \phi(K)^T}

計算順序改變: $(QK^T)V rightarrow Q(K^TV)$ ,複雜度由 $O(n^2 d)$ 降到 $O(nd^2)$ 。

點解冇普及?

近似誤差: 同 exact attention 有 gap
實作複雜: 需要 careful implementation
Flash Attention 出現,提供 exact attention 但更 memory-efficient

狀態: ❌ 基本已被淘汰

2021-2022: 位置編碼大戰 📍

Position encoding 係 attention 機制嘅重要組件,呢個時期出現咗好多新方法。

RoPE (2021) ✅ 現在主流!

論文: RoFormer: Enhanced Transformer with Rotary Position Embedding (Su et al., 2021)

核心 idea: 通過旋轉 query 同 key 向量嚟注入相對位置信息。

\mathbf{q}_m = \mathbf{R}_m \mathbf{q}, \quad \mathbf{k}_n = \mathbf{R}_n \mathbf{k}

優點:

✅ 自然嘅相對位置建模
✅ 無額外參數
✅ 優秀嘅長度外推能力
✅ 計算高效

狀態: ✅ 而家最主流嘅位置編碼!

邊個用緊?

LLaMA / LLaMA 2 / LLaMA 3 (Meta)
PaLM / Gemini (Google)
GPT-NeoX (EleutherAI)
ChatGLM (清華)
Mistral (Mistral AI)
Qwen (阿里巴巴)

基本上 2021 之後嘅新模型,大部分都用 RoPE!

詳細可以睇我嘅另一篇文: Rotary Position Embeddings (RoPE)

ALiBi (2022) ✅ 第二主流

論文: Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation (Press et al., 2022)

核心 idea: 喺 attention scores 加一個 linear bias term,bias 隨距離增加。

\text{softmax}(q_i K^T + m \cdot [-(i-1), -(i-2), \ldots, 0])

其中 $m$ 係 head-specific slope (唔同 head 用唔同 slope)。

具體例子:4 個 Tokens

假設有 4 個 tokens,計 token 3 嘅 attention:

Step 1: 原始 attention scores (假設已計咗 $q_3 K^T$ ):

\text{scores} = [2.1, 1.8, 2.5, 2.3]

Step 2: 加 ALiBi bias (假設 slope $m = 0.5$ ):

Bias 係根據相對距離:

\text{bias} = m \times [-(3-0), -(3-1), -(3-2), -(3-3)]

= 0.5 \times [-3, -2, -1, 0] = [-1.5, -1.0, -0.5, 0.0]

Step 3: 加埋:

\text{scores + bias} = [2.1-1.5, 1.8-1.0, 2.5-0.5, 2.3-0.0]

= [0.6, 0.8, 2.0, 2.3]

Step 4: Softmax:

\text{attention weights} = [0.12, 0.15, 0.31, 0.42]

效果: Token 3 對自己 (distance=0) 嘅 attention 最高,距離越遠嘅 token,attention 越低!

唔同 heads 用唔同 slopes:

Head 1: $m = 0.5$ (強 local bias)
Head 2: $m = 0.25$ (中等 bias)
Head 3: $m = 0.125$ (弱 bias)
Head 4: $m = 0.0625$ (好弱 bias)

每個 head 學習唔同 range 嘅 dependencies!

優點:

✅ 極強嘅外推能力: Train 2k, 可以推到 100k+ tokens 而 performance 唔跌太多
✅ Implementation 超簡單 (幾行 code)
✅ 無額外參數

缺點:

❌ 喺某啲任務,效果略遜於 RoPE

狀態: ✅ 第二主流,仍廣泛使用

邊個用緊?

BLOOM (BigScience)
MPT (MosaicML)
Falcon (TII UAE)
StarCoder (Hugging Face)

通常用於需要極長 context 嘅場景。

xPos (2022) 🎓 學術為主

論文: A Length-Extrapolatable Transformer (Sun et al., 2022)

核心 idea: 改進 RoPE,加入 exponential decay。

對於 query 同 key,分別乘以 decay factors:

\mathbf{q}_m = \mathbf{R}_m \mathbf{D}_m \mathbf{q}, \quad \mathbf{k}_n = \mathbf{R}_n \mathbf{D}_n^{-1} \mathbf{k}

好處: 更好嘅長度外推,特別係超長序列。

狀態: 🎓 主要係研究,少生產使用

Implementation 比 RoPE 複雜少少
效果提升唔算特別明顯
主流 LLMs 仍然偏好 RoPE 或 ALiBi

FIRE (2023) 🎓 研究階段

論文: Functional Interpolation for Relative Positions (Li et al., 2023)

核心 idea: 用可學習嘅函數嚟 interpolate 位置編碼。

狀態: 🎓 純學術研究

太新,未有大規模應用。

2022-2023: Memory Efficiency 革命 💾

呢個時期重點轉移到記憶體優化,因為 attention 嘅 $O(n^2)$ memory 係訓練長序列嘅最大瓶頸。

Flash Attention (2022) ✅ 革命性突破!

論文: FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness (Dao et al., 2022)

核心 idea: 通過 tiling 同 recomputation,減少 HBM (GPU global memory) 存取次數。

關鍵技術:

Block-sparse attention matrix: 將 $n \times n$ matrix 分成 blocks
Fused kernel: 將 softmax + matmul 合併成一個 CUDA kernel
Recomputation: Backward pass 時 recompute attention,唔需要 store 成個 $n \times n$ matrix

Tiling 例子

假設 sequence length = 1024,將佢分成 4 個 blocks (每個 256 tokens):

傳統 Attention - 一次過計成個矩陣:

python# 要 store 成個 1024×1024 attention matrix
S = Q @ K.T              # [1024×1024] 寫入 HBM
P = softmax(S)           # [1024×1024] 讀+寫 HBM  
O = P @ V                # [1024×1024] 讀 HBM

# Memory: O(n²) = O(1024²) = ~1M elements
# HBM 讀寫: 3 次大量存取

Flash Attention - Block-wise 計算:

python# 將 Q, K, V 分成 blocks
Q = [Q₁, Q₂, Q₃, Q₄]    # 每個 [256×d]
K = [K₁, K₂, K₃, K₄]  
V = [V₁, V₂, V₃, V₄]

# 逐個 block 計,keep 住喺 SRAM (fast memory)
for i in range(4):
    for j in range(4):
        # 只計 256×256 嘅 sub-block
        S_ij = Q_i @ K_j.T    # [256×256] stay in SRAM!
        P_ij = softmax(S_ij)  # stay in SRAM
        O_i += P_ij @ V_j     # accumulate

# Memory: O(n) = O(1024) - 只 store partial sums
# HBM 讀寫: 大幅減少!

Memory Hierarchy

javascriptSRAM (on-chip, 20 MB)    ← 快 ⚡⚡⚡ (19 TB/s)
  ↕  
HBM (GPU memory, 40 GB)  ← 慢 🐌 (1.5 TB/s)
  ↕
DRAM (CPU memory)

關鍵 insight: 盡量 keep 嘢喺 SRAM,減少 HBM 讀寫!

實際數字 (A100 GPU)

Sequence Length	Standard Attention	Flash Attention	Speedup
512	2.5 GB memory	0.8 GB	1.8x 快
1024	10 GB memory	1.2 GB	2.5x 快
2048	40 GB (OOM!)	2.8 GB	4x 快
4096	N/A (out of memory)	8 GB	可以跑!

效果:

✅ 2-4x speedup (training)
✅ 5-9x speedup (long sequences > 2k tokens)
✅ Memory 由 $O(n^2)$ 降到 $O(n)$
✅ Exact attention (唔係近似!)

狀態: ✅ 而家幾乎必用!

邊個用緊?

PyTorch 官方支持 (torch.nn.functional.scaled_dot_product_attention)
Hugging Face Transformers 預設開啟
vLLM (推理框架)
幾乎所有新訓練嘅大模型都用

Flash Attention 2 (2023) 同 Flash Attention 3 (2024) 進一步優化,速度更快!

詳細可以睇我嘅 Transformer 文章: Transformer 架構詳解：Attention、QKV 同 Multi-Head 機制

Memory-Efficient Attention (2022) ✅ 仍在用

來源: xFormers library (Meta)

核心 idea: 類似 Flash Attention,但用 different tiling strategies。

狀態: ✅ 仍在用,但 Flash Attention 更流行

Stable Diffusion 訓練用呢個
某啲 vision models 偏好用

PagedAttention (2023) ✅ 推理必備!

論文: Efficient Memory Management for Large Language Model Serving with PagedAttention (Kwon et al., 2023)

核心 idea: 將 KV cache 用 virtual memory paging 管理,減少記憶體碎片。

問題:

傳統推理,每個 request 要 pre-allocate 最大長度嘅 KV cache,浪費好多記憶體 (55-80% fragmentation!)

解決方法:

將 KV cache 分成固定 size 嘅 pages (例如每 page 64 tokens)
用 virtual memory 概念,dynamic allocation
唔同 requests 可以 share prefix pages (例如 system prompt)

效果:

✅ 2-4x throughput improvement
✅ 減少 55-80% 記憶體浪費
✅ 支持更多 concurrent requests

狀態: ✅ 推理框架標配!

邊個用緊?

vLLM (最流行嘅推理框架,原創者開發)
TensorRT-LLM (NVIDIA)
Text Generation Inference (Hugging Face)

如果你做 LLM serving,基本上必用!

Flash-Decoding (2023) ✅ 推理優化

論文: Flash-Decoding for Long-Context Inference (2023)

核心 idea: 針對 generation phase (decode) 優化 Flash Attention。

Decode 嗰陣每次只生成 1 個 token,但要 attend 去成個 context (可能幾萬 tokens),Flash-Decoding 用 parallelization 加速。

狀態: ✅ 推理框架逐步採用

整合進 Flash Attention 2/3
vLLM 等框架已支持

2023-2024: Multi-Query / Grouped-Query Attention 🚄

呢個時期重點係減少 KV cache size 嚟加快推理。

Multi-Query Attention (MQA) (2019/再度流行 2023) ✅ 推理優化常用

論文: Fast Transformer Decoding: One Write-Head is All You Need (Shazeer, 2019)

雖然 2019 年就提出,但係去到 2023 先開始流行!

核心 idea:

傳統 Multi-Head Attention: 每個 head 都有自己嘅 Q, K, V
MQA: 所有 heads share 同一組 K, V,只有 Q 係 per-head

對比例子:MHA vs MQA

假設 4 個 heads,sequence length = 3,hidden dim = 512:

Multi-Head Attention (MHA):

python# 每個 head 有自己嘅 K, V (head_dim = 128)
K_head1 = [3 × 128]  # Shape: [seq_len × head_dim]
V_head1 = [3 × 128]
K_head2 = [3 × 128]
V_head2 = [3 × 128]
K_head3 = [3 × 128]
V_head3 = [3 × 128]
K_head4 = [3 × 128]
V_head4 = [3 × 128]

# Total KV cache: 3 × 128 × 8 = 3,072 elements

Multi-Query Attention (MQA):

python# 所有 heads share 同一組 K, V
K_shared = [3 × 128]  # 只有一組!
V_shared = [3 × 128]

Q_head1 = [3 × 128]  # 每個 head 仍然有自己嘅 Q
Q_head2 = [3 × 128]
Q_head3 = [3 × 128]
Q_head4 = [3 × 128]

# Total KV cache: 3 × 128 × 2 = 768 elements
# 減少 4x!

實際計算 (token 0, head 1):

MHA:

\text{Attention}_{h1} = \text{softmax}(Q_{h1} K_{h1}^T) V_{h1}

MQA:

\text{Attention}_{h1} = \text{softmax}(Q_{h1} K_{\text{shared}}^T) V_{\text{shared}}

\text{Attention}_{h2} = \text{softmax}(Q_{h2} K_{\text{shared}}^T) V_{\text{shared}}

所有 heads 用同一組 K, V,但用唔同嘅 Q 去 query!

好處:

✅ KV cache size 減少 8-16x (如果有 8-16 個 heads)
✅ 推理速度快好多 (memory bandwidth bound → 減少記憶體存取)
❌ 但係效果會差少少 (quality degradation)

狀態: ✅ 某啲模型用,但 GQA 更常見

邊個用緊?

PaLM (Google)
Falcon 40B
StarCoder

Grouped-Query Attention (GQA) (2023) ✅ 而家最常見!

論文: GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints (Ainslie et al., 2023)

核心 idea: MQA 同 MHA 嘅折衷方案。

將 heads 分成 groups
每個 group share 同一組 K, V
例如:32 heads 分成 4 groups,每 group 8 heads share 一組 KV

對比例子:MHA vs GQA vs MQA

假設 8 個 query heads,sequence length = 3,head_dim = 128:

Multi-Head Attention (MHA) - 8 組 KV:

pythonK = [K₁, K₂, K₃, K₄, K₅, K₆, K₇, K₈]  # 每個 [3×128]
V = [V₁, V₂, V₃, V₄, V₅, V₆, V₇, V₈]

Q₁ → (K₁, V₁)
Q₂ → (K₂, V₂)
...  
Q₈ → (K₈, V₈)

# KV cache: 3 × 128 × 8 × 2 = 6,144 elements

Grouped-Query Attention (GQA) - 2 組 KV (每組 4 heads):

pythonK = [K_group1, K_group2]  # 只有 2 組!
V = [V_group1, V_group2]

# Group 1 (heads 1-4 share)
Q₁, Q₂, Q₃, Q₄ → (K_group1, V_group1)

# Group 2 (heads 5-8 share)  
Q₅, Q₆, Q₇, Q₈ → (K_group2, V_group2)

# KV cache: 3 × 128 × 2 × 2 = 1,536 elements  
# 減少 4x!

Multi-Query Attention (MQA) - 1 組 KV:

pythonK = [K_shared]  # 得一組!
V = [V_shared]

Q₁, Q₂, ..., Q₈ → (K_shared, V_shared)

# KV cache: 3 × 128 × 1 × 2 = 768 elements
# 減少 8x!

視覺化架構

javascriptMHA:  Q₁→K₁,V₁  Q₂→K₂,V₂  Q₃→K₃,V₃  Q₄→K₄,V₄  ...
      ━━━━━━━  ━━━━━━━  ━━━━━━━  ━━━━━━━
      獨立      獨立      獨立      獨立

GQA:  Q₁→╲      Q₂→╱      Q₃→╲      Q₄→╱
          K₁,V₁             K₂,V₂
      ━━━━━━━━━━━━━  ━━━━━━━━━━━━━
      Group 1 share    Group 2 share

MQA:  Q₁→╲  Q₂→╱  Q₃→╲  Q₄→╱  ...
            K_shared, V_shared  
      ━━━━━━━━━━━━━━━━━━━━━━━
      全部 share

Memory 對比表

方法	KV Groups	KV Cache Size	Quality	Speed
MHA (8 heads)	8	6,144	最好 ⭐⭐⭐	慢
GQA (2 groups)	2	1,536 (↓75%)	好 ⭐⭐	快 ⚡⚡
MQA (1 group)	1	768 (↓87.5%)	差啲 ⭐	最快 ⚡⚡⚡

LLaMA 2 70B 嘅選擇: 64 query heads, 8 KV groups → 每 group 8 heads share

好處:

✅ KV cache size 減少 4-8x (視乎 groups 數目)
✅ 效果比 MQA 好,接近 MHA
✅ 推理速度仍然快好多

狀態: ✅ 而家最主流嘅配置!

邊個用緊?

LLaMA 2 (Meta): 8 KV heads for 70B model
Mistral 7B: 8 KV heads
Qwen 2: GQA
Gemma: GQA

基本上 2023 之後嘅新模型,好多都用 GQA!

2024-2026: Mixture of Depths & 其他探索 🔬

Mixture of Depths (MoD) (2024) 🎓 研究階段

核心 idea: 唔係每個 token 都過完整嘅 transformer layer,根據 importance 決定邊啲 tokens 行 "express lane"。

動機:

某啲 tokens 重要 (例如 nouns, verbs),需要深度處理
某啲 tokens 唔重要 (例如 "the", "a"),可以 skip 某啲 layers

狀態: 🎓 太新,仍在研究

未有大規模 production 應用。

Infini-Attention (2024) 🎓 研究階段

論文: Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention (Google, 2024)

核心 idea: 結合 local attention + compressed memory,目標係處理無限長嘅 context。

狀態: 🎓 研究階段

Google 有做實驗,但未見普及。

Differential Attention (2024) 🎓 研究中

核心 idea: 用兩個 attention heads 做 difference,增強對 relevant info 嘅 focus。

狀態: 🎓 學術研究

Sliding Window Attention (Mistral, 2023) ✅ 仍在用

論文: Mistral 7B 引入

核心 idea: 每個 token 只 attend 前面 固定 window size 嘅 tokens (例如 4096)。

好處:

✅ KV cache 固定大小,唔會隨序列增長
✅ 推理速度穩定
❌ 但係失去真正嘅 long-range dependencies

狀態: ✅ Mistral 系列仍在用

但主流 LLMs 更偏好 full attention + better memory management。

已淘汰嘅方法總結 ❌

呢啲方法喺歷史上有貢獻,但而家基本上冇人用:

方法	年份	點解淘汰?	後繼者
Transformer-XL	2019	實作複雜,recurrence 唔夠平行	RoPE, ALiBi
Linformer	2020	信息損失,Flash Attention 更好	Flash Attention
Performer	2021	近似誤差,實作複雜	Flash Attention
Reformer	2020	LSH attention 唔穩定	Flash Attention
Synthesizer	2021	學習 attention 效果唔好	Standard attention
Routing Transformer	2021	實作複雜,收益有限	Dense attention

共同模式:

大部分被淘汰嘅方法都係試圖通過 approximation 嚟減少複雜度,但係:

2022 年 Flash Attention 出現,證明可以做到 exact attention 但 memory-efficient
Approximation 嘅 quality loss 唔值得

現役主流方法 ✅

如果你而家要 train 或 deploy 一個 transformer model,應該用咩組合?

訓練 (Training)

標準配置 (2026):

python# Position encoding
position_encoding = "RoPE"  # 或者 ALiBi

# Attention mechanism
attention = "Multi-Head Self-Attention"
attention_implementation = "Flash Attention 3"  # 必備!

# Query/Key/Value heads
use_GQA = True  # Grouped-Query Attention
num_query_heads = 32
num_kv_heads = 8  # 4x reduction

# Context length
max_seq_len = 8192  # 或更長

效果:

Flash Attention 3: 快 + memory-efficient
GQA: 推理快,效果好
RoPE: 位置建模強

推理 (Inference)

標配:

python# Attention optimization
attention_backend = "Flash Attention 3"

# KV cache management  
kv_cache_manager = "PagedAttention"  # vLLM

# Optional: 如果 context 超長
use_sliding_window = False  # 通常唔使

框架推薦:

vLLM: Flash Attention + PagedAttention,最快
TensorRT-LLM: NVIDIA 優化,A100/H100 最優
Text Generation Inference: Hugging Face,易用

Timeline 視覺化 📊

javascript2014 ■ Seq2Seq (RNN Encoder-Decoder)
       └─ 問題: fixed-size bottleneck

2015 ■ Bahdanau Attention ✅ (教學用)
     ■ Luong Attention ✅ (教學用)
       └─ 首次引入 attention 概念

2017 ████ Transformer 革命 ████
     ■ Scaled Dot-Product Attention ✅ 核心!
     ■ Multi-Head Attention ✅ 核心!
     ■ Self-Attention ✅ 核心!
     ■ Cross-Attention ✅ 仍在用
       └─ Attention is All You Need

2019 ■ Transformer-XL ❌ 已淘汰
     ■ Sparse Transformer 🎓 學術
     ■ Multi-Query Attention 🔄 2023 再流行

2020 ■ Longformer ✅ 文檔理解
     ■ BigBird 🎓 學術
     ■ Linformer ❌ 已淘汰
     ■ Reformer ❌ 已淘汰

2021 ■ Performer ❌ 已淘汰
     ■ RoPE ✅✅✅ 現在主流!
       └─ LLaMA, PaLM, GPT-NeoX 都用

2022 ████ Memory 革命 ████
     ■ Flash Attention ✅✅✅ 革命性!
     ■ ALiBi ✅✅ 第二主流
     ■ Memory-Efficient Attention ✅

2023 ■ Grouped-Query Attention (GQA) ✅✅✅ 推理標配!
     ■ PagedAttention ✅✅✅ 推理必備!
     ■ Flash Attention 2 ✅✅✅
     ■ Flash-Decoding ✅
     ■ Sliding Window (Mistral) ✅
       └─ LLaMA 2, Mistral 採用 GQA

2024 ■ Flash Attention 3 ✅✅✅
     ■ Mixture of Depths 🎓 研究
     ■ Infini-Attention 🎓 研究

2026 ← 你而家喺度!
     主流: Flash Attention 3 + GQA + RoPE/ALiBi + PagedAttention

技術選擇指南 🧭

點樣揀適合你嘅 attention 方案?

如果你係... 🤔

📚 學生 / 研究新手

學習路徑:

Bahdanau/Luong Attention (理解基本概念)
Scaled Dot-Product Attention (理解 QKV)
Multi-Head Attention (現代標準)
Flash Attention (理解 memory optimization)

🏢 訓練新模型 (Production)

推薦配置:

Position: RoPE (如果唔知揀邊個) 或 ALiBi (如果需要超長 context)
Attention: Multi-Head Self-Attention
Implementation: Flash Attention 3
Heads: GQA (例如 32 query heads, 8 KV heads)
Framework: PyTorch + Flash Attention official library

🚀 部署推理服務

推薦配置:

Framework: vLLM (最快)
KV Cache: PagedAttention (vLLM 內建)
Attention: Flash Attention 3
如果用 NVIDIA GPU: 可以試 TensorRT-LLM

📄 處理超長文檔 (32k+ tokens)

選項 1 (dense attention):

RoPE + Flash Attention + Ring Attention
需要多張 GPU,用 sequence parallelism

選項 2 (sparse attention):

Longformer (16k-32k tokens)
或者用 RAG (Retrieval-Augmented Generation) 切細文檔

🎨 多模態模型 (Vision + Language)

Cross-Attention 仍然需要 (連接 vision encoder 同 language decoder)
例子: CLIP, Flamingo, LLaVA

未來趨勢預測 🔮

基於而家嘅發展,我估未來幾年會有呢啲趨勢:

1. Flash Attention 系列繼續主導 ✅

Flash Attention 4, 5... 會繼續優化
更好嘅 hardware utilization (特別係新 GPU 架構)
可能會有 specialized hardware accelerators for attention

2. GQA 成為標配 ✅

幾乎所有新模型都會用 GQA
可能會有 dynamic GQA (根據 layer 調整 KV heads 數目)

3. 位置編碼趨向統一 🤔

RoPE 同 ALiBi 仍然會係主流
可能會出現統一兩者優點嘅新方法
RoPE scaling 技術會更成熟 (YaRN, NTK-aware 等)

4. Context Length 持續增長 📏

2026: 128k-256k tokens 會變常見
需要更好嘅 memory management (PagedAttention 改進版)
可能會有 hierarchical attention (唔同 layers 用唔同 context window)

5. Sparse Attention 再度回歸? 🔄

當 context length 去到 1M+ tokens,dense attention 會太貴
可能會有新一代 sparse patterns,配合 Flash Attention
Mixture of Depths 概念可能會成熟

6. 專門化 Attention 🎯

唔同 modalities 用唔同 attention mechanisms
例如:vision 用 local attention,language 用 global attention
Multi-modal models 會有更複雜嘅 attention routing

7. Hardware-Software Co-design ⚡

Attention 專用嘅 hardware accelerators
類似 TPU 專為 matrix multiplication 優化,未來可能有 "Attention Processing Unit"

我嘅睇法 💭

研究咗咁多 attention mechanisms,我有幾個感想:

1. 簡單通常更好

好多複雜嘅變種 (Linformer, Performer, Reformer) 最終都失敗咗,反而簡單嘅 dense attention + 優化 implementation (Flash Attention) 贏咗。

教訓: 唔好急住 approximate,先 optimize exact solution。

2. 工程優化同理論創新一樣重要

Flash Attention 冇改變 attention 嘅數學公式,只係改變咗計算順序同 memory access pattern,但帶嚟嘅影響比好多理論上 fancy 嘅方法都大。

教訓: 理解 hardware (GPU memory hierarchy) 好重要!

3. 位置編碼超級重要

好多人以為 attention 機制本身最重要,但其實 position encoding (RoPE, ALiBi) 對 performance 嘅影響好大,特別係長序列。

教訓: 唔好忽略 "配角" 組件。

4. 推理同訓練嘅需求好唔同

訓練時最緊要 throughput (Flash Attention),推理時最緊要 latency 同 memory (PagedAttention, GQA)。

教訓: 設計 model 要同時考慮訓練同部署。

5. 某啲 "舊" idea 會返嚟

Multi-Query Attention 喺 2019 年提出,但去到 2023 年先流行 (因為推理需求增加)。Sparse attention 可能都會喺未來再度興起 (當 context length 去到 millions)。

教訓: Keep an eye on "old" papers,佢哋可能會喺新 context 下有用。

參考資料 📚

經典論文:

Bahdanau et al. (2015): Neural Machine Translation by Jointly Learning to Align and Translate
Vaswani et al. (2017): Attention Is All You Need
Su et al. (2021): RoFormer: Enhanced Transformer with Rotary Position Embedding
Press et al. (2022): Train Short, Test Long: Attention with Linear Biases
Dao et al. (2022): FlashAttention: Fast and Memory-Efficient Exact Attention
Ainslie et al. (2023): GQA: Training Generalized Multi-Query Transformer Models
Kwon et al. (2023): Efficient Memory Management for LLM Serving with PagedAttention

相關文章:

框架 & 實現:

Attention 機制進化史：由 2014 到 2026 嘅演變

呢篇文會按時間順序講解各種 attention 機制,並且會標明邊啲仍然活躍 ✅、邊啲已經過時 ❌、邊啲只係學術研究 🎓。

📑 目錄

2014-2015: Attention 嘅誕生 🐣

Seq2Seq 嘅瓶頸 (2014)

喺 attention 出現之前,機器翻譯用 sequence-to-sequence (Seq2Seq) 架構:

javascriptEncoder → Fixed-size vector → Decoder

問題:

成個輸入句子要壓縮成一個固定長度嘅 vector
句子越長,信息損失越嚴重
譯長句嗰陣,開頭嘅詞會俾遺忘

💡 想像你要將成本書嘅內容濃縮成一句說話,然後用嗰句說話嚟重構成本書——當然會失真!

Bahdanau Attention (2015) ✅ 仍在用(教學)

論文: Neural Machine Translation by Jointly Learning to Align and Translate (Bahdanau et al., 2015)

核心 idea: Decoder 每一步都可以望返去 encoder 嘅所有 hidden states,動態決定要 focus 邊度。

計算方式

Alignment score (用 small neural network):

e_{ij} = a(s_{i-1}, h_j)

其中 $s_{i-1}$ 係 decoder 上一步嘅 state, $h_j$ 係 encoder 第 $j$ 個 hidden state。

Attention weights (softmax normalization):

\alpha_{ij} = \frac{\exp(e_{ij})}{\sum_{k=1}^{T} \exp(e_{ik})}

Context vector (weighted sum):

c_i = \sum_{j=1}^{T} \alpha_{ij} h_j

具體例子:翻譯 "I love AI"

假設 encoder 產生 3 個 hidden states,decoder 生成第一個字「我」:

Encoder hidden states (簡化成 2D):

H = \begin{bmatrix} h_1 & h_2 & h_3 \end{bmatrix} = \begin{bmatrix} 0.8 & 0.2 & 0.9 \\ 0.6 & 0.7 & 0.3 \end{bmatrix}

Decoder state:

s_0 = \begin{bmatrix} 0.5 \\ 0.5 \end{bmatrix}

Step 1: 計 alignment scores (假設用 dot product 簡化):

\begin{aligned} e_{01} &= s_0^T h_1 = 0.5(0.8) + 0.5(0.6) = 0.70 \\ e_{02} &= s_0^T h_2 = 0.5(0.2) + 0.5(0.7) = 0.45 \\ e_{03} &= s_0^T h_3 = 0.5(0.9) + 0.5(0.3) = 0.60 \end{aligned}

Step 2: Softmax → attention weights:

\begin{aligned} \alpha_{01} &= \frac{e^{0.70}}{e^{0.70} + e^{0.45} + e^{0.60}} = \frac{2.01}{2.01 + 1.57 + 1.82} = 0.37 \\ \alpha_{02} &= \frac{1.57}{5.40} = 0.29 \\ \alpha_{03} &= \frac{1.82}{5.40} = 0.34 \end{aligned}

Step 3: Context vector (weighted sum):

c_0 = \alpha_{01} h_1 + \alpha_{02} h_2 + \alpha_{03} h_3

= 0.37 \begin{bmatrix} 0.8 \\ 0.6 \end{bmatrix} + 0.29 \begin{bmatrix} 0.2 \\ 0.7 \end{bmatrix} + 0.34 \begin{bmatrix} 0.9 \\ 0.3 \end{bmatrix}

= \begin{bmatrix} 0.296 + 0.058 + 0.306 \\ 0.222 + 0.203 + 0.102 \end{bmatrix} = \begin{bmatrix} 0.66 \\ 0.53 \end{bmatrix}

Attention matrix 視覺化:

javascriptDecoder → | h₁(I)  h₂(love) h₃(AI)
"我"      | 0.37   0.29     0.34    ← 最關注 "I"
"愛"      | 0.05   0.82     0.13    ← 最關注 "love"
"人工"    | 0.08   0.15     0.77    ← 最關注 "AI"

狀態: ✅ 仍在用於教學

而家唔會用喺 production (太慢、效果唔夠好)
但係教科書同課程仍然會教,因為概念清晰
係理解 attention 嘅最佳起點

Luong Attention (2015) ✅ 仍在用(教學)

論文: Effective Approaches to Attention-based Neural Machine Translation (Luong et al., 2015)

改進: 提出多種 alignment function 同 global/local attention 變種。

Alignment Functions

Dot product:

\text{score}(s_i, h_j) = s_i^T h_j

General:

\text{score}(s_i, h_j) = s_i^T W h_j

Concat (Bahdanau 嘅方法):

\text{score}(s_i, h_j) = v^T \tanh(W[s_i; h_j])

三種 Scoring 對比例子

假設 decoder state 同 encoder hidden state:

s = \begin{bmatrix} 0.6 \\ 0.8 \end{bmatrix}, \quad h = \begin{bmatrix} 0.5 \\ 0.9 \end{bmatrix}

1. Dot product:

\text{score} = s^T h = 0.6(0.5) + 0.8(0.9) = 0.30 + 0.72 = 1.02

2. General (假設權重矩陣):

W = \begin{bmatrix} 1.2 & 0.3 \\ 0.4 & 1.1 \end{bmatrix}

Wh = \begin{bmatrix} 1.2(0.5) + 0.3(0.9) \\ 0.4(0.5) + 1.1(0.9) \end{bmatrix} = \begin{bmatrix} 0.87 \\ 1.19 \end{bmatrix}

\text{score} = s^T(Wh) = 0.6(0.87) + 0.8(1.19) = 0.522 + 0.952 = 1.474

3. Concat:

[s; h] = \begin{bmatrix} 0.6 \\ 0.8 \\ 0.5 \\ 0.9 \end{bmatrix}

經過 tanh(W[s;h]) 同 v^T 之後得出一個 scalar score。

分別:

Dot: 最簡單,但假設 s 同 h 嘅 dimensions 對齊
General: 有學習嘅權重,更 flexible
Concat: Bahdanau 方法,最 expressive 但計算最貴

Global vs Local Attention

Global: 同 Bahdanau 類似,attend 晒所有 positions。

Local: 只 attend 一個 window (例如前後各 5 個 positions),減少計算量。

狀態: ✅ 仍在特定場景使用

Local attention 嘅 idea 影響咗後來嘅 sparse attention
Dot product score 變成 Transformer 嘅標準做法
但原版 RNN-based 實現已經冇人用

2017: Transformer 革命 🚀

Scaled Dot-Product Attention (2017) ✅ 核心方法,仍廣泛使用

論文: Attention Is All You Need (Vaswani et al., 2017)

呢個係現代 NLP 嘅分水嶺!

Query, Key, Value 概念

首次引入 QKV (Query, Key, Value) 框架:

\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right) V

點解要除 $sqrt{d_k}$ ?

當 $d_k$ (key dimension) 好大嗰陣,dot product 嘅值會好大,令 softmax 進入 saturation region (gradient 接近 0)。除以 $\sqrt{d_k}$ 可以穩定訓練。

完整矩陣例子

假設 4 個 tokens,每個 embedding 係 3D:

Input embeddings:

X = \begin{bmatrix} 0.5 & 0.2 & 0.8 \\ 0.3 & 0.9 & 0.1 \\ 0.7 & 0.4 & 0.6 \\ 0.2 & 0.8 & 0.5 \end{bmatrix} \quad \text{(4 tokens × 3 dim)}

Weight matrices (簡化成 3×3):

W^Q = \begin{bmatrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \end{bmatrix}, \quad W^K = \begin{bmatrix} 0.8 & 0.1 & 0.2 \\ 0.1 & 0.9 & 0.3 \\ 0.2 & 0.1 & 0.7 \end{bmatrix}, \quad W^V = \begin{bmatrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \end{bmatrix}

Step 1: 計 Q, K, V

Q = XW^Q = X \quad \text{(用 identity 簡化)}

K = XW^K = \begin{bmatrix} 0.56 & 0.18 & 0.69 \\ 0.26 & 0.84 & 0.22 \\ 0.68 & 0.52 & 0.71 \\ 0.26 & 0.79 & 0.49 \end{bmatrix}

V = XW^V = X

Step 2: 計 attention scores

QK^T = \begin{bmatrix} 0.5 & 0.2 & 0.8 \\ 0.3 & 0.9 & 0.1 \\ 0.7 & 0.4 & 0.6 \\ 0.2 & 0.8 & 0.5 \end{bmatrix} \begin{bmatrix} 0.56 & 0.26 & 0.68 & 0.26 \\ 0.18 & 0.84 & 0.52 & 0.79 \\ 0.69 & 0.22 & 0.71 & 0.49 \end{bmatrix}

= \begin{bmatrix} 0.868 & 0.472 & 1.076 & 0.722 \\ 0.501 & 1.116 & 0.679 & 1.111 \\ 0.862 & 0.856 & 1.182 & 0.856 \\ 0.801 & 0.883 & 0.991 & 1.117 \end{bmatrix}

Step 3: Scaling (假設 $d_k = 3$ ,所以除 $\sqrt{3} \approx 1.73$ ):

\frac{QK^T}{\sqrt{d_k}} = \begin{bmatrix} 0.50 & 0.27 & 0.62 & 0.42 \\ 0.29 & 0.65 & 0.39 & 0.64 \\ 0.50 & 0.49 & 0.68 & 0.49 \\ 0.46 & 0.51 & 0.57 & 0.65 \end{bmatrix}

Step 4: Softmax (每行獨立做):

\text{Attention Weights} = \begin{bmatrix} 0.23 & 0.18 & 0.27 & 0.21 \\ 0.19 & 0.27 & 0.21 & 0.27 \\ 0.23 & 0.23 & 0.27 & 0.23 \\ 0.22 & 0.23 & 0.24 & 0.27 \end{bmatrix}

Step 5: 乘 V (weighted sum of values):

\text{Output} = \text{Attention Weights} \times V

= \begin{bmatrix} 0.23 & 0.18 & 0.27 & 0.21 \\ 0.19 & 0.27 & 0.21 & 0.27 \\ 0.23 & 0.23 & 0.27 & 0.23 \\ 0.22 & 0.23 & 0.24 & 0.27 \end{bmatrix} \begin{bmatrix} 0.5 & 0.2 & 0.8 \\ 0.3 & 0.9 & 0.1 \\ 0.7 & 0.4 & 0.6 \\ 0.2 & 0.8 & 0.5 \end{bmatrix}

= \begin{bmatrix} 0.48 & 0.45 & 0.54 \\ 0.41 & 0.59 & 0.44 \\ 0.48 & 0.50 & 0.53 \\ 0.43 & 0.54 & 0.51 \end{bmatrix}

視覺化 attention pattern:

javascriptToken0  Token1  Token2  Token3
Token0   0.23    0.18    0.27 ⬆  0.21
Token1   0.19    0.27 ⬆  0.21    0.27 ⬆
Token2   0.23    0.23    0.27 ⬆  0.23
Token3   0.22    0.23    0.24    0.27 ⬆

⬆ = 最高 attention weight

關鍵觀察:

每個 token 嘅 output 係所有 tokens 嘅加權平均
Attention weights 加埋等於 1 (因為 softmax)
唔同 token 可以 attend 去唔同位置

狀態: ✅ 仍然係標準做法

幾乎所有現代 LLM 都用呢個公式
GPT、BERT、T5、LLaMA 全部都係基於呢個
已經係 de facto standard

Multi-Head Attention (2017) ✅ 核心方法,仍廣泛使用

核心 idea: 用多個 attention heads 並行計算,捕捉唔同嘅 patterns。

\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h) W^O

其中每個 head:

\text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V)

好處:

唔同 heads 學習唔同 aspects (syntax, semantics, long-range dependencies)
可以並行計算,發揮 GPU 優勢
增加 model expressiveness

具體例子:2 個 Heads

假設 input 係 2 個 tokens,每個 embedding 6D,用 2 個 heads (每個 head 3D):

Input:

X = \begin{bmatrix} 0.5 & 0.2 & 0.8 & 0.1 & 0.9 & 0.3 \\ 0.7 & 0.4 & 0.6 & 0.8 & 0.2 & 0.5 \end{bmatrix} \quad \text{(2 tokens × 6 dim)}

Head 1 (用前 3 dims):

X_1 = \begin{bmatrix} 0.5 & 0.2 & 0.8 \\ 0.7 & 0.4 & 0.6 \end{bmatrix}

計 attention → 得出:

\text{head}_1 = \begin{bmatrix} 0.55 & 0.25 & 0.75 \\ 0.65 & 0.35 & 0.65 \end{bmatrix}

Head 2 (用後 3 dims):

X_2 = \begin{bmatrix} 0.1 & 0.9 & 0.3 \\ 0.8 & 0.2 & 0.5 \end{bmatrix}

計 attention → 得出:

\text{head}_2 = \begin{bmatrix} 0.45 & 0.65 & 0.35 \\ 0.70 & 0.30 & 0.50 \end{bmatrix}

Concatenate:

\text{Concat}(\text{head}_1, \text{head}_2) = \begin{bmatrix} 0.55 & 0.25 & 0.75 & 0.45 & 0.65 & 0.35 \\ 0.65 & 0.35 & 0.65 & 0.70 & 0.30 & 0.50 \end{bmatrix}

最後乘 output projection $W^O$ (6×6) 得出 final output。

關鍵點:

每個 head 獨立計算,可以並行
Head 1 可能學到 syntactic patterns,Head 2 學到 semantic patterns
Concat 之後保留晒所有 heads 嘅信息

狀態: ✅ 仍然係標準配置

現代模型通常用 8-96 個 heads
GPT-3: 96 heads
LLaMA-2 70B: 64 heads

Cross-Attention (2017) ✅ 仍廣泛使用

Transformer encoder-decoder 架構入面,decoder 用 cross-attention 嚟 attend 去 encoder 嘅 outputs。

特點:

Query 嚟自 decoder
Key 同 Value 嚟自 encoder
實現 source-target alignment (例如翻譯嗰陣)

具體例子:翻譯 "I love" → "我愛"

Encoder outputs (source: "I love"):

H_{\text{enc}} = \begin{bmatrix} 0.7 & 0.4 & 0.6 \\ 0.3 & 0.9 & 0.5 \end{bmatrix} \begin{matrix} \text{← "I"} \\ \text{← "love"} \end{matrix}

Decoder state (generating "愛"):

H_{\text{dec}} = \begin{bmatrix} 0.5 & 0.8 & 0.4 \end{bmatrix} \quad \text{← state after generating "我"}

Cross-Attention:

Query: 嚟自 decoder → $Q = H_{\text{dec}}W^Q$
Key, Value: 嚟自 encoder → $K = H_{\text{enc}}W^K, V = H_{\text{enc}}W^V$

假設用簡化權重,計 attention scores:

QK^T = \begin{bmatrix} 0.5 & 0.8 & 0.4 \end{bmatrix} \begin{bmatrix} 0.7 & 0.3 \\ 0.4 & 0.9 \\ 0.6 & 0.5 \end{bmatrix} = \begin{bmatrix} 0.91 & 1.31 \end{bmatrix}

Softmax:

\text{Weights} = [0.35, 0.65]

解讀: 生成 "愛" 嗰陣,decoder attend 更多去 "love" (0.65) 而唔係 "I" (0.35),符合翻譯對齊!

Output (weighted sum of encoder values):

\text{context} = 0.35 \times \text{V}_{\text{I}} + 0.65 \times \text{V}_{\text{love}}

呢個 context vector 會 feed 去 decoder 嚟生成下一個字。

狀態: ✅ 仍在特定任務使用

Encoder-decoder models 仍然用: T5, BART, Whisper (語音識別)
多模態模型: CLIP, Flamingo (vision-language)
但係純語言模型 (GPT 系列) 唔使,因為係 decoder-only

2018-2019: 第一波優化 🔧

Self-Attention (2017+) ✅ 核心方法

Self-attention 係 Transformer 嘅精髓: Q, K, V 全部都嚟自同一個 input sequence。

\text{Self-Attention}(X) = \text{Attention}(XW^Q, XW^K, XW^V)

用途:

Encoder (BERT): bidirectional self-attention
Decoder (GPT): causal self-attention (只能望前面)

具體例子:句子 "I love AI"

假設每個 word embedding 係 3D (簡化):

X = \begin{bmatrix} 0.8 & 0.3 & 0.5 \\ 0.2 & 0.9 & 0.4 \\ 0.6 & 0.5 & 0.7 \end{bmatrix} \begin{matrix} \text{← "I"} \\ \text{← "love"} \\ \text{← "AI"} \end{matrix}

計 Q, K, V (用簡化權重):

假設 $W^Q = W^K = W^V = I$ (identity),所以:

Q = K = V = X

Step 1: Attention scores $QK^T$ :

QK^T = \begin{bmatrix} 0.8 & 0.3 & 0.5 \\ 0.2 & 0.9 & 0.4 \\ 0.6 & 0.5 & 0.7 \end{bmatrix} \begin{bmatrix} 0.8 & 0.2 & 0.6 \\ 0.3 & 0.9 & 0.5 \\ 0.5 & 0.4 & 0.7 \end{bmatrix}

= \begin{bmatrix} 0.98 & 0.63 & 1.06 \\ 0.63 & 1.21 & 0.85 \\ 1.06 & 0.85 & 1.35 \end{bmatrix}

Step 2: Softmax (每行):

\text{Attention Weights} = \begin{bmatrix} 0.31 & 0.22 & 0.47 \\ 0.22 & 0.47 & 0.31 \\ 0.31 & 0.24 & 0.45 \end{bmatrix}

解讀:

"I" (row 1) attends most to "AI" (0.47)
"love" (row 2) attends most to itself (0.47)
"AI" (row 3) attends most to itself (0.45)

Step 3: 乘 V:

\text{Output} = \text{Attention Weights} \times V

每個 token 嘅 output 係所有 tokens 嘅加權組合,捕捉到 contextual information!

狀態: ✅ 係現代 transformer 嘅核心

Transformer-XL: Relative Position (2019) ❌ 已少用

論文: Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context (Dai et al., 2019)

問題: 原本 Transformer 有固定嘅 context length,唔能處理超長序列。

解決方法:

Segment-level recurrence: 將前一個 segment 嘅 hidden states cache 住
Relative position encodings: 用相對位置而唔係絕對位置

點解冇人用?

Implementation 複雜
Recurrence 機制唔夠 parallelizable
後來 RoPE (2021) 提供咗更簡單嘅相對位置方案
ALiBi (2022) 嘅外推能力更強

狀態: ❌ 基本已被淘汰

但佢嘅 relative position encoding idea 影響咗後來嘅 RoPE 同 ALiBi!

Sparse Transformer (2019) 🎓 學術研究為主

論文: Generating Long Sequences with Sparse Transformers (Child et al., 2019)

核心 idea: 唔係每個 token 都 attend 所有其他 tokens,只 attend 一部分。

Sparse Patterns

Strided attention:

每隔 $k$ 個 positions attend 一次:

javascriptToken 0 attends: [0, 5, 10, 15, ...]  (stride=5)

Fixed attention:

某啲 positions 有 "global" attention,所有 tokens 都 attend 去佢哋。

複雜度: 由 $O(n^2)$ 降到 $O(n\sqrt{n})$ 或 $O(n \log n)$

狀態: 🎓 主要係學術研究

點解冇普及?

Implementation 複雜,需要 custom CUDA kernels
對於常見嘅序列長度 (2k-8k tokens),dense attention + Flash Attention 已經夠快
Sparse patterns 要針對 task 設計,唔夠 general

邊度仍在用?

BigBird (Google, 2020): 用咗 sparse attention
Longformer (AllenAI, 2020): 結合 local + global attention
但主流 LLMs (GPT, LLaMA) 都唔用

2020-2021: 長序列挑戰 📏

Longformer (2020) ✅ 仍在特定場景使用

論文: Longformer: The Long-Document Transformer (Beltagy et al., 2020)

核心 idea: 結合 local windowed attention + global attention。

Attention Pattern

Local attention: 每個 token attend 前後 $w$ 個 tokens (例如 $w=512$ )
Global attention: 某啲特殊 tokens (例如 [CLS]) attend 全部 tokens

複雜度: $O(n times w)$ ,linear in sequence length!

Attention Pattern 視覺化

假設 8 個 tokens,window size = 2 (前後各 1 個 token):

Full Attention (傳統):

javascript0  1  2  3  4  5  6  7
0  [ ✓  ✓  ✓  ✓  ✓  ✓  ✓  ✓ ]  ← token 0 attend 晒所有
1  [ ✓  ✓  ✓  ✓  ✓  ✓  ✓  ✓ ]
2  [ ✓  ✓  ✓  ✓  ✓  ✓  ✓  ✓ ]
3  [ ✓  ✓  ✓  ✓  ✓  ✓  ✓  ✓ ]
4  [ ✓  ✓  ✓  ✓  ✓  ✓  ✓  ✓ ]
5  [ ✓  ✓  ✓  ✓  ✓  ✓  ✓  ✓ ]
6  [ ✓  ✓  ✓  ✓  ✓  ✓  ✓  ✓ ]
7  [ ✓  ✓  ✓  ✓  ✓  ✓  ✓  ✓ ]

每個 token: 8 個 attention 計算
總計: 8 × 8 = 64 計算 → O(n²)

Longformer Local Attention (window=2):

javascript0  1  2  3  4  5  6  7
0  [ ✓  ✓  ·  ·  ·  ·  ·  · ]  ← token 0 只 attend [0,1]
1  [ ✓  ✓  ✓  ·  ·  ·  ·  · ]  ← token 1 attend [0,1,2]
2  [ ·  ✓  ✓  ✓  ·  ·  ·  · ]  ← attend [1,2,3]
3  [ ·  ·  ✓  ✓  ✓  ·  ·  · ]  ← attend [2,3,4]
4  [ ·  ·  ·  ✓  ✓  ✓  ·  · ]
5  [ ·  ·  ·  ·  ✓  ✓  ✓  · ]
6  [ ·  ·  ·  ·  ·  ✓  ✓  ✓ ]
7  [ ·  ·  ·  ·  ·  ·  ✓  ✓ ]  ← token 7 只 attend [6,7]

每個 token: 2-3 個 attention 計算
總計: ~8 × 2.5 = 20 計算 → O(n×w)

Longformer + Global Attention (token 0 係 [CLS]):

javascript0  1  2  3  4  5  6  7
0  [ ✓  ✓  ✓  ✓  ✓  ✓  ✓  ✓ ]  ← [CLS] attend 全部
1  [ ✓  ✓  ✓  ·  ·  ·  ·  · ]  ← local + attend to [CLS]
2  [ ✓  ✓  ✓  ✓  ·  ·  ·  · ]  ← local + attend to [CLS]
3  [ ✓  ·  ✓  ✓  ✓  ·  ·  · ]
4  [ ✓  ·  ·  ✓  ✓  ✓  ·  · ]
5  [ ✓  ·  ·  ·  ✓  ✓  ✓  · ]
6  [ ✓  ·  ·  ·  ·  ✓  ✓  ✓ ]
7  [ ✓  ·  ·  ·  ·  ·  ✓  ✓ ]
    ↑
   全部都 attend 去 [CLS]

實際應用 (4096 tokens, window=512):

Full attention: 4096² = 16M 計算
Longformer: 4096 × 512 = 2M 計算 (減少 8x!)

狀態: ✅ 文檔理解任務仍在用

LED (Longformer Encoder-Decoder): 用於文檔 summarization
適合處理超長文檔 (16k-32k tokens)
但係語言生成任務,大家仍然偏好 dense attention + better position encodings

BigBird (2020) 🎓 學術為主

論文: Big Bird: Transformers for Longer Sequences (Zaheer et al., 2020)

Sparse pattern: 結合三種 attention:

Random attention: 隨機 attend 幾個 tokens
Window attention: Local sliding window
Global attention: 特定 tokens attend 全部

狀態: 🎓 主要係研究用途

Google 有推出 BigBird models,但唔算主流
理論上有 universal approximation 保證,但實際應用有限

Linformer (2020) ❌ 已少用

論文: Linformer: Self-Attention with Linear Complexity (Wang et al., 2020)

核心 idea: 用 low-rank approximation 將 $n \times n$ 嘅 attention matrix 降到 $n \times k$ (其中 $k ll n$ )。

點做?

將 Key 同 Value 用 projection 降維:

K' = KE, \quad V' = VF

其中 $E, F in mathbb{R}^{n times k}$ 。

點解失敗?

信息損失: Low-rank approximation 會損失 long-range dependencies
唔夠 flexible: $k$ 係 hyperparameter,要手動 tune
Flash Attention 更好: 2022 年 Flash Attention 出現,做到 $O(n^2)$ 但係 memory-efficient,效果更好

狀態: ❌ 基本已被淘汰

Performer (2021) ❌ 已少用

論文: Rethinking Attention with Performers (Choromanski et al., 2021)

核心 idea: 用 kernel methods 將 attention 近似成 linear complexity。

數學技巧:

用 random feature maps $\phi$ 將 softmax 近似:

\text{Attention}(Q, K, V) \approx \frac{\phi(Q) (\phi(K)^T V)}{\phi(Q) \phi(K)^T}

計算順序改變: $(QK^T)V rightarrow Q(K^TV)$ ,複雜度由 $O(n^2 d)$ 降到 $O(nd^2)$ 。

點解冇普及?

近似誤差: 同 exact attention 有 gap
實作複雜: 需要 careful implementation
Flash Attention 出現,提供 exact attention 但更 memory-efficient

狀態: ❌ 基本已被淘汰

2021-2022: 位置編碼大戰 📍

Position encoding 係 attention 機制嘅重要組件,呢個時期出現咗好多新方法。

RoPE (2021) ✅ 現在主流!

論文: RoFormer: Enhanced Transformer with Rotary Position Embedding (Su et al., 2021)

核心 idea: 通過旋轉 query 同 key 向量嚟注入相對位置信息。

\mathbf{q}_m = \mathbf{R}_m \mathbf{q}, \quad \mathbf{k}_n = \mathbf{R}_n \mathbf{k}

優點:

✅ 自然嘅相對位置建模
✅ 無額外參數
✅ 優秀嘅長度外推能力
✅ 計算高效

狀態: ✅ 而家最主流嘅位置編碼!

邊個用緊?

LLaMA / LLaMA 2 / LLaMA 3 (Meta)
PaLM / Gemini (Google)
GPT-NeoX (EleutherAI)
ChatGLM (清華)
Mistral (Mistral AI)
Qwen (阿里巴巴)

基本上 2021 之後嘅新模型,大部分都用 RoPE!

詳細可以睇我嘅另一篇文: Rotary Position Embeddings (RoPE)

ALiBi (2022) ✅ 第二主流

論文: Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation (Press et al., 2022)

核心 idea: 喺 attention scores 加一個 linear bias term,bias 隨距離增加。

\text{softmax}(q_i K^T + m \cdot [-(i-1), -(i-2), \ldots, 0])

其中 $m$ 係 head-specific slope (唔同 head 用唔同 slope)。

具體例子:4 個 Tokens

假設有 4 個 tokens,計 token 3 嘅 attention:

Step 1: 原始 attention scores (假設已計咗 $q_3 K^T$ ):

\text{scores} = [2.1, 1.8, 2.5, 2.3]

Step 2: 加 ALiBi bias (假設 slope $m = 0.5$ ):

Bias 係根據相對距離:

\text{bias} = m \times [-(3-0), -(3-1), -(3-2), -(3-3)]

= 0.5 \times [-3, -2, -1, 0] = [-1.5, -1.0, -0.5, 0.0]

Step 3: 加埋:

\text{scores + bias} = [2.1-1.5, 1.8-1.0, 2.5-0.5, 2.3-0.0]

= [0.6, 0.8, 2.0, 2.3]

Step 4: Softmax:

\text{attention weights} = [0.12, 0.15, 0.31, 0.42]

效果: Token 3 對自己 (distance=0) 嘅 attention 最高,距離越遠嘅 token,attention 越低!

唔同 heads 用唔同 slopes:

Head 1: $m = 0.5$ (強 local bias)
Head 2: $m = 0.25$ (中等 bias)
Head 3: $m = 0.125$ (弱 bias)
Head 4: $m = 0.0625$ (好弱 bias)

每個 head 學習唔同 range 嘅 dependencies!

優點:

✅ 極強嘅外推能力: Train 2k, 可以推到 100k+ tokens 而 performance 唔跌太多
✅ Implementation 超簡單 (幾行 code)
✅ 無額外參數

缺點:

❌ 喺某啲任務,效果略遜於 RoPE

狀態: ✅ 第二主流,仍廣泛使用

邊個用緊?

BLOOM (BigScience)
MPT (MosaicML)
Falcon (TII UAE)
StarCoder (Hugging Face)

通常用於需要極長 context 嘅場景。

xPos (2022) 🎓 學術為主

論文: A Length-Extrapolatable Transformer (Sun et al., 2022)

核心 idea: 改進 RoPE,加入 exponential decay。

對於 query 同 key,分別乘以 decay factors:

\mathbf{q}_m = \mathbf{R}_m \mathbf{D}_m \mathbf{q}, \quad \mathbf{k}_n = \mathbf{R}_n \mathbf{D}_n^{-1} \mathbf{k}

好處: 更好嘅長度外推,特別係超長序列。

狀態: 🎓 主要係研究,少生產使用

Implementation 比 RoPE 複雜少少
效果提升唔算特別明顯
主流 LLMs 仍然偏好 RoPE 或 ALiBi

FIRE (2023) 🎓 研究階段

論文: Functional Interpolation for Relative Positions (Li et al., 2023)

核心 idea: 用可學習嘅函數嚟 interpolate 位置編碼。

狀態: 🎓 純學術研究

太新,未有大規模應用。

2022-2023: Memory Efficiency 革命 💾

呢個時期重點轉移到記憶體優化,因為 attention 嘅 $O(n^2)$ memory 係訓練長序列嘅最大瓶頸。

Flash Attention (2022) ✅ 革命性突破!

論文: FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness (Dao et al., 2022)

核心 idea: 通過 tiling 同 recomputation,減少 HBM (GPU global memory) 存取次數。

關鍵技術:

Block-sparse attention matrix: 將 $n \times n$ matrix 分成 blocks
Fused kernel: 將 softmax + matmul 合併成一個 CUDA kernel
Recomputation: Backward pass 時 recompute attention,唔需要 store 成個 $n \times n$ matrix

Tiling 例子

假設 sequence length = 1024,將佢分成 4 個 blocks (每個 256 tokens):

傳統 Attention - 一次過計成個矩陣:

python# 要 store 成個 1024×1024 attention matrix
S = Q @ K.T              # [1024×1024] 寫入 HBM
P = softmax(S)           # [1024×1024] 讀+寫 HBM  
O = P @ V                # [1024×1024] 讀 HBM

# Memory: O(n²) = O(1024²) = ~1M elements
# HBM 讀寫: 3 次大量存取

Flash Attention - Block-wise 計算:

python# 將 Q, K, V 分成 blocks
Q = [Q₁, Q₂, Q₃, Q₄]    # 每個 [256×d]
K = [K₁, K₂, K₃, K₄]  
V = [V₁, V₂, V₃, V₄]

# 逐個 block 計,keep 住喺 SRAM (fast memory)
for i in range(4):
    for j in range(4):
        # 只計 256×256 嘅 sub-block
        S_ij = Q_i @ K_j.T    # [256×256] stay in SRAM!
        P_ij = softmax(S_ij)  # stay in SRAM
        O_i += P_ij @ V_j     # accumulate

# Memory: O(n) = O(1024) - 只 store partial sums
# HBM 讀寫: 大幅減少!

Memory Hierarchy

javascriptSRAM (on-chip, 20 MB)    ← 快 ⚡⚡⚡ (19 TB/s)
  ↕  
HBM (GPU memory, 40 GB)  ← 慢 🐌 (1.5 TB/s)
  ↕
DRAM (CPU memory)

關鍵 insight: 盡量 keep 嘢喺 SRAM,減少 HBM 讀寫!

實際數字 (A100 GPU)

Sequence Length	Standard Attention	Flash Attention	Speedup
512	2.5 GB memory	0.8 GB	1.8x 快
1024	10 GB memory	1.2 GB	2.5x 快
2048	40 GB (OOM!)	2.8 GB	4x 快
4096	N/A (out of memory)	8 GB	可以跑!

效果:

✅ 2-4x speedup (training)
✅ 5-9x speedup (long sequences > 2k tokens)
✅ Memory 由 $O(n^2)$ 降到 $O(n)$
✅ Exact attention (唔係近似!)

狀態: ✅ 而家幾乎必用!

邊個用緊?

PyTorch 官方支持 (torch.nn.functional.scaled_dot_product_attention)
Hugging Face Transformers 預設開啟
vLLM (推理框架)
幾乎所有新訓練嘅大模型都用

Flash Attention 2 (2023) 同 Flash Attention 3 (2024) 進一步優化,速度更快!

詳細可以睇我嘅 Transformer 文章: Transformer 架構詳解：Attention、QKV 同 Multi-Head 機制

Memory-Efficient Attention (2022) ✅ 仍在用

來源: xFormers library (Meta)

核心 idea: 類似 Flash Attention,但用 different tiling strategies。

狀態: ✅ 仍在用,但 Flash Attention 更流行

Stable Diffusion 訓練用呢個
某啲 vision models 偏好用

PagedAttention (2023) ✅ 推理必備!

論文: Efficient Memory Management for Large Language Model Serving with PagedAttention (Kwon et al., 2023)

核心 idea: 將 KV cache 用 virtual memory paging 管理,減少記憶體碎片。

問題:

傳統推理,每個 request 要 pre-allocate 最大長度嘅 KV cache,浪費好多記憶體 (55-80% fragmentation!)

解決方法:

將 KV cache 分成固定 size 嘅 pages (例如每 page 64 tokens)
用 virtual memory 概念,dynamic allocation
唔同 requests 可以 share prefix pages (例如 system prompt)

效果:

✅ 2-4x throughput improvement
✅ 減少 55-80% 記憶體浪費
✅ 支持更多 concurrent requests

狀態: ✅ 推理框架標配!

邊個用緊?

vLLM (最流行嘅推理框架,原創者開發)
TensorRT-LLM (NVIDIA)
Text Generation Inference (Hugging Face)

如果你做 LLM serving,基本上必用!

Flash-Decoding (2023) ✅ 推理優化

論文: Flash-Decoding for Long-Context Inference (2023)

核心 idea: 針對 generation phase (decode) 優化 Flash Attention。

Decode 嗰陣每次只生成 1 個 token,但要 attend 去成個 context (可能幾萬 tokens),Flash-Decoding 用 parallelization 加速。

狀態: ✅ 推理框架逐步採用

整合進 Flash Attention 2/3
vLLM 等框架已支持

2023-2024: Multi-Query / Grouped-Query Attention 🚄

呢個時期重點係減少 KV cache size 嚟加快推理。

Multi-Query Attention (MQA) (2019/再度流行 2023) ✅ 推理優化常用

論文: Fast Transformer Decoding: One Write-Head is All You Need (Shazeer, 2019)

雖然 2019 年就提出,但係去到 2023 先開始流行!

核心 idea:

傳統 Multi-Head Attention: 每個 head 都有自己嘅 Q, K, V
MQA: 所有 heads share 同一組 K, V,只有 Q 係 per-head

對比例子:MHA vs MQA

假設 4 個 heads,sequence length = 3,hidden dim = 512:

Multi-Head Attention (MHA):

python# 每個 head 有自己嘅 K, V (head_dim = 128)
K_head1 = [3 × 128]  # Shape: [seq_len × head_dim]
V_head1 = [3 × 128]
K_head2 = [3 × 128]
V_head2 = [3 × 128]
K_head3 = [3 × 128]
V_head3 = [3 × 128]
K_head4 = [3 × 128]
V_head4 = [3 × 128]

# Total KV cache: 3 × 128 × 8 = 3,072 elements

Multi-Query Attention (MQA):

python# 所有 heads share 同一組 K, V
K_shared = [3 × 128]  # 只有一組!
V_shared = [3 × 128]

Q_head1 = [3 × 128]  # 每個 head 仍然有自己嘅 Q
Q_head2 = [3 × 128]
Q_head3 = [3 × 128]
Q_head4 = [3 × 128]

# Total KV cache: 3 × 128 × 2 = 768 elements
# 減少 4x!

實際計算 (token 0, head 1):

MHA:

\text{Attention}_{h1} = \text{softmax}(Q_{h1} K_{h1}^T) V_{h1}

MQA:

\text{Attention}_{h1} = \text{softmax}(Q_{h1} K_{\text{shared}}^T) V_{\text{shared}}

\text{Attention}_{h2} = \text{softmax}(Q_{h2} K_{\text{shared}}^T) V_{\text{shared}}

所有 heads 用同一組 K, V,但用唔同嘅 Q 去 query!

好處:

✅ KV cache size 減少 8-16x (如果有 8-16 個 heads)
✅ 推理速度快好多 (memory bandwidth bound → 減少記憶體存取)
❌ 但係效果會差少少 (quality degradation)

狀態: ✅ 某啲模型用,但 GQA 更常見

邊個用緊?

PaLM (Google)
Falcon 40B
StarCoder

Grouped-Query Attention (GQA) (2023) ✅ 而家最常見!

論文: GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints (Ainslie et al., 2023)

核心 idea: MQA 同 MHA 嘅折衷方案。

將 heads 分成 groups
每個 group share 同一組 K, V
例如:32 heads 分成 4 groups,每 group 8 heads share 一組 KV

對比例子:MHA vs GQA vs MQA

假設 8 個 query heads,sequence length = 3,head_dim = 128:

Multi-Head Attention (MHA) - 8 組 KV:

pythonK = [K₁, K₂, K₃, K₄, K₅, K₆, K₇, K₈]  # 每個 [3×128]
V = [V₁, V₂, V₃, V₄, V₅, V₆, V₇, V₈]

Q₁ → (K₁, V₁)
Q₂ → (K₂, V₂)
...  
Q₈ → (K₈, V₈)

# KV cache: 3 × 128 × 8 × 2 = 6,144 elements

Grouped-Query Attention (GQA) - 2 組 KV (每組 4 heads):

pythonK = [K_group1, K_group2]  # 只有 2 組!
V = [V_group1, V_group2]

# Group 1 (heads 1-4 share)
Q₁, Q₂, Q₃, Q₄ → (K_group1, V_group1)

# Group 2 (heads 5-8 share)  
Q₅, Q₆, Q₇, Q₈ → (K_group2, V_group2)

# KV cache: 3 × 128 × 2 × 2 = 1,536 elements  
# 減少 4x!

Multi-Query Attention (MQA) - 1 組 KV:

pythonK = [K_shared]  # 得一組!
V = [V_shared]

Q₁, Q₂, ..., Q₈ → (K_shared, V_shared)

# KV cache: 3 × 128 × 1 × 2 = 768 elements
# 減少 8x!

視覺化架構

javascriptMHA:  Q₁→K₁,V₁  Q₂→K₂,V₂  Q₃→K₃,V₃  Q₄→K₄,V₄  ...
      ━━━━━━━  ━━━━━━━  ━━━━━━━  ━━━━━━━
      獨立      獨立      獨立      獨立

GQA:  Q₁→╲      Q₂→╱      Q₃→╲      Q₄→╱
          K₁,V₁             K₂,V₂
      ━━━━━━━━━━━━━  ━━━━━━━━━━━━━
      Group 1 share    Group 2 share

MQA:  Q₁→╲  Q₂→╱  Q₃→╲  Q₄→╱  ...
            K_shared, V_shared  
      ━━━━━━━━━━━━━━━━━━━━━━━
      全部 share

Memory 對比表

方法	KV Groups	KV Cache Size	Quality	Speed
MHA (8 heads)	8	6,144	最好 ⭐⭐⭐	慢
GQA (2 groups)	2	1,536 (↓75%)	好 ⭐⭐	快 ⚡⚡
MQA (1 group)	1	768 (↓87.5%)	差啲 ⭐	最快 ⚡⚡⚡

LLaMA 2 70B 嘅選擇: 64 query heads, 8 KV groups → 每 group 8 heads share

好處:

✅ KV cache size 減少 4-8x (視乎 groups 數目)
✅ 效果比 MQA 好,接近 MHA
✅ 推理速度仍然快好多

狀態: ✅ 而家最主流嘅配置!

邊個用緊?

LLaMA 2 (Meta): 8 KV heads for 70B model
Mistral 7B: 8 KV heads
Qwen 2: GQA
Gemma: GQA

基本上 2023 之後嘅新模型,好多都用 GQA!

2024-2026: Mixture of Depths & 其他探索 🔬

Mixture of Depths (MoD) (2024) 🎓 研究階段

核心 idea: 唔係每個 token 都過完整嘅 transformer layer,根據 importance 決定邊啲 tokens 行 "express lane"。

動機:

某啲 tokens 重要 (例如 nouns, verbs),需要深度處理
某啲 tokens 唔重要 (例如 "the", "a"),可以 skip 某啲 layers

狀態: 🎓 太新,仍在研究

未有大規模 production 應用。

Infini-Attention (2024) 🎓 研究階段

論文: Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention (Google, 2024)

核心 idea: 結合 local attention + compressed memory,目標係處理無限長嘅 context。

狀態: 🎓 研究階段

Google 有做實驗,但未見普及。

Differential Attention (2024) 🎓 研究中

核心 idea: 用兩個 attention heads 做 difference,增強對 relevant info 嘅 focus。

狀態: 🎓 學術研究

Sliding Window Attention (Mistral, 2023) ✅ 仍在用

論文: Mistral 7B 引入

核心 idea: 每個 token 只 attend 前面 固定 window size 嘅 tokens (例如 4096)。

好處:

✅ KV cache 固定大小,唔會隨序列增長
✅ 推理速度穩定
❌ 但係失去真正嘅 long-range dependencies

狀態: ✅ Mistral 系列仍在用

但主流 LLMs 更偏好 full attention + better memory management。

已淘汰嘅方法總結 ❌

呢啲方法喺歷史上有貢獻,但而家基本上冇人用:

方法	年份	點解淘汰?	後繼者
Transformer-XL	2019	實作複雜,recurrence 唔夠平行	RoPE, ALiBi
Linformer	2020	信息損失,Flash Attention 更好	Flash Attention
Performer	2021	近似誤差,實作複雜	Flash Attention
Reformer	2020	LSH attention 唔穩定	Flash Attention
Synthesizer	2021	學習 attention 效果唔好	Standard attention
Routing Transformer	2021	實作複雜,收益有限	Dense attention

共同模式:

大部分被淘汰嘅方法都係試圖通過 approximation 嚟減少複雜度,但係:

2022 年 Flash Attention 出現,證明可以做到 exact attention 但 memory-efficient
Approximation 嘅 quality loss 唔值得

現役主流方法 ✅

如果你而家要 train 或 deploy 一個 transformer model,應該用咩組合?

訓練 (Training)

標準配置 (2026):

python# Position encoding
position_encoding = "RoPE"  # 或者 ALiBi

# Attention mechanism
attention = "Multi-Head Self-Attention"
attention_implementation = "Flash Attention 3"  # 必備!

# Query/Key/Value heads
use_GQA = True  # Grouped-Query Attention
num_query_heads = 32
num_kv_heads = 8  # 4x reduction

# Context length
max_seq_len = 8192  # 或更長

效果:

Flash Attention 3: 快 + memory-efficient
GQA: 推理快,效果好
RoPE: 位置建模強

推理 (Inference)

標配:

python# Attention optimization
attention_backend = "Flash Attention 3"

# KV cache management  
kv_cache_manager = "PagedAttention"  # vLLM

# Optional: 如果 context 超長
use_sliding_window = False  # 通常唔使

框架推薦:

vLLM: Flash Attention + PagedAttention,最快
TensorRT-LLM: NVIDIA 優化,A100/H100 最優
Text Generation Inference: Hugging Face,易用

Timeline 視覺化 📊

javascript2014 ■ Seq2Seq (RNN Encoder-Decoder)
       └─ 問題: fixed-size bottleneck

2015 ■ Bahdanau Attention ✅ (教學用)
     ■ Luong Attention ✅ (教學用)
       └─ 首次引入 attention 概念

2017 ████ Transformer 革命 ████
     ■ Scaled Dot-Product Attention ✅ 核心!
     ■ Multi-Head Attention ✅ 核心!
     ■ Self-Attention ✅ 核心!
     ■ Cross-Attention ✅ 仍在用
       └─ Attention is All You Need

2019 ■ Transformer-XL ❌ 已淘汰
     ■ Sparse Transformer 🎓 學術
     ■ Multi-Query Attention 🔄 2023 再流行

2020 ■ Longformer ✅ 文檔理解
     ■ BigBird 🎓 學術
     ■ Linformer ❌ 已淘汰
     ■ Reformer ❌ 已淘汰

2021 ■ Performer ❌ 已淘汰
     ■ RoPE ✅✅✅ 現在主流!
       └─ LLaMA, PaLM, GPT-NeoX 都用

2022 ████ Memory 革命 ████
     ■ Flash Attention ✅✅✅ 革命性!
     ■ ALiBi ✅✅ 第二主流
     ■ Memory-Efficient Attention ✅

2023 ■ Grouped-Query Attention (GQA) ✅✅✅ 推理標配!
     ■ PagedAttention ✅✅✅ 推理必備!
     ■ Flash Attention 2 ✅✅✅
     ■ Flash-Decoding ✅
     ■ Sliding Window (Mistral) ✅
       └─ LLaMA 2, Mistral 採用 GQA

2024 ■ Flash Attention 3 ✅✅✅
     ■ Mixture of Depths 🎓 研究
     ■ Infini-Attention 🎓 研究

2026 ← 你而家喺度!
     主流: Flash Attention 3 + GQA + RoPE/ALiBi + PagedAttention

技術選擇指南 🧭

點樣揀適合你嘅 attention 方案?

如果你係... 🤔

📚 學生 / 研究新手

學習路徑:

Bahdanau/Luong Attention (理解基本概念)
Scaled Dot-Product Attention (理解 QKV)
Multi-Head Attention (現代標準)
Flash Attention (理解 memory optimization)

🏢 訓練新模型 (Production)

推薦配置:

Position: RoPE (如果唔知揀邊個) 或 ALiBi (如果需要超長 context)
Attention: Multi-Head Self-Attention
Implementation: Flash Attention 3
Heads: GQA (例如 32 query heads, 8 KV heads)
Framework: PyTorch + Flash Attention official library

🚀 部署推理服務

推薦配置:

Framework: vLLM (最快)
KV Cache: PagedAttention (vLLM 內建)
Attention: Flash Attention 3
如果用 NVIDIA GPU: 可以試 TensorRT-LLM

📄 處理超長文檔 (32k+ tokens)

選項 1 (dense attention):

RoPE + Flash Attention + Ring Attention
需要多張 GPU,用 sequence parallelism

選項 2 (sparse attention):

Longformer (16k-32k tokens)
或者用 RAG (Retrieval-Augmented Generation) 切細文檔

🎨 多模態模型 (Vision + Language)

Cross-Attention 仍然需要 (連接 vision encoder 同 language decoder)
例子: CLIP, Flamingo, LLaVA

未來趨勢預測 🔮

基於而家嘅發展,我估未來幾年會有呢啲趨勢:

1. Flash Attention 系列繼續主導 ✅

Flash Attention 4, 5... 會繼續優化
更好嘅 hardware utilization (特別係新 GPU 架構)
可能會有 specialized hardware accelerators for attention

2. GQA 成為標配 ✅

幾乎所有新模型都會用 GQA
可能會有 dynamic GQA (根據 layer 調整 KV heads 數目)

3. 位置編碼趨向統一 🤔

RoPE 同 ALiBi 仍然會係主流
可能會出現統一兩者優點嘅新方法
RoPE scaling 技術會更成熟 (YaRN, NTK-aware 等)

4. Context Length 持續增長 📏

2026: 128k-256k tokens 會變常見
需要更好嘅 memory management (PagedAttention 改進版)
可能會有 hierarchical attention (唔同 layers 用唔同 context window)

5. Sparse Attention 再度回歸? 🔄

當 context length 去到 1M+ tokens,dense attention 會太貴
可能會有新一代 sparse patterns,配合 Flash Attention
Mixture of Depths 概念可能會成熟

6. 專門化 Attention 🎯

唔同 modalities 用唔同 attention mechanisms
例如:vision 用 local attention,language 用 global attention
Multi-modal models 會有更複雜嘅 attention routing

7. Hardware-Software Co-design ⚡

Attention 專用嘅 hardware accelerators
類似 TPU 專為 matrix multiplication 優化,未來可能有 "Attention Processing Unit"

我嘅睇法 💭

研究咗咁多 attention mechanisms,我有幾個感想:

1. 簡單通常更好

好多複雜嘅變種 (Linformer, Performer, Reformer) 最終都失敗咗,反而簡單嘅 dense attention + 優化 implementation (Flash Attention) 贏咗。

教訓: 唔好急住 approximate,先 optimize exact solution。

2. 工程優化同理論創新一樣重要

Flash Attention 冇改變 attention 嘅數學公式,只係改變咗計算順序同 memory access pattern,但帶嚟嘅影響比好多理論上 fancy 嘅方法都大。

教訓: 理解 hardware (GPU memory hierarchy) 好重要!

3. 位置編碼超級重要

好多人以為 attention 機制本身最重要,但其實 position encoding (RoPE, ALiBi) 對 performance 嘅影響好大,特別係長序列。

教訓: 唔好忽略 "配角" 組件。

4. 推理同訓練嘅需求好唔同

訓練時最緊要 throughput (Flash Attention),推理時最緊要 latency 同 memory (PagedAttention, GQA)。

教訓: 設計 model 要同時考慮訓練同部署。

5. 某啲 "舊" idea 會返嚟

Multi-Query Attention 喺 2019 年提出,但去到 2023 年先流行 (因為推理需求增加)。Sparse attention 可能都會喺未來再度興起 (當 context length 去到 millions)。

教訓: Keep an eye on "old" papers,佢哋可能會喺新 context 下有用。

參考資料 📚

經典論文:

Bahdanau et al. (2015): Neural Machine Translation by Jointly Learning to Align and Translate
Vaswani et al. (2017): Attention Is All You Need
Su et al. (2021): RoFormer: Enhanced Transformer with Rotary Position Embedding
Press et al. (2022): Train Short, Test Long: Attention with Linear Biases
Dao et al. (2022): FlashAttention: Fast and Memory-Efficient Exact Attention
Ainslie et al. (2023): GQA: Training Generalized Multi-Query Transformer Models
Kwon et al. (2023): Efficient Memory Management for LLM Serving with PagedAttention

相關文章:

框架 & 實現: