LLM Pre-training 架構詳解：Decoder-Only、Encoder-Only、Encoder-Decoder 嘅分別同訓練方式

點解 GPT 用 Decoder-Only，BERT 用 Encoder-Only，而 T5 用埋兩樣？呢啲架構有咩分別，訓練方式又點唔同？今日深入拆解 LLM Pre-training 嘅三大架構。

TL;DR

核心重點：

🎯 三大架構：Decoder-Only（單向生成）、Encoder-Only（雙向理解）、Encoder-Decoder（理解+生成）
🔄 訓練方式：Causal LM (GPT)、Masked LM (BERT)、Span Corruption (T5)
✅ 趨勢：Decoder-Only 已成主流（GPT-4, Llama, DeepSeek），因為可以同時做理解同生成
⚠️ 誤解：Decoder-Only 唔係淨係識生成，加上 bidirectional attention 一樣可以做理解任務
📊 選擇：新 project 建議直接用 Decoder-Only，除非你只做 embedding/classification

Pre-training 基礎概念
歷史發展時間線
架構一：Decoder-Only（GPT 系列）
架構二：Encoder-Only（BERT 系列）
架構三：Encoder-Decoder（T5 系列）
三大架構比較
點解 Decoder-Only 成為主流？
實際應用場景
總結
相關資源

Pre-training 基礎概念

咩係 Pre-training？

Pre-training（預訓練） 係指用大量無標註文本訓練模型，令佢學會語言嘅基本規律，之後再透過 fine-tuning 或者 prompting 適應特定任務。

Pre-training 嘅核心目標：

學習語言嘅統計規律（文法、語義、常識）
建立豐富嘅詞彙同概念表徵
為下游任務提供強大嘅起點

Self-Supervised Learning

Pre-training 用嘅係 self-supervised learning，即係從數據本身產生訓練信號，唔需要人手標註。

用日常例子理解三種訓練方式：

三種方法嘅核心分別：

Decoder-Only (GPT)：順序生成，一個字接一個字
Encoder-Only (BERT)：睇晒成句，填返空格（可以睇前後文）
Encoder-Decoder (T5)：睇晒成句理解，然後順序生成答案

核心概念

Pre-training 嘅威力喺於規模效應：用數以千億計嘅 tokens 訓練出嚟嘅模型，會產生 emergent abilities（湧現能力），即係一啲冇明確教過佢嘅技能，例如 in-context learning、chain-of-thought reasoning 等。

Token Representations：點樣將文字變成向量？

喺深入討論三大架構之前，我哋需要先理解 Transformer 點樣處理輸入文本。所有架構都需要將文字轉換成數字向量（embeddings），呢個過程包含三個關鍵組件：

Token Embeddings（詞嵌入）

Token Embedding 係將每個 token（字或詞）轉換成一個固定維度嘅向量。

生活例子：詞典查表

想像你有本特殊嘅詞典，每個字都對應一組數字：

「今日」→ [0.2, -0.5, 0.8, ...] (d_model 維度)
「天氣」→ [-0.1, 0.3, -0.2, ...]
「好」→ [0.5, 0.1, 0.4, ...]

呢啲數字係透過訓練學習出嚟，相似意思嘅字會有相似嘅向量。

實作：

pythonimport torch
import torch.nn as nn

# 假設詞彙表有 50,000 個 tokens，每個 token 嘅 embedding 維度係 768
vocab_size = 50000
d_model = 768

# Token Embedding Layer（其實就係個 lookup table）
token_embedding = nn.Embedding(vocab_size, d_model)

# 輸入：token IDs
input_ids = torch.tensor([1234, 5678, 9012])  # 對應 ["今日", "天氣", "好"]

# 輸出：embeddings
embedded = token_embedding(input_ids)
print(embedded.shape)  # torch.Size([3, 768])

特性：

📏 固定維度：無論 token 長短，output 維度都係 d_model
🎓 可學習：Embedding matrix 係模型參數之一，會隨訓練更新
🔢 巨大矩陣：vocab_size × d_model（例如 50K × 768 = 3840 萬個參數）

Segment Embeddings / Token Type Embeddings（句子分段嵌入）

Segment Embedding 用嚟區分唔同句子或片段，主要用於 BERT 同某啲 Encoder-Decoder 模型。

生活例子：顏色標籤

想像你要比較兩句說話：

Sentence A：「今日天氣好晴朗」→ 貼藍色標籤（Segment ID = 0）
Sentence B：「我哋去行山啦」→ 貼紅色標籤（Segment ID = 1）

模型透過呢啲標籤知道邊啲 tokens 屬於邊句。

BERT 嘅應用場景：

python# BERT Input Format
input_text = "[CLS] Sentence A [SEP] Sentence B [SEP]"

# Token IDs
token_ids =     [101, 1234, 5678, 102, 9012, 3456, 102]
# Token Type IDs (Segment IDs)
segment_ids =   [  0,    0,    0,   0,    1,    1,   1]
#                [CLS]   A     A  [SEP]   B     B  [SEP]

# Segment Embedding
segment_embedding = nn.Embedding(2, d_model)  # 只有兩種 segment：0 同 1
segment_embed = segment_embedding(torch.tensor(segment_ids))

用途：

✅ Next Sentence Prediction (NSP)：BERT 訓練任務
✅ Question Answering：區分 question 同 context
✅ 多輪對話：區分唔同說話者

⚠️ 重要：

GPT 系列（Decoder-Only） 通常唔使 Segment Embeddings
T5 都冇用 Segment Embeddings（用 text prefix 取代）

Positional Encodings（位置編碼）

Positional Encoding 係 Transformer 最關鍵嘅創新之一！因為 Self-Attention 本身冇順序概念，所以需要 positional encoding 嚟告訴模型每個 token 嘅位置。

生活例子：座位號碼

想像一班學生企成一行：

冇座位號：老師只知道「有邊啲學生」，但唔知佢哋企喺邊
有座位號：老師知道「Alice 企第 1 位，Bob 企第 2 位」

Transformer 需要座位號（位置資訊）先識得處理文字順序。

兩種主要方法：

用數學公式計算位置向量，唔使訓練：

pythonimport numpy as np

def sinusoidal_positional_encoding(max_len, d_model):
    """
    max_len: 最大序列長度
    d_model: embedding 維度
    """
    position = np.arange(max_len)[:, np.newaxis]  # (max_len, 1)
    div_term = np.exp(np.arange(0, d_model, 2) * -(np.log(10000.0) / d_model))
    
    pe = np.zeros((max_len, d_model))
    pe[:, 0::2] = np.sin(position * div_term)  # 偶數維度用 sin
    pe[:, 1::2] = np.cos(position * div_term)  # 奇數維度用 cos
    
    return torch.FloatTensor(pe)

# 例子
pe = sinusoidal_positional_encoding(max_len=512, d_model=768)
print(pe.shape)  # torch.Size([512, 768])

數學公式：

\text{PE}(\text{pos}, 2i) = \sin\left(\frac{\text{pos}}{10000^{2i/d_{\text{model}}}}\right)

\text{PE}(\text{pos}, 2i+1) = \cos\left(\frac{\text{pos}}{10000^{2i/d_{\text{model}}}}\right)

優點：

✅ 唔使訓練，節省參數
✅ 可以處理任意長度序列（理論上）
✅ 有數學規律，相對位置有明確關係

缺點：

❌ 冇彈性，固定公式
❌ 實際上對超長序列效果一般

將位置當成可學習嘅參數，同 token embedding 一樣：

python# Learned Positional Embedding
max_position = 512  # BERT 嘅 max sequence length
position_embedding = nn.Embedding(max_position, d_model)

# 輸入：位置 IDs
position_ids = torch.arange(0, seq_len)  # [0, 1, 2, 3, ...]

# 輸出：位置向量
pos_embed = position_embedding(position_ids)

優點：

✅ 更靈活，模型自己學習最佳表示
✅ 實際效果通常更好

缺點：

❌ 固定最大長度（例如 BERT 最多 512 tokens）
❌ 需要額外參數（max_len × d_model）

唔係記錄絕對位置（第 1、2、3 個），而係記錄相對距離（「相隔 2 個位置」）。

例子：

python# 相對位置：token i 同 token j 之間嘅距離
relative_distance = i - j

# 例如：
# token 5 attend 到 token 2 → 相對位置 = 5 - 2 = 3
# token 2 attend 到 token 5 → 相對位置 = 2 - 5 = -3

優點：

✅ 泛化能力更強（訓練時見過相對距離，推理時可以應用到更長序列）
✅ 更符合語言學直覺

組合三者：最終 Input Representation

所有架構都會將三種 embeddings 相加作為最終輸入：

pythonclass TransformerInputEmbedding(nn.Module):
    def __init__(self, vocab_size, max_len, d_model):
        super().__init__()
        self.token_embed = nn.Embedding(vocab_size, d_model)
        self.position_embed = nn.Embedding(max_len, d_model)
        # Segment embedding (只有 BERT 用)
        self.segment_embed = nn.Embedding(2, d_model)  
    
    def forward(self, input_ids, segment_ids=None):
        seq_len = input_ids.size(1)
        
        # 1. Token Embeddings
        token_embeddings = self.token_embed(input_ids)
        
        # 2. Position Embeddings
        position_ids = torch.arange(seq_len, device=input_ids.device)
        position_embeddings = self.position_embed(position_ids)
        
        # 3. Segment Embeddings (optional)
        if segment_ids is not None:
            segment_embeddings = self.segment_embed(segment_ids)
            embeddings = token_embeddings + position_embeddings + segment_embeddings
        else:
            embeddings = token_embeddings + position_embeddings
        
        return embeddings

視覺化：

Loading diagram...

重點總結

Input = Token Embed + Position Embed + (Segment Embed)

📝 Token Embedding：每個字嘅語義向量

📍 Position Embedding：告訴模型每個字嘅位置

🏷️ Segment Embedding：區分唔同句子（BERT 專用）

所有 embeddings 相加（唔係 concat），維持 d_model 維度。

如果用全人類知識嚟訓練會點？

呢個係好多人都會問嘅問題：點解唔直接用「所有人類知識」嚟訓練 AI？咁佢咪乜都識？

現實情況：

現代大型 LLM（例如 GPT-4、Llama 3）其實已經接近呢個目標：

Loading diagram...

實際數字：

GPT-3：訓練數據約 500B tokens（相當於 50 萬本書）
Llama 3：訓練數據約 15T tokens（相當於 1500 萬本書）
人類所有出版物：估計約 100-200T tokens

所以現代 LLM 已經訓練咗相當於人類大部分公開知識！

有趣事實

GPT-4 嘅訓練數據包含咗：

📚 幾乎所有英文維基百科

📖 數以百萬計嘅書籍

💻 大量 GitHub 代碼

📄 數以千萬計嘅學術論文

🌐 大量網頁內容

呢個數據量已經超過任何一個人一生可以閱讀嘅量好多倍！

但係仍然有限制：

唔係真正嘅「所有」知識
- ❌ 未公開嘅研究（企業機密、軍事資料）
- ❌ 個人經驗同隱性知識（點樣踩單車、點樣煮飯）
- ❌ 實時資訊（今日天氣、最新新聞）
- ❌ 多媒體內容（圖片、影片、音樂入面嘅知識）
數據質素問題
- ⚠️ 網上有大量錯誤資訊同假新聞
- ⚠️ 偏見同刻板印象（數據反映人類社會嘅偏見）
- ⚠️ 唔同來源嘅資訊可能矛盾
知識更新問題
- 📅 Knowledge Cutoff：訓練數據有截止日期
- 📅 例如 GPT-4 嘅知識截止於 2023 年 4 月
- 📅 之後嘅新知識佢唔會知道
理解 vs 記憶
- 🤔 LLM 係「記住」知識定係「理解」知識？
- 🤔 佢可能會背咗大量事實但唔理解背後原理
- 🤔 所以會出現 hallucination（作故仔）

點解唔訓練更多數據？

真正嘅問題唔係數據量，而係：

Data Quality > Data Quantity
- 高質素嘅 1T tokens 好過低質素嘅 10T tokens
- DeepSeek-V3 證明咗用更少但更高質素嘅數據可以達到更好效果
Reasoning > Memorization
- 現代研究 focus 喺提升推理能力
- Chain-of-Thought, Self-Consistency 等技術
- 唔係單純增加數據量
Alignment > Raw Knowledge
- RLHF (Reinforcement Learning from Human Feedback)
- 確保 AI 嘅輸出符合人類價值觀
- 有用、無害、誠實

重點總結

現代 LLM 已經訓練咗相當於「大部分人類公開知識」，但：

✅ 量夠晒：數萬億 tokens，超過人一生所讀

⚠️ 質參差：網上資訊有真有假

📅 會過時：knowledge cutoff 問題

🤔 識唔等於明：可能背咗但唔理解

未來方向係提升質素同推理能力，唔係單純增加數據量。

歷史發展時間線

點樣由 Transformer 變成三大架構？

好多人會好奇：點解會有三種唔同嘅架構？邊個先出現？以下係 LLM Pre-training 架構嘅發展歷史：

Loading diagram...

重要里程碑

2017：Transformer 誕生

Google 發表 "Attention Is All You Need"
原始設計：Encoder-Decoder（為機器翻譯設計）
引入 self-attention 機制，取代 RNN

2018：GPT-1 vs BERT 之爭

2018-2019：BERT 時代

BERT 刷新 11 個 NLP 任務紀錄
所有人都在做 BERT 變體：RoBERTa, ALBERT, DistilBERT, ELECTRA
主流觀點：Encoder-Only 最強，Decoder-Only 只係玩具

2019-2020：GPT-2/GPT-3 逆襲

GPT-2（2019 Feb）：1.5B 參數，展示 zero-shot 能力
T5（2019 Oct）：統一 text-to-text 框架，Encoder-Decoder
GPT-3（2020 May）：175B 參數，few-shot learning 爆發
轉捩點：大家發現 Decoder-Only scaling 效果最好

2021-2023：Decoder-Only 主導

2021：GitHub Copilot（GPT-Codex）證明生成式 AI 嘅商業價值
2022：ChatGPT 爆紅，全球關注
2023：Llama 1/2 開源，Decoder-Only 成為開源主流
2023：Encoder-Only 新 model 幾乎消失

2024-2026：Decoder-Only 一統江湖

Llama 3, GPT-4, Mistral, Qwen, DeepSeek 全部係 Decoder-Only
Encoder-Decoder (T5/BART) 基本被淘汰
新 project 預設選擇：Decoder-Only

點解 BERT 由主流變成式微？

2018-2019 年：BERT 全面主導

刷新所有 NLP benchmark

論文引用數爆升

人人都做 BERT 變體

2020-2023 年：GPT-3 改變遊戲規則

Few-shot learning 唔使 fine-tune

Scaling laws：model 越大越勁

ChatGPT 證明生成式 AI 嘅價值

2024-2026 年：Decoder-Only 完全勝出

一個 model 做晒所有任務

開源社群全面擁抱（Llama, Mistral）

BERT 只剩低特定場景（embeddings, 分類）

關鍵轉折點

轉折點 1：GPT-3 嘅 scaling laws（2020）

證明 Decoder-Only 隨規模提升效果最好
出現 emergent abilities（chain-of-thought 等）
In-context learning 取代 fine-tuning

轉折點 2：ChatGPT 嘅成功（2022）

向大眾證明生成式 AI 嘅實用性
BERT 只識理解，唔識生成，變成劣勢
對話式 AI 成為新標準

轉折點 3：Llama 開源（2023）

Meta 開源 Llama 系列
證明 Decoder-Only 可以開源複製
社群全面轉向 Decoder-Only

重點總結

時間線總結：

📅 2017：Transformer（Encoder-Decoder）

📅 2018：GPT-1（Decoder-Only）vs BERT（Encoder-Only）

📅 2019：T5（Encoder-Decoder）統一框架

📅 2020：GPT-3 證明 Decoder-Only scaling 最好

📅 2022：ChatGPT 引爆生成式 AI

📅 2023-2026：Decoder-Only 一統江湖

演變邏輯：

最初大家以為 bidirectional（BERT）最強

但發現 unidirectional（GPT）scaling 效果更好

加上 prompting 取代 fine-tuning

最終 Decoder-Only 勝出

未來趨勢

2026 年之後可能嘅發展：

Decoder-Only 繼續主導：冇跡象顯示會改變
Hybrid 架構：可能出現新嘅混合設計（例如 Mamba + Attention）
Encoder-Only 小眾化：只用於特定場景（embeddings, 邊緣裝置）
Encoder-Decoder 消失：除咗維護舊 project，基本唔會再用

架構一：Decoder-Only（GPT 系列）

架構設計

Decoder-Only 係最簡單嘅架構，只用 Transformer Decoder 嘅 self-attention 層，每個 token 只能 attend 到佢之前嘅 tokens（causal attention / unidirectional）。

生活例子：寫故事接龍

想像你同朋友玩緊故事接龍，每個人要根據前面嘅內容寫落去：

朋友 A：「從前有個勇敢嘅騎士」
朋友 B：「佢住喺一座城堡」（只睇到前面嘅「從前有個勇敢嘅騎士」）
朋友 C：「每日都出去冒險」（只睇到前面全部，但睇唔到後面會點）

呢個就係 Decoder-Only！你只能根據已經寫咗嘅內容，順序寫落去。

架構流程圖：

Loading diagram...

Causal Attention Mask：

python# 每個位置只能睇到自己同之前嘅 tokens
[1, 0, 0, 0]  # Token 0 只睇到自己
[1, 1, 0, 0]  # Token 1 睇到 0, 1
[1, 1, 1, 0]  # Token 2 睇到 0, 1, 2
[1, 1, 1, 1]  # Token 3 睇到 0, 1, 2, 3

訓練方式：Causal Language Modeling (CLM)

Decoder-Only 用 next-token prediction 嚟訓練，好似順住寫落去：

簡單理解

想像你睇緊半句：「今日天氣好」，你要估下一個字。

你只能睇前面嘅字（「今日天氣好」）

你唔能偷睇後面會寫咩

你要逐個字順序生成：「今日天氣好」→「晴朗」

呢個就係 Decoder-Only 嘅訓練方式！

訓練過程：

python# 輸入文本："The cat sat on the mat"
# Tokenized: [The, cat, sat, on, the, mat]

# 訓練時的 input 同 target：
Input:  [The, cat, sat, on, the]
Target: [cat, sat, on, the, mat]

# 每個位置都預測下一個 token
Position 0: "The"      → predict "cat"
Position 1: "The cat"  → predict "sat"
Position 2: "The cat sat" → predict "on"
...

Loss Function：

python# Cross-entropy loss on next token
def causal_lm_loss(logits, targets):
    """
    logits: (batch_size, seq_len, vocab_size)
    targets: (batch_size, seq_len)
    """
    # Shift logits and targets
    shift_logits = logits[:, :-1, :].contiguous()
    shift_targets = targets[:, 1:].contiguous()
    
    # Compute loss
    loss = F.cross_entropy(
        shift_logits.view(-1, vocab_size),
        shift_targets.view(-1)
    )
    return loss

優點

✅ 架構簡單：只需要 decoder，易於實作同訓練

✅ 生成能力強：天生就係為生成任務設計

✅ 統一框架：可以用同一個模型做晒所有任務（理解、生成、推理）

✅ Scaling 效果好：規模越大，emergent abilities 越強

✅ In-context learning：可以透過 prompting 做 few-shot learning

缺點與技術限制

雖然 Decoder-Only 已成主流，但佢仍然有唔少限制：

Sequential Generation Bottleneck（順序生成瓶頸）

❌ Autoregressive 特性：必須逐個 token 生成，冇辦法平行化

python# Decoder-Only 生成過程
t=0: Generate token 0
t=1: Generate token 1 (需要等 token 0 生成完)
t=2: Generate token 2 (需要等 token 0, 1 都生成完)
# ... 依此類推

⏱️ 推理速度慢：生成 100 個 tokens 需要 100 次 forward pass
💰 成本高：每個 token 都要運行成個模型
📉 吞吐量低：batch size 受限於 GPU 記憶體

現實影響：

GPT-4 生成 1000 字文章可能需要 10-30 秒
長文本生成（10K+ tokens）可能需要幾分鐘

Context Window 限制

❌ 記憶體複雜度：O(n²) 隨序列長度平方增長

原因：Self-attention 需要計算每對 tokens 之間嘅關係
後果：
- GPT-3.5：4K context (約 3000 字)
- GPT-4：8K-32K context (取決於版本)
- Llama 2：4K context
- Llama 3.1：128K context (但需要大量 GPU)

實際問題：

python# 假設 d_model = 4096, batch_size = 1
seq_len = 4096 tokens
# Attention matrix size: 4096 × 4096 × 4 bytes (FP32) = 67 MB
# 32 layers × 32 heads = 需要約 68 GB GPU memory！

單向 Attention 嘅理解限制

❌ Causal Masking：理論上只能睇前文

對於完形填空類任務（需要睇後文）表現較差
例如：「我今日食咗___，所以好飽」（答案係「飯」，但 decoder 睇唔到「好飽」）

解決方法：

Prefix LM：input 部分用 bidirectional attention
Prompting：將後文資訊放入 prompt
實務上影響有限：大部分任務唔需要真正嘅雙向理解

訓練數據量要求高

❌ 需要海量數據：Decoder-Only scaling laws 要求數據量隨模型大小增長

GPT-3 (175B)：500B tokens
Llama 3 (70B)：15T tokens
小公司難以負擔：數據收集、清洗、存儲都係巨大成本

生成質量問題

❌ Hallucination（幻覺）：模型會「作故仔」，生成唔真實嘅內容

原因：只根據前文生成下一個 token，冇 fact-checking 機制
後果：可能生成錯誤資訊、虛構事實

❌ Repetition（重複）：容易重複相同內容

原因：Autoregressive 特性 + 訓練數據嘅統計偏差
解決方法：repetition penalty, nucleus sampling

❌ Exposure Bias（曝光偏差）

問題：訓練時見到嘅係 ground truth，但推理時見到嘅係自己生成嘅 tokens
後果：一旦生成錯誤 token，後續可能連環出錯

計算資源需求極高

❌ 訓練成本：

GPT-3：估計 460 萬美元（一次訓練）
GPT-4：估計數千萬美元
Llama 3 (70B)：需要數千張 H100 GPU，訓練數週

❌ 推理成本：

GPT-4 API： $0.03 / 1K input tokens,$ 0.06 / 1K output tokens
自建服務：需要多張 A100/H100 GPU

Knowledge Cutoff 問題

❌ 知識過時：模型只知道訓練截止日期之前嘅資訊

GPT-4：截止 2023 年 4 月
對於實時資訊（新聞、股價、天氣）完全唔識

解決方法：

RAG (Retrieval-Augmented Generation)
外部 API 調用
Fine-tuning（但成本高）

代表模型

GPT 系列：GPT-2, GPT-3, GPT-4
Llama 系列：Llama 2, Llama 3, Llama 3.1
DeepSeek 系列：DeepSeek-V2, DeepSeek-V3
Mistral 系列：Mistral 7B, Mixtral 8x7B
Qwen 系列：Qwen 2.5

架構二：Encoder-Only（BERT 系列）

架構設計

Encoder-Only 只用 Transformer Encoder，每個 token 可以 attend 到所有其他 tokens（bidirectional attention）。

生活例子：做閱讀理解

想像你做緊閱讀理解填充題：

題目：「今日___好晴朗，所以我決定去行山」
你會睇晒成句，包括前面嘅「今日」同後面嘅「好晴朗」「行山」
透過前後文理解，你知道空格應該係「天氣」
你一次過填返答案，唔使逐個字生成

呢個就係 Encoder-Only！佢可以同時睇晒前後文嚟理解，所以特別適合分類、情感分析等需要理解嘅任務。

架構流程圖：

Loading diagram...

Bidirectional Attention：

python# 每個位置可以睇到所有 tokens
[1, 1, 1, 1]  # Token 0 睇到全部
[1, 1, 1, 1]  # Token 1 睇到全部
[1, 1, 1, 1]  # Token 2 睇到全部
[1, 1, 1, 1]  # Token 3 睇到全部

訓練方式：Masked Language Modeling (MLM)

Encoder-Only 用 masked token prediction 嚟訓練，好似 Fill in the blank（填充題）：

簡單理解

想像你做緊填充題：「今日___好晴朗」，你要估空格係咩字。

你可以睇前後文（「今日」同「好晴朗」）

你一次過填晒所有空格（唔係逐個字生成）

答案顯然係「天氣」

呢個就係 Encoder-Only 嘅訓練方式！因為可以睇前後文，所以理解能力好強。

訓練過程：

python# 原始文本："The cat sat on the mat"
# 隨機遮蔽 15% tokens

# Example 1:
Input:  "The [MASK] sat on the mat"
Target: Predict "cat" at [MASK] position

# Example 2:
Input:  "The cat [MASK] on [MASK] mat"
Target: Predict "sat" and "the" at [MASK] positions

Masking 策略（BERT）：

80%：用 [MASK] token 取代
10%：用隨機 token 取代
10%：保持原 token

Loss Function：

pythondef mlm_loss(logits, targets, mask_positions):
    """
    只計算被 mask 位置的 loss
    logits: (batch_size, seq_len, vocab_size)
    targets: (batch_size, seq_len)
    mask_positions: (batch_size, seq_len) - binary mask
    """
    # Only compute loss on masked positions
    active_loss = mask_positions.view(-1) == 1
    active_logits = logits.view(-1, vocab_size)[active_loss]
    active_labels = targets.view(-1)[active_loss]
    
    loss = F.cross_entropy(active_logits, active_labels)
    return loss

其他訓練方式

Next Sentence Prediction (NSP)：（BERT 使用，但後來發現唔太有用）

python# 訓練模型判斷兩句是否連續
Input:  "[CLS] Sentence A [SEP] Sentence B [SEP]"
Target: IsNext=1 or IsNext=0

優點

✅ 雙向理解：可以同時利用前後文，理解能力強

✅ 高質量 embeddings：產生嘅 contextualized embeddings 好適合做 downstream tasks

✅ 訓練效率高：MLM 可以平行處理，比 autoregressive 快

✅ 適合分類任務：加個 classification head 就可以 fine-tune

缺點與技術限制

完全冇生成能力

❌ 架構限制：Encoder 輸出嘅係 fixed-size representation，唔係 probability distribution

後果：
- 冇辦法做 text generation
- 冇辦法做 machine translation
- 冇辦法做 summarization（除非做 extractive，唔係 abstractive）

對比：

python# Decoder-Only 可以生成
logits = model(input_ids)  # (batch, seq_len, vocab_size)
next_token = torch.argmax(logits[:, -1, :])  # 預測下一個 token

# Encoder-Only 唔得
hidden_states = model(input_ids)  # (batch, seq_len, d_model)
# 冇 vocab_size 維度，冇辦法轉換成 token！

Pre-training / Fine-tuning Mismatch

❌ [MASK] Token 問題：

Pre-training：模型學習預測 [MASK] tokens
Fine-tuning：實際應用冇 [MASK] tokens
後果：模型喺 fine-tuning 時見到嘅 input distribution 同 pre-training 時唔同

例子：

python# Pre-training 見到嘅句子
"The [MASK] sat on the [MASK]"

# Fine-tuning/Inference 見到嘅句子
"The cat sat on the mat"  # 冇 [MASK]！

實際影響：

RoBERTa 透過改良 masking 策略減輕呢個問題
但始終存在 distribution gap

固定最大長度限制

❌ BERT 最多 512 tokens（約 400 字）：

原因：Position embeddings 係固定大小（learned embeddings）
後果：
- 長文檔需要切割（可能失去 context）
- 唔適合處理書籍、論文等長文本
- 需要額外嘅 sliding window 或 hierarchical 方法

對比：

Longformer：4096 tokens（用 sparse attention）
BigBird：4096 tokens（用 random + sliding + global attention）
但仍然遠不及現代 Decoder-Only（Llama 3.1：128K tokens）

Bidirectional Attention 嘅訓練限制

❌ 訓練效率問題：

MLM 只計算 15% tokens 嘅 loss（被 mask 嘅部分）
相比 Decoder-Only 每個 token 都計 loss
資料利用率較低：同樣數據量，BERT 學到嘅嘢較少

計算：

python# BERT MLM
seq_len = 512
masked_tokens = 512 × 0.15 = 76.8 個 tokens 計 loss

# GPT Causal LM
seq_len = 512
all_tokens = 512 個 tokens 都計 loss  # 效率高 6.67 倍！

下游任務需要 Fine-tuning

❌ 唔支援 Zero-shot / Few-shot Learning：

BERT 做新任務必須 fine-tune
冇辦法透過 prompting 做 in-context learning
後果：
- 每個任務都要訓練一個專用模型
- 需要標註數據
- 部署複雜（多個模型）

對比：

python# BERT 做 sentiment analysis
# 需要 fine-tune 一個 classification head
model = BertForSequenceClassification.from_pretrained('bert-base')
trainer.train(model, labeled_data)  # 需要標註數據！

# GPT 做 sentiment analysis
# 直接 prompting
prompt = "Sentiment of 'I love this movie': "
output = model.generate(prompt)  # "Positive" (zero-shot!)

Static Embeddings 問題

❌ [CLS] Token 嘅限制：

BERT 用 [CLS] token 嘅 embedding 代表成句
問題：單一向量難以捕捉複雜語義
後果：
- 長句子資訊可能會丟失
- 唔同任務可能需要唔同嘅 pooling 策略

計算資源浪費

❌ Inference 時嘅浪費：

即使只關心某幾個 tokens，都要計算成條序列嘅 attention
例子：做 NER 只需要識別實體，但 BERT 要處理成句

唔適應長尾分佈

❌ Random Masking 嘅偏差：

高頻詞容易被 mask 到（被學習到）
低頻詞、專有名詞較少被 mask（學習不足）
後果：對領域特定詞彙嘅理解較差

為何 BERT 被淘汰？

綜合以上限制，BERT 嘅問題在於：

❌ 應用範圍窄：只做理解，唔做生成

❌ 唔夠靈活：需要 fine-tuning，冇 in-context learning

❌ 訓練效率低：只用 15% tokens 計 loss

❌ 長度限制死：最多 512 tokens

而 Decoder-Only 一個架構解決晒所有問題，所以最終勝出。

代表模型

BERT 系列：BERT-base, BERT-large, RoBERTa
ALBERT：參數共享版本的 BERT
DeBERTa：Microsoft 改進版（disentangled attention）
特定領域：BioBERT（生物醫學）、SciBERT（科學）

架構三：Encoder-Decoder（T5 系列）

架構設計

Encoder-Decoder 結合咗兩者：

Encoder：Bidirectional attention，理解輸入
Decoder：Causal attention on decoder side + cross-attention to encoder，生成輸出

生活例子：翻譯考試

想像你做緊中譯英考試：

第一步（Encoder）：先睇晒成句中文，完全理解意思
- 中文：「今日天氣好晴朗」
- 你會睇晒前後文，理解整體意思
第二步（Decoder）：然後逐個字生成英文翻譯
- 生成 "Today" → 再生成 "the" → 再生成 "weather" → ...
- 生成每個字時，你會參考返中文原文（cross-attention）
- 同時只睇前面已經寫咗嘅英文（causal attention）

呢個就係 Encoder-Decoder！先用 Encoder 理解輸入，再用 Decoder 逐步生成輸出。

架構流程圖：

Loading diagram...

Cross-Attention：

python# Decoder 每個位置可以 attend 到 encoder 嘅所有輸出
Query: from Decoder hidden states
Key, Value: from Encoder hidden states

訓練方式：Span Corruption

T5 使用 Span Corruption（片段破壞） 訓練，結合咗 Fill in the blank + 順序生成：

簡單理解

想像你做緊進階版填充題：「今日___晴朗」

分兩個階段：

Encoder（理解題目）：睇晒成句，明白前後文（「今日」「晴朗」）

Decoder（生成答案）：逐個字順序生成答案 →「天」→「氣」→「好」

同 BERT 嘅分別係：BERT 一次過填「天氣好」，但 T5 要逐個字生成。

同 GPT 嘅分別係：GPT 睇唔到後面嘅「晴朗」，但 T5 嘅 Encoder 睇到。

呢個就係 Encoder-Decoder 嘅訓練方式！

訓練過程：

python# 原始文本
"Thank you for inviting me to your party last week"

# 遮蔽連續片段（用特殊 token <X>, <Y>, <Z> 標記）
Input (Encoder):  "Thank you <X> me to your party <Y> week"
Target (Decoder): "<X> for inviting <Y> last <Z>"

# Encoder 睇到遮蔽後的輸入
# Decoder 要生成被遮蔽的片段

統一格式：Text-to-Text

T5 將所有任務都轉成 text-to-text 格式：

python# Translation
Input:  "translate English to German: That is good."
Output: "Das ist gut."

# Classification
Input:  "cola sentence: The course is jumping well."
Output: "not acceptable"

# Summarization
Input:  "summarize: [long article]"
Output: "summary text"

優點

✅ 最靈活：Encoder 理解 + Decoder 生成，兩者兼得

✅ 適合 Seq2Seq：Translation, Summarization, QA 等任務表現好

✅ 統一框架：T5 將所有任務統一成 text-to-text

✅ Cross-attention：Decoder 可以充分利用 encoder 嘅資訊

缺點與技術限制

架構複雜度高

❌ 兩個獨立 Stack：

Encoder：N layers × (Self-Attention + FFN)
Decoder：N layers × (Self-Attention + Cross-Attention + FFN)
後果：
- 參數量接近兩倍（相比 Decoder-Only）
- 訓練時間長
- 推理速度慢

參數量對比：

python# T5-Base (Encoder-Decoder)
Encoder: 6 layers × 12 heads × 768 dim = 110M params
Decoder: 6 layers × 12 heads × 768 dim + Cross-Attention = 110M params
Total: ~220M params

# GPT-2 Medium (Decoder-Only)
12 layers × 12 heads × 1024 dim = ~345M params
# 參數更多，但架構更簡單！

Cross-Attention 嘅瓶頸

❌ 額外計算開銷：

Decoder 每層都要做 cross-attention
Query from decoder, Key/Value from encoder
複雜度：O(n_decoder × n_encoder)

實際影響：

python# 假設 encoder input = 512 tokens, decoder output = 128 tokens
# Cross-attention 需要計算：512 × 128 = 65,536 個 attention scores
# 每層都要算，32 layers → 需要大量計算

訓練效率低

❌ Two-Pass Forward：

第一步：Encoder 處理 input
第二步：Decoder 逐步生成 output（autoregressive）
後果：訓練時間係 Decoder-Only 嘅 1.5-2 倍

❌ Teacher Forcing 依賴：

訓練時用 ground truth 作為 decoder input
推理時用自己生成嘅 tokens
Exposure Bias 問題更嚴重

Scaling Laws 效果較差

❌ 大規模訓練唔划算：

研究發現，同樣計算資源下：

Decoder-Only：性能提升明顯
Encoder-Decoder：性能提升緩慢

實驗數據：

javascript模型大小：1B → 10B → 100B

Decoder-Only 效能提升：
1B: BLEU 30 → 10B: BLEU 40 → 100B: BLEU 48 (持續提升)

Encoder-Decoder 效能提升：
1B: BLEU 32 → 10B: BLEU 38 → 100B: BLEU 42 (提升放緩)

原因分析：

Decoder-Only 可以更有效利用參數
統一架構令 scaling laws 更穩定
In-context learning 能力隨規模指數增長

Inference 速度慢

❌ 雙重瓶頸：

Encoder forward pass：需要處理完整 input
Decoder autoregressive generation：逐個 token 生成

速度對比：

python# 生成 100 tokens

Decoder-Only:
- Forward passes: 100 次（只有 decoder）
- 時間：~1 秒（假設）

Encoder-Decoder:
- Encoder forward: 1 次（處理 input）
- Decoder forward: 100 次（生成 output）
- Cross-attention: 100 次（每個 decoder step 都要）
- 時間：~1.8 秒（慢 80%！）

記憶體佔用大

❌ KV Cache 雙倍：

Encoder 需要 cache：K, V matrices
Decoder 需要 cache：
- Self-attention: K, V
- Cross-attention: K, V (from encoder)
總計：記憶體需求約 Decoder-Only 嘅 1.5-2 倍

實際影響：

python# T5-Large (770M params)
# Batch size = 8, seq_len = 512
# KV cache memory:
Encoder KV: 8 × 512 × 1024 × 2 × 24 layers × 2 bytes ≈ 800 MB
Decoder KV (self): 800 MB
Decoder KV (cross): 800 MB
Total: ~2.4 GB (只係 KV cache！)

唔支援 In-Context Learning

❌ Prompting 效果差：

Encoder-Decoder 設計上係為 input → output 嘅 mapping
唔適合 few-shot prompting

例子：

python# GPT 可以做 few-shot
prompt = "Translate: Hello→Bonjour, Hi→Salut, Good→"
# GPT 會輸出 "Bon"

# T5 做唔到
# T5 需要明確嘅 input-output pairs

文本生成質素問題

❌ Length Bias：

Encoder-Decoder 傾向生成較短嘅輸出
原因：訓練時 loss 係 per-token，模型學會「快啲結束」
後果：生成嘅摘要、翻譯可能過度簡化

❌ Attention Dilution：

Cross-attention 需要 attend 到整個 encoder output
長 input 時，attention 分散
後果：可能遺漏重要資訊

部署複雜度高

❌ 兩套參數需要管理：

Encoder weights
Decoder weights
唔同嘅 optimization 策略

❌ 分散式推理困難：

Encoder 同 Decoder 難以分離部署
唔似 Decoder-Only 可以輕易做 pipeline parallelism

社群支援減少

❌ 2024+ 幾乎冇新 Encoder-Decoder 模型：

Hugging Face 新增模型：90%+ 係 Decoder-Only
開源工具、優化技術：主要針對 Decoder-Only
後果：
- 難以搵到預訓練模型
- 缺乏 SOTA 技術支援
- 社群討論、教學資源稀缺

為何 Encoder-Decoder 被淘汰？

總結來講，Encoder-Decoder 嘅問題在於：

❌ 複雜但冇明顯優勢：雙 stack 但效果唔及 Decoder-Only

❌ 訓練成本高：慢 + 食記憶體

❌ Scaling 效果差：大規模訓練唔划算

❌ 唔支援 prompting：冇 in-context learning

而 Decoder-Only 更簡單、更快、更強，所以業界全面放棄 Encoder-Decoder。

代表模型

T5 系列：T5-small, T5-base, T5-11B, Flan-T5
BART：Facebook 的 Encoder-Decoder（用 denoising autoencoder 訓練）
mT5：Multilingual T5
UL2：Unified Language Learner（混合多種訓練目標）

BART vs BERT：點解一個係 Encoder-Decoder，一個係 Encoder-Only？

好多人會混淆 BART 同 BERT，因為名字好似，但佢哋其實完全唔同：

BART 嘅訓練方式同 BERT 唔同：

python# BERT：隨機遮蔽 tokens
Input:  "The [MASK] sat on the [MASK]"
Target: Predict "cat" and "mat"

# BART：破壞文本，然後重建
Input (Encoder):  "The cat sat on" （刪除咗部分）
Target (Decoder): "The cat sat on the mat" （重建完整句子）

# BART 可以做更多破壞方式：
# - Token Masking（似 BERT）
# - Token Deletion（刪除 tokens）
# - Text Infilling（填充連續片段）
# - Sentence Permutation（打亂句子順序）
# - Document Rotation（旋轉文檔起點）

點解 BART 用 Encoder-Decoder？

因為 BART 要做生成任務，所以需要 Decoder。BERT 只做理解，所以只需要 Encoder。

生活例子：

BERT：好似閱讀理解，睇晒篇文答問題
BART：好似文章修復，將破爛嘅文章修復返好

三大架構比較

用生活例子總結

三種架構好似三種寫作方式

Decoder-Only（GPT）= 故事接龍

Encoder-Only（BERT）= 閱讀理解

Encoder-Decoder（T5）= 翻譯考試

視覺比較：

Loading diagram...

技術規格比較

點解 Decoder-Only 成為主流？

過去幾年，業界明顯趨向 Decoder-Only 架構。點解？

1. 統一框架 (Unified Framework)

Decoder-Only 可以用同一個架構做晒所有任務：

python# Generation
prompt = "Write a poem about the moon:"
output = model.generate(prompt)

# Understanding (透過 prompting)
prompt = "Classify sentiment: I love this movie! Sentiment:"
output = model.generate(prompt)  # "Positive"

# Reasoning
prompt = "Q: 2+2=? A:"
output = model.generate(prompt)  # "4"

相比之下，BERT 只能做理解，唔識生成；T5 雖然兩樣都得，但架構複雜。

2. In-Context Learning

Decoder-Only 天生支援 in-context learning（ICL），即係俾幾個 examples 就識做新任務，唔需要 fine-tuning：

python# Few-shot learning
prompt = """
Translate to French:
Hello -> Bonjour
Good morning -> Bonjour
Thank you -> Merci
Goodbye -> 
"""
# Model 會輸出 "Au revoir"

BERT 做唔到呢樣，需要 fine-tune 先識新任務。

3. Scaling Laws 效果最好

研究發現，Decoder-Only 嘅 scaling laws 最理想：

Model size ↑ → Performance ↑（幾乎線性）
Emergent abilities 喺大規模先出現
效果提升最明顯

BERT-style 嘅 model scale 上去效果唔及 GPT-style。

4. 簡化訓練同部署

訓練：

Decoder-Only：直接 next-token prediction
Encoder-Only：要設計 masking 策略
Encoder-Decoder：要設計 input-output pairs

部署：

Decoder-Only：單一 model
Encoder-Decoder：兩個 stack，記憶體同運算需求大

5. Prompting > Fine-tuning

現代趨勢係用 prompting 而唔係 fine-tuning：

Prompting 需要 generative model（Decoder-Only）
Instruction-tuning（Instruct-GPT, Llama-Chat）讓 model 更易用
Few-shot prompting 效果好過 fine-tuned BERT

重點總結

Decoder-Only 勝出唔係因為技術上優越（BERT 嘅 bidirectional attention 理論上更強），而係因為：

實用性：一個 model 做晒所有野

Scaling：大規模訓練效果最好

Flexibility：Prompting 比 fine-tuning 更靈活

Decoder-Only 點樣做理解任務？

有人會問：Decoder-Only 只有 causal attention，點做理解任務？

答案：Prefix LM (Non-Causal Prefix)

python# 將輸入分兩部份：
# 1. Prefix (bidirectional attention)
# 2. Target (causal attention)

Input: "Translate to French: Hello"
       [-------Prefix-------]  [Target]
       Bidirectional ←→      Causal →

# Prefix 部分可以互相 attend
# Target 部分只能 attend 到 prefix + 之前的 target

現代 Decoder-Only model 會喺訓練時混合：

部分數據用 causal LM（純生成）
部分數據用 prefix LM（理解+生成）

實際應用場景

何時用 Decoder-Only？

✅ 推薦用 Decoder-Only：

新 project：直接用 Llama, Mistral 等開源 model
生成任務：寫作、翻譯、對話、代碼生成
需要 in-context learning：Few-shot prompting
多任務統一：用一個 model 做多種任務
長文本處理：現代 Decoder-Only 支援超長 context（128K+）

代表場景：

ChatGPT 式對話系統
代碼助手（GitHub Copilot）
內容創作（文章、劇本、詩歌）
通用 AI Agent

何時用 Encoder-Only？

✅ 推薦用 Encoder-Only：

純分類任務：Sentiment analysis, spam detection
需要高質量 embeddings：Semantic search, clustering
計算資源有限：BERT-base 只有 110M 參數
唔需要生成：純理解任務

代表場景：

文本分類（新聞分類、情感分析）
Named Entity Recognition (NER)
Question Answering（抽取式，唔係生成式）
Semantic Search（用 embeddings 做相似度搜尋）

⚠️ 注意： 而家好多 embedding model 都轉咗用 Decoder-Only（例如 Mistral Embed），所以 Encoder-Only 嘅優勢越來越細。

何時用 Encoder-Decoder？

✅ 推薦用 Encoder-Decoder：

Seq2Seq 任務（但 Decoder-Only 一樣得）
已有 T5/BART 嘅 fine-tuned model
特定研究項目

代表場景：

機器翻譯（但 GPT-4 做得更好）
文本摘要（但 Llama 一樣得）
Data-to-text generation

⚠️ 注意： Encoder-Decoder 逐漸被淘汰，新 project 唔建議用。

快速決策指南：點樣揀架構？

用呢個流程圖幫你決定：

Loading diagram...

簡化版決策表：

2026 年黃金法則：

唔知揀咩？預設用 Decoder-Only！

除非你：

✅ 只做分類 同埋資源好有限 → 先用 BERT

✅ 維護緊舊 project 已經用緊 T5/BART → 先繼續用 Encoder-Decoder

其他情況一律用 Decoder-Only（Llama 3, Mistral, Qwen）！

實用建議

2026 年建議策略：

🎯 預設選擇：Decoder-Only（Llama 3, Mistral, Qwen）

🎯 純分類/Embedding：可以考慮 Encoder-Only（BERT, DeBERTa），但 Decoder-Only fine-tuned 版本更好

🎯 避免：Encoder-Decoder（除非有特殊原因）

總結

經過詳細比較，我哋可以得出以下結論：

三大架構各有特色

1. Decoder-Only（GPT/Llama）

✅ 統一框架，做晒所有任務
✅ Scaling laws 最好
✅ In-context learning 能力強
🎯 現時主流，新 project 首選

2. Encoder-Only（BERT）

✅ 雙向理解，高質量 embeddings
✅ 參數少，訓練快
⚠️ 只做理解，唔識生成
📉 逐漸被 Decoder-Only 取代

3. Encoder-Decoder（T5/BART）

✅ Seq2Seq 任務表現好
✅ Cross-attention 機制強
❌ 架構複雜，參數量大
📉 基本已被淘汰

訓練方式總結

最終建議

🎯 2026 年 LLM 開發指南：

新 project 一律用 Decoder-Only（Llama 3, Mistral, Qwen）
純分類任務可以考慮 fine-tuned Decoder-Only 或 BERT
避免從零訓練，用現成開源 model + fine-tune/prompting
Encoder-Decoder 可以忽略（除非維護舊 project）

記住：架構選擇冇絕對對錯，最重要係符合你嘅應用場景同資源限制。 但趨勢已經好明顯：Decoder-Only 會繼續主導未來幾年嘅 LLM 發展。

TL;DR

核心重點：

🎯 三大架構：Decoder-Only（單向生成）、Encoder-Only（雙向理解）、Encoder-Decoder（理解+生成）
🔄 訓練方式：Causal LM (GPT)、Masked LM (BERT)、Span Corruption (T5)
✅ 趨勢：Decoder-Only 已成主流（GPT-4, Llama, DeepSeek），因為可以同時做理解同生成
⚠️ 誤解：Decoder-Only 唔係淨係識生成，加上 bidirectional attention 一樣可以做理解任務
📊 選擇：新 project 建議直接用 Decoder-Only，除非你只做 embedding/classification

Pre-training 基礎概念
歷史發展時間線
架構一：Decoder-Only（GPT 系列）
架構二：Encoder-Only（BERT 系列）
架構三：Encoder-Decoder（T5 系列）
三大架構比較
點解 Decoder-Only 成為主流？
實際應用場景
總結
相關資源

Pre-training 基礎概念

咩係 Pre-training？

Pre-training（預訓練） 係指用大量無標註文本訓練模型，令佢學會語言嘅基本規律，之後再透過 fine-tuning 或者 prompting 適應特定任務。

Pre-training 嘅核心目標：

學習語言嘅統計規律（文法、語義、常識）
建立豐富嘅詞彙同概念表徵
為下游任務提供強大嘅起點

Self-Supervised Learning

Pre-training 用嘅係 self-supervised learning，即係從數據本身產生訓練信號，唔需要人手標註。

用日常例子理解三種訓練方式：

三種方法嘅核心分別：

Decoder-Only (GPT)：順序生成，一個字接一個字
Encoder-Only (BERT)：睇晒成句，填返空格（可以睇前後文）
Encoder-Decoder (T5)：睇晒成句理解，然後順序生成答案

核心概念

Pre-training 嘅威力喺於規模效應：用數以千億計嘅 tokens 訓練出嚟嘅模型，會產生 emergent abilities（湧現能力），即係一啲冇明確教過佢嘅技能，例如 in-context learning、chain-of-thought reasoning 等。

Token Representations：點樣將文字變成向量？

Token Embeddings（詞嵌入）

Token Embedding 係將每個 token（字或詞）轉換成一個固定維度嘅向量。

生活例子：詞典查表

想像你有本特殊嘅詞典，每個字都對應一組數字：

「今日」→ [0.2, -0.5, 0.8, ...] (d_model 維度)
「天氣」→ [-0.1, 0.3, -0.2, ...]
「好」→ [0.5, 0.1, 0.4, ...]

呢啲數字係透過訓練學習出嚟，相似意思嘅字會有相似嘅向量。

實作：

pythonimport torch
import torch.nn as nn

# 假設詞彙表有 50,000 個 tokens，每個 token 嘅 embedding 維度係 768
vocab_size = 50000
d_model = 768

# Token Embedding Layer（其實就係個 lookup table）
token_embedding = nn.Embedding(vocab_size, d_model)

# 輸入：token IDs
input_ids = torch.tensor([1234, 5678, 9012])  # 對應 ["今日", "天氣", "好"]

# 輸出：embeddings
embedded = token_embedding(input_ids)
print(embedded.shape)  # torch.Size([3, 768])

特性：

📏 固定維度：無論 token 長短，output 維度都係 d_model
🎓 可學習：Embedding matrix 係模型參數之一，會隨訓練更新
🔢 巨大矩陣：vocab_size × d_model（例如 50K × 768 = 3840 萬個參數）

Segment Embeddings / Token Type Embeddings（句子分段嵌入）

Segment Embedding 用嚟區分唔同句子或片段，主要用於 BERT 同某啲 Encoder-Decoder 模型。

生活例子：顏色標籤

想像你要比較兩句說話：

Sentence A：「今日天氣好晴朗」→ 貼藍色標籤（Segment ID = 0）
Sentence B：「我哋去行山啦」→ 貼紅色標籤（Segment ID = 1）

模型透過呢啲標籤知道邊啲 tokens 屬於邊句。

BERT 嘅應用場景：

python# BERT Input Format
input_text = "[CLS] Sentence A [SEP] Sentence B [SEP]"

# Token IDs
token_ids =     [101, 1234, 5678, 102, 9012, 3456, 102]
# Token Type IDs (Segment IDs)
segment_ids =   [  0,    0,    0,   0,    1,    1,   1]
#                [CLS]   A     A  [SEP]   B     B  [SEP]

# Segment Embedding
segment_embedding = nn.Embedding(2, d_model)  # 只有兩種 segment：0 同 1
segment_embed = segment_embedding(torch.tensor(segment_ids))

用途：

✅ Next Sentence Prediction (NSP)：BERT 訓練任務
✅ Question Answering：區分 question 同 context
✅ 多輪對話：區分唔同說話者

⚠️ 重要：

GPT 系列（Decoder-Only） 通常唔使 Segment Embeddings
T5 都冇用 Segment Embeddings（用 text prefix 取代）

Positional Encodings（位置編碼）

Positional Encoding 係 Transformer 最關鍵嘅創新之一！因為 Self-Attention 本身冇順序概念，所以需要 positional encoding 嚟告訴模型每個 token 嘅位置。

生活例子：座位號碼

想像一班學生企成一行：

冇座位號：老師只知道「有邊啲學生」，但唔知佢哋企喺邊
有座位號：老師知道「Alice 企第 1 位，Bob 企第 2 位」

Transformer 需要座位號（位置資訊）先識得處理文字順序。

兩種主要方法：

用數學公式計算位置向量，唔使訓練：

pythonimport numpy as np

def sinusoidal_positional_encoding(max_len, d_model):
    """
    max_len: 最大序列長度
    d_model: embedding 維度
    """
    position = np.arange(max_len)[:, np.newaxis]  # (max_len, 1)
    div_term = np.exp(np.arange(0, d_model, 2) * -(np.log(10000.0) / d_model))
    
    pe = np.zeros((max_len, d_model))
    pe[:, 0::2] = np.sin(position * div_term)  # 偶數維度用 sin
    pe[:, 1::2] = np.cos(position * div_term)  # 奇數維度用 cos
    
    return torch.FloatTensor(pe)

# 例子
pe = sinusoidal_positional_encoding(max_len=512, d_model=768)
print(pe.shape)  # torch.Size([512, 768])

數學公式：

\text{PE}(\text{pos}, 2i) = \sin\left(\frac{\text{pos}}{10000^{2i/d_{\text{model}}}}\right)

\text{PE}(\text{pos}, 2i+1) = \cos\left(\frac{\text{pos}}{10000^{2i/d_{\text{model}}}}\right)

優點：

✅ 唔使訓練，節省參數
✅ 可以處理任意長度序列（理論上）
✅ 有數學規律，相對位置有明確關係

缺點：

❌ 冇彈性，固定公式
❌ 實際上對超長序列效果一般

將位置當成可學習嘅參數，同 token embedding 一樣：

python# Learned Positional Embedding
max_position = 512  # BERT 嘅 max sequence length
position_embedding = nn.Embedding(max_position, d_model)

# 輸入：位置 IDs
position_ids = torch.arange(0, seq_len)  # [0, 1, 2, 3, ...]

# 輸出：位置向量
pos_embed = position_embedding(position_ids)

優點：

✅ 更靈活，模型自己學習最佳表示
✅ 實際效果通常更好

缺點：

❌ 固定最大長度（例如 BERT 最多 512 tokens）
❌ 需要額外參數（max_len × d_model）

唔係記錄絕對位置（第 1、2、3 個），而係記錄相對距離（「相隔 2 個位置」）。

例子：

python# 相對位置：token i 同 token j 之間嘅距離
relative_distance = i - j

# 例如：
# token 5 attend 到 token 2 → 相對位置 = 5 - 2 = 3
# token 2 attend 到 token 5 → 相對位置 = 2 - 5 = -3

優點：

✅ 泛化能力更強（訓練時見過相對距離，推理時可以應用到更長序列）
✅ 更符合語言學直覺

組合三者：最終 Input Representation

所有架構都會將三種 embeddings 相加作為最終輸入：

pythonclass TransformerInputEmbedding(nn.Module):
    def __init__(self, vocab_size, max_len, d_model):
        super().__init__()
        self.token_embed = nn.Embedding(vocab_size, d_model)
        self.position_embed = nn.Embedding(max_len, d_model)
        # Segment embedding (只有 BERT 用)
        self.segment_embed = nn.Embedding(2, d_model)  
    
    def forward(self, input_ids, segment_ids=None):
        seq_len = input_ids.size(1)
        
        # 1. Token Embeddings
        token_embeddings = self.token_embed(input_ids)
        
        # 2. Position Embeddings
        position_ids = torch.arange(seq_len, device=input_ids.device)
        position_embeddings = self.position_embed(position_ids)
        
        # 3. Segment Embeddings (optional)
        if segment_ids is not None:
            segment_embeddings = self.segment_embed(segment_ids)
            embeddings = token_embeddings + position_embeddings + segment_embeddings
        else:
            embeddings = token_embeddings + position_embeddings
        
        return embeddings

視覺化：

Loading diagram...

重點總結

Input = Token Embed + Position Embed + (Segment Embed)

📝 Token Embedding：每個字嘅語義向量

📍 Position Embedding：告訴模型每個字嘅位置

🏷️ Segment Embedding：區分唔同句子（BERT 專用）

所有 embeddings 相加（唔係 concat），維持 d_model 維度。

如果用全人類知識嚟訓練會點？

呢個係好多人都會問嘅問題：點解唔直接用「所有人類知識」嚟訓練 AI？咁佢咪乜都識？

現實情況：

現代大型 LLM（例如 GPT-4、Llama 3）其實已經接近呢個目標：

Loading diagram...

實際數字：

GPT-3：訓練數據約 500B tokens（相當於 50 萬本書）
Llama 3：訓練數據約 15T tokens（相當於 1500 萬本書）
人類所有出版物：估計約 100-200T tokens

所以現代 LLM 已經訓練咗相當於人類大部分公開知識！

有趣事實

GPT-4 嘅訓練數據包含咗：

📚 幾乎所有英文維基百科

📖 數以百萬計嘅書籍

💻 大量 GitHub 代碼

📄 數以千萬計嘅學術論文

🌐 大量網頁內容

呢個數據量已經超過任何一個人一生可以閱讀嘅量好多倍！

但係仍然有限制：

唔係真正嘅「所有」知識
- ❌ 未公開嘅研究（企業機密、軍事資料）
- ❌ 個人經驗同隱性知識（點樣踩單車、點樣煮飯）
- ❌ 實時資訊（今日天氣、最新新聞）
- ❌ 多媒體內容（圖片、影片、音樂入面嘅知識）
數據質素問題
- ⚠️ 網上有大量錯誤資訊同假新聞
- ⚠️ 偏見同刻板印象（數據反映人類社會嘅偏見）
- ⚠️ 唔同來源嘅資訊可能矛盾
知識更新問題
- 📅 Knowledge Cutoff：訓練數據有截止日期
- 📅 例如 GPT-4 嘅知識截止於 2023 年 4 月
- 📅 之後嘅新知識佢唔會知道
理解 vs 記憶
- 🤔 LLM 係「記住」知識定係「理解」知識？
- 🤔 佢可能會背咗大量事實但唔理解背後原理
- 🤔 所以會出現 hallucination（作故仔）

點解唔訓練更多數據？

真正嘅問題唔係數據量，而係：

Data Quality > Data Quantity
- 高質素嘅 1T tokens 好過低質素嘅 10T tokens
- DeepSeek-V3 證明咗用更少但更高質素嘅數據可以達到更好效果
Reasoning > Memorization
- 現代研究 focus 喺提升推理能力
- Chain-of-Thought, Self-Consistency 等技術
- 唔係單純增加數據量
Alignment > Raw Knowledge
- RLHF (Reinforcement Learning from Human Feedback)
- 確保 AI 嘅輸出符合人類價值觀
- 有用、無害、誠實

重點總結

現代 LLM 已經訓練咗相當於「大部分人類公開知識」，但：

✅ 量夠晒：數萬億 tokens，超過人一生所讀

⚠️ 質參差：網上資訊有真有假

📅 會過時：knowledge cutoff 問題

🤔 識唔等於明：可能背咗但唔理解

未來方向係提升質素同推理能力，唔係單純增加數據量。

歷史發展時間線

點樣由 Transformer 變成三大架構？

好多人會好奇：點解會有三種唔同嘅架構？邊個先出現？以下係 LLM Pre-training 架構嘅發展歷史：

Loading diagram...

重要里程碑

2017：Transformer 誕生

Google 發表 "Attention Is All You Need"
原始設計：Encoder-Decoder（為機器翻譯設計）
引入 self-attention 機制，取代 RNN

2018：GPT-1 vs BERT 之爭

2018-2019：BERT 時代

BERT 刷新 11 個 NLP 任務紀錄
所有人都在做 BERT 變體：RoBERTa, ALBERT, DistilBERT, ELECTRA
主流觀點：Encoder-Only 最強，Decoder-Only 只係玩具

2019-2020：GPT-2/GPT-3 逆襲

GPT-2（2019 Feb）：1.5B 參數，展示 zero-shot 能力
T5（2019 Oct）：統一 text-to-text 框架，Encoder-Decoder
GPT-3（2020 May）：175B 參數，few-shot learning 爆發
轉捩點：大家發現 Decoder-Only scaling 效果最好

2021-2023：Decoder-Only 主導

2021：GitHub Copilot（GPT-Codex）證明生成式 AI 嘅商業價值
2022：ChatGPT 爆紅，全球關注
2023：Llama 1/2 開源，Decoder-Only 成為開源主流
2023：Encoder-Only 新 model 幾乎消失

2024-2026：Decoder-Only 一統江湖

Llama 3, GPT-4, Mistral, Qwen, DeepSeek 全部係 Decoder-Only
Encoder-Decoder (T5/BART) 基本被淘汰
新 project 預設選擇：Decoder-Only

點解 BERT 由主流變成式微？

2018-2019 年：BERT 全面主導

刷新所有 NLP benchmark

論文引用數爆升

人人都做 BERT 變體

2020-2023 年：GPT-3 改變遊戲規則

Few-shot learning 唔使 fine-tune

Scaling laws：model 越大越勁

ChatGPT 證明生成式 AI 嘅價值

2024-2026 年：Decoder-Only 完全勝出

一個 model 做晒所有任務

開源社群全面擁抱（Llama, Mistral）

BERT 只剩低特定場景（embeddings, 分類）

關鍵轉折點

轉折點 1：GPT-3 嘅 scaling laws（2020）

證明 Decoder-Only 隨規模提升效果最好
出現 emergent abilities（chain-of-thought 等）
In-context learning 取代 fine-tuning

轉折點 2：ChatGPT 嘅成功（2022）

向大眾證明生成式 AI 嘅實用性
BERT 只識理解，唔識生成，變成劣勢
對話式 AI 成為新標準

轉折點 3：Llama 開源（2023）

Meta 開源 Llama 系列
證明 Decoder-Only 可以開源複製
社群全面轉向 Decoder-Only

重點總結

時間線總結：

📅 2017：Transformer（Encoder-Decoder）

📅 2018：GPT-1（Decoder-Only）vs BERT（Encoder-Only）

📅 2019：T5（Encoder-Decoder）統一框架

📅 2020：GPT-3 證明 Decoder-Only scaling 最好

📅 2022：ChatGPT 引爆生成式 AI

📅 2023-2026：Decoder-Only 一統江湖

演變邏輯：

最初大家以為 bidirectional（BERT）最強

但發現 unidirectional（GPT）scaling 效果更好

加上 prompting 取代 fine-tuning

最終 Decoder-Only 勝出

未來趨勢

2026 年之後可能嘅發展：

Decoder-Only 繼續主導：冇跡象顯示會改變
Hybrid 架構：可能出現新嘅混合設計（例如 Mamba + Attention）
Encoder-Only 小眾化：只用於特定場景（embeddings, 邊緣裝置）
Encoder-Decoder 消失：除咗維護舊 project，基本唔會再用

架構一：Decoder-Only（GPT 系列）

架構設計

Decoder-Only 係最簡單嘅架構，只用 Transformer Decoder 嘅 self-attention 層，每個 token 只能 attend 到佢之前嘅 tokens（causal attention / unidirectional）。

生活例子：寫故事接龍

想像你同朋友玩緊故事接龍，每個人要根據前面嘅內容寫落去：

朋友 A：「從前有個勇敢嘅騎士」
朋友 B：「佢住喺一座城堡」（只睇到前面嘅「從前有個勇敢嘅騎士」）
朋友 C：「每日都出去冒險」（只睇到前面全部，但睇唔到後面會點）

呢個就係 Decoder-Only！你只能根據已經寫咗嘅內容，順序寫落去。

架構流程圖：

Loading diagram...

Causal Attention Mask：

python# 每個位置只能睇到自己同之前嘅 tokens
[1, 0, 0, 0]  # Token 0 只睇到自己
[1, 1, 0, 0]  # Token 1 睇到 0, 1
[1, 1, 1, 0]  # Token 2 睇到 0, 1, 2
[1, 1, 1, 1]  # Token 3 睇到 0, 1, 2, 3

訓練方式：Causal Language Modeling (CLM)

Decoder-Only 用 next-token prediction 嚟訓練，好似順住寫落去：

簡單理解

想像你睇緊半句：「今日天氣好」，你要估下一個字。

你只能睇前面嘅字（「今日天氣好」）

你唔能偷睇後面會寫咩

你要逐個字順序生成：「今日天氣好」→「晴朗」

呢個就係 Decoder-Only 嘅訓練方式！

訓練過程：

python# 輸入文本："The cat sat on the mat"
# Tokenized: [The, cat, sat, on, the, mat]

# 訓練時的 input 同 target：
Input:  [The, cat, sat, on, the]
Target: [cat, sat, on, the, mat]

# 每個位置都預測下一個 token
Position 0: "The"      → predict "cat"
Position 1: "The cat"  → predict "sat"
Position 2: "The cat sat" → predict "on"
...

Loss Function：

python# Cross-entropy loss on next token
def causal_lm_loss(logits, targets):
    """
    logits: (batch_size, seq_len, vocab_size)
    targets: (batch_size, seq_len)
    """
    # Shift logits and targets
    shift_logits = logits[:, :-1, :].contiguous()
    shift_targets = targets[:, 1:].contiguous()
    
    # Compute loss
    loss = F.cross_entropy(
        shift_logits.view(-1, vocab_size),
        shift_targets.view(-1)
    )
    return loss

優點

✅ 架構簡單：只需要 decoder，易於實作同訓練

✅ 生成能力強：天生就係為生成任務設計

✅ 統一框架：可以用同一個模型做晒所有任務（理解、生成、推理）

✅ Scaling 效果好：規模越大，emergent abilities 越強

✅ In-context learning：可以透過 prompting 做 few-shot learning

缺點與技術限制

雖然 Decoder-Only 已成主流，但佢仍然有唔少限制：

Sequential Generation Bottleneck（順序生成瓶頸）

❌ Autoregressive 特性：必須逐個 token 生成，冇辦法平行化

python# Decoder-Only 生成過程
t=0: Generate token 0
t=1: Generate token 1 (需要等 token 0 生成完)
t=2: Generate token 2 (需要等 token 0, 1 都生成完)
# ... 依此類推

⏱️ 推理速度慢：生成 100 個 tokens 需要 100 次 forward pass
💰 成本高：每個 token 都要運行成個模型
📉 吞吐量低：batch size 受限於 GPU 記憶體

現實影響：

GPT-4 生成 1000 字文章可能需要 10-30 秒
長文本生成（10K+ tokens）可能需要幾分鐘

Context Window 限制

❌ 記憶體複雜度：O(n²) 隨序列長度平方增長

原因：Self-attention 需要計算每對 tokens 之間嘅關係
後果：
- GPT-3.5：4K context (約 3000 字)
- GPT-4：8K-32K context (取決於版本)
- Llama 2：4K context
- Llama 3.1：128K context (但需要大量 GPU)

實際問題：

python# 假設 d_model = 4096, batch_size = 1
seq_len = 4096 tokens
# Attention matrix size: 4096 × 4096 × 4 bytes (FP32) = 67 MB
# 32 layers × 32 heads = 需要約 68 GB GPU memory！

單向 Attention 嘅理解限制

❌ Causal Masking：理論上只能睇前文

對於完形填空類任務（需要睇後文）表現較差
例如：「我今日食咗___，所以好飽」（答案係「飯」，但 decoder 睇唔到「好飽」）

解決方法：

Prefix LM：input 部分用 bidirectional attention
Prompting：將後文資訊放入 prompt
實務上影響有限：大部分任務唔需要真正嘅雙向理解

訓練數據量要求高

❌ 需要海量數據：Decoder-Only scaling laws 要求數據量隨模型大小增長

GPT-3 (175B)：500B tokens
Llama 3 (70B)：15T tokens
小公司難以負擔：數據收集、清洗、存儲都係巨大成本

生成質量問題

❌ Hallucination（幻覺）：模型會「作故仔」，生成唔真實嘅內容

原因：只根據前文生成下一個 token，冇 fact-checking 機制
後果：可能生成錯誤資訊、虛構事實

❌ Repetition（重複）：容易重複相同內容

原因：Autoregressive 特性 + 訓練數據嘅統計偏差
解決方法：repetition penalty, nucleus sampling

❌ Exposure Bias（曝光偏差）

問題：訓練時見到嘅係 ground truth，但推理時見到嘅係自己生成嘅 tokens
後果：一旦生成錯誤 token，後續可能連環出錯

計算資源需求極高

❌ 訓練成本：

GPT-3：估計 460 萬美元（一次訓練）
GPT-4：估計數千萬美元
Llama 3 (70B)：需要數千張 H100 GPU，訓練數週

❌ 推理成本：

GPT-4 API： $0.03 / 1K input tokens,$ 0.06 / 1K output tokens
自建服務：需要多張 A100/H100 GPU

Knowledge Cutoff 問題

❌ 知識過時：模型只知道訓練截止日期之前嘅資訊

GPT-4：截止 2023 年 4 月
對於實時資訊（新聞、股價、天氣）完全唔識

解決方法：

RAG (Retrieval-Augmented Generation)
外部 API 調用
Fine-tuning（但成本高）

代表模型

GPT 系列：GPT-2, GPT-3, GPT-4
Llama 系列：Llama 2, Llama 3, Llama 3.1
DeepSeek 系列：DeepSeek-V2, DeepSeek-V3
Mistral 系列：Mistral 7B, Mixtral 8x7B
Qwen 系列：Qwen 2.5

架構二：Encoder-Only（BERT 系列）

架構設計

Encoder-Only 只用 Transformer Encoder，每個 token 可以 attend 到所有其他 tokens（bidirectional attention）。

生活例子：做閱讀理解

想像你做緊閱讀理解填充題：

題目：「今日___好晴朗，所以我決定去行山」
你會睇晒成句，包括前面嘅「今日」同後面嘅「好晴朗」「行山」
透過前後文理解，你知道空格應該係「天氣」
你一次過填返答案，唔使逐個字生成

呢個就係 Encoder-Only！佢可以同時睇晒前後文嚟理解，所以特別適合分類、情感分析等需要理解嘅任務。

架構流程圖：

Loading diagram...

Bidirectional Attention：

python# 每個位置可以睇到所有 tokens
[1, 1, 1, 1]  # Token 0 睇到全部
[1, 1, 1, 1]  # Token 1 睇到全部
[1, 1, 1, 1]  # Token 2 睇到全部
[1, 1, 1, 1]  # Token 3 睇到全部

訓練方式：Masked Language Modeling (MLM)

Encoder-Only 用 masked token prediction 嚟訓練，好似 Fill in the blank（填充題）：

簡單理解

想像你做緊填充題：「今日___好晴朗」，你要估空格係咩字。

你可以睇前後文（「今日」同「好晴朗」）

你一次過填晒所有空格（唔係逐個字生成）

答案顯然係「天氣」

呢個就係 Encoder-Only 嘅訓練方式！因為可以睇前後文，所以理解能力好強。

訓練過程：

python# 原始文本："The cat sat on the mat"
# 隨機遮蔽 15% tokens

# Example 1:
Input:  "The [MASK] sat on the mat"
Target: Predict "cat" at [MASK] position

# Example 2:
Input:  "The cat [MASK] on [MASK] mat"
Target: Predict "sat" and "the" at [MASK] positions

Masking 策略（BERT）：

80%：用 [MASK] token 取代
10%：用隨機 token 取代
10%：保持原 token

Loss Function：

pythondef mlm_loss(logits, targets, mask_positions):
    """
    只計算被 mask 位置的 loss
    logits: (batch_size, seq_len, vocab_size)
    targets: (batch_size, seq_len)
    mask_positions: (batch_size, seq_len) - binary mask
    """
    # Only compute loss on masked positions
    active_loss = mask_positions.view(-1) == 1
    active_logits = logits.view(-1, vocab_size)[active_loss]
    active_labels = targets.view(-1)[active_loss]
    
    loss = F.cross_entropy(active_logits, active_labels)
    return loss

其他訓練方式

Next Sentence Prediction (NSP)：（BERT 使用，但後來發現唔太有用）

python# 訓練模型判斷兩句是否連續
Input:  "[CLS] Sentence A [SEP] Sentence B [SEP]"
Target: IsNext=1 or IsNext=0

優點

✅ 雙向理解：可以同時利用前後文，理解能力強

✅ 高質量 embeddings：產生嘅 contextualized embeddings 好適合做 downstream tasks

✅ 訓練效率高：MLM 可以平行處理，比 autoregressive 快

✅ 適合分類任務：加個 classification head 就可以 fine-tune

缺點與技術限制

完全冇生成能力

❌ 架構限制：Encoder 輸出嘅係 fixed-size representation，唔係 probability distribution

後果：
- 冇辦法做 text generation
- 冇辦法做 machine translation
- 冇辦法做 summarization（除非做 extractive，唔係 abstractive）

對比：

python# Decoder-Only 可以生成
logits = model(input_ids)  # (batch, seq_len, vocab_size)
next_token = torch.argmax(logits[:, -1, :])  # 預測下一個 token

# Encoder-Only 唔得
hidden_states = model(input_ids)  # (batch, seq_len, d_model)
# 冇 vocab_size 維度，冇辦法轉換成 token！

Pre-training / Fine-tuning Mismatch

❌ [MASK] Token 問題：

Pre-training：模型學習預測 [MASK] tokens
Fine-tuning：實際應用冇 [MASK] tokens
後果：模型喺 fine-tuning 時見到嘅 input distribution 同 pre-training 時唔同

例子：

python# Pre-training 見到嘅句子
"The [MASK] sat on the [MASK]"

# Fine-tuning/Inference 見到嘅句子
"The cat sat on the mat"  # 冇 [MASK]！

實際影響：

RoBERTa 透過改良 masking 策略減輕呢個問題
但始終存在 distribution gap

固定最大長度限制

❌ BERT 最多 512 tokens（約 400 字）：

原因：Position embeddings 係固定大小（learned embeddings）
後果：
- 長文檔需要切割（可能失去 context）
- 唔適合處理書籍、論文等長文本
- 需要額外嘅 sliding window 或 hierarchical 方法

對比：

Longformer：4096 tokens（用 sparse attention）
BigBird：4096 tokens（用 random + sliding + global attention）
但仍然遠不及現代 Decoder-Only（Llama 3.1：128K tokens）

Bidirectional Attention 嘅訓練限制

❌ 訓練效率問題：

MLM 只計算 15% tokens 嘅 loss（被 mask 嘅部分）
相比 Decoder-Only 每個 token 都計 loss
資料利用率較低：同樣數據量，BERT 學到嘅嘢較少

計算：

python# BERT MLM
seq_len = 512
masked_tokens = 512 × 0.15 = 76.8 個 tokens 計 loss

# GPT Causal LM
seq_len = 512
all_tokens = 512 個 tokens 都計 loss  # 效率高 6.67 倍！

下游任務需要 Fine-tuning

❌ 唔支援 Zero-shot / Few-shot Learning：

BERT 做新任務必須 fine-tune
冇辦法透過 prompting 做 in-context learning
後果：
- 每個任務都要訓練一個專用模型
- 需要標註數據
- 部署複雜（多個模型）

對比：

python# BERT 做 sentiment analysis
# 需要 fine-tune 一個 classification head
model = BertForSequenceClassification.from_pretrained('bert-base')
trainer.train(model, labeled_data)  # 需要標註數據！

# GPT 做 sentiment analysis
# 直接 prompting
prompt = "Sentiment of 'I love this movie': "
output = model.generate(prompt)  # "Positive" (zero-shot!)

Static Embeddings 問題

❌ [CLS] Token 嘅限制：

BERT 用 [CLS] token 嘅 embedding 代表成句
問題：單一向量難以捕捉複雜語義
後果：
- 長句子資訊可能會丟失
- 唔同任務可能需要唔同嘅 pooling 策略

計算資源浪費

❌ Inference 時嘅浪費：

即使只關心某幾個 tokens，都要計算成條序列嘅 attention
例子：做 NER 只需要識別實體，但 BERT 要處理成句

唔適應長尾分佈

❌ Random Masking 嘅偏差：

高頻詞容易被 mask 到（被學習到）
低頻詞、專有名詞較少被 mask（學習不足）
後果：對領域特定詞彙嘅理解較差

為何 BERT 被淘汰？

綜合以上限制，BERT 嘅問題在於：

❌ 應用範圍窄：只做理解，唔做生成

❌ 唔夠靈活：需要 fine-tuning，冇 in-context learning

❌ 訓練效率低：只用 15% tokens 計 loss

❌ 長度限制死：最多 512 tokens

而 Decoder-Only 一個架構解決晒所有問題，所以最終勝出。

代表模型

BERT 系列：BERT-base, BERT-large, RoBERTa
ALBERT：參數共享版本的 BERT
DeBERTa：Microsoft 改進版（disentangled attention）
特定領域：BioBERT（生物醫學）、SciBERT（科學）

架構三：Encoder-Decoder（T5 系列）

架構設計

Encoder-Decoder 結合咗兩者：

Encoder：Bidirectional attention，理解輸入
Decoder：Causal attention on decoder side + cross-attention to encoder，生成輸出

生活例子：翻譯考試

想像你做緊中譯英考試：

第一步（Encoder）：先睇晒成句中文，完全理解意思
- 中文：「今日天氣好晴朗」
- 你會睇晒前後文，理解整體意思
第二步（Decoder）：然後逐個字生成英文翻譯
- 生成 "Today" → 再生成 "the" → 再生成 "weather" → ...
- 生成每個字時，你會參考返中文原文（cross-attention）
- 同時只睇前面已經寫咗嘅英文（causal attention）

呢個就係 Encoder-Decoder！先用 Encoder 理解輸入，再用 Decoder 逐步生成輸出。

架構流程圖：

Loading diagram...

Cross-Attention：

python# Decoder 每個位置可以 attend 到 encoder 嘅所有輸出
Query: from Decoder hidden states
Key, Value: from Encoder hidden states

訓練方式：Span Corruption

T5 使用 Span Corruption（片段破壞） 訓練，結合咗 Fill in the blank + 順序生成：

簡單理解

想像你做緊進階版填充題：「今日___晴朗」

分兩個階段：

Encoder（理解題目）：睇晒成句，明白前後文（「今日」「晴朗」）

Decoder（生成答案）：逐個字順序生成答案 →「天」→「氣」→「好」

同 BERT 嘅分別係：BERT 一次過填「天氣好」，但 T5 要逐個字生成。

同 GPT 嘅分別係：GPT 睇唔到後面嘅「晴朗」，但 T5 嘅 Encoder 睇到。

呢個就係 Encoder-Decoder 嘅訓練方式！

訓練過程：

python# 原始文本
"Thank you for inviting me to your party last week"

# 遮蔽連續片段（用特殊 token <X>, <Y>, <Z> 標記）
Input (Encoder):  "Thank you <X> me to your party <Y> week"
Target (Decoder): "<X> for inviting <Y> last <Z>"

# Encoder 睇到遮蔽後的輸入
# Decoder 要生成被遮蔽的片段

統一格式：Text-to-Text

T5 將所有任務都轉成 text-to-text 格式：

python# Translation
Input:  "translate English to German: That is good."
Output: "Das ist gut."

# Classification
Input:  "cola sentence: The course is jumping well."
Output: "not acceptable"

# Summarization
Input:  "summarize: [long article]"
Output: "summary text"

優點

✅ 最靈活：Encoder 理解 + Decoder 生成，兩者兼得

✅ 適合 Seq2Seq：Translation, Summarization, QA 等任務表現好

✅ 統一框架：T5 將所有任務統一成 text-to-text

✅ Cross-attention：Decoder 可以充分利用 encoder 嘅資訊

缺點與技術限制

架構複雜度高

❌ 兩個獨立 Stack：

Encoder：N layers × (Self-Attention + FFN)
Decoder：N layers × (Self-Attention + Cross-Attention + FFN)
後果：
- 參數量接近兩倍（相比 Decoder-Only）
- 訓練時間長
- 推理速度慢

參數量對比：

python# T5-Base (Encoder-Decoder)
Encoder: 6 layers × 12 heads × 768 dim = 110M params
Decoder: 6 layers × 12 heads × 768 dim + Cross-Attention = 110M params
Total: ~220M params

# GPT-2 Medium (Decoder-Only)
12 layers × 12 heads × 1024 dim = ~345M params
# 參數更多，但架構更簡單！

Cross-Attention 嘅瓶頸

❌ 額外計算開銷：

Decoder 每層都要做 cross-attention
Query from decoder, Key/Value from encoder
複雜度：O(n_decoder × n_encoder)

實際影響：

python# 假設 encoder input = 512 tokens, decoder output = 128 tokens
# Cross-attention 需要計算：512 × 128 = 65,536 個 attention scores
# 每層都要算，32 layers → 需要大量計算

訓練效率低

❌ Two-Pass Forward：

第一步：Encoder 處理 input
第二步：Decoder 逐步生成 output（autoregressive）
後果：訓練時間係 Decoder-Only 嘅 1.5-2 倍

❌ Teacher Forcing 依賴：

訓練時用 ground truth 作為 decoder input
推理時用自己生成嘅 tokens
Exposure Bias 問題更嚴重

Scaling Laws 效果較差

❌ 大規模訓練唔划算：

研究發現，同樣計算資源下：

Decoder-Only：性能提升明顯
Encoder-Decoder：性能提升緩慢

實驗數據：

javascript模型大小：1B → 10B → 100B

Decoder-Only 效能提升：
1B: BLEU 30 → 10B: BLEU 40 → 100B: BLEU 48 (持續提升)

Encoder-Decoder 效能提升：
1B: BLEU 32 → 10B: BLEU 38 → 100B: BLEU 42 (提升放緩)

原因分析：

Decoder-Only 可以更有效利用參數
統一架構令 scaling laws 更穩定
In-context learning 能力隨規模指數增長

Inference 速度慢

❌ 雙重瓶頸：

Encoder forward pass：需要處理完整 input
Decoder autoregressive generation：逐個 token 生成

速度對比：

python# 生成 100 tokens

Decoder-Only:
- Forward passes: 100 次（只有 decoder）
- 時間：~1 秒（假設）

Encoder-Decoder:
- Encoder forward: 1 次（處理 input）
- Decoder forward: 100 次（生成 output）
- Cross-attention: 100 次（每個 decoder step 都要）
- 時間：~1.8 秒（慢 80%！）

記憶體佔用大

❌ KV Cache 雙倍：

Encoder 需要 cache：K, V matrices
Decoder 需要 cache：
- Self-attention: K, V
- Cross-attention: K, V (from encoder)
總計：記憶體需求約 Decoder-Only 嘅 1.5-2 倍

實際影響：

python# T5-Large (770M params)
# Batch size = 8, seq_len = 512
# KV cache memory:
Encoder KV: 8 × 512 × 1024 × 2 × 24 layers × 2 bytes ≈ 800 MB
Decoder KV (self): 800 MB
Decoder KV (cross): 800 MB
Total: ~2.4 GB (只係 KV cache！)

唔支援 In-Context Learning

❌ Prompting 效果差：

Encoder-Decoder 設計上係為 input → output 嘅 mapping
唔適合 few-shot prompting

例子：

python# GPT 可以做 few-shot
prompt = "Translate: Hello→Bonjour, Hi→Salut, Good→"
# GPT 會輸出 "Bon"

# T5 做唔到
# T5 需要明確嘅 input-output pairs

文本生成質素問題

❌ Length Bias：

Encoder-Decoder 傾向生成較短嘅輸出
原因：訓練時 loss 係 per-token，模型學會「快啲結束」
後果：生成嘅摘要、翻譯可能過度簡化

❌ Attention Dilution：

Cross-attention 需要 attend 到整個 encoder output
長 input 時，attention 分散
後果：可能遺漏重要資訊

部署複雜度高

❌ 兩套參數需要管理：

Encoder weights
Decoder weights
唔同嘅 optimization 策略

❌ 分散式推理困難：

Encoder 同 Decoder 難以分離部署
唔似 Decoder-Only 可以輕易做 pipeline parallelism

社群支援減少

❌ 2024+ 幾乎冇新 Encoder-Decoder 模型：

Hugging Face 新增模型：90%+ 係 Decoder-Only
開源工具、優化技術：主要針對 Decoder-Only
後果：
- 難以搵到預訓練模型
- 缺乏 SOTA 技術支援
- 社群討論、教學資源稀缺

為何 Encoder-Decoder 被淘汰？

總結來講，Encoder-Decoder 嘅問題在於：

❌ 複雜但冇明顯優勢：雙 stack 但效果唔及 Decoder-Only

❌ 訓練成本高：慢 + 食記憶體

❌ Scaling 效果差：大規模訓練唔划算

❌ 唔支援 prompting：冇 in-context learning

而 Decoder-Only 更簡單、更快、更強，所以業界全面放棄 Encoder-Decoder。

代表模型

T5 系列：T5-small, T5-base, T5-11B, Flan-T5
BART：Facebook 的 Encoder-Decoder（用 denoising autoencoder 訓練）
mT5：Multilingual T5
UL2：Unified Language Learner（混合多種訓練目標）

BART vs BERT：點解一個係 Encoder-Decoder，一個係 Encoder-Only？

好多人會混淆 BART 同 BERT，因為名字好似，但佢哋其實完全唔同：

BART 嘅訓練方式同 BERT 唔同：

python# BERT：隨機遮蔽 tokens
Input:  "The [MASK] sat on the [MASK]"
Target: Predict "cat" and "mat"

# BART：破壞文本，然後重建
Input (Encoder):  "The cat sat on" （刪除咗部分）
Target (Decoder): "The cat sat on the mat" （重建完整句子）

# BART 可以做更多破壞方式：
# - Token Masking（似 BERT）
# - Token Deletion（刪除 tokens）
# - Text Infilling（填充連續片段）
# - Sentence Permutation（打亂句子順序）
# - Document Rotation（旋轉文檔起點）

點解 BART 用 Encoder-Decoder？

因為 BART 要做生成任務，所以需要 Decoder。BERT 只做理解，所以只需要 Encoder。

生活例子：

BERT：好似閱讀理解，睇晒篇文答問題
BART：好似文章修復，將破爛嘅文章修復返好

三大架構比較

用生活例子總結

三種架構好似三種寫作方式

Decoder-Only（GPT）= 故事接龍

Encoder-Only（BERT）= 閱讀理解

Encoder-Decoder（T5）= 翻譯考試

視覺比較：

Loading diagram...

技術規格比較

點解 Decoder-Only 成為主流？

過去幾年，業界明顯趨向 Decoder-Only 架構。點解？

1. 統一框架 (Unified Framework)

Decoder-Only 可以用同一個架構做晒所有任務：

python# Generation
prompt = "Write a poem about the moon:"
output = model.generate(prompt)

# Understanding (透過 prompting)
prompt = "Classify sentiment: I love this movie! Sentiment:"
output = model.generate(prompt)  # "Positive"

# Reasoning
prompt = "Q: 2+2=? A:"
output = model.generate(prompt)  # "4"

相比之下，BERT 只能做理解，唔識生成；T5 雖然兩樣都得，但架構複雜。

2. In-Context Learning

Decoder-Only 天生支援 in-context learning（ICL），即係俾幾個 examples 就識做新任務，唔需要 fine-tuning：

python# Few-shot learning
prompt = """
Translate to French:
Hello -> Bonjour
Good morning -> Bonjour
Thank you -> Merci
Goodbye -> 
"""
# Model 會輸出 "Au revoir"

BERT 做唔到呢樣，需要 fine-tune 先識新任務。

3. Scaling Laws 效果最好

研究發現，Decoder-Only 嘅 scaling laws 最理想：

Model size ↑ → Performance ↑（幾乎線性）
Emergent abilities 喺大規模先出現
效果提升最明顯

BERT-style 嘅 model scale 上去效果唔及 GPT-style。

4. 簡化訓練同部署

訓練：

Decoder-Only：直接 next-token prediction
Encoder-Only：要設計 masking 策略
Encoder-Decoder：要設計 input-output pairs

部署：

Decoder-Only：單一 model
Encoder-Decoder：兩個 stack，記憶體同運算需求大

5. Prompting > Fine-tuning

現代趨勢係用 prompting 而唔係 fine-tuning：

Prompting 需要 generative model（Decoder-Only）
Instruction-tuning（Instruct-GPT, Llama-Chat）讓 model 更易用
Few-shot prompting 效果好過 fine-tuned BERT

重點總結

Decoder-Only 勝出唔係因為技術上優越（BERT 嘅 bidirectional attention 理論上更強），而係因為：

實用性：一個 model 做晒所有野

Scaling：大規模訓練效果最好

Flexibility：Prompting 比 fine-tuning 更靈活

Decoder-Only 點樣做理解任務？

有人會問：Decoder-Only 只有 causal attention，點做理解任務？

答案：Prefix LM (Non-Causal Prefix)

python# 將輸入分兩部份：
# 1. Prefix (bidirectional attention)
# 2. Target (causal attention)

Input: "Translate to French: Hello"
       [-------Prefix-------]  [Target]
       Bidirectional ←→      Causal →

# Prefix 部分可以互相 attend
# Target 部分只能 attend 到 prefix + 之前的 target

現代 Decoder-Only model 會喺訓練時混合：

部分數據用 causal LM（純生成）
部分數據用 prefix LM（理解+生成）

實際應用場景

何時用 Decoder-Only？

✅ 推薦用 Decoder-Only：

新 project：直接用 Llama, Mistral 等開源 model
生成任務：寫作、翻譯、對話、代碼生成
需要 in-context learning：Few-shot prompting
多任務統一：用一個 model 做多種任務
長文本處理：現代 Decoder-Only 支援超長 context（128K+）

代表場景：

ChatGPT 式對話系統
代碼助手（GitHub Copilot）
內容創作（文章、劇本、詩歌）
通用 AI Agent

何時用 Encoder-Only？

✅ 推薦用 Encoder-Only：

純分類任務：Sentiment analysis, spam detection
需要高質量 embeddings：Semantic search, clustering
計算資源有限：BERT-base 只有 110M 參數
唔需要生成：純理解任務

代表場景：

文本分類（新聞分類、情感分析）
Named Entity Recognition (NER)
Question Answering（抽取式，唔係生成式）
Semantic Search（用 embeddings 做相似度搜尋）

⚠️ 注意： 而家好多 embedding model 都轉咗用 Decoder-Only（例如 Mistral Embed），所以 Encoder-Only 嘅優勢越來越細。

何時用 Encoder-Decoder？

✅ 推薦用 Encoder-Decoder：

Seq2Seq 任務（但 Decoder-Only 一樣得）
已有 T5/BART 嘅 fine-tuned model
特定研究項目

代表場景：

機器翻譯（但 GPT-4 做得更好）
文本摘要（但 Llama 一樣得）
Data-to-text generation

⚠️ 注意： Encoder-Decoder 逐漸被淘汰，新 project 唔建議用。

快速決策指南：點樣揀架構？

用呢個流程圖幫你決定：

Loading diagram...

簡化版決策表：

2026 年黃金法則：

唔知揀咩？預設用 Decoder-Only！

除非你：

✅ 只做分類 同埋資源好有限 → 先用 BERT

✅ 維護緊舊 project 已經用緊 T5/BART → 先繼續用 Encoder-Decoder

其他情況一律用 Decoder-Only（Llama 3, Mistral, Qwen）！

實用建議

2026 年建議策略：

🎯 預設選擇：Decoder-Only（Llama 3, Mistral, Qwen）

🎯 純分類/Embedding：可以考慮 Encoder-Only（BERT, DeBERTa），但 Decoder-Only fine-tuned 版本更好

🎯 避免：Encoder-Decoder（除非有特殊原因）

總結

經過詳細比較，我哋可以得出以下結論：

三大架構各有特色

1. Decoder-Only（GPT/Llama）

✅ 統一框架，做晒所有任務
✅ Scaling laws 最好
✅ In-context learning 能力強
🎯 現時主流，新 project 首選

2. Encoder-Only（BERT）

✅ 雙向理解，高質量 embeddings
✅ 參數少，訓練快
⚠️ 只做理解，唔識生成
📉 逐漸被 Decoder-Only 取代

3. Encoder-Decoder（T5/BART）

✅ Seq2Seq 任務表現好
✅ Cross-attention 機制強
❌ 架構複雜，參數量大
📉 基本已被淘汰

訓練方式總結

最終建議

🎯 2026 年 LLM 開發指南：

新 project 一律用 Decoder-Only（Llama 3, Mistral, Qwen）
純分類任務可以考慮 fine-tuned Decoder-Only 或 BERT
避免從零訓練，用現成開源 model + fine-tune/prompting
Encoder-Decoder 可以忽略（除非維護舊 project）

記住：架構選擇冇絕對對錯，最重要係符合你嘅應用場景同資源限制。 但趨勢已經好明顯：Decoder-Only 會繼續主導未來幾年嘅 LLM 發展。

TL;DR

Table of Contents

Pre-training 基礎概念

咩係 Pre-training？

Self-Supervised Learning

Token Representations：點樣將文字變成向量？

如果用全人類知識嚟訓練會點？

歷史發展時間線

點樣由 Transformer 變成三大架構？

重要里程碑

點解 BERT 由主流變成式微？

關鍵轉折點

未來趨勢

架構一：Decoder-Only（GPT 系列）

架構設計

訓練方式：Causal Language Modeling (CLM)

優點

缺點與技術限制

代表模型

架構二：Encoder-Only（BERT 系列）

架構設計

訓練方式：Masked Language Modeling (MLM)

其他訓練方式

優點

缺點與技術限制

代表模型

架構三：Encoder-Decoder（T5 系列）

架構設計

訓練方式：Span Corruption

優點

缺點與技術限制

代表模型

BART vs BERT：點解一個係 Encoder-Decoder，一個係 Encoder-Only？

三大架構比較

用生活例子總結

技術規格比較

點解 Decoder-Only 成為主流？

1. 統一框架 (Unified Framework)

2. In-Context Learning

3. Scaling Laws 效果最好

4. 簡化訓練同部署

5. Prompting > Fine-tuning

Decoder-Only 點樣做理解任務？

實際應用場景

何時用 Decoder-Only？

何時用 Encoder-Only？

何時用 Encoder-Decoder？

快速決策指南：點樣揀架構？

總結

三大架構各有特色

訓練方式總結

最終建議

相關資源

📄 經典論文

💻 開源模型

📚 延伸閱讀

TL;DR

Table of Contents

Pre-training 基礎概念

咩係 Pre-training？

Self-Supervised Learning

Token Representations：點樣將文字變成向量？

如果用全人類知識嚟訓練會點？

歷史發展時間線

點樣由 Transformer 變成三大架構？

重要里程碑

點解 BERT 由主流變成式微？

關鍵轉折點

未來趨勢

架構一：Decoder-Only（GPT 系列）

架構設計

訓練方式：Causal Language Modeling (CLM)

優點

缺點與技術限制

代表模型

架構二：Encoder-Only（BERT 系列）

架構設計

訓練方式：Masked Language Modeling (MLM)

其他訓練方式

優點