DeepSeek-OCR 2：點樣教 AI 好似人咁「睇」文件？Visual Causal Flow 完全拆解

你有冇試過用 AI 讀一份兩欄嘅 PDF，結果佢將左欄第一行同右欄第一行混埋一齊？呢個就係傳統 VLMs 嘅根本問題——佢哋「睇」圖像嘅方式，同人類完全唔同。 DeepSeek-OCR 2 話俾你聽：只要教識 AI 用「正確嘅順序」睇嘢，3B 參數就可以打贏 235B 嘅模型。論文來源：DeepSeek-AI（2026-01-28） arXiv: 2601.20552 GitHub: deepseek-ai/DeepSeek-OCR-2 HuggingFace: deepseek-ai/DeepSeek-OCR-2

TL;DR

DeepSeek-OCR 2 提出咗一個革命性嘅想法：用 LLM 做 vision encoder。傳統 VLMs 將圖像 patches 以固定嘅 raster-scan 順序（由左至右、由上至下）餵入 LLM，但呢種做法忽略咗圖像嘅語義結構。DeepSeek-OCR 2 嘅 DeepEncoder V2 引入咗 visual causal flow 機制，透過 learnable queries 動態重排 visual tokens，模仿人類視覺系統嘅語義導向掃描模式。

核心重點：

🎯 喺 OmniDocBench v1.5 達到 91.09% 準確度（End-to-end model SOTA）
🚀 Visual tokens 上限只需 1120（比同類模型少 >6000）
📖 Reading order 錯誤率降至 0.057（比 baseline 改善 33%）
🧠 首次驗證 LLM 架構可作為 vision encoder
💡 只用 3B 參數（500M active）就打贏 Qwen3-VL-235B
⚡ 同一個 encoder 架構未來可以做 omni-modal（text + image + audio）

背景：點解傳統 VLMs 有問題？

人類視覺 vs. 機器視覺

人類睇圖像唔係一個 pixel 一個 pixel 咁掃描，而係根據語義邏輯跳躍式咁睇。

例如你睇一份兩欄嘅學術論文：

👁️ 人類：先讀完左欄全部 → 再跳去右欄頂部 → 讀完右欄
🤖 傳統 VLM：由左上角開始 → 一行一行掃 → 左欄第一行接右欄第一行 → 完全亂晒

再例如睇一個螺旋形嘅文字：

👁️ 人類：眼球沿住螺旋嘅內在邏輯移動，每次 fixation 都因果相關
🤖 傳統 VLM：死板咁由左上 → 右下掃描，完全無視螺旋結構

💡 核心矛盾：LLM 係 1D sequential model，但圖像係 2D 結構。傳統做法將 2D patches 用固定嘅 raster-scan 順序「壓扁」成 1D 序列，呢個「壓扁」嘅方式引入咗一個完全唔合理嘅 inductive bias——假設咗「空間上相鄰 = 語義上相關」，但呢個假設對複雜 layout 嘅文件完全唔成立。

Raster Scan 嘅根本問題：用具體例子睇

假設你有一個 4×4 嘅圖像 grid，分成 16 個 patches。圖像入面有兩欄文字：

plain┌─────────────────────┐
│  A1  A2  │  B1  B2  │
│  A3  A4  │  B3  B4  │
│  A5  A6  │  B5  B6  │
│  A7  A8  │  B7  B8  │
└─────────────────────┘
  左欄 (A)     右欄 (B)

Raster scan 順序（傳統 VLM）：

A1 → A2 → B1 → B2 → A3 → A4 → B3 → B4 → ...

呢個順序嘅問題係 A2 之後就接住 B1——但 A2 同 B1 完全冇語義關係！A2 應該接 A3（同一欄嘅下一行），但 raster scan 強制將佢同 B1 擺埋一齊。

人類閱讀順序（語義正確）：

A1 → A2 → A3 → A4 → A5 → A6 → A7 → A8 → B1 → B2 → ...

⚠️ 呢個問題喺 LLM 入面特別嚴重
因為 LLM 用 causal attention（每個 token 只能 attend 到前面嘅 tokens），raster scan 順序意味住 B1 可以 attend 到 A2，但 A3 要等到好後面先出現。呢個令 LLM 完全無法建立同一欄內嘅因果關係。

問題有幾嚴重？睇返實驗數據：

DeepSeek-OCR（用 raster scan）嘅 reading order edit distance = 0.085

DeepSeek-OCR 2（用 causal flow）嘅 reading order edit distance = 0.057

改善咗 33%！呢個就係語義重排嘅威力。

傳統 VLM 嘅完整 Pipeline

傳統 vision-language models 嘅做法係：

Loading diagram...

將圖像切成 patches（例如 16×16）
強制以 top-left → bottom-right 順序排列
加上固定嘅 positional encoding（如 RoPE）
餵入 LLM 做 causal decoding

呢種做法有三大問題：

問題	具體表現	影響
❌ Spatial bias	強制嘅空間順序忽略咗語義關係	兩欄文件讀錯順序
❌ 2D→1D mismatch	2D 圖像被硬塞入 1D causal LLM	複雜 layout 理解力差
❌ Visual tokens 過多	大部分模型需要 >6000 tokens	Memory 同 latency 暴增

DeepEncoder V2：核心創新

整體架構

DeepSeek-OCR 2 嘅架構同傳統 VLM 有一個根本分別：佢嘅 encoder 本身就有 causal reasoning 能力。

Loading diagram...

🔑 同傳統 VLM 嘅根本分別

傳統 VLM：Non-causal Encoder（CLIP）→ Causal Decoder（LLM）

DeepSeek-OCR 2：Causal Encoder（DeepEncoder V2）→ Causal Decoder（DeepSeek-3B）

兩個 stage 都有 causal reasoning 能力——encoder 做「閱讀順序推理」，decoder 做「內容理解推理」。

組件 1：Vision Tokenizer（80M params）

Vision Tokenizer 嘅工作好簡單：將圖像壓縮成少量嘅 visual tokens。

架構：SAM-base（80M params）+ 兩層 Conv layers

壓縮比：16×

具體數字：

1024×1024 圖像 → 256 visual tokens
768×768 圖像 → 144 visual tokens

💡 點解唔直接用 patch embedding？
因為 16× 壓縮可以大幅減少後面 global attention 嘅計算成本同 activation memory。80M 參數同 LLM 嘅 text embedding（通常 ~100M params）差唔多，所以唔算 overhead。

組件 2：LLM-style Vision Encoder（500M params）

呢個係 DeepSeek-OCR 2 最核心嘅創新——用 Qwen2-0.5B 取代傳統嘅 CLIP ViT！

點解用 LLM 做 Vision Encoder？

傳統用 CLIP ViT 嘅問題係：CLIP 擅長提取 features，但唔識做 reasoning。佢用 bidirectional attention，所有 tokens 互相 attend，冇因果順序嘅概念。

LLM 天生就識做三件事：

Ordering（排序）：因為 causal attention 強制學習序列順序
Logic（邏輯）：Pretraining 學到咗大量因果推理能力
Causality（因果性）：每個 token 依賴前面嘅 tokens

呢三個能力正正係 visual token reordering 需要嘅！所以 DeepSeek-OCR 2 用 Qwen2-0.5B 嘅 pretrained weights 做初始化，將佢變成一個 visual reasoning module，而唔只係 feature extractor。

🎯 一句話概括
CLIP 係一個「影相師」——拍到所有細節但唔識排序。

DeepEncoder V2 係一個「編輯」——唔只睇到所有嘢，仲識得決定「讀者應該先睇邊度」。

Dual-Stream Attention：點樣運作？

DeepEncoder V2 嘅 attention mechanism 分成兩個 stream：

plain[Visual Tokens (prefix)]  +  [Causal Flow Queries (suffix)]
         ↓                            ↓
  Bidirectional Attention       Causal Attention
  (每個 token 睇到所有          (每個 query 只睇到
   其他 visual tokens)           所有 visual tokens
                                 + 前面嘅 queries)

點解 visual tokens 用 bidirectional？

因為你需要每個 visual token 都有 full-image receptive field——睇到成張圖嘅所有部分。呢個同 CLIP ViT 嘅能力一樣，保留咗全局感知能力。

點解 causal flow queries 用 causal？

因為你要學習一個順序——「先睇邊度，然後睇邊度」。Causal attention 強制每個 query 只能 attend 到前面嘅 queries，呢個就自然形成咗一個「閱讀順序」。第 5 個 query 可以根據前 4 個 query 已經「睇過」嘅內容，決定下一步應該 focus 喺邊度。

💡 直覺理解 Dual-Stream
Imagine 你係一個編輯，要幫一篇文章決定段落順序：

第一步（Bidirectional）：你先將所有段落全部讀一次，了解全局內容

第二步（Causal）：然後你開始排序——「開頭應該放背景介紹…嗯，睇過背景之後，下一段應該係方法論…睇過方法論之後，接住應該係實驗…」

每一步嘅決定都取決於前面已經做嘅決定——呢個就係 causal reasoning。

Attention Mask：用具體數字睇

假設有 $m = 4$ 個 visual tokens 同 $n = 4$ 個 causal flow queries。Attention mask $M$ 係一個 $8 \times 8$ 嘅矩陣：

M = \begin{bmatrix} \mathbf{1}_{4 \times 4} & \mathbf{0}_{4 \times 4} \\ \mathbf{1}_{4 \times 4} & \text{LowerTri}(4) \end{bmatrix}

展開嚟睇：

plainV1  V2  V3  V4 │ Q1  Q2  Q3  Q4
    ─────────────────────┼─────────────────
V1 │  1   1   1   1  │  0   0   0   0     ← V1 睇到所有 V，但睇唔到 Q
V2 │  1   1   1   1  │  0   0   0   0     ← V2 同上
V3 │  1   1   1   1  │  0   0   0   0     ← V3 同上
V4 │  1   1   1   1  │  0   0   0   0     ← V4 同上
    ─────────────────────┼─────────────────
Q1 │  1   1   1   1  │  1   0   0   0     ← Q1 睇到所有 V + 自己
Q2 │  1   1   1   1  │  1   1   0   0     ← Q2 睇到所有 V + Q1 + 自己
Q3 │  1   1   1   1  │  1   1   1   0     ← Q3 睇到所有 V + Q1,Q2 + 自己
Q4 │  1   1   1   1  │  1   1   1   1     ← Q4 睇到所有 V + 所有前面嘅 Q

🔑 Mask 嘅四個象限

左上（V→V）：全 1 = bidirectional attention，visual tokens 互相睇到

右上（V→Q）：全 0 = visual tokens 睇唔到 queries（queries 唔影響 visual tokens）

左下（Q→V）：全 1 = 每個 query 都可以 attend 到所有 visual tokens

右下（Q→Q）：下三角 = causal attention，每個 query 只 attend 到前面嘅 queries

呢個設計嘅巧妙之處係：visual tokens 完全唔受 queries 影響（保持純粹嘅 feature extraction），但每個 query 可以同時睇到所有 visual information 同 前面嘅 queries 已經做咗嘅決定。

Attention Mask 嘅 Python 實現：

pythonimport torch

def create_deepencoder_v2_mask(m, n):
    """
    Create DeepEncoder V2 attention mask.
    m: number of visual tokens
    n: number of causal query tokens (n == m)
    """
    # Upper-left: visual tokens see each other (bidirectional)
    visual_mask = torch.ones(m, m)
    
    # Upper-right: visual tokens cannot see queries
    upper_right = torch.zeros(m, n)
    
    # Lower-left: queries can attend to ALL visual tokens
    lower_left = torch.ones(n, m)
    
    # Lower-right: causal mask for queries (lower triangular)
    lower_right = torch.tril(torch.ones(n, n))
    
    # Assemble the full mask
    top = torch.cat([visual_mask, upper_right], dim=1)
    bottom = torch.cat([lower_left, lower_right], dim=1)
    mask = torch.cat([top, bottom], dim=0)
    
    return mask  # shape: (m+n, m+n)

# Example: 256 visual tokens + 256 queries
mask = create_deepencoder_v2_mask(256, 256)  # 512 x 512
print(mask.shape)  # torch.Size([512, 512])

組件 3：Causal Flow Queries

Causal flow queries 係 DeepEncoder V2 嘅「秘密武器」——一組 learnable embeddings，負責將 visual tokens 重新排序。

設計決策 1：Queries 數量 = Visual Tokens 數量

呢個係一個刻意嘅設計： $n = m$ （queries 同 visual tokens 一樣多）。

點解唔用更少嘅 queries（好似 BLIP-2 嘅 32 個）？

因為 DeepSeek-OCR 2 嘅目標唔只係「壓縮」，而係「重排」。你要保留所有 visual information，只係改變佢嘅順序。如果 queries 太少，就冇足夠嘅容量做 re-fixation（重新注視）——人類視覺會重複注視重要區域，呢個設計模仿咗呢種機制。

🎯 同其他 Parallelized Query 設計嘅對比

Model Query 數量 Attention 目的
DETR 100 object queries Bidirectional Object detection
BLIP-2 Q-former 32 learnable queries Bidirectional Token compression
DeepEncoder V2 n queries（n = visual tokens） Causal Token reordering

DeepEncoder V2 係首個將 causal attention 應用於 vision encoder queries 嘅架構。

Model	Query 數量	Attention	目的
DETR	100 object queries	Bidirectional	Object detection
BLIP-2 Q-former	32 learnable queries	Bidirectional	Token compression
DeepEncoder V2	n queries（n = visual tokens）	Causal	Token reordering

設計決策 2：Multi-Crop 策略

唔同大小嘅圖像會產生唔同數量嘅 tokens：

Global view（必定有）：

Resolution：1024×1024
Visual tokens：256
Query embeddings：query_global（256 個）

Local crops（0-6 個）：

Resolution：768×768 each
Visual tokens：每個 144
Query embeddings：query_local（144 個，所有 local views 共用）
只有當圖像嘅寬或高 ≥768 先會 crop

Token count 計算：

\text{Total tokens} = k \times 144 + 256 \quad (k = 0 \text{ to } 6)

Local crops (k)	計算	Total tokens	適用場景
0	0×144 + 256	256（最少）	小圖（<768×768）
1	1×144 + 256	400	中圖
3	3×144 + 256	688	大圖
6	6×144 + 256	1120（最多）	超大 / 複雜文件

💡 1120 tokens 嘅設計意義

比 DeepSeek-OCR 嘅 1156 tokens（Gundam mode）少 36 個

同 Gemini-3 Pro 嘅 maximum visual token budget 一樣

但 Qwen3-VL-235B 需要 >6000 tokens！

用 1120 tokens 就做到 SOTA，證明咗 語義重排 >> 暴力增加 token 數量。

Two-Stage Cascade Causal Reasoning：點解呢個 Decomposition Work？

DeepSeek-OCR 2 嘅完整 pipeline 可以寫成一條公式：

\mathbf{O} = \mathcal{D}\left(\pi_Q\left(\mathcal{T}^L\left(\mathcal{E}(\mathbf{I}) \oplus \mathbf{Q}_0; \mathbf{M}\right)\right)\right)

等我逐個符號拆解：

符號	意思	對應組件
$\mathbf{I}$	輸入圖像 $H \times W \times 3$	原始圖片
$\mathcal{E}$	Vision tokenizer	SAM-base + 2 Conv（80M params）
$\mathbf{Q}_0$	Learnable causal query embeddings	n 個可學習嘅 queries
$\oplus$	Sequence concatenation	Visual tokens 做 prefix，queries 做 suffix
$\mathcal{T}^L$	L-layer Transformer with masked attention $\mathbf{M}$	Qwen2-0.5B（500M params）
$\pi_Q$	Projection operator：只抽取後半部分（queries）	丟棄 visual tokens，只保留 causal flow outputs
$\mathcal{D}$	Language decoder	DeepSeek-3B MoE（500M active params）

兩階段因果推理

呢條 pipeline 自然分成兩個 causal reasoning stage：

Loading diagram...

Encoder stage：透過 causal queries 對 visual tokens 做語義重排——決定「先讀邊度，後讀邊度」
Decoder stage：LLM 對已重排嘅 visual tokens 做 autoregressive reasoning——理解內容、生成輸出

🔑 點解呢個 Decomposition Work？
論文提出一個大膽嘅假設：2D 理解 = 兩個互補嘅 1D causal reasoning 嘅級聯。

第一個 1D reasoner（encoder）：將 2D spatial layout 轉換成 1D semantic order

第二個 1D reasoner（decoder）：喺 1D semantic order 上做 content understanding

兩者嘅 reasoning 方向係「正交」嘅——一個處理 layout，一個處理 content。呢個 decomposition 避免咗一個 single model 要同時處理 2D layout + content understanding 嘅複雜性。

點解 Prefix Concatenation 而唔係 Cross-Attention？

論文做過一個重要嘅實驗：佢哋試過用 mBART-style encoder-decoder + cross-attention 嘅架構（即係 visual tokens 同 queries 分開，透過 cross-attention 交互），結果無法收斂。

原因分析：

喺 cross-attention 設計入面，visual tokens 被隔離喺 encoder 入面，同 queries 嘅交互只透過 cross-attention layers——呢個交互唔夠頻繁、唔夠深入。

而 prefix concatenation 嘅設計令 visual tokens 喺所有 layers 都保持活躍——佢哋喺每一層都同 causal queries 共存喺同一個 sequence 入面，促進咗更有效嘅 information exchange。

💡 類比：Cross-attention 就好似你同同事喺兩個唔同嘅房間，只能透過 email 溝通。Prefix concatenation 就好似你哋坐喺同一張桌，隨時可以轉頭問嘢——溝通效率完全唔同 level。

訓練流程：三個精心設計嘅 Stage

Loading diagram...

Stage 1: Encoder Pretraining（40k iterations）

目標：教識 Vision Tokenizer 同 LLM Encoder 基本嘅 feature extraction、token compression 同 token reordering 能力。

做法：

用 language modeling objective（next token prediction）訓練
配一個輕量級 decoder 做 joint optimization
兩個 dataloader：768×768 同 1024×1024 resolution
Vision Tokenizer：從 DeepEncoder 初始化（唔係從零開始）
LLM Encoder：從 Qwen2-0.5B-base 初始化

資源：160 A100 GPUs，batch size 640，約 100M image-text pair samples

💡 點解用 Qwen2-0.5B 做初始化？
Qwen2-0.5B 嘅 pretrained weights 包含咗大量關於 ordering、logic、causality 嘅知識——呢啲都係 text pretraining 學到嘅。呢啲知識可以 transfer 到 vision tasks，特別係 visual token reordering 需要嘅因果推理能力。呢個係「LLM pretraining 對 multimodal 有價值」嘅直接驗證。

Stage 2: Query Enhancement（15k iterations）

目標：進一步強化 causal flow queries 嘅 reordering 能力，同時提升 visual knowledge compression。

做法：

凍結 Vision Tokenizer（SAM-base + Conv）
同時優化 LLM Encoder + DeepSeek-3B Decoder
統一用 multi-crop 策略處理多種 resolution
4-stage pipeline parallelism

資源：160 GPUs（40GB each），40 data parallel replicas，global batch size 1280

Stage 3: Decoder Specialization（20k iterations）

目標：幫 DeepSeek-3B 更好理解 DeepEncoder V2 重排後嘅 visual tokens。

做法：

凍結所有 encoder 參數（Vision Tokenizer + LLM Encoder 都唔動）
只更新 DeepSeek-LLM 參數
訓練速度提升 >2×（因為唔使 backprop 過 encoder）

🎯 三個 Stage 嘅分工

Stage 1：「教識 encoder 基本功」— 識得睇嘢、壓縮、初步排序

Stage 2：「強化排序能力」— encoder + decoder 一齊學，令排序更加準確

Stage 3：「教識 decoder 讀重排後嘅 tokens」— encoder 已經搞掂，只需要 decoder 適應新嘅 input format

呢個 staged approach 確保咗 stable training 同 efficient scaling。

實驗結果

OmniDocBench v1.5：主要指標

OmniDocBench v1.5 包含 1,355 個文件頁面，涵蓋 9 大類別（雜誌、學術論文、研究報告等），中英文混合。

End-to-end Models 對比

Model	V-token (max)	Overall ↑	Text ED ↓	Formula CDM ↑	Table TEDS ↑	R-order ED ↓
GPT-4o	-	75.02%	0.217	79.70%	67.07%	0.148
InternVL3.5-241B	>7000	82.67%	0.142	87.23%	75.00%	0.125
Qwen2.5-VL-72B	>6000	87.02%	0.094	88.27%	82.15%	0.102
DeepSeek-OCR	1156	87.36%	0.073	84.14%	85.25%	0.085
Gemini-2.5 Pro	-	88.03%	0.075	85.82%	85.71%	0.097
Qwen3-VL-235B	>6000	89.15%	0.069	88.14%	86.21%	0.068
DeepSeek-OCR 2	1120	91.09%	0.048	90.31%	87.75%	0.057

🚀 令人震驚嘅數字

用 1120 tokens 打贏用 >6000 tokens 嘅 Qwen3-VL-235B（89.15% vs 91.09%）

只用 3B 參數（500M active）！比 Qwen3-VL 細 78 倍！

同 Gemini-2.5 Pro 比，Overall 高 3.06%，Reading order 好 41%

呢個結果證明咗：語義重排 + efficient architecture >> 暴力 scale up 參數同 tokens

相比 DeepSeek-OCR baseline 嘅改善：

Overall：+3.73%
Text ED：-0.025（改善 34%）
Formula CDM：+6.17%
Reading order：-0.028（改善 33%）

同 Gemini-3 Pro 嘅直接對比（相同 Token Budget）

喺相同嘅 visual token budget（~1120）下：

Model	V-token max	Overall Edit Distance ↓
Gemini-3 Pro	1120	0.115
DeepSeek-OCR 2	1120	0.100 ✅

喺同樣嘅 visual token 預算下，DeepSeek-OCR 2 嘅文件解析精度顯著超過 Gemini-3 Pro。

跨文件類型表現

Document Type	DeepSeek-OCR Text / R-order	DeepSeek-OCR 2 Text / R-order	改善
PPT	0.052 / 0.052	0.031 / 0.025 ✅	Text -40%, R-order -52%
Academic Paper	0.028 / 0.021	0.013 / 0.013 ✅	Text -54%, R-order -38%
Colorful Textbook	0.130 / 0.125	0.053 / 0.066 ✅	Text -59%, R-order -47%
Exam Paper	0.074 / 0.083	0.047 / 0.048 ✅	Text -36%, R-order -42%
Note	0.145 / 0.089	0.068 / 0.035 ✅	Text -53%, R-order -61%
Newspaper ⚠️	0.131 / 0.217	0.139 / 0.176	Text +6%, R-order -19%

喺幾乎所有類別都有顯著改善，尤其係 reading order 指標。唯一例外係 Newspaper（原因見下面嘅限制分析）。

Production Readiness

Environment	Metric	DeepSeek-OCR	DeepSeek-OCR 2	改善
Online user images	Repetition rate	6.25%	4.17%	↓ 2.08%
Pretraining PDFs	Repetition rate	3.69%	2.88%	↓ 0.81%

💡 點解 Repetition Rate 重要？
喺 production 環境冇 ground truth，所以用 repetition rate（模型重複輸出同一段文字嘅比率）作為 primary quality metric。Repetition 通常係模型「唔確定」或者「迷路」嘅表現——causal flow 令模型嘅閱讀路徑更加清晰，所以 repetition 大幅減少。

深入分析

點解 LLM 可以做 Vision Encoder？

論文提出三個關鍵因素，等我逐個解釋：

因素 1：Prefix Concatenation 設計

將 visual tokens 作為 prefix（而非用 cross-attention 分離），確保 visual tokens 喺所有 layers 都保持活躍，促進同 causal queries 嘅有效交換。

反面例子：用 mBART-style encoder-decoder + cross-attention → 無法收斂。因為 visual tokens 被隔離後，同 queries 嘅交互唔夠深入。

因素 2：Equal Cardinality（等數量）

Queries 數量 = Visual tokens 數量，提供足夠容量做 re-fixation。

點解需要 re-fixation？

想像你讀一份複雜嘅表格——你唔會只睇一次就記住所有數字。你會：

先掃一眼表格嘅整體結構（第 1-3 個 queries）
仔細讀第一行（第 4-10 個 queries）
回去再看表頭確認列名（第 11-13 個 queries）← re-fixation！
繼續讀第二行…

如果 queries 太少（例如 BLIP-2 嘅 32 個），就冇「回去再看」嘅空間。Equal cardinality 確保有足夠嘅 capacity 做呢種 re-examination。

因素 3：Dual-Stream Attention

Visual tokens 用 bidirectional attention → 保留全局理解（好似 CLIP）
Queries 用 causal attention → 引入順序推理（好似 LLM）

呢種混合設計結合咗 ViT 同 LLM decoder 嘅優點——既有全局感知，又有順序推理。

🎯 一句話總結三個因素

Prefix concatenation：「坐同一張桌」— visual tokens 同 queries 隨時互動

Equal cardinality：「有足夠嘅時間」— queries 可以反覆注視重要區域

Dual-stream：「兩種技能」— 全局理解 + 順序推理

三者缺一不可。缺 prefix → 無法收斂。缺 equal cardinality → 容量不足。缺 dual-stream → 冇順序推理能力。

點解唔打破 Encoder-Decoder 範式？

傳統嘅 VLM 設計遵循一個隱含假設：encoder 一定要係 bidirectional 嘅（好似 CLIP ViT），因為 vision tasks 需要全局理解。

DeepSeek-OCR 2 挑戰咗呢個假設：encoder 可以同時有 bidirectional 同 causal 嘅部分。而且事實證明，加入 causal reasoning 唔只冇損害 vision encoder 嘅能力，反而大幅提升咗 document understanding 嘅表現。

呢個啟示可能對未來嘅 VLM 架構設計有深遠影響——encoder 唔需要全部都係 bidirectional，適當引入 causality 可以帶來巨大收益。

實作指南：點樣用 DeepSeek-OCR 2？

DeepSeek-OCR 2 係完全開源嘅，可以用 vLLM 或 Transformers 做 inference，仲支援 fine-tuning。

環境要求

CUDA 11.8 + PyTorch 2.6.0
GPU：建議 A100 40GB 或以上（model 係 3B params，bfloat16 大約需要 ~7GB VRAM）
HuggingFace model：deepseek-ai/DeepSeek-OCR-2

方法一：Transformers Inference（最簡單）

pythonfrom transformers import AutoModel, AutoTokenizer
import torch
import os

os.environ["CUDA_VISIBLE_DEVICES"] = '0'
model_name = 'deepseek-ai/DeepSeek-OCR-2'

# Step 1: Load model + tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModel.from_pretrained(
    model_name,
    _attn_implementation='flash_attention_2',
    trust_remote_code=True,
    use_safetensors=True
)
model = model.eval().cuda().to(torch.bfloat16)

# Step 2: Run inference
# 文件 OCR（帶 layout detection）：
prompt = "<image>\n<|grounding|>Convert the document to markdown."

# 純文字 OCR（唔帶 layout）：
# prompt = "<image>\nFree OCR."

image_file = 'your_document.jpg'
output_path = 'output/'

res = model.infer(
    tokenizer,
    prompt=prompt,
    image_file=image_file,
    output_path=output_path,
    base_size=1024,       # global view resolution
    image_size=768,       # local crop resolution
    crop_mode=True,       # 開啟 multi-crop
    save_results=True
)

print(res)  # Markdown 格式嘅 OCR 結果

方法二：vLLM Inference（高吞吐量 / Production）

vLLM 支援 concurrent batch processing，適合 production deployment：

pythonfrom vllm import LLM, SamplingParams

# Step 1: 啟動 vLLM engine
llm = LLM(
    model="deepseek-ai/DeepSeek-OCR-2",
    trust_remote_code=True,
    dtype="bfloat16",
    gpu_memory_utilization=0.9,
)

# Step 2: 設定 sampling params
sampling_params = SamplingParams(
    temperature=0.0,      # OCR 唔需要 randomness
    max_tokens=4096,
)

# Step 3: Batch inference
# 參考 DeepSeek-OCR2-vllm/run_dpsk_ocr2_pdf.py 做 PDF batch processing

⚠️ vLLM 版本注意
DeepSeek-OCR 2 需要 vLLM 0.8.5 或以上。如果你用 CUDA 11.8，需要下載 specific wheel：

pip install vllm-0.8.5+cu118-cp38-abi3-manylinux1_x86_64.whl

方法三：Fine-tuning with Unsloth（自定義數據）

如果你有自己嘅 PDF / 文件數據，可以用 Unsloth 做 fine-tuning：

pythonfrom unsloth import FastVisionModel

# Load model with LoRA
model, tokenizer = FastVisionModel.from_pretrained(
    "deepseek-ai/DeepSeek-OCR-2",
    load_in_4bit=True,         # 4-bit quantization 省 memory
    use_gradient_checkpointing="unsloth",
)

# Add LoRA adapters
model = FastVisionModel.get_peft_model(
    model,
    r=16,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
    lora_alpha=16,
    lora_dropout=0,
)

# Fine-tune on your data...

✅ 實作 Checklist

✅ Install：git clone + conda env + requirements

✅ Quick test：用 Transformers inference 跑一張圖

✅ Production：轉用 vLLM 做 batch processing

✅ Custom data：用 Unsloth fine-tune

✅ Prompts：文件用 <|grounding|>Convert the document to markdown.，純文字用 Free OCR.

Main Prompts 速查表

用途	Prompt	說明
文件 OCR（帶 layout）	`<image>\n<\|grounding\|>Convert the document to markdown.`	輸出帶 heading、table、formula 嘅 Markdown
純文字 OCR	`<image>\nFree OCR.`	只提取文字，唔理 layout

限制同未來方向

當前限制

Newspaper 表現仍有改善空間

Text ED 仍然 >0.13（甚至略微退步），可能原因：

Visual token 上限（1120）對 text-super-rich 嘅報紙唔夠
訓練數據只有 250k newspaper samples，唔夠 DeepEncoder V2 學識呢類 layout

解決方案：增加 local crops 數量（例如 k=8 或更多）同增強 newspaper 訓練數據。

單次 Reordering

當前設計只做一次語義重排，未能支持 multi-hop reordering 或 multiple re-examinations。要做到真正嘅多輪「回看」，可能需要 causal flow tokens 嘅數量 遠多於 visual tokens。

只喺 Document OCR 驗證

論文選擇 document parsing 作為 testbed，但 visual causal flow 喺 general visual reasoning tasks（例如 VQA、image captioning）嘅有效性仲未驗證。

未來方向

Towards Genuine 2D Reasoning

論文提出一個大膽嘅假設：2D reasoning = 兩個 1D causal reasoners 嘅級聯。

要實現真正嘅 2D reasoning，可能需要：

Causal flow tokens 數量 >> original visual tokens
Multi-hop reordering mechanisms
驗證喺 general visual reasoning tasks 嘅有效性

Towards Native Multimodality

DeepEncoder V2 提供咗 omni-modal encoder 嘅可能性：

Loading diagram...

核心思路：同一個 LLM encoder，透過唔同嘅 modality-specific learnable queries，可以處理 text、audio、vision 等多種 modality。唔同 modality 共用 $W_k, W_v$ 、attention mechanisms 同 FFNs，只有 query embeddings 唔同。

呢個架構嘅優勢：

統一架構處理多種 modalities
自然繼承 LLM 社群嘅優化（MoE、efficient attention 等）
參數共享，更高效

技術啟示

1. 打破「Encoder 一定要 Bidirectional」嘅假設

傳統 VLMs：Non-causal Encoder → Causal Decoder

DeepSeek-OCR 2：Causal Encoder → Causal Decoder

呢種設計更加 end-to-end causal，減少 2D→1D 嘅 mismatch。

2. Visual Tokens 唔係越多越好

DeepSeek-OCR 2 用 1120 tokens 就打贏咗用 >6000 tokens 嘅模型（Qwen3-VL-235B、InternVL3.5 等）。

🎯 關鍵 insight：語義重排 > 暴力增加 token 數量
唔係「餵更多 tokens 入 LLM」就會更好。重點係 tokens 嘅順序要同語義邏輯一致。1120 個「排好順序」嘅 tokens，勝過 6000 個「亂排」嘅 tokens。

3. LLM Pretraining 對 Multimodal 嘅價值

Qwen2-0.5B 嘅 pretrained weights 提供咗強大嘅初始化，證明 LLM 訓練學到嘅知識可以 transfer 到 vision tasks。呢個為「用 LLM 做 everything」嘅 trend 提供咗更多 evidence。

4. Document OCR 係理想 Testbed

論文選擇 document parsing 作為主要實驗場景，因為：

包含 complex layouts（多欄、表格、公式）
需要 sophisticated causal reasoning（閱讀順序）
有明確嘅 ground truth

呢個選擇非常聰明——喺一個 well-defined 嘅 challenging setting 入面驗證架構嘅有效性。

5. 簡單嘅方法往往最有效

DeepSeek-OCR 2 嘅核心其實好簡單：一個 customized attention mask。冇複雜嘅 multi-stage pipeline、冇新嘅 loss function、冇 fancy data augmentation。只係將 visual tokens 同 learnable queries concatenate 埋，用一個精心設計嘅 attention mask 控制 information flow——呢種 elegance 正正係好嘅 architecture design 嘅標誌。

同其他工作嘅比較

vs. Traditional OCR Pipelines

Pipeline	V-tokens	Overall	Architecture
PaddleOCR-VL	-	92.86%	Multi-stage pipeline
MinerU2.5	-	90.67%	Multi-stage pipeline
DeepSeek-OCR 2	1120	91.09%	End-to-end VLM ✅

DeepSeek-OCR 2 以 end-to-end 方式接近甚至超越咗大部分 multi-stage pipeline 嘅表現。只有 PaddleOCR-VL（用咗 multi-stage + specialized modules）仲高 1.77%——但 DeepSeek-OCR 2 嘅架構簡單得多，而且有更大嘅 scaling potential。

vs. 1.5-bit LLM 同其他量化方法

DeepSeek-OCR 2 同量化技術（例如 TurboQuant）係互補嘅：

DeepSeek-OCR 2：用更好嘅架構減少 visual tokens（1120 vs >6000）
TurboQuant：用更好嘅量化減少每個 token 嘅 bit-width
兩者可以 stack：DeepSeek-OCR 2 嘅 KV cache 可以用 TurboQuant 進一步壓縮！

總結

DeepSeek-OCR 2 唔只係一個 document parsing 嘅 SOTA 模型，更重要嘅係提出咗一個全新嘅 vision encoding 範式。

核心貢獻

Visual Causal Flow：首次證明 LLM 架構可以作為 vision encoder
Dual-Stream Attention：結合 bidirectional 同 causal attention，用一個 customized mask 實現
Cascade Causal Reasoning：將 2D 理解分解成兩個 1D causal tasks
High Efficiency：用最少嘅 visual tokens（1120）達到 end-to-end SOTA
Open Source：完整開源，支援 vLLM / Transformers / fine-tuning

點解你應該 care？

如果你：

做 Document AI / OCR → DeepSeek-OCR 2 係目前 end-to-end 最強嘅開源模型
做 VLM 研究 → Visual causal flow 係一個全新嘅 research direction
做 LLM pretraining data → DeepSeek-OCR 2 可以高效生產高質量 training data
做 Multimodal AI → Omni-modal encoder 嘅 pathway 值得關注

🎯 一句話記住 DeepSeek-OCR 2：
用一個 customized attention mask，將 LLM 變成 vision encoder，教識 AI 用「人類嘅閱讀順序」睇文件。

3B 參數，1120 tokens，打贏 235B + 6000 tokens。

簡單到令人驚訝，效果好到改寫咗 document OCR 嘅 leaderboard。

TL;DR

核心重點：

🎯 喺 OmniDocBench v1.5 達到 91.09% 準確度（End-to-end model SOTA）
🚀 Visual tokens 上限只需 1120（比同類模型少 >6000）
📖 Reading order 錯誤率降至 0.057（比 baseline 改善 33%）
🧠 首次驗證 LLM 架構可作為 vision encoder
💡 只用 3B 參數（500M active）就打贏 Qwen3-VL-235B
⚡ 同一個 encoder 架構未來可以做 omni-modal（text + image + audio）

背景：點解傳統 VLMs 有問題？

人類視覺 vs. 機器視覺

人類睇圖像唔係一個 pixel 一個 pixel 咁掃描，而係根據語義邏輯跳躍式咁睇。

例如你睇一份兩欄嘅學術論文：

👁️ 人類：先讀完左欄全部 → 再跳去右欄頂部 → 讀完右欄
🤖 傳統 VLM：由左上角開始 → 一行一行掃 → 左欄第一行接右欄第一行 → 完全亂晒

再例如睇一個螺旋形嘅文字：

👁️ 人類：眼球沿住螺旋嘅內在邏輯移動，每次 fixation 都因果相關
🤖 傳統 VLM：死板咁由左上 → 右下掃描，完全無視螺旋結構

💡 核心矛盾：LLM 係 1D sequential model，但圖像係 2D 結構。傳統做法將 2D patches 用固定嘅 raster-scan 順序「壓扁」成 1D 序列，呢個「壓扁」嘅方式引入咗一個完全唔合理嘅 inductive bias——假設咗「空間上相鄰 = 語義上相關」，但呢個假設對複雜 layout 嘅文件完全唔成立。

Raster Scan 嘅根本問題：用具體例子睇

假設你有一個 4×4 嘅圖像 grid，分成 16 個 patches。圖像入面有兩欄文字：

plain┌─────────────────────┐
│  A1  A2  │  B1  B2  │
│  A3  A4  │  B3  B4  │
│  A5  A6  │  B5  B6  │
│  A7  A8  │  B7  B8  │
└─────────────────────┘
  左欄 (A)     右欄 (B)

Raster scan 順序（傳統 VLM）：

A1 → A2 → B1 → B2 → A3 → A4 → B3 → B4 → ...

呢個順序嘅問題係 A2 之後就接住 B1——但 A2 同 B1 完全冇語義關係！A2 應該接 A3（同一欄嘅下一行），但 raster scan 強制將佢同 B1 擺埋一齊。

人類閱讀順序（語義正確）：

A1 → A2 → A3 → A4 → A5 → A6 → A7 → A8 → B1 → B2 → ...

⚠️ 呢個問題喺 LLM 入面特別嚴重
因為 LLM 用 causal attention（每個 token 只能 attend 到前面嘅 tokens），raster scan 順序意味住 B1 可以 attend 到 A2，但 A3 要等到好後面先出現。呢個令 LLM 完全無法建立同一欄內嘅因果關係。

問題有幾嚴重？睇返實驗數據：

DeepSeek-OCR（用 raster scan）嘅 reading order edit distance = 0.085

DeepSeek-OCR 2（用 causal flow）嘅 reading order edit distance = 0.057

改善咗 33%！呢個就係語義重排嘅威力。

傳統 VLM 嘅完整 Pipeline

傳統 vision-language models 嘅做法係：

Loading diagram...

將圖像切成 patches（例如 16×16）
強制以 top-left → bottom-right 順序排列
加上固定嘅 positional encoding（如 RoPE）
餵入 LLM 做 causal decoding

呢種做法有三大問題：

問題	具體表現	影響
❌ Spatial bias	強制嘅空間順序忽略咗語義關係	兩欄文件讀錯順序
❌ 2D→1D mismatch	2D 圖像被硬塞入 1D causal LLM	複雜 layout 理解力差
❌ Visual tokens 過多	大部分模型需要 >6000 tokens	Memory 同 latency 暴增

DeepEncoder V2：核心創新

整體架構

DeepSeek-OCR 2 嘅架構同傳統 VLM 有一個根本分別：佢嘅 encoder 本身就有 causal reasoning 能力。

Loading diagram...

🔑 同傳統 VLM 嘅根本分別

傳統 VLM：Non-causal Encoder（CLIP）→ Causal Decoder（LLM）

DeepSeek-OCR 2：Causal Encoder（DeepEncoder V2）→ Causal Decoder（DeepSeek-3B）

兩個 stage 都有 causal reasoning 能力——encoder 做「閱讀順序推理」，decoder 做「內容理解推理」。

組件 1：Vision Tokenizer（80M params）

Vision Tokenizer 嘅工作好簡單：將圖像壓縮成少量嘅 visual tokens。

架構：SAM-base（80M params）+ 兩層 Conv layers

壓縮比：16×

具體數字：

1024×1024 圖像 → 256 visual tokens
768×768 圖像 → 144 visual tokens

💡 點解唔直接用 patch embedding？
因為 16× 壓縮可以大幅減少後面 global attention 嘅計算成本同 activation memory。80M 參數同 LLM 嘅 text embedding（通常 ~100M params）差唔多，所以唔算 overhead。

組件 2：LLM-style Vision Encoder（500M params）

呢個係 DeepSeek-OCR 2 最核心嘅創新——用 Qwen2-0.5B 取代傳統嘅 CLIP ViT！

點解用 LLM 做 Vision Encoder？

傳統用 CLIP ViT 嘅問題係：CLIP 擅長提取 features，但唔識做 reasoning。佢用 bidirectional attention，所有 tokens 互相 attend，冇因果順序嘅概念。

LLM 天生就識做三件事：

Ordering（排序）：因為 causal attention 強制學習序列順序
Logic（邏輯）：Pretraining 學到咗大量因果推理能力
Causality（因果性）：每個 token 依賴前面嘅 tokens

🎯 一句話概括
CLIP 係一個「影相師」——拍到所有細節但唔識排序。

DeepEncoder V2 係一個「編輯」——唔只睇到所有嘢，仲識得決定「讀者應該先睇邊度」。

Dual-Stream Attention：點樣運作？

DeepEncoder V2 嘅 attention mechanism 分成兩個 stream：

plain[Visual Tokens (prefix)]  +  [Causal Flow Queries (suffix)]
         ↓                            ↓
  Bidirectional Attention       Causal Attention
  (每個 token 睇到所有          (每個 query 只睇到
   其他 visual tokens)           所有 visual tokens
                                 + 前面嘅 queries)

點解 visual tokens 用 bidirectional？

因為你需要每個 visual token 都有 full-image receptive field——睇到成張圖嘅所有部分。呢個同 CLIP ViT 嘅能力一樣，保留咗全局感知能力。

點解 causal flow queries 用 causal？

💡 直覺理解 Dual-Stream
Imagine 你係一個編輯，要幫一篇文章決定段落順序：

第一步（Bidirectional）：你先將所有段落全部讀一次，了解全局內容

第二步（Causal）：然後你開始排序——「開頭應該放背景介紹…嗯，睇過背景之後，下一段應該係方法論…睇過方法論之後，接住應該係實驗…」

每一步嘅決定都取決於前面已經做嘅決定——呢個就係 causal reasoning。

Attention Mask：用具體數字睇

假設有 $m = 4$ 個 visual tokens 同 $n = 4$ 個 causal flow queries。Attention mask $M$ 係一個 $8 \times 8$ 嘅矩陣：

M = \begin{bmatrix} \mathbf{1}_{4 \times 4} & \mathbf{0}_{4 \times 4} \\ \mathbf{1}_{4 \times 4} & \text{LowerTri}(4) \end{bmatrix}

展開嚟睇：

plainV1  V2  V3  V4 │ Q1  Q2  Q3  Q4
    ─────────────────────┼─────────────────
V1 │  1   1   1   1  │  0   0   0   0     ← V1 睇到所有 V，但睇唔到 Q
V2 │  1   1   1   1  │  0   0   0   0     ← V2 同上
V3 │  1   1   1   1  │  0   0   0   0     ← V3 同上
V4 │  1   1   1   1  │  0   0   0   0     ← V4 同上
    ─────────────────────┼─────────────────
Q1 │  1   1   1   1  │  1   0   0   0     ← Q1 睇到所有 V + 自己
Q2 │  1   1   1   1  │  1   1   0   0     ← Q2 睇到所有 V + Q1 + 自己
Q3 │  1   1   1   1  │  1   1   1   0     ← Q3 睇到所有 V + Q1,Q2 + 自己
Q4 │  1   1   1   1  │  1   1   1   1     ← Q4 睇到所有 V + 所有前面嘅 Q

🔑 Mask 嘅四個象限

左上（V→V）：全 1 = bidirectional attention，visual tokens 互相睇到

右上（V→Q）：全 0 = visual tokens 睇唔到 queries（queries 唔影響 visual tokens）

左下（Q→V）：全 1 = 每個 query 都可以 attend 到所有 visual tokens

右下（Q→Q）：下三角 = causal attention，每個 query 只 attend 到前面嘅 queries

Attention Mask 嘅 Python 實現：

pythonimport torch

def create_deepencoder_v2_mask(m, n):
    """
    Create DeepEncoder V2 attention mask.
    m: number of visual tokens
    n: number of causal query tokens (n == m)
    """
    # Upper-left: visual tokens see each other (bidirectional)
    visual_mask = torch.ones(m, m)
    
    # Upper-right: visual tokens cannot see queries
    upper_right = torch.zeros(m, n)
    
    # Lower-left: queries can attend to ALL visual tokens
    lower_left = torch.ones(n, m)
    
    # Lower-right: causal mask for queries (lower triangular)
    lower_right = torch.tril(torch.ones(n, n))
    
    # Assemble the full mask
    top = torch.cat([visual_mask, upper_right], dim=1)
    bottom = torch.cat([lower_left, lower_right], dim=1)
    mask = torch.cat([top, bottom], dim=0)
    
    return mask  # shape: (m+n, m+n)

# Example: 256 visual tokens + 256 queries
mask = create_deepencoder_v2_mask(256, 256)  # 512 x 512
print(mask.shape)  # torch.Size([512, 512])

組件 3：Causal Flow Queries

Causal flow queries 係 DeepEncoder V2 嘅「秘密武器」——一組 learnable embeddings，負責將 visual tokens 重新排序。

設計決策 1：Queries 數量 = Visual Tokens 數量

呢個係一個刻意嘅設計： $n = m$ （queries 同 visual tokens 一樣多）。

點解唔用更少嘅 queries（好似 BLIP-2 嘅 32 個）？

🎯 同其他 Parallelized Query 設計嘅對比

Model Query 數量 Attention 目的
DETR 100 object queries Bidirectional Object detection
BLIP-2 Q-former 32 learnable queries Bidirectional Token compression
DeepEncoder V2 n queries（n = visual tokens） Causal Token reordering

DeepEncoder V2 係首個將 causal attention 應用於 vision encoder queries 嘅架構。

Model	Query 數量	Attention	目的
DETR	100 object queries	Bidirectional	Object detection
BLIP-2 Q-former	32 learnable queries	Bidirectional	Token compression
DeepEncoder V2	n queries（n = visual tokens）	Causal	Token reordering

設計決策 2：Multi-Crop 策略

唔同大小嘅圖像會產生唔同數量嘅 tokens：

Global view（必定有）：

Resolution：1024×1024
Visual tokens：256
Query embeddings：query_global（256 個）

Local crops（0-6 個）：

Resolution：768×768 each
Visual tokens：每個 144
Query embeddings：query_local（144 個，所有 local views 共用）
只有當圖像嘅寬或高 ≥768 先會 crop

Token count 計算：

\text{Total tokens} = k \times 144 + 256 \quad (k = 0 \text{ to } 6)

Local crops (k)	計算	Total tokens	適用場景
0	0×144 + 256	256（最少）	小圖（<768×768）
1	1×144 + 256	400	中圖
3	3×144 + 256	688	大圖
6	6×144 + 256	1120（最多）	超大 / 複雜文件

💡 1120 tokens 嘅設計意義

比 DeepSeek-OCR 嘅 1156 tokens（Gundam mode）少 36 個

同 Gemini-3 Pro 嘅 maximum visual token budget 一樣

但 Qwen3-VL-235B 需要 >6000 tokens！

用 1120 tokens 就做到 SOTA，證明咗 語義重排 >> 暴力增加 token 數量。

Two-Stage Cascade Causal Reasoning：點解呢個 Decomposition Work？

DeepSeek-OCR 2 嘅完整 pipeline 可以寫成一條公式：

\mathbf{O} = \mathcal{D}\left(\pi_Q\left(\mathcal{T}^L\left(\mathcal{E}(\mathbf{I}) \oplus \mathbf{Q}_0; \mathbf{M}\right)\right)\right)

等我逐個符號拆解：

符號	意思	對應組件
$\mathbf{I}$	輸入圖像 $H \times W \times 3$	原始圖片
$\mathcal{E}$	Vision tokenizer	SAM-base + 2 Conv（80M params）
$\mathbf{Q}_0$	Learnable causal query embeddings	n 個可學習嘅 queries
$\oplus$	Sequence concatenation	Visual tokens 做 prefix，queries 做 suffix
$\mathcal{T}^L$	L-layer Transformer with masked attention $\mathbf{M}$	Qwen2-0.5B（500M params）
$\pi_Q$	Projection operator：只抽取後半部分（queries）	丟棄 visual tokens，只保留 causal flow outputs
$\mathcal{D}$	Language decoder	DeepSeek-3B MoE（500M active params）

兩階段因果推理

呢條 pipeline 自然分成兩個 causal reasoning stage：

Loading diagram...

Encoder stage：透過 causal queries 對 visual tokens 做語義重排——決定「先讀邊度，後讀邊度」
Decoder stage：LLM 對已重排嘅 visual tokens 做 autoregressive reasoning——理解內容、生成輸出

🔑 點解呢個 Decomposition Work？
論文提出一個大膽嘅假設：2D 理解 = 兩個互補嘅 1D causal reasoning 嘅級聯。

第一個 1D reasoner（encoder）：將 2D spatial layout 轉換成 1D semantic order

第二個 1D reasoner（decoder）：喺 1D semantic order 上做 content understanding

兩者嘅 reasoning 方向係「正交」嘅——一個處理 layout，一個處理 content。呢個 decomposition 避免咗一個 single model 要同時處理 2D layout + content understanding 嘅複雜性。

點解 Prefix Concatenation 而唔係 Cross-Attention？

原因分析：

喺 cross-attention 設計入面，visual tokens 被隔離喺 encoder 入面，同 queries 嘅交互只透過 cross-attention layers——呢個交互唔夠頻繁、唔夠深入。

💡 類比：Cross-attention 就好似你同同事喺兩個唔同嘅房間，只能透過 email 溝通。Prefix concatenation 就好似你哋坐喺同一張桌，隨時可以轉頭問嘢——溝通效率完全唔同 level。

訓練流程：三個精心設計嘅 Stage

Loading diagram...

Stage 1: Encoder Pretraining（40k iterations）

目標：教識 Vision Tokenizer 同 LLM Encoder 基本嘅 feature extraction、token compression 同 token reordering 能力。

做法：

用 language modeling objective（next token prediction）訓練
配一個輕量級 decoder 做 joint optimization
兩個 dataloader：768×768 同 1024×1024 resolution
Vision Tokenizer：從 DeepEncoder 初始化（唔係從零開始）
LLM Encoder：從 Qwen2-0.5B-base 初始化

資源：160 A100 GPUs，batch size 640，約 100M image-text pair samples

💡 點解用 Qwen2-0.5B 做初始化？
Qwen2-0.5B 嘅 pretrained weights 包含咗大量關於 ordering、logic、causality 嘅知識——呢啲都係 text pretraining 學到嘅。呢啲知識可以 transfer 到 vision tasks，特別係 visual token reordering 需要嘅因果推理能力。呢個係「LLM pretraining 對 multimodal 有價值」嘅直接驗證。

Stage 2: Query Enhancement（15k iterations）

目標：進一步強化 causal flow queries 嘅 reordering 能力，同時提升 visual knowledge compression。

做法：

凍結 Vision Tokenizer（SAM-base + Conv）
同時優化 LLM Encoder + DeepSeek-3B Decoder
統一用 multi-crop 策略處理多種 resolution
4-stage pipeline parallelism

資源：160 GPUs（40GB each），40 data parallel replicas，global batch size 1280

Stage 3: Decoder Specialization（20k iterations）

目標：幫 DeepSeek-3B 更好理解 DeepEncoder V2 重排後嘅 visual tokens。

做法：

凍結所有 encoder 參數（Vision Tokenizer + LLM Encoder 都唔動）
只更新 DeepSeek-LLM 參數
訓練速度提升 >2×（因為唔使 backprop 過 encoder）

🎯 三個 Stage 嘅分工

Stage 1：「教識 encoder 基本功」— 識得睇嘢、壓縮、初步排序

Stage 2：「強化排序能力」— encoder + decoder 一齊學，令排序更加準確

Stage 3：「教識 decoder 讀重排後嘅 tokens」— encoder 已經搞掂，只需要 decoder 適應新嘅 input format

呢個 staged approach 確保咗 stable training 同 efficient scaling。

實驗結果

OmniDocBench v1.5：主要指標

OmniDocBench v1.5 包含 1,355 個文件頁面，涵蓋 9 大類別（雜誌、學術論文、研究報告等），中英文混合。

End-to-end Models 對比

Model	V-token (max)	Overall ↑	Text ED ↓	Formula CDM ↑	Table TEDS ↑	R-order ED ↓
GPT-4o	-	75.02%	0.217	79.70%	67.07%	0.148
InternVL3.5-241B	>7000	82.67%	0.142	87.23%	75.00%	0.125
Qwen2.5-VL-72B	>6000	87.02%	0.094	88.27%	82.15%	0.102
DeepSeek-OCR	1156	87.36%	0.073	84.14%	85.25%	0.085
Gemini-2.5 Pro	-	88.03%	0.075	85.82%	85.71%	0.097
Qwen3-VL-235B	>6000	89.15%	0.069	88.14%	86.21%	0.068
DeepSeek-OCR 2	1120	91.09%	0.048	90.31%	87.75%	0.057

🚀 令人震驚嘅數字

用 1120 tokens 打贏用 >6000 tokens 嘅 Qwen3-VL-235B（89.15% vs 91.09%）

只用 3B 參數（500M active）！比 Qwen3-VL 細 78 倍！

同 Gemini-2.5 Pro 比，Overall 高 3.06%，Reading order 好 41%

呢個結果證明咗：語義重排 + efficient architecture >> 暴力 scale up 參數同 tokens

相比 DeepSeek-OCR baseline 嘅改善：

Overall：+3.73%
Text ED：-0.025（改善 34%）
Formula CDM：+6.17%
Reading order：-0.028（改善 33%）

同 Gemini-3 Pro 嘅直接對比（相同 Token Budget）

喺相同嘅 visual token budget（~1120）下：

Model	V-token max	Overall Edit Distance ↓
Gemini-3 Pro	1120	0.115
DeepSeek-OCR 2	1120	0.100 ✅

喺同樣嘅 visual token 預算下，DeepSeek-OCR 2 嘅文件解析精度顯著超過 Gemini-3 Pro。

跨文件類型表現

Document Type	DeepSeek-OCR Text / R-order	DeepSeek-OCR 2 Text / R-order	改善
PPT	0.052 / 0.052	0.031 / 0.025 ✅	Text -40%, R-order -52%
Academic Paper	0.028 / 0.021	0.013 / 0.013 ✅	Text -54%, R-order -38%
Colorful Textbook	0.130 / 0.125	0.053 / 0.066 ✅	Text -59%, R-order -47%
Exam Paper	0.074 / 0.083	0.047 / 0.048 ✅	Text -36%, R-order -42%
Note	0.145 / 0.089	0.068 / 0.035 ✅	Text -53%, R-order -61%
Newspaper ⚠️	0.131 / 0.217	0.139 / 0.176	Text +6%, R-order -19%

喺幾乎所有類別都有顯著改善，尤其係 reading order 指標。唯一例外係 Newspaper（原因見下面嘅限制分析）。

Production Readiness

Environment	Metric	DeepSeek-OCR	DeepSeek-OCR 2	改善
Online user images	Repetition rate	6.25%	4.17%	↓ 2.08%
Pretraining PDFs	Repetition rate	3.69%	2.88%	↓ 0.81%

💡 點解 Repetition Rate 重要？
喺 production 環境冇 ground truth，所以用 repetition rate（模型重複輸出同一段文字嘅比率）作為 primary quality metric。Repetition 通常係模型「唔確定」或者「迷路」嘅表現——causal flow 令模型嘅閱讀路徑更加清晰，所以 repetition 大幅減少。

深入分析

點解 LLM 可以做 Vision Encoder？

論文提出三個關鍵因素，等我逐個解釋：

因素 1：Prefix Concatenation 設計

將 visual tokens 作為 prefix（而非用 cross-attention 分離），確保 visual tokens 喺所有 layers 都保持活躍，促進同 causal queries 嘅有效交換。

反面例子：用 mBART-style encoder-decoder + cross-attention → 無法收斂。因為 visual tokens 被隔離後，同 queries 嘅交互唔夠深入。

因素 2：Equal Cardinality（等數量）

Queries 數量 = Visual tokens 數量，提供足夠容量做 re-fixation。

點解需要 re-fixation？

想像你讀一份複雜嘅表格——你唔會只睇一次就記住所有數字。你會：

先掃一眼表格嘅整體結構（第 1-3 個 queries）
仔細讀第一行（第 4-10 個 queries）
回去再看表頭確認列名（第 11-13 個 queries）← re-fixation！
繼續讀第二行…

如果 queries 太少（例如 BLIP-2 嘅 32 個），就冇「回去再看」嘅空間。Equal cardinality 確保有足夠嘅 capacity 做呢種 re-examination。

因素 3：Dual-Stream Attention

Visual tokens 用 bidirectional attention → 保留全局理解（好似 CLIP）
Queries 用 causal attention → 引入順序推理（好似 LLM）

呢種混合設計結合咗 ViT 同 LLM decoder 嘅優點——既有全局感知，又有順序推理。

🎯 一句話總結三個因素

Prefix concatenation：「坐同一張桌」— visual tokens 同 queries 隨時互動

Equal cardinality：「有足夠嘅時間」— queries 可以反覆注視重要區域

Dual-stream：「兩種技能」— 全局理解 + 順序推理

三者缺一不可。缺 prefix → 無法收斂。缺 equal cardinality → 容量不足。缺 dual-stream → 冇順序推理能力。

點解唔打破 Encoder-Decoder 範式？

傳統嘅 VLM 設計遵循一個隱含假設：encoder 一定要係 bidirectional 嘅（好似 CLIP ViT），因為 vision tasks 需要全局理解。

呢個啟示可能對未來嘅 VLM 架構設計有深遠影響——encoder 唔需要全部都係 bidirectional，適當引入 causality 可以帶來巨大收益。

實作指南：點樣用 DeepSeek-OCR 2？

DeepSeek-OCR 2 係完全開源嘅，可以用 vLLM 或 Transformers 做 inference，仲支援 fine-tuning。

環境要求

CUDA 11.8 + PyTorch 2.6.0
GPU：建議 A100 40GB 或以上（model 係 3B params，bfloat16 大約需要 ~7GB VRAM）
HuggingFace model：deepseek-ai/DeepSeek-OCR-2

方法一：Transformers Inference（最簡單）

pythonfrom transformers import AutoModel, AutoTokenizer
import torch
import os

os.environ["CUDA_VISIBLE_DEVICES"] = '0'
model_name = 'deepseek-ai/DeepSeek-OCR-2'

# Step 1: Load model + tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModel.from_pretrained(
    model_name,
    _attn_implementation='flash_attention_2',
    trust_remote_code=True,
    use_safetensors=True
)
model = model.eval().cuda().to(torch.bfloat16)

# Step 2: Run inference
# 文件 OCR（帶 layout detection）：
prompt = "<image>\n<|grounding|>Convert the document to markdown."

# 純文字 OCR（唔帶 layout）：
# prompt = "<image>\nFree OCR."

image_file = 'your_document.jpg'
output_path = 'output/'

res = model.infer(
    tokenizer,
    prompt=prompt,
    image_file=image_file,
    output_path=output_path,
    base_size=1024,       # global view resolution
    image_size=768,       # local crop resolution
    crop_mode=True,       # 開啟 multi-crop
    save_results=True
)

print(res)  # Markdown 格式嘅 OCR 結果

方法二：vLLM Inference（高吞吐量 / Production）

vLLM 支援 concurrent batch processing，適合 production deployment：

pythonfrom vllm import LLM, SamplingParams

# Step 1: 啟動 vLLM engine
llm = LLM(
    model="deepseek-ai/DeepSeek-OCR-2",
    trust_remote_code=True,
    dtype="bfloat16",
    gpu_memory_utilization=0.9,
)

# Step 2: 設定 sampling params
sampling_params = SamplingParams(
    temperature=0.0,      # OCR 唔需要 randomness
    max_tokens=4096,
)

# Step 3: Batch inference
# 參考 DeepSeek-OCR2-vllm/run_dpsk_ocr2_pdf.py 做 PDF batch processing

⚠️ vLLM 版本注意
DeepSeek-OCR 2 需要 vLLM 0.8.5 或以上。如果你用 CUDA 11.8，需要下載 specific wheel：

pip install vllm-0.8.5+cu118-cp38-abi3-manylinux1_x86_64.whl

方法三：Fine-tuning with Unsloth（自定義數據）

如果你有自己嘅 PDF / 文件數據，可以用 Unsloth 做 fine-tuning：

pythonfrom unsloth import FastVisionModel

# Load model with LoRA
model, tokenizer = FastVisionModel.from_pretrained(
    "deepseek-ai/DeepSeek-OCR-2",
    load_in_4bit=True,         # 4-bit quantization 省 memory
    use_gradient_checkpointing="unsloth",
)

# Add LoRA adapters
model = FastVisionModel.get_peft_model(
    model,
    r=16,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
    lora_alpha=16,
    lora_dropout=0,
)

# Fine-tune on your data...

✅ 實作 Checklist

✅ Install：git clone + conda env + requirements

✅ Quick test：用 Transformers inference 跑一張圖

✅ Production：轉用 vLLM 做 batch processing

✅ Custom data：用 Unsloth fine-tune

✅ Prompts：文件用 <|grounding|>Convert the document to markdown.，純文字用 Free OCR.

Main Prompts 速查表

用途	Prompt	說明
文件 OCR（帶 layout）	`<image>\n<\|grounding\|>Convert the document to markdown.`	輸出帶 heading、table、formula 嘅 Markdown
純文字 OCR	`<image>\nFree OCR.`	只提取文字，唔理 layout

限制同未來方向

當前限制

Newspaper 表現仍有改善空間

Text ED 仍然 >0.13（甚至略微退步），可能原因：

Visual token 上限（1120）對 text-super-rich 嘅報紙唔夠
訓練數據只有 250k newspaper samples，唔夠 DeepEncoder V2 學識呢類 layout

解決方案：增加 local crops 數量（例如 k=8 或更多）同增強 newspaper 訓練數據。

單次 Reordering

只喺 Document OCR 驗證

論文選擇 document parsing 作為 testbed，但 visual causal flow 喺 general visual reasoning tasks（例如 VQA、image captioning）嘅有效性仲未驗證。

未來方向

Towards Genuine 2D Reasoning

論文提出一個大膽嘅假設：2D reasoning = 兩個 1D causal reasoners 嘅級聯。

要實現真正嘅 2D reasoning，可能需要：

Causal flow tokens 數量 >> original visual tokens
Multi-hop reordering mechanisms
驗證喺 general visual reasoning tasks 嘅有效性

Towards Native Multimodality

DeepEncoder V2 提供咗 omni-modal encoder 嘅可能性：

Loading diagram...

呢個架構嘅優勢：

統一架構處理多種 modalities
自然繼承 LLM 社群嘅優化（MoE、efficient attention 等）
參數共享，更高效

技術啟示

1. 打破「Encoder 一定要 Bidirectional」嘅假設

傳統 VLMs：Non-causal Encoder → Causal Decoder

DeepSeek-OCR 2：Causal Encoder → Causal Decoder

呢種設計更加 end-to-end causal，減少 2D→1D 嘅 mismatch。

2. Visual Tokens 唔係越多越好

DeepSeek-OCR 2 用 1120 tokens 就打贏咗用 >6000 tokens 嘅模型（Qwen3-VL-235B、InternVL3.5 等）。

🎯 關鍵 insight：語義重排 > 暴力增加 token 數量
唔係「餵更多 tokens 入 LLM」就會更好。重點係 tokens 嘅順序要同語義邏輯一致。1120 個「排好順序」嘅 tokens，勝過 6000 個「亂排」嘅 tokens。

3. LLM Pretraining 對 Multimodal 嘅價值

4. Document OCR 係理想 Testbed

論文選擇 document parsing 作為主要實驗場景，因為：

包含 complex layouts（多欄、表格、公式）
需要 sophisticated causal reasoning（閱讀順序）
有明確嘅 ground truth

呢個選擇非常聰明——喺一個 well-defined 嘅 challenging setting 入面驗證架構嘅有效性。

5. 簡單嘅方法往往最有效

同其他工作嘅比較

vs. Traditional OCR Pipelines

Pipeline	V-tokens	Overall	Architecture
PaddleOCR-VL	-	92.86%	Multi-stage pipeline
MinerU2.5	-	90.67%	Multi-stage pipeline
DeepSeek-OCR 2	1120	91.09%	End-to-end VLM ✅

vs. 1.5-bit LLM 同其他量化方法

DeepSeek-OCR 2 同量化技術（例如 TurboQuant）係互補嘅：

DeepSeek-OCR 2：用更好嘅架構減少 visual tokens（1120 vs >6000）
TurboQuant：用更好嘅量化減少每個 token 嘅 bit-width
兩者可以 stack：DeepSeek-OCR 2 嘅 KV cache 可以用 TurboQuant 進一步壓縮！

總結

DeepSeek-OCR 2 唔只係一個 document parsing 嘅 SOTA 模型，更重要嘅係提出咗一個全新嘅 vision encoding 範式。

核心貢獻

Visual Causal Flow：首次證明 LLM 架構可以作為 vision encoder
Dual-Stream Attention：結合 bidirectional 同 causal attention，用一個 customized mask 實現
Cascade Causal Reasoning：將 2D 理解分解成兩個 1D causal tasks
High Efficiency：用最少嘅 visual tokens（1120）達到 end-to-end SOTA
Open Source：完整開源，支援 vLLM / Transformers / fine-tuning

點解你應該 care？

如果你：

做 Document AI / OCR → DeepSeek-OCR 2 係目前 end-to-end 最強嘅開源模型
做 VLM 研究 → Visual causal flow 係一個全新嘅 research direction
做 LLM pretraining data → DeepSeek-OCR 2 可以高效生產高質量 training data
做 Multimodal AI → Omni-modal encoder 嘅 pathway 值得關注

🎯 一句話記住 DeepSeek-OCR 2：
用一個 customized attention mask，將 LLM 變成 vision encoder，教識 AI 用「人類嘅閱讀順序」睇文件。

3B 參數，1120 tokens，打贏 235B + 6000 tokens。

簡單到令人驚訝，效果好到改寫咗 document OCR 嘅 leaderboard。

TL;DR

目錄

背景：點解傳統 VLMs 有問題？

人類視覺 vs. 機器視覺

Raster Scan 嘅根本問題：用具體例子睇

傳統 VLM 嘅完整 Pipeline

DeepEncoder V2：核心創新

整體架構

組件 1：Vision Tokenizer（80M params）

組件 2：LLM-style Vision Encoder（500M params）

組件 3：Causal Flow Queries

Two-Stage Cascade Causal Reasoning：點解呢個 Decomposition Work？

兩階段因果推理

點解 Prefix Concatenation 而唔係 Cross-Attention？

訓練流程：三個精心設計嘅 Stage

Stage 1: Encoder Pretraining（40k iterations）

Stage 2: Query Enhancement（15k iterations）

Stage 3: Decoder Specialization（20k iterations）

實驗結果

OmniDocBench v1.5：主要指標

跨文件類型表現

Production Readiness

深入分析

點解 LLM 可以做 Vision Encoder？

點解唔打破 Encoder-Decoder 範式？

實作指南：點樣用 DeepSeek-OCR 2？

環境要求

方法一：Transformers Inference（最簡單）

方法二：vLLM Inference（高吞吐量 / Production）

方法三：Fine-tuning with Unsloth（自定義數據）

Main Prompts 速查表

限制同未來方向

當前限制

未來方向

技術啟示

1. 打破「Encoder 一定要 Bidirectional」嘅假設

2. Visual Tokens 唔係越多越好

3. LLM Pretraining 對 Multimodal 嘅價值

4. Document OCR 係理想 Testbed

5. 簡單嘅方法往往最有效

同其他工作嘅比較

vs. Traditional OCR Pipelines

vs. 1.5-bit LLM 同其他量化方法

總結

核心貢獻

點解你應該 care？

相關資源

TL;DR

目錄

背景：點解傳統 VLMs 有問題？

人類視覺 vs. 機器視覺

Raster Scan 嘅根本問題：用具體例子睇

傳統 VLM 嘅完整 Pipeline

DeepEncoder V2：核心創新

整體架構

組件 1：Vision Tokenizer（80M params）

組件 2：LLM-style Vision Encoder（500M params）

組件 3：Causal Flow Queries

Two-Stage Cascade Causal Reasoning：點解呢個 Decomposition Work？

兩階段因果推理

點解 Prefix Concatenation 而唔係 Cross-Attention？

訓練流程：三個精心設計嘅 Stage

Stage 1: Encoder Pretraining（40k iterations）

Stage 2: Query Enhancement（15k iterations）

Stage 3: Decoder Specialization（20k iterations）

實驗結果

OmniDocBench v1.5：主要指標

跨文件類型表現

Production Readiness

深入分析

點解 LLM 可以做 Vision Encoder？

點解唔打破 Encoder-Decoder 範式？

實作指南：點樣用 DeepSeek-OCR 2？

環境要求

方法一：Transformers Inference（最簡單）

方法二：vLLM Inference（高吞吐量 / Production）

方法三：Fine-tuning with Unsloth（自定義數據）

Main Prompts 速查表

限制同未來方向

當前限制