Glyph 同 Vist：點解將文字變圖像反而更慳 Tokens？

如果我話你知，將文字 render 成圖像再餵入 vision encoder，可以比直接用 text tokens 更加慳，你信唔信？

呢個聽落好似反直覺——文字唔係已經係最高效嘅表示方式咩？點解要 render 成圖像再用 visual tokens？

最近兩篇論文（Glyph 同 Vist）證明咗：喺特定場景下，text-to-image compression 確實有效。呢篇文會深入分析呢種方法嘅原理、適用場景，同埋同 DeepSeek OCR 嘅根本分別。

TL;DR
問題起源：一個矛盾嘅發現
Vist：Vision-Centric Token Compression
Glyph：Visual-Text Compression for Context Windows
技術深入：點解呢種方法會有效？
同 DeepSeek OCR 嘅根本分別
何時用 vs 何時唔用？
實驗結果同效能
技術啟示
總結
相關資源

TL;DR

核心發現：

🖼️ Text → Image 壓縮可行：Glyph 同 Vist 證明將文字 render 成圖像可以壓縮 tokens
🎯 關鍵場景：適用於遠距離、低顯著性語境（distant, low-salience context）
⚡ Slow-Fast 架構：近距離用 text tokens（精準），遠距離用 visual tokens（壓縮）
❌ 唔適用：當前對話、需要精確推理嘅內容
🔄 同 DeepSeek OCR 唔同：DeepSeek OCR 係處理已經係圖像嘅文檔，呢啲方法係主動 render 文字成圖像

實際效果：

Vist：遠距離文字壓縮 ~10×，context window 擴展到 100k+ tokens
Glyph：用 genetic search 優化 rendering，達到最佳壓縮率

問題起源：一個矛盾嘅發現

最初嘅疑問

當討論 DeepSeek OCR 嘅 visual token compression 時，出現咗一個好自然嘅問題：

❓ 核心問題
如果 visual tokens 可以將文檔壓縮 10-20×，點解唔將所有文字（包括 system prompts、user input、conversation history）都轉換成圖像再壓縮？

初步嘅回答係：「呢個唔會 work，因為 plain text 冇視覺結構，render 成圖像只會增加 tokens。」

但呢個答案原來只係部份正確。

矛盾嘅解決

DeepSeek OCR 嘅 blog 入面提到「視覺遺忘機制」（Visual Forgetting），暗示可以將舊對話壓縮成低分辨率 visual tokens。但呢個其實係誤解——DeepSeek OCR 論文從來冇提議將 text conversations render 成圖像。

真正實現呢個想法嘅係兩篇最近嘅論文：

Vist（Vision-centric Token Compression）—— arXiv 2502.00791
Glyph（Scaling Context Windows via Visual-Text Compression）—— GLM Team

Vist: Vision-Centric Token Compression

核心理念：模仿人類閱讀

Vist 提出咗一個 slow-fast compression framework，靈感來自人類閱讀模式：

javascript人類閱讀長文檔：
  當前段落 → 仔細閱讀（slow path）
  之前內容 → 快速瀏覽（fast path）
  遙遠內容 → 只記大概印象

Vist 將呢種模式應用到 LLM context management：

🎯 Slow-Fast 架構
Slow Path（text tokens）：

處理 proximal window（最近嘅對話/問題）

用完整 text tokens，保持高保真度

LLM 做 fine-grained reasoning

Fast Path（visual tokens）：

將 distant tokens render 成圖像

用 lightweight vision encoder "skim" 內容

大幅壓縮，只保留語義重點

架構設計

javascriptLLM Context Window (100k+ tokens):
  ┌─────────────────────────────────────┐
  │ Proximal Window (text tokens)        │ ← Slow path
  │ - 最近 5-10 輪對話                   │
  │ - 當前用戶問題                       │
  │ Token count: ~5,000                  │
  ├─────────────────────────────────────┤
  │ Distant Context (visual tokens)      │ ← Fast path
  │ - 將文字 render 成圖像               │
  │ - Vision encoder 壓縮                │
  │ - 原本 50,000 text tokens            │
  │   → 壓縮成 5,000 visual tokens (10×) │
  └─────────────────────────────────────┘
  
總 tokens: 5,000 (text) + 5,000 (visual) = 10,000
vs. 全文字: 5,000 + 50,000 = 55,000 ❌
壓縮比率: 5.5×

Probability-Informed Visual Enhancement (PVE)

關鍵創新：點樣訓練 vision encoder 去「識別」重要資訊？

Vist 提出 PVE objective，模仿 skilled readers 嘅行為：

Frequency-based Masking：喺訓練時 mask 掉 high-frequency tokens（function words like "the", "is", "and"）
強迫 Resampler 專注於 semantically rich regions
結果：vision encoder 學識 "gloss over" 唔重要嘅詞，concentrate on key information

Component	Details	Purpose
Vision Encoder	Frozen lightweight model	快速處理圖像
Perceiver Resampler	Learnable queries	壓縮 visual features
PVE Training	Mask high-freq tokens	學習語義 saliency
LLM Decoder	Full-scale LLM	Fine-grained reasoning

Glyph: Visual-Text Compression for Context Windows

GLM Team 嘅方法

Glyph 同樣係將文字 render 成圖像，但著重喺優化 rendering configuration：

🔧 核心問題
點樣 render 文字成圖像先可以達到最佳壓縮率同準確度？

考慮因素：

Font size

Line spacing

Image resolution

Characters per line

Background color

LLM-Driven Genetic Search

Glyph 用 genetic algorithm 去搵最佳 rendering 配置：

pythondef optimize_rendering_config():
    """
    用 genetic search 優化 rendering parameters
    """
    # Initial population
    configs = [
        {"font_size": 12, "line_spacing": 1.5, "resolution": 1024},
        {"font_size": 14, "line_spacing": 1.2, "resolution": 768},
        # ...
    ]
    
    for generation in range(num_generations):
        # Evaluate each config
        scores = []
        for config in configs:
            rendered_image = render_text_to_image(text, config)
            visual_tokens = vision_encoder(rendered_image)
            accuracy = evaluate_reconstruction(text, visual_tokens)
            compression = len(text_tokens) / len(visual_tokens)
            
            # Fitness = balance accuracy + compression
            score = alpha * accuracy + beta * compression
            scores.append(score)
        
        # Selection + Crossover + Mutation
        configs = evolve(configs, scores)
    
    return best_config

實際應用場景

典型 workflow：

javascript用戶問題："Summarize the key points from our last 50 conversations"

傳統 LLM：
  → Load 50 conversations as text tokens
  → Total: 100,000+ tokens
  → ❌ 超出 context window 或成本極高

Glyph:
  → Render 前 45 conversations 成圖像
  → Vision encoder 壓縮 (10× compression)
  → 最近 5 conversations 保持 text tokens
  → Total: 10,000 (visual) + 10,000 (text) = 20,000 tokens ✅
  → 可以處理！

技術深入：點解呢種方法會有效？

反直覺嘅真相

初步諗法：

javascript"Hello world" (2 text tokens)
    ↓ Render as image
Image → 20-50 visual tokens ❌
壓縮失敗！

但實際上，當處理長文本時：

javascript10,000 個 text tokens
    ↓ Render as single/few images
5-10 張圖像 → 每張 200 visual tokens
= 1,000-2,000 visual tokens ✅
壓縮比率: 5-10×

關鍵洞察：

💡 點解長文本壓縮有效

Batch rendering：多個 tokens 壓縮入同一張圖像

Visual redundancy：文字 rendering 有大量空間冗餘，vision encoder 可以高效壓縮

Selective fidelity：唔需要 100% 重建，只需要保留語義

Fixed visual tokens per image：無論幾多文字，每張圖像 → 固定數量 visual tokens

數學分析

Text tokens 效率：

\text{Text tokens} = \frac{\text{Characters}}{\text{avg\_chars\_per\_token}} \approx \frac{\text{Characters}}{4}

1000 characters ≈ 250 text tokens

Visual tokens 效率（Vist/Glyph）：

\text{Visual tokens} = \text{num\_images} \times \text{tokens\_per\_image}

假設：

每張圖像可以 render 500 characters
每張圖像 → 50 visual tokens

1000 characters → 2 images → 100 visual tokens

壓縮比率：

\text{Compression} = \frac{250}{100} = 2.5\times

但實際上，因為：

Larger batch sizes：更長文本 → 更高效率
Optimized rendering：genetic search 搵到最佳配置
Selective compression：只壓縮低顯著性內容

實際可以達到 5-10× compression。

同 Traditional Text Compression 嘅比較

Method	Compression	LLM Compatible	Semantic Preservation
gzip/zlib	3-5×	❌ 需要 decompress	✅ Lossless
Text summarization	5-20×	✅ Direct use	⚠️ 失去細節
Vist/Glyph	5-10×	✅ Native VLM support	✅ High semantic fidelity

同 DeepSeek OCR 嘅根本分別

容易混淆嘅地方

DeepSeek OCR、Vist、同 Glyph 都涉及 visual tokens 同壓縮，但佢哋處理嘅係完全唔同嘅問題：

Aspect	DeepSeek OCR	Vist/Glyph
Input	已經係圖像嘅文檔
(PDFs, scans, screenshots)	純文字
(text strings)
操作	壓縮已有圖像	主動 render 文字成圖像
目標	Document understanding	Long-context compression
保留資訊	Visual structure
(layout, formulas, tables)	Semantic content
(text meaning)
適用場景	文檔本身就係圖像	文字對話/記憶管理

視覺化對比

DeepSeek OCR workflow：

javascriptPDF 文檔（已經係圖像）
    ↓
[DeepEncoder]
    ↓
Visual Tokens（壓縮 10×）
    ↓
[LLM Decoder]
    ↓
文檔理解/OCR 重建

Vist/Glyph workflow：

javascript純文字字串（conversation history）
    ↓
[Text Rendering Engine]
    ↓
生成圖像（主動 render）
    ↓
[Vision Encoder]
    ↓
Visual Tokens（壓縮 5-10×）
    ↓
[LLM Decoder]
    ↓
理解壓縮後嘅語境

關鍵分別

🎯 核心差異
DeepSeek OCR：

Input 本身有視覺結構（佈局、公式、表格）

壓縮係為咗保留呢啲視覺資訊

唔會將 plain text 轉圖像

Vist/Glyph：

Input 係純文字，冇視覺結構

故意 render 成圖像去利用 vision encoder 嘅壓縮能力

針對超長文字語境管理

何時用 vs 何時唔用？

✅ 適合 Text-to-Image Compression 嘅場景

1. 遠距離對話歷史

javascriptContext:
  最近 5 輪對話 → text tokens（需要精確理解）
  10-50 輪之前 → visual tokens（只需大概記得）
  50+ 輪之前 → 高壓縮 visual tokens（模糊印象）

2. 大量參考文檔（純文字）

法律案例庫
歷史聊天記錄
程式碼 repository（唔需要編輯）

3. Low-salience background information

Meeting transcripts（已經過咗幾日）
舊 email threads
參考資料（只需要能夠「搜尋」，唔需要精確引用）

4. 超長 context windows

需要處理 100k+ tokens
Memory budget 有限
可以接受部份語義損失

❌ 唔適合 Text-to-Image Compression 嘅場景

1. 當前對話

javascriptUser: "根據我上一個問題，幫我修改第 3 點..."
           ↑
    需要精確理解，唔可以壓縮！

2. 需要精確引用嘅內容

法律條文（需要逐字精確）
程式碼編輯（typo 都會出事）
數學證明（每個符號都重要）

3. 結構化數據

JSON/XML parsing
Database queries
API responses

4. 短文本

javascript少於 1000 tokens 嘅文字
    → Rendering overhead > 壓縮收益
    → 直接用 text tokens 更好

混合策略（最佳實踐）

💡 Hybrid Compression Strategy

javascriptLLM Context Management:  
  ┌─────────────────────────────────────┐  
  │ Tier 1: Text Tokens (High Fidelity) │  
  │ - 最近 3-5 輪對話                    │  
  │ - 當前 task 相關內容                 │  
  │ - 需要精確引用嘅資訊                 │  
  │ Cost: ~5,000 tokens                 │  
  ├─────────────────────────────────────┤  
  │ Tier 2: Visual Tokens (5× compress) │  
  │ - 10-20 輪之前對話                   │  
  │ - 參考文檔（純文字）                 │  
  │ - 中等重要性資訊                     │  
  │ Cost: ~5,000 visual tokens          │  
  │ (equivalent to 25,000 text tokens)  │  
  ├─────────────────────────────────────┤  
  │ Tier 3: High-Compress Visual        │  
  │ - 50+ 輪之前對話                     │  
  │ - 低頻存取資料                       │  
  │ - Background context                │  
  │ Cost: ~2,000 visual tokens          │  
  │ (equivalent to 40,000 text tokens)  │  
  └─────────────────────────────────────┘  
    
總成本: 12,000 tokens (mixed)  
vs. 純 text: 70,000 tokens  
節省: 5.8×

實驗結果同效能

Vist 實驗數據

Long-context QA Tasks:

Task	Baseline (Full Text)	Vist	Compression
NarrativeQA	67.3%	65.8%	8.5× ✅
Qasper	43.2%	41.7%	9.2× ✅
Multi-doc QA	72.1%	69.4%	10.1× ✅

關鍵發現：

準確度輕微下降（~2-3%）
壓縮比率達到 8-10×
推理速度提升 3-4×（少咗好多 tokens 要處理）

Glyph 實驗數據

Context Window Scaling:

Context Length	Text-only	Glyph	Speedup
50k tokens	12.3s prefill	8.1s	1.5× ⚡
100k tokens	45.7s prefill	18.2s	2.5× ⚡
200k tokens	OOM ❌	42.5s	✅ Feasible

Prefill efficiency 提升顯著，特別係 context length 越長，收益越大。

Memory Usage

python# 100k context window 例子

Baseline (text-only):
  KV Cache: 100k × hidden_dim × num_layers
  Memory: ~45 GB (A100 GPU)
  
Vist/Glyph (hybrid):
  Text KV Cache: 10k × hidden_dim × num_layers
  Visual KV Cache: 10k × hidden_dim × num_layers  # compressed from 90k
  Memory: ~12 GB ✅
  
節省: 3.75×

技術啟示

1. Compression 唔係越多越好

⚠️ Trade-off Awareness
壓縮比率 vs 語義保真度：

5× compression：~98% semantic accuracy

10× compression：~92% semantic accuracy

20× compression：~75% semantic accuracy ⚠️

選擇策略：根據 information salience 動態調整

2. Multimodal 唔只係「睇圖」

Vist 同 Glyph 證明：multimodal models 嘅 vision encoder 可以用嚟處理任何需要壓縮嘅資訊，唔限於「原本就係圖像」嘅內容。

未來方向：

Audio → spectrogram → visual compression？
Structured data → visualization → visual compression？
Code → syntax highlighting rendering → visual compression？

3. LLM Context 管理係 Tiered System

唔同 tier 嘅資訊需要唔同 fidelity：

javascriptContext Hierarchy:
  L1 Cache (text tokens) ← 最高保真度
    ↓ 降級
  L2 Cache (visual tokens 5×) ← 中等保真度
    ↓ 降級
  L3 Cache (visual tokens 10×) ← 低保真度
    ↓ eviction
  Disk (permanent storage) ← 需要時重新 load

呢種 tiered approach 係未來 long-context LLM 嘅關鍵。

4. Rendering 係一門科學

Glyph 用 genetic search 優化 rendering，證明：

Font choice matters
Line spacing affects compression
Resolution 需要平衡

未來可能出現 specialized text renderers for LLM compression。

總結

核心要點

Text-to-Image compression 係 real：Vist 同 Glyph 證明將文字 render 成圖像可以有效壓縮 tokens
適用場景明確：
- ✅ 遠距離、低顯著性語境
- ✅ 超長 context windows
- ❌ 當前對話、精確推理
唔同於 DeepSeek OCR：
- DeepSeek OCR：處理已有圖像文檔
- Vist/Glyph：主動 render 純文字
Slow-Fast 架構係關鍵：
- Near context: text tokens (high fidelity)
- Distant context: visual tokens (compressed)
實際效果可觀：
- 8-10× compression ratio
- 輕微準確度損失（2-3%）
- 顯著速度提升（2-4×）

何時用？

Scenario	Recommendation	Reason
當前對話	❌ Text tokens	需要精確理解
Recent context (< 10 turns)	⚠️ Text tokens	可能需要引用
Distant context (10-50 turns)	✅ Visual tokens (5×)	平衡壓縮同精度
Far context (50+ turns)	✅ Visual tokens (10×)	最大化壓縮
Structured data	❌ Text tokens	需要 parsing
Reference docs (pure text)	✅ Visual tokens	可以接受語義損失

展望未來

Vist 同 Glyph 開啟咗一個新方向：主動利用 multimodal 能力去優化 LLM 效率。

未來可能見到：

Adaptive compression：根據 query 動態調整壓縮率
Learned rendering：neural network 學習最佳 text-to-image 轉換
Omni-modal compression：統一處理 text/audio/video
Hardware acceleration：專門嘅 vision encoder 硬件

呢個領域仍然好新（2025-2026），仲有好多野可以探索。

TL;DR

核心發現：

🖼️ Text → Image 壓縮可行：Glyph 同 Vist 證明將文字 render 成圖像可以壓縮 tokens
🎯 關鍵場景：適用於遠距離、低顯著性語境（distant, low-salience context）
⚡ Slow-Fast 架構：近距離用 text tokens（精準），遠距離用 visual tokens（壓縮）
❌ 唔適用：當前對話、需要精確推理嘅內容
🔄 同 DeepSeek OCR 唔同：DeepSeek OCR 係處理已經係圖像嘅文檔，呢啲方法係主動 render 文字成圖像

實際效果：

Vist：遠距離文字壓縮 ~10×，context window 擴展到 100k+ tokens
Glyph：用 genetic search 優化 rendering，達到最佳壓縮率

問題起源：一個矛盾嘅發現

最初嘅疑問

當討論 DeepSeek OCR 嘅 visual token compression 時，出現咗一個好自然嘅問題：

❓ 核心問題
如果 visual tokens 可以將文檔壓縮 10-20×，點解唔將所有文字（包括 system prompts、user input、conversation history）都轉換成圖像再壓縮？

初步嘅回答係：「呢個唔會 work，因為 plain text 冇視覺結構，render 成圖像只會增加 tokens。」

但呢個答案原來只係部份正確。

矛盾嘅解決

真正實現呢個想法嘅係兩篇最近嘅論文：

Vist（Vision-centric Token Compression）—— arXiv 2502.00791
Glyph（Scaling Context Windows via Visual-Text Compression）—— GLM Team

Vist: Vision-Centric Token Compression

核心理念：模仿人類閱讀

Vist 提出咗一個 slow-fast compression framework，靈感來自人類閱讀模式：

javascript人類閱讀長文檔：
  當前段落 → 仔細閱讀（slow path）
  之前內容 → 快速瀏覽（fast path）
  遙遠內容 → 只記大概印象

Vist 將呢種模式應用到 LLM context management：

🎯 Slow-Fast 架構
Slow Path（text tokens）：

處理 proximal window（最近嘅對話/問題）

用完整 text tokens，保持高保真度

LLM 做 fine-grained reasoning

Fast Path（visual tokens）：

將 distant tokens render 成圖像

用 lightweight vision encoder "skim" 內容

大幅壓縮，只保留語義重點

架構設計

javascriptLLM Context Window (100k+ tokens):
  ┌─────────────────────────────────────┐
  │ Proximal Window (text tokens)        │ ← Slow path
  │ - 最近 5-10 輪對話                   │
  │ - 當前用戶問題                       │
  │ Token count: ~5,000                  │
  ├─────────────────────────────────────┤
  │ Distant Context (visual tokens)      │ ← Fast path
  │ - 將文字 render 成圖像               │
  │ - Vision encoder 壓縮                │
  │ - 原本 50,000 text tokens            │
  │   → 壓縮成 5,000 visual tokens (10×) │
  └─────────────────────────────────────┘
  
總 tokens: 5,000 (text) + 5,000 (visual) = 10,000
vs. 全文字: 5,000 + 50,000 = 55,000 ❌
壓縮比率: 5.5×

Probability-Informed Visual Enhancement (PVE)

關鍵創新：點樣訓練 vision encoder 去「識別」重要資訊？

Vist 提出 PVE objective，模仿 skilled readers 嘅行為：

Frequency-based Masking：喺訓練時 mask 掉 high-frequency tokens（function words like "the", "is", "and"）
強迫 Resampler 專注於 semantically rich regions
結果：vision encoder 學識 "gloss over" 唔重要嘅詞，concentrate on key information

Component	Details	Purpose
Vision Encoder	Frozen lightweight model	快速處理圖像
Perceiver Resampler	Learnable queries	壓縮 visual features
PVE Training	Mask high-freq tokens	學習語義 saliency
LLM Decoder	Full-scale LLM	Fine-grained reasoning

Glyph: Visual-Text Compression for Context Windows

GLM Team 嘅方法

Glyph 同樣係將文字 render 成圖像，但著重喺優化 rendering configuration：

🔧 核心問題
點樣 render 文字成圖像先可以達到最佳壓縮率同準確度？

考慮因素：

Font size

Line spacing

Image resolution

Characters per line

Background color

LLM-Driven Genetic Search

Glyph 用 genetic algorithm 去搵最佳 rendering 配置：

pythondef optimize_rendering_config():
    """
    用 genetic search 優化 rendering parameters
    """
    # Initial population
    configs = [
        {"font_size": 12, "line_spacing": 1.5, "resolution": 1024},
        {"font_size": 14, "line_spacing": 1.2, "resolution": 768},
        # ...
    ]
    
    for generation in range(num_generations):
        # Evaluate each config
        scores = []
        for config in configs:
            rendered_image = render_text_to_image(text, config)
            visual_tokens = vision_encoder(rendered_image)
            accuracy = evaluate_reconstruction(text, visual_tokens)
            compression = len(text_tokens) / len(visual_tokens)
            
            # Fitness = balance accuracy + compression
            score = alpha * accuracy + beta * compression
            scores.append(score)
        
        # Selection + Crossover + Mutation
        configs = evolve(configs, scores)
    
    return best_config

實際應用場景

典型 workflow：

javascript用戶問題："Summarize the key points from our last 50 conversations"

傳統 LLM：
  → Load 50 conversations as text tokens
  → Total: 100,000+ tokens
  → ❌ 超出 context window 或成本極高

Glyph:
  → Render 前 45 conversations 成圖像
  → Vision encoder 壓縮 (10× compression)
  → 最近 5 conversations 保持 text tokens
  → Total: 10,000 (visual) + 10,000 (text) = 20,000 tokens ✅
  → 可以處理！

技術深入：點解呢種方法會有效？

反直覺嘅真相

初步諗法：

javascript"Hello world" (2 text tokens)
    ↓ Render as image
Image → 20-50 visual tokens ❌
壓縮失敗！

但實際上，當處理長文本時：

javascript10,000 個 text tokens
    ↓ Render as single/few images
5-10 張圖像 → 每張 200 visual tokens
= 1,000-2,000 visual tokens ✅
壓縮比率: 5-10×

關鍵洞察：

💡 點解長文本壓縮有效

Batch rendering：多個 tokens 壓縮入同一張圖像

Visual redundancy：文字 rendering 有大量空間冗餘，vision encoder 可以高效壓縮

Selective fidelity：唔需要 100% 重建，只需要保留語義

Fixed visual tokens per image：無論幾多文字，每張圖像 → 固定數量 visual tokens

數學分析

Text tokens 效率：

\text{Text tokens} = \frac{\text{Characters}}{\text{avg\_chars\_per\_token}} \approx \frac{\text{Characters}}{4}

1000 characters ≈ 250 text tokens

Visual tokens 效率（Vist/Glyph）：

\text{Visual tokens} = \text{num\_images} \times \text{tokens\_per\_image}

假設：

每張圖像可以 render 500 characters
每張圖像 → 50 visual tokens

1000 characters → 2 images → 100 visual tokens

壓縮比率：

\text{Compression} = \frac{250}{100} = 2.5\times

但實際上，因為：

Larger batch sizes：更長文本 → 更高效率
Optimized rendering：genetic search 搵到最佳配置
Selective compression：只壓縮低顯著性內容

實際可以達到 5-10× compression。

同 Traditional Text Compression 嘅比較

Method	Compression	LLM Compatible	Semantic Preservation
gzip/zlib	3-5×	❌ 需要 decompress	✅ Lossless
Text summarization	5-20×	✅ Direct use	⚠️ 失去細節
Vist/Glyph	5-10×	✅ Native VLM support	✅ High semantic fidelity

同 DeepSeek OCR 嘅根本分別

容易混淆嘅地方

DeepSeek OCR、Vist、同 Glyph 都涉及 visual tokens 同壓縮，但佢哋處理嘅係完全唔同嘅問題：

Aspect	DeepSeek OCR	Vist/Glyph
Input	已經係圖像嘅文檔
(PDFs, scans, screenshots)	純文字
(text strings)
操作	壓縮已有圖像	主動 render 文字成圖像
目標	Document understanding	Long-context compression
保留資訊	Visual structure
(layout, formulas, tables)	Semantic content
(text meaning)
適用場景	文檔本身就係圖像	文字對話/記憶管理

視覺化對比

DeepSeek OCR workflow：

javascriptPDF 文檔（已經係圖像）
    ↓
[DeepEncoder]
    ↓
Visual Tokens（壓縮 10×）
    ↓
[LLM Decoder]
    ↓
文檔理解/OCR 重建

Vist/Glyph workflow：

javascript純文字字串（conversation history）
    ↓
[Text Rendering Engine]
    ↓
生成圖像（主動 render）
    ↓
[Vision Encoder]
    ↓
Visual Tokens（壓縮 5-10×）
    ↓
[LLM Decoder]
    ↓
理解壓縮後嘅語境

關鍵分別

🎯 核心差異
DeepSeek OCR：

Input 本身有視覺結構（佈局、公式、表格）

壓縮係為咗保留呢啲視覺資訊

唔會將 plain text 轉圖像

Vist/Glyph：

Input 係純文字，冇視覺結構

故意 render 成圖像去利用 vision encoder 嘅壓縮能力

針對超長文字語境管理

何時用 vs 何時唔用？

✅ 適合 Text-to-Image Compression 嘅場景

1. 遠距離對話歷史

javascriptContext:
  最近 5 輪對話 → text tokens（需要精確理解）
  10-50 輪之前 → visual tokens（只需大概記得）
  50+ 輪之前 → 高壓縮 visual tokens（模糊印象）

2. 大量參考文檔（純文字）

法律案例庫
歷史聊天記錄
程式碼 repository（唔需要編輯）

3. Low-salience background information

Meeting transcripts（已經過咗幾日）
舊 email threads
參考資料（只需要能夠「搜尋」，唔需要精確引用）

4. 超長 context windows

需要處理 100k+ tokens
Memory budget 有限
可以接受部份語義損失

❌ 唔適合 Text-to-Image Compression 嘅場景

1. 當前對話

javascriptUser: "根據我上一個問題，幫我修改第 3 點..."
           ↑
    需要精確理解，唔可以壓縮！

2. 需要精確引用嘅內容

法律條文（需要逐字精確）
程式碼編輯（typo 都會出事）
數學證明（每個符號都重要）

3. 結構化數據

JSON/XML parsing
Database queries
API responses

4. 短文本

javascript少於 1000 tokens 嘅文字
    → Rendering overhead > 壓縮收益
    → 直接用 text tokens 更好

混合策略（最佳實踐）

💡 Hybrid Compression Strategy

javascriptLLM Context Management:  
  ┌─────────────────────────────────────┐  
  │ Tier 1: Text Tokens (High Fidelity) │  
  │ - 最近 3-5 輪對話                    │  
  │ - 當前 task 相關內容                 │  
  │ - 需要精確引用嘅資訊                 │  
  │ Cost: ~5,000 tokens                 │  
  ├─────────────────────────────────────┤  
  │ Tier 2: Visual Tokens (5× compress) │  
  │ - 10-20 輪之前對話                   │  
  │ - 參考文檔（純文字）                 │  
  │ - 中等重要性資訊                     │  
  │ Cost: ~5,000 visual tokens          │  
  │ (equivalent to 25,000 text tokens)  │  
  ├─────────────────────────────────────┤  
  │ Tier 3: High-Compress Visual        │  
  │ - 50+ 輪之前對話                     │  
  │ - 低頻存取資料                       │  
  │ - Background context                │  
  │ Cost: ~2,000 visual tokens          │  
  │ (equivalent to 40,000 text tokens)  │  
  └─────────────────────────────────────┘  
    
總成本: 12,000 tokens (mixed)  
vs. 純 text: 70,000 tokens  
節省: 5.8×

實驗結果同效能

Vist 實驗數據

Long-context QA Tasks:

Task	Baseline (Full Text)	Vist	Compression
NarrativeQA	67.3%	65.8%	8.5× ✅
Qasper	43.2%	41.7%	9.2× ✅
Multi-doc QA	72.1%	69.4%	10.1× ✅

關鍵發現：

準確度輕微下降（~2-3%）
壓縮比率達到 8-10×
推理速度提升 3-4×（少咗好多 tokens 要處理）

Glyph 實驗數據

Context Window Scaling:

Context Length	Text-only	Glyph	Speedup
50k tokens	12.3s prefill	8.1s	1.5× ⚡
100k tokens	45.7s prefill	18.2s	2.5× ⚡
200k tokens	OOM ❌	42.5s	✅ Feasible

Prefill efficiency 提升顯著，特別係 context length 越長，收益越大。

Memory Usage

python# 100k context window 例子

Baseline (text-only):
  KV Cache: 100k × hidden_dim × num_layers
  Memory: ~45 GB (A100 GPU)
  
Vist/Glyph (hybrid):
  Text KV Cache: 10k × hidden_dim × num_layers
  Visual KV Cache: 10k × hidden_dim × num_layers  # compressed from 90k
  Memory: ~12 GB ✅
  
節省: 3.75×

技術啟示

1. Compression 唔係越多越好

⚠️ Trade-off Awareness
壓縮比率 vs 語義保真度：

5× compression：~98% semantic accuracy

10× compression：~92% semantic accuracy

20× compression：~75% semantic accuracy ⚠️

選擇策略：根據 information salience 動態調整

2. Multimodal 唔只係「睇圖」

Vist 同 Glyph 證明：multimodal models 嘅 vision encoder 可以用嚟處理任何需要壓縮嘅資訊，唔限於「原本就係圖像」嘅內容。

未來方向：

Audio → spectrogram → visual compression？
Structured data → visualization → visual compression？
Code → syntax highlighting rendering → visual compression？

3. LLM Context 管理係 Tiered System

唔同 tier 嘅資訊需要唔同 fidelity：

javascriptContext Hierarchy:
  L1 Cache (text tokens) ← 最高保真度
    ↓ 降級
  L2 Cache (visual tokens 5×) ← 中等保真度
    ↓ 降級
  L3 Cache (visual tokens 10×) ← 低保真度
    ↓ eviction
  Disk (permanent storage) ← 需要時重新 load

呢種 tiered approach 係未來 long-context LLM 嘅關鍵。

4. Rendering 係一門科學

Glyph 用 genetic search 優化 rendering，證明：

Font choice matters
Line spacing affects compression
Resolution 需要平衡

未來可能出現 specialized text renderers for LLM compression。

總結

核心要點

Text-to-Image compression 係 real：Vist 同 Glyph 證明將文字 render 成圖像可以有效壓縮 tokens
適用場景明確：
- ✅ 遠距離、低顯著性語境
- ✅ 超長 context windows
- ❌ 當前對話、精確推理
唔同於 DeepSeek OCR：
- DeepSeek OCR：處理已有圖像文檔
- Vist/Glyph：主動 render 純文字
Slow-Fast 架構係關鍵：
- Near context: text tokens (high fidelity)
- Distant context: visual tokens (compressed)
實際效果可觀：
- 8-10× compression ratio
- 輕微準確度損失（2-3%）
- 顯著速度提升（2-4×）

何時用？

Scenario	Recommendation	Reason
當前對話	❌ Text tokens	需要精確理解
Recent context (< 10 turns)	⚠️ Text tokens	可能需要引用
Distant context (10-50 turns)	✅ Visual tokens (5×)	平衡壓縮同精度
Far context (50+ turns)	✅ Visual tokens (10×)	最大化壓縮
Structured data	❌ Text tokens	需要 parsing
Reference docs (pure text)	✅ Visual tokens	可以接受語義損失

展望未來

Vist 同 Glyph 開啟咗一個新方向：主動利用 multimodal 能力去優化 LLM 效率。

未來可能見到：

Adaptive compression：根據 query 動態調整壓縮率
Learned rendering：neural network 學習最佳 text-to-image 轉換
Omni-modal compression：統一處理 text/audio/video
Hardware acceleration：專門嘅 vision encoder 硬件

呢個領域仍然好新（2025-2026），仲有好多野可以探索。

Table of Contents

TL;DR

問題起源：一個矛盾嘅發現

最初嘅疑問

矛盾嘅解決

Vist: Vision-Centric Token Compression

核心理念：模仿人類閱讀

架構設計

Probability-Informed Visual Enhancement (PVE)

Glyph: Visual-Text Compression for Context Windows

GLM Team 嘅方法

LLM-Driven Genetic Search

實際應用場景

技術深入：點解呢種方法會有效？

反直覺嘅真相

數學分析

同 Traditional Text Compression 嘅比較

同 DeepSeek OCR 嘅根本分別

容易混淆嘅地方

視覺化對比

關鍵分別

何時用 vs 何時唔用？

✅ 適合 Text-to-Image Compression 嘅場景

❌ 唔適合 Text-to-Image Compression 嘅場景

混合策略（最佳實踐）

實驗結果同效能

Vist 實驗數據

Glyph 實驗數據

Memory Usage

技術啟示

1. Compression 唔係越多越好

2. Multimodal 唔只係「睇圖」

3. LLM Context 管理係 Tiered System

4. Rendering 係一門科學

總結

核心要點

何時用？

展望未來

相關資源

Table of Contents

TL;DR

問題起源：一個矛盾嘅發現

最初嘅疑問

矛盾嘅解決

Vist: Vision-Centric Token Compression

核心理念：模仿人類閱讀

架構設計

Probability-Informed Visual Enhancement (PVE)

Glyph: Visual-Text Compression for Context Windows

GLM Team 嘅方法

LLM-Driven Genetic Search

實際應用場景

技術深入：點解呢種方法會有效？

反直覺嘅真相

數學分析

同 Traditional Text Compression 嘅比較

同 DeepSeek OCR 嘅根本分別

容易混淆嘅地方

視覺化對比

關鍵分別

何時用 vs 何時唔用？

✅ 適合 Text-to-Image Compression 嘅場景

❌ 唔適合 Text-to-Image Compression 嘅場景

混合策略（最佳實踐）

實驗結果同效能

Vist 實驗數據

Glyph 實驗數據

Memory Usage

技術啟示

1. Compression 唔係越多越好

2. Multimodal 唔只係「睇圖」

3. LLM Context 管理係 Tiered System

4. Rendering 係一門科學

總結

核心要點

何時用？

展望未來

相關資源