TurboQuant：Google 用 3-bit 壓縮 KV Cache 做到零精度損失嘅突破

論文來源：Google Research（ICLR 2026） arXiv: 2504.19874 Blog: Google Research Blog

TL;DR

Google Research 提出咗 TurboQuant，一個理論上近乎最優嘅 vector quantization 框架，專門解決 LLM 嘅 KV cache memory bottleneck 同 vector search 嘅效率問題。

核心重點：

🎯 3-bit 零損失壓縮：KV cache 壓縮到 3.5 bits/channel 完全唔影響模型質量
🚀 8x Speedup：喺 H100 GPU 上 4-bit TurboQuant 比 32-bit unquantized keys 快 8 倍
📊 6x 記憶體減少：KV cache memory 至少減少 6 倍
🧮 理論最優：同 information-theoretic lower bound 只差 ≈2.7× constant factor
⚡ 零 indexing 時間：喺 nearest neighbor search 任務，indexing 時間由幾百秒降到 0.001 秒

背景：點解 KV Cache 量化咁重要？

LLM 嘅 Memory Bottleneck

現代 LLM 運行嘅時候需要維護一個 Key-Value (KV) cache，用嚟儲存已處理 token 嘅 key 同 value vectors。每個新 token 要同之前所有 tokens 做 attention，所以呢個 cache 會隨住 context length 不斷增長。

問題有幾嚴重？

假設一個 LLM 有 32 layers，每層有 32 attention heads，每個 head 嘅 dimension 係 128：

每個 token 嘅 KV cache： $32 \times 32 \times 128 \times 2 \times 4$ bytes = 1 MB
128K context window：128 GB 淨係 KV cache！

呢個就係點解長 context LLM 咁食 memory 嘅原因。

現有量化方法嘅問題

傳統嘅量化方法（例如 per-channel 或 per-block quantization）都有一個共同問題：quantization constants overhead。

每個 data block 都要儲存至少兩個 full-precision 嘅常數（zero point 同 scale），視乎 block size，呢個 overhead 可以加 1-2 bits per number。如果你本身想壓縮到 4-bit，呢個 overhead 等於增加咗 25-50% 嘅儲存成本。

Scale 同 Zero Point 係咩？具體例子

等我哋用一個具體例子解釋。假設你有一個 4 個元素嘅 float vector：

pythonx = [-3.2,  0.0,  1.5,  4.8]

你想將佢壓縮成 4-bit integers（範圍 0–15）。

Step 1 — 計算 Scale（縮放比例）

Scale 話俾你知：每個 integer step 代表幾多 float 值：

\text{scale} = \frac{\text{max} - \text{min}}{2^{\text{bits}} - 1} = \frac{4.8 - (-3.2)}{15} = \frac{8.0}{15} \approx 0.533

Step 2 — 計算 Zero Point（零點偏移）

Zero point 話俾你知：邊個 integer 對應 float 嘅 $0.0$ ：

\text{zero\_point} = \text{round}\left(\frac{-\text{min}}{\text{scale}}\right) = \text{round}\left(\frac{3.2}{0.533}\right) \approx 6

Step 3 — 量化

pythonx_int = round(x / scale + zero_point)

原始 float	計算	量化後 (4-bit int)
−3.2	round(−3.2 / 0.533 + 6) = round(0.0)	0
0.0	round(0.0 / 0.533 + 6) = round(6.0)	6
1.5	round(1.5 / 0.533 + 6) = round(8.8)	9
4.8	round(4.8 / 0.533 + 6) = round(15.0)	15

Step 4 — 還原（Dequantize）

pythonx_reconstructed = scale * (x_int - zero_point)
# → [-3.198, 0.0, 1.599, 4.797]  ← 同原始值非常接近

🎯 Scale 同 Zero Point 嘅直觀理解

Scale：就係一把「尺」，話你知 float 世界同 integer 世界嘅比例關係

Zero Point：就係「原點對齊」，確保 float $0.0$ 可以被精確表示，避免系統性偏差

兩者合起來定義咗一個從 integer 空間到 float 空間嘅 affine mapping（線性映射）。

呢度嘅問題係咩？

你需要為每個 block 儲存 scale（float32）同 zero_point（float32 或 int32）。假設 block size = 128 個元素：

部分	大小
128 個 4-bit integers（實際數據）	64 bytes
scale（float32）	4 bytes
zero_point（float32）	4 bytes
總計	72 bytes → 有效 bit-width ≈ 4.5 bits ⚠️

你以為係 4-bit，實際係 4.5 bits。Block size 越細（為咗提高精度），overhead 比例越大，甚至可以去到 5-6 bits。

⚠️ 傳統量化嘅兩難

Block size 太大 → 量化誤差大（同一個 block 入面數值差異大）

Block size 太細 → Constants overhead 大（每個 block 都要 full-precision constants）

兩邊都唔理想！

另一個核心問題係：大部分 MSE-optimal quantizers 喺做 inner product estimation 嘅時候會引入 bias。Attention score 本質上就係 query 同 key 嘅 inner product，有 bias 嘅 estimator 會令 attention 結果 systematically 偏離，影響模型質量。

TurboQuant：核心創新

設計理念

TurboQuant 嘅核心思路可以用一句話概括：

💡 核心洞察：隨機旋轉 + 最優 scalar quantization + 殘差修正
透過 random rotation 將 vector 嘅座標分佈變得「可預測」，然後對每個座標獨立做最優量化，最後用 1-bit QJL 修正 inner product bias。

TurboQuant 係一個 data-oblivious 算法，即係唔需要任何 training data 或者 codebook learning。呢個特性令佢非常適合 online applications——新嘅 vector 一嚟就可以即時量化，唔使等。

Data-Oblivious 係咩意思？

量化算法可以分成兩大類：

Data-dependent（依賴數據）：需要先睇大量樣本數據，從中「學習」數據嘅分佈，然後根據呢個分佈設計最優嘅 quantizer。例如 Product Quantization (PQ) 要先用 k-means clustering 喺 dataset 上 train 一個 codebook，之後先可以做量化。
Data-oblivious（不依賴數據）：唔需要任何預先收集嘅數據。量化嘅規則完全由數學推導決定，對任何輸入都適用。TurboQuant 就係呢類——random rotation 之後座標分佈係已知嘅 Beta distribution，所以 Lloyd-Max quantizer 嘅最優 levels 可以純粹從理論計算出嚟，完全唔需要睇實際數據。

💡 類比：一個 data-dependent 嘅快遞分揀員要先研究成個月嘅包裹數據，搞清楚最常見嘅目的地係邊度，先設計分揀規則。一個 data-oblivious 嘅分揀員直接用郵政編碼規則——唔需要睇任何歷史數據，第一個包裹嚟就即刻可以分揀。

點解 Data-Oblivious 係優勢？

喺實際應用入面，data-dependent 方法有幾個根本問題：

需要 warm-up 數據：你要先收集夠代表性嘅樣本先可以 train codebook。喺 LLM inference 嘅場景，每個用戶嘅 query 都唔同，難以預先知道 KV cache 嘅分佈。
Distribution shift：如果實際數據嘅分佈同 training 時嘅分佈唔同（例如新領域嘅問題），codebook 就會 stale，量化質量暴跌。
無法做 online quantization：新 vector 嚟嘅時候，你要先等 codebook ready 先可以量化。對於實時系統，呢個 latency 係不可接受嘅。

TurboQuant 完全冇呢啲問題——任何 vector、任何時間、即刻量化，質量都有理論保證。

三個關鍵組件

Random Rotation（隨機旋轉）

呢個係成個方法嘅基礎。當你對一個 high-dimensional vector 做 random orthogonal rotation 之後，每個座標嘅分佈會 converge 到一個 concentrated Beta distribution。

點解 random rotation 之後會係 Beta distribution？

直覺上，對一個 vector 做 random orthogonal rotation，等同於喺 $d$ -dimensional unit sphere 上 uniform 咁 sample 一個方向。而高維 sphere 有一個 classical result：如果你喺 sphere 上 uniform sample 一個點 $z$ ，每個座標嘅平方 $z_i^2$ 佔 total squared norm 嘅比例，根據定義就係：

z_i^2 \sim \text{Beta}\left(\frac{1}{2}, \frac{d-1}{2}\right)

呢個唔係 approximation，係 exact result。而且因為 concentration of measure，維度越高，呢個分佈越 concentrated——所有座標嘅值都被迫集中喺 $\approx \pm 1/\sqrt{d}$ 嘅窄範圍。詳細數學推導（包括點樣從 Gaussian 構造 sphere uniform distribution 再得出 Beta）見下方深入分析一節。

💡 一句話直覺：Random rotation 將 vector 嘅「能量」均勻地 spread 到每個座標上——每個座標分到大約 $1/d$ 嘅 total energy，而且愈高維呢個分配愈穩定、愈可預測。呢個令到所有座標嘅分佈係 已知同固定嘅，唔受原始 vector 嘅值影響。

點解呢個有用？

原始 vector：座標分佈未知，可能好 skewed，有 outliers
旋轉後 vector：座標分佈變得高度集中同可預測
獨立性：高維空間入面，旋轉後嘅座標幾乎 互相獨立

x \in \mathbb{R}^d \xrightarrow{\text{Random Rotation } R} Rx \sim \text{Concentrated Beta Distribution}

呢個性質意味住我哋可以用一個 universal scalar quantizer 處理所有座標，而唔需要 per-block 嘅 normalization constants。

Lloyd-Max Optimal Scalar Quantizer

在講 PolarQuant 之前，先要理解 TurboQuant 嘅核心壓縮工具：Lloyd-Max quantizer。

Scalar quantization 基本概念：

假設你有一個 continuous 嘅數值（例如 3.7），但你只能用 2-bit 儲存，即係得 4 個可能嘅值。你要揀最接近嘅一個儲存。問題係：呢 4 個值應該擺喺邊？

🎯 Lloyd-Max Quantizer 解決嘅問題
Given 一個已知嘅概率分佈，Lloyd-Max algorithm 會找到一組 最優嘅 quantization levels（reconstruction points）同 decision boundaries，使得 mean squared error (MSE) 最小化。

具體嚟講，Lloyd-Max 做兩件事：

Decision boundaries：將數值線切分成 $2^b$ 個區間（ $b$ = bit-width）
Reconstruction levels：每個區間內擺一個「代表值」，位置喺該區間嘅 probability-weighted centroid

javascript原始數值線：  ----[---x---]----[--x--]----[----x----]----[--x--]----
                  ^           ^              ^           ^
              level 0     level 1        level 2     level 3
              
        區間寬度根據概率密度調整：密度高嘅地方切得密，密度低嘅切得疊

點解係「optimal」？用一個實際例子嚟睇：

想像你用麥克風錄人聲，電壓訊號範圍係 $-10V$ 到 $+10V$ 。但人講嘢嘅音量分佈好唔均勻——90% 嘅時間電壓都集中喺 $-2V$ 到 $+2V$ （正常講嘢），只有 10% 嘅時間先會去到極端值（大叫或者爆破音）。

假設我哋只有 2-bit（4 個 quantization levels）嚟壓縮呢個訊號。

Uniform Quantizer 嘅災難：

佢會死板咁將 $-10V$ 到 $+10V$ 平均切成 4 等分：

區間	範圍	代表值	問題
1	[-10, -5]	-7.5V	幾乎冇訊號用到 ❌
2	[-5, 0]	-2.5V	90% 訊號迫晒入呢兩個區間 😱
3	[0, 5]	+2.5V	90% 訊號迫晒入呢兩個區間 😱
4	[5, 10]	+7.5V	幾乎冇訊號用到 ❌

一個微弱嘅 $0.1V$ 訊號（正常講嘢音量）會被歸入 $[0, 5]$ 區間，然後被強制轉換成 $2.5V$ ——誤差係 2.4V，原本嘅聲音被放大咗 25 倍！代表值 $\pm 7.5V$ 就白白浪費咗，幾乎冇訊號會用到佢哋。

Lloyd-Max Quantizer 嘅解法：

Lloyd-Max 嘅核心思想係：資源應該花喺刀口上。

佢透過迭代演算法自動調整：

尋找質心（Centroid Condition）：發現 $[0, 5]$ 區間入面，訊號幾乎全部擠喺 $0V$ 附近，所以將代表值從 $2.5V$ 拉到訊號真正嘅重心，例如 $0.8V$
重新劃分邊界（Nearest Neighbor Condition）：根據新嘅代表值重新劃分區間邊界
反覆迭代直到收斂

最終 Lloyd-Max 分配嘅 4 個代表值可能係：-4.0V, -0.8V, +0.8V, +4.0V

而家同一個 $0.1V$ 訊號會被轉換成 $0.8V$ ——誤差只有 0.7V，比 uniform 嘅 2.4V 細咗 3 倍以上！

💡 點解整體 MSE 更低？
對於偶爾出現嘅極端值（例如 $9V$ ），Lloyd-Max 可能轉換成 $4.0V$ ，單點誤差確實好大。但係，因為 $9V$ 出現嘅機率極低（不到 1%），而 $0.1V$ 出現嘅機率極高（90%+），從整體 MSE 嚟睇，Lloyd-Max 將高頻區域嘅誤差最小化，令整體平均失真降到最低。呢個就係佢被稱為「最優」嘅原因。

總結比較：

方法	分割策略	0.1V 訊號嘅誤差	整體 MSE
Uniform Quantization	均勻分割，唔理分佈	2.4V 😱	較高
Lloyd-Max Quantization	根據分佈密度調整分割	0.7V ✅	最低（理論最優）

喺 TurboQuant 入面嘅角色：

因為 random rotation 之後每個座標嘅分佈係已知嘅 Beta distribution，Lloyd-Max 嘅最優 levels 可以 提前計算好（只取決於維度 $d$ 同 bit-width $b$ ）。即係話，唔需要 runtime 做任何 optimization，直接 lookup table 就搞定。呢個係 TurboQuant 能夠做到 微秒級 indexing 嘅核心原因之一。

PolarQuant：極座標量化

PolarQuant 係 TurboQuant 嘅第一階段壓縮方法。佢將傳統嘅 Cartesian coordinates 轉換成 polar coordinates：

(x_1, x_2) \rightarrow (r, \theta) \quad \text{where } r = \sqrt{x_1^2 + x_2^2}, \quad \theta = \arctan(x_2 / x_1)

點解用 polar coordinates？

類比：

Cartesian："向東行 3 個 block，向北行 4 個 block"（兩個 unbounded 數值）
Polar："行 5 步，角度 37°"（一個 magnitude + 一個 bounded angle）

因為 random rotation 之後，angles 嘅分佈係已知同高度集中嘅，PolarQuant 可以：

唔需要 per-block normalization（angles 嘅 range 已知）
消除 quantization constants overhead
對每個 angle 獨立用 Lloyd-Max optimal scalar quantizer

但 PolarQuant 唔只係做一次 polar transform——佢係 recursive（遞迴） 咁做，直到只剩一個 final radius 同一堆 angles。

點解要轉成 (d-1) angles + 1 radius 呢個格式？

呢個係 PolarQuant 最核心嘅設計動機。傳統 Cartesian 量化有一個根本問題：每個座標嘅 range 取決於 vector 嘅 magnitude（模長）。一個 norm = 100 嘅 vector 同 norm = 0.01 嘅 vector，座標值差幾千倍。所以你必須為每個 block 儲存 scale 同 zero-point constants，呢個就係前面提到嘅 quantization constants overhead。

Recursive polar transform 做嘅嘢本質上係 將 magnitude（大小）同 direction（方向）徹底分離：

d-1 個 angles = 方向資訊：angles 嘅 range 永遠係固定嘅（ $0$ 到 $\pi$ 或 $0$ 到 $2\pi$ ），而且 random rotation 之後佢哋嘅分佈係已知嘅 Beta distribution。呢個意味住你可以用 同一個 pre-computed Lloyd-Max quantizer 處理所有 angles，零 overhead——唔需要任何 per-vector 或 per-block 嘅 normalization constants。
1 個 radius = magnitude 資訊：呢個就係 vector 嘅 norm $\|x\|$ ，佢嘅 range 係 unbounded 嘅，確實需要額外儲存。但只有 1 個數字！

🎯 Overhead 對比

傳統方法： $d$ 個座標，每個 block 需要 2 個 full-precision constants（scale + zero-point）→ overhead 可以加 1-2 bits per number

PolarQuant： $d-1$ 個 angles 用 universal quantizer（0 overhead）+ 1 個 radius 需要儲存 → overhead = 只有 1 個數字

呢個就係 PolarQuant 消除 quantization constants overhead 嘅核心原因——將「需要額外資訊先量化到」嘅部分壓縮到極致（ $d$ 維 → 1 個數字），而將「唔需要額外資訊」嘅部分最大化（ $d-1$ 個 angles）。

類比：描述一個目的地，你可以講「向東行 3km，向北行 4km」（Cartesian——兩個數都同 scale 有關，量化時要知道「最遠會行幾遠」），或者講「方向 37°，行 5km」（Polar——方向同距離分離咗，量化 37° 呢個角度唔需要知道你行幾遠）。

Recursive Process 具體點運作？用 8D 做例子：

假設 random rotation 之後你有一個 8 維 vector：

y = (y_1, y_2, y_3, y_4, y_5, y_6, y_7, y_8)

Round 1：將 8 個座標兩兩配對

將相鄰座標 pair up，每對轉成 polar form $(r, \theta)$ ：

Pair	輸入（Cartesian）	輸出（Polar）
Pair 1	$(y_1, y_2)$	$r_1 = \sqrt{y_1^2 + y_2^2}$ ， $\theta_1 = \arctan(y_2/y_1)$
Pair 2	$(y_3, y_4)$	$r_2 = \sqrt{y_3^2 + y_4^2}$ ， $\theta_2 = \arctan(y_4/y_3)$
Pair 3	$(y_5, y_6)$	$r_3 = \sqrt{y_5^2 + y_6^2}$ ， $\theta_3 = \arctan(y_6/y_5)$
Pair 4	$(y_7, y_8)$	$r_4 = \sqrt{y_7^2 + y_8^2}$ ， $\theta_4 = \arctan(y_8/y_7)$

Round 1 之後你有：4 個 angles $(\theta_1, \theta_2, \theta_3, \theta_4)$ + 4 個 radii $(r_1, r_2, r_3, r_4)$ 。

Angles 嘅分佈已知（Beta distribution），即刻用 Lloyd-Max quantize。但 4 個 radii 點算？佢哋係 普通嘅正數，可以當做新嘅「座標」繼續配對！

🔑 關鍵洞察：Radii 就係新嘅數值
每個 $r_i$ 只係一個正實數（例如 $r_1 = 2.3, r_2 = 1.8, r_3 = 0.9, r_4 = 3.1$ ）。佢哋可以同任何數字一樣被 pair up，再做一次 polar transform。你唔需要 radii 有任何特殊性質——只要有兩個數字，就可以算佢哋嘅 $(r, \theta)$ 。

Round 2：將 4 個 radii 兩兩配對

Pair	輸入（兩個 radii）	輸出（Polar）
Pair A	$(r_1, r_2)$ 例如 $(2.3, 1.8)$	$R_1 = \sqrt{2.3^2 + 1.8^2} = 2.92$ ， $\phi_1 = \arctan(1.8/2.3) = 38.0°$
Pair B	$(r_3, r_4)$ 例如 $(0.9, 3.1)$	$R_2 = \sqrt{0.9^2 + 3.1^2} = 3.23$ ， $\phi_2 = \arctan(3.1/0.9) = 73.8°$

Round 2 之後你又得到 2 個新 angles $(\phi_1, \phi_2)$ + 2 個新 radii $(R_1, R_2)$ 。同樣，angles 即刻 quantize，radii 繼續配對。

Round 3（最後一輪）：將 2 個 radii 配對

Pair	輸入（兩個 radii）	輸出（Polar）
Final	$(R_1, R_2)$ 即 $(2.92, 3.23)$	$R_{\text{final}} = \sqrt{2.92^2 + 3.23^2} = 4.35$ ， $\psi = \arctan(3.23/2.92) = 47.9°$

而家只剩 1 個 angle $\psi$ + 1 個 radius $R_{\text{final}}$ 。冇得再 pair 喇！

Binary Tree 全景圖：

javascriptRound 1:  y₁ y₂   y₃ y₄   y₅ y₆   y₇ y₈     ← 8 coordinates
           ↓  ↓     ↓  ↓     ↓  ↓     ↓  ↓
          r₁,θ₁   r₂,θ₂   r₃,θ₃   r₄,θ₄      ← 4 angles ✅ quantize
                                                   4 radii → 繼續
Round 2:  r₁  r₂   r₃  r₄
           ↓   ↓     ↓   ↓
          R₁, φ₁   R₂, φ₂                       ← 2 angles ✅ quantize
                                                   2 radii → 繼續
Round 3:  R₁  R₂
           ↓   ↓
        R_final, ψ                               ← 1 angle  ✅ quantize
                                                   1 radius → 冇得再 pair！

最終結果：7 angles (θ₁,θ₂,θ₃,θ₄,φ₁,φ₂,ψ) + 1 final radius (R_final)
         = (d-1) angles + 1 radius  ✅

💡 點解最後一定淨返 1 個 radius？
每一輪 polar transform 都將 $n$ 個數字變成 $n/2$ 個 angles + $n/2$ 個 radii。Angles 即刻 quantize 咗，radii 繼續 pair。所以 radii 嘅數量每輪減半： $4 \to 2 \to 1$ 。呢個係一棵 binary tree，最尾一定 converge 到 1 個 root。

數學上，呢個 final radius 就係原始 vector 嘅 norm（模長）：
$R_{\text{final}} = \|y\| = \sqrt{y_1^2 + y_2^2 + \cdots + y_8^2}$
因為每次 $\sqrt{a^2 + b^2}$ 都係 partial norm，一路 combine 上去就係 total norm。呢個其實就係 hyperspherical coordinates（超球座標） 嘅遞迴定義！

Final radius 點處理？

呢個 single radius = vector 嘅 norm，佢嘅 range 同其他 angles 唔同，需要另外處理（例如用更多 bits 或者 separate quantization scheme）。但重點係：只有 1 個，overhead 極低。

QJL：1-bit 殘差修正

就算 PolarQuant 嘅 MSE 壓縮質量好高，佢仍然會喺 inner product estimation 引入 bias。TurboQuant 用 Quantized Johnson-Lindenstrauss (QJL) transform 解決呢個問題。

點解需要 QJL？先理解 Residual 同 Bias 問題

首先要搞清楚一件事：residual（殘差）係咩？佢係邊度嚟嘅？

PolarQuant 嘅整個流程入面，只有一個步驟會 lose information——就係 Lloyd-Max rounding（將 continuous angle round 到最近嘅 quantization level）。呢個 rounding 產生嘅誤差就係 residual：

python# Residual 嘅來源：rounding error
original_key = k                              # 原始 key vector
rotated = R @ k                               # Step 1: rotation（lossless）
angles, radius = polar_transform(rotated)     # Step 2: polar transform（lossless）
quantized_angles = lloyd_max_round(angles)     # Step 3: rounding ⚠️ 呢度產生誤差！
reconstructed = inverse_all(quantized_angles, radius, R)  # 還原

residual = k - reconstructed                  # ← 呢個就係 residual = rounding error

即係話：Residual = 原始 key − 量化後還原嘅 key = rounding 造成嘅誤差。Rotation 同 polar transform 本身都係 lossless 嘅，唯一嘅 information loss 發生喺 rounding 嗰一步。

而呢個 residual 會引入一個隱藏問題——inner product bias。

PolarQuant 壓縮質量高（低 MSE），但有一個隱藏問題：當你用壓縮後嘅 vector 計算 inner product（即係 attention score = query · key），結果會有系統性偏差（bias）：

\underbrace{q \cdot k}_{\text{真實 attention score}} = 0.85 \quad \neq \quad \underbrace{q \cdot \hat{k}}_{\text{PolarQuant 估算}} = 0.97

呢個偏差唔係隨機噪聲——係每次都咁偏嘅系統性錯誤。喺 attention 入面，呢個會令某啲 token 嘅 attention weight 被系統性高估，模型行為 systematically drift，影響生成質量。

⚠️ 點解 MSE 低但 inner product 仍有 bias？
MSE 衡量嘅係 $\|x - \hat{x}\|^2$ （向量空間嘅重建誤差），但 inner product $q \cdot \hat{k}$ 嘅誤差取決於殘差同 query 嘅方向關係。PolarQuant 嘅量化誤差喺某個方向上有系統性偏向，導致 inner product 嘅估算偏高或偏低。兩個目標（低 MSE vs unbiased inner product）唔係同一回事。

QJL 點樣修正？

核心思路係：用極少嘅 bits 捕捉殘差嘅方向資訊，然後用呢個資訊修正 inner product 估算。

python# Step 1: 計算殘差（量化丟咗嘅部分）
residual = x_rotated - x_dequantized   # 原始 rotated vector - PolarQuant 還原值

# Step 2: 對 residual 做 random projection（JL transform）
projected = random_matrix @ residual   # 降維到低維空間

# Step 3: 只保留 sign bit（+1 或 -1）
qjl_sign = sign(projected)             # 每個維度只用 1 bit！

用呢個 qjl_sign 修正 inner product 估算之後，數學上可以證明估算變成 unbiased：

\mathbb{E}[q \cdot \hat{k}_{\text{corrected}}] = q \cdot k \quad \text{（期望值等於真實值）}

💡 直覺類比：想像你估算兩個城市之間嘅距離。PolarQuant 估算就好似用地圖粗略量度——準確但因地圖比例尺會系統性偏高。QJL correction 就好似一張只寫住「偏長咗定偏短咗」（+ 或 −）嘅修正紙條。紙條只有一個 bit，但已經夠你知道要調整嘅方向，統計上就可以消除系統性偏差。

點解只需要 1 bit？用具體數字睇

Johnson-Lindenstrauss 定理保證：隨機投影之後，sign 已經包含足夠嘅方向資訊嚟做 unbiased correction。聽落好 magical——淨係知道 + 或 − 就夠？等我用一個 2D 嘅數字例子示範。

假設 PolarQuant rounding 之後，residual（rounding error）同 query 如下：

Residual（rounding 造成嘅誤差）： $e = [-0.3, +0.1]$
Query： $q = [0.8, 0.6]$
真實 inner product 誤差： $q \cdot e = 0.8 \times (-0.3) + 0.6 \times 0.1 = -0.18$

呢個 $-0.18$ 就係 PolarQuant 估算嘅 bias——我哋想用 1 bit 嚟修正佢。

QJL 揀一個 random direction $r$ （每個 component 隨機 $\pm 1$ ），計算 $r \cdot e$ ，然後只保留 sign（+ 或 −）。修正公式：

\text{correction} = \text{sign}(r \cdot e) \times (r \cdot q)

Dot product 點計？ 就係逐個 component 相乘再加埋：

$r \cdot e$ ：將 $r$ 同 residual $e$ 嘅對應 component 逐個乘，再求和。例如 $r = (+1, +1)$ ， $e = [-0.3, +0.1]$ → $(+1)(-0.3) + (+1)(+0.1) = -0.3 + 0.1 = -0.2$
$r \cdot q$ ：同理，將 $r$ 同 query $q$ 嘅對應 component 逐個乘再求和。例如 $r = (+1, +1)$ ， $q = [0.8, 0.6]$ → $(+1)(0.8) + (+1)(0.6) = 0.8 + 0.6 = 1.4$

2D 入面 $r$ 有 4 種等概率嘅可能，逐個計：

Random direction $r$	$r \cdot e$ 計算過程	sign（1 bit）	$r \cdot q$ 計算過程	sign × (r·q)
$(+1, +1)$	$(+1)(-0.3) + (+1)(+0.1) = -0.2$	−1	$(+1)(0.8) + (+1)(0.6) = 1.4$	−1.4
$(+1, -1)$	$(+1)(-0.3) + (-1)(+0.1) = -0.4$	−1	$(+1)(0.8) + (-1)(0.6) = 0.2$	−0.2
$(-1, +1)$	$(-1)(-0.3) + (+1)(+0.1) = +0.4$	+1	$(-1)(0.8) + (+1)(0.6) = -0.2$	−0.2
$(-1, -1)$	$(-1)(-0.3) + (-1)(+0.1) = +0.2$	+1	$(-1)(0.8) + (-1)(0.6) = -1.4$	−1.4

所有 4 種情況嘅 correction 都係負數！ 平均值 = $(-1.4 - 0.2 - 0.2 - 1.4) / 4 = -0.8$ 。

注意 $-0.8 \neq -0.18$ ——raw correction 嘅 magnitude 唔會直接等於真實誤差。實際使用時需要乘一個 scaling factor（同維度 $d$ 同 norm 有關），令期望值等於真實誤差。但最重要嘅係：correction 嘅期望值同真實誤差 $q \cdot e$ 之間有一個 精確嘅數學比例關係，唔只係「方向一樣」咁簡單。

點解期望值會正比於 $q \cdot e$ ？逐步拆解：

修正公式係 $\text{sign}(r \cdot e) \times (r \cdot q)$ 。將 $r \cdot q = \sum_i r_i q_i$ 展開，用 linearity of expectation（期望值嘅線性性質）：

\mathbb{E}[\text{sign}(r \cdot e) \times (r \cdot q)] = \sum_i q_i \times \mathbb{E}[\text{sign}(r \cdot e) \times r_i]

而家個關鍵問題變成： $\mathbb{E}[\text{sign}(r \cdot e) \times r_i]$ 等於咩？答案係：

\mathbb{E}[\text{sign}(r \cdot e) \times r_i] = c \times e_i

其中 $c > 0$ 係一個只取決於維度同 norm 嘅已知常數。

點解 $\mathbb{E}[\text{sign}(r \cdot e) \times r_i] \propto e_i$ ？直覺解釋：

回想 $r \cdot e = r_1 e_1 + r_2 e_2 + \cdots + r_d e_d$ ，而 $r_i$ 係其中一個 component（隨機 $\pm 1$ ）。

當 $e_i$ 大且正嘅時候： $r_i = +1$ 會將 $r \cdot e$ 推向正方向 → sign 更可能係 $+1$ → $\text{sign} \times r_i$ 傾向正數
當 $e_i$ 大且負嘅時候： $r_i = +1$ 會將 $r \cdot e$ 推向負方向 → sign 更可能係 $-1$ → $\text{sign} \times r_i$ 傾向負數
當 $e_i = 0$ ： $r_i$ 對 sign 冇任何影響 → $\text{sign} \times r_i$ 期望值 = 0

呢個 $r_i$ 同 sign 之間嘅「共振」大小，正正同 $e_i$ 嘅大小成正比！

最後一步：代入求和

\mathbb{E}[\text{correction}] = \sum_i q_i \times (c \times e_i) = c \times \sum_i q_i \, e_i = c \times (q \cdot e)

所以 correction 嘅期望值 = 已知常數 $c$ × 真實誤差 $q \cdot e$ 。因為 $c$ 係已知嘅（只取決於維度 $d$ 同 residual 嘅 norm），你只要除以 $c$ 就得到一個 unbiased estimator：

\mathbb{E}\left[\frac{\text{correction}}{c}\right] = q \cdot e \quad \checkmark

驗證返我哋嘅 2D 例子： $\mathbb{E}[\text{correction}] = -0.8$ ， $q \cdot e = -0.18$ 。比例 = $-0.8 / -0.18 \approx 4.44$ 。呢個 $4.44$ 就係嗰個常數 $c$ ——佢只同 $d = 2$ 同 $\|e\|$ 有關，唔會因為 $q$ 或者 $e$ 嘅具體值改變。所以系統事先知道要除以 $4.44$ ，就可以精確 recover 到 $-0.18$ 。

🎯 一句話總結：點解「只知方向」就夠 unbiased？
因為 sign bit 捕捉嘅唔只係「正定負」——佢捕捉嘅係 residual 嘅每個 component $e_i$ 同隨機方向 $r_i$ 之間嘅 correlation 強度。呢個 correlation 正正同 $e_i$ 成正比。乘以 $q_i$ 再求和之後，期望值自然等於 $q \cdot e$ （差一個已知常數）。所以 1 bit 嘅 sign 已經包含咗足夠嘅 方向 + 比例 資訊，唔只係 ± 咁簡單。

🔑 點解 work？直覺解釋
Sign bit 本質上回答咗一個問題：「residual 喺呢條隨機線嘅邊一邊？」

如果 residual 同 query 指向同一邊（兩者都同 $r$ align 或者都 oppose），correction 係正數 → 加大估算

如果指向唔同邊，correction 係負數 → 減少估算

因為 residual 同 query 嘅相對方向決定咗 inner product 誤差嘅正負，而 sign bit 正正捕捉咗呢個「相對方向」資訊。每個 1-bit 答案都好 noisy，但 期望值（平均）已經指向正確方向——呢個就係 unbiased 嘅定義。

多用幾個 random directions（更多 bits）= 更多「投票」= 更低 variance（更穩定）。但 1 bit 已經足以消除 bias——期望值等於真實誤差。呢個就係 JL 定理嘅 magic。

QJL 點樣運作（完整流程）：

計算 PolarQuant 嘅 residual（原始 vector - 量化後 vector）
對 residual 做 JL transform（random projection）
只保留 sign bit（+1 或 -1）

呢個 1-bit 嘅修正項作為 zero-bias estimator，確保 attention score 嘅估算係 unbiased 嘅。

python# TurboQuant 兩階段量化（概念版）
def turboquant_encode(x, R, num_bits):
    """
    x: 原始 vector (d-dimensional)
    R: Random rotation matrix
    num_bits: 目標 bit-width
    """
    # Step 1: Random rotation
    x_rotated = R @ x
    
    # Step 2: PolarQuant (用 num_bits - 1 bits)
    x_polar = polar_transform(x_rotated)
    x_quantized = lloyd_max_quantize(x_polar, bits=num_bits - 1)
    
    # Step 3: QJL on residual (用 1 bit)
    residual = x_rotated - inverse_polar(x_quantized)
    qjl_sign = sign(jl_transform(residual))  # 1-bit!
    
    return x_quantized, qjl_sign

def turboquant_decode(x_quantized, qjl_sign, R):
    """
    x_quantized: 量化後嘅 polar coordinates（angles + final radius）
    qjl_sign:    1-bit QJL correction（TurboQuant_Prod 先需要）
    R:           同一個 random rotation matrix（encode 時用嘅）
    """
    # Step 1: Dequantize — 用 Lloyd-Max 反向 mapping 還原 angles 同 radius
    x_polar_reconstructed = lloyd_max_dequantize(x_quantized)

    # Step 2: Inverse polar transform — 從 (angles + radius) 遞迴還原 rotated vector
    y_hat = inverse_polar_transform(x_polar_reconstructed)  # shape: (d,)

    # Step 3: (TurboQuant_Prod only) Apply QJL correction to remove bias
    # y_hat = apply_qjl_correction(y_hat, qjl_sign)  # 需要 custom attention kernel

    # Step 4: Inverse rotation — R 係 orthogonal，所以 R^T = R^{-1}
    x_reconstructed = R.T @ y_hat  # 完美還原到原始空間

    return x_reconstructed

# 唯一有 information loss 嘅步驟係 lloyd_max_quantize / lloyd_max_dequantize（rounding）
# Random rotation 同 polar transform 本身都係 lossless 同 invertible

TurboQuant 嘅兩個變體

變體	Bit 分配	目標	用途
TurboQuant_MSE	所有 bits → Lloyd-Max	最小化 MSE	Vector reconstruction、KV cache drop-in
TurboQuant_Prod	(b-1) bits Lloyd-Max + 1 bit QJL	Unbiased inner product	Custom attention kernel、vector search

⚠️ 實作注意
如果你想做 KV cache 嘅 drop-in replacement，用 TurboQuant_MSE。因為將 QJL correction 加返入 reconstructed vector 會注入 noise，反而令 cosine similarity 暴跌。TurboQuant_Prod 需要 custom attention kernel 先可以正確使用。

理論分析：點解 TurboQuant 近乎最優？

Information-Theoretic Lower Bound

論文提供咗一個重要嘅理論貢獻：證明咗任何 vector quantizer 能達到嘅最佳 distortion rate 嘅 lower bound。

對於 $b$ -bit quantization 嘅 $d$ -dimensional vector：

\text{MSE}_{\text{optimal}} \geq \frac{\|x\|^2}{d} \cdot \frac{1}{2^{2b/d} \cdot c}

TurboQuant 達到嘅 distortion rate 同呢個 lower bound 只差一個 ≈2.7× constant factor。

呢個意味住咩？ 你已經冇辦法設計一個比 TurboQuant 好好多嘅 vector quantizer——最多只能再改善 2.7 倍。喺 quantization 嘅世界入面，呢個差距已經非常細。

點解 Data-Oblivious 都可以咁好？

通常我哋會覺得 data-dependent 方法（例如 Product Quantization 需要 train codebook）應該表現更好。但 TurboQuant 用 random rotation 嘅幾何性質繞過咗呢個問題：

高維集中現象：高維空間嘅 random rotation 令座標分佈高度 concentrated
Near-independence：旋轉後嘅座標幾乎獨立，可以安全地做 per-coordinate quantization
Universal quantizer：因為分佈已知，一個 fixed quantizer 就夠

實驗結果

KV Cache 量化

論文用 Gemma、Mistral 同 Llama 模型，喺五個 long-context benchmarks 上測試（LongBench、Needle In A Haystack、ZeroSCROLLS、RULER、L-Eval）。

LongBench 結果（Llama-3.1-8B-Instruct）

方法	Bit-width	Memory 減少	質量影響
Unquantized (FP32)	32-bit	1x（baseline）	-
KIVI	4-bit	~8x	⚠️ 有明顯退化
PolarQuant	4-bit	~8x	✅ 接近無損
TurboQuant	3.5-bit	≥6x	✅ 完全無損
TurboQuant	2.5-bit	~12x	⚠️ 輕微退化

關鍵發現：

🎯 3.5 bits/channel：absolute quality neutrality（同 FP32 一樣！）
📉 2.5 bits/channel：marginal quality degradation（仍然可用）
🔍 Needle In A Haystack：完美通過，喺超長 context 入面搵到「針」

Attention Speedup（H100 GPU）

Bit-width	相對 FP32 Speedup
8-bit	~2x
4-bit	~8x
3-bit	~6x

4-bit TurboQuant 喺 H100 上達到 8x speedup，呢個數字非常 impressive。

Nearest Neighbor Search

喺 vector search 任務上，TurboQuant 嘅表現更加震撼：

Recall 比較（GloVe dataset, d=200）

TurboQuant 喺所有 bit-widths 同 top-k 設定下都 consistently 超越 Product Quantization (PQ) 同 RabbiQ 等 data-dependent baselines。

Indexing Time 比較

方法	d=200	d=1536	d=3072
Product Quantization	37.04s	239.75s	494.42s
TurboQuant	0.0007s	0.0013s	0.0021s

🚀 Indexing 時間差距
TurboQuant 嘅 indexing 時間比 Product Quantization 快 50,000x — 230,000x！

因為 TurboQuant 係 data-oblivious，唔需要 train codebook，indexing 基本上只係做一次 random rotation + scalar quantization。呢個對 real-time vector search 系統嚟講係 game changer。

深入分析

點解 Random Rotation 會令座標 Follow Beta Distribution？

呢個係 TurboQuant 最 elegant 嘅地方，亦係成個方法嘅數學核心。等我一步步解釋。

Step 0：Random Rotation 會唔會 Lose Information？

首先解答一個好自然嘅疑問：做 random rotation 會唔會丟失方向資訊？

答案係唔會。Orthogonal rotation 係一個 bijective（一一對應）同 invertible（可逆） 嘅操作：

\text{Encode: } y = Rx \quad \longleftrightarrow \quad \text{Decode: } x = R^T y

因為 rotation matrix $R$ 係 orthogonal 嘅（ $R^T R = I$ ），你隨時可以用 $R^T$ 完美還原原始 vector。

💡 類比：想像你有一張相片（= 原始 vector），你將佢旋轉咗 37°（= random rotation）。張相嘅所有資訊都仲喺度，只係角度變咗。你隨時可以旋轉返 -37° 完美還原。

Random rotation 唔係「丟掉」方向資訊，而係將方向資訊 重新分配 到所有座標上面。原本可能集中喺幾個座標嘅資訊，旋轉後均勻咁 spread 到每個座標。

所以 TurboQuant 嘅流程入面：

Random rotation（ $Rx$ ）→ ✅ Lossless，只係改變 representation
Quantization（rounding）→ ⚠️ 呢度先會 lose 少量 information
Dequantize + 反旋轉（ $R^T \hat{y}$ ）→ 還原到原始空間

唯一 lose information 嘅步驟係 quantization，唔係 rotation。 而 rotation 嘅 magic 正正係令 quantization loss 變到最小——因為旋轉後座標分佈 known 同 concentrated，同一套 quantizer 就可以 near-optimally 壓縮所有座標，唔使浪費 bits 喺 per-block constants 上面。

簡單嚟講：rotation 係「免費改變 representation 令壓縮更高效」，唔係「犧牲方向換取分佈」。

Step 1：Random Rotation = Uniform on Sphere

假設你有一個 unit vector $x \in \mathbb{R}^d$ ，佢住喺一個 $d$ -dimensional unit sphere 上面。當你做一個 uniformly random orthogonal rotation $R$ ，效果等同於將 $x$ 隨機映射到 sphere 上嘅另一個點——而且呢個映射係 uniform over the sphere。

即係話， $Rx$ 嘅分佈同「喺 unit sphere 上 uniformly sample 一個點」係一樣嘅。

Step 2：Sphere 上嘅 Uniform Point → Beta Distribution

呢度有一個 classical result：如果你喺 $d$ -dimensional unit sphere 上 uniformly sample 一個點 $z$ ，每個座標 $z_i$ 嘅 squared value 嘅 marginal distribution 係：

z_i^2 \sim \text{Beta}\left(\frac{1}{2}, \frac{d-1}{2}\right)

呢個唔係 approximation，係 exact result，源自 sphere 嘅幾何對稱性。

Step 3：點解係 Beta？完整數學推導

首先要理解 Beta distribution 嘅本質。Beta distribution $\text{Beta}(\alpha, \beta)$ 係一個定義喺 $[0, 1]$ 上面嘅分佈，PDF 係：

f(x) = \frac{x^{\alpha-1}(1-x)^{\beta-1}}{B(\alpha, \beta)}

佢有一個好重要嘅 構造性定義：如果你有兩個獨立嘅 Gamma random variables：

X \sim \text{Gamma}(\alpha), \quad Y \sim \text{Gamma}(\beta)

咁：

\frac{X}{X+Y} \sim \text{Beta}(\alpha, \beta)

💡 核心洞察：Beta distribution 本質上就係「一個量佔總量嘅比例」嘅分佈。呢個觀察係成個推導嘅關鍵。

Step 3a：Sphere 上嘅 Uniform Point 可以用 Gaussian 構造

如果你 sample $d$ 個 i.i.d. standard Gaussians $g_1, g_2, \dots, g_d \sim \mathcal{N}(0,1)$ ，然後 normalize：

z = \frac{(g_1, g_2, \dots, g_d)}{\sqrt{g_1^2 + g_2^2 + \cdots + g_d^2}}

呢個 $z$ 就係 unit sphere 上嘅 uniform random point。點解？因為 multivariate standard Gaussian 係 spherically symmetric（各個方向嘅密度一樣），normalize 之後自然就 uniform on sphere。

Step 3b：將 $z_i^2$ 拆開嚟睇

因為 $z_i = g_i / \|g\|$ ，所以：

z_i^2 = \frac{g_i^2}{g_1^2 + g_2^2 + \cdots + g_d^2} = \frac{g_i^2}{g_i^2 + \sum_{j \neq i} g_j^2}

呢個 fraction 嘅形式正正就係 「一個量佔總量嘅比例」！

Step 3c：分析分子同分母

分子 $g_i^2$ ：一個 standard Gaussian 嘅平方 = $\chi^2(1) = \text{Gamma}(1/2, 1/2)$
分母嘅其餘部分 $\sum_{j \neq i} g_j^2$ ： $(d-1)$ 個獨立 $\chi^2(1)$ 嘅和 = $\chi^2(d-1) = \text{Gamma}((d-1)/2, 1/2)$

因為 $g_i$ 同其他 $g_j$ 獨立，分子同分母嘅其餘部分亦獨立。根據 Beta 嘅構造性定義：

z_i^2 = \frac{\text{Gamma}(1/2)}{\text{Gamma}(1/2) + \text{Gamma}((d-1)/2)} \sim \text{Beta}\left(\frac{1}{2}, \frac{d-1}{2}\right)

Step 3d：點解 $\alpha = 1/2$ ？

呢個 $1/2$ 嚟自 $\chi^2(1)$ 嘅 shape parameter。 $\chi^2(k)$ 等價於 $\text{Gamma}(k/2, 1/2)$ ，所以 $\chi^2(1) = \text{Gamma}(1/2, 1/2)$ 。

直覺上， $\alpha = 1/2 < 1$ 代表分佈喺 $0$ 附近有一個 spike——即係話大部分座標嘅 $z_i^2$ 都好細，集中喺接近 $0$ 嘅位置。呢個完全 make sense：一個 $d$ -維 sphere 上嘅 uniform point，每個座標只「分到」大約 $1/d$ 嘅 total energy。

🎯 一句話總結：因為 sphere 上嘅 uniform distribution 可以用獨立 Gaussians normalize 嚟構造，而「一個 Gaussian 平方佔總平方和嘅比例」根據定義就係 Beta distribution。呢個唔係 approximation，係 exact 嘅結果。

💡 直覺理解：Concentration of Measure

2D：Unit circle 上嘅 uniform point = $(\cos\theta, \sin\theta)$ ，座標值分佈得幾 spread out（ $-1$ 到 $+1$ ）

3D：Unit sphere 上嘅 uniform point，座標已經開始 concentrate around 0

1000D：每個座標 極度集中 喺 $\approx 0$ 附近，大約 $\pm 1/\sqrt{d}$ 嘅範圍

維度越高，sphere 嘅「面積」越集中喺赤道附近，所以每個座標嘅值都被迫落入一個好窄嘅已知範圍。

當 $d$ 好大嘅時候，呢個 Beta distribution concentrate around $1/d$ ，所以每個座標大約 $\pm 1/\sqrt{d}$ 。呢個亦解釋咗點解 Gaussian approximation 喺高維度下 work：

[Rx]_i \approx \mathcal{N}\left(0, \frac{\|x\|^2}{d}\right) \quad \text{（高維 Gaussian approximation）}

Step 4：TurboQuant 點樣利用呢個性質

因為 rotation 之後每個座標嘅分佈係 已知同固定嘅（Beta distribution，只取決於維度 $d$ ），TurboQuant 可以：

所有座標嘅 range 幾乎一樣 → 唔需要 per-block scaling
分佈已知 → 可以 precompute 最優嘅 Lloyd-Max quantization levels
座標近乎獨立 → per-coordinate quantization 幾乎最優

簡單嚟講：random rotation 係一個 distribution normalizer——將任何未知分佈嘅 vector 變成一個已知分佈嘅 vector，令 quantization 變得 tractable。

TurboQuant vs Product Quantization

特性	Product Quantization (PQ)	TurboQuant
Training	需要 k-means clustering on dataset	唔需要（data-oblivious）
Codebook	Large codebook（dataset-specific）	固定 scalar quantizer levels
Indexing Time	秒級到分鐘級	微秒級
Online 能力	❌ 需要 pre-trained codebook	✅ 即時量化
理論保證	❌ 無 distortion rate 保證	✅ 近乎 information-theoretic optimal
Recall	⚠️ 依賴 codebook 質量	✅ consistently 更高

PQ 嘅核心問題係佢係 data-dependent：你需要先有一個 representative dataset 嚟 train codebook。如果 data distribution shift 咗，codebook 就會 stale。TurboQuant 完全冇呢個問題。

實際影響同應用場景

1. LLM Inference 優化

呢個係最直接嘅應用。TurboQuant 可以：

將 KV cache memory 減少 6x+
令 LLM 喺同一個 GPU 上支持更長嘅 context
減少 inference cost（memory bandwidth 係主要瓶頸）

具體數字：

原本需要 128 GB memory 嘅 128K context → 用 TurboQuant 只需 ~21 GB
H100 80GB GPU 原本只能 serve 1 個 request → 可能 serve 3-4 個

2. Vector Search / RAG

TurboQuant 嘅 data-oblivious 特性對 vector database 非常有價值：

即時 indexing：新文檔加入後毫秒內可搜索
更高 recall：比 PQ 更準確
更低 memory：同樣 recall 下用更少 storage

3. Edge / Mobile Deployment

3-bit quantization 令 LLM 喺 resource-constrained 環境更加可行：

Mobile devices
Embedded systems
IoT edge computing

實作指南：點樣喺 HuggingFace 模型上用 TurboQuant？

呢個係好多人最關心嘅問題：我有一個 HuggingFace 上嘅 LLM，點樣實際 apply TurboQuant？

核心前提：零 Training，純 Inference-Time

首先要強調：TurboQuant 完全唔需要 retrain 或 fine-tune 模型。唔似 GPTQ / AWQ（需要 calibration data 去量化 weights），TurboQuant 係純粹喺 inference 時壓縮 KV cache。你嘅 model weights 一個 bit 都唔使動。

呢個係因為 TurboQuant 係 data-oblivious 嘅——所有嘢都係數學推導出嚟：

Random rotation matrix $R$ ：server 啟動時隨機 generate 一次，之後固定
Lloyd-Max quantization levels：只取決於維度 $d$ 同 bit-width $b$ ，可以 hardcode 做 lookup table
QJL projection matrix：同樣隨機 generate，唔需要 train

目前嘅實作生態

TurboQuant 論文喺 2026 年 3 月正式發佈，社區已經有幾個實作方向：

平台	狀態	連結
Triton Kernel	✅ 已實作（fused encode/decode）	Dejan.ai Blog
Apple MLX	✅ 已有 prototype（Qwen3.5）	HuggingFace Repo
llama.cpp	💬 Discussion 階段	GitHub Discussion
Weight Quantization	✅ 社區擴展到 weight compression	turboquant-model GitHub
HuggingFace Transformers	⏳ 原生支援待整合	KV Cache Quant Blog

方法一：用 HuggingFace 現有嘅 KV Cache Quantization（最簡單）

HuggingFace Transformers 已經有內建嘅 KV cache quantization 功能。雖然目前用嘅係傳統 affine quantization（唔係 TurboQuant），但呢個 API 展示咗 KV cache 量化嘅基本用法：

pythonimport torch
from transformers import AutoTokenizer, AutoModelForCausalLM, QuantizedCacheConfig

# Step 1: 正常 load HuggingFace 模型（weights 完全唔動）
model_name = "meta-llama/Llama-3.1-8B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,
    device_map="auto"
)

# Step 2: 配置 KV cache 量化
cache_config = QuantizedCacheConfig(
    backend="quanto",          # 量化 backend
    nbits=4,                   # 目標 bit-width
    residual_length=128,       # 保留幾多 token 用 full precision
)

# Step 3: Generate 時傳入 cache config
inputs = tokenizer("Your prompt here", return_tensors="pt").to(model.device)
outputs = model.generate(
    **inputs,
    max_new_tokens=1000,
    cache_implementation="quantized",
    cache_config=cache_config,
)

💡 重點：呢個 API 目前用嘅係傳統 affine quantization（有 scale/zero-point overhead）。一旦社區將 TurboQuant 整合入 HuggingFace，你只需要換 backend 同 config 就可以無縫切換到 TurboQuant——model code 完全唔使改。

方法二：自己實作 TurboQuant KV Cache Wrapper（進階）

如果你想而家就試 TurboQuant，可以自己寫一個 KV cache wrapper。下面提供兩個版本嘅 rotation——標準 random matrix 同快速 Hadamard transform。

版本 A：標準 Random Rotation（簡單但慢）

pythonimport torch
import numpy as np
from scipy.stats import beta as beta_dist

class TurboQuantKVCache:
    """TurboQuant_MSE KV Cache — drop-in replacement"""
    
    def __init__(self, head_dim, num_bits=4, device="cuda"):
        self.head_dim = head_dim
        self.num_bits = num_bits
        self.device = device
        
        # Step 1: 生成 random rotation matrix（一次性）
        # 用 QR decomposition 確保 orthogonality
        random_matrix = torch.randn(head_dim, head_dim, device=device)
        self.R, _ = torch.linalg.qr(random_matrix)
        
        # Step 2: 預計算 Lloyd-Max quantization levels
        # 基於 Beta(1/2, (d-1)/2) distribution
        self.boundaries, self.levels = self._precompute_lloyd_max(
            alpha=0.5, beta_param=(head_dim - 1) / 2, num_levels=2**num_bits
        )
        
        # Storage
        self.quantized_keys = []    # 壓縮後嘅 keys
        self.signs = []             # 每個座標嘅正負號（1 bit each）
        self.radii = []             # 每個 vector 嘅 norm
    
    def _precompute_lloyd_max(self, alpha, beta_param, num_levels):
        """用 Beta distribution 預計算最優 quantization levels"""
        # Lloyd-Max iterative algorithm
        # 初始化：uniform spacing on Beta CDF
        cdf_points = np.linspace(0, 1, num_levels + 1)
        boundaries = beta_dist.ppf(cdf_points, alpha, beta_param)
        levels = np.zeros(num_levels)
        
        for _ in range(100):  # 迭代直到收斂
            # Centroid condition: 每個區間嘅 probability-weighted mean
            for i in range(num_levels):
                lo, hi = boundaries[i], boundaries[i + 1]
                # E[X | lo < X < hi] for Beta distribution
                levels[i] = self._conditional_mean(lo, hi, alpha, beta_param)
            
            # Nearest neighbor condition: 重新劃分邊界
            for i in range(1, num_levels):
                boundaries[i] = (levels[i - 1] + levels[i]) / 2
        
        return (
            torch.tensor(boundaries, device=self.device, dtype=torch.float32),
            torch.tensor(levels, device=self.device, dtype=torch.float32)
        )
    
    def _conditional_mean(self, lo, hi, alpha, beta_param):
        """計算 E[X | lo < X < hi] for Beta(alpha, beta_param)"""
        from scipy.integrate import quad
        if hi <= lo:
            return (lo + hi) / 2
        numer, _ = quad(lambda t: t * beta_dist.pdf(t, alpha, beta_param), lo, hi)
        denom = beta_dist.cdf(hi, alpha, beta_param) - beta_dist.cdf(lo, alpha, beta_param)
        return numer / denom if denom > 1e-12 else (lo + hi) / 2
    
    def encode(self, key_vectors):
        """
        將 key vectors 壓縮儲存
        key_vectors: shape (batch, num_heads, seq_len, head_dim)
        """
        # Step 1: Random rotation
        rotated = key_vectors @ self.R.T  # (B, H, S, D) @ (D, D)
        
        # Step 2: 記錄 norm（final radius）
        norms = torch.norm(rotated, dim=-1, keepdim=True)  # (B, H, S, 1)
        normalized = rotated / (norms + 1e-8)  # unit vectors
        
        # Step 3: 儲存正負號（每個座標 1 bit）
        coord_signs = (normalized >= 0)
        
        # Step 4: 對 z_i² 做 Lloyd-Max quantization
        # Beta(1/2, (d-1)/2) 描述嘅係 z_i²，所以量化 squared values
        squared = normalized.square()
        quantized_indices = torch.bucketize(squared, self.boundaries) - 1
        quantized_indices = quantized_indices.clamp(0, len(self.levels) - 1)
        
        self.quantized_keys.append(quantized_indices.to(torch.uint8))
        self.signs.append(coord_signs)
        self.radii.append(norms.squeeze(-1).half())
    
    def decode(self, layer_idx):
        """
        還原壓縮嘅 key vectors
        """
        indices = self.quantized_keys[layer_idx]
        norms = self.radii[layer_idx]
        
        # Step 1: Dequantize z_i² → 取 sqrt 得到 |z_i|
        reconstructed_sq = self.levels[indices.long()]
        reconstructed = reconstructed_sq.sqrt()
        
        # Step 2: 恢復正負號
        coord_signs = self.signs[layer_idx]
        reconstructed = torch.where(coord_signs, reconstructed, -reconstructed)
        
        # Step 3: 乘返 norm
        reconstructed = reconstructed * norms.unsqueeze(-1).float()
        
        # Step 4: Inverse rotation
        decoded = reconstructed @ self.R  # R^T 嘅 inverse = R
        
        return decoded

版本 B：Randomized Hadamard Transform（快 ~18x）

標準 random rotation 嘅瓶頸係 dense matrix multiplication： $O(d^2)$ 運算。對於 head_dim=128，呢個係 $128^2 = 16{,}384$ 次乘法。Randomized Hadamard Transform (RHT) 可以將呢個降到 $O(d \log d)$ ，即 $128 \times 7 = 896$ 次——快咗大約 18 倍。

RHT 嘅原理：

\text{RHT}(x) = H_d \cdot D \cdot x

其中：

$D$ = diagonal matrix，對角線係隨機 $\pm 1$ （random sign flip）
$H_d$ = Hadamard matrix，可以用 Fast Walsh-Hadamard Transform (FWHT) 喺 $O(d \log d)$ 計算，唔使儲存成個 $d \times d$ matrix

點解 RHT 同 random rotation 效果幾乎一樣？

Random sign flip ( $D$ ) 已經將座標嘅 sign 打亂
Hadamard transform ( $H_d$ ) 將每個座標嘅值均勻 spread 到所有維度
喺高維度（ $d \geq 64$ ，typical attention head dim），concentration of measure 效應令 RHT 後嘅座標分佈同 exact random rotation 幾乎無法區分
QuIP# 等量化論文已經驗證咗呢個 approach 嘅有效性

pythonimport torch
import numpy as np
from scipy.stats import beta as beta_dist

def fast_walsh_hadamard_transform(x):
    """
    Fast Walsh-Hadamard Transform — O(d log d)
    x: tensor of shape (..., d) where d is a power of 2
    """
    d = x.shape[-1]
    h = 1
    while h < d:
        # Butterfly operation: 每次將相鄰 h 個元素做 ± 組合
        x = x.view(*x.shape[:-1], -1, 2 * h)
        a = x[..., :h]
        b = x[..., h:]
        x = torch.cat([a + b, a - b], dim=-1)
        x = x.view(*x.shape[:-2], -1)
        h *= 2
    return x / (d ** 0.5)  # normalize


class TurboQuantKVCache_Hadamard:
    """TurboQuant with Randomized Hadamard Transform — O(d log d) rotation"""
    
    def __init__(self, head_dim, num_bits=4, device="cuda"):
        assert head_dim & (head_dim - 1) == 0, "head_dim must be power of 2"
        self.head_dim = head_dim
        self.num_bits = num_bits
        self.device = device
        
        # Step 1: 生成 random sign vector（一次性，只需 d 個 ±1）
        # 唔使儲存 d×d matrix！只需 d 個 bits
        self.had_signs = (torch.randint(0, 2, (head_dim,), device=device) * 2 - 1).float()
        
        # Step 2: 預計算 Lloyd-Max quantization levels（同版本 A 一樣）
        self.boundaries, self.levels = self._precompute_lloyd_max(
            alpha=0.5, beta_param=(head_dim - 1) / 2, num_levels=2**num_bits
        )
        
        # Storage
        self.quantized_keys = []
        self.signs = []
        self.radii = []
    
    def _conditional_mean(self, lo, hi, alpha, beta_param):
        from scipy.integrate import quad
        if hi <= lo:
            return (lo + hi) / 2
        numer, _ = quad(lambda t: t * beta_dist.pdf(t, alpha, beta_param), lo, hi)
        denom = beta_dist.cdf(hi, alpha, beta_param) - beta_dist.cdf(lo, alpha, beta_param)
        return numer / denom if denom > 1e-12 else (lo + hi) / 2
    
    def _precompute_lloyd_max(self, alpha, beta_param, num_levels):
        cdf_points = np.linspace(0, 1, num_levels + 1)
        boundaries = beta_dist.ppf(cdf_points, alpha, beta_param)
        boundaries = np.clip(boundaries, 0, 1)
        levels = np.zeros(num_levels)
        for _ in range(100):
            for i in range(num_levels):
                lo, hi = boundaries[i], boundaries[i + 1]
                levels[i] = self._conditional_mean(lo, hi, alpha, beta_param)
            for i in range(1, num_levels):
                boundaries[i] = (levels[i - 1] + levels[i]) / 2
        return (
            torch.tensor(boundaries, device=self.device, dtype=torch.float32),
            torch.tensor(levels, device=self.device, dtype=torch.float32)
        )
    
    def _rht_forward(self, x):
        """Randomized Hadamard Transform: H @ D @ x"""
        # Step 1: Random sign flip（element-wise，O(d)）
        x_flipped = x * self.had_signs
        # Step 2: Fast Walsh-Hadamard（O(d log d)，冇 matrix storage）
        return fast_walsh_hadamard_transform(x_flipped)
    
    def _rht_inverse(self, y):
        """Inverse RHT: D^T @ H^T @ y = D @ H @ y（因為 H 同 D 都係 involution）"""
        # H 係 symmetric 同 orthogonal → H^{-1} = H
        # D 係 diagonal 同 self-inverse → D^{-1} = D
        y_unhadamard = fast_walsh_hadamard_transform(y)
        return y_unhadamard * self.had_signs
    
    def encode(self, key_vectors):
        """同版本 A 一樣，但用 RHT 代替 dense rotation"""
        # RHT rotation — O(d log d) instead of O(d²)
        rotated = self._rht_forward(key_vectors)
        
        norms = torch.norm(rotated, dim=-1, keepdim=True)
        normalized = rotated / (norms + 1e-8)
        
        coord_signs = (normalized >= 0)
        squared = normalized.square()
        quantized_indices = torch.bucketize(squared, self.boundaries) - 1
        quantized_indices = quantized_indices.clamp(0, len(self.levels) - 1)
        
        self.quantized_keys.append(quantized_indices.to(torch.uint8))
        self.signs.append(coord_signs)
        self.radii.append(norms.squeeze(-1).half())
    
    def decode(self, layer_idx):
        """同版本 A 一樣，但用 inverse RHT"""
        indices = self.quantized_keys[layer_idx]
        norms = self.radii[layer_idx]
        
        reconstructed_sq = self.levels[indices.long()]
        reconstructed = reconstructed_sq.sqrt()
        coord_signs = self.signs[layer_idx]
        reconstructed = torch.where(coord_signs, reconstructed, -reconstructed)
        reconstructed = reconstructed * norms.unsqueeze(-1).float()
        
        # Inverse RHT — 同樣 O(d log d)
        decoded = self._rht_inverse(reconstructed)
        return decoded

兩個版本嘅對比：

特性	版本 A：Dense Random Rotation	版本 B：Randomized Hadamard (RHT)
Rotation 複雜度	$O(d^2)$	$O(d \log d)$ ✅
head_dim=128 嘅運算量	16,384 次乘法	896 次加減法 ✅
Memory（儲存 rotation）	$d \times d$ matrix = 65KB	$d$ 個 sign bits = 16 bytes ✅
數學保證	Exact uniform on sphere ✅	Approximate（高維幾乎無差）
量化質量	理論最優	實際幾乎一樣（ $d \geq 64$ ）✅
限制	無	$d$ 必須係 2 嘅冪次（LLM 通常滿足）

🚀 建議：喺 production 環境用 版本 B（Hadamard）。大部分 LLM 嘅 head_dim 都係 64、128 或 256（都係 2 嘅冪次），完美符合要求。速度快 ~18x，memory 少 4000x，quantization 質量幾乎無差異。QuIP#、AQLM 等 SOTA 量化方法都已經採用咗 RHT 作為標準 rotation scheme。

⚠️ 注意：以上兩個版本都係簡化版嘅概念 code，省略咗 recursive polar transform 同 sign handling。實際生產級實作建議參考 Dejan.ai 嘅 Triton kernel，佢哋已經做咗 fused encode/decode，性能更好。

方法三：用 Triton Kernel 做高性能實作（生產級）

咩係 Triton？

如果你之前淨係寫過 PyTorch，可能會問：Triton 係咩嚟？點解唔直接用 CUDA？

Triton 係 OpenAI 喺 2021 年開源嘅一個 Python-based GPU programming language 同 compiler。佢嘅目標好簡單：令你唔使學 CUDA 都可以寫出接近 expert-level 性能嘅 GPU kernel。

💡 類比：CUDA 就好似用 Assembly 寫程式——你要自己管理 thread blocks、shared memory、memory coalescing、register allocation⋯⋯每一步都要手動優化。Triton 就好似用 Python 寫程式——你只需要講清楚「我想做咩計算」，compiler 會自動幫你處理底層嘅 GPU 優化。

Triton 同 CUDA 嘅核心分別：

特性	CUDA	Triton
語言	C/C++ 擴展	Python（用 decorator `@triton.jit`）
並行模型	手動管理 thread blocks、warps、threads	自動管理，你只需定義 tile/block 級別嘅操作
Memory 管理	手動搬 data（global → shared → register）	自動優化 memory access patterns
學習曲線	極陡峭（需要深入理解 GPU 架構）	相對平緩（Python 開發者可以快速上手）
性能	理論上最優（如果你係 expert）	大部分情況同 expert CUDA 一樣快
Debug	困難（GPU debugger 唔好用）	相對容易（Python 環境）

點解 Triton 對 TurboQuant 特別重要？

TurboQuant 嘅 encode/decode 涉及多個步驟（rotation → polar transform → quantize → dequantize → inverse），如果每步都用 PyTorch 嘅 built-in ops，中間結果要不斷喺 GPU 嘅 HBM（高帶寬記憶體） 同 SRAM（片上快取） 之間搬嚟搬去，bandwidth 會成為瓶頸。

Triton 嘅 kernel fusion 可以將成條 pipeline 寫成一個 GPU kernel——所有中間結果都留喺 SRAM 入面，避免 HBM round trip。呢個就係點解 Triton 實作可以比 naive PyTorch 快好多倍嘅原因。

🎯 一句話總結：Triton = 用 Python 寫 GPU kernel 嘅語言。佢令 ML researcher 唔需要學 CUDA 就可以寫出高性能嘅 custom operation（例如 TurboQuant 嘅 fused encode/decode）。PyTorch 2.0+ 嘅 torch.compile() 底層都係用 Triton 生成 GPU code。

Triton 實作 TurboQuant

如果你要喺 NVIDIA GPU 上做 production deployment，Triton kernel 係最佳選擇。Dejan.ai 已經開源咗一個 fused implementation：

python# 概念：將 TurboQuant encode/decode fuse 入 attention kernel
# 唔需要 dequantize 到 full precision 再做 attention
# 直接喺 quantized domain 計算 attention score

import triton
import triton.language as tl

@triton.jit
def turboquant_attention_kernel(
    Q, K_quantized, K_norms, V, Output,
    rotation_matrix, lloyd_max_levels,
    seq_len, head_dim, num_bits,
    BLOCK_SIZE: tl.constexpr
):
    # Fused: dequantize K + compute Q·K + softmax + V aggregation
    # 所有嘢喺 GPU SRAM 入面完成，避免 HBM ↔ SRAM round trip
    ...

實際部署 Checklist

✅ TurboQuant 部署步驟（TurboQuant_MSE，drop-in）

正常 load HuggingFace model（weights 唔動）

生成 random rotation matrix $R$ （per attention head，一次性）

預計算 Lloyd-Max levels（based on head_dim 同目標 bits，hardcode 做 lookup table）

Hook 入 model 嘅 KV cache：encode 時做 rotate → polar → quantize，decode 時做 dequantize → inverse polar → inverse rotate

搞定！Model 正常 generate，KV cache 自動壓縮

對比	傳統方法（GPTQ/AWQ）	TurboQuant
量化目標	Model weights	KV cache（runtime data）
需要 calibration data？	✅ 需要（通常跑 128 個 samples）	❌ 完全唔需要
需要 retrain？	❌（但 QAT 需要）	❌
可以同時用？	✅ 可以 stack！用 GPTQ 量化 weights + TurboQuant 壓縮 KV cache = 雙重壓縮
Apply 時機	Offline（量化後儲存新 model）	Online（inference 時即時壓縮）

🚀 最佳實踐組合
如果你想最大化壓縮：

先用 GPTQ/AWQ 將 model weights 量化到 4-bit（減少 model size）

再用 TurboQuant 喺 inference 時壓縮 KV cache 到 3.5-bit（減少 runtime memory）

兩者完全獨立，可以自由組合，唔會互相影響

例如：Llama-3.1-70B 用 GPTQ-4bit（35GB weights）+ TurboQuant-3.5bit KV cache = 喺單張 H100 80GB 上跑 128K context！

限制同未來方向

當前限制

1. Random Rotation 嘅 Overhead

雖然 quantization 本身好快，但 random rotation 需要一次 matrix multiplication（ $O(d^2)$ ）。對於非常高維嘅 vectors，呢個可能係瓶頸。

解決思路：用 structured random matrices（如 Hadamard transform），可以將 rotation 降到 $O(d \log d)$ 。

2. 需要 Custom Kernel

TurboQuant_Prod（帶 QJL 嘅版本）需要 custom attention kernel 先可以正確使用。Drop-in replacement 只能用 TurboQuant_MSE。

3. 2-bit 以下仍有挑戰

雖然 2.5-bit 已經可用，但極低 bit-width（1-2 bit）嘅質量退化仍然明顯。呢個可能係 information-theoretic 嘅限制。

未來方向

Hardware-Aware Optimization

將 TurboQuant 嘅 quantization scheme 同 GPU kernel 深度整合，進一步減少 dequantization overhead。已經有人喺 Triton kernel 上實現咗 fused implementation。

同其他壓縮技術結合

TurboQuant 可以同 model pruning、knowledge distillation 等技術 stack：

Weight quantization + KV cache quantization（TurboQuant）
Structured pruning + TurboQuant
多層壓縮進一步減少 memory footprint

Broader Applications

除咗 KV cache 同 vector search，TurboQuant 嘅理論可以應用喺：

Embedding compression
Gradient quantization（distributed training）
Sensor data compression（IoT）

技術啟示

1. Data-Oblivious 唔代表差

傳統觀念認為 data-dependent 方法一定好過 data-oblivious。TurboQuant 用數學證明咗：透過巧妙利用高維空間嘅幾何性質，data-oblivious 方法可以達到 near-optimal performance。

2. 理論同實踐嘅完美結合

TurboQuant 唔只係一個 practical hack，佢有嚴格嘅 information-theoretic 證明。呢種 theory-driven approach 喺 ML 領域越嚟越重要——唔只係「work」，而且「知道點解 work」同「知道幾好」。

3. 簡單嘅方法往往最有效

TurboQuant 嘅核心其實好簡單：rotate → quantize per coordinate → fix bias。冇複雜嘅 codebook、冇 training loop、冇 hyperparameter tuning。呢種 elegance 正正係好嘅 algorithm design 嘅標誌。

4. Quantization 係 LLM 效率嘅關鍵戰場

隨住 context window 越嚟越長（已經去到 1M+ tokens），KV cache 嘅 memory 問題只會越嚟越嚴重。TurboQuant 呢類工作代表咗一個重要方向：唔係靠更大嘅 GPU，而係靠更聰明嘅壓縮。

總結

TurboQuant 係一個將 classical information theory 同 modern LLM 需求完美結合嘅工作。

核心貢獻

PolarQuant：用 polar coordinates + random rotation 消除 quantization overhead
QJL：1-bit 殘差修正確保 unbiased inner product estimation
TurboQuant：兩者結合，達到 near-optimal distortion rate
理論證明：首次提供 vector quantization 嘅 tight information-theoretic lower bounds

實用價值

🎯 3.5-bit KV cache 量化，零精度損失
🚀 H100 上 8x attention speedup
⚡ Vector search indexing 快 200,000x
📊 唔需要 training data，即插即用

點解你應該 care？

如果你：

做 LLM serving → TurboQuant 可以大幅降低你嘅 GPU memory 成本
做 vector search → 即時 indexing + 更高 recall
做 edge AI → 3-bit 壓縮令 on-device LLM 更加可行
做 ML research → 呢篇論文嘅理論分析非常 elegant，值得學習

TL;DR

Google Research 提出咗 TurboQuant，一個理論上近乎最優嘅 vector quantization 框架，專門解決 LLM 嘅 KV cache memory bottleneck 同 vector search 嘅效率問題。

核心重點：

🎯 3-bit 零損失壓縮：KV cache 壓縮到 3.5 bits/channel 完全唔影響模型質量
🚀 8x Speedup：喺 H100 GPU 上 4-bit TurboQuant 比 32-bit unquantized keys 快 8 倍
📊 6x 記憶體減少：KV cache memory 至少減少 6 倍
🧮 理論最優：同 information-theoretic lower bound 只差 ≈2.7× constant factor
⚡ 零 indexing 時間：喺 nearest neighbor search 任務，indexing 時間由幾百秒降到 0.001 秒

背景：點解 KV Cache 量化咁重要？

LLM 嘅 Memory Bottleneck

問題有幾嚴重？

假設一個 LLM 有 32 layers，每層有 32 attention heads，每個 head 嘅 dimension 係 128：

每個 token 嘅 KV cache： $32 \times 32 \times 128 \times 2 \times 4$ bytes = 1 MB
128K context window：128 GB 淨係 KV cache！

呢個就係點解長 context LLM 咁食 memory 嘅原因。

現有量化方法嘅問題

傳統嘅量化方法（例如 per-channel 或 per-block quantization）都有一個共同問題：quantization constants overhead。

Scale 同 Zero Point 係咩？具體例子

等我哋用一個具體例子解釋。假設你有一個 4 個元素嘅 float vector：

pythonx = [-3.2,  0.0,  1.5,  4.8]

你想將佢壓縮成 4-bit integers（範圍 0–15）。

Step 1 — 計算 Scale（縮放比例）

Scale 話俾你知：每個 integer step 代表幾多 float 值：

\text{scale} = \frac{\text{max} - \text{min}}{2^{\text{bits}} - 1} = \frac{4.8 - (-3.2)}{15} = \frac{8.0}{15} \approx 0.533

Step 2 — 計算 Zero Point（零點偏移）

Zero point 話俾你知：邊個 integer 對應 float 嘅 $0.0$ ：

\text{zero\_point} = \text{round}\left(\frac{-\text{min}}{\text{scale}}\right) = \text{round}\left(\frac{3.2}{0.533}\right) \approx 6

Step 3 — 量化

pythonx_int = round(x / scale + zero_point)

原始 float	計算	量化後 (4-bit int)
−3.2	round(−3.2 / 0.533 + 6) = round(0.0)	0
0.0	round(0.0 / 0.533 + 6) = round(6.0)	6
1.5	round(1.5 / 0.533 + 6) = round(8.8)	9
4.8	round(4.8 / 0.533 + 6) = round(15.0)	15

Step 4 — 還原（Dequantize）

pythonx_reconstructed = scale * (x_int - zero_point)
# → [-3.198, 0.0, 1.599, 4.797]  ← 同原始值非常接近

🎯 Scale 同 Zero Point 嘅直觀理解

Scale：就係一把「尺」，話你知 float 世界同 integer 世界嘅比例關係

Zero Point：就係「原點對齊」，確保 float $0.0$ 可以被精確表示，避免系統性偏差

兩者合起來定義咗一個從 integer 空間到 float 空間嘅 affine mapping（線性映射）。

呢度嘅問題係咩？

你需要為每個 block 儲存 scale（float32）同 zero_point（float32 或 int32）。假設 block size = 128 個元素：

部分	大小
128 個 4-bit integers（實際數據）	64 bytes
scale（float32）	4 bytes
zero_point（float32）	4 bytes
總計	72 bytes → 有效 bit-width ≈ 4.5 bits ⚠️

你以為係 4-bit，實際係 4.5 bits。Block size 越細（為咗提高精度），overhead 比例越大，甚至可以去到 5-6 bits。

⚠️ 傳統量化嘅兩難

Block size 太大 → 量化誤差大（同一個 block 入面數值差異大）

Block size 太細 → Constants overhead 大（每個 block 都要 full-precision constants）

兩邊都唔理想！

TurboQuant：核心創新

設計理念

TurboQuant 嘅核心思路可以用一句話概括：

💡 核心洞察：隨機旋轉 + 最優 scalar quantization + 殘差修正
透過 random rotation 將 vector 嘅座標分佈變得「可預測」，然後對每個座標獨立做最優量化，最後用 1-bit QJL 修正 inner product bias。

Data-Oblivious 係咩意思？

量化算法可以分成兩大類：

Data-dependent（依賴數據）：需要先睇大量樣本數據，從中「學習」數據嘅分佈，然後根據呢個分佈設計最優嘅 quantizer。例如 Product Quantization (PQ) 要先用 k-means clustering 喺 dataset 上 train 一個 codebook，之後先可以做量化。
Data-oblivious（不依賴數據）：唔需要任何預先收集嘅數據。量化嘅規則完全由數學推導決定，對任何輸入都適用。TurboQuant 就係呢類——random rotation 之後座標分佈係已知嘅 Beta distribution，所以 Lloyd-Max quantizer 嘅最優 levels 可以純粹從理論計算出嚟，完全唔需要睇實際數據。

💡 類比：一個 data-dependent 嘅快遞分揀員要先研究成個月嘅包裹數據，搞清楚最常見嘅目的地係邊度，先設計分揀規則。一個 data-oblivious 嘅分揀員直接用郵政編碼規則——唔需要睇任何歷史數據，第一個包裹嚟就即刻可以分揀。

點解 Data-Oblivious 係優勢？

喺實際應用入面，data-dependent 方法有幾個根本問題：

需要 warm-up 數據：你要先收集夠代表性嘅樣本先可以 train codebook。喺 LLM inference 嘅場景，每個用戶嘅 query 都唔同，難以預先知道 KV cache 嘅分佈。
Distribution shift：如果實際數據嘅分佈同 training 時嘅分佈唔同（例如新領域嘅問題），codebook 就會 stale，量化質量暴跌。
無法做 online quantization：新 vector 嚟嘅時候，你要先等 codebook ready 先可以量化。對於實時系統，呢個 latency 係不可接受嘅。

TurboQuant 完全冇呢啲問題——任何 vector、任何時間、即刻量化，質量都有理論保證。

三個關鍵組件

Random Rotation（隨機旋轉）

呢個係成個方法嘅基礎。當你對一個 high-dimensional vector 做 random orthogonal rotation 之後，每個座標嘅分佈會 converge 到一個 concentrated Beta distribution。

點解 random rotation 之後會係 Beta distribution？

z_i^2 \sim \text{Beta}\left(\frac{1}{2}, \frac{d-1}{2}\right)

💡 一句話直覺：Random rotation 將 vector 嘅「能量」均勻地 spread 到每個座標上——每個座標分到大約 $1/d$ 嘅 total energy，而且愈高維呢個分配愈穩定、愈可預測。呢個令到所有座標嘅分佈係 已知同固定嘅，唔受原始 vector 嘅值影響。

點解呢個有用？

原始 vector：座標分佈未知，可能好 skewed，有 outliers
旋轉後 vector：座標分佈變得高度集中同可預測
獨立性：高維空間入面，旋轉後嘅座標幾乎 互相獨立

x \in \mathbb{R}^d \xrightarrow{\text{Random Rotation } R} Rx \sim \text{Concentrated Beta Distribution}

呢個性質意味住我哋可以用一個 universal scalar quantizer 處理所有座標，而唔需要 per-block 嘅 normalization constants。

Lloyd-Max Optimal Scalar Quantizer

在講 PolarQuant 之前，先要理解 TurboQuant 嘅核心壓縮工具：Lloyd-Max quantizer。

Scalar quantization 基本概念：

假設你有一個 continuous 嘅數值（例如 3.7），但你只能用 2-bit 儲存，即係得 4 個可能嘅值。你要揀最接近嘅一個儲存。問題係：呢 4 個值應該擺喺邊？

🎯 Lloyd-Max Quantizer 解決嘅問題
Given 一個已知嘅概率分佈，Lloyd-Max algorithm 會找到一組 最優嘅 quantization levels（reconstruction points）同 decision boundaries，使得 mean squared error (MSE) 最小化。

具體嚟講，Lloyd-Max 做兩件事：

Decision boundaries：將數值線切分成 $2^b$ 個區間（ $b$ = bit-width）
Reconstruction levels：每個區間內擺一個「代表值」，位置喺該區間嘅 probability-weighted centroid

javascript原始數值線：  ----[---x---]----[--x--]----[----x----]----[--x--]----
                  ^           ^              ^           ^
              level 0     level 1        level 2     level 3
              
        區間寬度根據概率密度調整：密度高嘅地方切得密，密度低嘅切得疊

點解係「optimal」？用一個實際例子嚟睇：

假設我哋只有 2-bit（4 個 quantization levels）嚟壓縮呢個訊號。

Uniform Quantizer 嘅災難：

佢會死板咁將 $-10V$ 到 $+10V$ 平均切成 4 等分：

區間	範圍	代表值	問題
1	[-10, -5]	-7.5V	幾乎冇訊號用到 ❌
2	[-5, 0]	-2.5V	90% 訊號迫晒入呢兩個區間 😱
3	[0, 5]	+2.5V	90% 訊號迫晒入呢兩個區間 😱
4	[5, 10]	+7.5V	幾乎冇訊號用到 ❌

Lloyd-Max Quantizer 嘅解法：

Lloyd-Max 嘅核心思想係：資源應該花喺刀口上。

佢透過迭代演算法自動調整：

尋找質心（Centroid Condition）：發現 $[0, 5]$ 區間入面，訊號幾乎全部擠喺 $0V$ 附近，所以將代表值從 $2.5V$ 拉到訊號真正嘅重心，例如 $0.8V$
重新劃分邊界（Nearest Neighbor Condition）：根據新嘅代表值重新劃分區間邊界
反覆迭代直到收斂

最終 Lloyd-Max 分配嘅 4 個代表值可能係：-4.0V, -0.8V, +0.8V, +4.0V

而家同一個 $0.1V$ 訊號會被轉換成 $0.8V$ ——誤差只有 0.7V，比 uniform 嘅 2.4V 細咗 3 倍以上！

💡 點解整體 MSE 更低？
對於偶爾出現嘅極端值（例如 $9V$ ），Lloyd-Max 可能轉換成 $4.0V$ ，單點誤差確實好大。但係，因為 $9V$ 出現嘅機率極低（不到 1%），而 $0.1V$ 出現嘅機率極高（90%+），從整體 MSE 嚟睇，Lloyd-Max 將高頻區域嘅誤差最小化，令整體平均失真降到最低。呢個就係佢被稱為「最優」嘅原因。

總結比較：

方法	分割策略	0.1V 訊號嘅誤差	整體 MSE
Uniform Quantization	均勻分割，唔理分佈	2.4V 😱	較高
Lloyd-Max Quantization	根據分佈密度調整分割	0.7V ✅	最低（理論最優）

喺 TurboQuant 入面嘅角色：

PolarQuant：極座標量化

PolarQuant 係 TurboQuant 嘅第一階段壓縮方法。佢將傳統嘅 Cartesian coordinates 轉換成 polar coordinates：

(x_1, x_2) \rightarrow (r, \theta) \quad \text{where } r = \sqrt{x_1^2 + x_2^2}, \quad \theta = \arctan(x_2 / x_1)

點解用 polar coordinates？

類比：

Cartesian："向東行 3 個 block，向北行 4 個 block"（兩個 unbounded 數值）
Polar："行 5 步，角度 37°"（一個 magnitude + 一個 bounded angle）

因為 random rotation 之後，angles 嘅分佈係已知同高度集中嘅，PolarQuant 可以：

唔需要 per-block normalization（angles 嘅 range 已知）
消除 quantization constants overhead
對每個 angle 獨立用 Lloyd-Max optimal scalar quantizer

但 PolarQuant 唔只係做一次 polar transform——佢係 recursive（遞迴） 咁做，直到只剩一個 final radius 同一堆 angles。

點解要轉成 (d-1) angles + 1 radius 呢個格式？

Recursive polar transform 做嘅嘢本質上係 將 magnitude（大小）同 direction（方向）徹底分離：

d-1 個 angles = 方向資訊：angles 嘅 range 永遠係固定嘅（ $0$ 到 $\pi$ 或 $0$ 到 $2\pi$ ），而且 random rotation 之後佢哋嘅分佈係已知嘅 Beta distribution。呢個意味住你可以用 同一個 pre-computed Lloyd-Max quantizer 處理所有 angles，零 overhead——唔需要任何 per-vector 或 per-block 嘅 normalization constants。
1 個 radius = magnitude 資訊：呢個就係 vector 嘅 norm $\|x\|$ ，佢嘅 range 係 unbounded 嘅，確實需要額外儲存。但只有 1 個數字！

🎯 Overhead 對比

傳統方法： $d$ 個座標，每個 block 需要 2 個 full-precision constants（scale + zero-point）→ overhead 可以加 1-2 bits per number

PolarQuant： $d-1$ 個 angles 用 universal quantizer（0 overhead）+ 1 個 radius 需要儲存 → overhead = 只有 1 個數字

呢個就係 PolarQuant 消除 quantization constants overhead 嘅核心原因——將「需要額外資訊先量化到」嘅部分壓縮到極致（ $d$ 維 → 1 個數字），而將「唔需要額外資訊」嘅部分最大化（ $d-1$ 個 angles）。

Recursive Process 具體點運作？用 8D 做例子：

假設 random rotation 之後你有一個 8 維 vector：

y = (y_1, y_2, y_3, y_4, y_5, y_6, y_7, y_8)

Round 1：將 8 個座標兩兩配對

將相鄰座標 pair up，每對轉成 polar form $(r, \theta)$ ：

Pair	輸入（Cartesian）	輸出（Polar）
Pair 1	$(y_1, y_2)$	$r_1 = \sqrt{y_1^2 + y_2^2}$ ， $\theta_1 = \arctan(y_2/y_1)$
Pair 2	$(y_3, y_4)$	$r_2 = \sqrt{y_3^2 + y_4^2}$ ， $\theta_2 = \arctan(y_4/y_3)$
Pair 3	$(y_5, y_6)$	$r_3 = \sqrt{y_5^2 + y_6^2}$ ， $\theta_3 = \arctan(y_6/y_5)$
Pair 4	$(y_7, y_8)$	$r_4 = \sqrt{y_7^2 + y_8^2}$ ， $\theta_4 = \arctan(y_8/y_7)$

Round 1 之後你有：4 個 angles $(\theta_1, \theta_2, \theta_3, \theta_4)$ + 4 個 radii $(r_1, r_2, r_3, r_4)$ 。

Angles 嘅分佈已知（Beta distribution），即刻用 Lloyd-Max quantize。但 4 個 radii 點算？佢哋係 普通嘅正數，可以當做新嘅「座標」繼續配對！

🔑 關鍵洞察：Radii 就係新嘅數值
每個 $r_i$ 只係一個正實數（例如 $r_1 = 2.3, r_2 = 1.8, r_3 = 0.9, r_4 = 3.1$ ）。佢哋可以同任何數字一樣被 pair up，再做一次 polar transform。你唔需要 radii 有任何特殊性質——只要有兩個數字，就可以算佢哋嘅 $(r, \theta)$ 。

Round 2：將 4 個 radii 兩兩配對

Pair	輸入（兩個 radii）	輸出（Polar）
Pair A	$(r_1, r_2)$ 例如 $(2.3, 1.8)$	$R_1 = \sqrt{2.3^2 + 1.8^2} = 2.92$ ， $\phi_1 = \arctan(1.8/2.3) = 38.0°$
Pair B	$(r_3, r_4)$ 例如 $(0.9, 3.1)$	$R_2 = \sqrt{0.9^2 + 3.1^2} = 3.23$ ， $\phi_2 = \arctan(3.1/0.9) = 73.8°$

Round 2 之後你又得到 2 個新 angles $(\phi_1, \phi_2)$ + 2 個新 radii $(R_1, R_2)$ 。同樣，angles 即刻 quantize，radii 繼續配對。

Round 3（最後一輪）：將 2 個 radii 配對

Pair	輸入（兩個 radii）	輸出（Polar）
Final	$(R_1, R_2)$ 即 $(2.92, 3.23)$	$R_{\text{final}} = \sqrt{2.92^2 + 3.23^2} = 4.35$ ， $\psi = \arctan(3.23/2.92) = 47.9°$

而家只剩 1 個 angle $\psi$ + 1 個 radius $R_{\text{final}}$ 。冇得再 pair 喇！

Binary Tree 全景圖：

javascriptRound 1:  y₁ y₂   y₃ y₄   y₅ y₆   y₇ y₈     ← 8 coordinates
           ↓  ↓     ↓  ↓     ↓  ↓     ↓  ↓
          r₁,θ₁   r₂,θ₂   r₃,θ₃   r₄,θ₄      ← 4 angles ✅ quantize
                                                   4 radii → 繼續
Round 2:  r₁  r₂   r₃  r₄
           ↓   ↓     ↓   ↓
          R₁, φ₁   R₂, φ₂                       ← 2 angles ✅ quantize
                                                   2 radii → 繼續
Round 3:  R₁  R₂
           ↓   ↓
        R_final, ψ                               ← 1 angle  ✅ quantize
                                                   1 radius → 冇得再 pair！

最終結果：7 angles (θ₁,θ₂,θ₃,θ₄,φ₁,φ₂,ψ) + 1 final radius (R_final)
         = (d-1) angles + 1 radius  ✅

💡 點解最後一定淨返 1 個 radius？
每一輪 polar transform 都將 $n$ 個數字變成 $n/2$ 個 angles + $n/2$ 個 radii。Angles 即刻 quantize 咗，radii 繼續 pair。所以 radii 嘅數量每輪減半： $4 \to 2 \to 1$ 。呢個係一棵 binary tree，最尾一定 converge 到 1 個 root。

數學上，呢個 final radius 就係原始 vector 嘅 norm（模長）：
$R_{\text{final}} = \|y\| = \sqrt{y_1^2 + y_2^2 + \cdots + y_8^2}$
因為每次 $\sqrt{a^2 + b^2}$ 都係 partial norm，一路 combine 上去就係 total norm。呢個其實就係 hyperspherical coordinates（超球座標） 嘅遞迴定義！

Final radius 點處理？

QJL：1-bit 殘差修正

就算 PolarQuant 嘅 MSE 壓縮質量好高，佢仍然會喺 inner product estimation 引入 bias。TurboQuant 用 Quantized Johnson-Lindenstrauss (QJL) transform 解決呢個問題。

點解需要 QJL？先理解 Residual 同 Bias 問題

首先要搞清楚一件事：residual（殘差）係咩？佢係邊度嚟嘅？

python# Residual 嘅來源：rounding error
original_key = k                              # 原始 key vector
rotated = R @ k                               # Step 1: rotation（lossless）
angles, radius = polar_transform(rotated)     # Step 2: polar transform（lossless）
quantized_angles = lloyd_max_round(angles)     # Step 3: rounding ⚠️ 呢度產生誤差！
reconstructed = inverse_all(quantized_angles, radius, R)  # 還原

residual = k - reconstructed                  # ← 呢個就係 residual = rounding error

而呢個 residual 會引入一個隱藏問題——inner product bias。

\underbrace{q \cdot k}_{\text{真實 attention score}} = 0.85 \quad \neq \quad \underbrace{q \cdot \hat{k}}_{\text{PolarQuant 估算}} = 0.97

⚠️ 點解 MSE 低但 inner product 仍有 bias？
MSE 衡量嘅係 $\|x - \hat{x}\|^2$ （向量空間嘅重建誤差），但 inner product $q \cdot \hat{k}$ 嘅誤差取決於殘差同 query 嘅方向關係。PolarQuant 嘅量化誤差喺某個方向上有系統性偏向，導致 inner product 嘅估算偏高或偏低。兩個目標（低 MSE vs unbiased inner product）唔係同一回事。

QJL 點樣修正？

核心思路係：用極少嘅 bits 捕捉殘差嘅方向資訊，然後用呢個資訊修正 inner product 估算。

python# Step 1: 計算殘差（量化丟咗嘅部分）
residual = x_rotated - x_dequantized   # 原始 rotated vector - PolarQuant 還原值

# Step 2: 對 residual 做 random projection（JL transform）
projected = random_matrix @ residual   # 降維到低維空間

# Step 3: 只保留 sign bit（+1 或 -1）
qjl_sign = sign(projected)             # 每個維度只用 1 bit！

用呢個 qjl_sign 修正 inner product 估算之後，數學上可以證明估算變成 unbiased：

\mathbb{E}[q \cdot \hat{k}_{\text{corrected}}] = q \cdot k \quad \text{（期望值等於真實值）}

💡 直覺類比：想像你估算兩個城市之間嘅距離。PolarQuant 估算就好似用地圖粗略量度——準確但因地圖比例尺會系統性偏高。QJL correction 就好似一張只寫住「偏長咗定偏短咗」（+ 或 −）嘅修正紙條。紙條只有一個 bit，但已經夠你知道要調整嘅方向，統計上就可以消除系統性偏差。

點解只需要 1 bit？用具體數字睇

假設 PolarQuant rounding 之後，residual（rounding error）同 query 如下：

Residual（rounding 造成嘅誤差）： $e = [-0.3, +0.1]$
Query： $q = [0.8, 0.6]$
真實 inner product 誤差： $q \cdot e = 0.8 \times (-0.3) + 0.6 \times 0.1 = -0.18$

呢個 $-0.18$ 就係 PolarQuant 估算嘅 bias——我哋想用 1 bit 嚟修正佢。

QJL 揀一個 random direction $r$ （每個 component 隨機 $\pm 1$ ），計算 $r \cdot e$ ，然後只保留 sign（+ 或 −）。修正公式：

\text{correction} = \text{sign}(r \cdot e) \times (r \cdot q)

Dot product 點計？ 就係逐個 component 相乘再加埋：

$r \cdot e$ ：將 $r$ 同 residual $e$ 嘅對應 component 逐個乘，再求和。例如 $r = (+1, +1)$ ， $e = [-0.3, +0.1]$ → $(+1)(-0.3) + (+1)(+0.1) = -0.3 + 0.1 = -0.2$
$r \cdot q$ ：同理，將 $r$ 同 query $q$ 嘅對應 component 逐個乘再求和。例如 $r = (+1, +1)$ ， $q = [0.8, 0.6]$ → $(+1)(0.8) + (+1)(0.6) = 0.8 + 0.6 = 1.4$

2D 入面 $r$ 有 4 種等概率嘅可能，逐個計：

Random direction $r$	$r \cdot e$ 計算過程	sign（1 bit）	$r \cdot q$ 計算過程	sign × (r·q)
$(+1, +1)$	$(+1)(-0.3) + (+1)(+0.1) = -0.2$	−1	$(+1)(0.8) + (+1)(0.6) = 1.4$	−1.4
$(+1, -1)$	$(+1)(-0.3) + (-1)(+0.1) = -0.4$	−1	$(+1)(0.8) + (-1)(0.6) = 0.2$	−0.2
$(-1, +1)$	$(-1)(-0.3) + (+1)(+0.1) = +0.4$	+1	$(-1)(0.8) + (+1)(0.6) = -0.2$	−0.2
$(-1, -1)$	$(-1)(-0.3) + (-1)(+0.1) = +0.2$	+1	$(-1)(0.8) + (-1)(0.6) = -1.4$	−1.4

所有 4 種情況嘅 correction 都係負數！ 平均值 = $(-1.4 - 0.2 - 0.2 - 1.4) / 4 = -0.8$ 。

點解期望值會正比於 $q \cdot e$ ？逐步拆解：

修正公式係 $\text{sign}(r \cdot e) \times (r \cdot q)$ 。將 $r \cdot q = \sum_i r_i q_i$ 展開，用 linearity of expectation（期望值嘅線性性質）：

\mathbb{E}[\text{sign}(r \cdot e) \times (r \cdot q)] = \sum_i q_i \times \mathbb{E}[\text{sign}(r \cdot e) \times r_i]

而家個關鍵問題變成： $\mathbb{E}[\text{sign}(r \cdot e) \times r_i]$ 等於咩？答案係：

\mathbb{E}[\text{sign}(r \cdot e) \times r_i] = c \times e_i

其中 $c > 0$ 係一個只取決於維度同 norm 嘅已知常數。

點解 $\mathbb{E}[\text{sign}(r \cdot e) \times r_i] \propto e_i$ ？直覺解釋：

回想 $r \cdot e = r_1 e_1 + r_2 e_2 + \cdots + r_d e_d$ ，而 $r_i$ 係其中一個 component（隨機 $\pm 1$ ）。

當 $e_i$ 大且正嘅時候： $r_i = +1$ 會將 $r \cdot e$ 推向正方向 → sign 更可能係 $+1$ → $\text{sign} \times r_i$ 傾向正數
當 $e_i$ 大且負嘅時候： $r_i = +1$ 會將 $r \cdot e$ 推向負方向 → sign 更可能係 $-1$ → $\text{sign} \times r_i$ 傾向負數
當 $e_i = 0$ ： $r_i$ 對 sign 冇任何影響 → $\text{sign} \times r_i$ 期望值 = 0

呢個 $r_i$ 同 sign 之間嘅「共振」大小，正正同 $e_i$ 嘅大小成正比！

最後一步：代入求和

\mathbb{E}[\text{correction}] = \sum_i q_i \times (c \times e_i) = c \times \sum_i q_i \, e_i = c \times (q \cdot e)

\mathbb{E}\left[\frac{\text{correction}}{c}\right] = q \cdot e \quad \checkmark

🎯 一句話總結：點解「只知方向」就夠 unbiased？
因為 sign bit 捕捉嘅唔只係「正定負」——佢捕捉嘅係 residual 嘅每個 component $e_i$ 同隨機方向 $r_i$ 之間嘅 correlation 強度。呢個 correlation 正正同 $e_i$ 成正比。乘以 $q_i$ 再求和之後，期望值自然等於 $q \cdot e$ （差一個已知常數）。所以 1 bit 嘅 sign 已經包含咗足夠嘅 方向 + 比例 資訊，唔只係 ± 咁簡單。

🔑 點解 work？直覺解釋
Sign bit 本質上回答咗一個問題：「residual 喺呢條隨機線嘅邊一邊？」

如果 residual 同 query 指向同一邊（兩者都同 $r$ align 或者都 oppose），correction 係正數 → 加大估算

如果指向唔同邊，correction 係負數 → 減少估算

因為 residual 同 query 嘅相對方向決定咗 inner product 誤差嘅正負，而 sign bit 正正捕捉咗呢個「相對方向」資訊。每個 1-bit 答案都好 noisy，但 期望值（平均）已經指向正確方向——呢個就係 unbiased 嘅定義。

多用幾個 random directions（更多 bits）= 更多「投票」= 更低 variance（更穩定）。但 1 bit 已經足以消除 bias——期望值等於真實誤差。呢個就係 JL 定理嘅 magic。

QJL 點樣運作（完整流程）：

計算 PolarQuant 嘅 residual（原始 vector - 量化後 vector）
對 residual 做 JL transform（random projection）
只保留 sign bit（+1 或 -1）

呢個 1-bit 嘅修正項作為 zero-bias estimator，確保 attention score 嘅估算係 unbiased 嘅。

python# TurboQuant 兩階段量化（概念版）
def turboquant_encode(x, R, num_bits):
    """
    x: 原始 vector (d-dimensional)
    R: Random rotation matrix
    num_bits: 目標 bit-width
    """
    # Step 1: Random rotation
    x_rotated = R @ x
    
    # Step 2: PolarQuant (用 num_bits - 1 bits)
    x_polar = polar_transform(x_rotated)
    x_quantized = lloyd_max_quantize(x_polar, bits=num_bits - 1)
    
    # Step 3: QJL on residual (用 1 bit)
    residual = x_rotated - inverse_polar(x_quantized)
    qjl_sign = sign(jl_transform(residual))  # 1-bit!
    
    return x_quantized, qjl_sign

def turboquant_decode(x_quantized, qjl_sign, R):
    """
    x_quantized: 量化後嘅 polar coordinates（angles + final radius）
    qjl_sign:    1-bit QJL correction（TurboQuant_Prod 先需要）
    R:           同一個 random rotation matrix（encode 時用嘅）
    """
    # Step 1: Dequantize — 用 Lloyd-Max 反向 mapping 還原 angles 同 radius
    x_polar_reconstructed = lloyd_max_dequantize(x_quantized)

    # Step 2: Inverse polar transform — 從 (angles + radius) 遞迴還原 rotated vector
    y_hat = inverse_polar_transform(x_polar_reconstructed)  # shape: (d,)

    # Step 3: (TurboQuant_Prod only) Apply QJL correction to remove bias
    # y_hat = apply_qjl_correction(y_hat, qjl_sign)  # 需要 custom attention kernel

    # Step 4: Inverse rotation — R 係 orthogonal，所以 R^T = R^{-1}
    x_reconstructed = R.T @ y_hat  # 完美還原到原始空間

    return x_reconstructed

# 唯一有 information loss 嘅步驟係 lloyd_max_quantize / lloyd_max_dequantize（rounding）
# Random rotation 同 polar transform 本身都係 lossless 同 invertible

TurboQuant 嘅兩個變體

變體	Bit 分配	目標	用途
TurboQuant_MSE	所有 bits → Lloyd-Max	最小化 MSE	Vector reconstruction、KV cache drop-in
TurboQuant_Prod	(b-1) bits Lloyd-Max + 1 bit QJL	Unbiased inner product	Custom attention kernel、vector search

⚠️ 實作注意
如果你想做 KV cache 嘅 drop-in replacement，用 TurboQuant_MSE。因為將 QJL correction 加返入 reconstructed vector 會注入 noise，反而令 cosine similarity 暴跌。TurboQuant_Prod 需要 custom attention kernel 先可以正確使用。

理論分析：點解 TurboQuant 近乎最優？

Information-Theoretic Lower Bound

論文提供咗一個重要嘅理論貢獻：證明咗任何 vector quantizer 能達到嘅最佳 distortion rate 嘅 lower bound。

對於 $b$ -bit quantization 嘅 $d$ -dimensional vector：

\text{MSE}_{\text{optimal}} \geq \frac{\|x\|^2}{d} \cdot \frac{1}{2^{2b/d} \cdot c}

TurboQuant 達到嘅 distortion rate 同呢個 lower bound 只差一個 ≈2.7× constant factor。

點解 Data-Oblivious 都可以咁好？

通常我哋會覺得 data-dependent 方法（例如 Product Quantization 需要 train codebook）應該表現更好。但 TurboQuant 用 random rotation 嘅幾何性質繞過咗呢個問題：

高維集中現象：高維空間嘅 random rotation 令座標分佈高度 concentrated
Near-independence：旋轉後嘅座標幾乎獨立，可以安全地做 per-coordinate quantization
Universal quantizer：因為分佈已知，一個 fixed quantizer 就夠

實驗結果

KV Cache 量化

論文用 Gemma、Mistral 同 Llama 模型，喺五個 long-context benchmarks 上測試（LongBench、Needle In A Haystack、ZeroSCROLLS、RULER、L-Eval）。

LongBench 結果（Llama-3.1-8B-Instruct）

方法	Bit-width	Memory 減少	質量影響
Unquantized (FP32)	32-bit	1x（baseline）	-
KIVI	4-bit	~8x	⚠️ 有明顯退化
PolarQuant	4-bit	~8x	✅ 接近無損
TurboQuant	3.5-bit	≥6x	✅ 完全無損
TurboQuant	2.5-bit	~12x	⚠️ 輕微退化

關鍵發現：

🎯 3.5 bits/channel：absolute quality neutrality（同 FP32 一樣！）
📉 2.5 bits/channel：marginal quality degradation（仍然可用）
🔍 Needle In A Haystack：完美通過，喺超長 context 入面搵到「針」

Attention Speedup（H100 GPU）

Bit-width	相對 FP32 Speedup
8-bit	~2x
4-bit	~8x
3-bit	~6x

4-bit TurboQuant 喺 H100 上達到 8x speedup，呢個數字非常 impressive。

Nearest Neighbor Search

喺 vector search 任務上，TurboQuant 嘅表現更加震撼：

Recall 比較（GloVe dataset, d=200）

TurboQuant 喺所有 bit-widths 同 top-k 設定下都 consistently 超越 Product Quantization (PQ) 同 RabbiQ 等 data-dependent baselines。

Indexing Time 比較

方法	d=200	d=1536	d=3072
Product Quantization	37.04s	239.75s	494.42s
TurboQuant	0.0007s	0.0013s	0.0021s

🚀 Indexing 時間差距
TurboQuant 嘅 indexing 時間比 Product Quantization 快 50,000x — 230,000x！

因為 TurboQuant 係 data-oblivious，唔需要 train codebook，indexing 基本上只係做一次 random rotation + scalar quantization。呢個對 real-time vector search 系統嚟講係 game changer。

深入分析

點解 Random Rotation 會令座標 Follow Beta Distribution？

呢個係 TurboQuant 最 elegant 嘅地方，亦係成個方法嘅數學核心。等我一步步解釋。

Step 0：Random Rotation 會唔會 Lose Information？

首先解答一個好自然嘅疑問：做 random rotation 會唔會丟失方向資訊？

答案係唔會。Orthogonal rotation 係一個 bijective（一一對應）同 invertible（可逆） 嘅操作：

\text{Encode: } y = Rx \quad \longleftrightarrow \quad \text{Decode: } x = R^T y

因為 rotation matrix $R$ 係 orthogonal 嘅（ $R^T R = I$ ），你隨時可以用 $R^T$ 完美還原原始 vector。

💡 類比：想像你有一張相片（= 原始 vector），你將佢旋轉咗 37°（= random rotation）。張相嘅所有資訊都仲喺度，只係角度變咗。你隨時可以旋轉返 -37° 完美還原。

所以 TurboQuant 嘅流程入面：

Random rotation（ $Rx$ ）→ ✅ Lossless，只係改變 representation
Quantization（rounding）→ ⚠️ 呢度先會 lose 少量 information
Dequantize + 反旋轉（ $R^T \hat{y}$ ）→ 還原到原始空間

簡單嚟講：rotation 係「免費改變 representation 令壓縮更高效」，唔係「犧牲方向換取分佈」。

Step 1：Random Rotation = Uniform on Sphere

即係話， $Rx$ 嘅分佈同「喺 unit sphere 上 uniformly sample 一個點」係一樣嘅。

Step 2：Sphere 上嘅 Uniform Point → Beta Distribution

呢度有一個 classical result：如果你喺 $d$ -dimensional unit sphere 上 uniformly sample 一個點 $z$ ，每個座標 $z_i$ 嘅 squared value 嘅 marginal distribution 係：

z_i^2 \sim \text{Beta}\left(\frac{1}{2}, \frac{d-1}{2}\right)

呢個唔係 approximation，係 exact result，源自 sphere 嘅幾何對稱性。

Step 3：點解係 Beta？完整數學推導

首先要理解 Beta distribution 嘅本質。Beta distribution $\text{Beta}(\alpha, \beta)$ 係一個定義喺 $[0, 1]$ 上面嘅分佈，PDF 係：

f(x) = \frac{x^{\alpha-1}(1-x)^{\beta-1}}{B(\alpha, \beta)}

佢有一個好重要嘅 構造性定義：如果你有兩個獨立嘅 Gamma random variables：

X \sim \text{Gamma}(\alpha), \quad Y \sim \text{Gamma}(\beta)

咁：

\frac{X}{X+Y} \sim \text{Beta}(\alpha, \beta)

💡 核心洞察：Beta distribution 本質上就係「一個量佔總量嘅比例」嘅分佈。呢個觀察係成個推導嘅關鍵。

Step 3a：Sphere 上嘅 Uniform Point 可以用 Gaussian 構造

如果你 sample $d$ 個 i.i.d. standard Gaussians $g_1, g_2, \dots, g_d \sim \mathcal{N}(0,1)$ ，然後 normalize：

z = \frac{(g_1, g_2, \dots, g_d)}{\sqrt{g_1^2 + g_2^2 + \cdots + g_d^2}}

Step 3b：將 $z_i^2$ 拆開嚟睇

因為 $z_i = g_i / \|g\|$ ，所以：

z_i^2 = \frac{g_i^2}{g_1^2 + g_2^2 + \cdots + g_d^2} = \frac{g_i^2}{g_i^2 + \sum_{j \neq i} g_j^2}

呢個 fraction 嘅形式正正就係 「一個量佔總量嘅比例」！

Step 3c：分析分子同分母

分子 $g_i^2$ ：一個 standard Gaussian 嘅平方 = $\chi^2(1) = \text{Gamma}(1/2, 1/2)$
分母嘅其餘部分 $\sum_{j \neq i} g_j^2$ ： $(d-1)$ 個獨立 $\chi^2(1)$ 嘅和 = $\chi^2(d-1) = \text{Gamma}((d-1)/2, 1/2)$

因為 $g_i$ 同其他 $g_j$ 獨立，分子同分母嘅其餘部分亦獨立。根據 Beta 嘅構造性定義：

z_i^2 = \frac{\text{Gamma}(1/2)}{\text{Gamma}(1/2) + \text{Gamma}((d-1)/2)} \sim \text{Beta}\left(\frac{1}{2}, \frac{d-1}{2}\right)

Step 3d：點解 $\alpha = 1/2$ ？

呢個 $1/2$ 嚟自 $\chi^2(1)$ 嘅 shape parameter。 $\chi^2(k)$ 等價於 $\text{Gamma}(k/2, 1/2)$ ，所以 $\chi^2(1) = \text{Gamma}(1/2, 1/2)$ 。

🎯 一句話總結：因為 sphere 上嘅 uniform distribution 可以用獨立 Gaussians normalize 嚟構造，而「一個 Gaussian 平方佔總平方和嘅比例」根據定義就係 Beta distribution。呢個唔係 approximation，係 exact 嘅結果。

💡 直覺理解：Concentration of Measure

2D：Unit circle 上嘅 uniform point = $(\cos\theta, \sin\theta)$ ，座標值分佈得幾 spread out（ $-1$ 到 $+1$ ）

3D：Unit sphere 上嘅 uniform point，座標已經開始 concentrate around 0

1000D：每個座標 極度集中 喺 $\approx 0$ 附近，大約 $\pm 1/\sqrt{d}$ 嘅範圍

維度越高，sphere 嘅「面積」越集中喺赤道附近，所以每個座標嘅值都被迫落入一個好窄嘅已知範圍。

當 $d$ 好大嘅時候，呢個 Beta distribution concentrate around $1/d$ ，所以每個座標大約 $\pm 1/\sqrt{d}$ 。呢個亦解釋咗點解 Gaussian approximation 喺高維度下 work：

[Rx]_i \approx \mathcal{N}\left(0, \frac{\|x\|^2}{d}\right) \quad \text{（高維 Gaussian approximation）}

Step 4：TurboQuant 點樣利用呢個性質

因為 rotation 之後每個座標嘅分佈係 已知同固定嘅（Beta distribution，只取決於維度 $d$ ），TurboQuant 可以：

所有座標嘅 range 幾乎一樣 → 唔需要 per-block scaling
分佈已知 → 可以 precompute 最優嘅 Lloyd-Max quantization levels
座標近乎獨立 → per-coordinate quantization 幾乎最優

簡單嚟講：random rotation 係一個 distribution normalizer——將任何未知分佈嘅 vector 變成一個已知分佈嘅 vector，令 quantization 變得 tractable。

TurboQuant vs Product Quantization

特性	Product Quantization (PQ)	TurboQuant
Training	需要 k-means clustering on dataset	唔需要（data-oblivious）
Codebook	Large codebook（dataset-specific）	固定 scalar quantizer levels
Indexing Time	秒級到分鐘級	微秒級
Online 能力	❌ 需要 pre-trained codebook	✅ 即時量化
理論保證	❌ 無 distortion rate 保證	✅ 近乎 information-theoretic optimal
Recall	⚠️ 依賴 codebook 質量	✅ consistently 更高

實際影響同應用場景

1. LLM Inference 優化

呢個係最直接嘅應用。TurboQuant 可以：

將 KV cache memory 減少 6x+
令 LLM 喺同一個 GPU 上支持更長嘅 context
減少 inference cost（memory bandwidth 係主要瓶頸）

具體數字：

原本需要 128 GB memory 嘅 128K context → 用 TurboQuant 只需 ~21 GB
H100 80GB GPU 原本只能 serve 1 個 request → 可能 serve 3-4 個

2. Vector Search / RAG

TurboQuant 嘅 data-oblivious 特性對 vector database 非常有價值：

即時 indexing：新文檔加入後毫秒內可搜索
更高 recall：比 PQ 更準確
更低 memory：同樣 recall 下用更少 storage

3. Edge / Mobile Deployment

3-bit quantization 令 LLM 喺 resource-constrained 環境更加可行：

Mobile devices
Embedded systems
IoT edge computing

實作指南：點樣喺 HuggingFace 模型上用 TurboQuant？

呢個係好多人最關心嘅問題：我有一個 HuggingFace 上嘅 LLM，點樣實際 apply TurboQuant？

核心前提：零 Training，純 Inference-Time

呢個係因為 TurboQuant 係 data-oblivious 嘅——所有嘢都係數學推導出嚟：

Random rotation matrix $R$ ：server 啟動時隨機 generate 一次，之後固定
Lloyd-Max quantization levels：只取決於維度 $d$ 同 bit-width $b$ ，可以 hardcode 做 lookup table
QJL projection matrix：同樣隨機 generate，唔需要 train

目前嘅實作生態

TurboQuant 論文喺 2026 年 3 月正式發佈，社區已經有幾個實作方向：

平台	狀態	連結
Triton Kernel	✅ 已實作（fused encode/decode）	Dejan.ai Blog
Apple MLX	✅ 已有 prototype（Qwen3.5）	HuggingFace Repo
llama.cpp	💬 Discussion 階段	GitHub Discussion
Weight Quantization	✅ 社區擴展到 weight compression	turboquant-model GitHub
HuggingFace Transformers	⏳ 原生支援待整合	KV Cache Quant Blog

方法一：用 HuggingFace 現有嘅 KV Cache Quantization（最簡單）

pythonimport torch
from transformers import AutoTokenizer, AutoModelForCausalLM, QuantizedCacheConfig

# Step 1: 正常 load HuggingFace 模型（weights 完全唔動）
model_name = "meta-llama/Llama-3.1-8B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,
    device_map="auto"
)

# Step 2: 配置 KV cache 量化
cache_config = QuantizedCacheConfig(
    backend="quanto",          # 量化 backend
    nbits=4,                   # 目標 bit-width
    residual_length=128,       # 保留幾多 token 用 full precision
)

# Step 3: Generate 時傳入 cache config
inputs = tokenizer("Your prompt here", return_tensors="pt").to(model.device)
outputs = model.generate(
    **inputs,
    max_new_tokens=1000,
    cache_implementation="quantized",
    cache_config=cache_config,
)

💡 重點：呢個 API 目前用嘅係傳統 affine quantization（有 scale/zero-point overhead）。一旦社區將 TurboQuant 整合入 HuggingFace，你只需要換 backend 同 config 就可以無縫切換到 TurboQuant——model code 完全唔使改。

方法二：自己實作 TurboQuant KV Cache Wrapper（進階）

如果你想而家就試 TurboQuant，可以自己寫一個 KV cache wrapper。下面提供兩個版本嘅 rotation——標準 random matrix 同快速 Hadamard transform。

版本 A：標準 Random Rotation（簡單但慢）

pythonimport torch
import numpy as np
from scipy.stats import beta as beta_dist

class TurboQuantKVCache:
    """TurboQuant_MSE KV Cache — drop-in replacement"""
    
    def __init__(self, head_dim, num_bits=4, device="cuda"):
        self.head_dim = head_dim
        self.num_bits = num_bits
        self.device = device
        
        # Step 1: 生成 random rotation matrix（一次性）
        # 用 QR decomposition 確保 orthogonality
        random_matrix = torch.randn(head_dim, head_dim, device=device)
        self.R, _ = torch.linalg.qr(random_matrix)
        
        # Step 2: 預計算 Lloyd-Max quantization levels
        # 基於 Beta(1/2, (d-1)/2) distribution
        self.boundaries, self.levels = self._precompute_lloyd_max(
            alpha=0.5, beta_param=(head_dim - 1) / 2, num_levels=2**num_bits
        )
        
        # Storage
        self.quantized_keys = []    # 壓縮後嘅 keys
        self.signs = []             # 每個座標嘅正負號（1 bit each）
        self.radii = []             # 每個 vector 嘅 norm
    
    def _precompute_lloyd_max(self, alpha, beta_param, num_levels):
        """用 Beta distribution 預計算最優 quantization levels"""
        # Lloyd-Max iterative algorithm
        # 初始化：uniform spacing on Beta CDF
        cdf_points = np.linspace(0, 1, num_levels + 1)
        boundaries = beta_dist.ppf(cdf_points, alpha, beta_param)
        levels = np.zeros(num_levels)
        
        for _ in range(100):  # 迭代直到收斂
            # Centroid condition: 每個區間嘅 probability-weighted mean
            for i in range(num_levels):
                lo, hi = boundaries[i], boundaries[i + 1]
                # E[X | lo < X < hi] for Beta distribution
                levels[i] = self._conditional_mean(lo, hi, alpha, beta_param)
            
            # Nearest neighbor condition: 重新劃分邊界
            for i in range(1, num_levels):
                boundaries[i] = (levels[i - 1] + levels[i]) / 2
        
        return (
            torch.tensor(boundaries, device=self.device, dtype=torch.float32),
            torch.tensor(levels, device=self.device, dtype=torch.float32)
        )
    
    def _conditional_mean(self, lo, hi, alpha, beta_param):
        """計算 E[X | lo < X < hi] for Beta(alpha, beta_param)"""
        from scipy.integrate import quad
        if hi <= lo:
            return (lo + hi) / 2
        numer, _ = quad(lambda t: t * beta_dist.pdf(t, alpha, beta_param), lo, hi)
        denom = beta_dist.cdf(hi, alpha, beta_param) - beta_dist.cdf(lo, alpha, beta_param)
        return numer / denom if denom > 1e-12 else (lo + hi) / 2
    
    def encode(self, key_vectors):
        """
        將 key vectors 壓縮儲存
        key_vectors: shape (batch, num_heads, seq_len, head_dim)
        """
        # Step 1: Random rotation
        rotated = key_vectors @ self.R.T  # (B, H, S, D) @ (D, D)
        
        # Step 2: 記錄 norm（final radius）
        norms = torch.norm(rotated, dim=-1, keepdim=True)  # (B, H, S, 1)
        normalized = rotated / (norms + 1e-8)  # unit vectors
        
        # Step 3: 儲存正負號（每個座標 1 bit）
        coord_signs = (normalized >= 0)
        
        # Step 4: 對 z_i² 做 Lloyd-Max quantization
        # Beta(1/2, (d-1)/2) 描述嘅係 z_i²，所以量化 squared values
        squared = normalized.square()
        quantized_indices = torch.bucketize(squared, self.boundaries) - 1
        quantized_indices = quantized_indices.clamp(0, len(self.levels) - 1)
        
        self.quantized_keys.append(quantized_indices.to(torch.uint8))
        self.signs.append(coord_signs)
        self.radii.append(norms.squeeze(-1).half())
    
    def decode(self, layer_idx):
        """
        還原壓縮嘅 key vectors
        """
        indices = self.quantized_keys[layer_idx]
        norms = self.radii[layer_idx]
        
        # Step 1: Dequantize z_i² → 取 sqrt 得到 |z_i|
        reconstructed_sq = self.levels[indices.long()]
        reconstructed = reconstructed_sq.sqrt()
        
        # Step 2: 恢復正負號
        coord_signs = self.signs[layer_idx]
        reconstructed = torch.where(coord_signs, reconstructed, -reconstructed)
        
        # Step 3: 乘返 norm
        reconstructed = reconstructed * norms.unsqueeze(-1).float()
        
        # Step 4: Inverse rotation
        decoded = reconstructed @ self.R  # R^T 嘅 inverse = R
        
        return decoded

版本 B：Randomized Hadamard Transform（快 ~18x）

RHT 嘅原理：

\text{RHT}(x) = H_d \cdot D \cdot x

其中：

$D$ = diagonal matrix，對角線係隨機 $\pm 1$ （random sign flip）
$H_d$ = Hadamard matrix，可以用 Fast Walsh-Hadamard Transform (FWHT) 喺 $O(d \log d)$ 計算，唔使儲存成個 $d \times d$ matrix

點解 RHT 同 random rotation 效果幾乎一樣？

Random sign flip ( $D$ ) 已經將座標嘅 sign 打亂
Hadamard transform ( $H_d$ ) 將每個座標嘅值均勻 spread 到所有維度
喺高維度（ $d \geq 64$ ，typical attention head dim），concentration of measure 效應令 RHT 後嘅座標分佈同 exact random rotation 幾乎無法區分
QuIP# 等量化論文已經驗證咗呢個 approach 嘅有效性

pythonimport torch
import numpy as np
from scipy.stats import beta as beta_dist

def fast_walsh_hadamard_transform(x):
    """
    Fast Walsh-Hadamard Transform — O(d log d)
    x: tensor of shape (..., d) where d is a power of 2
    """
    d = x.shape[-1]
    h = 1
    while h < d:
        # Butterfly operation: 每次將相鄰 h 個元素做 ± 組合
        x = x.view(*x.shape[:-1], -1, 2 * h)
        a = x[..., :h]
        b = x[..., h:]
        x = torch.cat([a + b, a - b], dim=-1)
        x = x.view(*x.shape[:-2], -1)
        h *= 2
    return x / (d ** 0.5)  # normalize


class TurboQuantKVCache_Hadamard:
    """TurboQuant with Randomized Hadamard Transform — O(d log d) rotation"""
    
    def __init__(self, head_dim, num_bits=4, device="cuda"):
        assert head_dim & (head_dim - 1) == 0, "head_dim must be power of 2"
        self.head_dim = head_dim
        self.num_bits = num_bits
        self.device = device
        
        # Step 1: 生成 random sign vector（一次性，只需 d 個 ±1）
        # 唔使儲存 d×d matrix！只需 d 個 bits
        self.had_signs = (torch.randint(0, 2, (head_dim,), device=device) * 2 - 1).float()
        
        # Step 2: 預計算 Lloyd-Max quantization levels（同版本 A 一樣）
        self.boundaries, self.levels = self._precompute_lloyd_max(
            alpha=0.5, beta_param=(head_dim - 1) / 2, num_levels=2**num_bits
        )
        
        # Storage
        self.quantized_keys = []
        self.signs = []
        self.radii = []
    
    def _conditional_mean(self, lo, hi, alpha, beta_param):
        from scipy.integrate import quad
        if hi <= lo:
            return (lo + hi) / 2
        numer, _ = quad(lambda t: t * beta_dist.pdf(t, alpha, beta_param), lo, hi)
        denom = beta_dist.cdf(hi, alpha, beta_param) - beta_dist.cdf(lo, alpha, beta_param)
        return numer / denom if denom > 1e-12 else (lo + hi) / 2
    
    def _precompute_lloyd_max(self, alpha, beta_param, num_levels):
        cdf_points = np.linspace(0, 1, num_levels + 1)
        boundaries = beta_dist.ppf(cdf_points, alpha, beta_param)
        boundaries = np.clip(boundaries, 0, 1)
        levels = np.zeros(num_levels)
        for _ in range(100):
            for i in range(num_levels):
                lo, hi = boundaries[i], boundaries[i + 1]
                levels[i] = self._conditional_mean(lo, hi, alpha, beta_param)
            for i in range(1, num_levels):
                boundaries[i] = (levels[i - 1] + levels[i]) / 2
        return (
            torch.tensor(boundaries, device=self.device, dtype=torch.float32),
            torch.tensor(levels, device=self.device, dtype=torch.float32)
        )
    
    def _rht_forward(self, x):
        """Randomized Hadamard Transform: H @ D @ x"""
        # Step 1: Random sign flip（element-wise，O(d)）
        x_flipped = x * self.had_signs
        # Step 2: Fast Walsh-Hadamard（O(d log d)，冇 matrix storage）
        return fast_walsh_hadamard_transform(x_flipped)
    
    def _rht_inverse(self, y):
        """Inverse RHT: D^T @ H^T @ y = D @ H @ y（因為 H 同 D 都係 involution）"""
        # H 係 symmetric 同 orthogonal → H^{-1} = H
        # D 係 diagonal 同 self-inverse → D^{-1} = D
        y_unhadamard = fast_walsh_hadamard_transform(y)
        return y_unhadamard * self.had_signs
    
    def encode(self, key_vectors):
        """同版本 A 一樣，但用 RHT 代替 dense rotation"""
        # RHT rotation — O(d log d) instead of O(d²)
        rotated = self._rht_forward(key_vectors)
        
        norms = torch.norm(rotated, dim=-1, keepdim=True)
        normalized = rotated / (norms + 1e-8)
        
        coord_signs = (normalized >= 0)
        squared = normalized.square()
        quantized_indices = torch.bucketize(squared, self.boundaries) - 1
        quantized_indices = quantized_indices.clamp(0, len(self.levels) - 1)
        
        self.quantized_keys.append(quantized_indices.to(torch.uint8))
        self.signs.append(coord_signs)
        self.radii.append(norms.squeeze(-1).half())
    
    def decode(self, layer_idx):
        """同版本 A 一樣，但用 inverse RHT"""
        indices = self.quantized_keys[layer_idx]
        norms = self.radii[layer_idx]
        
        reconstructed_sq = self.levels[indices.long()]
        reconstructed = reconstructed_sq.sqrt()
        coord_signs = self.signs[layer_idx]
        reconstructed = torch.where(coord_signs, reconstructed, -reconstructed)
        reconstructed = reconstructed * norms.unsqueeze(-1).float()
        
        # Inverse RHT — 同樣 O(d log d)
        decoded = self._rht_inverse(reconstructed)
        return decoded

兩個版本嘅對比：

特性	版本 A：Dense Random Rotation	版本 B：Randomized Hadamard (RHT)
Rotation 複雜度	$O(d^2)$	$O(d \log d)$ ✅
head_dim=128 嘅運算量	16,384 次乘法	896 次加減法 ✅
Memory（儲存 rotation）	$d \times d$ matrix = 65KB	$d$ 個 sign bits = 16 bytes ✅
數學保證	Exact uniform on sphere ✅	Approximate（高維幾乎無差）
量化質量	理論最優	實際幾乎一樣（ $d \geq 64$ ）✅
限制	無	$d$ 必須係 2 嘅冪次（LLM 通常滿足）

🚀 建議：喺 production 環境用 版本 B（Hadamard）。大部分 LLM 嘅 head_dim 都係 64、128 或 256（都係 2 嘅冪次），完美符合要求。速度快 ~18x，memory 少 4000x，quantization 質量幾乎無差異。QuIP#、AQLM 等 SOTA 量化方法都已經採用咗 RHT 作為標準 rotation scheme。

⚠️ 注意：以上兩個版本都係簡化版嘅概念 code，省略咗 recursive polar transform 同 sign handling。實際生產級實作建議參考 Dejan.ai 嘅 Triton kernel，佢哋已經做咗 fused encode/decode，性能更好。

方法三：用 Triton Kernel 做高性能實作（生產級）

咩係 Triton？

如果你之前淨係寫過 PyTorch，可能會問：Triton 係咩嚟？點解唔直接用 CUDA？

💡 類比：CUDA 就好似用 Assembly 寫程式——你要自己管理 thread blocks、shared memory、memory coalescing、register allocation⋯⋯每一步都要手動優化。Triton 就好似用 Python 寫程式——你只需要講清楚「我想做咩計算」，compiler 會自動幫你處理底層嘅 GPU 優化。

Triton 同 CUDA 嘅核心分別：

特性	CUDA	Triton
語言	C/C++ 擴展	Python（用 decorator `@triton.jit`）
並行模型	手動管理 thread blocks、warps、threads	自動管理，你只需定義 tile/block 級別嘅操作
Memory 管理	手動搬 data（global → shared → register）	自動優化 memory access patterns
學習曲線	極陡峭（需要深入理解 GPU 架構）	相對平緩（Python 開發者可以快速上手）
性能	理論上最優（如果你係 expert）	大部分情況同 expert CUDA 一樣快
Debug	困難（GPU debugger 唔好用）	相對容易（Python 環境）

點解 Triton 對 TurboQuant 特別重要？

🎯 一句話總結：Triton = 用 Python 寫 GPU kernel 嘅語言。佢令 ML researcher 唔需要學 CUDA 就可以寫出高性能嘅 custom operation（例如 TurboQuant 嘅 fused encode/decode）。PyTorch 2.0+ 嘅 torch.compile() 底層都係用 Triton 生成 GPU code。

Triton 實作 TurboQuant

如果你要喺 NVIDIA GPU 上做 production deployment，Triton kernel 係最佳選擇。Dejan.ai 已經開源咗一個 fused implementation：

python# 概念：將 TurboQuant encode/decode fuse 入 attention kernel
# 唔需要 dequantize 到 full precision 再做 attention
# 直接喺 quantized domain 計算 attention score

import triton
import triton.language as tl

@triton.jit
def turboquant_attention_kernel(
    Q, K_quantized, K_norms, V, Output,
    rotation_matrix, lloyd_max_levels,
    seq_len, head_dim, num_bits,
    BLOCK_SIZE: tl.constexpr
):
    # Fused: dequantize K + compute Q·K + softmax + V aggregation
    # 所有嘢喺 GPU SRAM 入面完成，避免 HBM ↔ SRAM round trip
    ...

實際部署 Checklist

✅ TurboQuant 部署步驟（TurboQuant_MSE，drop-in）

正常 load HuggingFace model（weights 唔動）

生成 random rotation matrix $R$ （per attention head，一次性）

預計算 Lloyd-Max levels（based on head_dim 同目標 bits，hardcode 做 lookup table）

Hook 入 model 嘅 KV cache：encode 時做 rotate → polar → quantize，decode 時做 dequantize → inverse polar → inverse rotate

搞定！Model 正常 generate，KV cache 自動壓縮

對比	傳統方法（GPTQ/AWQ）	TurboQuant
量化目標	Model weights	KV cache（runtime data）
需要 calibration data？	✅ 需要（通常跑 128 個 samples）	❌ 完全唔需要
需要 retrain？	❌（但 QAT 需要）	❌
可以同時用？	✅ 可以 stack！用 GPTQ 量化 weights + TurboQuant 壓縮 KV cache = 雙重壓縮
Apply 時機	Offline（量化後儲存新 model）	Online（inference 時即時壓縮）

🚀 最佳實踐組合
如果你想最大化壓縮：

先用 GPTQ/AWQ 將 model weights 量化到 4-bit（減少 model size）

再用 TurboQuant 喺 inference 時壓縮 KV cache 到 3.5-bit（減少 runtime memory）

兩者完全獨立，可以自由組合，唔會互相影響

例如：Llama-3.1-70B 用 GPTQ-4bit（35GB weights）+ TurboQuant-3.5bit KV cache = 喺單張 H100 80GB 上跑 128K context！

限制同未來方向

當前限制

1. Random Rotation 嘅 Overhead

雖然 quantization 本身好快，但 random rotation 需要一次 matrix multiplication（ $O(d^2)$ ）。對於非常高維嘅 vectors，呢個可能係瓶頸。

解決思路：用 structured random matrices（如 Hadamard transform），可以將 rotation 降到 $O(d \log d)$ 。

2. 需要 Custom Kernel

TurboQuant_Prod（帶 QJL 嘅版本）需要 custom attention kernel 先可以正確使用。Drop-in replacement 只能用 TurboQuant_MSE。

3. 2-bit 以下仍有挑戰

雖然 2.5-bit 已經可用，但極低 bit-width（1-2 bit）嘅質量退化仍然明顯。呢個可能係 information-theoretic 嘅限制。

未來方向

Hardware-Aware Optimization

將 TurboQuant 嘅 quantization scheme 同 GPU kernel 深度整合，進一步減少 dequantization overhead。已經有人喺 Triton kernel 上實現咗 fused implementation。

同其他壓縮技術結合

TurboQuant 可以同 model pruning、knowledge distillation 等技術 stack：

Weight quantization + KV cache quantization（TurboQuant）
Structured pruning + TurboQuant
多層壓縮進一步減少 memory footprint

Broader Applications

除咗 KV cache 同 vector search，TurboQuant 嘅理論可以應用喺：

Embedding compression
Gradient quantization（distributed training）
Sensor data compression（IoT）

技術啟示

1. Data-Oblivious 唔代表差

2. 理論同實踐嘅完美結合

3. 簡單嘅方法往往最有效

4. Quantization 係 LLM 效率嘅關鍵戰場

總結

TurboQuant 係一個將 classical information theory 同 modern LLM 需求完美結合嘅工作。

核心貢獻

PolarQuant：用 polar coordinates + random rotation 消除 quantization overhead
QJL：1-bit 殘差修正確保 unbiased inner product estimation
TurboQuant：兩者結合，達到 near-optimal distortion rate
理論證明：首次提供 vector quantization 嘅 tight information-theoretic lower bounds

實用價值

🎯 3.5-bit KV cache 量化，零精度損失
🚀 H100 上 8x attention speedup
⚡ Vector search indexing 快 200,000x
📊 唔需要 training data，即插即用

點解你應該 care？

如果你：

做 LLM serving → TurboQuant 可以大幅降低你嘅 GPU memory 成本
做 vector search → 即時 indexing + 更高 recall
做 edge AI → 3-bit 壓縮令 on-device LLM 更加可行
做 ML research → 呢篇論文嘅理論分析非常 elegant，值得學習

TL;DR

目錄

背景：點解 KV Cache 量化咁重要？

LLM 嘅 Memory Bottleneck

現有量化方法嘅問題

Scale 同 Zero Point 係咩？具體例子

TurboQuant：核心創新

設計理念

三個關鍵組件

TurboQuant 嘅兩個變體

理論分析：點解 TurboQuant 近乎最優？

Information-Theoretic Lower Bound

點解 Data-Oblivious 都可以咁好？

實驗結果

KV Cache 量化

Nearest Neighbor Search

深入分析

點解 Random Rotation 會令座標 Follow Beta Distribution？

TurboQuant vs Product Quantization

實際影響同應用場景

1. LLM Inference 優化

2. Vector Search / RAG

3. Edge / Mobile Deployment

實作指南：點樣喺 HuggingFace 模型上用 TurboQuant？

核心前提：零 Training，純 Inference-Time

目前嘅實作生態

方法一：用 HuggingFace 現有嘅 KV Cache Quantization（最簡單）

方法二：自己實作 TurboQuant KV Cache Wrapper（進階）

方法三：用 Triton Kernel 做高性能實作（生產級）

實際部署 Checklist

限制同未來方向

當前限制

未來方向

技術啟示

1. Data-Oblivious 唔代表差

2. 理論同實踐嘅完美結合

3. 簡單嘅方法往往最有效

4. Quantization 係 LLM 效率嘅關鍵戰場

總結

核心貢獻

實用價值

點解你應該 care？

相關資源

TL;DR

目錄

背景：點解 KV Cache 量化咁重要？

LLM 嘅 Memory Bottleneck

現有量化方法嘅問題

Scale 同 Zero Point 係咩？具體例子

TurboQuant：核心創新

設計理念

三個關鍵組件

TurboQuant 嘅兩個變體

理論分析：點解 TurboQuant 近乎最優？

Information-Theoretic Lower Bound

點解 Data-Oblivious 都可以咁好？

實驗結果

KV Cache 量化

Nearest Neighbor Search

深入分析

點解 Random Rotation 會令座標 Follow Beta Distribution？

TurboQuant vs Product Quantization

實際影響同應用場景

1. LLM Inference 優化

2. Vector Search / RAG

3. Edge / Mobile Deployment

實作指南：點樣喺 HuggingFace 模型上用 TurboQuant？

核心前提：零 Training，純 Inference-Time

目前嘅實作生態

方法一：用 HuggingFace 現有嘅 KV Cache Quantization（最簡單）

方法二：自己實作 TurboQuant KV Cache Wrapper（進階）

方法三：用 Triton Kernel 做高性能實作（生產級）

實際部署 Checklist

限制同未來方向

當前限制

未來方向

技術啟示

1. Data-Oblivious 唔代表差

2. 理論同實踐嘅完美結合

3. 簡單嘅方法往往最有效