OSNet：點解一個 2.2M 參數嘅輕量 CNN 可以打贏 ResNet50 做 Person Re-ID？

論文來源：University of Surrey / Queen Mary University of London / Samsung AI Center arXiv: 1905.00953（ICCV 2019） TPAMI 2021 擴展版: Learning Generalisable Omni-Scale Representations GitHub: KaiyangZhou/deep-person-reid HuggingFace: kaiyangzhou/osnet

TL;DR

OSNet 係一個專門為 Person Re-Identification (ReID) 設計嘅輕量級 CNN，核心創新係學習 omni-scale features——即同時包含單一尺度同多尺度混合嘅特徵。

核心重點：

🎯 Omni-Scale Features：同時捕捉 local（鞋、logo）同 global（全身輪廓）特徵，仲可以動態混合唔同 scale
🏗️ 只有 2.2M 參數：比 ResNet50（24M+）細超過 10 倍，但表現更好
🚀 SOTA on 6 datasets：Market1501 Rank-1 = 94.8%，打贏幾乎所有大型 model
⚡ Unified Aggregation Gate：一個 shared mini-network 動態決定每個 channel 應該 focus 邊個 scale
📱 Edge-friendly：極細嘅 model size 適合部署喺 surveillance cameras

背景：Person Re-ID 係咩？點解咁難？

問題定義

Person Re-Identification（行人重識別）嘅目標好簡單：喺唔同 camera 嘅影片入面，搵返同一個人。

想像你有一個商場，裝咗 50 個 CCTV。一個可疑人物喺 Camera A 出現咗，你想知佢之後去咗邊——即係喺 Camera B、C、D⋯⋯嘅畫面入面搵返佢。

💡 一句話定義：ReID = 跨 camera 嘅「呢個人係咪同一個人？」判斷問題。唔係 face recognition——因為 surveillance 嘅解像度太低，通常睇唔到樣。

點解咁難？兩大挑戰

挑戰 1：同一個人，樣子差好遠（大 intra-class variation）

同一個人喺唔同 camera 會有完全唔同嘅外觀：

正面 vs 背面（背囊只有背面先見到）
光線差異（室內暗 vs 室外亮）
遮擋（俾其他人擋住一半身體）

挑戰 2：唔同嘅人，樣子好似（細 inter-class variation）

公共場所好多人著差唔多嘅衫——白 T-shirt + 牛仔褲嘅組合可以有幾十個人。從遠處睇，佢哋幾乎一模一樣。

🎯 核心難題：你需要 features 可以同時做到：

Global：辨認「白 T-shirt + 灰色短褲」嘅整體搭配

Local：注意到「著緊波鞋定涼鞋」嘅細節

Mixed：捕捉「白色 T-shirt 上面有特定 logo」呢種跨尺度嘅組合

前兩個好多 model 都做到，但第三個——heterogeneous-scale features——就係 OSNet 嘅獨特貢獻。

用一個具體例子解釋

假設你要 match 一個人。你嘅 query image 係一個著白 T-shirt 嘅男仔。Gallery 入面有兩個候選人都著白 T-shirt：

特徵類型	Scale	例子	能否區分？
Global	全身	「白 T-shirt + 灰短褲」	❌ 兩個候選人都一樣
Local	小區域	「T-shirt 上有個 logo」	⚠️ Logo 太細，可能同其他 pattern 混淆
Heterogeneous（混合）	跨尺度	「白 T-shirt + 正面有特定 logo」	✅ 呢個組合先至 unique！

單獨睇 logo 冇用（太 generic），單獨睇白 T-shirt 冇用（太 common）。但白 T-shirt 配呢個特定 logo 嘅組合就好 distinctive。呢種跨尺度嘅混合特徵就係 omni-scale features。

現有方法嘅問題

借用 ImageNet Model 嘅困境

大部分 ReID model 都直接用 ResNet50、Inception 等為 ImageNet 設計嘅 backbone。但呢啲 model 係為 category-level recognition（分類貓狗）設計嘅，同 instance-level recognition（分辨兩個著白衫嘅人）係根本唔同嘅任務。

任務	目標	需要嘅 features
ImageNet Classification	分辨「貓」同「狗」	Category-level（耳朵形狀、毛色）
Person ReID	分辨「人 A」同「人 B」	Instance-level（特定 logo、鞋款、背囊細節）

現有 Multi-Scale 方法嘅局限

有啲 ReID model 嘗試學習 multi-scale features，但都有缺陷：

方法	問題
ResNeXt-based（MLFN）	所有 stream 用相同 scale，學唔到唔同尺度
Inception-based（MuDeep）	手動設計嘅 mixed ops，而且用 fixed weights fusion
Multi-level fusion	只喺特定 layer 做 fusion，唔係每層都有
Part-based models（PCB）	依賴外部 pose detector，唔係 end-to-end

⚠️ 核心缺失：冇一個現有 model 可以同時做到：

每層都學唔同 scale 嘅 features

動態咁 fuse 唔同 scale（input-dependent）

學 heterogeneous-scale features（多個 scale 嘅混合）

OSNet 係第一個解決晒三個問題嘅 architecture。

OSNet：核心設計

設計理念

💡 核心洞察：用多條唔同 receptive field 嘅 stream，配合一個 shared 嘅 dynamic gate，令每個 feature channel 自動揀最適合嘅 scale 組合。

OSNet 嘅 building block 有三個關鍵組件：

Lite 3×3：輕量化嘅 depthwise separable convolution
Multi-Scale Streams：4 條唔同深度嘅 convolutional stream（T=1,2,3,4）
Unified Aggregation Gate (AG)：shared mini-network 動態 fuse multi-scale features

組件 1：Lite 3×3 — 輕量化卷積

傳統嘅 3×3 convolution 參數量係 $k^2 \times c \times c'$ 。OSNet 用 depthwise separable convolution 將佢拆成兩步，大幅減少參數：

Loading diagram...

🔑 OSNet 嘅微妙差異：一般 MobileNet 用 depthwise → pointwise 嘅順序，但 OSNet 用 pointwise → depthwise（先擴展 channel 再做 spatial aggregation）。作者發現呢個順序對 omni-scale feature learning 更有效。

參數量對比：

類型	參數量	c=256, c'=256 時
Standard 3×3 Conv	$k^2 \times c \times c'$	589,824
Lite 3×3（OSNet）	$(k^2 + c) \times c'$	67,840（少 8.7x）

組件 2：Multi-Scale Streams — 4 條唔同尺度嘅路線

OSNet 嘅 building block（OSBlock）入面有 4 條 parallel streams，每條由唔同數量嘅 Lite 3×3 堆疊而成：

Stream	Exponent t	Lite 3×3 層數	Receptive Field	捕捉嘅特徵
Stream 1	t=1	1	3×3	極 local（紋理、邊緣）
Stream 2	t=2	2	5×5	Local（logo、飾物）
Stream 3	t=3	3	7×7	Medium（上身、下身）
Stream 4	t=4	4	9×9	Global（全身輪廓）

點解用呢個設計？

每條 stream 嘅 receptive field 大小係 $(2t + 1) \times (2t + 1)$ ，由 exponent $t$ 控制。透過 linearly 增加 $t$ ，OSNet 確保每個 block 都涵蓋從 fine-grained 到 coarse 嘅完整 scale range。

Loading diagram...

組件 3：Unified Aggregation Gate — 動態 Scale Fusion 嘅核心

呢個係 OSNet 最重要嘅創新。AG 係一個 shared mini-network，為每條 stream 生成 channel-wise weights，動態決定每個 feature channel 應該用邊個 scale。

AG 嘅結構：

Loading diagram...

數學公式：

\tilde{x} = \sum_{t=1}^{T} G(x^t) \odot x^t

其中 $G(x^t)$ 係 AG 為第 $t$ 條 stream 生成嘅 channel-wise weight vector， $\odot$ 係 Hadamard product（element-wise 相乘）。

🎯 點解 AG 要 shared（unified）？三個原因：

參數數量唔受 T 影響：唔理你有幾多條 stream，AG 嘅參數量都一樣 → 更 scalable

更好嘅 gradient flow：Backprop 時，所有 stream 嘅 supervision signals 都會匯聚到同一個 AG → gradient 更 informative

促進 cross-scale comparison：同一個 network 處理所有 scale 嘅 features → 自然學識 compare 同 fuse 唔同 scale

點解用 channel-wise weights 而唔係 stream-wise scalar？

一個 scalar weight per stream 太粗糙——佢只能話「成條 stream 重要定唔重要」。但 channel-wise weights 可以做到：

Channel 1 主要用 Stream 1（local texture）
Channel 2 主要用 Stream 3（medium body part）
Channel 3 混合 Stream 1 + Stream 4（local + global = heterogeneous scale！）

呢個 fine-grained fusion 就係 OSNet 可以學到 heterogeneous-scale features 嘅關鍵。

用具體數字做例子

假設 mid_channels = 64（即每條 stream 嘅 output 有 64 個 channels）：

Step 1：4 條 stream 各自輸出 feature maps

Stream	Output shape	Receptive field
$x^1$ = Stream 1	32×16×64	3×3（鞋扣、紐扣）
$x^2$ = Stream 2	32×16×64	5×5（logo、口袋）
$x^3$ = Stream 3	32×16×64	7×7（上半身）
$x^4$ = Stream 4	32×16×64	9×9（全身）

Step 2：Shared AG 為每條 stream 生成 64-dim weight vector

假設對於一張「白 T-shirt 有 logo 嘅男仔」嘅圖片：

Channel	AG weight for Stream 1	AG weight for Stream 2	AG weight for Stream 3	AG weight for Stream 4	學到咩？
Ch. 1	0.1	0.8	0.3	0.1	Focus on logo（5×5 scale）
Ch. 2	0.1	0.2	0.3	0.9	Focus on 全身輪廓（9×9 scale）
Ch. 3	0.7	0.1	0.8	0.2	Mixed：紋理 + 上半身 = heterogeneous!

Step 3：Weighted sum

\tilde{x}_{\text{ch.3}} = 0.7 \cdot x^1_{\text{ch.3}} + 0.1 \cdot x^2_{\text{ch.3}} + 0.8 \cdot x^3_{\text{ch.3}} + 0.2 \cdot x^4_{\text{ch.3}}

Channel 3 混合咗 Stream 1（紋理）同 Stream 3（上半身）→ 呢個 channel 捕捉嘅就係「白 T-shirt 上面有特定紋理」呢種 heterogeneous-scale feature！

💡 動態嘅意思：上面嘅 weights 係 input-dependent 嘅——換一張冇 logo 嘅人，AG 會 assign 完全唔同嘅 weights。呢個同 fixed fusion（例如 simple addition 或 concatenation）有根本分別。

完整 Network Architecture

架構總覽

OSNet 用最簡單嘅方式——一層一層 stack 相同嘅 OSBlock——構建整個 network。冇花巧嘅 per-stage customization。

Stage	Output Size	操作	Channels
conv1	128×64	7×7 conv, stride 2	64
maxpool	64×32	3×3 max pool, stride 2	64
conv2	64×32	OSBlock × 2	256
transition	32×16	1×1 conv + 2×2 avg pool	256
conv3	32×16	OSBlock × 2	384
transition	16×8	1×1 conv + 2×2 avg pool	384
conv4	16×8	OSBlock × 2	512
conv5	16×8	1×1 conv	512
GAP	1×1	Global Average Pooling	512
FC	-	Fully Connected	512

Model Complexity：

參數量：2.2M（ResNet50 = 23.5M，差 10.7x）
Mult-Adds：978.9M
Feature dim：512（用 $\ell_2$ distance 做 matching）

Width Multiplier：靈活縮放

OSNet 支持 width multiplier $\beta$ 嚟縮放 model size：

Model	β	Params	Mult-Adds	Market1501 R1
osnet_x1_0	1.0	2.2M	978.9M	94.8%
osnet_x0_75	0.75	1.3M	571.8M	94.5%
osnet_x0_5	0.5	0.6M	272.9M	93.4%
osnet_x0_25	0.25	0.2M	82.3M	92.2%

🚀 0.2M 參數嘅 osnet_x0_25 仲有 92.2% Rank-1！ 呢個比好多 24M+ 參數嘅 ResNet50-based model 仲要好。證明咗 OSNet 嘅設計真係高效。

同 Inception / ResNeXt / SENet 嘅分別

OSNet 嘅 multi-stream 設計表面上似 Inception 同 ResNeXt，但有根本性嘅分別：

特性	Inception	ResNeXt	SENet	OSNet
Multi-stream 目的	減少計算量	增加 width	Re-calibrate channels	捕捉唔同 scale
Stream 嘅 scale	手動設計 mixed ops	所有 stream 同一 scale	只有 1 個 stream	每條 stream 唔同 scale
Fusion 方式	Concatenation	Addition	Channel re-scaling	Dynamic channel-wise AG
Fusion 係動態？	❌ 固定	❌ 固定	✅ 動態（但唔係 multi-scale）	✅ 動態 + multi-scale
Heterogeneous scale？	❌	❌	❌	✅

🎯 一句話總結：Inception 追求「效率」，ResNeXt 追求「width」，SENet 追求「channel attention」，OSNet 追求「omni-scale feature learning」。四個目標完全唔同。

Ablation Study：每個設計決定有幾重要？

作者做咗詳盡嘅 ablation，每個結論都有實驗支持：

1. Multi-Scale Streams 嘅效果

Stream 數量 T	Market1501 R1	mAP
T=1（single scale）	86.5%	67.7%
T=2 + AG	91.7%	77.0%
T=3 + AG	92.8%	79.9%
T=4 + unified AG	93.6%	81.0%

從 T=1 到 T=4，Rank-1 提升咗 7.1%，mAP 提升咗 13.3%。每加一條 stream 都有明顯改善。

2. Fusion Strategy 嘅對比

Fusion 方法	R1	mAP	特點
Concatenation	91.4%	77.4%	固定，粗糙
Addition	92.0%	78.2%	固定，equal weight
Separate AGs	92.9%	80.2%	動態，但冇 cross-stream gradient
Unified AG（stream-wise scalar）	92.6%	80.0%	動態，但太粗糙
Learned-and-fixed gates	91.6%	77.5%	Train 時學、test 時固定
Unified AG（channel-wise, dynamic）	93.6%	81.0%	全部都有 ✅

💡 三個關鍵 takeaway：

Dynamic > Fixed：Learned-and-fixed gates 比 unified AG 差 2.0% R1 → adaptive fusion 好重要

Channel-wise > Stream-wise：Stream-wise scalar 差 1.0% R1 → fine-grained fusion matters

Unified > Separate：Unified AG 好過 separate AGs 0.7% R1 → shared gradient 嘅優勢

3. Lite 3×3 vs Standard Convolution

用 standard convolution 只能提升 0.4% R1，但 model size 大 3x。Lite 3×3 幾乎冇損失！

實驗結果

Big Datasets（Same-Domain）

Method	Backbone	Params	Market1501 R1/mAP	CUHK03 R1/mAP	Duke R1/mAP	MSMT17 R1/mAP
PCB	ResNet50	~24M	93.8 / 81.6	63.7 / 57.5	83.3 / 69.2	68.2 / 40.4
DGNet	ResNet50	~24M	94.8 / 86.0	• / -	86.6 / 74.8	77.2 / 52.3
IANet	ResNet50	~24M	94.4 / 83.1	• / -	87.1 / 73.4	75.5 / 46.8
MobileNetV2	MobileNetV2	2.2M	87.0 / 69.5	46.5 / 46.0	75.2 / 55.8	50.9 / 27.0
ShuffleNet	ShuffleNet	~2M	84.8 / 65.0	38.4 / 37.2	71.6 / 49.9	41.5 / 19.9
OSNet	OSNet	2.2M	94.8 / 84.9	72.3 / 67.8	88.6 / 73.5	78.7 / 52.9

🚀 重點觀察：

OSNet 用 2.2M params 打贏所有 24M+ params 嘅 ResNet50-based models

比同級嘅 MobileNetV2（同樣 2.2M params）高出 7.8% R1 on Market1501

喺最大最難嘅 MSMT17 上，OSNet 嘅 78.7% R1 領先所有方法

CUHK03 嘅 72.3% R1 比第二名 CAMA（66.6%）高出 5.7%

Small Datasets（VIPeR & GRID）

Method	VIPeR R1	GRID R1
HydraPlus-Net	56.6%	-
GLAD	54.8%	-
JLML	50.2%	37.5%
OSNet	68.0%	38.2%

OSNet 喺 VIPeR 上領先第二名 11.4%！呢個 dataset 只有幾百張 training images，證明 OSNet 嘅 lightweight 設計有效防止 overfitting。

Cross-Domain Generalization：OSNet-AIN

ICCV 2019 版本專注 same-domain。2021 年嘅 TPAMI 擴展版進一步解決 cross-domain 問題——即喺 Dataset A 訓練，直接部署到 Dataset B（冇做任何 adaptation）。

點解需要 Cross-Domain？

現實世界唔可能為每個新場景都收集 labelled data。你需要一個 model 喺商場 A 訓練完，直接搬去機場 B 都 work。

OSNet-AIN 嘅方案：Instance Normalization

OSNet-AIN（Adaptive Instance Normalization）喺 network 嘅 early layers 加入 Instance Normalization (IN)，令 model 學識 ignore domain-specific 嘅 style 資訊（例如光線、色彩分佈），focus 喺 domain-invariant 嘅 structural features。

Model	MSMT17→Market R1/mAP	MSMT17→Duke R1/mAP
ResNet50	46.3 / 22.8	52.3 / 32.1
OSNet x1.0	66.6 / 37.5	66.0 / 45.3
OSNet-IBN x1.0	66.5 / 37.2	67.4 / 45.6
OSNet-AIN x1.0	70.1 / 43.3	71.1 / 52.7

OSNet-AIN 比 ResNet50 嘅 cross-domain 表現高出 23.8% R1（MSMT17→Market）！

深入分析

點解 Unified AG 嘅 Gradient 更好？

呢個係一個好 elegant 嘅數學性質。根據 Eq. 3：

\frac{\partial \mathcal{L}}{\partial G} = \frac{\partial \mathcal{L}}{\partial \tilde{x}} \cdot \left( \sum_{t=1}^{T} x^t \right)

因為 AG 係 shared 嘅，gradient 包含咗所有 stream 嘅 features 嘅 sum。即係話，AG 可以同時「睇到」所有 scale 嘅資訊嚟更新自己。

如果用 separate AGs，每個 AG 只能「睇到」自己 stream 嘅 gradient，冇 cross-scale 嘅 supervision signal。

💡 類比：Unified AG 就好似一個裁判同時睇晒 4 個選手嘅表現先至評分；Separate AGs 就好似 4 個裁判各自只睇 1 個選手——前者嘅評分自然更全面同 consistent。

AG 學到咗咩？Visualisation

作者將 test images 按 AG 嘅 gating vectors 做 k-means clustering，發現同一個 cluster 入面嘅圖片有 相似嘅視覺模式：

Cluster A：全部係背面 + 背囊嘅人 → AG 重 global scale
Cluster B：全部係有明顯 logo/圖案嘅 T-shirt → AG 重 local + medium 嘅混合
Cluster C：全部係全身黑色嘅人 → AG 重 local 細節（因為 global 冇乜 discriminative）

呢個證明 AG 真係學識咗 根據唔同人嘅外觀特點，動態調整 scale fusion 策略。

Attention Map 對比

對比 OSNet 同 single-scale baseline 嘅 activation maps：

Single-scale：過度集中喺 face region（但 surveillance 嘅 face resolution 太低，唔 reliable）
OSNet：可以 detect 衫上嘅 logo、鞋嘅款式等 local discriminative patterns，同時 maintain 全身嘅 context

實作指南：點樣用 Torchreid + OSNet？

方法一：用 Torchreid（官方框架，最完整）

python# 安裝
# git clone https://github.com/KaiyangZhou/deep-person-reid.git
# cd deep-person-reid && python setup.py develop

import torchreid

# ===== Step 1: 加載數據 =====
datamanager = torchreid.data.ImageDataManager(
    root="reid-data",
    sources="market1501",         # 訓練用嘅 dataset
    targets="market1501",         # 測試用嘅 dataset
    height=256,
    width=128,
    batch_size_train=32,
    batch_size_test=100,
    transforms=["random_flip", "random_erase"]
)

# ===== Step 2: 建立 OSNet model =====
model = torchreid.models.build_model(
    name="osnet_x1_0",                         # OSNet 標準版
    num_classes=datamanager.num_train_pids,      # identity 數量
    loss="softmax",                              # 分類 loss
    pretrained=True                              # 用 ImageNet pretrained weights
)
model = model.cuda()

# ===== Step 3: Optimizer + Scheduler =====
optimizer = torchreid.optim.build_optimizer(
    model, optim="amsgrad", lr=0.0015
)
scheduler = torchreid.optim.build_lr_scheduler(
    optimizer, lr_scheduler="cosine"
)

# ===== Step 4: Training Engine =====
engine = torchreid.engine.ImageSoftmaxEngine(
    datamanager, model,
    optimizer=optimizer,
    scheduler=scheduler,
    label_smooth=True
)

# ===== Step 5: 訓練 + 評估 =====
engine.run(
    save_dir="log/osnet_x1_0_market1501",
    max_epoch=150,
    eval_freq=10,
    print_freq=10,
    test_only=False
)

方法二：Feature Extraction API（已有 model，只想提取 features）

pythonimport torchreid
from torchreid.utils import FeatureExtractor

# 加載 pretrained OSNet
extractor = FeatureExtractor(
    model_name="osnet_x1_0",
    model_path="path/to/osnet_x1_0_market.pth.tar",
    device="cuda"
)

# 提取 features（512-dim vectors）
image_list = ["person_001.jpg", "person_002.jpg", "person_003.jpg"]
features = extractor(image_list)  # shape: (3, 512)

# 計算相似度
import torch
from torch.nn.functional import cosine_similarity

sim = cosine_similarity(
    features[0].unsqueeze(0),
    features[1].unsqueeze(0)
)
print(f"Similarity: {sim.item():.4f}")

方法三：Cross-Domain（用 OSNet-AIN）

bash# 喺 DukeMTMC 訓練，直接測試 Market1501（零 adaptation）
python scripts/main.py \
    --config-file configs/im_osnet_ain_x1_0_softmax_256x128_amsgrad_cosine.yaml \
    -s dukemtmcreid \
    -t market1501 \
    --transforms random_flip color_jitter \
    --root $PATH_TO_DATA

💡 Cross-domain tips：

用 color_jitter 代替 random_erase（可以提升 generalization）

用 cosine distance 代替 euclidean（AIN 配 cosine 效果最好）

用多個 source datasets 訓練（-s msmt17 dukemtmcreid cuhk03）效果更好

方法四：Export 到 ONNX / OpenVINO / TFLite（Edge 部署）

Torchreid 已經有內建嘅 export script：

bash# Export 到 ONNX
python tools/export.py \
    --model-name osnet_x0_25 \
    --model-path path/to/model.pth.tar \
    --export-format onnx \
    --input-size 256 128

# Export 到 OpenVINO（用於 Intel 硬件）
python tools/export.py \
    --model-name osnet_x0_25 \
    --model-path path/to/model.pth.tar \
    --export-format openvino

# Export 到 TFLite（用於 mobile）
python tools/export.py \
    --model-name osnet_x0_25 \
    --model-path path/to/model.pth.tar \
    --export-format tflite

🚀 Edge 部署推薦：用 osnet_x0_25（0.2M params, 82M mult-adds），可以喺 STM32 等 embedded devices 上跑。STMicroelectronics 已經將 OSNet 加入咗佢哋嘅 AI Model Zoo。

Available Pretrained Models

Model	Params	Use Case	Download
osnet_x1_0	2.2M	Same-domain ReID（最佳精度）	HuggingFace / Torchreid Model Zoo
osnet_x0_25	0.2M	Edge / Mobile 部署	Torchreid Model Zoo
osnet_ibn_x1_0	2.2M	Moderate cross-domain	Torchreid Model Zoo
osnet_ain_x1_0	2.2M	Best cross-domain	Torchreid Model Zoo

OSNet 嘅核心 Code 解讀

理解 OSNet 嘅最好方法就係睇 source code。以下係簡化版嘅核心 building block：

pythonimport torch
import torch.nn as nn
import torch.nn.functional as F


class LightConv3x3(nn.Module):
    """Lite 3×3: Pointwise → Depthwise (OSNet 嘅特殊順序)"""
    def __init__(self, in_ch, out_ch):
        super().__init__()
        self.conv1 = nn.Conv2d(in_ch, out_ch, 1, bias=False)      # 1×1 pointwise
        self.conv2 = nn.Conv2d(out_ch, out_ch, 3, padding=1,      # 3×3 depthwise
                               bias=False, groups=out_ch)
        self.bn = nn.BatchNorm2d(out_ch)
    
    def forward(self, x):
        return F.relu(self.bn(self.conv2(self.conv1(x))))


class ChannelGate(nn.Module):
    """Unified Aggregation Gate: GAP → FC → ReLU → FC → Sigmoid"""
    def __init__(self, channels, reduction=16):
        super().__init__()
        self.gap = nn.AdaptiveAvgPool2d(1)
        self.fc1 = nn.Conv2d(channels, channels // reduction, 1, bias=True)
        self.fc2 = nn.Conv2d(channels // reduction, channels, 1, bias=True)
    
    def forward(self, x):
        w = torch.sigmoid(self.fc2(F.relu(self.fc1(self.gap(x)))))
        return x * w  # channel-wise reweighting


class OSBlock(nn.Module):
    """Omni-Scale Feature Learning Block"""
    def __init__(self, in_ch, out_ch, reduction=4):
        super().__init__()
        mid = out_ch // reduction
        
        # Dimension reduction
        self.conv1 = nn.Sequential(
            nn.Conv2d(in_ch, mid, 1, bias=False),
            nn.BatchNorm2d(mid), nn.ReLU()
        )
        
        # 4 streams with increasing receptive fields
        self.stream1 = LightConv3x3(mid, mid)                    # RF = 3×3
        self.stream2 = nn.Sequential(                             # RF = 5×5
            LightConv3x3(mid, mid), LightConv3x3(mid, mid)
        )
        self.stream3 = nn.Sequential(                             # RF = 7×7
            LightConv3x3(mid, mid), LightConv3x3(mid, mid),
            LightConv3x3(mid, mid)
        )
        self.stream4 = nn.Sequential(                             # RF = 9×9
            LightConv3x3(mid, mid), LightConv3x3(mid, mid),
            LightConv3x3(mid, mid), LightConv3x3(mid, mid)
        )
        
        # Shared Aggregation Gate (核心！)
        self.gate = ChannelGate(mid)
        
        # Dimension restoration
        self.conv3 = nn.Sequential(
            nn.Conv2d(mid, out_ch, 1, bias=False),
            nn.BatchNorm2d(out_ch)
        )
        
        # Residual connection
        self.downsample = None
        if in_ch != out_ch:
            self.downsample = nn.Sequential(
                nn.Conv2d(in_ch, out_ch, 1, bias=False),
                nn.BatchNorm2d(out_ch)
            )
    
    def forward(self, x):
        identity = x
        x1 = self.conv1(x)
        
        # 4 條 stream 各自提取唔同 scale 嘅 features
        s1 = self.stream1(x1)   # 3×3 scale
        s2 = self.stream2(x1)   # 5×5 scale
        s3 = self.stream3(x1)   # 7×7 scale
        s4 = self.stream4(x1)   # 9×9 scale
        
        # 同一個 AG 分別處理 4 條 stream → channel-wise fusion
        fused = self.gate(s1) + self.gate(s2) + self.gate(s3) + self.gate(s4)
        
        out = self.conv3(fused)
        if self.downsample is not None:
            identity = self.downsample(identity)
        
        return F.relu(out + identity)

🔑 注意第 62 行：self.gate 被 call 咗 4 次——同一個 AG 處理 4 條 stream。呢個就係 "unified" 嘅意思。每次 forward 時，AG 根據唔同 stream 嘅 input 生成唔同嘅 channel-wise weights，實現 dynamic, input-dependent 嘅 scale fusion。

限制同未來方向

當前限制

1. 唔處理 Temporal 資訊

OSNet 設計用於 image-based ReID。對於 video-based ReID，佢冇 temporal aggregation mechanism（雖然有人 extend 咗 OSNet 到 video，效果唔差）。

2. 固定 T=4

Stream 數量 T=4 係 hard-coded 嘅。雖然 ablation 顯示 T=4 已經夠好，但唔同 dataset 可能有唔同嘅 optimal T。後續嘅 OSNet-AIN 用 NAS (Neural Architecture Search) 嚟自動搜索最優嘅 block 結構。

3. 冇考慮 Occlusion

行人被部分遮擋係 real-world 常見嘅問題。OSNet 冇專門處理 occlusion 嘅 mechanism（例如 part-visibility estimation）。

後續發展

OSNet-AIN (TPAMI 2021)：加入 Instance Normalization 解決 cross-domain 問題
MixStyle (ICLR 2021)：同一作者嘅 domain generalization 方法，可以同 OSNet 結合
STM32 AI Model Zoo：OSNet 已被 STMicroelectronics 官方採用到 embedded AI 平台
4793 GitHub Stars：社區活躍，持續維護

技術啟示

1. Lightweight ≠ Weak

OSNet 用 2.2M 參數打贏 24M 嘅 ResNet50，證明咗 task-specific 設計 遠比 blindly scaling up 更有效。唔係越大越好——係越 fit 越好。

2. Dynamic > Static

所有 ablation 都指向同一個結論：dynamic, input-dependent 嘅 fusion 永遠好過 static fusion。呢個 insight 後來喺 attention mechanism（Transformer）入面被大規模 validate。

3. Shared Parameters 嘅 Magic

Unified AG 嘅 shared parameter design 唔止慳 memory——佢從數學上保證咗更好嘅 gradient flow（Eq. 4）。呢種「少即是多」嘅設計哲學值得記住。

4. 為 Edge AI 設計嘅重要性

OSNet 從一開始就考慮 edge deployment。呢個 design principle 令佢唔止係 paper 上嘅數字好睇，而係 真正可以部署 到 real-world surveillance systems。0.2M 嘅 osnet_x0_25 可以跑喺 STM32 上面，呢個係大部分 ReID model 做唔到嘅。

總結

OSNet 係一個將 multi-scale feature learning 做到極致嘅 lightweight CNN。

核心貢獻

Omni-Scale Feature Learning：首次提出同時學習 homogeneous + heterogeneous scale features
Unified Aggregation Gate：shared, dynamic, channel-wise 嘅 multi-scale fusion mechanism
Lite 3×3：用 factorized convolution 實現 3x model compression 幾乎零損失
SOTA on 6 datasets：用 10x 更少嘅參數打贏所有 ResNet50-based models

實用價值

🎯 高精度：Market1501 Rank-1 = 94.8%，MSMT17 Rank-1 = 78.7%
📱 Edge-ready：osnet_x0_25 只有 0.2M params，可以跑喺 embedded devices
🌍 Cross-domain：OSNet-AIN 喺零 adaptation 嘅情況下 generalize 到新場景
🛠️ Production-ready：Torchreid 提供完整嘅 training / evaluation / export pipeline
📦 社區活躍：4.8K stars，20 contributors，持續更新

點解你應該 care？

如果你：

做 surveillance / security → OSNet 係 ReID 嘅 go-to baseline
做 edge AI → 0.2M params 嘅 model 可以跑喺 camera 端
做 multi-scale feature learning → AG 嘅設計 insight 可以 apply 到其他 task
學 CNN architecture design → OSNet 嘅 ablation study 係教科書級別嘅設計驗證

TL;DR

OSNet 係一個專門為 Person Re-Identification (ReID) 設計嘅輕量級 CNN，核心創新係學習 omni-scale features——即同時包含單一尺度同多尺度混合嘅特徵。

核心重點：

🎯 Omni-Scale Features：同時捕捉 local（鞋、logo）同 global（全身輪廓）特徵，仲可以動態混合唔同 scale
🏗️ 只有 2.2M 參數：比 ResNet50（24M+）細超過 10 倍，但表現更好
🚀 SOTA on 6 datasets：Market1501 Rank-1 = 94.8%，打贏幾乎所有大型 model
⚡ Unified Aggregation Gate：一個 shared mini-network 動態決定每個 channel 應該 focus 邊個 scale
📱 Edge-friendly：極細嘅 model size 適合部署喺 surveillance cameras

背景：Person Re-ID 係咩？點解咁難？

問題定義

Person Re-Identification（行人重識別）嘅目標好簡單：喺唔同 camera 嘅影片入面，搵返同一個人。

想像你有一個商場，裝咗 50 個 CCTV。一個可疑人物喺 Camera A 出現咗，你想知佢之後去咗邊——即係喺 Camera B、C、D⋯⋯嘅畫面入面搵返佢。

💡 一句話定義：ReID = 跨 camera 嘅「呢個人係咪同一個人？」判斷問題。唔係 face recognition——因為 surveillance 嘅解像度太低，通常睇唔到樣。

點解咁難？兩大挑戰

挑戰 1：同一個人，樣子差好遠（大 intra-class variation）

同一個人喺唔同 camera 會有完全唔同嘅外觀：

正面 vs 背面（背囊只有背面先見到）
光線差異（室內暗 vs 室外亮）
遮擋（俾其他人擋住一半身體）

挑戰 2：唔同嘅人，樣子好似（細 inter-class variation）

公共場所好多人著差唔多嘅衫——白 T-shirt + 牛仔褲嘅組合可以有幾十個人。從遠處睇，佢哋幾乎一模一樣。

🎯 核心難題：你需要 features 可以同時做到：

Global：辨認「白 T-shirt + 灰色短褲」嘅整體搭配

Local：注意到「著緊波鞋定涼鞋」嘅細節

Mixed：捕捉「白色 T-shirt 上面有特定 logo」呢種跨尺度嘅組合

前兩個好多 model 都做到，但第三個——heterogeneous-scale features——就係 OSNet 嘅獨特貢獻。

用一個具體例子解釋

假設你要 match 一個人。你嘅 query image 係一個著白 T-shirt 嘅男仔。Gallery 入面有兩個候選人都著白 T-shirt：

特徵類型	Scale	例子	能否區分？
Global	全身	「白 T-shirt + 灰短褲」	❌ 兩個候選人都一樣
Local	小區域	「T-shirt 上有個 logo」	⚠️ Logo 太細，可能同其他 pattern 混淆
Heterogeneous（混合）	跨尺度	「白 T-shirt + 正面有特定 logo」	✅ 呢個組合先至 unique！

現有方法嘅問題

借用 ImageNet Model 嘅困境

任務	目標	需要嘅 features
ImageNet Classification	分辨「貓」同「狗」	Category-level（耳朵形狀、毛色）
Person ReID	分辨「人 A」同「人 B」	Instance-level（特定 logo、鞋款、背囊細節）

現有 Multi-Scale 方法嘅局限

有啲 ReID model 嘗試學習 multi-scale features，但都有缺陷：

方法	問題
ResNeXt-based（MLFN）	所有 stream 用相同 scale，學唔到唔同尺度
Inception-based（MuDeep）	手動設計嘅 mixed ops，而且用 fixed weights fusion
Multi-level fusion	只喺特定 layer 做 fusion，唔係每層都有
Part-based models（PCB）	依賴外部 pose detector，唔係 end-to-end

⚠️ 核心缺失：冇一個現有 model 可以同時做到：

每層都學唔同 scale 嘅 features

動態咁 fuse 唔同 scale（input-dependent）

學 heterogeneous-scale features（多個 scale 嘅混合）

OSNet 係第一個解決晒三個問題嘅 architecture。

OSNet：核心設計

設計理念

💡 核心洞察：用多條唔同 receptive field 嘅 stream，配合一個 shared 嘅 dynamic gate，令每個 feature channel 自動揀最適合嘅 scale 組合。

OSNet 嘅 building block 有三個關鍵組件：

Lite 3×3：輕量化嘅 depthwise separable convolution
Multi-Scale Streams：4 條唔同深度嘅 convolutional stream（T=1,2,3,4）
Unified Aggregation Gate (AG)：shared mini-network 動態 fuse multi-scale features

組件 1：Lite 3×3 — 輕量化卷積

傳統嘅 3×3 convolution 參數量係 $k^2 \times c \times c'$ 。OSNet 用 depthwise separable convolution 將佢拆成兩步，大幅減少參數：

Loading diagram...

🔑 OSNet 嘅微妙差異：一般 MobileNet 用 depthwise → pointwise 嘅順序，但 OSNet 用 pointwise → depthwise（先擴展 channel 再做 spatial aggregation）。作者發現呢個順序對 omni-scale feature learning 更有效。

參數量對比：

類型	參數量	c=256, c'=256 時
Standard 3×3 Conv	$k^2 \times c \times c'$	589,824
Lite 3×3（OSNet）	$(k^2 + c) \times c'$	67,840（少 8.7x）

組件 2：Multi-Scale Streams — 4 條唔同尺度嘅路線

OSNet 嘅 building block（OSBlock）入面有 4 條 parallel streams，每條由唔同數量嘅 Lite 3×3 堆疊而成：

Stream	Exponent t	Lite 3×3 層數	Receptive Field	捕捉嘅特徵
Stream 1	t=1	1	3×3	極 local（紋理、邊緣）
Stream 2	t=2	2	5×5	Local（logo、飾物）
Stream 3	t=3	3	7×7	Medium（上身、下身）
Stream 4	t=4	4	9×9	Global（全身輪廓）

點解用呢個設計？

Loading diagram...

組件 3：Unified Aggregation Gate — 動態 Scale Fusion 嘅核心

呢個係 OSNet 最重要嘅創新。AG 係一個 shared mini-network，為每條 stream 生成 channel-wise weights，動態決定每個 feature channel 應該用邊個 scale。

AG 嘅結構：

Loading diagram...

數學公式：

\tilde{x} = \sum_{t=1}^{T} G(x^t) \odot x^t

其中 $G(x^t)$ 係 AG 為第 $t$ 條 stream 生成嘅 channel-wise weight vector， $\odot$ 係 Hadamard product（element-wise 相乘）。

🎯 點解 AG 要 shared（unified）？三個原因：

參數數量唔受 T 影響：唔理你有幾多條 stream，AG 嘅參數量都一樣 → 更 scalable

更好嘅 gradient flow：Backprop 時，所有 stream 嘅 supervision signals 都會匯聚到同一個 AG → gradient 更 informative

促進 cross-scale comparison：同一個 network 處理所有 scale 嘅 features → 自然學識 compare 同 fuse 唔同 scale

點解用 channel-wise weights 而唔係 stream-wise scalar？

一個 scalar weight per stream 太粗糙——佢只能話「成條 stream 重要定唔重要」。但 channel-wise weights 可以做到：

Channel 1 主要用 Stream 1（local texture）
Channel 2 主要用 Stream 3（medium body part）
Channel 3 混合 Stream 1 + Stream 4（local + global = heterogeneous scale！）

呢個 fine-grained fusion 就係 OSNet 可以學到 heterogeneous-scale features 嘅關鍵。

用具體數字做例子

假設 mid_channels = 64（即每條 stream 嘅 output 有 64 個 channels）：

Step 1：4 條 stream 各自輸出 feature maps

Stream	Output shape	Receptive field
$x^1$ = Stream 1	32×16×64	3×3（鞋扣、紐扣）
$x^2$ = Stream 2	32×16×64	5×5（logo、口袋）
$x^3$ = Stream 3	32×16×64	7×7（上半身）
$x^4$ = Stream 4	32×16×64	9×9（全身）

Step 2：Shared AG 為每條 stream 生成 64-dim weight vector

假設對於一張「白 T-shirt 有 logo 嘅男仔」嘅圖片：

Channel	AG weight for Stream 1	AG weight for Stream 2	AG weight for Stream 3	AG weight for Stream 4	學到咩？
Ch. 1	0.1	0.8	0.3	0.1	Focus on logo（5×5 scale）
Ch. 2	0.1	0.2	0.3	0.9	Focus on 全身輪廓（9×9 scale）
Ch. 3	0.7	0.1	0.8	0.2	Mixed：紋理 + 上半身 = heterogeneous!

Step 3：Weighted sum

\tilde{x}_{\text{ch.3}} = 0.7 \cdot x^1_{\text{ch.3}} + 0.1 \cdot x^2_{\text{ch.3}} + 0.8 \cdot x^3_{\text{ch.3}} + 0.2 \cdot x^4_{\text{ch.3}}

Channel 3 混合咗 Stream 1（紋理）同 Stream 3（上半身）→ 呢個 channel 捕捉嘅就係「白 T-shirt 上面有特定紋理」呢種 heterogeneous-scale feature！

💡 動態嘅意思：上面嘅 weights 係 input-dependent 嘅——換一張冇 logo 嘅人，AG 會 assign 完全唔同嘅 weights。呢個同 fixed fusion（例如 simple addition 或 concatenation）有根本分別。

完整 Network Architecture

架構總覽

OSNet 用最簡單嘅方式——一層一層 stack 相同嘅 OSBlock——構建整個 network。冇花巧嘅 per-stage customization。

Stage	Output Size	操作	Channels
conv1	128×64	7×7 conv, stride 2	64
maxpool	64×32	3×3 max pool, stride 2	64
conv2	64×32	OSBlock × 2	256
transition	32×16	1×1 conv + 2×2 avg pool	256
conv3	32×16	OSBlock × 2	384
transition	16×8	1×1 conv + 2×2 avg pool	384
conv4	16×8	OSBlock × 2	512
conv5	16×8	1×1 conv	512
GAP	1×1	Global Average Pooling	512
FC	-	Fully Connected	512

Model Complexity：

參數量：2.2M（ResNet50 = 23.5M，差 10.7x）
Mult-Adds：978.9M
Feature dim：512（用 $\ell_2$ distance 做 matching）

Width Multiplier：靈活縮放

OSNet 支持 width multiplier $\beta$ 嚟縮放 model size：

Model	β	Params	Mult-Adds	Market1501 R1
osnet_x1_0	1.0	2.2M	978.9M	94.8%
osnet_x0_75	0.75	1.3M	571.8M	94.5%
osnet_x0_5	0.5	0.6M	272.9M	93.4%
osnet_x0_25	0.25	0.2M	82.3M	92.2%

🚀 0.2M 參數嘅 osnet_x0_25 仲有 92.2% Rank-1！ 呢個比好多 24M+ 參數嘅 ResNet50-based model 仲要好。證明咗 OSNet 嘅設計真係高效。

同 Inception / ResNeXt / SENet 嘅分別

OSNet 嘅 multi-stream 設計表面上似 Inception 同 ResNeXt，但有根本性嘅分別：

特性	Inception	ResNeXt	SENet	OSNet
Multi-stream 目的	減少計算量	增加 width	Re-calibrate channels	捕捉唔同 scale
Stream 嘅 scale	手動設計 mixed ops	所有 stream 同一 scale	只有 1 個 stream	每條 stream 唔同 scale
Fusion 方式	Concatenation	Addition	Channel re-scaling	Dynamic channel-wise AG
Fusion 係動態？	❌ 固定	❌ 固定	✅ 動態（但唔係 multi-scale）	✅ 動態 + multi-scale
Heterogeneous scale？	❌	❌	❌	✅

🎯 一句話總結：Inception 追求「效率」，ResNeXt 追求「width」，SENet 追求「channel attention」，OSNet 追求「omni-scale feature learning」。四個目標完全唔同。

Ablation Study：每個設計決定有幾重要？

作者做咗詳盡嘅 ablation，每個結論都有實驗支持：

1. Multi-Scale Streams 嘅效果

Stream 數量 T	Market1501 R1	mAP
T=1（single scale）	86.5%	67.7%
T=2 + AG	91.7%	77.0%
T=3 + AG	92.8%	79.9%
T=4 + unified AG	93.6%	81.0%

從 T=1 到 T=4，Rank-1 提升咗 7.1%，mAP 提升咗 13.3%。每加一條 stream 都有明顯改善。

2. Fusion Strategy 嘅對比

Fusion 方法	R1	mAP	特點
Concatenation	91.4%	77.4%	固定，粗糙
Addition	92.0%	78.2%	固定，equal weight
Separate AGs	92.9%	80.2%	動態，但冇 cross-stream gradient
Unified AG（stream-wise scalar）	92.6%	80.0%	動態，但太粗糙
Learned-and-fixed gates	91.6%	77.5%	Train 時學、test 時固定
Unified AG（channel-wise, dynamic）	93.6%	81.0%	全部都有 ✅

💡 三個關鍵 takeaway：

Dynamic > Fixed：Learned-and-fixed gates 比 unified AG 差 2.0% R1 → adaptive fusion 好重要

Channel-wise > Stream-wise：Stream-wise scalar 差 1.0% R1 → fine-grained fusion matters

Unified > Separate：Unified AG 好過 separate AGs 0.7% R1 → shared gradient 嘅優勢

3. Lite 3×3 vs Standard Convolution

用 standard convolution 只能提升 0.4% R1，但 model size 大 3x。Lite 3×3 幾乎冇損失！

實驗結果

Big Datasets（Same-Domain）

Method	Backbone	Params	Market1501 R1/mAP	CUHK03 R1/mAP	Duke R1/mAP	MSMT17 R1/mAP
PCB	ResNet50	~24M	93.8 / 81.6	63.7 / 57.5	83.3 / 69.2	68.2 / 40.4
DGNet	ResNet50	~24M	94.8 / 86.0	• / -	86.6 / 74.8	77.2 / 52.3
IANet	ResNet50	~24M	94.4 / 83.1	• / -	87.1 / 73.4	75.5 / 46.8
MobileNetV2	MobileNetV2	2.2M	87.0 / 69.5	46.5 / 46.0	75.2 / 55.8	50.9 / 27.0
ShuffleNet	ShuffleNet	~2M	84.8 / 65.0	38.4 / 37.2	71.6 / 49.9	41.5 / 19.9
OSNet	OSNet	2.2M	94.8 / 84.9	72.3 / 67.8	88.6 / 73.5	78.7 / 52.9

🚀 重點觀察：

OSNet 用 2.2M params 打贏所有 24M+ params 嘅 ResNet50-based models

比同級嘅 MobileNetV2（同樣 2.2M params）高出 7.8% R1 on Market1501

喺最大最難嘅 MSMT17 上，OSNet 嘅 78.7% R1 領先所有方法

CUHK03 嘅 72.3% R1 比第二名 CAMA（66.6%）高出 5.7%

Small Datasets（VIPeR & GRID）

Method	VIPeR R1	GRID R1
HydraPlus-Net	56.6%	-
GLAD	54.8%	-
JLML	50.2%	37.5%
OSNet	68.0%	38.2%

OSNet 喺 VIPeR 上領先第二名 11.4%！呢個 dataset 只有幾百張 training images，證明 OSNet 嘅 lightweight 設計有效防止 overfitting。

Cross-Domain Generalization：OSNet-AIN

ICCV 2019 版本專注 same-domain。2021 年嘅 TPAMI 擴展版進一步解決 cross-domain 問題——即喺 Dataset A 訓練，直接部署到 Dataset B（冇做任何 adaptation）。

點解需要 Cross-Domain？

現實世界唔可能為每個新場景都收集 labelled data。你需要一個 model 喺商場 A 訓練完，直接搬去機場 B 都 work。

OSNet-AIN 嘅方案：Instance Normalization

Model	MSMT17→Market R1/mAP	MSMT17→Duke R1/mAP
ResNet50	46.3 / 22.8	52.3 / 32.1
OSNet x1.0	66.6 / 37.5	66.0 / 45.3
OSNet-IBN x1.0	66.5 / 37.2	67.4 / 45.6
OSNet-AIN x1.0	70.1 / 43.3	71.1 / 52.7

OSNet-AIN 比 ResNet50 嘅 cross-domain 表現高出 23.8% R1（MSMT17→Market）！

深入分析

點解 Unified AG 嘅 Gradient 更好？

呢個係一個好 elegant 嘅數學性質。根據 Eq. 3：

\frac{\partial \mathcal{L}}{\partial G} = \frac{\partial \mathcal{L}}{\partial \tilde{x}} \cdot \left( \sum_{t=1}^{T} x^t \right)

因為 AG 係 shared 嘅，gradient 包含咗所有 stream 嘅 features 嘅 sum。即係話，AG 可以同時「睇到」所有 scale 嘅資訊嚟更新自己。

如果用 separate AGs，每個 AG 只能「睇到」自己 stream 嘅 gradient，冇 cross-scale 嘅 supervision signal。

💡 類比：Unified AG 就好似一個裁判同時睇晒 4 個選手嘅表現先至評分；Separate AGs 就好似 4 個裁判各自只睇 1 個選手——前者嘅評分自然更全面同 consistent。

AG 學到咗咩？Visualisation

作者將 test images 按 AG 嘅 gating vectors 做 k-means clustering，發現同一個 cluster 入面嘅圖片有 相似嘅視覺模式：

Cluster A：全部係背面 + 背囊嘅人 → AG 重 global scale
Cluster B：全部係有明顯 logo/圖案嘅 T-shirt → AG 重 local + medium 嘅混合
Cluster C：全部係全身黑色嘅人 → AG 重 local 細節（因為 global 冇乜 discriminative）

呢個證明 AG 真係學識咗 根據唔同人嘅外觀特點，動態調整 scale fusion 策略。

Attention Map 對比

對比 OSNet 同 single-scale baseline 嘅 activation maps：

Single-scale：過度集中喺 face region（但 surveillance 嘅 face resolution 太低，唔 reliable）
OSNet：可以 detect 衫上嘅 logo、鞋嘅款式等 local discriminative patterns，同時 maintain 全身嘅 context

實作指南：點樣用 Torchreid + OSNet？

方法一：用 Torchreid（官方框架，最完整）

python# 安裝
# git clone https://github.com/KaiyangZhou/deep-person-reid.git
# cd deep-person-reid && python setup.py develop

import torchreid

# ===== Step 1: 加載數據 =====
datamanager = torchreid.data.ImageDataManager(
    root="reid-data",
    sources="market1501",         # 訓練用嘅 dataset
    targets="market1501",         # 測試用嘅 dataset
    height=256,
    width=128,
    batch_size_train=32,
    batch_size_test=100,
    transforms=["random_flip", "random_erase"]
)

# ===== Step 2: 建立 OSNet model =====
model = torchreid.models.build_model(
    name="osnet_x1_0",                         # OSNet 標準版
    num_classes=datamanager.num_train_pids,      # identity 數量
    loss="softmax",                              # 分類 loss
    pretrained=True                              # 用 ImageNet pretrained weights
)
model = model.cuda()

# ===== Step 3: Optimizer + Scheduler =====
optimizer = torchreid.optim.build_optimizer(
    model, optim="amsgrad", lr=0.0015
)
scheduler = torchreid.optim.build_lr_scheduler(
    optimizer, lr_scheduler="cosine"
)

# ===== Step 4: Training Engine =====
engine = torchreid.engine.ImageSoftmaxEngine(
    datamanager, model,
    optimizer=optimizer,
    scheduler=scheduler,
    label_smooth=True
)

# ===== Step 5: 訓練 + 評估 =====
engine.run(
    save_dir="log/osnet_x1_0_market1501",
    max_epoch=150,
    eval_freq=10,
    print_freq=10,
    test_only=False
)

方法二：Feature Extraction API（已有 model，只想提取 features）

pythonimport torchreid
from torchreid.utils import FeatureExtractor

# 加載 pretrained OSNet
extractor = FeatureExtractor(
    model_name="osnet_x1_0",
    model_path="path/to/osnet_x1_0_market.pth.tar",
    device="cuda"
)

# 提取 features（512-dim vectors）
image_list = ["person_001.jpg", "person_002.jpg", "person_003.jpg"]
features = extractor(image_list)  # shape: (3, 512)

# 計算相似度
import torch
from torch.nn.functional import cosine_similarity

sim = cosine_similarity(
    features[0].unsqueeze(0),
    features[1].unsqueeze(0)
)
print(f"Similarity: {sim.item():.4f}")

方法三：Cross-Domain（用 OSNet-AIN）

bash# 喺 DukeMTMC 訓練，直接測試 Market1501（零 adaptation）
python scripts/main.py \
    --config-file configs/im_osnet_ain_x1_0_softmax_256x128_amsgrad_cosine.yaml \
    -s dukemtmcreid \
    -t market1501 \
    --transforms random_flip color_jitter \
    --root $PATH_TO_DATA

💡 Cross-domain tips：

用 color_jitter 代替 random_erase（可以提升 generalization）

用 cosine distance 代替 euclidean（AIN 配 cosine 效果最好）

用多個 source datasets 訓練（-s msmt17 dukemtmcreid cuhk03）效果更好

方法四：Export 到 ONNX / OpenVINO / TFLite（Edge 部署）

Torchreid 已經有內建嘅 export script：

bash# Export 到 ONNX
python tools/export.py \
    --model-name osnet_x0_25 \
    --model-path path/to/model.pth.tar \
    --export-format onnx \
    --input-size 256 128

# Export 到 OpenVINO（用於 Intel 硬件）
python tools/export.py \
    --model-name osnet_x0_25 \
    --model-path path/to/model.pth.tar \
    --export-format openvino

# Export 到 TFLite（用於 mobile）
python tools/export.py \
    --model-name osnet_x0_25 \
    --model-path path/to/model.pth.tar \
    --export-format tflite

🚀 Edge 部署推薦：用 osnet_x0_25（0.2M params, 82M mult-adds），可以喺 STM32 等 embedded devices 上跑。STMicroelectronics 已經將 OSNet 加入咗佢哋嘅 AI Model Zoo。

Available Pretrained Models

Model	Params	Use Case	Download
osnet_x1_0	2.2M	Same-domain ReID（最佳精度）	HuggingFace / Torchreid Model Zoo
osnet_x0_25	0.2M	Edge / Mobile 部署	Torchreid Model Zoo
osnet_ibn_x1_0	2.2M	Moderate cross-domain	Torchreid Model Zoo
osnet_ain_x1_0	2.2M	Best cross-domain	Torchreid Model Zoo

OSNet 嘅核心 Code 解讀

理解 OSNet 嘅最好方法就係睇 source code。以下係簡化版嘅核心 building block：

pythonimport torch
import torch.nn as nn
import torch.nn.functional as F


class LightConv3x3(nn.Module):
    """Lite 3×3: Pointwise → Depthwise (OSNet 嘅特殊順序)"""
    def __init__(self, in_ch, out_ch):
        super().__init__()
        self.conv1 = nn.Conv2d(in_ch, out_ch, 1, bias=False)      # 1×1 pointwise
        self.conv2 = nn.Conv2d(out_ch, out_ch, 3, padding=1,      # 3×3 depthwise
                               bias=False, groups=out_ch)
        self.bn = nn.BatchNorm2d(out_ch)
    
    def forward(self, x):
        return F.relu(self.bn(self.conv2(self.conv1(x))))


class ChannelGate(nn.Module):
    """Unified Aggregation Gate: GAP → FC → ReLU → FC → Sigmoid"""
    def __init__(self, channels, reduction=16):
        super().__init__()
        self.gap = nn.AdaptiveAvgPool2d(1)
        self.fc1 = nn.Conv2d(channels, channels // reduction, 1, bias=True)
        self.fc2 = nn.Conv2d(channels // reduction, channels, 1, bias=True)
    
    def forward(self, x):
        w = torch.sigmoid(self.fc2(F.relu(self.fc1(self.gap(x)))))
        return x * w  # channel-wise reweighting


class OSBlock(nn.Module):
    """Omni-Scale Feature Learning Block"""
    def __init__(self, in_ch, out_ch, reduction=4):
        super().__init__()
        mid = out_ch // reduction
        
        # Dimension reduction
        self.conv1 = nn.Sequential(
            nn.Conv2d(in_ch, mid, 1, bias=False),
            nn.BatchNorm2d(mid), nn.ReLU()
        )
        
        # 4 streams with increasing receptive fields
        self.stream1 = LightConv3x3(mid, mid)                    # RF = 3×3
        self.stream2 = nn.Sequential(                             # RF = 5×5
            LightConv3x3(mid, mid), LightConv3x3(mid, mid)
        )
        self.stream3 = nn.Sequential(                             # RF = 7×7
            LightConv3x3(mid, mid), LightConv3x3(mid, mid),
            LightConv3x3(mid, mid)
        )
        self.stream4 = nn.Sequential(                             # RF = 9×9
            LightConv3x3(mid, mid), LightConv3x3(mid, mid),
            LightConv3x3(mid, mid), LightConv3x3(mid, mid)
        )
        
        # Shared Aggregation Gate (核心！)
        self.gate = ChannelGate(mid)
        
        # Dimension restoration
        self.conv3 = nn.Sequential(
            nn.Conv2d(mid, out_ch, 1, bias=False),
            nn.BatchNorm2d(out_ch)
        )
        
        # Residual connection
        self.downsample = None
        if in_ch != out_ch:
            self.downsample = nn.Sequential(
                nn.Conv2d(in_ch, out_ch, 1, bias=False),
                nn.BatchNorm2d(out_ch)
            )
    
    def forward(self, x):
        identity = x
        x1 = self.conv1(x)
        
        # 4 條 stream 各自提取唔同 scale 嘅 features
        s1 = self.stream1(x1)   # 3×3 scale
        s2 = self.stream2(x1)   # 5×5 scale
        s3 = self.stream3(x1)   # 7×7 scale
        s4 = self.stream4(x1)   # 9×9 scale
        
        # 同一個 AG 分別處理 4 條 stream → channel-wise fusion
        fused = self.gate(s1) + self.gate(s2) + self.gate(s3) + self.gate(s4)
        
        out = self.conv3(fused)
        if self.downsample is not None:
            identity = self.downsample(identity)
        
        return F.relu(out + identity)

🔑 注意第 62 行：self.gate 被 call 咗 4 次——同一個 AG 處理 4 條 stream。呢個就係 "unified" 嘅意思。每次 forward 時，AG 根據唔同 stream 嘅 input 生成唔同嘅 channel-wise weights，實現 dynamic, input-dependent 嘅 scale fusion。

限制同未來方向

當前限制

1. 唔處理 Temporal 資訊

OSNet 設計用於 image-based ReID。對於 video-based ReID，佢冇 temporal aggregation mechanism（雖然有人 extend 咗 OSNet 到 video，效果唔差）。

2. 固定 T=4

3. 冇考慮 Occlusion

行人被部分遮擋係 real-world 常見嘅問題。OSNet 冇專門處理 occlusion 嘅 mechanism（例如 part-visibility estimation）。

後續發展

OSNet-AIN (TPAMI 2021)：加入 Instance Normalization 解決 cross-domain 問題
MixStyle (ICLR 2021)：同一作者嘅 domain generalization 方法，可以同 OSNet 結合
STM32 AI Model Zoo：OSNet 已被 STMicroelectronics 官方採用到 embedded AI 平台
4793 GitHub Stars：社區活躍，持續維護

技術啟示

1. Lightweight ≠ Weak

OSNet 用 2.2M 參數打贏 24M 嘅 ResNet50，證明咗 task-specific 設計 遠比 blindly scaling up 更有效。唔係越大越好——係越 fit 越好。

2. Dynamic > Static

3. Shared Parameters 嘅 Magic

Unified AG 嘅 shared parameter design 唔止慳 memory——佢從數學上保證咗更好嘅 gradient flow（Eq. 4）。呢種「少即是多」嘅設計哲學值得記住。

4. 為 Edge AI 設計嘅重要性

總結

OSNet 係一個將 multi-scale feature learning 做到極致嘅 lightweight CNN。

核心貢獻

Omni-Scale Feature Learning：首次提出同時學習 homogeneous + heterogeneous scale features
Unified Aggregation Gate：shared, dynamic, channel-wise 嘅 multi-scale fusion mechanism
Lite 3×3：用 factorized convolution 實現 3x model compression 幾乎零損失
SOTA on 6 datasets：用 10x 更少嘅參數打贏所有 ResNet50-based models

實用價值

🎯 高精度：Market1501 Rank-1 = 94.8%，MSMT17 Rank-1 = 78.7%
📱 Edge-ready：osnet_x0_25 只有 0.2M params，可以跑喺 embedded devices
🌍 Cross-domain：OSNet-AIN 喺零 adaptation 嘅情況下 generalize 到新場景
🛠️ Production-ready：Torchreid 提供完整嘅 training / evaluation / export pipeline
📦 社區活躍：4.8K stars，20 contributors，持續更新

點解你應該 care？

如果你：

做 surveillance / security → OSNet 係 ReID 嘅 go-to baseline
做 edge AI → 0.2M params 嘅 model 可以跑喺 camera 端
做 multi-scale feature learning → AG 嘅設計 insight 可以 apply 到其他 task
學 CNN architecture design → OSNet 嘅 ablation study 係教科書級別嘅設計驗證

TL;DR

目錄

背景：Person Re-ID 係咩？點解咁難？

問題定義

點解咁難？兩大挑戰

用一個具體例子解釋

現有方法嘅問題

借用 ImageNet Model 嘅困境

現有 Multi-Scale 方法嘅局限

OSNet：核心設計

設計理念

組件 1：Lite 3×3 — 輕量化卷積

組件 2：Multi-Scale Streams — 4 條唔同尺度嘅路線

組件 3：Unified Aggregation Gate — 動態 Scale Fusion 嘅核心

用具體數字做例子

完整 Network Architecture

架構總覽

Width Multiplier：靈活縮放

同 Inception / ResNeXt / SENet 嘅分別

Ablation Study：每個設計決定有幾重要？

1. Multi-Scale Streams 嘅效果

2. Fusion Strategy 嘅對比

3. Lite 3×3 vs Standard Convolution

實驗結果

Big Datasets（Same-Domain）

Small Datasets（VIPeR & GRID）

Cross-Domain Generalization：OSNet-AIN

點解需要 Cross-Domain？

OSNet-AIN 嘅方案：Instance Normalization

深入分析

點解 Unified AG 嘅 Gradient 更好？

AG 學到咗咩？Visualisation

Attention Map 對比

實作指南：點樣用 Torchreid + OSNet？

方法一：用 Torchreid（官方框架，最完整）

方法二：Feature Extraction API（已有 model，只想提取 features）

方法三：Cross-Domain（用 OSNet-AIN）

方法四：Export 到 ONNX / OpenVINO / TFLite（Edge 部署）

Available Pretrained Models

OSNet 嘅核心 Code 解讀

限制同未來方向

當前限制

後續發展

技術啟示

1. Lightweight ≠ Weak

2. Dynamic > Static

3. Shared Parameters 嘅 Magic

4. 為 Edge AI 設計嘅重要性

總結

核心貢獻

實用價值

點解你應該 care？

相關資源

TL;DR

目錄

背景：Person Re-ID 係咩？點解咁難？

問題定義

點解咁難？兩大挑戰

用一個具體例子解釋

現有方法嘅問題

借用 ImageNet Model 嘅困境

現有 Multi-Scale 方法嘅局限

OSNet：核心設計

設計理念

組件 1：Lite 3×3 — 輕量化卷積

組件 2：Multi-Scale Streams — 4 條唔同尺度嘅路線

組件 3：Unified Aggregation Gate — 動態 Scale Fusion 嘅核心

用具體數字做例子

完整 Network Architecture

架構總覽

Width Multiplier：靈活縮放

同 Inception / ResNeXt / SENet 嘅分別

Ablation Study：每個設計決定有幾重要？

1. Multi-Scale Streams 嘅效果

2. Fusion Strategy 嘅對比

3. Lite 3×3 vs Standard Convolution

實驗結果

Big Datasets（Same-Domain）

Small Datasets（VIPeR & GRID）

Cross-Domain Generalization：OSNet-AIN