論文來源:University of Surrey / Queen Mary University of London / Samsung AI Center arXiv: 1905.00953(ICCV 2019) TPAMI 2021 擴展版: Learning Generalisable Omni-Scale Representations GitHub: KaiyangZhou/deep-person-reid HuggingFace: kaiyangzhou/osnet
TL;DR
OSNet 係一個專門為 Person Re-Identification (ReID) 設計嘅輕量級 CNN,核心創新係學習 omni-scale features——即同時包含單一尺度同多尺度混合嘅特徵。
核心重點:
- 🎯 Omni-Scale Features:同時捕捉 local(鞋、logo)同 global(全身輪廓)特徵,仲可以動態混合唔同 scale
- 🏗️ 只有 2.2M 參數:比 ResNet50(24M+)細超過 10 倍,但表現更好
- 🚀 SOTA on 6 datasets:Market1501 Rank-1 = 94.8%,打贏幾乎所有大型 model
- ⚡ Unified Aggregation Gate:一個 shared mini-network 動態決定每個 channel 應該 focus 邊個 scale
- 📱 Edge-friendly:極細嘅 model size 適合部署喺 surveillance cameras
目錄
背景:Person Re-ID 係咩?點解咁難?
問題定義
Person Re-Identification(行人重識別)嘅目標好簡單:喺唔同 camera 嘅影片入面,搵返同一個人。
想像你有一個商場,裝咗 50 個 CCTV。一個可疑人物喺 Camera A 出現咗,你想知佢之後去咗邊——即係喺 Camera B、C、D⋯⋯嘅畫面入面搵返佢。
💡 一句話定義:ReID = 跨 camera 嘅「呢個人係咪同一個人?」判斷問題。唔係 face recognition——因為 surveillance 嘅解像度太低,通常睇唔到樣。
點解咁難?兩大挑戰
挑戰 1:同一個人,樣子差好遠(大 intra-class variation)
同一個人喺唔同 camera 會有完全唔同嘅外觀:
- 正面 vs 背面(背囊只有背面先見到)
- 光線差異(室內暗 vs 室外亮)
- 遮擋(俾其他人擋住一半身體)
挑戰 2:唔同嘅人,樣子好似(細 inter-class variation)
公共場所好多人著差唔多嘅衫——白 T-shirt + 牛仔褲嘅組合可以有幾十個人。從遠處睇,佢哋幾乎一模一樣。
🎯 核心難題:你需要 features 可以同時做到:
Global:辨認「白 T-shirt + 灰色短褲」嘅整體搭配
Local:注意到「著緊波鞋定涼鞋」嘅細節
Mixed:捕捉「白色 T-shirt 上面有特定 logo」呢種跨尺度嘅組合
前兩個好多 model 都做到,但第三個——heterogeneous-scale features——就係 OSNet 嘅獨特貢獻。
用一個具體例子解釋
假設你要 match 一個人。你嘅 query image 係一個著白 T-shirt 嘅男仔。Gallery 入面有兩個候選人都著白 T-shirt:
| 特徵類型 | Scale | 例子 | 能否區分? |
|---|---|---|---|
| Global | 全身 | 「白 T-shirt + 灰短褲」 | ❌ 兩個候選人都一樣 |
| Local | 小區域 | 「T-shirt 上有個 logo」 | ⚠️ Logo 太細,可能同其他 pattern 混淆 |
| Heterogeneous(混合) | 跨尺度 | 「白 T-shirt + 正面有特定 logo」 | ✅ 呢個組合先至 unique! |
單獨睇 logo 冇用(太 generic),單獨睇白 T-shirt 冇用(太 common)。但白 T-shirt 配呢個特定 logo 嘅組合就好 distinctive。呢種跨尺度嘅混合特徵就係 omni-scale features。
現有方法嘅問題
借用 ImageNet Model 嘅困境
大部分 ReID model 都直接用 ResNet50、Inception 等為 ImageNet 設計嘅 backbone。但呢啲 model 係為 category-level recognition(分類貓狗)設計嘅,同 instance-level recognition(分辨兩個著白衫嘅人)係根本唔同嘅任務。
| 任務 | 目標 | 需要嘅 features |
|---|---|---|
| ImageNet Classification | 分辨「貓」同「狗」 | Category-level(耳朵形狀、毛色) |
| Person ReID | 分辨「人 A」同「人 B」 | Instance-level(特定 logo、鞋款、背囊細節) |
現有 Multi-Scale 方法嘅局限
有啲 ReID model 嘗試學習 multi-scale features,但都有缺陷:
| 方法 | 問題 |
|---|---|
| ResNeXt-based(MLFN) | 所有 stream 用相同 scale,學唔到唔同尺度 |
| Inception-based(MuDeep) | 手動設計嘅 mixed ops,而且用 fixed weights fusion |
| Multi-level fusion | 只喺特定 layer 做 fusion,唔係每層都有 |
| Part-based models(PCB) | 依賴外部 pose detector,唔係 end-to-end |
⚠️ 核心缺失:冇一個現有 model 可以同時做到:
每層都學唔同 scale 嘅 features
動態咁 fuse 唔同 scale(input-dependent)
學 heterogeneous-scale features(多個 scale 嘅混合)
OSNet 係第一個解決晒三個問題嘅 architecture。
OSNet:核心設計

設計理念
💡 核心洞察:用多條唔同 receptive field 嘅 stream,配合一個 shared 嘅 dynamic gate,令每個 feature channel 自動揀最適合嘅 scale 組合。
OSNet 嘅 building block 有三個關鍵組件:
- Lite 3×3:輕量化嘅 depthwise separable convolution
- Multi-Scale Streams:4 條唔同深度嘅 convolutional stream(T=1,2,3,4)
- Unified Aggregation Gate (AG):shared mini-network 動態 fuse multi-scale features
組件 1:Lite 3×3 — 輕量化卷積
傳統嘅 3×3 convolution 參數量係 。OSNet 用 depthwise separable convolution 將佢拆成兩步,大幅減少參數:
🔑 OSNet 嘅微妙差異:一般 MobileNet 用 depthwise → pointwise 嘅順序,但 OSNet 用 pointwise → depthwise(先擴展 channel 再做 spatial aggregation)。作者發現呢個順序對 omni-scale feature learning 更有效。
參數量對比:
| 類型 | 參數量 | c=256, c'=256 時 |
|---|---|---|
| Standard 3×3 Conv | 589,824 | |
| Lite 3×3(OSNet) | 67,840(少 8.7x) |
組件 2:Multi-Scale Streams — 4 條唔同尺度嘅路線

OSNet 嘅 building block(OSBlock)入面有 4 條 parallel streams,每條由唔同數量嘅 Lite 3×3 堆疊而成:
| Stream | Exponent t | Lite 3×3 層數 | Receptive Field | 捕捉嘅特徵 |
|---|---|---|---|---|
| Stream 1 | t=1 | 1 | 3×3 | 極 local(紋理、邊緣) |
| Stream 2 | t=2 | 2 | 5×5 | Local(logo、飾物) |
| Stream 3 | t=3 | 3 | 7×7 | Medium(上身、下身) |
| Stream 4 | t=4 | 4 | 9×9 | Global(全身輪廓) |
點解用呢個設計?
每條 stream 嘅 receptive field 大小係 ,由 exponent 控制。透過 linearly 增加 ,OSNet 確保每個 block 都涵蓋從 fine-grained 到 coarse 嘅完整 scale range。
組件 3:Unified Aggregation Gate — 動態 Scale Fusion 嘅核心
呢個係 OSNet 最重要嘅創新。AG 係一個 shared mini-network,為每條 stream 生成 channel-wise weights,動態決定每個 feature channel 應該用邊個 scale。
AG 嘅結構:
數學公式:
其中 係 AG 為第 條 stream 生成嘅 channel-wise weight vector, 係 Hadamard product(element-wise 相乘)。
🎯 點解 AG 要 shared(unified)?三個原因:
參數數量唔受 T 影響:唔理你有幾多條 stream,AG 嘅參數量都一樣 → 更 scalable
更好嘅 gradient flow:Backprop 時,所有 stream 嘅 supervision signals 都會 匯聚 到同一個 AG → gradient 更 informative
促進 cross-scale comparison:同一個 network 處理所有 scale 嘅 features → 自然學識 compare 同 fuse 唔同 scale
點解用 channel-wise weights 而唔係 stream-wise scalar?
一個 scalar weight per stream 太粗糙——佢只能話「成條 stream 重要定唔重要」。但 channel-wise weights 可以做到:
- Channel 1 主要用 Stream 1(local texture)
- Channel 2 主要用 Stream 3(medium body part)
- Channel 3 混合 Stream 1 + Stream 4(local + global = heterogeneous scale!)
呢個 fine-grained fusion 就係 OSNet 可以學到 heterogeneous-scale features 嘅關鍵。
用具體數字做例子
假設 mid_channels = 64(即每條 stream 嘅 output 有 64 個 channels):
Step 1:4 條 stream 各自輸出 feature maps
| Stream | Output shape | Receptive field |
|---|---|---|
| = Stream 1 | 32×16×64 | 3×3(鞋扣、紐扣) |
| = Stream 2 | 32×16×64 | 5×5(logo、口袋) |
| = Stream 3 | 32×16×64 | 7×7(上半身) |
| = Stream 4 | 32×16×64 | 9×9(全身) |
Step 2:Shared AG 為每條 stream 生成 64-dim weight vector
假設對於一張「白 T-shirt 有 logo 嘅男仔」嘅圖片:
| Channel | AG weight for Stream 1 | AG weight for Stream 2 | AG weight for Stream 3 | AG weight for Stream 4 | 學到咩? |
|---|---|---|---|---|---|
| Ch. 1 | 0.1 | 0.8 | 0.3 | 0.1 | Focus on logo(5×5 scale) |
| Ch. 2 | 0.1 | 0.2 | 0.3 | 0.9 | Focus on 全身輪廓(9×9 scale) |
| Ch. 3 | 0.7 | 0.1 | 0.8 | 0.2 | Mixed:紋理 + 上半身 = heterogeneous! |
Step 3:Weighted sum
Channel 3 混合咗 Stream 1(紋理)同 Stream 3(上半身)→ 呢個 channel 捕捉嘅就係「白 T-shirt 上面有特定紋理」呢種 heterogeneous-scale feature!
💡 動態嘅意思:上面嘅 weights 係 input-dependent 嘅——換一張冇 logo 嘅人,AG 會 assign 完全唔同嘅 weights。呢個同 fixed fusion(例如 simple addition 或 concatenation)有根本分別。
完整 Network Architecture
架構總覽
OSNet 用最簡單嘅方式——一層一層 stack 相同嘅 OSBlock——構建整個 network。冇花巧嘅 per-stage customization。
| Stage | Output Size | 操作 | Channels |
|---|---|---|---|
| conv1 | 128×64 | 7×7 conv, stride 2 | 64 |
| maxpool | 64×32 | 3×3 max pool, stride 2 | 64 |
| conv2 | 64×32 | OSBlock × 2 | 256 |
| transition | 32×16 | 1×1 conv + 2×2 avg pool | 256 |
| conv3 | 32×16 | OSBlock × 2 | 384 |
| transition | 16×8 | 1×1 conv + 2×2 avg pool | 384 |
| conv4 | 16×8 | OSBlock × 2 | 512 |
| conv5 | 16×8 | 1×1 conv | 512 |
| GAP | 1×1 | Global Average Pooling | 512 |
| FC | - | Fully Connected | 512 |
Model Complexity:
- 參數量:2.2M(ResNet50 = 23.5M,差 10.7x)
- Mult-Adds:978.9M
- Feature dim:512(用 distance 做 matching)
Width Multiplier:靈活縮放
OSNet 支持 width multiplier 嚟縮放 model size:
| Model | β | Params | Mult-Adds | Market1501 R1 |
|---|---|---|---|---|
| osnet_x1_0 | 1.0 | 2.2M | 978.9M | 94.8% |
| osnet_x0_75 | 0.75 | 1.3M | 571.8M | 94.5% |
| osnet_x0_5 | 0.5 | 0.6M | 272.9M | 93.4% |
| osnet_x0_25 | 0.25 | 0.2M | 82.3M | 92.2% |
🚀 0.2M 參數嘅 osnet_x0_25 仲有 92.2% Rank-1! 呢個比好多 24M+ 參數嘅 ResNet50-based model 仲要好。證明咗 OSNet 嘅設計真係高效。
同 Inception / ResNeXt / SENet 嘅分別
OSNet 嘅 multi-stream 設計表面上似 Inception 同 ResNeXt,但有根本性嘅分別:
| 特性 | Inception | ResNeXt | SENet | OSNet |
|---|---|---|---|---|
| Multi-stream 目的 | 減少計算量 | 增加 width | Re-calibrate channels | 捕捉唔同 scale |
| Stream 嘅 scale | 手動設計 mixed ops | 所有 stream 同一 scale | 只有 1 個 stream | 每條 stream 唔同 scale |
| Fusion 方式 | Concatenation | Addition | Channel re-scaling | Dynamic channel-wise AG |
| Fusion 係動態? | ❌ 固定 | ❌ 固定 | ✅ 動態(但唔係 multi-scale) | ✅ 動態 + multi-scale |
| Heterogeneous scale? | ❌ | ❌ | ❌ | ✅ |
🎯 一句話總結:Inception 追求「效率」,ResNeXt 追求「width」,SENet 追求「channel attention」,OSNet 追求「omni-scale feature learning」。四個目標完全唔同。
Ablation Study:每個設計決定有幾重要?
作者做咗詳盡嘅 ablation,每個結論都有實驗支持:
1. Multi-Scale Streams 嘅效果
| Stream 數量 T | Market1501 R1 | mAP |
|---|---|---|
| T=1(single scale) | 86.5% | 67.7% |
| T=2 + AG | 91.7% | 77.0% |
| T=3 + AG | 92.8% | 79.9% |
| T=4 + unified AG | 93.6% | 81.0% |
從 T=1 到 T=4,Rank-1 提升咗 7.1%,mAP 提升咗 13.3%。每加一條 stream 都有明顯改善。
2. Fusion Strategy 嘅對比
| Fusion 方法 | R1 | mAP | 特點 |
|---|---|---|---|
| Concatenation | 91.4% | 77.4% | 固定,粗糙 |
| Addition | 92.0% | 78.2% | 固定,equal weight |
| Separate AGs | 92.9% | 80.2% | 動態,但冇 cross-stream gradient |
| Unified AG(stream-wise scalar) | 92.6% | 80.0% | 動態,但太粗糙 |
| Learned-and-fixed gates | 91.6% | 77.5% | Train 時學、test 時固定 |
| Unified AG(channel-wise, dynamic) | 93.6% | 81.0% | 全部都有 ✅ |
💡 三個關鍵 takeaway:
Dynamic > Fixed:Learned-and-fixed gates 比 unified AG 差 2.0% R1 → adaptive fusion 好重要
Channel-wise > Stream-wise:Stream-wise scalar 差 1.0% R1 → fine-grained fusion matters
Unified > Separate:Unified AG 好過 separate AGs 0.7% R1 → shared gradient 嘅優勢
3. Lite 3×3 vs Standard Convolution
用 standard convolution 只能提升 0.4% R1,但 model size 大 3x。Lite 3×3 幾乎冇損失!
實驗結果

Big Datasets(Same-Domain)
| Method | Backbone | Params | Market1501 R1/mAP | CUHK03 R1/mAP | Duke R1/mAP | MSMT17 R1/mAP |
|---|---|---|---|---|---|---|
| PCB | ResNet50 | ~24M | 93.8 / 81.6 | 63.7 / 57.5 | 83.3 / 69.2 | 68.2 / 40.4 |
| DGNet | ResNet50 | ~24M | 94.8 / 86.0 | • / - | 86.6 / 74.8 | 77.2 / 52.3 |
| IANet | ResNet50 | ~24M | 94.4 / 83.1 | • / - | 87.1 / 73.4 | 75.5 / 46.8 |
| MobileNetV2 | MobileNetV2 | 2.2M | 87.0 / 69.5 | 46.5 / 46.0 | 75.2 / 55.8 | 50.9 / 27.0 |
| ShuffleNet | ShuffleNet | ~2M | 84.8 / 65.0 | 38.4 / 37.2 | 71.6 / 49.9 | 41.5 / 19.9 |
| OSNet | OSNet | 2.2M | 94.8 / 84.9 | 72.3 / 67.8 | 88.6 / 73.5 | 78.7 / 52.9 |
🚀 重點觀察:
OSNet 用 2.2M params 打贏所有 24M+ params 嘅 ResNet50-based models
比同級嘅 MobileNetV2(同樣 2.2M params)高出 7.8% R1 on Market1501
喺最大最難嘅 MSMT17 上,OSNet 嘅 78.7% R1 領先所有方法
CUHK03 嘅 72.3% R1 比第二名 CAMA(66.6%)高出 5.7%
Small Datasets(VIPeR & GRID)
| Method | VIPeR R1 | GRID R1 |
|---|---|---|
| HydraPlus-Net | 56.6% | - |
| GLAD | 54.8% | - |
| JLML | 50.2% | 37.5% |
| OSNet | 68.0% | 38.2% |
OSNet 喺 VIPeR 上領先第二名 11.4%!呢個 dataset 只有幾百張 training images,證明 OSNet 嘅 lightweight 設計有效防止 overfitting。
Cross-Domain Generalization:OSNet-AIN
ICCV 2019 版本專注 same-domain。2021 年嘅 TPAMI 擴展版進一步解決 cross-domain 問題——即喺 Dataset A 訓練,直接部署到 Dataset B(冇做任何 adaptation)。
點解需要 Cross-Domain?
現實世界唔可能為每個新場景都收集 labelled data。你需要一個 model 喺商場 A 訓練完,直接搬去機場 B 都 work。
OSNet-AIN 嘅方案:Instance Normalization
OSNet-AIN(Adaptive Instance Normalization)喺 network 嘅 early layers 加入 Instance Normalization (IN),令 model 學識 ignore domain-specific 嘅 style 資訊(例如光線、色彩分佈),focus 喺 domain-invariant 嘅 structural features。
| Model | MSMT17→Market R1/mAP | MSMT17→Duke R1/mAP |
|---|---|---|
| ResNet50 | 46.3 / 22.8 | 52.3 / 32.1 |
| OSNet x1.0 | 66.6 / 37.5 | 66.0 / 45.3 |
| OSNet-IBN x1.0 | 66.5 / 37.2 | 67.4 / 45.6 |
| OSNet-AIN x1.0 | 70.1 / 43.3 | 71.1 / 52.7 |
OSNet-AIN 比 ResNet50 嘅 cross-domain 表現高出 23.8% R1(MSMT17→Market)!
深入分析
點解 Unified AG 嘅 Gradient 更好?
呢個係一個好 elegant 嘅數學性質。根據 Eq. 3:
因為 AG 係 shared 嘅,gradient 包含咗所有 stream 嘅 features 嘅 sum。即係話,AG 可以同時「睇到」所有 scale 嘅資訊嚟更新自己。
如果用 separate AGs,每個 AG 只能「睇到」自己 stream 嘅 gradient,冇 cross-scale 嘅 supervision signal。
💡 類比:Unified AG 就好似一個裁判同時睇晒 4 個選手嘅表現先至評分;Separate AGs 就好似 4 個裁判各自只睇 1 個選手——前者嘅評分自然更全面同 consistent。
AG 學到咗咩?Visualisation
作者將 test images 按 AG 嘅 gating vectors 做 k-means clustering,發現同一個 cluster 入面嘅圖片有 相似嘅視覺模式:
- Cluster A:全部係背面 + 背囊嘅人 → AG 重 global scale
- Cluster B:全部係有明顯 logo/圖案嘅 T-shirt → AG 重 local + medium 嘅混合
- Cluster C:全部係全身黑色嘅人 → AG 重 local 細節(因為 global 冇乜 discriminative)
呢個證明 AG 真係學識咗 根據唔同人嘅外觀特點,動態調整 scale fusion 策略。
Attention Map 對比
對比 OSNet 同 single-scale baseline 嘅 activation maps:
- Single-scale:過度集中喺 face region(但 surveillance 嘅 face resolution 太低,唔 reliable)
- OSNet:可以 detect 衫上嘅 logo、鞋嘅款式等 local discriminative patterns,同時 maintain 全身嘅 context
實作指南:點樣用 Torchreid + OSNet?
方法一:用 Torchreid(官方框架,最完整)
python# 安裝
# git clone https://github.com/KaiyangZhou/deep-person-reid.git
# cd deep-person-reid && python setup.py develop
import torchreid
# ===== Step 1: 加載數據 =====
datamanager = torchreid.data.ImageDataManager(
root="reid-data",
sources="market1501", # 訓練用嘅 dataset
targets="market1501", # 測試用嘅 dataset
height=256,
width=128,
batch_size_train=32,
batch_size_test=100,
transforms=["random_flip", "random_erase"]
)
# ===== Step 2: 建立 OSNet model =====
model = torchreid.models.build_model(
name="osnet_x1_0", # OSNet 標準版
num_classes=datamanager.num_train_pids, # identity 數量
loss="softmax", # 分類 loss
pretrained=True # 用 ImageNet pretrained weights
)
model = model.cuda()
# ===== Step 3: Optimizer + Scheduler =====
optimizer = torchreid.optim.build_optimizer(
model, optim="amsgrad", lr=0.0015
)
scheduler = torchreid.optim.build_lr_scheduler(
optimizer, lr_scheduler="cosine"
)
# ===== Step 4: Training Engine =====
engine = torchreid.engine.ImageSoftmaxEngine(
datamanager, model,
optimizer=optimizer,
scheduler=scheduler,
label_smooth=True
)
# ===== Step 5: 訓練 + 評估 =====
engine.run(
save_dir="log/osnet_x1_0_market1501",
max_epoch=150,
eval_freq=10,
print_freq=10,
test_only=False
)
方法二:Feature Extraction API(已有 model,只想提取 features)
pythonimport torchreid
from torchreid.utils import FeatureExtractor
# 加載 pretrained OSNet
extractor = FeatureExtractor(
model_name="osnet_x1_0",
model_path="path/to/osnet_x1_0_market.pth.tar",
device="cuda"
)
# 提取 features(512-dim vectors)
image_list = ["person_001.jpg", "person_002.jpg", "person_003.jpg"]
features = extractor(image_list) # shape: (3, 512)
# 計算相似度
import torch
from torch.nn.functional import cosine_similarity
sim = cosine_similarity(
features[0].unsqueeze(0),
features[1].unsqueeze(0)
)
print(f"Similarity: {sim.item():.4f}")
方法三:Cross-Domain(用 OSNet-AIN)
bash# 喺 DukeMTMC 訓練,直接測試 Market1501(零 adaptation)
python scripts/main.py \
--config-file configs/im_osnet_ain_x1_0_softmax_256x128_amsgrad_cosine.yaml \
-s dukemtmcreid \
-t market1501 \
--transforms random_flip color_jitter \
--root $PATH_TO_DATA
💡 Cross-domain tips:
用
color_jitter代替random_erase(可以提升 generalization)用
cosinedistance 代替euclidean(AIN 配 cosine 效果最好)用多個 source datasets 訓練(
-s msmt17 dukemtmcreid cuhk03)效果更好
方法四:Export 到 ONNX / OpenVINO / TFLite(Edge 部署)
Torchreid 已經有內建嘅 export script:
bash# Export 到 ONNX
python tools/export.py \
--model-name osnet_x0_25 \
--model-path path/to/model.pth.tar \
--export-format onnx \
--input-size 256 128
# Export 到 OpenVINO(用於 Intel 硬件)
python tools/export.py \
--model-name osnet_x0_25 \
--model-path path/to/model.pth.tar \
--export-format openvino
# Export 到 TFLite(用於 mobile)
python tools/export.py \
--model-name osnet_x0_25 \
--model-path path/to/model.pth.tar \
--export-format tflite
🚀 Edge 部署推薦:用
osnet_x0_25(0.2M params, 82M mult-adds),可以喺 STM32 等 embedded devices 上跑。STMicroelectronics 已經將 OSNet 加入咗佢哋嘅 AI Model Zoo。
Available Pretrained Models
| Model | Params | Use Case | Download |
|---|---|---|---|
| osnet_x1_0 | 2.2M | Same-domain ReID(最佳精度) | HuggingFace / Torchreid Model Zoo |
| osnet_x0_25 | 0.2M | Edge / Mobile 部署 | Torchreid Model Zoo |
| osnet_ibn_x1_0 | 2.2M | Moderate cross-domain | Torchreid Model Zoo |
| osnet_ain_x1_0 | 2.2M | Best cross-domain | Torchreid Model Zoo |
OSNet 嘅核心 Code 解讀
理解 OSNet 嘅最好方法就係睇 source code。以下係簡化版嘅核心 building block:
pythonimport torch
import torch.nn as nn
import torch.nn.functional as F
class LightConv3x3(nn.Module):
"""Lite 3×3: Pointwise → Depthwise (OSNet 嘅特殊順序)"""
def __init__(self, in_ch, out_ch):
super().__init__()
self.conv1 = nn.Conv2d(in_ch, out_ch, 1, bias=False) # 1×1 pointwise
self.conv2 = nn.Conv2d(out_ch, out_ch, 3, padding=1, # 3×3 depthwise
bias=False, groups=out_ch)
self.bn = nn.BatchNorm2d(out_ch)
def forward(self, x):
return F.relu(self.bn(self.conv2(self.conv1(x))))
class ChannelGate(nn.Module):
"""Unified Aggregation Gate: GAP → FC → ReLU → FC → Sigmoid"""
def __init__(self, channels, reduction=16):
super().__init__()
self.gap = nn.AdaptiveAvgPool2d(1)
self.fc1 = nn.Conv2d(channels, channels // reduction, 1, bias=True)
self.fc2 = nn.Conv2d(channels // reduction, channels, 1, bias=True)
def forward(self, x):
w = torch.sigmoid(self.fc2(F.relu(self.fc1(self.gap(x)))))
return x * w # channel-wise reweighting
class OSBlock(nn.Module):
"""Omni-Scale Feature Learning Block"""
def __init__(self, in_ch, out_ch, reduction=4):
super().__init__()
mid = out_ch // reduction
# Dimension reduction
self.conv1 = nn.Sequential(
nn.Conv2d(in_ch, mid, 1, bias=False),
nn.BatchNorm2d(mid), nn.ReLU()
)
# 4 streams with increasing receptive fields
self.stream1 = LightConv3x3(mid, mid) # RF = 3×3
self.stream2 = nn.Sequential( # RF = 5×5
LightConv3x3(mid, mid), LightConv3x3(mid, mid)
)
self.stream3 = nn.Sequential( # RF = 7×7
LightConv3x3(mid, mid), LightConv3x3(mid, mid),
LightConv3x3(mid, mid)
)
self.stream4 = nn.Sequential( # RF = 9×9
LightConv3x3(mid, mid), LightConv3x3(mid, mid),
LightConv3x3(mid, mid), LightConv3x3(mid, mid)
)
# Shared Aggregation Gate (核心!)
self.gate = ChannelGate(mid)
# Dimension restoration
self.conv3 = nn.Sequential(
nn.Conv2d(mid, out_ch, 1, bias=False),
nn.BatchNorm2d(out_ch)
)
# Residual connection
self.downsample = None
if in_ch != out_ch:
self.downsample = nn.Sequential(
nn.Conv2d(in_ch, out_ch, 1, bias=False),
nn.BatchNorm2d(out_ch)
)
def forward(self, x):
identity = x
x1 = self.conv1(x)
# 4 條 stream 各自提取唔同 scale 嘅 features
s1 = self.stream1(x1) # 3×3 scale
s2 = self.stream2(x1) # 5×5 scale
s3 = self.stream3(x1) # 7×7 scale
s4 = self.stream4(x1) # 9×9 scale
# 同一個 AG 分別處理 4 條 stream → channel-wise fusion
fused = self.gate(s1) + self.gate(s2) + self.gate(s3) + self.gate(s4)
out = self.conv3(fused)
if self.downsample is not None:
identity = self.downsample(identity)
return F.relu(out + identity)
🔑 注意第 62 行:
self.gate被 call 咗 4 次——同一個 AG 處理 4 條 stream。呢個就係 "unified" 嘅意思。每次 forward 時,AG 根據唔同 stream 嘅 input 生成唔同嘅 channel-wise weights,實現 dynamic, input-dependent 嘅 scale fusion。
限制同未來方向
當前限制
1. 唔處理 Temporal 資訊
OSNet 設計用於 image-based ReID。對於 video-based ReID,佢冇 temporal aggregation mechanism(雖然有人 extend 咗 OSNet 到 video,效果唔差)。
2. 固定 T=4
Stream 數量 T=4 係 hard-coded 嘅。雖然 ablation 顯示 T=4 已經夠好,但唔同 dataset 可能有唔同嘅 optimal T。後續嘅 OSNet-AIN 用 NAS (Neural Architecture Search) 嚟自動搜索最優嘅 block 結構。
3. 冇考慮 Occlusion
行人被部分遮擋係 real-world 常見嘅問題。OSNet 冇專門處理 occlusion 嘅 mechanism(例如 part-visibility estimation)。
後續發展
- OSNet-AIN (TPAMI 2021):加入 Instance Normalization 解決 cross-domain 問題
- MixStyle (ICLR 2021):同一作者嘅 domain generalization 方法,可以同 OSNet 結合
- STM32 AI Model Zoo:OSNet 已被 STMicroelectronics 官方採用到 embedded AI 平台
- 4793 GitHub Stars:社區活躍,持續維護
技術啟示
1. Lightweight ≠ Weak
OSNet 用 2.2M 參數打贏 24M 嘅 ResNet50,證明咗 task-specific 設計 遠比 blindly scaling up 更有效。唔係越大越好——係越 fit 越好。
2. Dynamic > Static
所有 ablation 都指向同一個結論:dynamic, input-dependent 嘅 fusion 永遠好過 static fusion。呢個 insight 後來喺 attention mechanism(Transformer)入面被大規模 validate。
3. Shared Parameters 嘅 Magic
Unified AG 嘅 shared parameter design 唔止慳 memory——佢從數學上保證咗更好嘅 gradient flow(Eq. 4)。呢種「少即是多」嘅設計哲學值得記住。
4. 為 Edge AI 設計嘅重要性
OSNet 從一開始就考慮 edge deployment。呢個 design principle 令佢唔止係 paper 上嘅數字好睇,而係 真正可以部署 到 real-world surveillance systems。0.2M 嘅 osnet_x0_25 可以跑喺 STM32 上面,呢個係大部分 ReID model 做唔到嘅。
總結
OSNet 係一個將 multi-scale feature learning 做到極致嘅 lightweight CNN。
核心貢獻
- Omni-Scale Feature Learning:首次提出同時學習 homogeneous + heterogeneous scale features
- Unified Aggregation Gate:shared, dynamic, channel-wise 嘅 multi-scale fusion mechanism
- Lite 3×3:用 factorized convolution 實現 3x model compression 幾乎零損失
- SOTA on 6 datasets:用 10x 更少嘅參數打贏所有 ResNet50-based models
實用價值
- 🎯 高精度:Market1501 Rank-1 = 94.8%,MSMT17 Rank-1 = 78.7%
- 📱 Edge-ready:osnet_x0_25 只有 0.2M params,可以跑喺 embedded devices
- 🌍 Cross-domain:OSNet-AIN 喺零 adaptation 嘅情況下 generalize 到新場景
- 🛠️ Production-ready:Torchreid 提供完整嘅 training / evaluation / export pipeline
- 📦 社區活躍:4.8K stars,20 contributors,持續更新
點解你應該 care?
如果你:
- 做 surveillance / security → OSNet 係 ReID 嘅 go-to baseline
- 做 edge AI → 0.2M params 嘅 model 可以跑喺 camera 端
- 做 multi-scale feature learning → AG 嘅設計 insight 可以 apply 到其他 task
- 學 CNN architecture design → OSNet 嘅 ablation study 係教科書級別嘅設計驗證
相關資源
- 📄 原始論文(ICCV 2019):arXiv:1905.00953
- 📄 擴展論文(TPAMI 2021):Learning Generalisable Omni-Scale Representations
- 💻 Code + Pretrained Models:github.com/KaiyangZhou/deep-person-reid
- 🤗 HuggingFace Weights:kaiyangzhou/osnet
- 📊 Model Zoo:torchreid Model Zoo
- 📚 Tech Report:arXiv:1910.10093
- 🏭 Edge Deployment:STM32 AI Model Zoo - OSNet
OSNet 話俾我哋聽:ReID 嘅關鍵唔係 model 有幾大,而係 features 有幾「全面」。一個 2.2M 嘅 model 可以學到比 24M model 更好嘅 features——只要你設計啱 architecture,令每個 channel 都可以自由揀最適合嘅 scale 組合。呢個就係 omni-scale 嘅 power。 🔍✨