Billy Tse
HomeRoadmapBlogContact
Playground
Buy me a bug

© 2026 Billy Tse

OnlyFansLinkedInGitHubEmail
Back to Blog
April 4, 2026•30 min read

OSNet:點解一個 2.2M 參數嘅輕量 CNN 可以打贏 ResNet50 做 Person Re-ID?

深入解析 OSNet(Omni-Scale Network),一個專為行人重識別設計嘅輕量 CNN。了解點樣透過 multi-scale streams + unified aggregation gate 學習 omni-scale features,用 2.2M 參數打贏 24M 嘅 ResNet50。

Computer VisionAIImage Processing

論文來源:University of Surrey / Queen Mary University of London / Samsung AI Center arXiv: 1905.00953(ICCV 2019) TPAMI 2021 擴展版: Learning Generalisable Omni-Scale Representations GitHub: KaiyangZhou/deep-person-reid HuggingFace: kaiyangzhou/osnet

TL;DR

OSNet 係一個專門為 Person Re-Identification (ReID) 設計嘅輕量級 CNN,核心創新係學習 omni-scale features——即同時包含單一尺度同多尺度混合嘅特徵。

核心重點:

  • 🎯 Omni-Scale Features:同時捕捉 local(鞋、logo)同 global(全身輪廓)特徵,仲可以動態混合唔同 scale
  • 🏗️ 只有 2.2M 參數:比 ResNet50(24M+)細超過 10 倍,但表現更好
  • 🚀 SOTA on 6 datasets:Market1501 Rank-1 = 94.8%,打贏幾乎所有大型 model
  • ⚡ Unified Aggregation Gate:一個 shared mini-network 動態決定每個 channel 應該 focus 邊個 scale
  • 📱 Edge-friendly:極細嘅 model size 適合部署喺 surveillance cameras

目錄

背景:Person Re-ID 係咩?點解咁難?

問題定義

Person Re-Identification(行人重識別)嘅目標好簡單:喺唔同 camera 嘅影片入面,搵返同一個人。

想像你有一個商場,裝咗 50 個 CCTV。一個可疑人物喺 Camera A 出現咗,你想知佢之後去咗邊——即係喺 Camera B、C、D⋯⋯嘅畫面入面搵返佢。

💡 一句話定義:ReID = 跨 camera 嘅「呢個人係咪同一個人?」判斷問題。唔係 face recognition——因為 surveillance 嘅解像度太低,通常睇唔到樣。

點解咁難?兩大挑戰

挑戰 1:同一個人,樣子差好遠(大 intra-class variation)

同一個人喺唔同 camera 會有完全唔同嘅外觀:

  • 正面 vs 背面(背囊只有背面先見到)
  • 光線差異(室內暗 vs 室外亮)
  • 遮擋(俾其他人擋住一半身體)

挑戰 2:唔同嘅人,樣子好似(細 inter-class variation)

公共場所好多人著差唔多嘅衫——白 T-shirt + 牛仔褲嘅組合可以有幾十個人。從遠處睇,佢哋幾乎一模一樣。

🎯 核心難題:你需要 features 可以同時做到:

  1. Global:辨認「白 T-shirt + 灰色短褲」嘅整體搭配

  2. Local:注意到「著緊波鞋定涼鞋」嘅細節

  3. Mixed:捕捉「白色 T-shirt 上面有特定 logo」呢種跨尺度嘅組合

前兩個好多 model 都做到,但第三個——heterogeneous-scale features——就係 OSNet 嘅獨特貢獻。

用一個具體例子解釋

假設你要 match 一個人。你嘅 query image 係一個著白 T-shirt 嘅男仔。Gallery 入面有兩個候選人都著白 T-shirt:

特徵類型Scale例子能否區分?
Global全身「白 T-shirt + 灰短褲」❌ 兩個候選人都一樣
Local小區域「T-shirt 上有個 logo」⚠️ Logo 太細,可能同其他 pattern 混淆
Heterogeneous(混合)跨尺度「白 T-shirt + 正面有特定 logo」✅ 呢個組合先至 unique!

單獨睇 logo 冇用(太 generic),單獨睇白 T-shirt 冇用(太 common)。但白 T-shirt 配呢個特定 logo 嘅組合就好 distinctive。呢種跨尺度嘅混合特徵就係 omni-scale features。

現有方法嘅問題

借用 ImageNet Model 嘅困境

大部分 ReID model 都直接用 ResNet50、Inception 等為 ImageNet 設計嘅 backbone。但呢啲 model 係為 category-level recognition(分類貓狗)設計嘅,同 instance-level recognition(分辨兩個著白衫嘅人)係根本唔同嘅任務。

任務目標需要嘅 features
ImageNet Classification分辨「貓」同「狗」Category-level(耳朵形狀、毛色)
Person ReID分辨「人 A」同「人 B」Instance-level(特定 logo、鞋款、背囊細節)

現有 Multi-Scale 方法嘅局限

有啲 ReID model 嘗試學習 multi-scale features,但都有缺陷:

方法問題
ResNeXt-based(MLFN)所有 stream 用相同 scale,學唔到唔同尺度
Inception-based(MuDeep)手動設計嘅 mixed ops,而且用 fixed weights fusion
Multi-level fusion只喺特定 layer 做 fusion,唔係每層都有
Part-based models(PCB)依賴外部 pose detector,唔係 end-to-end

⚠️ 核心缺失:冇一個現有 model 可以同時做到:

  1. 每層都學唔同 scale 嘅 features

  2. 動態咁 fuse 唔同 scale(input-dependent)

  3. 學 heterogeneous-scale features(多個 scale 嘅混合)

OSNet 係第一個解決晒三個問題嘅 architecture。

OSNet:核心設計

image.png

設計理念

💡 核心洞察:用多條唔同 receptive field 嘅 stream,配合一個 shared 嘅 dynamic gate,令每個 feature channel 自動揀最適合嘅 scale 組合。

OSNet 嘅 building block 有三個關鍵組件:

  1. Lite 3×3:輕量化嘅 depthwise separable convolution
  2. Multi-Scale Streams:4 條唔同深度嘅 convolutional stream(T=1,2,3,4)
  3. Unified Aggregation Gate (AG):shared mini-network 動態 fuse multi-scale features

組件 1:Lite 3×3 — 輕量化卷積

傳統嘅 3×3 convolution 參數量係 k2×c×c′k^2 \times c \times c'k2×c×c′。OSNet 用 depthwise separable convolution 將佢拆成兩步,大幅減少參數:

Loading diagram...

🔑 OSNet 嘅微妙差異:一般 MobileNet 用 depthwise → pointwise 嘅順序,但 OSNet 用 pointwise → depthwise(先擴展 channel 再做 spatial aggregation)。作者發現呢個順序對 omni-scale feature learning 更有效。

參數量對比:

類型參數量c=256, c'=256 時
Standard 3×3 Convk2×c×c′k^2 \times c \times c'k2×c×c′589,824
Lite 3×3(OSNet)(k2+c)×c′(k^2 + c) \times c'(k2+c)×c′67,840(少 8.7x)

組件 2:Multi-Scale Streams — 4 條唔同尺度嘅路線

image.png

OSNet 嘅 building block(OSBlock)入面有 4 條 parallel streams,每條由唔同數量嘅 Lite 3×3 堆疊而成:

StreamExponent tLite 3×3 層數Receptive Field捕捉嘅特徵
Stream 1t=113×3極 local(紋理、邊緣)
Stream 2t=225×5Local(logo、飾物)
Stream 3t=337×7Medium(上身、下身)
Stream 4t=449×9Global(全身輪廓)

點解用呢個設計?

每條 stream 嘅 receptive field 大小係 (2t+1)×(2t+1)(2t + 1) \times (2t + 1)(2t+1)×(2t+1),由 exponent ttt 控制。透過 linearly 增加 ttt,OSNet 確保每個 block 都涵蓋從 fine-grained 到 coarse 嘅完整 scale range。

Loading diagram...

組件 3:Unified Aggregation Gate — 動態 Scale Fusion 嘅核心

呢個係 OSNet 最重要嘅創新。AG 係一個 shared mini-network,為每條 stream 生成 channel-wise weights,動態決定每個 feature channel 應該用邊個 scale。

AG 嘅結構:

Loading diagram...

數學公式:

x~=∑t=1TG(xt)⊙xt\tilde{x} = \sum_{t=1}^{T} G(x^t) \odot x^tx~=t=1∑T​G(xt)⊙xt

其中 G(xt)G(x^t)G(xt) 係 AG 為第 ttt 條 stream 生成嘅 channel-wise weight vector,⊙\odot⊙ 係 Hadamard product(element-wise 相乘)。

🎯 點解 AG 要 shared(unified)?三個原因:

  1. 參數數量唔受 T 影響:唔理你有幾多條 stream,AG 嘅參數量都一樣 → 更 scalable

  2. 更好嘅 gradient flow:Backprop 時,所有 stream 嘅 supervision signals 都會 匯聚 到同一個 AG → gradient 更 informative

  3. 促進 cross-scale comparison:同一個 network 處理所有 scale 嘅 features → 自然學識 compare 同 fuse 唔同 scale

點解用 channel-wise weights 而唔係 stream-wise scalar?

一個 scalar weight per stream 太粗糙——佢只能話「成條 stream 重要定唔重要」。但 channel-wise weights 可以做到:

  • Channel 1 主要用 Stream 1(local texture)
  • Channel 2 主要用 Stream 3(medium body part)
  • Channel 3 混合 Stream 1 + Stream 4(local + global = heterogeneous scale!)

呢個 fine-grained fusion 就係 OSNet 可以學到 heterogeneous-scale features 嘅關鍵。

用具體數字做例子

假設 mid_channels = 64(即每條 stream 嘅 output 有 64 個 channels):

Step 1:4 條 stream 各自輸出 feature maps

StreamOutput shapeReceptive field
x1x^1x1 = Stream 132×16×643×3(鞋扣、紐扣)
x2x^2x2 = Stream 232×16×645×5(logo、口袋)
x3x^3x3 = Stream 332×16×647×7(上半身)
x4x^4x4 = Stream 432×16×649×9(全身)

Step 2:Shared AG 為每條 stream 生成 64-dim weight vector

假設對於一張「白 T-shirt 有 logo 嘅男仔」嘅圖片:

ChannelAG weight for Stream 1AG weight for Stream 2AG weight for Stream 3AG weight for Stream 4學到咩?
Ch. 10.10.80.30.1Focus on logo(5×5 scale)
Ch. 20.10.20.30.9Focus on 全身輪廓(9×9 scale)
Ch. 30.70.10.80.2Mixed:紋理 + 上半身 = heterogeneous!

Step 3:Weighted sum

x~ch.3=0.7⋅xch.31+0.1⋅xch.32+0.8⋅xch.33+0.2⋅xch.34\tilde{x}_{\text{ch.3}} = 0.7 \cdot x^1_{\text{ch.3}} + 0.1 \cdot x^2_{\text{ch.3}} + 0.8 \cdot x^3_{\text{ch.3}} + 0.2 \cdot x^4_{\text{ch.3}}x~ch.3​=0.7⋅xch.31​+0.1⋅xch.32​+0.8⋅xch.33​+0.2⋅xch.34​

Channel 3 混合咗 Stream 1(紋理)同 Stream 3(上半身)→ 呢個 channel 捕捉嘅就係「白 T-shirt 上面有特定紋理」呢種 heterogeneous-scale feature!

💡 動態嘅意思:上面嘅 weights 係 input-dependent 嘅——換一張冇 logo 嘅人,AG 會 assign 完全唔同嘅 weights。呢個同 fixed fusion(例如 simple addition 或 concatenation)有根本分別。

完整 Network Architecture

架構總覽

OSNet 用最簡單嘅方式——一層一層 stack 相同嘅 OSBlock——構建整個 network。冇花巧嘅 per-stage customization。

StageOutput Size操作Channels
conv1128×647×7 conv, stride 264
maxpool64×323×3 max pool, stride 264
conv264×32OSBlock × 2256
transition32×161×1 conv + 2×2 avg pool256
conv332×16OSBlock × 2384
transition16×81×1 conv + 2×2 avg pool384
conv416×8OSBlock × 2512
conv516×81×1 conv512
GAP1×1Global Average Pooling512
FC-Fully Connected512

Model Complexity:

  • 參數量:2.2M(ResNet50 = 23.5M,差 10.7x)
  • Mult-Adds:978.9M
  • Feature dim:512(用 ℓ2\ell_2ℓ2​ distance 做 matching)

Width Multiplier:靈活縮放

OSNet 支持 width multiplier β\betaβ 嚟縮放 model size:

ModelβParamsMult-AddsMarket1501 R1
osnet_x1_01.02.2M978.9M94.8%
osnet_x0_750.751.3M571.8M94.5%
osnet_x0_50.50.6M272.9M93.4%
osnet_x0_250.250.2M82.3M92.2%

🚀 0.2M 參數嘅 osnet_x0_25 仲有 92.2% Rank-1! 呢個比好多 24M+ 參數嘅 ResNet50-based model 仲要好。證明咗 OSNet 嘅設計真係高效。

同 Inception / ResNeXt / SENet 嘅分別

OSNet 嘅 multi-stream 設計表面上似 Inception 同 ResNeXt,但有根本性嘅分別:

特性InceptionResNeXtSENetOSNet
Multi-stream 目的減少計算量增加 widthRe-calibrate channels捕捉唔同 scale
Stream 嘅 scale手動設計 mixed ops所有 stream 同一 scale只有 1 個 stream每條 stream 唔同 scale
Fusion 方式ConcatenationAdditionChannel re-scalingDynamic channel-wise AG
Fusion 係動態?❌ 固定❌ 固定✅ 動態(但唔係 multi-scale)✅ 動態 + multi-scale
Heterogeneous scale?❌❌❌✅

🎯 一句話總結:Inception 追求「效率」,ResNeXt 追求「width」,SENet 追求「channel attention」,OSNet 追求「omni-scale feature learning」。四個目標完全唔同。

Ablation Study:每個設計決定有幾重要?

作者做咗詳盡嘅 ablation,每個結論都有實驗支持:

1. Multi-Scale Streams 嘅效果

Stream 數量 TMarket1501 R1mAP
T=1(single scale)86.5%67.7%
T=2 + AG91.7%77.0%
T=3 + AG92.8%79.9%
T=4 + unified AG93.6%81.0%

從 T=1 到 T=4,Rank-1 提升咗 7.1%,mAP 提升咗 13.3%。每加一條 stream 都有明顯改善。

2. Fusion Strategy 嘅對比

Fusion 方法R1mAP特點
Concatenation91.4%77.4%固定,粗糙
Addition92.0%78.2%固定,equal weight
Separate AGs92.9%80.2%動態,但冇 cross-stream gradient
Unified AG(stream-wise scalar)92.6%80.0%動態,但太粗糙
Learned-and-fixed gates91.6%77.5%Train 時學、test 時固定
Unified AG(channel-wise, dynamic)93.6%81.0%全部都有 ✅

💡 三個關鍵 takeaway:

  1. Dynamic > Fixed:Learned-and-fixed gates 比 unified AG 差 2.0% R1 → adaptive fusion 好重要

  2. Channel-wise > Stream-wise:Stream-wise scalar 差 1.0% R1 → fine-grained fusion matters

  3. Unified > Separate:Unified AG 好過 separate AGs 0.7% R1 → shared gradient 嘅優勢

3. Lite 3×3 vs Standard Convolution

用 standard convolution 只能提升 0.4% R1,但 model size 大 3x。Lite 3×3 幾乎冇損失!

實驗結果

image.png

Big Datasets(Same-Domain)

MethodBackboneParamsMarket1501 R1/mAPCUHK03 R1/mAPDuke R1/mAPMSMT17 R1/mAP
PCBResNet50~24M93.8 / 81.663.7 / 57.583.3 / 69.268.2 / 40.4
DGNetResNet50~24M94.8 / 86.0• / -86.6 / 74.877.2 / 52.3
IANetResNet50~24M94.4 / 83.1• / -87.1 / 73.475.5 / 46.8
MobileNetV2MobileNetV22.2M87.0 / 69.546.5 / 46.075.2 / 55.850.9 / 27.0
ShuffleNetShuffleNet~2M84.8 / 65.038.4 / 37.271.6 / 49.941.5 / 19.9
OSNetOSNet2.2M94.8 / 84.972.3 / 67.888.6 / 73.578.7 / 52.9

🚀 重點觀察:

  • OSNet 用 2.2M params 打贏所有 24M+ params 嘅 ResNet50-based models

  • 比同級嘅 MobileNetV2(同樣 2.2M params)高出 7.8% R1 on Market1501

  • 喺最大最難嘅 MSMT17 上,OSNet 嘅 78.7% R1 領先所有方法

  • CUHK03 嘅 72.3% R1 比第二名 CAMA(66.6%)高出 5.7%

Small Datasets(VIPeR & GRID)

MethodVIPeR R1GRID R1
HydraPlus-Net56.6%-
GLAD54.8%-
JLML50.2%37.5%
OSNet68.0%38.2%

OSNet 喺 VIPeR 上領先第二名 11.4%!呢個 dataset 只有幾百張 training images,證明 OSNet 嘅 lightweight 設計有效防止 overfitting。

Cross-Domain Generalization:OSNet-AIN

ICCV 2019 版本專注 same-domain。2021 年嘅 TPAMI 擴展版進一步解決 cross-domain 問題——即喺 Dataset A 訓練,直接部署到 Dataset B(冇做任何 adaptation)。

點解需要 Cross-Domain?

現實世界唔可能為每個新場景都收集 labelled data。你需要一個 model 喺商場 A 訓練完,直接搬去機場 B 都 work。

OSNet-AIN 嘅方案:Instance Normalization

OSNet-AIN(Adaptive Instance Normalization)喺 network 嘅 early layers 加入 Instance Normalization (IN),令 model 學識 ignore domain-specific 嘅 style 資訊(例如光線、色彩分佈),focus 喺 domain-invariant 嘅 structural features。

ModelMSMT17→Market R1/mAPMSMT17→Duke R1/mAP
ResNet5046.3 / 22.852.3 / 32.1
OSNet x1.066.6 / 37.566.0 / 45.3
OSNet-IBN x1.066.5 / 37.267.4 / 45.6
OSNet-AIN x1.070.1 / 43.371.1 / 52.7

OSNet-AIN 比 ResNet50 嘅 cross-domain 表現高出 23.8% R1(MSMT17→Market)!

深入分析

點解 Unified AG 嘅 Gradient 更好?

呢個係一個好 elegant 嘅數學性質。根據 Eq. 3:

∂L∂G=∂L∂x~⋅(∑t=1Txt)\frac{\partial \mathcal{L}}{\partial G} = \frac{\partial \mathcal{L}}{\partial \tilde{x}} \cdot \left( \sum_{t=1}^{T} x^t \right)∂G∂L​=∂x~∂L​⋅(t=1∑T​xt)

因為 AG 係 shared 嘅,gradient 包含咗所有 stream 嘅 features 嘅 sum。即係話,AG 可以同時「睇到」所有 scale 嘅資訊嚟更新自己。

如果用 separate AGs,每個 AG 只能「睇到」自己 stream 嘅 gradient,冇 cross-scale 嘅 supervision signal。

💡 類比:Unified AG 就好似一個裁判同時睇晒 4 個選手嘅表現先至評分;Separate AGs 就好似 4 個裁判各自只睇 1 個選手——前者嘅評分自然更全面同 consistent。

AG 學到咗咩?Visualisation

作者將 test images 按 AG 嘅 gating vectors 做 k-means clustering,發現同一個 cluster 入面嘅圖片有 相似嘅視覺模式:

  • Cluster A:全部係背面 + 背囊嘅人 → AG 重 global scale
  • Cluster B:全部係有明顯 logo/圖案嘅 T-shirt → AG 重 local + medium 嘅混合
  • Cluster C:全部係全身黑色嘅人 → AG 重 local 細節(因為 global 冇乜 discriminative)

呢個證明 AG 真係學識咗 根據唔同人嘅外觀特點,動態調整 scale fusion 策略。

Attention Map 對比

對比 OSNet 同 single-scale baseline 嘅 activation maps:

  • Single-scale:過度集中喺 face region(但 surveillance 嘅 face resolution 太低,唔 reliable)
  • OSNet:可以 detect 衫上嘅 logo、鞋嘅款式等 local discriminative patterns,同時 maintain 全身嘅 context

實作指南:點樣用 Torchreid + OSNet?

方法一:用 Torchreid(官方框架,最完整)

python# 安裝 # git clone https://github.com/KaiyangZhou/deep-person-reid.git # cd deep-person-reid && python setup.py develop import torchreid # ===== Step 1: 加載數據 ===== datamanager = torchreid.data.ImageDataManager( root="reid-data", sources="market1501", # 訓練用嘅 dataset targets="market1501", # 測試用嘅 dataset height=256, width=128, batch_size_train=32, batch_size_test=100, transforms=["random_flip", "random_erase"] ) # ===== Step 2: 建立 OSNet model ===== model = torchreid.models.build_model( name="osnet_x1_0", # OSNet 標準版 num_classes=datamanager.num_train_pids, # identity 數量 loss="softmax", # 分類 loss pretrained=True # 用 ImageNet pretrained weights ) model = model.cuda() # ===== Step 3: Optimizer + Scheduler ===== optimizer = torchreid.optim.build_optimizer( model, optim="amsgrad", lr=0.0015 ) scheduler = torchreid.optim.build_lr_scheduler( optimizer, lr_scheduler="cosine" ) # ===== Step 4: Training Engine ===== engine = torchreid.engine.ImageSoftmaxEngine( datamanager, model, optimizer=optimizer, scheduler=scheduler, label_smooth=True ) # ===== Step 5: 訓練 + 評估 ===== engine.run( save_dir="log/osnet_x1_0_market1501", max_epoch=150, eval_freq=10, print_freq=10, test_only=False )

方法二:Feature Extraction API(已有 model,只想提取 features)

pythonimport torchreid from torchreid.utils import FeatureExtractor # 加載 pretrained OSNet extractor = FeatureExtractor( model_name="osnet_x1_0", model_path="path/to/osnet_x1_0_market.pth.tar", device="cuda" ) # 提取 features(512-dim vectors) image_list = ["person_001.jpg", "person_002.jpg", "person_003.jpg"] features = extractor(image_list) # shape: (3, 512) # 計算相似度 import torch from torch.nn.functional import cosine_similarity sim = cosine_similarity( features[0].unsqueeze(0), features[1].unsqueeze(0) ) print(f"Similarity: {sim.item():.4f}")

方法三:Cross-Domain(用 OSNet-AIN)

bash# 喺 DukeMTMC 訓練,直接測試 Market1501(零 adaptation) python scripts/main.py \ --config-file configs/im_osnet_ain_x1_0_softmax_256x128_amsgrad_cosine.yaml \ -s dukemtmcreid \ -t market1501 \ --transforms random_flip color_jitter \ --root $PATH_TO_DATA

💡 Cross-domain tips:

  • 用 color_jitter 代替 random_erase(可以提升 generalization)

  • 用 cosine distance 代替 euclidean(AIN 配 cosine 效果最好)

  • 用多個 source datasets 訓練(-s msmt17 dukemtmcreid cuhk03)效果更好

方法四:Export 到 ONNX / OpenVINO / TFLite(Edge 部署)

Torchreid 已經有內建嘅 export script:

bash# Export 到 ONNX python tools/export.py \ --model-name osnet_x0_25 \ --model-path path/to/model.pth.tar \ --export-format onnx \ --input-size 256 128 # Export 到 OpenVINO(用於 Intel 硬件) python tools/export.py \ --model-name osnet_x0_25 \ --model-path path/to/model.pth.tar \ --export-format openvino # Export 到 TFLite(用於 mobile) python tools/export.py \ --model-name osnet_x0_25 \ --model-path path/to/model.pth.tar \ --export-format tflite

🚀 Edge 部署推薦:用 osnet_x0_25(0.2M params, 82M mult-adds),可以喺 STM32 等 embedded devices 上跑。STMicroelectronics 已經將 OSNet 加入咗佢哋嘅 AI Model Zoo。

Available Pretrained Models

ModelParamsUse CaseDownload
osnet_x1_02.2MSame-domain ReID(最佳精度)HuggingFace / Torchreid Model Zoo
osnet_x0_250.2MEdge / Mobile 部署Torchreid Model Zoo
osnet_ibn_x1_02.2MModerate cross-domainTorchreid Model Zoo
osnet_ain_x1_02.2MBest cross-domainTorchreid Model Zoo

OSNet 嘅核心 Code 解讀

理解 OSNet 嘅最好方法就係睇 source code。以下係簡化版嘅核心 building block:

pythonimport torch import torch.nn as nn import torch.nn.functional as F class LightConv3x3(nn.Module): """Lite 3×3: Pointwise → Depthwise (OSNet 嘅特殊順序)""" def __init__(self, in_ch, out_ch): super().__init__() self.conv1 = nn.Conv2d(in_ch, out_ch, 1, bias=False) # 1×1 pointwise self.conv2 = nn.Conv2d(out_ch, out_ch, 3, padding=1, # 3×3 depthwise bias=False, groups=out_ch) self.bn = nn.BatchNorm2d(out_ch) def forward(self, x): return F.relu(self.bn(self.conv2(self.conv1(x)))) class ChannelGate(nn.Module): """Unified Aggregation Gate: GAP → FC → ReLU → FC → Sigmoid""" def __init__(self, channels, reduction=16): super().__init__() self.gap = nn.AdaptiveAvgPool2d(1) self.fc1 = nn.Conv2d(channels, channels // reduction, 1, bias=True) self.fc2 = nn.Conv2d(channels // reduction, channels, 1, bias=True) def forward(self, x): w = torch.sigmoid(self.fc2(F.relu(self.fc1(self.gap(x))))) return x * w # channel-wise reweighting class OSBlock(nn.Module): """Omni-Scale Feature Learning Block""" def __init__(self, in_ch, out_ch, reduction=4): super().__init__() mid = out_ch // reduction # Dimension reduction self.conv1 = nn.Sequential( nn.Conv2d(in_ch, mid, 1, bias=False), nn.BatchNorm2d(mid), nn.ReLU() ) # 4 streams with increasing receptive fields self.stream1 = LightConv3x3(mid, mid) # RF = 3×3 self.stream2 = nn.Sequential( # RF = 5×5 LightConv3x3(mid, mid), LightConv3x3(mid, mid) ) self.stream3 = nn.Sequential( # RF = 7×7 LightConv3x3(mid, mid), LightConv3x3(mid, mid), LightConv3x3(mid, mid) ) self.stream4 = nn.Sequential( # RF = 9×9 LightConv3x3(mid, mid), LightConv3x3(mid, mid), LightConv3x3(mid, mid), LightConv3x3(mid, mid) ) # Shared Aggregation Gate (核心!) self.gate = ChannelGate(mid) # Dimension restoration self.conv3 = nn.Sequential( nn.Conv2d(mid, out_ch, 1, bias=False), nn.BatchNorm2d(out_ch) ) # Residual connection self.downsample = None if in_ch != out_ch: self.downsample = nn.Sequential( nn.Conv2d(in_ch, out_ch, 1, bias=False), nn.BatchNorm2d(out_ch) ) def forward(self, x): identity = x x1 = self.conv1(x) # 4 條 stream 各自提取唔同 scale 嘅 features s1 = self.stream1(x1) # 3×3 scale s2 = self.stream2(x1) # 5×5 scale s3 = self.stream3(x1) # 7×7 scale s4 = self.stream4(x1) # 9×9 scale # 同一個 AG 分別處理 4 條 stream → channel-wise fusion fused = self.gate(s1) + self.gate(s2) + self.gate(s3) + self.gate(s4) out = self.conv3(fused) if self.downsample is not None: identity = self.downsample(identity) return F.relu(out + identity)

🔑 注意第 62 行:self.gate 被 call 咗 4 次——同一個 AG 處理 4 條 stream。呢個就係 "unified" 嘅意思。每次 forward 時,AG 根據唔同 stream 嘅 input 生成唔同嘅 channel-wise weights,實現 dynamic, input-dependent 嘅 scale fusion。

限制同未來方向

當前限制

1. 唔處理 Temporal 資訊

OSNet 設計用於 image-based ReID。對於 video-based ReID,佢冇 temporal aggregation mechanism(雖然有人 extend 咗 OSNet 到 video,效果唔差)。

2. 固定 T=4

Stream 數量 T=4 係 hard-coded 嘅。雖然 ablation 顯示 T=4 已經夠好,但唔同 dataset 可能有唔同嘅 optimal T。後續嘅 OSNet-AIN 用 NAS (Neural Architecture Search) 嚟自動搜索最優嘅 block 結構。

3. 冇考慮 Occlusion

行人被部分遮擋係 real-world 常見嘅問題。OSNet 冇專門處理 occlusion 嘅 mechanism(例如 part-visibility estimation)。

後續發展

  • OSNet-AIN (TPAMI 2021):加入 Instance Normalization 解決 cross-domain 問題
  • MixStyle (ICLR 2021):同一作者嘅 domain generalization 方法,可以同 OSNet 結合
  • STM32 AI Model Zoo:OSNet 已被 STMicroelectronics 官方採用到 embedded AI 平台
  • 4793 GitHub Stars:社區活躍,持續維護

技術啟示

1. Lightweight ≠ Weak

OSNet 用 2.2M 參數打贏 24M 嘅 ResNet50,證明咗 task-specific 設計 遠比 blindly scaling up 更有效。唔係越大越好——係越 fit 越好。

2. Dynamic > Static

所有 ablation 都指向同一個結論:dynamic, input-dependent 嘅 fusion 永遠好過 static fusion。呢個 insight 後來喺 attention mechanism(Transformer)入面被大規模 validate。

3. Shared Parameters 嘅 Magic

Unified AG 嘅 shared parameter design 唔止慳 memory——佢從數學上保證咗更好嘅 gradient flow(Eq. 4)。呢種「少即是多」嘅設計哲學值得記住。

4. 為 Edge AI 設計嘅重要性

OSNet 從一開始就考慮 edge deployment。呢個 design principle 令佢唔止係 paper 上嘅數字好睇,而係 真正可以部署 到 real-world surveillance systems。0.2M 嘅 osnet_x0_25 可以跑喺 STM32 上面,呢個係大部分 ReID model 做唔到嘅。

總結

OSNet 係一個將 multi-scale feature learning 做到極致嘅 lightweight CNN。

核心貢獻

  1. Omni-Scale Feature Learning:首次提出同時學習 homogeneous + heterogeneous scale features
  2. Unified Aggregation Gate:shared, dynamic, channel-wise 嘅 multi-scale fusion mechanism
  3. Lite 3×3:用 factorized convolution 實現 3x model compression 幾乎零損失
  4. SOTA on 6 datasets:用 10x 更少嘅參數打贏所有 ResNet50-based models

實用價值

  • 🎯 高精度:Market1501 Rank-1 = 94.8%,MSMT17 Rank-1 = 78.7%
  • 📱 Edge-ready:osnet_x0_25 只有 0.2M params,可以跑喺 embedded devices
  • 🌍 Cross-domain:OSNet-AIN 喺零 adaptation 嘅情況下 generalize 到新場景
  • 🛠️ Production-ready:Torchreid 提供完整嘅 training / evaluation / export pipeline
  • 📦 社區活躍:4.8K stars,20 contributors,持續更新

點解你應該 care?

如果你:

  • 做 surveillance / security → OSNet 係 ReID 嘅 go-to baseline
  • 做 edge AI → 0.2M params 嘅 model 可以跑喺 camera 端
  • 做 multi-scale feature learning → AG 嘅設計 insight 可以 apply 到其他 task
  • 學 CNN architecture design → OSNet 嘅 ablation study 係教科書級別嘅設計驗證

相關資源

  • 📄 原始論文(ICCV 2019):arXiv:1905.00953
  • 📄 擴展論文(TPAMI 2021):Learning Generalisable Omni-Scale Representations
  • 💻 Code + Pretrained Models:github.com/KaiyangZhou/deep-person-reid
  • 🤗 HuggingFace Weights:kaiyangzhou/osnet
  • 📊 Model Zoo:torchreid Model Zoo
  • 📚 Tech Report:arXiv:1910.10093
  • 🏭 Edge Deployment:STM32 AI Model Zoo - OSNet

OSNet 話俾我哋聽:ReID 嘅關鍵唔係 model 有幾大,而係 features 有幾「全面」。一個 2.2M 嘅 model 可以學到比 24M model 更好嘅 features——只要你設計啱 architecture,令每個 channel 都可以自由揀最適合嘅 scale 組合。呢個就係 omni-scale 嘅 power。 🔍✨

Back to all articles
目錄