Billy Tse
HomeRoadmapBlogContact
Playground
Buy me a bug

© 2026 Billy Tse

OnlyFansLinkedInGitHubEmail
Back to Blog
April 5, 2026•35 min read

Person ReID 進化史:由 TransReID 到 SOLIDER 到 DINOv2,Transformer 點樣統治行人重識別?

深入解析 Person ReID 由 CNN 時代到 Transformer 時代嘅三大里程碑:TransReID(首個 pure Transformer ReID)、SOLIDER(語義可控自監督預訓練)、DINOv2(通用視覺基礎模型)。了解 JPM、SIE、語義控制器等核心設計,附完整實作指南。

Computer VisionAITransformerImage Processing

本文涵蓋三篇核心論文: TransReID — arXiv:2102.04378(ICCV 2021)| GitHub SOLIDER — arXiv:2303.17602(CVPR 2023)| GitHub DINOv2 — arXiv:2304.07193(Meta AI, 2023)| GitHub

TL;DR

上一篇我哋講咗 OSNet 點樣用 2.2M 參數嘅 CNN 打贏 ResNet50 做 Person ReID。但 2021 年之後,Transformer 徹底改寫咗 ReID 嘅遊戲規則。

呢篇 blog 會帶你行過 ReID 嘅三個 Transformer 時代里程碑:

核心重點:

  • 🏗️ TransReID(ICCV 2021):第一個 pure Transformer ReID framework,提出 JPM + SIE 兩個 ReID-specific 模組
  • 🧠 SOLIDER(CVPR 2023):語義可控嘅自監督預訓練,一個 model 搞掂 6 個 human-centric 任務
  • 🦕 DINOv2(Meta AI 2023):通用視覺基礎模型,frozen features 就可以做 ReID
  • 📊 進化趨勢:由「手工設計 backbone」→「ReID-specific Transformer」→「通用自監督預訓練」→「Foundation Model」

目錄

背景:ReID 進化嘅三個時代

喺深入每個 model 之前,先睇下 Person ReID 嘅技術演變全景:

Loading diagram...
時代代表作核心理念Pre-training DataMSMT17 mAP
CNNOSNet(ICCV 2019)Task-specific CNN 設計ImageNet-1K52.9
TransformerTransReID(ICCV 2021)Pure ViT + ReID modulesImageNet-21K67.4
Pre-trainingSOLIDER(CVPR 2023)Human-centric SSLLUPerson(4M)77.1
FoundationDINOv2(Meta 2023)Universal visual featuresLVD-142M~70+(frozen)

🎯 核心趨勢:由「為 ReID 專門設計 architecture」逐步走向「用更好嘅 pre-training 提供更 universal 嘅 features」。Model 越嚟越 general,但 performance 越嚟越好——呢個就係 Foundation Model 時代嘅 magic。

Part 1:TransReID — 第一個 Pure Transformer 做 ReID

點解要用 Transformer 做 ReID?

CNN 有一個根本性嘅限制:local receptive field。即使 OSNet 用 multi-scale streams 嚟擴展 receptive field,每一層嘅 convolution 仍然只能「睇到」附近嘅 pixels。

Transformer 嘅 self-attention 天然就解決咗呢個問題——每個 token 都可以直接 attend to 圖片嘅任何位置。

特性CNN(OSNet)Transformer(TransReID)
Receptive fieldLocal → 逐層擴展Global from Layer 1
Long-range dependencies需要 deep stacking每層都有
Position encodingImplicit(by conv position)Explicit(可加 side info)
Part-level features需要額外 branchPatch tokens 天然對應 body parts

💡 TransReID 嘅 insight:ViT 將圖片切成 patches,每個 patch 自然對應行人嘅唔同部位(頭、上身、腿)。呢個同 ReID 需要 part-level features 嘅需求完美 match!

TransReID 嘅架構

TransReID 嘅 baseline 就係一個 standard ViT-B/16,加上兩個 ReID-specific 嘅模組:

Loading diagram...

創新 1:Side Information Embedding(SIE)

呢個係 TransReID 最實用嘅創新。ReID 有一個 CNN 時代一直冇好好解決嘅問題:camera bias。

問題: 同一個人喺 Camera A(室內、暖光)同 Camera B(室外、冷光)嘅外觀差異可以好大。Model 容易學到「Camera A 嘅人都偏黃,Camera B 嘅人都偏藍」呢種 shortcut。

SIE 嘅解決方案: 將 camera ID 同 viewpoint 編碼成 learnable embeddings,直接加到 patch embeddings 入面:

z0=[xcls;x1E;x2E;…;xNE]+Epos+λ1⋅Scam+λ2⋅Sviewz_0 = [x_{\text{cls}}; x_1^E; x_2^E; \ldots; x_N^E] + E_{\text{pos}} + \lambda_1 \cdot S_{\text{cam}} + \lambda_2 \cdot S_{\text{view}}z0​=[xcls​;x1E​;x2E​;…;xNE​]+Epos​+λ1​⋅Scam​+λ2​⋅Sview​

其中:

  • ScamS_{\text{cam}}Scam​ = camera ID embedding(每個 camera 有自己嘅 learnable vector)
  • SviewS_{\text{view}}Sview​ = viewpoint embedding(正面 / 背面 / 側面)
  • λ1,λ2\lambda_1, \lambda_2λ1​,λ2​ = 可學習嘅 scaling factors

🎯 點解 SIE 有效?

  • 佢俾 model 知道「呢張相係邊個 camera 影嘅」→ model 可以學識 compensate camera-specific 嘅 bias

  • 類似 NLP 入面嘅 segment embeddings(BERT 用嚟區分 sentence A 同 sentence B)

  • 唔需要任何額外嘅 annotation——camera ID 喺 ReID datasets 入面本身就有

創新 2:Jigsaw Patch Module(JPM)

JPM 解決嘅係另一個問題:ViT 嘅 patch tokens 太 regular。每個 patch 固定對應圖片嘅某個位置,model 容易 overfit 到特定嘅 spatial pattern。

JPM 嘅做法:

  1. Shift:將 patch embeddings 向上 shift k 個位置(circular shift)
  2. Patch Shuffle:將 shifted patches 隨機分成 K 組
  3. 每組獨立做 classification:每組都要識別出呢個人係邊個
Loading diagram...

💡 點解 Shuffle 有用?三個好處:

  1. Robustness:強迫 model 用「打亂咗嘅」局部特徵都能識別人 → 更 robust against occlusion

  2. Diversity:每個 group 都包含唔同 body parts → 每個 branch 學到嘅 features 更 diverse

  3. Regularization:避免 model overfit 到「頭永遠喺上面、腳永遠喺下面」嘅 pattern

類比:想像你要認一個朋友,但每次只俾你睇佢身體嘅隨機幾個部分——你必須學識每個部分嘅特徵先可以做到。

用具體數字解釋 JPM

假設 ViT-B/16 將 256×128 嘅圖片切成 16×8 = 128 個 patches,加上 1 個 CLS token = 129 個 tokens。

Step 1:Shift

假設 shift = 5:

  • 原本 token 排列:[CLS, P1, P2, ..., P128]
  • Shift 後:[CLS, P6, P7, ..., P128, P1, P2, P3, P4, P5]

Step 2:分成 K=4 組

  • Group 1:P6, P10, P14, ...(每隔 4 個取一個)
  • Group 2:P7, P11, P15, ...
  • Group 3:P8, P12, P16, ...
  • Group 4:P9, P13, P17, ...

每組都包含來自圖片唔同位置嘅 patches → heterogeneous spatial coverage。

Step 3:每組各自做 classification

  • 每組各有一個 learnable token(類似 CLS)→ 生成一個 feature vector
  • 連同 global CLS token,總共 K+1 個 feature vectors
  • 每個都要能正確識別 identity → 多個 complementary 嘅 features

TransReID 嘅一個微妙但重要嘅改進:Overlapping Patches

Standard ViT 用 stride = patch_size(即 16),但 TransReID 發現用 stride = 12(overlapping)可以顯著提升 performance:

SettingStrideMarket1501 R1/mAPMSMT17 R1/mAP
ViT-B/1616(no overlap)94.6 / 87.181.8 / 61.0
ViT-B/16_s=1212(overlap)95.0 / 88.283.4 / 64.5

🔑 點解 overlap 有用? 因為 non-overlapping patches 嘅邊界會 cut through 重要嘅 features(例如一個 logo 被切成兩半)。Overlap 確保每個 patch 同隔離嘅 patches 有部分重疊 → 保留更多 boundary 資訊。

TransReID 完整結果

MethodBackboneMarket1501 R1/mAPMSMT17 R1/mAP
OSNet(ICCV 2019)OSNet94.8 / 84.978.7 / 52.9
BoT(CVPRW 2019)ResNet5094.5 / 85.9• / -
ABDNet(ICCV 2019)ResNet5095.6 / 88.382.3 / 60.8
TransReID BaselineViT-B/1694.6 / 87.181.8 / 61.0
TransReIDViT-B/16_s=1295.2 / 89.585.3 / 67.4

🚀 重點觀察:

  • TransReID 喺 MSMT17(最大最難嘅 dataset)上 mAP = 67.4,比 OSNet 高出 14.5%

  • 用 ImageNet-21K pre-training 嘅 ViT-B/16 backbone 已經好強

  • JPM + SIE 喺 baseline 上再提升 3-6% mAP

  • 第一次證明 pure Transformer 可以 dominate ReID

Part 2:SOLIDER — 語義可控嘅自監督預訓練

由「Architecture Design」到「Pre-training」

TransReID 證明咗 Transformer 做 ReID 嘅潛力,但佢仍然依賴 ImageNet pre-training——一個為物體分類設計嘅 dataset。

問題來咗:ImageNet 入面冇幾多人嘅圖片。Pre-training data 同 downstream task 嘅 domain gap 好大。

SOLIDER 嘅核心問題:如果用大量 unlabeled 嘅人嘅圖片做 self-supervised pre-training,效果會唔會好好多?

答案係:好好多。

SOLIDER 嘅核心設計

SOLIDER = Semantic cOntrollable seLf-supervIseD lEaRning

佢有三個核心創新:

Loading diagram...

創新 1:用 Human Prior Knowledge 生成 Pseudo Semantic Labels

一般嘅 self-supervised learning(例如 DINO、MAE)學到嘅係 pure appearance features——邊個 pixel 同邊個 pixel 相似。但人嘅圖片有好明確嘅 semantic structure:頭、上身、下身、鞋。

SOLIDER 利用呢個 prior knowledge:

  1. 用一個 off-the-shelf human parsing model(例如 SCHP)對 training images 生成 pseudo semantic labels
  2. 將 pixels 分成 semantic groups(頭、上身、下身等)
  3. 喺 contrastive learning 入面,同一個 semantic group 嘅 features 應該更相似

💡 點解唔直接用 ground-truth labels? 因為 LUPerson 有 4 百萬張 unlabeled 圖片——冇人可以手動 annotate 咁多。用 pseudo labels 係唯一可行嘅方法,而且 parsing model 嘅 pseudo labels 已經夠準確。

創新 2:Semantic Controller — 一個 model 適應所有任務

呢個係 SOLIDER 最 unique 嘅設計。唔同嘅 downstream tasks 需要唔同類型嘅 features:

Task需要嘅 FeaturesSemantic vs Appearance
Person ReID衫嘅顏色、紋理、logo偏 Appearance(λ ≈ 0.2)
Human Parsing身體部位嘅邊界偏 Semantic(λ ≈ 0.8)
Pedestrian Detection人嘅整體形狀平衡(λ ≈ 0.5)
Attribute Recognition衫嘅類型、顏色名稱偏 Semantic(λ ≈ 0.6)

SOLIDER 引入一個 semantic controller——一個以 λ∈[0,1]\lambda \in [0, 1]λ∈[0,1] 為輸入嘅 conditional network:

  • λ=0\lambda = 0λ=0:輸出 pure appearance features(適合 ReID)
  • λ=1\lambda = 1λ=1:輸出 pure semantic features(適合 parsing)
  • λ=0.5\lambda = 0.5λ=0.5:balanced features(適合 detection)

🎯 點解呢個設計好 elegant?

  1. Pre-training 時:用 λ\lambdaλ 做 conditioning,令 model 學識同時編碼 appearance 同 semantic info

  2. Fine-tuning 時:用戶只需要 set 一個 λ\lambdaλ 值就可以調整 features → 唔需要重新 train

  3. 一個 model 搞掂所有 human-centric tasks → 極大嘅實用價值

創新 3:Swin Transformer Backbone

SOLIDER 揀咗 Swin Transformer 而唔係 ViT 做 backbone,原因:

特性ViTSwin Transformer
Attention scopeGlobal(所有 tokens)Shifted Windows(local → global)
ResolutionFixed patch sizeHierarchical(多尺度 feature maps)
Dense prediction需要 decoder天然支持(FPN-friendly)
Downstream flexibility主要做 classificationDetection、Segmentation 都 work

因為 SOLIDER 要支持 6 個 downstream tasks(包括 detection 同 parsing),Swin 嘅 hierarchical design 係必要嘅。

SOLIDER 嘅結果

MethodPre-trainingBackboneMarket1501 mAP/R1MSMT17 mAP/R1
TransReIDImageNet-21KViT-B/1689.5 / 95.267.4 / 85.3
TransReID-SSLLUPerson(SSL)ViT-B/1690.0 / 95.668.7 / 86.1
PASS(ECCV 2022)LUPerson(Part-aware SSL)ViT-B/1690.3 / 95.870.0 / 86.8
SOLIDER(Swin-T)LUPerson(Semantic SSL)Swin-Tiny91.6 / 96.167.4 / 85.9
SOLIDER(Swin-S)LUPerson(Semantic SSL)Swin-Small93.3 / 96.676.9 / 90.8
SOLIDER(Swin-B)LUPerson(Semantic SSL)Swin-Base93.9 / 96.977.1 / 90.7

🚀 重點觀察:

  • SOLIDER Swin-B 嘅 MSMT17 mAP = 77.1,比 TransReID 高出 9.7%!

  • 即使係最細嘅 Swin-Tiny,Market1501 mAP 都已經有 91.6

  • 用 re-ranking 嘅話,MSMT17 mAP 可以去到 86.5

  • 同一個 pre-trained model 仲可以做 detection、parsing、pose estimation、attribute recognition、person search

SOLIDER 嘅跨任務表現

TaskDatasetMetricImageNet Pre-trainSOLIDER Swin-B
Person ReIDMarket1501mAP~8893.9
Person ReIDMSMT17mAP~6277.1
Pedestrian DetectionCityPersonsMR-2(↓)~119.7
Human ParsingLIPmIOU~5660.5
Attribute RecognitionPA100KmA~8286.4
Pose EstimationCOCOAP~7476.6

🎯 一個 model,六個任務全部 SOTA——呢個就係 human-centric pre-training 嘅威力。

Part 3:DINOv2 — Foundation Model 嘅 Universal Features

由 Human-Specific 到 Universal

SOLIDER 用 human images 做 pre-training → human-centric tasks 表現好好。但如果有一個 model 喺 任何視覺任務 都表現好好呢?

呢個就係 DINOv2。

DINOv2 係咩?

DINOv2 係 Meta AI 喺 2023 年發布嘅 self-supervised vision foundation model,用 142M 張多樣化圖片(LVD-142M dataset)訓練。

關鍵特點:

  • 🦕 Self-supervised:完全唔需要 labels,用 self-distillation(teacher-student)方式訓練
  • 🌍 Universal features:同一個 frozen backbone 可以做 classification、segmentation、depth estimation、retrieval...
  • 📐 ViT backbone:ViT-S/14、ViT-B/14、ViT-L/14、ViT-g/14
  • ❄️ Frozen features work:唔需要 fine-tune——直接用 linear probe 或 kNN 就有好好嘅表現

DINOv2 嘅訓練方法

Loading diagram...

DINOv2 結合咗幾種 self-supervised 方法:

  1. DINO loss(self-distillation):student 嘅 local view output 要 match teacher 嘅 global view output
  2. iBOT loss(masked image modeling):mask 部分 patches,要求 model 預測 masked 嘅 features
  3. KoLeo regularizer:確保 feature space 嘅 uniformity

💡 DINOv2 點解 features 咁好?三個原因:

  1. Data scale:142M 張圖片,涵蓋自然、城市、室內、人物等各種 domain

  2. Data curation:用 self-supervised retrieval 精心篩選訓練數據(唔係 random web scraping)

  3. Training recipe:結合多種 SSL objectives → features 同時有 local 同 global 嘅 discriminability

DINOv2 做 ReID:Frozen Features 嘅威力

最近嘅研究開始探索 DINOv2 作為 ReID backbone 嘅潛力。同傳統方法最大嘅分別係:你唔需要 fine-tune 個 model。

使用方式:

  1. 用 DINOv2 ViT-L/14 提取 frozen features
  2. 加一個 lightweight head(linear layer 或 simple MLP)
  3. 只 train 個 head

優勢:

  • 極低嘅 training cost(只 train 幾個 layers)
  • 唔需要大量 ReID-specific training data
  • Cross-domain generalization 特別好(因為 features 已經夠 universal)
方法BackboneTrainingMarket1501 R1Cross-domain 能力
OSNetOSNet(2.2M)Full fine-tune94.8⚠️ 需要 AIN
TransReIDViT-B/16(86M)Full fine-tune95.2⚠️ 需要 SIE
SOLIDERSwin-B(88M)Fine-tune96.9✅ 較好
DINOv2ViT-L/14(300M)Frozen + linear probe~93-95✅✅ 最好

🎯 DINOv2 嘅 trade-off:

  • ✅ Cross-domain generalization 最強:因為 features 唔係為特定 dataset fine-tune 嘅

  • ✅ Zero-shot / few-shot 能力:幾乎唔需要 target domain 嘅 data

  • ⚠️ Same-domain 精度唔係最高:fine-tuned SOLIDER 仍然喺 same-domain benchmarks 上領先

  • ⚠️ Model size 大:ViT-L 有 300M params,唔適合 edge deployment

DINOv2 嘅 Attention Maps:點解佢天然適合 ReID?

DINOv2 學到嘅 attention maps 有一個好驚人嘅特性:佢可以自動 segment objects——包括人嘅唔同身體部位。

呢個能力係完全 self-supervised 學到嘅,冇任何 segmentation labels。對於 ReID 嚟講,呢個意味住:

  • 自動 part-level attention:唔需要 JPM 或者 external pose detector
  • Semantic understanding:知道邊啲 patches 屬於同一個 body part
  • Background suppression:自然地 ignore 背景 clutter

深度對比:四個方法嘅根本分別

Architecture 對比

特性OSNetTransReIDSOLIDERDINOv2
ArchitectureCustom CNNViT-B/16Swin-T/S/BViT-S/B/L/g
Pre-training dataImageNet-1KImageNet-21KLUPerson(4M 人)LVD-142M(通用)
Pre-training methodSupervisedSupervisedSelf-supervised + semanticSelf-supervised
ReID-specific designAG + multi-scaleJPM + SIESemantic controllerNone(general purpose)
Params2.2M~86M~88M300M+
Edge-friendly?✅✅⚠️⚠️❌
Cross-domainNeeds AINSIE helpsGoodBest

點解要選邊個?Decision Framework

Loading diagram...

實作指南

方法一:TransReID(ViT-based ReID)

python# ===== 安裝 ===== # git clone https://github.com/damo-cv/TransReID.git # pip install yacs timm # ===== Training ===== # 下載 ViT-B/16 pretrained weights(ImageNet-21K) # wget https://github.com/rwightman/pytorch-image-models/releases/download/v0.1-vitjx/jx_vit_base_p16_224-80ecf9dd.pth import torch from config import cfg from model import make_model from processor import do_train # 配置 cfg.merge_from_file("configs/Market/vit_transreid.yml") cfg.MODEL.PRETRAIN_PATH = "jx_vit_base_p16_224-80ecf9dd.pth" cfg.MODEL.SIE_CAMERA = True # 啟用 Camera SIE cfg.MODEL.SIE_VIEW = False # Person ReID 通常冇 viewpoint label cfg.MODEL.JPM = True # 啟用 Jigsaw Patch Module cfg.MODEL.STRIDE_SIZE = [12, 12] # Overlapping patches # 建立 model model = make_model(cfg, num_class=751, camera_num=6, view_num=0) model = model.cuda() print(f"Parameters: {sum(p.numel() for p in model.parameters()) / 1e6:.1f}M") # → Parameters: 86.5M
bash# 完整 training command python train.py --config_file configs/Market/vit_transreid.yml \ MODEL.DEVICE_ID "('0')" \ MODEL.PRETRAIN_PATH 'jx_vit_base_p16_224-80ecf9dd.pth' \ OUTPUT_DIR './logs/market_transreid'

方法二:SOLIDER Pre-trained Model(最推薦)

python# ===== 用 SOLIDER pretrained weights 做 ReID ===== # git clone https://github.com/tinyvision/SOLIDER-REID.git # 下載 SOLIDER Swin-Base weights import torch from model import make_model from config import cfg # SOLIDER 嘅 semantic controller # λ ≈ 0.2 for ReID(偏 appearance) cfg.merge_from_file("configs/market/swin_base.yml") cfg.MODEL.PRETRAIN_PATH = "solider_swin_base.pth" cfg.MODEL.SEMANTIC_WEIGHT = 0.2 # Semantic controller λ model = make_model(cfg, num_class=751) model = model.cuda() # ===== 簡單 demo:SOLIDER 嘅 semantic controller ===== def demo_semantic_control(): """展示 λ 嘅效果""" import torch from solider import build_model model = build_model("swin_base", pretrained="solider_swin_base.pth") model.eval() dummy_input = torch.randn(1, 3, 256, 128) # λ=0.0 → Pure appearance (最適合 ReID) feat_appearance = model(dummy_input, semantic_weight=0.0) # λ=0.5 → Balanced (適合 detection) feat_balanced = model(dummy_input, semantic_weight=0.5) # λ=1.0 → Pure semantic (最適合 parsing) feat_semantic = model(dummy_input, semantic_weight=1.0) print(f"Appearance feat: {feat_appearance.shape}") print(f"Balanced feat: {feat_balanced.shape}") print(f"Semantic feat: {feat_semantic.shape}")
bash# Training SOLIDER-REID python train.py --config_file configs/market/swin_base.yml \ MODEL.PRETRAIN_PATH 'solider_swin_base.pth' \ MODEL.SEMANTIC_WEIGHT 0.2 \ OUTPUT_DIR './logs/market_solider'

💡 SOLIDER fine-tuning tips:

  • ReID 用 SEMANTIC_WEIGHT = 0.2(偏 appearance)

  • 如果 dataset 好細,可以試 0.3-0.4(多啲 semantic info 防 overfit)

  • Cross-domain evaluation 可以試 0.4-0.5(更 generalizable)

  • 用 cosine learning rate scheduler + warmup 效果最好

方法三:DINOv2 Frozen Features(最簡單)

python# ===== DINOv2 做 ReID:Frozen Backbone + Linear Probe ===== import torch import torch.nn as nn from torchvision import transforms # Step 1: 加載 DINOv2 dinov2 = torch.hub.load('facebookresearch/dinov2', 'dinov2_vitl14') dinov2 = dinov2.cuda().eval() # Freeze 所有參數 for param in dinov2.parameters(): param.requires_grad = False # Step 2: 定義 ReID head class DINOv2ReID(nn.Module): def __init__(self, backbone, feat_dim=1024, num_classes=751): super().__init__() self.backbone = backbone self.bn = nn.BatchNorm1d(feat_dim) self.classifier = nn.Linear(feat_dim, num_classes) def forward(self, x): with torch.no_grad(): features = self.backbone(x) # CLS token features features = self.bn(features) if self.training: logits = self.classifier(features) return features, logits return features model = DINOv2ReID(dinov2, feat_dim=1024, num_classes=751) model = model.cuda() # Step 3: 只 train BN + classifier optimizer = torch.optim.Adam( [p for p in model.parameters() if p.requires_grad], lr=3.5e-4, weight_decay=5e-4 ) # Step 4: Data preprocessing(DINOv2 需要 518×518 input) transform = transforms.Compose([ transforms.Resize((518, 518)), # DINOv2 optimal resolution transforms.ToTensor(), transforms.Normalize( mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225] ) ]) print(f"Trainable params: {sum(p.numel() for p in model.parameters() if p.requires_grad) / 1e6:.1f}M") print(f"Total params: {sum(p.numel() for p in model.parameters()) / 1e6:.1f}M") # → Trainable params: ~1.8M # → Total params: ~305M

🚀 DINOv2 ReID 嘅好處:

  • Training 極快:只 train 1.8M params(vs 86M for TransReID)

  • 數據需求少:Frozen features 已經好 discriminative

  • Cross-domain 最強:唔需要 target domain adaptation

  • 代價:Inference 時 backbone 仍然係 300M params → 需要 GPU

技術深潛:三個 Model 嘅 Feature Space 有咩分別?

Attention Pattern 嘅差異

TransReID:透過 self-attention 學到 person-specific 嘅 attention patterns

  • CLS token attend to 全身
  • 唔同 layer 嘅 attention 逐漸由 local(紋理)變到 global(輪廓)
  • JPM 令 attention 更 diverse(唔會過度集中喺某個部位)

SOLIDER:Swin 嘅 shifted window attention + semantic conditioning

  • Semantic controller 可以調整 attention focus 嘅 level
  • λ\lambdaλ 低:attention 偏向 texture/color(appearance)
  • λ\lambdaλ 高:attention 偏向 body part boundaries(semantic)

DINOv2:Self-supervised attention 自然 emerge 出 object-centric patterns

  • 唔需要 labels 就能 segment 人嘅唔同部位
  • Attention maps 嘅 consistency 好高(跨 domain 都 stable)
  • 但可能 miss ReID-specific 嘅 fine-grained details(因為唔係為 ReID 訓練嘅)

Pre-training Data 嘅影響

Pre-trainingDataDomain MatchFeature QualityGeneralization
ImageNet-1K Supervised1.2M images, 1K classes❌ 冇人⭐⭐⭐⭐
ImageNet-21K Supervised14M images, 21K classes❌ 冇人⭐⭐⭐⭐⭐⭐
LUPerson SSL(SOLIDER)4M person images✅✅ 全部係人⭐⭐⭐⭐⭐⭐⭐⭐⭐
LVD-142M SSL(DINOv2)142M diverse images⚠️ 部分有人⭐⭐⭐⭐⭐⭐⭐⭐⭐

💡 Insight:SOLIDER 同 DINOv2 代表兩種唔同嘅 scaling strategy:

  • SOLIDER:domain-specific data + domain-specific SSL → same-domain 最強

  • DINOv2:massive diverse data + general SSL → generalization 最強

最終嘅最佳方案可能係結合兩者:用 DINOv2 做 general pre-training,再用 SOLIDER-style human-centric fine-tuning。

限制同未來方向

TransReID 嘅限制

  1. 依賴 ImageNet pre-training:ViT 喺細 dataset 上 from scratch 表現差
  2. SIE 需要 camera metadata:如果冇 camera ID 就用唔到
  3. JPM 增加 inference cost:多個 branch = 多次 forward

SOLIDER 嘅限制

  1. LUPerson dataset 唔 public:需要向作者申請
  2. Pseudo label quality:parsing model 嘅誤差會傳播到 pre-training
  3. Swin 唔係最新嘅 backbone:2024+ 已經有更強嘅 backbone

DINOv2 嘅限制

  1. Model 太大:ViT-L 有 300M params,唔適合 edge
  2. 唔係 ReID-specific:Same-domain accuracy 比 SOLIDER 低
  3. Resolution sensitivity:最佳 input size 係 518×518,ReID 通常用 256×128

未來方向

  • DINOv2 + Human-centric fine-tuning:結合 DINOv2 嘅 universal features 同 SOLIDER 嘅 semantic control
  • Efficient ViT for ReID:用 distillation 將大 model 壓縮到 edge-friendly size
  • Multi-modal ReID:結合 text descriptions(CLIP-ReID)或 LLM scene graphs
  • Video-based Foundation Models:將 DINOv2 extend 到 temporal domain

技術啟示

1. Pre-training Data > Architecture Design

SOLIDER 用「普通」嘅 Swin Transformer + 「正確」嘅 pre-training data,就打贏咗 TransReID 嘅精心設計嘅 JPM + SIE。呢個同 NLP 嘅經驗一致:data + scale > clever architecture。

2. Self-Supervised Learning 嘅威力

DINOv2 同 SOLIDER 都唔用 labels 做 pre-training,但學到嘅 features 比 supervised ImageNet features 好得多。呢個趨勢暗示 labels 嘅重要性喺下降——未來嘅 computer vision 可能越嚟越少依賴 manual annotation。

3. 「General vs Specific」嘅 Spectrum

ReID 嘅進化顯示:最 specific 嘅 model(OSNet)edge-friendly 但 performance ceiling 低,最 general 嘅 model(DINOv2)performance ceiling 高但 resource heavy。實際部署需要喺呢個 spectrum 上搵到最適合嘅位置。

4. Camera-Aware Learning 嘅重要性

TransReID 嘅 SIE 證明:利用 metadata(camera ID、viewpoint)可以顯著提升 ReID performance。呢個 insight 喺 production systems 特別有價值——因為你通常知道每張相係邊個 camera 影嘅。

總結

核心 Takeaways

  1. TransReID(ICCV 2021):證明 pure Transformer 可以統治 ReID,JPM 提供 robust part features,SIE 解決 camera bias
  2. SOLIDER(CVPR 2023):human-centric SSL pre-training + semantic controller = 一個 model 搞掂 6 個任務,same-domain SOTA
  3. DINOv2(Meta 2023):universal visual features,frozen backbone + linear probe 就可以做到 competitive ReID,cross-domain generalization 最強

點樣揀?

場景推薦方案原因
Edge / Camera 端部署OSNet x0.250.2M params,可以跑喺 STM32
有大量 labeled dataSOLIDER Swin-B + fine-tuneSame-domain performance 最高
冇 / 少 labeled dataDINOv2 ViT-L frozen + probe唔需要 target domain labels
多個 human-centric tasksSOLIDER一個 pre-trained model 搞掂晒
Balanced(GPU available)TransReID + ImageNet-21KGood accuracy,well-documented

相關資源

TransReID

  • 📄 論文:arXiv:2102.04378(ICCV 2021,1500+ citations)
  • 💻 Code:github.com/damo-cv/TransReID(MIT License)
  • 📊 Results:Market1501 95.2/89.5,MSMT17 85.3/67.4

SOLIDER

  • 📄 論文:arXiv:2303.17602(CVPR 2023,200+ citations)
  • 💻 Code:github.com/tinyvision/SOLIDER(1.5K stars,Apache 2.0)
  • 💻 ReID downstream:github.com/tinyvision/SOLIDER-REID
  • 📊 Results:Market1501 93.9/96.9,MSMT17 77.1/90.7

DINOv2

  • 📄 論文:arXiv:2304.07193(Meta AI 2023)
  • 💻 Code:github.com/facebookresearch/dinov2(Apache 2.0)
  • 🔗 Demo:dinov2.metademolab.com
  • 📊 Models:ViT-S/14、ViT-B/14、ViT-L/14、ViT-g/14

延伸閱讀

  • 📄 PASS(ECCV 2022):Part-Aware Self-Supervised Pre-Training — arXiv:2203.03931
  • 📄 CLIP-ReID(AAAI 2023):用 CLIP 嘅 text-image alignment 做 ReID
  • 📄 OSNet(ICCV 2019):上一篇 blog — OSNet:點解一個 2.2M 參數嘅輕量 CNN 可以打贏 ResNet50 做 Person Re-ID?

Person ReID 嘅進化告訴我哋一個深刻嘅教訓:喺 AI 領域,「正確嘅 pre-training」往往比「精巧嘅 architecture」更重要。由 OSNet 嘅手工 multi-scale CNN,到 TransReID 嘅 Transformer adaptation,到 SOLIDER 嘅 human-centric SSL,再到 DINOv2 嘅 universal foundation model——每一步都係 data 同 scale 嘅勝利。未來嘅 ReID 可能唔再需要 ReID-specific 嘅任何設計——只需要更好嘅 foundation model。 🔄✨

Back to all articles
目錄