Person ReID 進化史：由 TransReID 到 SOLIDER 到 DINOv2，Transformer 點樣統治行人重識別？

本文涵蓋三篇核心論文： TransReID — arXiv:2102.04378（ICCV 2021）| GitHub SOLIDER — arXiv:2303.17602（CVPR 2023）| GitHub DINOv2 — arXiv:2304.07193（Meta AI, 2023）| GitHub

TL;DR

上一篇我哋講咗 OSNet 點樣用 2.2M 參數嘅 CNN 打贏 ResNet50 做 Person ReID。但 2021 年之後，Transformer 徹底改寫咗 ReID 嘅遊戲規則。

呢篇 blog 會帶你行過 ReID 嘅三個 Transformer 時代里程碑：

核心重點：

🏗️ TransReID（ICCV 2021）：第一個 pure Transformer ReID framework，提出 JPM + SIE 兩個 ReID-specific 模組
🧠 SOLIDER（CVPR 2023）：語義可控嘅自監督預訓練，一個 model 搞掂 6 個 human-centric 任務
🦕 DINOv2（Meta AI 2023）：通用視覺基礎模型，frozen features 就可以做 ReID
📊 進化趨勢：由「手工設計 backbone」→「ReID-specific Transformer」→「通用自監督預訓練」→「Foundation Model」

背景：ReID 進化嘅三個時代

喺深入每個 model 之前，先睇下 Person ReID 嘅技術演變全景：

Loading diagram...

時代	代表作	核心理念	Pre-training Data	MSMT17 mAP
CNN	OSNet（ICCV 2019）	Task-specific CNN 設計	ImageNet-1K	52.9
Transformer	TransReID（ICCV 2021）	Pure ViT + ReID modules	ImageNet-21K	67.4
Pre-training	SOLIDER（CVPR 2023）	Human-centric SSL	LUPerson（4M）	77.1
Foundation	DINOv2（Meta 2023）	Universal visual features	LVD-142M	~70+（frozen）

🎯 核心趨勢：由「為 ReID 專門設計 architecture」逐步走向「用更好嘅 pre-training 提供更 universal 嘅 features」。Model 越嚟越 general，但 performance 越嚟越好——呢個就係 Foundation Model 時代嘅 magic。

Part 1：TransReID — 第一個 Pure Transformer 做 ReID

點解要用 Transformer 做 ReID？

CNN 有一個根本性嘅限制：local receptive field。即使 OSNet 用 multi-scale streams 嚟擴展 receptive field，每一層嘅 convolution 仍然只能「睇到」附近嘅 pixels。

Transformer 嘅 self-attention 天然就解決咗呢個問題——每個 token 都可以直接 attend to 圖片嘅任何位置。

特性	CNN（OSNet）	Transformer（TransReID）
Receptive field	Local → 逐層擴展	Global from Layer 1
Long-range dependencies	需要 deep stacking	每層都有
Position encoding	Implicit（by conv position）	Explicit（可加 side info）
Part-level features	需要額外 branch	Patch tokens 天然對應 body parts

💡 TransReID 嘅 insight：ViT 將圖片切成 patches，每個 patch 自然對應行人嘅唔同部位（頭、上身、腿）。呢個同 ReID 需要 part-level features 嘅需求完美 match！

TransReID 嘅架構

TransReID 嘅 baseline 就係一個 standard ViT-B/16，加上兩個 ReID-specific 嘅模組：

Loading diagram...

創新 1：Side Information Embedding（SIE）

呢個係 TransReID 最實用嘅創新。ReID 有一個 CNN 時代一直冇好好解決嘅問題：camera bias。

問題： 同一個人喺 Camera A（室內、暖光）同 Camera B（室外、冷光）嘅外觀差異可以好大。Model 容易學到「Camera A 嘅人都偏黃，Camera B 嘅人都偏藍」呢種 shortcut。

SIE 嘅解決方案： 將 camera ID 同 viewpoint 編碼成 learnable embeddings，直接加到 patch embeddings 入面：

z_0 = [x_{\text{cls}}; x_1^E; x_2^E; \ldots; x_N^E] + E_{\text{pos}} + \lambda_1 \cdot S_{\text{cam}} + \lambda_2 \cdot S_{\text{view}}

其中：

$S_{\text{cam}}$ = camera ID embedding（每個 camera 有自己嘅 learnable vector）
$S_{\text{view}}$ = viewpoint embedding（正面 / 背面 / 側面）
$\lambda_1, \lambda_2$ = 可學習嘅 scaling factors

🎯 點解 SIE 有效？

佢俾 model 知道「呢張相係邊個 camera 影嘅」→ model 可以學識 compensate camera-specific 嘅 bias

類似 NLP 入面嘅 segment embeddings（BERT 用嚟區分 sentence A 同 sentence B）

唔需要任何額外嘅 annotation——camera ID 喺 ReID datasets 入面本身就有

創新 2：Jigsaw Patch Module（JPM）

JPM 解決嘅係另一個問題：ViT 嘅 patch tokens 太 regular。每個 patch 固定對應圖片嘅某個位置，model 容易 overfit 到特定嘅 spatial pattern。

JPM 嘅做法：

Shift：將 patch embeddings 向上 shift k 個位置（circular shift）
Patch Shuffle：將 shifted patches 隨機分成 K 組
每組獨立做 classification：每組都要識別出呢個人係邊個

Loading diagram...

💡 點解 Shuffle 有用？三個好處：

Robustness：強迫 model 用「打亂咗嘅」局部特徵都能識別人 → 更 robust against occlusion

Diversity：每個 group 都包含唔同 body parts → 每個 branch 學到嘅 features 更 diverse

Regularization：避免 model overfit 到「頭永遠喺上面、腳永遠喺下面」嘅 pattern

類比：想像你要認一個朋友，但每次只俾你睇佢身體嘅隨機幾個部分——你必須學識每個部分嘅特徵先可以做到。

用具體數字解釋 JPM

假設 ViT-B/16 將 256×128 嘅圖片切成 16×8 = 128 個 patches，加上 1 個 CLS token = 129 個 tokens。

Step 1：Shift

假設 shift = 5：

原本 token 排列：[CLS, P1, P2, ..., P128]
Shift 後：[CLS, P6, P7, ..., P128, P1, P2, P3, P4, P5]

Step 2：分成 K=4 組

Group 1：P6, P10, P14, ...（每隔 4 個取一個）
Group 2：P7, P11, P15, ...
Group 3：P8, P12, P16, ...
Group 4：P9, P13, P17, ...

每組都包含來自圖片唔同位置嘅 patches → heterogeneous spatial coverage。

Step 3：每組各自做 classification

每組各有一個 learnable token（類似 CLS）→ 生成一個 feature vector
連同 global CLS token，總共 K+1 個 feature vectors
每個都要能正確識別 identity → 多個 complementary 嘅 features

TransReID 嘅一個微妙但重要嘅改進：Overlapping Patches

Standard ViT 用 stride = patch_size（即 16），但 TransReID 發現用 stride = 12（overlapping）可以顯著提升 performance：

Setting	Stride	Market1501 R1/mAP	MSMT17 R1/mAP
ViT-B/16	16（no overlap）	94.6 / 87.1	81.8 / 61.0
ViT-B/16_s=12	12（overlap）	95.0 / 88.2	83.4 / 64.5

🔑 點解 overlap 有用？ 因為 non-overlapping patches 嘅邊界會 cut through 重要嘅 features（例如一個 logo 被切成兩半）。Overlap 確保每個 patch 同隔離嘅 patches 有部分重疊 → 保留更多 boundary 資訊。

TransReID 完整結果

Method	Backbone	Market1501 R1/mAP	MSMT17 R1/mAP
OSNet（ICCV 2019）	OSNet	94.8 / 84.9	78.7 / 52.9
BoT（CVPRW 2019）	ResNet50	94.5 / 85.9	• / -
ABDNet（ICCV 2019）	ResNet50	95.6 / 88.3	82.3 / 60.8
TransReID Baseline	ViT-B/16	94.6 / 87.1	81.8 / 61.0
TransReID	ViT-B/16_s=12	95.2 / 89.5	85.3 / 67.4

🚀 重點觀察：

TransReID 喺 MSMT17（最大最難嘅 dataset）上 mAP = 67.4，比 OSNet 高出 14.5%

用 ImageNet-21K pre-training 嘅 ViT-B/16 backbone 已經好強

JPM + SIE 喺 baseline 上再提升 3-6% mAP

第一次證明 pure Transformer 可以 dominate ReID

Part 2：SOLIDER — 語義可控嘅自監督預訓練

由「Architecture Design」到「Pre-training」

TransReID 證明咗 Transformer 做 ReID 嘅潛力，但佢仍然依賴 ImageNet pre-training——一個為物體分類設計嘅 dataset。

問題來咗：ImageNet 入面冇幾多人嘅圖片。Pre-training data 同 downstream task 嘅 domain gap 好大。

SOLIDER 嘅核心問題：如果用大量 unlabeled 嘅人嘅圖片做 self-supervised pre-training，效果會唔會好好多？

答案係：好好多。

SOLIDER 嘅核心設計

SOLIDER = Semantic cOntrollable seLf-supervIseD lEaRning

佢有三個核心創新：

Loading diagram...

創新 1：用 Human Prior Knowledge 生成 Pseudo Semantic Labels

一般嘅 self-supervised learning（例如 DINO、MAE）學到嘅係 pure appearance features——邊個 pixel 同邊個 pixel 相似。但人嘅圖片有好明確嘅 semantic structure：頭、上身、下身、鞋。

SOLIDER 利用呢個 prior knowledge：

用一個 off-the-shelf human parsing model（例如 SCHP）對 training images 生成 pseudo semantic labels
將 pixels 分成 semantic groups（頭、上身、下身等）
喺 contrastive learning 入面，同一個 semantic group 嘅 features 應該更相似

💡 點解唔直接用 ground-truth labels？ 因為 LUPerson 有 4 百萬張 unlabeled 圖片——冇人可以手動 annotate 咁多。用 pseudo labels 係唯一可行嘅方法，而且 parsing model 嘅 pseudo labels 已經夠準確。

創新 2：Semantic Controller — 一個 model 適應所有任務

呢個係 SOLIDER 最 unique 嘅設計。唔同嘅 downstream tasks 需要唔同類型嘅 features：

Task	需要嘅 Features	Semantic vs Appearance
Person ReID	衫嘅顏色、紋理、logo	偏 Appearance（λ ≈ 0.2）
Human Parsing	身體部位嘅邊界	偏 Semantic（λ ≈ 0.8）
Pedestrian Detection	人嘅整體形狀	平衡（λ ≈ 0.5）
Attribute Recognition	衫嘅類型、顏色名稱	偏 Semantic（λ ≈ 0.6）

SOLIDER 引入一個 semantic controller——一個以 $\lambda \in [0, 1]$ 為輸入嘅 conditional network：

$\lambda = 0$ ：輸出 pure appearance features（適合 ReID）
$\lambda = 1$ ：輸出 pure semantic features（適合 parsing）
$\lambda = 0.5$ ：balanced features（適合 detection）

🎯 點解呢個設計好 elegant？

Pre-training 時：用 $\lambda$ 做 conditioning，令 model 學識同時編碼 appearance 同 semantic info

Fine-tuning 時：用戶只需要 set 一個 $\lambda$ 值就可以調整 features → 唔需要重新 train

一個 model 搞掂所有 human-centric tasks → 極大嘅實用價值

創新 3：Swin Transformer Backbone

SOLIDER 揀咗 Swin Transformer 而唔係 ViT 做 backbone，原因：

特性	ViT	Swin Transformer
Attention scope	Global（所有 tokens）	Shifted Windows（local → global）
Resolution	Fixed patch size	Hierarchical（多尺度 feature maps）
Dense prediction	需要 decoder	天然支持（FPN-friendly）
Downstream flexibility	主要做 classification	Detection、Segmentation 都 work

因為 SOLIDER 要支持 6 個 downstream tasks（包括 detection 同 parsing），Swin 嘅 hierarchical design 係必要嘅。

SOLIDER 嘅結果

Method	Pre-training	Backbone	Market1501 mAP/R1	MSMT17 mAP/R1
TransReID	ImageNet-21K	ViT-B/16	89.5 / 95.2	67.4 / 85.3
TransReID-SSL	LUPerson（SSL）	ViT-B/16	90.0 / 95.6	68.7 / 86.1
PASS（ECCV 2022）	LUPerson（Part-aware SSL）	ViT-B/16	90.3 / 95.8	70.0 / 86.8
SOLIDER（Swin-T）	LUPerson（Semantic SSL）	Swin-Tiny	91.6 / 96.1	67.4 / 85.9
SOLIDER（Swin-S）	LUPerson（Semantic SSL）	Swin-Small	93.3 / 96.6	76.9 / 90.8
SOLIDER（Swin-B）	LUPerson（Semantic SSL）	Swin-Base	93.9 / 96.9	77.1 / 90.7

🚀 重點觀察：

SOLIDER Swin-B 嘅 MSMT17 mAP = 77.1，比 TransReID 高出 9.7%！

即使係最細嘅 Swin-Tiny，Market1501 mAP 都已經有 91.6

用 re-ranking 嘅話，MSMT17 mAP 可以去到 86.5

同一個 pre-trained model 仲可以做 detection、parsing、pose estimation、attribute recognition、person search

SOLIDER 嘅跨任務表現

Task	Dataset	Metric	ImageNet Pre-train	SOLIDER Swin-B
Person ReID	Market1501	mAP	~88	93.9
Person ReID	MSMT17	mAP	~62	77.1
Pedestrian Detection	CityPersons	MR-2（↓）	~11	9.7
Human Parsing	LIP	mIOU	~56	60.5
Attribute Recognition	PA100K	mA	~82	86.4
Pose Estimation	COCO	AP	~74	76.6

🎯 一個 model，六個任務全部 SOTA——呢個就係 human-centric pre-training 嘅威力。

Part 3：DINOv2 — Foundation Model 嘅 Universal Features

由 Human-Specific 到 Universal

SOLIDER 用 human images 做 pre-training → human-centric tasks 表現好好。但如果有一個 model 喺 任何視覺任務 都表現好好呢？

呢個就係 DINOv2。

DINOv2 係咩？

DINOv2 係 Meta AI 喺 2023 年發布嘅 self-supervised vision foundation model，用 142M 張多樣化圖片（LVD-142M dataset）訓練。

關鍵特點：

🦕 Self-supervised：完全唔需要 labels，用 self-distillation（teacher-student）方式訓練
🌍 Universal features：同一個 frozen backbone 可以做 classification、segmentation、depth estimation、retrieval...
📐 ViT backbone：ViT-S/14、ViT-B/14、ViT-L/14、ViT-g/14
❄️ Frozen features work：唔需要 fine-tune——直接用 linear probe 或 kNN 就有好好嘅表現

DINOv2 嘅訓練方法

Loading diagram...

DINOv2 結合咗幾種 self-supervised 方法：

DINO loss（self-distillation）：student 嘅 local view output 要 match teacher 嘅 global view output
iBOT loss（masked image modeling）：mask 部分 patches，要求 model 預測 masked 嘅 features
KoLeo regularizer：確保 feature space 嘅 uniformity

💡 DINOv2 點解 features 咁好？三個原因：

Data scale：142M 張圖片，涵蓋自然、城市、室內、人物等各種 domain

Data curation：用 self-supervised retrieval 精心篩選訓練數據（唔係 random web scraping）

Training recipe：結合多種 SSL objectives → features 同時有 local 同 global 嘅 discriminability

DINOv2 做 ReID：Frozen Features 嘅威力

最近嘅研究開始探索 DINOv2 作為 ReID backbone 嘅潛力。同傳統方法最大嘅分別係：你唔需要 fine-tune 個 model。

使用方式：

用 DINOv2 ViT-L/14 提取 frozen features
加一個 lightweight head（linear layer 或 simple MLP）
只 train 個 head

優勢：

極低嘅 training cost（只 train 幾個 layers）
唔需要大量 ReID-specific training data
Cross-domain generalization 特別好（因為 features 已經夠 universal）

方法	Backbone	Training	Market1501 R1	Cross-domain 能力
OSNet	OSNet（2.2M）	Full fine-tune	94.8	⚠️ 需要 AIN
TransReID	ViT-B/16（86M）	Full fine-tune	95.2	⚠️ 需要 SIE
SOLIDER	Swin-B（88M）	Fine-tune	96.9	✅ 較好
DINOv2	ViT-L/14（300M）	Frozen + linear probe	~93-95	✅✅ 最好

🎯 DINOv2 嘅 trade-off：

✅ Cross-domain generalization 最強：因為 features 唔係為特定 dataset fine-tune 嘅

✅ Zero-shot / few-shot 能力：幾乎唔需要 target domain 嘅 data

⚠️ Same-domain 精度唔係最高：fine-tuned SOLIDER 仍然喺 same-domain benchmarks 上領先

⚠️ Model size 大：ViT-L 有 300M params，唔適合 edge deployment

DINOv2 嘅 Attention Maps：點解佢天然適合 ReID？

DINOv2 學到嘅 attention maps 有一個好驚人嘅特性：佢可以自動 segment objects——包括人嘅唔同身體部位。

呢個能力係完全 self-supervised 學到嘅，冇任何 segmentation labels。對於 ReID 嚟講，呢個意味住：

自動 part-level attention：唔需要 JPM 或者 external pose detector
Semantic understanding：知道邊啲 patches 屬於同一個 body part
Background suppression：自然地 ignore 背景 clutter

深度對比：四個方法嘅根本分別

Architecture 對比

特性	OSNet	TransReID	SOLIDER	DINOv2
Architecture	Custom CNN	ViT-B/16	Swin-T/S/B	ViT-S/B/L/g
Pre-training data	ImageNet-1K	ImageNet-21K	LUPerson（4M 人）	LVD-142M（通用）
Pre-training method	Supervised	Supervised	Self-supervised + semantic	Self-supervised
ReID-specific design	AG + multi-scale	JPM + SIE	Semantic controller	None（general purpose）
Params	2.2M	~86M	~88M	300M+
Edge-friendly?	✅✅	⚠️	⚠️	❌
Cross-domain	Needs AIN	SIE helps	Good	Best

點解要選邊個？Decision Framework

Loading diagram...

實作指南

方法一：TransReID（ViT-based ReID）

python# ===== 安裝 =====
# git clone https://github.com/damo-cv/TransReID.git
# pip install yacs timm

# ===== Training =====
# 下載 ViT-B/16 pretrained weights（ImageNet-21K）
# wget https://github.com/rwightman/pytorch-image-models/releases/download/v0.1-vitjx/jx_vit_base_p16_224-80ecf9dd.pth

import torch
from config import cfg
from model import make_model
from processor import do_train

# 配置
cfg.merge_from_file("configs/Market/vit_transreid.yml")
cfg.MODEL.PRETRAIN_PATH = "jx_vit_base_p16_224-80ecf9dd.pth"
cfg.MODEL.SIE_CAMERA = True    # 啟用 Camera SIE
cfg.MODEL.SIE_VIEW = False     # Person ReID 通常冇 viewpoint label
cfg.MODEL.JPM = True           # 啟用 Jigsaw Patch Module
cfg.MODEL.STRIDE_SIZE = [12, 12]  # Overlapping patches

# 建立 model
model = make_model(cfg, num_class=751, camera_num=6, view_num=0)
model = model.cuda()

print(f"Parameters: {sum(p.numel() for p in model.parameters()) / 1e6:.1f}M")
# → Parameters: 86.5M

bash# 完整 training command
python train.py --config_file configs/Market/vit_transreid.yml \
    MODEL.DEVICE_ID "('0')" \
    MODEL.PRETRAIN_PATH 'jx_vit_base_p16_224-80ecf9dd.pth' \
    OUTPUT_DIR './logs/market_transreid'

方法二：SOLIDER Pre-trained Model（最推薦）

python# ===== 用 SOLIDER pretrained weights 做 ReID =====
# git clone https://github.com/tinyvision/SOLIDER-REID.git
# 下載 SOLIDER Swin-Base weights

import torch
from model import make_model
from config import cfg

# SOLIDER 嘅 semantic controller
# λ ≈ 0.2 for ReID（偏 appearance）
cfg.merge_from_file("configs/market/swin_base.yml")
cfg.MODEL.PRETRAIN_PATH = "solider_swin_base.pth"
cfg.MODEL.SEMANTIC_WEIGHT = 0.2  # Semantic controller λ

model = make_model(cfg, num_class=751)
model = model.cuda()

# ===== 簡單 demo：SOLIDER 嘅 semantic controller =====
def demo_semantic_control():
    """展示 λ 嘅效果"""
    import torch
    from solider import build_model
    
    model = build_model("swin_base", pretrained="solider_swin_base.pth")
    model.eval()
    
    dummy_input = torch.randn(1, 3, 256, 128)
    
    # λ=0.0 → Pure appearance (最適合 ReID)
    feat_appearance = model(dummy_input, semantic_weight=0.0)
    
    # λ=0.5 → Balanced (適合 detection)
    feat_balanced = model(dummy_input, semantic_weight=0.5)
    
    # λ=1.0 → Pure semantic (最適合 parsing)
    feat_semantic = model(dummy_input, semantic_weight=1.0)
    
    print(f"Appearance feat: {feat_appearance.shape}")
    print(f"Balanced feat:   {feat_balanced.shape}")
    print(f"Semantic feat:   {feat_semantic.shape}")

bash# Training SOLIDER-REID
python train.py --config_file configs/market/swin_base.yml \
    MODEL.PRETRAIN_PATH 'solider_swin_base.pth' \
    MODEL.SEMANTIC_WEIGHT 0.2 \
    OUTPUT_DIR './logs/market_solider'

💡 SOLIDER fine-tuning tips：

ReID 用 SEMANTIC_WEIGHT = 0.2（偏 appearance）

如果 dataset 好細，可以試 0.3-0.4（多啲 semantic info 防 overfit）

Cross-domain evaluation 可以試 0.4-0.5（更 generalizable）

用 cosine learning rate scheduler + warmup 效果最好

方法三：DINOv2 Frozen Features（最簡單）

python# ===== DINOv2 做 ReID：Frozen Backbone + Linear Probe =====
import torch
import torch.nn as nn
from torchvision import transforms

# Step 1: 加載 DINOv2
dinov2 = torch.hub.load('facebookresearch/dinov2', 'dinov2_vitl14')
dinov2 = dinov2.cuda().eval()

# Freeze 所有參數
for param in dinov2.parameters():
    param.requires_grad = False

# Step 2: 定義 ReID head
class DINOv2ReID(nn.Module):
    def __init__(self, backbone, feat_dim=1024, num_classes=751):
        super().__init__()
        self.backbone = backbone
        self.bn = nn.BatchNorm1d(feat_dim)
        self.classifier = nn.Linear(feat_dim, num_classes)
    
    def forward(self, x):
        with torch.no_grad():
            features = self.backbone(x)  # CLS token features
        features = self.bn(features)
        if self.training:
            logits = self.classifier(features)
            return features, logits
        return features

model = DINOv2ReID(dinov2, feat_dim=1024, num_classes=751)
model = model.cuda()

# Step 3: 只 train BN + classifier
optimizer = torch.optim.Adam(
    [p for p in model.parameters() if p.requires_grad],
    lr=3.5e-4, weight_decay=5e-4
)

# Step 4: Data preprocessing（DINOv2 需要 518×518 input）
transform = transforms.Compose([
    transforms.Resize((518, 518)),  # DINOv2 optimal resolution
    transforms.ToTensor(),
    transforms.Normalize(
        mean=[0.485, 0.456, 0.406],
        std=[0.229, 0.224, 0.225]
    )
])

print(f"Trainable params: {sum(p.numel() for p in model.parameters() if p.requires_grad) / 1e6:.1f}M")
print(f"Total params: {sum(p.numel() for p in model.parameters()) / 1e6:.1f}M")
# → Trainable params: ~1.8M
# → Total params: ~305M

🚀 DINOv2 ReID 嘅好處：

Training 極快：只 train 1.8M params（vs 86M for TransReID）

數據需求少：Frozen features 已經好 discriminative

Cross-domain 最強：唔需要 target domain adaptation

代價：Inference 時 backbone 仍然係 300M params → 需要 GPU

技術深潛：三個 Model 嘅 Feature Space 有咩分別？

Attention Pattern 嘅差異

TransReID：透過 self-attention 學到 person-specific 嘅 attention patterns

CLS token attend to 全身
唔同 layer 嘅 attention 逐漸由 local（紋理）變到 global（輪廓）
JPM 令 attention 更 diverse（唔會過度集中喺某個部位）

SOLIDER：Swin 嘅 shifted window attention + semantic conditioning

Semantic controller 可以調整 attention focus 嘅 level
$\lambda$ 低：attention 偏向 texture/color（appearance）
$\lambda$ 高：attention 偏向 body part boundaries（semantic）

DINOv2：Self-supervised attention 自然 emerge 出 object-centric patterns

唔需要 labels 就能 segment 人嘅唔同部位
Attention maps 嘅 consistency 好高（跨 domain 都 stable）
但可能 miss ReID-specific 嘅 fine-grained details（因為唔係為 ReID 訓練嘅）

Pre-training Data 嘅影響

Pre-training	Data	Domain Match	Feature Quality	Generalization
ImageNet-1K Supervised	1.2M images, 1K classes	❌ 冇人	⭐⭐	⭐⭐
ImageNet-21K Supervised	14M images, 21K classes	❌ 冇人	⭐⭐⭐	⭐⭐⭐
LUPerson SSL（SOLIDER）	4M person images	✅✅ 全部係人	⭐⭐⭐⭐⭐	⭐⭐⭐⭐
LVD-142M SSL（DINOv2）	142M diverse images	⚠️ 部分有人	⭐⭐⭐⭐	⭐⭐⭐⭐⭐

💡 Insight：SOLIDER 同 DINOv2 代表兩種唔同嘅 scaling strategy：

SOLIDER：domain-specific data + domain-specific SSL → same-domain 最強

DINOv2：massive diverse data + general SSL → generalization 最強

最終嘅最佳方案可能係結合兩者：用 DINOv2 做 general pre-training，再用 SOLIDER-style human-centric fine-tuning。

限制同未來方向

TransReID 嘅限制

依賴 ImageNet pre-training：ViT 喺細 dataset 上 from scratch 表現差
SIE 需要 camera metadata：如果冇 camera ID 就用唔到
JPM 增加 inference cost：多個 branch = 多次 forward

SOLIDER 嘅限制

LUPerson dataset 唔 public：需要向作者申請
Pseudo label quality：parsing model 嘅誤差會傳播到 pre-training
Swin 唔係最新嘅 backbone：2024+ 已經有更強嘅 backbone

DINOv2 嘅限制

Model 太大：ViT-L 有 300M params，唔適合 edge
唔係 ReID-specific：Same-domain accuracy 比 SOLIDER 低
Resolution sensitivity：最佳 input size 係 518×518，ReID 通常用 256×128

未來方向

DINOv2 + Human-centric fine-tuning：結合 DINOv2 嘅 universal features 同 SOLIDER 嘅 semantic control
Efficient ViT for ReID：用 distillation 將大 model 壓縮到 edge-friendly size
Multi-modal ReID：結合 text descriptions（CLIP-ReID）或 LLM scene graphs
Video-based Foundation Models：將 DINOv2 extend 到 temporal domain

技術啟示

1. Pre-training Data > Architecture Design

SOLIDER 用「普通」嘅 Swin Transformer + 「正確」嘅 pre-training data，就打贏咗 TransReID 嘅精心設計嘅 JPM + SIE。呢個同 NLP 嘅經驗一致：data + scale > clever architecture。

2. Self-Supervised Learning 嘅威力

DINOv2 同 SOLIDER 都唔用 labels 做 pre-training，但學到嘅 features 比 supervised ImageNet features 好得多。呢個趨勢暗示 labels 嘅重要性喺下降——未來嘅 computer vision 可能越嚟越少依賴 manual annotation。

3. 「General vs Specific」嘅 Spectrum

ReID 嘅進化顯示：最 specific 嘅 model（OSNet）edge-friendly 但 performance ceiling 低，最 general 嘅 model（DINOv2）performance ceiling 高但 resource heavy。實際部署需要喺呢個 spectrum 上搵到最適合嘅位置。

4. Camera-Aware Learning 嘅重要性

TransReID 嘅 SIE 證明：利用 metadata（camera ID、viewpoint）可以顯著提升 ReID performance。呢個 insight 喺 production systems 特別有價值——因為你通常知道每張相係邊個 camera 影嘅。

總結

核心 Takeaways

TransReID（ICCV 2021）：證明 pure Transformer 可以統治 ReID，JPM 提供 robust part features，SIE 解決 camera bias
SOLIDER（CVPR 2023）：human-centric SSL pre-training + semantic controller = 一個 model 搞掂 6 個任務，same-domain SOTA
DINOv2（Meta 2023）：universal visual features，frozen backbone + linear probe 就可以做到 competitive ReID，cross-domain generalization 最強

點樣揀？

場景	推薦方案	原因
Edge / Camera 端部署	OSNet x0.25	0.2M params，可以跑喺 STM32
有大量 labeled data	SOLIDER Swin-B + fine-tune	Same-domain performance 最高
冇 / 少 labeled data	DINOv2 ViT-L frozen + probe	唔需要 target domain labels
多個 human-centric tasks	SOLIDER	一個 pre-trained model 搞掂晒
Balanced（GPU available）	TransReID + ImageNet-21K	Good accuracy，well-documented

TL;DR

上一篇我哋講咗 OSNet 點樣用 2.2M 參數嘅 CNN 打贏 ResNet50 做 Person ReID。但 2021 年之後，Transformer 徹底改寫咗 ReID 嘅遊戲規則。

呢篇 blog 會帶你行過 ReID 嘅三個 Transformer 時代里程碑：

核心重點：

🏗️ TransReID（ICCV 2021）：第一個 pure Transformer ReID framework，提出 JPM + SIE 兩個 ReID-specific 模組
🧠 SOLIDER（CVPR 2023）：語義可控嘅自監督預訓練，一個 model 搞掂 6 個 human-centric 任務
🦕 DINOv2（Meta AI 2023）：通用視覺基礎模型，frozen features 就可以做 ReID
📊 進化趨勢：由「手工設計 backbone」→「ReID-specific Transformer」→「通用自監督預訓練」→「Foundation Model」

背景：ReID 進化嘅三個時代

喺深入每個 model 之前，先睇下 Person ReID 嘅技術演變全景：

Loading diagram...

時代	代表作	核心理念	Pre-training Data	MSMT17 mAP
CNN	OSNet（ICCV 2019）	Task-specific CNN 設計	ImageNet-1K	52.9
Transformer	TransReID（ICCV 2021）	Pure ViT + ReID modules	ImageNet-21K	67.4
Pre-training	SOLIDER（CVPR 2023）	Human-centric SSL	LUPerson（4M）	77.1
Foundation	DINOv2（Meta 2023）	Universal visual features	LVD-142M	~70+（frozen）

🎯 核心趨勢：由「為 ReID 專門設計 architecture」逐步走向「用更好嘅 pre-training 提供更 universal 嘅 features」。Model 越嚟越 general，但 performance 越嚟越好——呢個就係 Foundation Model 時代嘅 magic。

Part 1：TransReID — 第一個 Pure Transformer 做 ReID

點解要用 Transformer 做 ReID？

CNN 有一個根本性嘅限制：local receptive field。即使 OSNet 用 multi-scale streams 嚟擴展 receptive field，每一層嘅 convolution 仍然只能「睇到」附近嘅 pixels。

Transformer 嘅 self-attention 天然就解決咗呢個問題——每個 token 都可以直接 attend to 圖片嘅任何位置。

特性	CNN（OSNet）	Transformer（TransReID）
Receptive field	Local → 逐層擴展	Global from Layer 1
Long-range dependencies	需要 deep stacking	每層都有
Position encoding	Implicit（by conv position）	Explicit（可加 side info）
Part-level features	需要額外 branch	Patch tokens 天然對應 body parts

💡 TransReID 嘅 insight：ViT 將圖片切成 patches，每個 patch 自然對應行人嘅唔同部位（頭、上身、腿）。呢個同 ReID 需要 part-level features 嘅需求完美 match！

TransReID 嘅架構

TransReID 嘅 baseline 就係一個 standard ViT-B/16，加上兩個 ReID-specific 嘅模組：

Loading diagram...

創新 1：Side Information Embedding（SIE）

呢個係 TransReID 最實用嘅創新。ReID 有一個 CNN 時代一直冇好好解決嘅問題：camera bias。

SIE 嘅解決方案： 將 camera ID 同 viewpoint 編碼成 learnable embeddings，直接加到 patch embeddings 入面：

z_0 = [x_{\text{cls}}; x_1^E; x_2^E; \ldots; x_N^E] + E_{\text{pos}} + \lambda_1 \cdot S_{\text{cam}} + \lambda_2 \cdot S_{\text{view}}

其中：

$S_{\text{cam}}$ = camera ID embedding（每個 camera 有自己嘅 learnable vector）
$S_{\text{view}}$ = viewpoint embedding（正面 / 背面 / 側面）
$\lambda_1, \lambda_2$ = 可學習嘅 scaling factors

🎯 點解 SIE 有效？

佢俾 model 知道「呢張相係邊個 camera 影嘅」→ model 可以學識 compensate camera-specific 嘅 bias

類似 NLP 入面嘅 segment embeddings（BERT 用嚟區分 sentence A 同 sentence B）

唔需要任何額外嘅 annotation——camera ID 喺 ReID datasets 入面本身就有

創新 2：Jigsaw Patch Module（JPM）

JPM 解決嘅係另一個問題：ViT 嘅 patch tokens 太 regular。每個 patch 固定對應圖片嘅某個位置，model 容易 overfit 到特定嘅 spatial pattern。

JPM 嘅做法：

Shift：將 patch embeddings 向上 shift k 個位置（circular shift）
Patch Shuffle：將 shifted patches 隨機分成 K 組
每組獨立做 classification：每組都要識別出呢個人係邊個

Loading diagram...

💡 點解 Shuffle 有用？三個好處：

Robustness：強迫 model 用「打亂咗嘅」局部特徵都能識別人 → 更 robust against occlusion

Diversity：每個 group 都包含唔同 body parts → 每個 branch 學到嘅 features 更 diverse

Regularization：避免 model overfit 到「頭永遠喺上面、腳永遠喺下面」嘅 pattern

類比：想像你要認一個朋友，但每次只俾你睇佢身體嘅隨機幾個部分——你必須學識每個部分嘅特徵先可以做到。

用具體數字解釋 JPM

假設 ViT-B/16 將 256×128 嘅圖片切成 16×8 = 128 個 patches，加上 1 個 CLS token = 129 個 tokens。

Step 1：Shift

假設 shift = 5：

原本 token 排列：[CLS, P1, P2, ..., P128]
Shift 後：[CLS, P6, P7, ..., P128, P1, P2, P3, P4, P5]

Step 2：分成 K=4 組

Group 1：P6, P10, P14, ...（每隔 4 個取一個）
Group 2：P7, P11, P15, ...
Group 3：P8, P12, P16, ...
Group 4：P9, P13, P17, ...

每組都包含來自圖片唔同位置嘅 patches → heterogeneous spatial coverage。

Step 3：每組各自做 classification

每組各有一個 learnable token（類似 CLS）→ 生成一個 feature vector
連同 global CLS token，總共 K+1 個 feature vectors
每個都要能正確識別 identity → 多個 complementary 嘅 features

TransReID 嘅一個微妙但重要嘅改進：Overlapping Patches

Standard ViT 用 stride = patch_size（即 16），但 TransReID 發現用 stride = 12（overlapping）可以顯著提升 performance：

Setting	Stride	Market1501 R1/mAP	MSMT17 R1/mAP
ViT-B/16	16（no overlap）	94.6 / 87.1	81.8 / 61.0
ViT-B/16_s=12	12（overlap）	95.0 / 88.2	83.4 / 64.5

🔑 點解 overlap 有用？ 因為 non-overlapping patches 嘅邊界會 cut through 重要嘅 features（例如一個 logo 被切成兩半）。Overlap 確保每個 patch 同隔離嘅 patches 有部分重疊 → 保留更多 boundary 資訊。

TransReID 完整結果

Method	Backbone	Market1501 R1/mAP	MSMT17 R1/mAP
OSNet（ICCV 2019）	OSNet	94.8 / 84.9	78.7 / 52.9
BoT（CVPRW 2019）	ResNet50	94.5 / 85.9	• / -
ABDNet（ICCV 2019）	ResNet50	95.6 / 88.3	82.3 / 60.8
TransReID Baseline	ViT-B/16	94.6 / 87.1	81.8 / 61.0
TransReID	ViT-B/16_s=12	95.2 / 89.5	85.3 / 67.4

🚀 重點觀察：

TransReID 喺 MSMT17（最大最難嘅 dataset）上 mAP = 67.4，比 OSNet 高出 14.5%

用 ImageNet-21K pre-training 嘅 ViT-B/16 backbone 已經好強

JPM + SIE 喺 baseline 上再提升 3-6% mAP

第一次證明 pure Transformer 可以 dominate ReID

Part 2：SOLIDER — 語義可控嘅自監督預訓練

由「Architecture Design」到「Pre-training」

TransReID 證明咗 Transformer 做 ReID 嘅潛力，但佢仍然依賴 ImageNet pre-training——一個為物體分類設計嘅 dataset。

問題來咗：ImageNet 入面冇幾多人嘅圖片。Pre-training data 同 downstream task 嘅 domain gap 好大。

SOLIDER 嘅核心問題：如果用大量 unlabeled 嘅人嘅圖片做 self-supervised pre-training，效果會唔會好好多？

答案係：好好多。

SOLIDER 嘅核心設計

SOLIDER = Semantic cOntrollable seLf-supervIseD lEaRning

佢有三個核心創新：

Loading diagram...

創新 1：用 Human Prior Knowledge 生成 Pseudo Semantic Labels

SOLIDER 利用呢個 prior knowledge：

用一個 off-the-shelf human parsing model（例如 SCHP）對 training images 生成 pseudo semantic labels
將 pixels 分成 semantic groups（頭、上身、下身等）
喺 contrastive learning 入面，同一個 semantic group 嘅 features 應該更相似

💡 點解唔直接用 ground-truth labels？ 因為 LUPerson 有 4 百萬張 unlabeled 圖片——冇人可以手動 annotate 咁多。用 pseudo labels 係唯一可行嘅方法，而且 parsing model 嘅 pseudo labels 已經夠準確。

創新 2：Semantic Controller — 一個 model 適應所有任務

呢個係 SOLIDER 最 unique 嘅設計。唔同嘅 downstream tasks 需要唔同類型嘅 features：

Task	需要嘅 Features	Semantic vs Appearance
Person ReID	衫嘅顏色、紋理、logo	偏 Appearance（λ ≈ 0.2）
Human Parsing	身體部位嘅邊界	偏 Semantic（λ ≈ 0.8）
Pedestrian Detection	人嘅整體形狀	平衡（λ ≈ 0.5）
Attribute Recognition	衫嘅類型、顏色名稱	偏 Semantic（λ ≈ 0.6）

SOLIDER 引入一個 semantic controller——一個以 $\lambda \in [0, 1]$ 為輸入嘅 conditional network：

$\lambda = 0$ ：輸出 pure appearance features（適合 ReID）
$\lambda = 1$ ：輸出 pure semantic features（適合 parsing）
$\lambda = 0.5$ ：balanced features（適合 detection）

🎯 點解呢個設計好 elegant？

Pre-training 時：用 $\lambda$ 做 conditioning，令 model 學識同時編碼 appearance 同 semantic info

Fine-tuning 時：用戶只需要 set 一個 $\lambda$ 值就可以調整 features → 唔需要重新 train

一個 model 搞掂所有 human-centric tasks → 極大嘅實用價值

創新 3：Swin Transformer Backbone

SOLIDER 揀咗 Swin Transformer 而唔係 ViT 做 backbone，原因：

特性	ViT	Swin Transformer
Attention scope	Global（所有 tokens）	Shifted Windows（local → global）
Resolution	Fixed patch size	Hierarchical（多尺度 feature maps）
Dense prediction	需要 decoder	天然支持（FPN-friendly）
Downstream flexibility	主要做 classification	Detection、Segmentation 都 work

因為 SOLIDER 要支持 6 個 downstream tasks（包括 detection 同 parsing），Swin 嘅 hierarchical design 係必要嘅。

SOLIDER 嘅結果

Method	Pre-training	Backbone	Market1501 mAP/R1	MSMT17 mAP/R1
TransReID	ImageNet-21K	ViT-B/16	89.5 / 95.2	67.4 / 85.3
TransReID-SSL	LUPerson（SSL）	ViT-B/16	90.0 / 95.6	68.7 / 86.1
PASS（ECCV 2022）	LUPerson（Part-aware SSL）	ViT-B/16	90.3 / 95.8	70.0 / 86.8
SOLIDER（Swin-T）	LUPerson（Semantic SSL）	Swin-Tiny	91.6 / 96.1	67.4 / 85.9
SOLIDER（Swin-S）	LUPerson（Semantic SSL）	Swin-Small	93.3 / 96.6	76.9 / 90.8
SOLIDER（Swin-B）	LUPerson（Semantic SSL）	Swin-Base	93.9 / 96.9	77.1 / 90.7

🚀 重點觀察：

SOLIDER Swin-B 嘅 MSMT17 mAP = 77.1，比 TransReID 高出 9.7%！

即使係最細嘅 Swin-Tiny，Market1501 mAP 都已經有 91.6

用 re-ranking 嘅話，MSMT17 mAP 可以去到 86.5

同一個 pre-trained model 仲可以做 detection、parsing、pose estimation、attribute recognition、person search

SOLIDER 嘅跨任務表現

Task	Dataset	Metric	ImageNet Pre-train	SOLIDER Swin-B
Person ReID	Market1501	mAP	~88	93.9
Person ReID	MSMT17	mAP	~62	77.1
Pedestrian Detection	CityPersons	MR-2（↓）	~11	9.7
Human Parsing	LIP	mIOU	~56	60.5
Attribute Recognition	PA100K	mA	~82	86.4
Pose Estimation	COCO	AP	~74	76.6

🎯 一個 model，六個任務全部 SOTA——呢個就係 human-centric pre-training 嘅威力。

Part 3：DINOv2 — Foundation Model 嘅 Universal Features

由 Human-Specific 到 Universal

SOLIDER 用 human images 做 pre-training → human-centric tasks 表現好好。但如果有一個 model 喺 任何視覺任務 都表現好好呢？

呢個就係 DINOv2。

DINOv2 係咩？

DINOv2 係 Meta AI 喺 2023 年發布嘅 self-supervised vision foundation model，用 142M 張多樣化圖片（LVD-142M dataset）訓練。

關鍵特點：

🦕 Self-supervised：完全唔需要 labels，用 self-distillation（teacher-student）方式訓練
🌍 Universal features：同一個 frozen backbone 可以做 classification、segmentation、depth estimation、retrieval...
📐 ViT backbone：ViT-S/14、ViT-B/14、ViT-L/14、ViT-g/14
❄️ Frozen features work：唔需要 fine-tune——直接用 linear probe 或 kNN 就有好好嘅表現

DINOv2 嘅訓練方法

Loading diagram...

DINOv2 結合咗幾種 self-supervised 方法：

DINO loss（self-distillation）：student 嘅 local view output 要 match teacher 嘅 global view output
iBOT loss（masked image modeling）：mask 部分 patches，要求 model 預測 masked 嘅 features
KoLeo regularizer：確保 feature space 嘅 uniformity

💡 DINOv2 點解 features 咁好？三個原因：

Data scale：142M 張圖片，涵蓋自然、城市、室內、人物等各種 domain

Data curation：用 self-supervised retrieval 精心篩選訓練數據（唔係 random web scraping）

Training recipe：結合多種 SSL objectives → features 同時有 local 同 global 嘅 discriminability

DINOv2 做 ReID：Frozen Features 嘅威力

最近嘅研究開始探索 DINOv2 作為 ReID backbone 嘅潛力。同傳統方法最大嘅分別係：你唔需要 fine-tune 個 model。

使用方式：

用 DINOv2 ViT-L/14 提取 frozen features
加一個 lightweight head（linear layer 或 simple MLP）
只 train 個 head

優勢：

極低嘅 training cost（只 train 幾個 layers）
唔需要大量 ReID-specific training data
Cross-domain generalization 特別好（因為 features 已經夠 universal）

方法	Backbone	Training	Market1501 R1	Cross-domain 能力
OSNet	OSNet（2.2M）	Full fine-tune	94.8	⚠️ 需要 AIN
TransReID	ViT-B/16（86M）	Full fine-tune	95.2	⚠️ 需要 SIE
SOLIDER	Swin-B（88M）	Fine-tune	96.9	✅ 較好
DINOv2	ViT-L/14（300M）	Frozen + linear probe	~93-95	✅✅ 最好

🎯 DINOv2 嘅 trade-off：

✅ Cross-domain generalization 最強：因為 features 唔係為特定 dataset fine-tune 嘅

✅ Zero-shot / few-shot 能力：幾乎唔需要 target domain 嘅 data

⚠️ Same-domain 精度唔係最高：fine-tuned SOLIDER 仍然喺 same-domain benchmarks 上領先

⚠️ Model size 大：ViT-L 有 300M params，唔適合 edge deployment

DINOv2 嘅 Attention Maps：點解佢天然適合 ReID？

DINOv2 學到嘅 attention maps 有一個好驚人嘅特性：佢可以自動 segment objects——包括人嘅唔同身體部位。

呢個能力係完全 self-supervised 學到嘅，冇任何 segmentation labels。對於 ReID 嚟講，呢個意味住：

自動 part-level attention：唔需要 JPM 或者 external pose detector
Semantic understanding：知道邊啲 patches 屬於同一個 body part
Background suppression：自然地 ignore 背景 clutter

深度對比：四個方法嘅根本分別

Architecture 對比

特性	OSNet	TransReID	SOLIDER	DINOv2
Architecture	Custom CNN	ViT-B/16	Swin-T/S/B	ViT-S/B/L/g
Pre-training data	ImageNet-1K	ImageNet-21K	LUPerson（4M 人）	LVD-142M（通用）
Pre-training method	Supervised	Supervised	Self-supervised + semantic	Self-supervised
ReID-specific design	AG + multi-scale	JPM + SIE	Semantic controller	None（general purpose）
Params	2.2M	~86M	~88M	300M+
Edge-friendly?	✅✅	⚠️	⚠️	❌
Cross-domain	Needs AIN	SIE helps	Good	Best

點解要選邊個？Decision Framework

Loading diagram...

實作指南

方法一：TransReID（ViT-based ReID）

python# ===== 安裝 =====
# git clone https://github.com/damo-cv/TransReID.git
# pip install yacs timm

# ===== Training =====
# 下載 ViT-B/16 pretrained weights（ImageNet-21K）
# wget https://github.com/rwightman/pytorch-image-models/releases/download/v0.1-vitjx/jx_vit_base_p16_224-80ecf9dd.pth

import torch
from config import cfg
from model import make_model
from processor import do_train

# 配置
cfg.merge_from_file("configs/Market/vit_transreid.yml")
cfg.MODEL.PRETRAIN_PATH = "jx_vit_base_p16_224-80ecf9dd.pth"
cfg.MODEL.SIE_CAMERA = True    # 啟用 Camera SIE
cfg.MODEL.SIE_VIEW = False     # Person ReID 通常冇 viewpoint label
cfg.MODEL.JPM = True           # 啟用 Jigsaw Patch Module
cfg.MODEL.STRIDE_SIZE = [12, 12]  # Overlapping patches

# 建立 model
model = make_model(cfg, num_class=751, camera_num=6, view_num=0)
model = model.cuda()

print(f"Parameters: {sum(p.numel() for p in model.parameters()) / 1e6:.1f}M")
# → Parameters: 86.5M

bash# 完整 training command
python train.py --config_file configs/Market/vit_transreid.yml \
    MODEL.DEVICE_ID "('0')" \
    MODEL.PRETRAIN_PATH 'jx_vit_base_p16_224-80ecf9dd.pth' \
    OUTPUT_DIR './logs/market_transreid'

方法二：SOLIDER Pre-trained Model（最推薦）

python# ===== 用 SOLIDER pretrained weights 做 ReID =====
# git clone https://github.com/tinyvision/SOLIDER-REID.git
# 下載 SOLIDER Swin-Base weights

import torch
from model import make_model
from config import cfg

# SOLIDER 嘅 semantic controller
# λ ≈ 0.2 for ReID（偏 appearance）
cfg.merge_from_file("configs/market/swin_base.yml")
cfg.MODEL.PRETRAIN_PATH = "solider_swin_base.pth"
cfg.MODEL.SEMANTIC_WEIGHT = 0.2  # Semantic controller λ

model = make_model(cfg, num_class=751)
model = model.cuda()

# ===== 簡單 demo：SOLIDER 嘅 semantic controller =====
def demo_semantic_control():
    """展示 λ 嘅效果"""
    import torch
    from solider import build_model
    
    model = build_model("swin_base", pretrained="solider_swin_base.pth")
    model.eval()
    
    dummy_input = torch.randn(1, 3, 256, 128)
    
    # λ=0.0 → Pure appearance (最適合 ReID)
    feat_appearance = model(dummy_input, semantic_weight=0.0)
    
    # λ=0.5 → Balanced (適合 detection)
    feat_balanced = model(dummy_input, semantic_weight=0.5)
    
    # λ=1.0 → Pure semantic (最適合 parsing)
    feat_semantic = model(dummy_input, semantic_weight=1.0)
    
    print(f"Appearance feat: {feat_appearance.shape}")
    print(f"Balanced feat:   {feat_balanced.shape}")
    print(f"Semantic feat:   {feat_semantic.shape}")

bash# Training SOLIDER-REID
python train.py --config_file configs/market/swin_base.yml \
    MODEL.PRETRAIN_PATH 'solider_swin_base.pth' \
    MODEL.SEMANTIC_WEIGHT 0.2 \
    OUTPUT_DIR './logs/market_solider'

💡 SOLIDER fine-tuning tips：

ReID 用 SEMANTIC_WEIGHT = 0.2（偏 appearance）

如果 dataset 好細，可以試 0.3-0.4（多啲 semantic info 防 overfit）

Cross-domain evaluation 可以試 0.4-0.5（更 generalizable）

用 cosine learning rate scheduler + warmup 效果最好

方法三：DINOv2 Frozen Features（最簡單）

python# ===== DINOv2 做 ReID：Frozen Backbone + Linear Probe =====
import torch
import torch.nn as nn
from torchvision import transforms

# Step 1: 加載 DINOv2
dinov2 = torch.hub.load('facebookresearch/dinov2', 'dinov2_vitl14')
dinov2 = dinov2.cuda().eval()

# Freeze 所有參數
for param in dinov2.parameters():
    param.requires_grad = False

# Step 2: 定義 ReID head
class DINOv2ReID(nn.Module):
    def __init__(self, backbone, feat_dim=1024, num_classes=751):
        super().__init__()
        self.backbone = backbone
        self.bn = nn.BatchNorm1d(feat_dim)
        self.classifier = nn.Linear(feat_dim, num_classes)
    
    def forward(self, x):
        with torch.no_grad():
            features = self.backbone(x)  # CLS token features
        features = self.bn(features)
        if self.training:
            logits = self.classifier(features)
            return features, logits
        return features

model = DINOv2ReID(dinov2, feat_dim=1024, num_classes=751)
model = model.cuda()

# Step 3: 只 train BN + classifier
optimizer = torch.optim.Adam(
    [p for p in model.parameters() if p.requires_grad],
    lr=3.5e-4, weight_decay=5e-4
)

# Step 4: Data preprocessing（DINOv2 需要 518×518 input）
transform = transforms.Compose([
    transforms.Resize((518, 518)),  # DINOv2 optimal resolution
    transforms.ToTensor(),
    transforms.Normalize(
        mean=[0.485, 0.456, 0.406],
        std=[0.229, 0.224, 0.225]
    )
])

print(f"Trainable params: {sum(p.numel() for p in model.parameters() if p.requires_grad) / 1e6:.1f}M")
print(f"Total params: {sum(p.numel() for p in model.parameters()) / 1e6:.1f}M")
# → Trainable params: ~1.8M
# → Total params: ~305M

🚀 DINOv2 ReID 嘅好處：

Training 極快：只 train 1.8M params（vs 86M for TransReID）

數據需求少：Frozen features 已經好 discriminative

Cross-domain 最強：唔需要 target domain adaptation

代價：Inference 時 backbone 仍然係 300M params → 需要 GPU

技術深潛：三個 Model 嘅 Feature Space 有咩分別？

Attention Pattern 嘅差異

TransReID：透過 self-attention 學到 person-specific 嘅 attention patterns

CLS token attend to 全身
唔同 layer 嘅 attention 逐漸由 local（紋理）變到 global（輪廓）
JPM 令 attention 更 diverse（唔會過度集中喺某個部位）

SOLIDER：Swin 嘅 shifted window attention + semantic conditioning

Semantic controller 可以調整 attention focus 嘅 level
$\lambda$ 低：attention 偏向 texture/color（appearance）
$\lambda$ 高：attention 偏向 body part boundaries（semantic）

DINOv2：Self-supervised attention 自然 emerge 出 object-centric patterns

唔需要 labels 就能 segment 人嘅唔同部位
Attention maps 嘅 consistency 好高（跨 domain 都 stable）
但可能 miss ReID-specific 嘅 fine-grained details（因為唔係為 ReID 訓練嘅）

Pre-training Data 嘅影響

Pre-training	Data	Domain Match	Feature Quality	Generalization
ImageNet-1K Supervised	1.2M images, 1K classes	❌ 冇人	⭐⭐	⭐⭐
ImageNet-21K Supervised	14M images, 21K classes	❌ 冇人	⭐⭐⭐	⭐⭐⭐
LUPerson SSL（SOLIDER）	4M person images	✅✅ 全部係人	⭐⭐⭐⭐⭐	⭐⭐⭐⭐
LVD-142M SSL（DINOv2）	142M diverse images	⚠️ 部分有人	⭐⭐⭐⭐	⭐⭐⭐⭐⭐

💡 Insight：SOLIDER 同 DINOv2 代表兩種唔同嘅 scaling strategy：

SOLIDER：domain-specific data + domain-specific SSL → same-domain 最強

DINOv2：massive diverse data + general SSL → generalization 最強

最終嘅最佳方案可能係結合兩者：用 DINOv2 做 general pre-training，再用 SOLIDER-style human-centric fine-tuning。

限制同未來方向

TransReID 嘅限制

依賴 ImageNet pre-training：ViT 喺細 dataset 上 from scratch 表現差
SIE 需要 camera metadata：如果冇 camera ID 就用唔到
JPM 增加 inference cost：多個 branch = 多次 forward

SOLIDER 嘅限制

LUPerson dataset 唔 public：需要向作者申請
Pseudo label quality：parsing model 嘅誤差會傳播到 pre-training
Swin 唔係最新嘅 backbone：2024+ 已經有更強嘅 backbone

DINOv2 嘅限制

Model 太大：ViT-L 有 300M params，唔適合 edge
唔係 ReID-specific：Same-domain accuracy 比 SOLIDER 低
Resolution sensitivity：最佳 input size 係 518×518，ReID 通常用 256×128

未來方向

DINOv2 + Human-centric fine-tuning：結合 DINOv2 嘅 universal features 同 SOLIDER 嘅 semantic control
Efficient ViT for ReID：用 distillation 將大 model 壓縮到 edge-friendly size
Multi-modal ReID：結合 text descriptions（CLIP-ReID）或 LLM scene graphs
Video-based Foundation Models：將 DINOv2 extend 到 temporal domain

技術啟示

1. Pre-training Data > Architecture Design

2. Self-Supervised Learning 嘅威力

3. 「General vs Specific」嘅 Spectrum

4. Camera-Aware Learning 嘅重要性

總結

核心 Takeaways

TransReID（ICCV 2021）：證明 pure Transformer 可以統治 ReID，JPM 提供 robust part features，SIE 解決 camera bias
SOLIDER（CVPR 2023）：human-centric SSL pre-training + semantic controller = 一個 model 搞掂 6 個任務，same-domain SOTA
DINOv2（Meta 2023）：universal visual features，frozen backbone + linear probe 就可以做到 competitive ReID，cross-domain generalization 最強

點樣揀？

場景	推薦方案	原因
Edge / Camera 端部署	OSNet x0.25	0.2M params，可以跑喺 STM32
有大量 labeled data	SOLIDER Swin-B + fine-tune	Same-domain performance 最高
冇 / 少 labeled data	DINOv2 ViT-L frozen + probe	唔需要 target domain labels
多個 human-centric tasks	SOLIDER	一個 pre-trained model 搞掂晒
Balanced（GPU available）	TransReID + ImageNet-21K	Good accuracy，well-documented

TL;DR

目錄

背景：ReID 進化嘅三個時代

Part 1：TransReID — 第一個 Pure Transformer 做 ReID

點解要用 Transformer 做 ReID？

TransReID 嘅架構

創新 1：Side Information Embedding（SIE）

創新 2：Jigsaw Patch Module（JPM）

用具體數字解釋 JPM

TransReID 嘅一個微妙但重要嘅改進：Overlapping Patches

TransReID 完整結果

Part 2：SOLIDER — 語義可控嘅自監督預訓練

由「Architecture Design」到「Pre-training」

SOLIDER 嘅核心設計

創新 1：用 Human Prior Knowledge 生成 Pseudo Semantic Labels

創新 2：Semantic Controller — 一個 model 適應所有任務

創新 3：Swin Transformer Backbone

SOLIDER 嘅結果

SOLIDER 嘅跨任務表現

Part 3：DINOv2 — Foundation Model 嘅 Universal Features

由 Human-Specific 到 Universal

DINOv2 係咩？

DINOv2 嘅訓練方法

DINOv2 做 ReID：Frozen Features 嘅威力

DINOv2 嘅 Attention Maps：點解佢天然適合 ReID？

深度對比：四個方法嘅根本分別

Architecture 對比

點解要選邊個？Decision Framework

實作指南

方法一：TransReID（ViT-based ReID）

方法二：SOLIDER Pre-trained Model（最推薦）

方法三：DINOv2 Frozen Features（最簡單）

技術深潛：三個 Model 嘅 Feature Space 有咩分別？

Attention Pattern 嘅差異

Pre-training Data 嘅影響

限制同未來方向

TransReID 嘅限制

SOLIDER 嘅限制

DINOv2 嘅限制

未來方向

技術啟示

1. Pre-training Data > Architecture Design

2. Self-Supervised Learning 嘅威力

3. 「General vs Specific」嘅 Spectrum

4. Camera-Aware Learning 嘅重要性

總結

核心 Takeaways

點樣揀？

相關資源

TransReID

SOLIDER

DINOv2

延伸閱讀

TL;DR

目錄

背景：ReID 進化嘅三個時代

Part 1：TransReID — 第一個 Pure Transformer 做 ReID

點解要用 Transformer 做 ReID？

TransReID 嘅架構

創新 1：Side Information Embedding（SIE）

創新 2：Jigsaw Patch Module（JPM）

用具體數字解釋 JPM

TransReID 嘅一個微妙但重要嘅改進：Overlapping Patches

TransReID 完整結果

Part 2：SOLIDER — 語義可控嘅自監督預訓練

由「Architecture Design」到「Pre-training」

SOLIDER 嘅核心設計

創新 1：用 Human Prior Knowledge 生成 Pseudo Semantic Labels

創新 2：Semantic Controller — 一個 model 適應所有任務

創新 3：Swin Transformer Backbone

SOLIDER 嘅結果

SOLIDER 嘅跨任務表現

Part 3：DINOv2 — Foundation Model 嘅 Universal Features

由 Human-Specific 到 Universal

DINOv2 係咩？

DINOv2 嘅訓練方法

DINOv2 做 ReID：Frozen Features 嘅威力

DINOv2 嘅 Attention Maps：點解佢天然適合 ReID？

深度對比：四個方法嘅根本分別

Architecture 對比