由 Prompt Engineering 到 Context Engineering 到 Harness Engineering：AI 工程三部曲嘅進化史

「Prompt Engineering is dead. Context Engineering is already showing its limits. The future is Harness Engineering.」 — 綜合 Andrej Karpathy、Tobi Lütke（Shopify CEO）、Harrison Chase（LangChain CEO）嘅觀點

TL;DR

2022 年你學識點樣同 AI 講嘢。2025 年你學識點樣俾 AI 正確嘅資訊。2026 年你學識點樣管住 AI 做嘢。

核心重點：

🎯 Prompt Engineering（2022-2024）：「點樣問」— 寫一句靚嘅指令，令 LLM 聽話
🧠 Context Engineering（2025）：「問之前知啲咩」— 設計 LLM 推理時嘅完整資訊環境
🔧 Harness Engineering（2026）：「點樣管住佢做嘢」— 建造 AI Agent 運行嘅完整基礎設施
📊 核心公式：Agent = Model + Harness（LangChain 定義）
🚀 趨勢：由「寫更好嘅 prompt」→「提供更好嘅 context」→「建造更好嘅系統」

背景：AI 工程嘅三個時代

如果你喺 2022 年開始玩 ChatGPT，你一定經歷過呢個進化：

Loading diagram...

時代	核心問題	代表人物 / 公司	類比
Prompt Engineering	點樣同 model 講嘢？	OpenAI、各種 "Prompt 秘笈"	寫一封完美嘅電郵
Context Engineering	Model 知道啲咩？	Karpathy、Tobi Lütke、Anthropic	準備整個 briefing folder
Harness Engineering	點樣管住 agent 做嘢？	LangChain、OpenAI Codex、Martin Fowler	設計成個辦公室 + 管理制度

💡 一句話總結：

Prompt Engineering = 點樣問

Context Engineering = 問之前俾 model 知道啲咩

Harness Engineering = 成個系統點樣運作

Part 1：Prompt Engineering — 一切嘅起點（2022-2024）

乜嘢係 Prompt Engineering？

Prompt Engineering 係你同 LLM 溝通嘅藝術——透過精心設計嘅文字指令，令 model 生成你想要嘅 output。

喺 ChatGPT 剛推出嘅日子，大家瘋狂研究點樣寫 prompt：

python# 2023 年嘅「頂級 prompt 技巧」

# ❌ 差嘅 prompt
prompt_bad = "幫我寫個 email"

# ✅ 好嘅 prompt（加角色、指令、格式）
prompt_good = """
你係一個專業嘅商務溝通顧問。

請幫我寫一封 follow-up email 俾客戶，要求：
- 語氣：專業但友善
- 長度：150-200 字
- 包含：上次 meeting 嘅 key takeaways
- 結尾：提議下次 meeting 嘅時間

背景：上次 meeting 係關於 Q3 marketing campaign budget。
"""

Prompt Engineering 嘅核心技巧

技巧	原理	例子
Role Prompting	指定 model 嘅角色	「你係一個 senior Python developer」
Few-shot Examples	俾幾個範例	「Input: X → Output: Y」× 3
Chain-of-Thought	叫 model 一步步思考	「Let's think step by step」
Output Formatting	指定輸出格式	「用 JSON 格式回覆」
Constraints	設定限制條件	「唔好用超過 100 字」

Prompt Engineering 嘅局限：點解佢唔夠用？

當你由「ChatGPT 玩具」進化到「production AI 系統」嘅時候，prompt engineering 嘅問題就暴露出嚟：

問題 1：Model 唔記得之前嘅對話

python# Session 1
user: "我個 project 用 FastAPI + PostgreSQL"
assistant: "好，我建議用 SQLAlchemy 做 ORM..."

# Session 2（新 session，model 乜都唔記得）
user: "點樣加 authentication？"
assistant: "你用咩 framework？" # 😤 又要重新解釋

問題 2：Static prompt 唔識適應唔同情況

python# 同一個 customer service bot，面對完全唔同嘅客戶
prompt = "你係一個客服助手，請禮貌地回答問題。"

# 客戶 A：VIP 客戶，投訴第三次 → 需要知道歷史記錄
# 客戶 B：新客戶，問基本問題 → 需要知道產品資料
# 客戶 C：技術問題 → 需要知道 API documentation
# 同一個 prompt 點可能搞掂三個完全唔同嘅場景？

問題 3：Prompt 冇辦法 scale

喺 2024 年中，Andrej Karpathy（前 Tesla AI Director、OpenAI co-founder）開始提出一個觀點：真正嘅工程唔只係寫 prompt，而係設計成個 context window 入面嘅資訊。

🎯 Prompt Engineering 嘅根本問題： 佢只優化你「點樣問」，但冇優化 model「知道啲咩」。就好似你寫咗一封完美嘅電郵，但收件人完全冇你嘅 background info——再靚嘅措辭都冇用。

Part 2：Context Engineering — 真正嘅範式轉移（2025）

Karpathy 同 Tobi Lütke 嘅定義

2025 年 6 月，兩個重量級人物嘅發言令「Context Engineering」呢個詞爆紅：

Tobi Lütke（Shopify CEO）：

"I really like the term 'context engineering' over prompt engineering. It describes the core skill better: the art of providing all the context for the task to be plausibly solvable by the LLM."

Andrej Karpathy（前 Tesla AI / OpenAI）：

"+1 for 'context engineering' over 'prompt engineering'. People associate prompts with short task descriptions you'd give an LLM in your day-to-day use. When in every industrial-strength LLM app, context engineering is the delicate art and science of filling the context window with just the right information for the next step."

乜嘢係 Context Engineering？

Anthropic（Claude 嘅公司）俾咗一個精準嘅定義：

Context refers to the set of tokens included when sampling from a large-language model. Context Engineering refers to the set of strategies for curating and maintaining the optimal set of tokens (information) during LLM inference.

用人話講：

Prompt Engineering = 你寫嘅指令
Context Engineering = model 推理嗰一刻，context window 入面所有嘅 tokens——包括 system prompt、conversation history、retrieved documents、tool results、metadata……

Loading diagram...

Context Engineering 嘅四大策略

LangChain 嘅 CEO Harrison Chase 將 Context Engineering 歸納為四大策略：Write、Select、Compress、Isolate。

策略 1：Write（寫入持久化資訊）

唔好依賴 ephemeral context——將重要資訊寫落去持久化存儲。

python# ===== Scratchpad Pattern =====
# Agent 將中間結果寫入 scratchpad，下一步可以用

class AgentScratchpad:
    def __init__(self):
        self.entries = []
    
    def write(self, key: str, value: str):
        self.entries.append({"key": key, "value": value})
    
    def read(self) -> str:
        return "\n".join(
            f"[{e['key']}]: {e['value']}" for e in self.entries
        )

# Agent 做 research 嘅過程
scratchpad = AgentScratchpad()

# Step 1: Search
scratchpad.write("search_results", "Found 3 relevant papers on RAG...")

# Step 2: Analyze
scratchpad.write("analysis", "Paper A uses dense retrieval, Paper B uses sparse...")

# Step 3: 下一個 LLM call 就有完整嘅 context
next_prompt = f"""
根據以下 research notes，寫一個總結：

{scratchpad.read()}
"""

策略 2：Select（智能選擇相關資訊）

Context window 有限，唔可以乜都塞入去。要揀最相關嘅資訊。

python# ===== RAG：最經典嘅 Select 策略 =====
from langchain.vectorstores import Chroma
from langchain.embeddings import OpenAIEmbeddings

# 建立 vector store
vectorstore = Chroma.from_documents(
    documents=company_docs,
    embedding=OpenAIEmbeddings()
)

# 用戶問問題時，只 retrieve 最相關嘅 documents
relevant_docs = vectorstore.similarity_search(
    query="點樣申請 annual leave？",
    k=3  # 只攞最相關嘅 3 份文件
)

# 組合成 context
context = "\n\n".join([doc.page_content for doc in relevant_docs])
prompt = f"""根據以下公司政策文件回答問題：

{context}

問題：點樣申請 annual leave？
"""

策略 3：Compress（壓縮過長嘅 context）

當 conversation history 太長，就需要壓縮。

python# ===== Context Compaction =====
# Anthropic 喺佢哋嘅 long-running agent 研究中大量使用呢個技術

def compact_context(conversation_history: list, model) -> str:
    """將過長嘅對話歷史壓縮成 summary"""
    if count_tokens(conversation_history) < 80000:
        return conversation_history  # 夠短，唔使壓縮
    
    summary_prompt = """
    請將以下對話歷史壓縮成一個簡潔嘅 summary，
    保留所有重要嘅決策、結果、同未完成嘅任務。
    
    對話歷史：
    {history}
    """
    
    summary = model.generate(summary_prompt.format(
        history=conversation_history
    ))
    
    return [{"role": "system", "content": f"之前嘅對話 summary：{summary}"}]

策略 4：Isolate（隔離唔同任務嘅 context）

唔同嘅 sub-task 需要唔同嘅 context，混埋一齊會互相干擾。

python# ===== Sub-agent Context Isolation =====
# 每個 sub-agent 有自己獨立嘅 context window

async def research_with_isolation(topic: str):
    # Sub-agent 1：搵資料（只需要 search context）
    researcher = Agent(
        system_prompt="你係 research assistant，搵相關資料。",
        tools=[web_search, arxiv_search]
    )
    findings = await researcher.run(f"搵關於 {topic} 嘅最新研究")
    
    # Sub-agent 2：寫報告（只需要 research results）
    writer = Agent(
        system_prompt="你係 technical writer，根據研究結果寫報告。",
        tools=[]
    )
    report = await writer.run(f"根據以下研究結果寫報告：\n{findings}")
    
    return report

🎯 Context Engineering 嘅本質：
Prompt Engineering 係「你講嘅嘢」，Context Engineering 係「model 知道嘅嘢」。

一個完美嘅 prompt 如果被 6K tokens 嘅無關對話歷史淹沒，效果一樣差。Context Engineering 建造嗰個「容器」，Prompt Engineering 只係入面嘅一小部分。

Part 3：Harness Engineering — Agent 時代嘅工程紀律（2026）

由 Context 到 Harness：點解又要進化？

Context Engineering 解決咗「model 知道啲咩」嘅問題。但當 AI agent 開始自主運行幾個鐘頭甚至幾日，新嘅問題出現咗：

Agent 喺第 50 步開始 drift，忘記初始目標
Agent 瘋狂 loop，重複犯同一個錯
Agent 調用錯誤嘅 tool，冇人 catch 到
跨 context window 嘅 state 丟失
冇 feedback loop，agent 唔知自己做錯

呢啲問題唔係 context 可以解決嘅——你需要嘅係成個運行環境嘅設計。

Harness Engineering 嘅定義

2026 年 3 月，LangChain CEO Harrison Chase 提出咗一個影響深遠嘅公式：

Agent = Model + Harness

「If you're not the model, you're the harness.」

Martin Fowler（軟件工程界嘅教父級人物）喺 2026 年 4 月嘅文章入面進一步解釋：

「Harness」呢個詞借自馬術——馬好有力好快，但冇韁繩、馬鞍、轡頭，佢就會亂跑。AI model 就係隻馬。Harness 就係控制佢力量嘅一切設備。工程師就係騎手。

Loading diagram...

邊啲公司喺推動 Harness Engineering？

公司 / 人物	貢獻	關鍵文章 / 日期
LangChain（Harrison Chase）	提出 Agent = Model + Harness 公式；定義 Harness 六大組件	The Anatomy of an Agent Harness（2026-03）
Anthropic	Long-running agent harness 研究；Context engineering for agents	Effective harnesses for long-running agents（2025-11）
OpenAI	Codex 團隊嘅 harness engineering 實戰經驗	Harness engineering: leveraging Codex（2026）
Martin Fowler（ThoughtWorks）	由軟件工程角度定義 harness engineering	Harness engineering for coding agent users（2026-04）
Phil Schmid（Hugging Face）	Harness 成為解決 model drift 嘅主要工具	The importance of Agent Harness in 2026（2026）

Harness 嘅六大核心組件

LangChain 將一個完整嘅 agent harness 分解為六大組件：

組件 1：System Prompts + Skills（漸進式指令）

唔係一次過 dump 所有指令，而係根據任務 progressively load 相關嘅 skills。

python# ===== Progressive Skill Loading =====
# Agent 只喺需要時先 load 相關嘅 skill

SKILLS = {
    "code_review": """
    ## Code Review Skill
    - Check for security vulnerabilities
    - Verify test coverage > 80%
    - Flag any hardcoded credentials
    """,
    "bug_fix": """
    ## Bug Fix Skill
    - Reproduce the bug first
    - Write a failing test
    - Fix and verify
    """,
    "feature_dev": """
    ## Feature Development Skill  
    - Start with the spec
    - Write tests first (TDD)
    - Implement incrementally
    """
}

def build_system_prompt(task_type: str) -> str:
    base_prompt = "你係一個 senior software engineer。"
    
    # 只 load 相關嘅 skill
    if task_type in SKILLS:
        return base_prompt + "\n\n" + SKILLS[task_type]
    
    return base_prompt

組件 2：Tools + Execution Environment

Agent 需要 tools 先可以做嘢——但更重要嘅係管理 tool 嘅 execution 環境。

python# ===== Sandboxed Tool Execution =====
import subprocess

class SandboxedCodeRunner:
    """喺隔離環境入面執行 agent 生成嘅 code"""
    
    def __init__(self, timeout: int = 30):
        self.timeout = timeout
    
    def run(self, code: str) -> dict:
        try:
            result = subprocess.run(
                ["python", "-c", code],
                capture_output=True,
                text=True,
                timeout=self.timeout,
                # 關鍵：限制權限
                env={"PATH": "/usr/bin"},
            )
            return {
                "success": result.returncode == 0,
                "stdout": result.stdout,
                "stderr": result.stderr
            }
        except subprocess.TimeoutExpired:
            return {
                "success": False,
                "error": f"Execution timed out after {self.timeout}s"
            }

組件 3：Hooks + Middleware（確定性嘅行為注入）

呢個係 Harness Engineering 最關鍵嘅創新——喺 model 嘅每一步之間注入確定性嘅邏輯。

python# ===== Middleware Pattern =====
# LangChain Deep Agents 嘅核心設計

class LoopDetectionMiddleware:
    """檢測 agent 係咪陷入 loop"""
    
    def __init__(self, max_similar_actions: int = 3):
        self.recent_actions = []
        self.max_similar = max_similar_actions
    
    def on_tool_call(self, tool_name: str, tool_args: dict) -> str | None:
        action_signature = f"{tool_name}:{hash(str(tool_args))}"
        self.recent_actions.append(action_signature)
        
        # 檢查最近嘅 actions 有冇重複
        recent = self.recent_actions[-self.max_similar:]
        if len(recent) == self.max_similar and len(set(recent)) == 1:
            return (
                "⚠️ LOOP DETECTED: 你已經連續做咗同一個 action "
                f"{self.max_similar} 次。請改變策略。"
            )
        return None  # 冇問題，繼續


class AutoLintMiddleware:
    """每次 agent 寫完 code，自動 run linter"""
    
    def on_file_write(self, filepath: str, content: str) -> str | None:
        if filepath.endswith(".py"):
            lint_result = run_ruff(content)
            if lint_result.errors:
                return f"Lint errors found:\n{lint_result.errors}\nPlease fix."
        return None

💡 Middleware 嘅威力： LangChain 嘅 coding agent 只用 middleware（loop detection + auto-lint + self-verification），喺 Terminal Bench 2.0 benchmark 上由 52.8% 升到 66.5%——model 完全冇變，只係改善咗 harness。呢個就係 Harness Engineering 嘅核心 thesis：瓶頸唔係 model，係 model 周圍嘅系統。

組件 4：Sub-agent Coordination

複雜任務需要多個 specialized agents 協作。

python# ===== Anthropic 嘅 Three-Agent Pattern =====
# Generator → Evaluator → Planner

class ThreeAgentHarness:
    def __init__(self):
        self.planner = Agent(role="planner")
        self.generator = Agent(role="generator")
        self.evaluator = Agent(role="evaluator")
    
    async def run_task(self, feature_spec: str):
        # Step 1: Planner 拆解任務
        plan = await self.planner.run(
            f"將以下 feature 拆解成可執行嘅步驟：\n{feature_spec}"
        )
        
        for step in plan.steps:
            # Step 2: Generator 執行
            result = await self.generator.run(
                f"執行以下步驟：\n{step}"
            )
            
            # Step 3: Evaluator 檢查品質
            evaluation = await self.evaluator.run(
                f"評估以下結果：\n{result}\n\n標準：{step.criteria}"
            )
            
            # 如果唔合格，要求 Generator 重做
            if evaluation.score < 0.8:
                result = await self.generator.run(
                    f"你嘅上一個結果需要改進：\n"
                    f"Feedback: {evaluation.feedback}\n"
                    f"請重新執行。"
                )

🎯 Anthropic 嘅關鍵發現： Model 唔可以可靠咁評估自己嘅 output。所以需要獨立嘅 Evaluator agent——類似 GAN 嘅 Generator/Discriminator 架構。呢個 insight 改變咗成個 harness 嘅設計。

組件 5：Persistent Memory（跨 Session 記憶）

python# ===== Convention Memory =====
# Agent 記住項目嘅 conventions 同偏好

class ProjectMemory:
    """持久化嘅項目記憶"""
    
    def __init__(self, project_path: str):
        self.memory_file = f"{project_path}/.agent_memory.json"
        self.conventions = self.load()
    
    def load(self) -> dict:
        if os.path.exists(self.memory_file):
            return json.load(open(self.memory_file))
        return {"style": {}, "decisions": [], "known_issues": []}
    
    def add_convention(self, key: str, value: str):
        """記住項目 convention"""
        self.conventions["style"][key] = value
        self.save()
    
    def add_decision(self, decision: str, reason: str):
        """記住設計決策"""
        self.conventions["decisions"].append({
            "decision": decision,
            "reason": reason,
            "timestamp": datetime.now().isoformat()
        })
        self.save()
    
    def get_context(self) -> str:
        """生成 context string 俾 agent"""
        return f"""
        Project Conventions:
        {json.dumps(self.conventions['style'], indent=2)}
        
        Previous Decisions:
        {chr(10).join(d['decision'] for d in self.conventions['decisions'][-10:])}
        """

# 使用
memory = ProjectMemory("/my-project")
memory.add_convention("naming", "Use snake_case for Python files")
memory.add_convention("testing", "Every feature needs unit + integration tests")
memory.add_decision(
    "Used FastAPI over Flask",
    "Better async support + OpenAPI auto-generation"
)

組件 6：Evaluation Loops（自動品質保證）

python# ===== Self-Verification Loop =====
# Agent 寫完 code 之後自動 verify

class VerificationLoop:
    def __init__(self, sandbox: SandboxedCodeRunner):
        self.sandbox = sandbox
    
    async def verify_code_change(self, code: str, test_file: str) -> dict:
        # Step 1: Run existing tests
        test_result = self.sandbox.run(f"pytest {test_file} -v")
        
        # Step 2: Run type checker
        type_result = self.sandbox.run(f"mypy {code} --strict")
        
        # Step 3: Run security scan
        security_result = self.sandbox.run(f"bandit -r {code}")
        
        return {
            "tests_pass": test_result["success"],
            "type_safe": type_result["success"],
            "security_clean": security_result["success"],
            "all_pass": all([
                test_result["success"],
                type_result["success"],
                security_result["success"]
            ])
        }

OpenAI 嘅實戰經驗：Codex 團隊學到咗啲咩？

OpenAI 喺 2026 年初發表咗一篇重要嘅文章：「Harness engineering: leveraging Codex in an agent-first world」。入面描述咗佢哋用 Codex 寫咗超過一百萬行 code 嘅經驗。

關鍵教訓

教訓 1：Entropy 同 Garbage Collection

Codex（agent）會複製 repo 入面已經存在嘅 pattern——即使嗰啲 pattern 係次優嘅。隨住時間推移，呢個會導致 code drift。

OpenAI 團隊最初嘅解決方案：每個 Friday 花 20% 嘅時間清理「AI slop」。但呢個 approach 明顯 scale 唔到。

教訓 2：Tests 作為 Harness 嘅核心

如果你嘅 codebase 冇好嘅 test coverage，agent 就會盲目行動。Tests 係 agent 嘅「guardrails」——佢哋定義咗「正確」嘅邊界。

教訓 3：Review 成為新嘅瓶頸

當 agent 寫 code 嘅速度快過人類 review 嘅速度，bottleneck 由 coding 轉移到 reviewing。呢個係 agent-first 開發嘅根本性轉變。

🚀 OpenAI 嘅核心 insight： 佢哋最大嘅挑戰唔係 model capability，而係「designing environments, feedback loops, and control systems」——即係 Harness Engineering。

Anthropic 嘅長期運行 Agent 研究

Anthropic 喺 2025 年 11 月發表咗 「Effective harnesses for long-running agents」，呢篇研究揭示咗一個反直覺嘅發現：

更大嘅 context window 往往令 agent 表現更差，唔係更好。

點解更多 context ≠ 更好？

想像一個 software engineer 做咗 8 個鐘嘅 shift：

最初幾個鐘：思路清晰，效率高
第 5 個鐘：開始忘記早期嘅 decisions
第 8 個鐘：疲勞，開始犯低級錯誤

AI agent 喺長 context window 入面嘅表現一模一樣——佢哋會「attention drift」，忘記早期嘅重要資訊。

Anthropic 嘅解決方案：Shift-based Harness

python# ===== Anthropic 嘅 Shift-based Pattern =====
# 類似工程師換班制度

class ShiftBasedHarness:
    def __init__(self, max_steps_per_shift: int = 50):
        self.max_steps = max_steps_per_shift
        self.current_step = 0
        self.handoff_notes = ""
    
    async def run_long_task(self, task: str):
        while not self.is_complete(task):
            # 開始新嘅 "shift"
            agent = self.create_fresh_agent()
            
            # 俾新 agent 之前嘅 handoff notes
            context = f"""
            任務：{task}
            
            之前嘅 shift handoff notes：
            {self.handoff_notes}
            
            請繼續工作。
            """
            
            # Agent 工作直到 shift 結束
            result = await agent.run(
                context, 
                max_steps=self.max_steps
            )
            
            # Shift 結束，生成 handoff notes
            self.handoff_notes = await agent.run(
                "請寫一份 handoff notes，包括：\n"
                "1. 已完成嘅工作\n"
                "2. 當前狀態\n"
                "3. 下一步計劃\n"
                "4. 已知問題"
            )
            
            self.current_step += self.max_steps

💡 核心 insight： Agent 嘅可靠性唔係靠更 smart 嘅 model 嚟達成嘅，而係靠更好嘅 scaffolding。呢個就係 Harness Engineering 嘅精髓——建造一個環境，令即使係 imperfect 嘅 model 都可以可靠咁完成任務。

深度對比：三個時代嘅根本分別

維度	Prompt Engineering	Context Engineering	Harness Engineering
時間	2022-2024	2025	2026+
核心問題	點樣問？	Model 知道啲咩？	系統點樣運作？
優化目標	單次 response 品質	Context window 資訊品質	Agent 長期運行可靠性
Scope	一句指令	整個 context window	Model 以外嘅一切
互動模式	One-turn Q&A	Multi-turn + RAG	Autonomous agent loops
關係	基礎	包含 Prompt Engineering	包含 Context Engineering
類比	寫一封靚嘅信	準備完整嘅 briefing package	建造整個辦公室 + 管理系統
失敗模式	Model 理解錯指令	Context 缺少關鍵資訊	Agent drift、loop、tool misuse
需要嘅技能	語言表達、NLP 直覺	Information architecture	Systems engineering

佢哋係遞進關係，唔係替代關係

Loading diagram...

🎯 三者嘅包含關係：

Prompt Engineering 係 Context Engineering 嘅一個子集

Context Engineering 係 Harness Engineering 嘅一個子集

每一層都唔會消失——佢只係被更大嘅框架所包含

實戰指南：點樣開始做 Harness Engineering

Step 1：評估你喺邊個階段

Loading diagram...

Step 2：由簡單到複雜嘅演進路徑

python# ===== Level 1: Prompt Engineering =====
# 最簡單——直接調 prompt

response = openai.chat.completions.create(
    model="gpt-4o",
    messages=[{
        "role": "system",
        "content": "你係一個 helpful assistant。用繁體中文回答。"
    }, {
        "role": "user",
        "content": "點樣用 Python 讀 CSV？"
    }]
)

# ===== Level 2: Context Engineering =====
# 加入 RAG、memory、structured context

from langchain.chains import RetrievalQA

# 建立 retrieval pipeline
qa = RetrievalQA.from_chain_type(
    llm=ChatOpenAI(model="gpt-4o"),
    retriever=vectorstore.as_retriever(),
    chain_type="stuff"
)

# Context 自動包含 retrieved documents
result = qa.invoke({"query": "公司嘅 annual leave policy 係咩？"})

# ===== Level 3: Harness Engineering =====
# 完整嘅 agent harness 包括 tools、middleware、evals

from langgraph.prebuilt import create_react_agent
from langgraph.checkpoint.memory import MemorySaver

# 定義 tools
tools = [web_search, code_runner, file_manager]

# 建立 agent 連 persistent memory
agent = create_react_agent(
    model=ChatAnthropic(model="claude-sonnet-4-20250514"),
    tools=tools,
    checkpointer=MemorySaver(),  # 持久化 state
)

# 加 middleware
agent.add_middleware(LoopDetectionMiddleware(max_similar_actions=3))
agent.add_middleware(AutoLintMiddleware())
agent.add_middleware(TokenBudgetMiddleware(max_tokens=100000))

# 加 evaluation hook
agent.add_hook(
    "on_task_complete",
    lambda result: verify_with_tests(result)
)

# 運行
result = await agent.ainvoke({
    "messages": [{"role": "user", "content": "實作 user authentication"}]
})

Step 3：Production Harness Checklist

未來展望

1. Training 同 Inference 環境嘅融合

Phil Schmid（Hugging Face）預測：Harness 會成為 detect model drift 嘅主要工具。Labs 會用 harness 收集嘅 data 直接 feed back 到 training——創造唔會喺長任務入面「get tired」嘅 models。

2. Standardized Agent Harness Frameworks

就好似 web 開發有 React/Django/Rails，AI agent 開發會出現標準化嘅 harness frameworks。LangChain 嘅 Deep Agents、Anthropic 嘅 Claude Agent SDK 係呢個趨勢嘅先行者。

3. Harness-as-a-Service

未來可能出現 managed harness platforms——你只需要定義 agent 嘅目標同 constraints，平台提供完整嘅 middleware stack、evaluation pipeline、monitoring dashboard。

4. Agent-to-Agent Harness

當多個 agents 需要協作（例如 research agent + coding agent + review agent），harness 嘅設計會由「管住一個 agent」進化到「管住一個 agent team」。

技術啟示

1. 瓶頸已經由 Model 轉移到 Harness

LangChain 嘅 experiment 證明：同一個 model，配合更好嘅 harness（middleware + self-verification），benchmark score 由 52.8% 升到 66.5%。Model 唔係限制因素——Harness 先係。

2. 軟件工程紀律回歸

Harness Engineering 本質上係將傳統軟件工程嘅 best practices（testing、monitoring、state management、error handling）應用到 AI agent 開發。AI 工程唔再係「magic prompt alchemy」——佢係真正嘅工程。

3. Constraining = Empowering

一個反直覺嘅發現：限制 agent 嘅行動空間反而提升佢嘅生產力。就好似一個好嘅 CI/CD pipeline 限制咗 developer 可以 deploy 咩，但反而令整個團隊更有信心同效率。

4. The Model is the Horse, You're the Rider

Martin Fowler 嘅類比完美咁總結咗呢個時代：model 係匹馬——有力、快、智能。但冇 harness（韁繩、馬鞍、轡頭），佢就會亂跑。你嘅工作唔係訓練隻馬，而係設計控制佢力量嘅系統。

總結

核心 Takeaways

Prompt Engineering（2022-2024）：學識點樣同 AI 講嘢。核心技巧（CoT、few-shot、role prompting）至今仍然重要，但佢只係整個 stack 嘅最底層。
Context Engineering（2025）：學識點樣俾 AI 正確嘅資訊。Karpathy 同 Tobi Lütke 推動嘅範式轉移——由「寫更好嘅指令」變成「設計更好嘅資訊環境」。RAG、memory、tool results、conversation management 都屬於呢個範疇。
Harness Engineering（2026）：學識點樣管住 AI 做嘢。LangChain、Anthropic、OpenAI 共同推動嘅新紀律——建造 model 以外嘅所有基礎設施，包括 middleware、evaluation loops、sub-agent coordination、guardrails、observability。

三句話嘅進化史

2022： 「我寫咗一個好勁嘅 prompt，ChatGPT 終於聽話！」 2025： 「我設計咗一個 context pipeline，model 有正確嘅資訊做決策。」 2026： 「我建造咗一個 harness，agent 可以自主運行 8 個鐘而唔出事。」

TL;DR

2022 年你學識點樣同 AI 講嘢。2025 年你學識點樣俾 AI 正確嘅資訊。2026 年你學識點樣管住 AI 做嘢。

核心重點：

🎯 Prompt Engineering（2022-2024）：「點樣問」— 寫一句靚嘅指令，令 LLM 聽話
🧠 Context Engineering（2025）：「問之前知啲咩」— 設計 LLM 推理時嘅完整資訊環境
🔧 Harness Engineering（2026）：「點樣管住佢做嘢」— 建造 AI Agent 運行嘅完整基礎設施
📊 核心公式：Agent = Model + Harness（LangChain 定義）
🚀 趨勢：由「寫更好嘅 prompt」→「提供更好嘅 context」→「建造更好嘅系統」

背景：AI 工程嘅三個時代

如果你喺 2022 年開始玩 ChatGPT，你一定經歷過呢個進化：

Loading diagram...

時代	核心問題	代表人物 / 公司	類比
Prompt Engineering	點樣同 model 講嘢？	OpenAI、各種 "Prompt 秘笈"	寫一封完美嘅電郵
Context Engineering	Model 知道啲咩？	Karpathy、Tobi Lütke、Anthropic	準備整個 briefing folder
Harness Engineering	點樣管住 agent 做嘢？	LangChain、OpenAI Codex、Martin Fowler	設計成個辦公室 + 管理制度

💡 一句話總結：

Prompt Engineering = 點樣問

Context Engineering = 問之前俾 model 知道啲咩

Harness Engineering = 成個系統點樣運作

Part 1：Prompt Engineering — 一切嘅起點（2022-2024）

乜嘢係 Prompt Engineering？

Prompt Engineering 係你同 LLM 溝通嘅藝術——透過精心設計嘅文字指令，令 model 生成你想要嘅 output。

喺 ChatGPT 剛推出嘅日子，大家瘋狂研究點樣寫 prompt：

python# 2023 年嘅「頂級 prompt 技巧」

# ❌ 差嘅 prompt
prompt_bad = "幫我寫個 email"

# ✅ 好嘅 prompt（加角色、指令、格式）
prompt_good = """
你係一個專業嘅商務溝通顧問。

請幫我寫一封 follow-up email 俾客戶，要求：
- 語氣：專業但友善
- 長度：150-200 字
- 包含：上次 meeting 嘅 key takeaways
- 結尾：提議下次 meeting 嘅時間

背景：上次 meeting 係關於 Q3 marketing campaign budget。
"""

Prompt Engineering 嘅核心技巧

技巧	原理	例子
Role Prompting	指定 model 嘅角色	「你係一個 senior Python developer」
Few-shot Examples	俾幾個範例	「Input: X → Output: Y」× 3
Chain-of-Thought	叫 model 一步步思考	「Let's think step by step」
Output Formatting	指定輸出格式	「用 JSON 格式回覆」
Constraints	設定限制條件	「唔好用超過 100 字」

Prompt Engineering 嘅局限：點解佢唔夠用？

當你由「ChatGPT 玩具」進化到「production AI 系統」嘅時候，prompt engineering 嘅問題就暴露出嚟：

問題 1：Model 唔記得之前嘅對話

python# Session 1
user: "我個 project 用 FastAPI + PostgreSQL"
assistant: "好，我建議用 SQLAlchemy 做 ORM..."

# Session 2（新 session，model 乜都唔記得）
user: "點樣加 authentication？"
assistant: "你用咩 framework？" # 😤 又要重新解釋

問題 2：Static prompt 唔識適應唔同情況

python# 同一個 customer service bot，面對完全唔同嘅客戶
prompt = "你係一個客服助手，請禮貌地回答問題。"

# 客戶 A：VIP 客戶，投訴第三次 → 需要知道歷史記錄
# 客戶 B：新客戶，問基本問題 → 需要知道產品資料
# 客戶 C：技術問題 → 需要知道 API documentation
# 同一個 prompt 點可能搞掂三個完全唔同嘅場景？

問題 3：Prompt 冇辦法 scale

喺 2024 年中，Andrej Karpathy（前 Tesla AI Director、OpenAI co-founder）開始提出一個觀點：真正嘅工程唔只係寫 prompt，而係設計成個 context window 入面嘅資訊。

🎯 Prompt Engineering 嘅根本問題： 佢只優化你「點樣問」，但冇優化 model「知道啲咩」。就好似你寫咗一封完美嘅電郵，但收件人完全冇你嘅 background info——再靚嘅措辭都冇用。

Part 2：Context Engineering — 真正嘅範式轉移（2025）

Karpathy 同 Tobi Lütke 嘅定義

2025 年 6 月，兩個重量級人物嘅發言令「Context Engineering」呢個詞爆紅：

Tobi Lütke（Shopify CEO）：

"I really like the term 'context engineering' over prompt engineering. It describes the core skill better: the art of providing all the context for the task to be plausibly solvable by the LLM."

Andrej Karpathy（前 Tesla AI / OpenAI）：

"+1 for 'context engineering' over 'prompt engineering'. People associate prompts with short task descriptions you'd give an LLM in your day-to-day use. When in every industrial-strength LLM app, context engineering is the delicate art and science of filling the context window with just the right information for the next step."

乜嘢係 Context Engineering？

Anthropic（Claude 嘅公司）俾咗一個精準嘅定義：

Context refers to the set of tokens included when sampling from a large-language model. Context Engineering refers to the set of strategies for curating and maintaining the optimal set of tokens (information) during LLM inference.

用人話講：

Prompt Engineering = 你寫嘅指令
Context Engineering = model 推理嗰一刻，context window 入面所有嘅 tokens——包括 system prompt、conversation history、retrieved documents、tool results、metadata……

Loading diagram...

Context Engineering 嘅四大策略

LangChain 嘅 CEO Harrison Chase 將 Context Engineering 歸納為四大策略：Write、Select、Compress、Isolate。

策略 1：Write（寫入持久化資訊）

唔好依賴 ephemeral context——將重要資訊寫落去持久化存儲。

python# ===== Scratchpad Pattern =====
# Agent 將中間結果寫入 scratchpad，下一步可以用

class AgentScratchpad:
    def __init__(self):
        self.entries = []
    
    def write(self, key: str, value: str):
        self.entries.append({"key": key, "value": value})
    
    def read(self) -> str:
        return "\n".join(
            f"[{e['key']}]: {e['value']}" for e in self.entries
        )

# Agent 做 research 嘅過程
scratchpad = AgentScratchpad()

# Step 1: Search
scratchpad.write("search_results", "Found 3 relevant papers on RAG...")

# Step 2: Analyze
scratchpad.write("analysis", "Paper A uses dense retrieval, Paper B uses sparse...")

# Step 3: 下一個 LLM call 就有完整嘅 context
next_prompt = f"""
根據以下 research notes，寫一個總結：

{scratchpad.read()}
"""

策略 2：Select（智能選擇相關資訊）

Context window 有限，唔可以乜都塞入去。要揀最相關嘅資訊。

python# ===== RAG：最經典嘅 Select 策略 =====
from langchain.vectorstores import Chroma
from langchain.embeddings import OpenAIEmbeddings

# 建立 vector store
vectorstore = Chroma.from_documents(
    documents=company_docs,
    embedding=OpenAIEmbeddings()
)

# 用戶問問題時，只 retrieve 最相關嘅 documents
relevant_docs = vectorstore.similarity_search(
    query="點樣申請 annual leave？",
    k=3  # 只攞最相關嘅 3 份文件
)

# 組合成 context
context = "\n\n".join([doc.page_content for doc in relevant_docs])
prompt = f"""根據以下公司政策文件回答問題：

{context}

問題：點樣申請 annual leave？
"""

策略 3：Compress（壓縮過長嘅 context）

當 conversation history 太長，就需要壓縮。

python# ===== Context Compaction =====
# Anthropic 喺佢哋嘅 long-running agent 研究中大量使用呢個技術

def compact_context(conversation_history: list, model) -> str:
    """將過長嘅對話歷史壓縮成 summary"""
    if count_tokens(conversation_history) < 80000:
        return conversation_history  # 夠短，唔使壓縮
    
    summary_prompt = """
    請將以下對話歷史壓縮成一個簡潔嘅 summary，
    保留所有重要嘅決策、結果、同未完成嘅任務。
    
    對話歷史：
    {history}
    """
    
    summary = model.generate(summary_prompt.format(
        history=conversation_history
    ))
    
    return [{"role": "system", "content": f"之前嘅對話 summary：{summary}"}]

策略 4：Isolate（隔離唔同任務嘅 context）

唔同嘅 sub-task 需要唔同嘅 context，混埋一齊會互相干擾。

python# ===== Sub-agent Context Isolation =====
# 每個 sub-agent 有自己獨立嘅 context window

async def research_with_isolation(topic: str):
    # Sub-agent 1：搵資料（只需要 search context）
    researcher = Agent(
        system_prompt="你係 research assistant，搵相關資料。",
        tools=[web_search, arxiv_search]
    )
    findings = await researcher.run(f"搵關於 {topic} 嘅最新研究")
    
    # Sub-agent 2：寫報告（只需要 research results）
    writer = Agent(
        system_prompt="你係 technical writer，根據研究結果寫報告。",
        tools=[]
    )
    report = await writer.run(f"根據以下研究結果寫報告：\n{findings}")
    
    return report

🎯 Context Engineering 嘅本質：
Prompt Engineering 係「你講嘅嘢」，Context Engineering 係「model 知道嘅嘢」。

一個完美嘅 prompt 如果被 6K tokens 嘅無關對話歷史淹沒，效果一樣差。Context Engineering 建造嗰個「容器」，Prompt Engineering 只係入面嘅一小部分。

Part 3：Harness Engineering — Agent 時代嘅工程紀律（2026）

由 Context 到 Harness：點解又要進化？

Context Engineering 解決咗「model 知道啲咩」嘅問題。但當 AI agent 開始自主運行幾個鐘頭甚至幾日，新嘅問題出現咗：

Agent 喺第 50 步開始 drift，忘記初始目標
Agent 瘋狂 loop，重複犯同一個錯
Agent 調用錯誤嘅 tool，冇人 catch 到
跨 context window 嘅 state 丟失
冇 feedback loop，agent 唔知自己做錯

呢啲問題唔係 context 可以解決嘅——你需要嘅係成個運行環境嘅設計。

Harness Engineering 嘅定義

2026 年 3 月，LangChain CEO Harrison Chase 提出咗一個影響深遠嘅公式：

Agent = Model + Harness

「If you're not the model, you're the harness.」

Martin Fowler（軟件工程界嘅教父級人物）喺 2026 年 4 月嘅文章入面進一步解釋：

「Harness」呢個詞借自馬術——馬好有力好快，但冇韁繩、馬鞍、轡頭，佢就會亂跑。AI model 就係隻馬。Harness 就係控制佢力量嘅一切設備。工程師就係騎手。

Loading diagram...

邊啲公司喺推動 Harness Engineering？

公司 / 人物	貢獻	關鍵文章 / 日期
LangChain（Harrison Chase）	提出 Agent = Model + Harness 公式；定義 Harness 六大組件	The Anatomy of an Agent Harness（2026-03）
Anthropic	Long-running agent harness 研究；Context engineering for agents	Effective harnesses for long-running agents（2025-11）
OpenAI	Codex 團隊嘅 harness engineering 實戰經驗	Harness engineering: leveraging Codex（2026）
Martin Fowler（ThoughtWorks）	由軟件工程角度定義 harness engineering	Harness engineering for coding agent users（2026-04）
Phil Schmid（Hugging Face）	Harness 成為解決 model drift 嘅主要工具	The importance of Agent Harness in 2026（2026）

Harness 嘅六大核心組件

LangChain 將一個完整嘅 agent harness 分解為六大組件：

組件 1：System Prompts + Skills（漸進式指令）

唔係一次過 dump 所有指令，而係根據任務 progressively load 相關嘅 skills。

python# ===== Progressive Skill Loading =====
# Agent 只喺需要時先 load 相關嘅 skill

SKILLS = {
    "code_review": """
    ## Code Review Skill
    - Check for security vulnerabilities
    - Verify test coverage > 80%
    - Flag any hardcoded credentials
    """,
    "bug_fix": """
    ## Bug Fix Skill
    - Reproduce the bug first
    - Write a failing test
    - Fix and verify
    """,
    "feature_dev": """
    ## Feature Development Skill  
    - Start with the spec
    - Write tests first (TDD)
    - Implement incrementally
    """
}

def build_system_prompt(task_type: str) -> str:
    base_prompt = "你係一個 senior software engineer。"
    
    # 只 load 相關嘅 skill
    if task_type in SKILLS:
        return base_prompt + "\n\n" + SKILLS[task_type]
    
    return base_prompt

組件 2：Tools + Execution Environment

Agent 需要 tools 先可以做嘢——但更重要嘅係管理 tool 嘅 execution 環境。

python# ===== Sandboxed Tool Execution =====
import subprocess

class SandboxedCodeRunner:
    """喺隔離環境入面執行 agent 生成嘅 code"""
    
    def __init__(self, timeout: int = 30):
        self.timeout = timeout
    
    def run(self, code: str) -> dict:
        try:
            result = subprocess.run(
                ["python", "-c", code],
                capture_output=True,
                text=True,
                timeout=self.timeout,
                # 關鍵：限制權限
                env={"PATH": "/usr/bin"},
            )
            return {
                "success": result.returncode == 0,
                "stdout": result.stdout,
                "stderr": result.stderr
            }
        except subprocess.TimeoutExpired:
            return {
                "success": False,
                "error": f"Execution timed out after {self.timeout}s"
            }

組件 3：Hooks + Middleware（確定性嘅行為注入）

呢個係 Harness Engineering 最關鍵嘅創新——喺 model 嘅每一步之間注入確定性嘅邏輯。

python# ===== Middleware Pattern =====
# LangChain Deep Agents 嘅核心設計

class LoopDetectionMiddleware:
    """檢測 agent 係咪陷入 loop"""
    
    def __init__(self, max_similar_actions: int = 3):
        self.recent_actions = []
        self.max_similar = max_similar_actions
    
    def on_tool_call(self, tool_name: str, tool_args: dict) -> str | None:
        action_signature = f"{tool_name}:{hash(str(tool_args))}"
        self.recent_actions.append(action_signature)
        
        # 檢查最近嘅 actions 有冇重複
        recent = self.recent_actions[-self.max_similar:]
        if len(recent) == self.max_similar and len(set(recent)) == 1:
            return (
                "⚠️ LOOP DETECTED: 你已經連續做咗同一個 action "
                f"{self.max_similar} 次。請改變策略。"
            )
        return None  # 冇問題，繼續


class AutoLintMiddleware:
    """每次 agent 寫完 code，自動 run linter"""
    
    def on_file_write(self, filepath: str, content: str) -> str | None:
        if filepath.endswith(".py"):
            lint_result = run_ruff(content)
            if lint_result.errors:
                return f"Lint errors found:\n{lint_result.errors}\nPlease fix."
        return None

💡 Middleware 嘅威力： LangChain 嘅 coding agent 只用 middleware（loop detection + auto-lint + self-verification），喺 Terminal Bench 2.0 benchmark 上由 52.8% 升到 66.5%——model 完全冇變，只係改善咗 harness。呢個就係 Harness Engineering 嘅核心 thesis：瓶頸唔係 model，係 model 周圍嘅系統。

組件 4：Sub-agent Coordination

複雜任務需要多個 specialized agents 協作。

python# ===== Anthropic 嘅 Three-Agent Pattern =====
# Generator → Evaluator → Planner

class ThreeAgentHarness:
    def __init__(self):
        self.planner = Agent(role="planner")
        self.generator = Agent(role="generator")
        self.evaluator = Agent(role="evaluator")
    
    async def run_task(self, feature_spec: str):
        # Step 1: Planner 拆解任務
        plan = await self.planner.run(
            f"將以下 feature 拆解成可執行嘅步驟：\n{feature_spec}"
        )
        
        for step in plan.steps:
            # Step 2: Generator 執行
            result = await self.generator.run(
                f"執行以下步驟：\n{step}"
            )
            
            # Step 3: Evaluator 檢查品質
            evaluation = await self.evaluator.run(
                f"評估以下結果：\n{result}\n\n標準：{step.criteria}"
            )
            
            # 如果唔合格，要求 Generator 重做
            if evaluation.score < 0.8:
                result = await self.generator.run(
                    f"你嘅上一個結果需要改進：\n"
                    f"Feedback: {evaluation.feedback}\n"
                    f"請重新執行。"
                )

🎯 Anthropic 嘅關鍵發現： Model 唔可以可靠咁評估自己嘅 output。所以需要獨立嘅 Evaluator agent——類似 GAN 嘅 Generator/Discriminator 架構。呢個 insight 改變咗成個 harness 嘅設計。

組件 5：Persistent Memory（跨 Session 記憶）

python# ===== Convention Memory =====
# Agent 記住項目嘅 conventions 同偏好

class ProjectMemory:
    """持久化嘅項目記憶"""
    
    def __init__(self, project_path: str):
        self.memory_file = f"{project_path}/.agent_memory.json"
        self.conventions = self.load()
    
    def load(self) -> dict:
        if os.path.exists(self.memory_file):
            return json.load(open(self.memory_file))
        return {"style": {}, "decisions": [], "known_issues": []}
    
    def add_convention(self, key: str, value: str):
        """記住項目 convention"""
        self.conventions["style"][key] = value
        self.save()
    
    def add_decision(self, decision: str, reason: str):
        """記住設計決策"""
        self.conventions["decisions"].append({
            "decision": decision,
            "reason": reason,
            "timestamp": datetime.now().isoformat()
        })
        self.save()
    
    def get_context(self) -> str:
        """生成 context string 俾 agent"""
        return f"""
        Project Conventions:
        {json.dumps(self.conventions['style'], indent=2)}
        
        Previous Decisions:
        {chr(10).join(d['decision'] for d in self.conventions['decisions'][-10:])}
        """

# 使用
memory = ProjectMemory("/my-project")
memory.add_convention("naming", "Use snake_case for Python files")
memory.add_convention("testing", "Every feature needs unit + integration tests")
memory.add_decision(
    "Used FastAPI over Flask",
    "Better async support + OpenAPI auto-generation"
)

組件 6：Evaluation Loops（自動品質保證）

python# ===== Self-Verification Loop =====
# Agent 寫完 code 之後自動 verify

class VerificationLoop:
    def __init__(self, sandbox: SandboxedCodeRunner):
        self.sandbox = sandbox
    
    async def verify_code_change(self, code: str, test_file: str) -> dict:
        # Step 1: Run existing tests
        test_result = self.sandbox.run(f"pytest {test_file} -v")
        
        # Step 2: Run type checker
        type_result = self.sandbox.run(f"mypy {code} --strict")
        
        # Step 3: Run security scan
        security_result = self.sandbox.run(f"bandit -r {code}")
        
        return {
            "tests_pass": test_result["success"],
            "type_safe": type_result["success"],
            "security_clean": security_result["success"],
            "all_pass": all([
                test_result["success"],
                type_result["success"],
                security_result["success"]
            ])
        }

OpenAI 嘅實戰經驗：Codex 團隊學到咗啲咩？

關鍵教訓

教訓 1：Entropy 同 Garbage Collection

Codex（agent）會複製 repo 入面已經存在嘅 pattern——即使嗰啲 pattern 係次優嘅。隨住時間推移，呢個會導致 code drift。

OpenAI 團隊最初嘅解決方案：每個 Friday 花 20% 嘅時間清理「AI slop」。但呢個 approach 明顯 scale 唔到。

教訓 2：Tests 作為 Harness 嘅核心

如果你嘅 codebase 冇好嘅 test coverage，agent 就會盲目行動。Tests 係 agent 嘅「guardrails」——佢哋定義咗「正確」嘅邊界。

教訓 3：Review 成為新嘅瓶頸

當 agent 寫 code 嘅速度快過人類 review 嘅速度，bottleneck 由 coding 轉移到 reviewing。呢個係 agent-first 開發嘅根本性轉變。

🚀 OpenAI 嘅核心 insight： 佢哋最大嘅挑戰唔係 model capability，而係「designing environments, feedback loops, and control systems」——即係 Harness Engineering。

Anthropic 嘅長期運行 Agent 研究

Anthropic 喺 2025 年 11 月發表咗 「Effective harnesses for long-running agents」，呢篇研究揭示咗一個反直覺嘅發現：

更大嘅 context window 往往令 agent 表現更差，唔係更好。

點解更多 context ≠ 更好？

想像一個 software engineer 做咗 8 個鐘嘅 shift：

最初幾個鐘：思路清晰，效率高
第 5 個鐘：開始忘記早期嘅 decisions
第 8 個鐘：疲勞，開始犯低級錯誤

AI agent 喺長 context window 入面嘅表現一模一樣——佢哋會「attention drift」，忘記早期嘅重要資訊。

Anthropic 嘅解決方案：Shift-based Harness

python# ===== Anthropic 嘅 Shift-based Pattern =====
# 類似工程師換班制度

class ShiftBasedHarness:
    def __init__(self, max_steps_per_shift: int = 50):
        self.max_steps = max_steps_per_shift
        self.current_step = 0
        self.handoff_notes = ""
    
    async def run_long_task(self, task: str):
        while not self.is_complete(task):
            # 開始新嘅 "shift"
            agent = self.create_fresh_agent()
            
            # 俾新 agent 之前嘅 handoff notes
            context = f"""
            任務：{task}
            
            之前嘅 shift handoff notes：
            {self.handoff_notes}
            
            請繼續工作。
            """
            
            # Agent 工作直到 shift 結束
            result = await agent.run(
                context, 
                max_steps=self.max_steps
            )
            
            # Shift 結束，生成 handoff notes
            self.handoff_notes = await agent.run(
                "請寫一份 handoff notes，包括：\n"
                "1. 已完成嘅工作\n"
                "2. 當前狀態\n"
                "3. 下一步計劃\n"
                "4. 已知問題"
            )
            
            self.current_step += self.max_steps

💡 核心 insight： Agent 嘅可靠性唔係靠更 smart 嘅 model 嚟達成嘅，而係靠更好嘅 scaffolding。呢個就係 Harness Engineering 嘅精髓——建造一個環境，令即使係 imperfect 嘅 model 都可以可靠咁完成任務。

深度對比：三個時代嘅根本分別

維度	Prompt Engineering	Context Engineering	Harness Engineering
時間	2022-2024	2025	2026+
核心問題	點樣問？	Model 知道啲咩？	系統點樣運作？
優化目標	單次 response 品質	Context window 資訊品質	Agent 長期運行可靠性
Scope	一句指令	整個 context window	Model 以外嘅一切
互動模式	One-turn Q&A	Multi-turn + RAG	Autonomous agent loops
關係	基礎	包含 Prompt Engineering	包含 Context Engineering
類比	寫一封靚嘅信	準備完整嘅 briefing package	建造整個辦公室 + 管理系統
失敗模式	Model 理解錯指令	Context 缺少關鍵資訊	Agent drift、loop、tool misuse
需要嘅技能	語言表達、NLP 直覺	Information architecture	Systems engineering

佢哋係遞進關係，唔係替代關係

Loading diagram...

🎯 三者嘅包含關係：

Prompt Engineering 係 Context Engineering 嘅一個子集

Context Engineering 係 Harness Engineering 嘅一個子集

每一層都唔會消失——佢只係被更大嘅框架所包含

實戰指南：點樣開始做 Harness Engineering

Step 1：評估你喺邊個階段

Loading diagram...

Step 2：由簡單到複雜嘅演進路徑

python# ===== Level 1: Prompt Engineering =====
# 最簡單——直接調 prompt

response = openai.chat.completions.create(
    model="gpt-4o",
    messages=[{
        "role": "system",
        "content": "你係一個 helpful assistant。用繁體中文回答。"
    }, {
        "role": "user",
        "content": "點樣用 Python 讀 CSV？"
    }]
)

# ===== Level 2: Context Engineering =====
# 加入 RAG、memory、structured context

from langchain.chains import RetrievalQA

# 建立 retrieval pipeline
qa = RetrievalQA.from_chain_type(
    llm=ChatOpenAI(model="gpt-4o"),
    retriever=vectorstore.as_retriever(),
    chain_type="stuff"
)

# Context 自動包含 retrieved documents
result = qa.invoke({"query": "公司嘅 annual leave policy 係咩？"})

# ===== Level 3: Harness Engineering =====
# 完整嘅 agent harness 包括 tools、middleware、evals

from langgraph.prebuilt import create_react_agent
from langgraph.checkpoint.memory import MemorySaver

# 定義 tools
tools = [web_search, code_runner, file_manager]

# 建立 agent 連 persistent memory
agent = create_react_agent(
    model=ChatAnthropic(model="claude-sonnet-4-20250514"),
    tools=tools,
    checkpointer=MemorySaver(),  # 持久化 state
)

# 加 middleware
agent.add_middleware(LoopDetectionMiddleware(max_similar_actions=3))
agent.add_middleware(AutoLintMiddleware())
agent.add_middleware(TokenBudgetMiddleware(max_tokens=100000))

# 加 evaluation hook
agent.add_hook(
    "on_task_complete",
    lambda result: verify_with_tests(result)
)

# 運行
result = await agent.ainvoke({
    "messages": [{"role": "user", "content": "實作 user authentication"}]
})

Step 3：Production Harness Checklist

未來展望

1. Training 同 Inference 環境嘅融合

2. Standardized Agent Harness Frameworks

就好似 web 開發有 React/Django/Rails，AI agent 開發會出現標準化嘅 harness frameworks。LangChain 嘅 Deep Agents、Anthropic 嘅 Claude Agent SDK 係呢個趨勢嘅先行者。

3. Harness-as-a-Service

未來可能出現 managed harness platforms——你只需要定義 agent 嘅目標同 constraints，平台提供完整嘅 middleware stack、evaluation pipeline、monitoring dashboard。

4. Agent-to-Agent Harness

當多個 agents 需要協作（例如 research agent + coding agent + review agent），harness 嘅設計會由「管住一個 agent」進化到「管住一個 agent team」。

技術啟示

1. 瓶頸已經由 Model 轉移到 Harness

2. 軟件工程紀律回歸

3. Constraining = Empowering

4. The Model is the Horse, You're the Rider

總結

核心 Takeaways

Prompt Engineering（2022-2024）：學識點樣同 AI 講嘢。核心技巧（CoT、few-shot、role prompting）至今仍然重要，但佢只係整個 stack 嘅最底層。
Context Engineering（2025）：學識點樣俾 AI 正確嘅資訊。Karpathy 同 Tobi Lütke 推動嘅範式轉移——由「寫更好嘅指令」變成「設計更好嘅資訊環境」。RAG、memory、tool results、conversation management 都屬於呢個範疇。
Harness Engineering（2026）：學識點樣管住 AI 做嘢。LangChain、Anthropic、OpenAI 共同推動嘅新紀律——建造 model 以外嘅所有基礎設施，包括 middleware、evaluation loops、sub-agent coordination、guardrails、observability。

三句話嘅進化史

2022： 「我寫咗一個好勁嘅 prompt，ChatGPT 終於聽話！」 2025： 「我設計咗一個 context pipeline，model 有正確嘅資訊做決策。」 2026： 「我建造咗一個 harness，agent 可以自主運行 8 個鐘而唔出事。」

TL;DR

目錄

背景：AI 工程嘅三個時代

Part 1：Prompt Engineering — 一切嘅起點（2022-2024）

乜嘢係 Prompt Engineering？

Prompt Engineering 嘅核心技巧

Prompt Engineering 嘅局限：點解佢唔夠用？

Part 2：Context Engineering — 真正嘅範式轉移（2025）

Karpathy 同 Tobi Lütke 嘅定義

乜嘢係 Context Engineering？

Context Engineering 嘅四大策略

策略 1：Write（寫入持久化資訊）

策略 2：Select（智能選擇相關資訊）

策略 3：Compress（壓縮過長嘅 context）

策略 4：Isolate（隔離唔同任務嘅 context）

Part 3：Harness Engineering — Agent 時代嘅工程紀律（2026）

由 Context 到 Harness：點解又要進化？

Harness Engineering 嘅定義

邊啲公司喺推動 Harness Engineering？

Harness 嘅六大核心組件

組件 1：System Prompts + Skills（漸進式指令）

組件 2：Tools + Execution Environment

組件 3：Hooks + Middleware（確定性嘅行為注入）

組件 4：Sub-agent Coordination

組件 5：Persistent Memory（跨 Session 記憶）

組件 6：Evaluation Loops（自動品質保證）

OpenAI 嘅實戰經驗：Codex 團隊學到咗啲咩？

關鍵教訓

Anthropic 嘅長期運行 Agent 研究

點解更多 context ≠ 更好？

Anthropic 嘅解決方案：Shift-based Harness

深度對比：三個時代嘅根本分別

佢哋係遞進關係，唔係替代關係

實戰指南：點樣開始做 Harness Engineering

Step 1：評估你喺邊個階段

Step 2：由簡單到複雜嘅演進路徑

Step 3：Production Harness Checklist

未來展望

1. Training 同 Inference 環境嘅融合

2. Standardized Agent Harness Frameworks

3. Harness-as-a-Service

4. Agent-to-Agent Harness

技術啟示

1. 瓶頸已經由 Model 轉移到 Harness

2. 軟件工程紀律回歸

3. Constraining = Empowering

4. The Model is the Horse, You're the Rider

總結

核心 Takeaways

三句話嘅進化史

相關資源

Context Engineering

Harness Engineering

延伸閱讀

TL;DR

目錄

背景：AI 工程嘅三個時代

Part 1：Prompt Engineering — 一切嘅起點（2022-2024）

乜嘢係 Prompt Engineering？

Prompt Engineering 嘅核心技巧

Prompt Engineering 嘅局限：點解佢唔夠用？

Part 2：Context Engineering — 真正嘅範式轉移（2025）

Karpathy 同 Tobi Lütke 嘅定義

乜嘢係 Context Engineering？

Context Engineering 嘅四大策略

策略 1：Write（寫入持久化資訊）

策略 2：Select（智能選擇相關資訊）

策略 3：Compress（壓縮過長嘅 context）

策略 4：Isolate（隔離唔同任務嘅 context）

Part 3：Harness Engineering — Agent 時代嘅工程紀律（2026）

由 Context 到 Harness：點解又要進化？

Harness Engineering 嘅定義

邊啲公司喺推動 Harness Engineering？

Harness 嘅六大核心組件

組件 1：System Prompts + Skills（漸進式指令）

組件 2：Tools + Execution Environment

組件 3：Hooks + Middleware（確定性嘅行為注入）

組件 4：Sub-agent Coordination

組件 5：Persistent Memory（跨 Session 記憶）

組件 6：Evaluation Loops（自動品質保證）