Billy Tse
HomeRoadmapBlogContact
Playground
Buy me a bug

© 2026 Billy Tse

OnlyFansLinkedInGitHubEmail
Back to Blog
April 6, 2026•35 min read

由 Prompt Engineering 到 Context Engineering 到 Harness Engineering:AI 工程三部曲嘅進化史

深入解析 AI 工程由 Prompt Engineering(2022-24)到 Context Engineering(2025)到 Harness Engineering(2026)嘅進化歷程。了解 Karpathy、Tobi Lütke、Anthropic、OpenAI、LangChain 點樣定義呢三個時代,附完整實戰指南同代碼範例。

AINLP

「Prompt Engineering is dead. Context Engineering is already showing its limits. The future is Harness Engineering.」 — 綜合 Andrej Karpathy、Tobi Lütke(Shopify CEO)、Harrison Chase(LangChain CEO)嘅觀點

TL;DR

2022 年你學識點樣同 AI 講嘢。2025 年你學識點樣俾 AI 正確嘅資訊。2026 年你學識點樣管住 AI 做嘢。

核心重點:

  • 🎯 Prompt Engineering(2022-2024):「點樣問」— 寫一句靚嘅指令,令 LLM 聽話
  • 🧠 Context Engineering(2025):「問之前知啲咩」— 設計 LLM 推理時嘅完整資訊環境
  • 🔧 Harness Engineering(2026):「點樣管住佢做嘢」— 建造 AI Agent 運行嘅完整基礎設施
  • 📊 核心公式:Agent = Model + Harness(LangChain 定義)
  • 🚀 趨勢:由「寫更好嘅 prompt」→「提供更好嘅 context」→「建造更好嘅系統」

目錄

背景:AI 工程嘅三個時代

如果你喺 2022 年開始玩 ChatGPT,你一定經歷過呢個進化:

Loading diagram...
時代核心問題代表人物 / 公司類比
Prompt Engineering點樣同 model 講嘢?OpenAI、各種 "Prompt 秘笈"寫一封完美嘅電郵
Context EngineeringModel 知道啲咩?Karpathy、Tobi Lütke、Anthropic準備整個 briefing folder
Harness Engineering點樣管住 agent 做嘢?LangChain、OpenAI Codex、Martin Fowler設計成個辦公室 + 管理制度

💡 一句話總結:

  • Prompt Engineering = 點樣問

  • Context Engineering = 問之前俾 model 知道啲咩

  • Harness Engineering = 成個系統點樣運作

Part 1:Prompt Engineering — 一切嘅起點(2022-2024)

乜嘢係 Prompt Engineering?

Prompt Engineering 係你同 LLM 溝通嘅藝術——透過精心設計嘅文字指令,令 model 生成你想要嘅 output。

喺 ChatGPT 剛推出嘅日子,大家瘋狂研究點樣寫 prompt:

python# 2023 年嘅「頂級 prompt 技巧」 # ❌ 差嘅 prompt prompt_bad = "幫我寫個 email" # ✅ 好嘅 prompt(加角色、指令、格式) prompt_good = """ 你係一個專業嘅商務溝通顧問。 請幫我寫一封 follow-up email 俾客戶,要求: - 語氣:專業但友善 - 長度:150-200 字 - 包含:上次 meeting 嘅 key takeaways - 結尾:提議下次 meeting 嘅時間 背景:上次 meeting 係關於 Q3 marketing campaign budget。 """

Prompt Engineering 嘅核心技巧

技巧原理例子
Role Prompting指定 model 嘅角色「你係一個 senior Python developer」
Few-shot Examples俾幾個範例「Input: X → Output: Y」× 3
Chain-of-Thought叫 model 一步步思考「Let's think step by step」
Output Formatting指定輸出格式「用 JSON 格式回覆」
Constraints設定限制條件「唔好用超過 100 字」

Prompt Engineering 嘅局限:點解佢唔夠用?

當你由「ChatGPT 玩具」進化到「production AI 系統」嘅時候,prompt engineering 嘅問題就暴露出嚟:

問題 1:Model 唔記得之前嘅對話

python# Session 1 user: "我個 project 用 FastAPI + PostgreSQL" assistant: "好,我建議用 SQLAlchemy 做 ORM..." # Session 2(新 session,model 乜都唔記得) user: "點樣加 authentication?" assistant: "你用咩 framework?" # 😤 又要重新解釋

問題 2:Static prompt 唔識適應唔同情況

python# 同一個 customer service bot,面對完全唔同嘅客戶 prompt = "你係一個客服助手,請禮貌地回答問題。" # 客戶 A:VIP 客戶,投訴第三次 → 需要知道歷史記錄 # 客戶 B:新客戶,問基本問題 → 需要知道產品資料 # 客戶 C:技術問題 → 需要知道 API documentation # 同一個 prompt 點可能搞掂三個完全唔同嘅場景?

問題 3:Prompt 冇辦法 scale

喺 2024 年中,Andrej Karpathy(前 Tesla AI Director、OpenAI co-founder)開始提出一個觀點:真正嘅工程唔只係寫 prompt,而係設計成個 context window 入面嘅資訊。

🎯 Prompt Engineering 嘅根本問題: 佢只優化你「點樣問」,但冇優化 model「知道啲咩」。就好似你寫咗一封完美嘅電郵,但收件人完全冇你嘅 background info——再靚嘅措辭都冇用。

Part 2:Context Engineering — 真正嘅範式轉移(2025)

Karpathy 同 Tobi Lütke 嘅定義

2025 年 6 月,兩個重量級人物嘅發言令「Context Engineering」呢個詞爆紅:

Tobi Lütke(Shopify CEO):

"I really like the term 'context engineering' over prompt engineering. It describes the core skill better: the art of providing all the context for the task to be plausibly solvable by the LLM."

Andrej Karpathy(前 Tesla AI / OpenAI):

"+1 for 'context engineering' over 'prompt engineering'. People associate prompts with short task descriptions you'd give an LLM in your day-to-day use. When in every industrial-strength LLM app, context engineering is the delicate art and science of filling the context window with just the right information for the next step."

乜嘢係 Context Engineering?

Anthropic(Claude 嘅公司)俾咗一個精準嘅定義:

Context refers to the set of tokens included when sampling from a large-language model. Context Engineering refers to the set of strategies for curating and maintaining the optimal set of tokens (information) during LLM inference.

用人話講:

  • Prompt Engineering = 你寫嘅指令
  • Context Engineering = model 推理嗰一刻,context window 入面所有嘅 tokens——包括 system prompt、conversation history、retrieved documents、tool results、metadata……
Loading diagram...

Context Engineering 嘅四大策略

LangChain 嘅 CEO Harrison Chase 將 Context Engineering 歸納為四大策略:Write、Select、Compress、Isolate。

策略 1:Write(寫入持久化資訊)

唔好依賴 ephemeral context——將重要資訊寫落去持久化存儲。

python# ===== Scratchpad Pattern ===== # Agent 將中間結果寫入 scratchpad,下一步可以用 class AgentScratchpad: def __init__(self): self.entries = [] def write(self, key: str, value: str): self.entries.append({"key": key, "value": value}) def read(self) -> str: return "\n".join( f"[{e['key']}]: {e['value']}" for e in self.entries ) # Agent 做 research 嘅過程 scratchpad = AgentScratchpad() # Step 1: Search scratchpad.write("search_results", "Found 3 relevant papers on RAG...") # Step 2: Analyze scratchpad.write("analysis", "Paper A uses dense retrieval, Paper B uses sparse...") # Step 3: 下一個 LLM call 就有完整嘅 context next_prompt = f""" 根據以下 research notes,寫一個總結: {scratchpad.read()} """

策略 2:Select(智能選擇相關資訊)

Context window 有限,唔可以乜都塞入去。要揀最相關嘅資訊。

python# ===== RAG:最經典嘅 Select 策略 ===== from langchain.vectorstores import Chroma from langchain.embeddings import OpenAIEmbeddings # 建立 vector store vectorstore = Chroma.from_documents( documents=company_docs, embedding=OpenAIEmbeddings() ) # 用戶問問題時,只 retrieve 最相關嘅 documents relevant_docs = vectorstore.similarity_search( query="點樣申請 annual leave?", k=3 # 只攞最相關嘅 3 份文件 ) # 組合成 context context = "\n\n".join([doc.page_content for doc in relevant_docs]) prompt = f"""根據以下公司政策文件回答問題: {context} 問題:點樣申請 annual leave? """

策略 3:Compress(壓縮過長嘅 context)

當 conversation history 太長,就需要壓縮。

python# ===== Context Compaction ===== # Anthropic 喺佢哋嘅 long-running agent 研究中大量使用呢個技術 def compact_context(conversation_history: list, model) -> str: """將過長嘅對話歷史壓縮成 summary""" if count_tokens(conversation_history) < 80000: return conversation_history # 夠短,唔使壓縮 summary_prompt = """ 請將以下對話歷史壓縮成一個簡潔嘅 summary, 保留所有重要嘅決策、結果、同未完成嘅任務。 對話歷史: {history} """ summary = model.generate(summary_prompt.format( history=conversation_history )) return [{"role": "system", "content": f"之前嘅對話 summary:{summary}"}]

策略 4:Isolate(隔離唔同任務嘅 context)

唔同嘅 sub-task 需要唔同嘅 context,混埋一齊會互相干擾。

python# ===== Sub-agent Context Isolation ===== # 每個 sub-agent 有自己獨立嘅 context window async def research_with_isolation(topic: str): # Sub-agent 1:搵資料(只需要 search context) researcher = Agent( system_prompt="你係 research assistant,搵相關資料。", tools=[web_search, arxiv_search] ) findings = await researcher.run(f"搵關於 {topic} 嘅最新研究") # Sub-agent 2:寫報告(只需要 research results) writer = Agent( system_prompt="你係 technical writer,根據研究結果寫報告。", tools=[] ) report = await writer.run(f"根據以下研究結果寫報告:\n{findings}") return report

🎯 Context Engineering 嘅本質:
Prompt Engineering 係「你講嘅嘢」,Context Engineering 係「model 知道嘅嘢」。

一個完美嘅 prompt 如果被 6K tokens 嘅無關對話歷史淹沒,效果一樣差。Context Engineering 建造嗰個「容器」,Prompt Engineering 只係入面嘅一小部分。

Part 3:Harness Engineering — Agent 時代嘅工程紀律(2026)

由 Context 到 Harness:點解又要進化?

Context Engineering 解決咗「model 知道啲咩」嘅問題。但當 AI agent 開始自主運行幾個鐘頭甚至幾日,新嘅問題出現咗:

  • Agent 喺第 50 步開始 drift,忘記初始目標
  • Agent 瘋狂 loop,重複犯同一個錯
  • Agent 調用錯誤嘅 tool,冇人 catch 到
  • 跨 context window 嘅 state 丟失
  • 冇 feedback loop,agent 唔知自己做錯

呢啲問題唔係 context 可以解決嘅——你需要嘅係成個運行環境嘅設計。

Harness Engineering 嘅定義

2026 年 3 月,LangChain CEO Harrison Chase 提出咗一個影響深遠嘅公式:

Agent = Model + Harness

「If you're not the model, you're the harness.」

Martin Fowler(軟件工程界嘅教父級人物)喺 2026 年 4 月嘅文章入面進一步解釋:

「Harness」呢個詞借自馬術——馬好有力好快,但冇韁繩、馬鞍、轡頭,佢就會亂跑。AI model 就係隻馬。Harness 就係控制佢力量嘅一切設備。工程師就係騎手。

Loading diagram...

邊啲公司喺推動 Harness Engineering?

公司 / 人物貢獻關鍵文章 / 日期
LangChain(Harrison Chase)提出 Agent = Model + Harness 公式;定義 Harness 六大組件The Anatomy of an Agent Harness(2026-03)
AnthropicLong-running agent harness 研究;Context engineering for agentsEffective harnesses for long-running agents(2025-11)
OpenAICodex 團隊嘅 harness engineering 實戰經驗Harness engineering: leveraging Codex(2026)
Martin Fowler(ThoughtWorks)由軟件工程角度定義 harness engineeringHarness engineering for coding agent users(2026-04)
Phil Schmid(Hugging Face)Harness 成為解決 model drift 嘅主要工具The importance of Agent Harness in 2026(2026)

Harness 嘅六大核心組件

LangChain 將一個完整嘅 agent harness 分解為六大組件:

組件 1:System Prompts + Skills(漸進式指令)

唔係一次過 dump 所有指令,而係根據任務 progressively load 相關嘅 skills。

python# ===== Progressive Skill Loading ===== # Agent 只喺需要時先 load 相關嘅 skill SKILLS = { "code_review": """ ## Code Review Skill - Check for security vulnerabilities - Verify test coverage > 80% - Flag any hardcoded credentials """, "bug_fix": """ ## Bug Fix Skill - Reproduce the bug first - Write a failing test - Fix and verify """, "feature_dev": """ ## Feature Development Skill - Start with the spec - Write tests first (TDD) - Implement incrementally """ } def build_system_prompt(task_type: str) -> str: base_prompt = "你係一個 senior software engineer。" # 只 load 相關嘅 skill if task_type in SKILLS: return base_prompt + "\n\n" + SKILLS[task_type] return base_prompt

組件 2:Tools + Execution Environment

Agent 需要 tools 先可以做嘢——但更重要嘅係管理 tool 嘅 execution 環境。

python# ===== Sandboxed Tool Execution ===== import subprocess class SandboxedCodeRunner: """喺隔離環境入面執行 agent 生成嘅 code""" def __init__(self, timeout: int = 30): self.timeout = timeout def run(self, code: str) -> dict: try: result = subprocess.run( ["python", "-c", code], capture_output=True, text=True, timeout=self.timeout, # 關鍵:限制權限 env={"PATH": "/usr/bin"}, ) return { "success": result.returncode == 0, "stdout": result.stdout, "stderr": result.stderr } except subprocess.TimeoutExpired: return { "success": False, "error": f"Execution timed out after {self.timeout}s" }

組件 3:Hooks + Middleware(確定性嘅行為注入)

呢個係 Harness Engineering 最關鍵嘅創新——喺 model 嘅每一步之間注入確定性嘅邏輯。

python# ===== Middleware Pattern ===== # LangChain Deep Agents 嘅核心設計 class LoopDetectionMiddleware: """檢測 agent 係咪陷入 loop""" def __init__(self, max_similar_actions: int = 3): self.recent_actions = [] self.max_similar = max_similar_actions def on_tool_call(self, tool_name: str, tool_args: dict) -> str | None: action_signature = f"{tool_name}:{hash(str(tool_args))}" self.recent_actions.append(action_signature) # 檢查最近嘅 actions 有冇重複 recent = self.recent_actions[-self.max_similar:] if len(recent) == self.max_similar and len(set(recent)) == 1: return ( "⚠️ LOOP DETECTED: 你已經連續做咗同一個 action " f"{self.max_similar} 次。請改變策略。" ) return None # 冇問題,繼續 class AutoLintMiddleware: """每次 agent 寫完 code,自動 run linter""" def on_file_write(self, filepath: str, content: str) -> str | None: if filepath.endswith(".py"): lint_result = run_ruff(content) if lint_result.errors: return f"Lint errors found:\n{lint_result.errors}\nPlease fix." return None

💡 Middleware 嘅威力: LangChain 嘅 coding agent 只用 middleware(loop detection + auto-lint + self-verification),喺 Terminal Bench 2.0 benchmark 上由 52.8% 升到 66.5%——model 完全冇變,只係改善咗 harness。呢個就係 Harness Engineering 嘅核心 thesis:瓶頸唔係 model,係 model 周圍嘅系統。

組件 4:Sub-agent Coordination

複雜任務需要多個 specialized agents 協作。

python# ===== Anthropic 嘅 Three-Agent Pattern ===== # Generator → Evaluator → Planner class ThreeAgentHarness: def __init__(self): self.planner = Agent(role="planner") self.generator = Agent(role="generator") self.evaluator = Agent(role="evaluator") async def run_task(self, feature_spec: str): # Step 1: Planner 拆解任務 plan = await self.planner.run( f"將以下 feature 拆解成可執行嘅步驟:\n{feature_spec}" ) for step in plan.steps: # Step 2: Generator 執行 result = await self.generator.run( f"執行以下步驟:\n{step}" ) # Step 3: Evaluator 檢查品質 evaluation = await self.evaluator.run( f"評估以下結果:\n{result}\n\n標準:{step.criteria}" ) # 如果唔合格,要求 Generator 重做 if evaluation.score < 0.8: result = await self.generator.run( f"你嘅上一個結果需要改進:\n" f"Feedback: {evaluation.feedback}\n" f"請重新執行。" )

🎯 Anthropic 嘅關鍵發現: Model 唔可以可靠咁評估自己嘅 output。所以需要獨立嘅 Evaluator agent——類似 GAN 嘅 Generator/Discriminator 架構。呢個 insight 改變咗成個 harness 嘅設計。

組件 5:Persistent Memory(跨 Session 記憶)

python# ===== Convention Memory ===== # Agent 記住項目嘅 conventions 同偏好 class ProjectMemory: """持久化嘅項目記憶""" def __init__(self, project_path: str): self.memory_file = f"{project_path}/.agent_memory.json" self.conventions = self.load() def load(self) -> dict: if os.path.exists(self.memory_file): return json.load(open(self.memory_file)) return {"style": {}, "decisions": [], "known_issues": []} def add_convention(self, key: str, value: str): """記住項目 convention""" self.conventions["style"][key] = value self.save() def add_decision(self, decision: str, reason: str): """記住設計決策""" self.conventions["decisions"].append({ "decision": decision, "reason": reason, "timestamp": datetime.now().isoformat() }) self.save() def get_context(self) -> str: """生成 context string 俾 agent""" return f""" Project Conventions: {json.dumps(self.conventions['style'], indent=2)} Previous Decisions: {chr(10).join(d['decision'] for d in self.conventions['decisions'][-10:])} """ # 使用 memory = ProjectMemory("/my-project") memory.add_convention("naming", "Use snake_case for Python files") memory.add_convention("testing", "Every feature needs unit + integration tests") memory.add_decision( "Used FastAPI over Flask", "Better async support + OpenAPI auto-generation" )

組件 6:Evaluation Loops(自動品質保證)

python# ===== Self-Verification Loop ===== # Agent 寫完 code 之後自動 verify class VerificationLoop: def __init__(self, sandbox: SandboxedCodeRunner): self.sandbox = sandbox async def verify_code_change(self, code: str, test_file: str) -> dict: # Step 1: Run existing tests test_result = self.sandbox.run(f"pytest {test_file} -v") # Step 2: Run type checker type_result = self.sandbox.run(f"mypy {code} --strict") # Step 3: Run security scan security_result = self.sandbox.run(f"bandit -r {code}") return { "tests_pass": test_result["success"], "type_safe": type_result["success"], "security_clean": security_result["success"], "all_pass": all([ test_result["success"], type_result["success"], security_result["success"] ]) }

OpenAI 嘅實戰經驗:Codex 團隊學到咗啲咩?

OpenAI 喺 2026 年初發表咗一篇重要嘅文章:「Harness engineering: leveraging Codex in an agent-first world」。入面描述咗佢哋用 Codex 寫咗超過一百萬行 code 嘅經驗。

關鍵教訓

教訓 1:Entropy 同 Garbage Collection

Codex(agent)會複製 repo 入面已經存在嘅 pattern——即使嗰啲 pattern 係次優嘅。隨住時間推移,呢個會導致 code drift。

OpenAI 團隊最初嘅解決方案:每個 Friday 花 20% 嘅時間清理「AI slop」。但呢個 approach 明顯 scale 唔到。

教訓 2:Tests 作為 Harness 嘅核心

如果你嘅 codebase 冇好嘅 test coverage,agent 就會盲目行動。Tests 係 agent 嘅「guardrails」——佢哋定義咗「正確」嘅邊界。

教訓 3:Review 成為新嘅瓶頸

當 agent 寫 code 嘅速度快過人類 review 嘅速度,bottleneck 由 coding 轉移到 reviewing。呢個係 agent-first 開發嘅根本性轉變。

🚀 OpenAI 嘅核心 insight: 佢哋最大嘅挑戰唔係 model capability,而係「designing environments, feedback loops, and control systems」——即係 Harness Engineering。

Anthropic 嘅長期運行 Agent 研究

Anthropic 喺 2025 年 11 月發表咗 「Effective harnesses for long-running agents」,呢篇研究揭示咗一個反直覺嘅發現:

更大嘅 context window 往往令 agent 表現更差,唔係更好。

點解更多 context ≠ 更好?

想像一個 software engineer 做咗 8 個鐘嘅 shift:

  • 最初幾個鐘:思路清晰,效率高
  • 第 5 個鐘:開始忘記早期嘅 decisions
  • 第 8 個鐘:疲勞,開始犯低級錯誤

AI agent 喺長 context window 入面嘅表現一模一樣——佢哋會「attention drift」,忘記早期嘅重要資訊。

Anthropic 嘅解決方案:Shift-based Harness

python# ===== Anthropic 嘅 Shift-based Pattern ===== # 類似工程師換班制度 class ShiftBasedHarness: def __init__(self, max_steps_per_shift: int = 50): self.max_steps = max_steps_per_shift self.current_step = 0 self.handoff_notes = "" async def run_long_task(self, task: str): while not self.is_complete(task): # 開始新嘅 "shift" agent = self.create_fresh_agent() # 俾新 agent 之前嘅 handoff notes context = f""" 任務:{task} 之前嘅 shift handoff notes: {self.handoff_notes} 請繼續工作。 """ # Agent 工作直到 shift 結束 result = await agent.run( context, max_steps=self.max_steps ) # Shift 結束,生成 handoff notes self.handoff_notes = await agent.run( "請寫一份 handoff notes,包括:\n" "1. 已完成嘅工作\n" "2. 當前狀態\n" "3. 下一步計劃\n" "4. 已知問題" ) self.current_step += self.max_steps

💡 核心 insight: Agent 嘅可靠性唔係靠更 smart 嘅 model 嚟達成嘅,而係靠更好嘅 scaffolding。呢個就係 Harness Engineering 嘅精髓——建造一個環境,令即使係 imperfect 嘅 model 都可以可靠咁完成任務。

深度對比:三個時代嘅根本分別

維度Prompt EngineeringContext EngineeringHarness Engineering
時間2022-202420252026+
核心問題點樣問?Model 知道啲咩?系統點樣運作?
優化目標單次 response 品質Context window 資訊品質Agent 長期運行可靠性
Scope一句指令整個 context windowModel 以外嘅一切
互動模式One-turn Q&AMulti-turn + RAGAutonomous agent loops
關係基礎包含 Prompt Engineering包含 Context Engineering
類比寫一封靚嘅信準備完整嘅 briefing package建造整個辦公室 + 管理系統
失敗模式Model 理解錯指令Context 缺少關鍵資訊Agent drift、loop、tool misuse
需要嘅技能語言表達、NLP 直覺Information architectureSystems engineering

佢哋係遞進關係,唔係替代關係

Loading diagram...

🎯 三者嘅包含關係:

  • Prompt Engineering 係 Context Engineering 嘅一個子集

  • Context Engineering 係 Harness Engineering 嘅一個子集

  • 每一層都唔會消失——佢只係被更大嘅框架所包含

實戰指南:點樣開始做 Harness Engineering

Step 1:評估你喺邊個階段

Loading diagram...

Step 2:由簡單到複雜嘅演進路徑

python# ===== Level 1: Prompt Engineering ===== # 最簡單——直接調 prompt response = openai.chat.completions.create( model="gpt-4o", messages=[{ "role": "system", "content": "你係一個 helpful assistant。用繁體中文回答。" }, { "role": "user", "content": "點樣用 Python 讀 CSV?" }] ) # ===== Level 2: Context Engineering ===== # 加入 RAG、memory、structured context from langchain.chains import RetrievalQA # 建立 retrieval pipeline qa = RetrievalQA.from_chain_type( llm=ChatOpenAI(model="gpt-4o"), retriever=vectorstore.as_retriever(), chain_type="stuff" ) # Context 自動包含 retrieved documents result = qa.invoke({"query": "公司嘅 annual leave policy 係咩?"}) # ===== Level 3: Harness Engineering ===== # 完整嘅 agent harness 包括 tools、middleware、evals from langgraph.prebuilt import create_react_agent from langgraph.checkpoint.memory import MemorySaver # 定義 tools tools = [web_search, code_runner, file_manager] # 建立 agent 連 persistent memory agent = create_react_agent( model=ChatAnthropic(model="claude-sonnet-4-20250514"), tools=tools, checkpointer=MemorySaver(), # 持久化 state ) # 加 middleware agent.add_middleware(LoopDetectionMiddleware(max_similar_actions=3)) agent.add_middleware(AutoLintMiddleware()) agent.add_middleware(TokenBudgetMiddleware(max_tokens=100000)) # 加 evaluation hook agent.add_hook( "on_task_complete", lambda result: verify_with_tests(result) ) # 運行 result = await agent.ainvoke({ "messages": [{"role": "user", "content": "實作 user authentication"}] })

Step 3:Production Harness Checklist

  • Loop Detection:Agent 連續做同一個 action 3+ 次就 intervene
  • Token Budget:設定每個 task 嘅最大 token 消耗
  • Timeout:每個 tool call 有 timeout
  • Self-verification:Agent 嘅 output 要自動 verify(tests、linting、type check)
  • Handoff Notes:Long-running tasks 要有 context compaction + handoff
  • Observability:所有 tool calls 同 model interactions 都要 log
  • Guardrails:限制 agent 可以 access 嘅 tools 同 resources
  • Human-in-the-loop:高風險 actions 要 human approval
  • Retry Logic:Tool failures 要有 graceful retry
  • Memory Persistence:跨 session 嘅 conventions 同 decisions 要持久化

未來展望

1. Training 同 Inference 環境嘅融合

Phil Schmid(Hugging Face)預測:Harness 會成為 detect model drift 嘅主要工具。Labs 會用 harness 收集嘅 data 直接 feed back 到 training——創造唔會喺長任務入面「get tired」嘅 models。

2. Standardized Agent Harness Frameworks

就好似 web 開發有 React/Django/Rails,AI agent 開發會出現標準化嘅 harness frameworks。LangChain 嘅 Deep Agents、Anthropic 嘅 Claude Agent SDK 係呢個趨勢嘅先行者。

3. Harness-as-a-Service

未來可能出現 managed harness platforms——你只需要定義 agent 嘅目標同 constraints,平台提供完整嘅 middleware stack、evaluation pipeline、monitoring dashboard。

4. Agent-to-Agent Harness

當多個 agents 需要協作(例如 research agent + coding agent + review agent),harness 嘅設計會由「管住一個 agent」進化到「管住一個 agent team」。

技術啟示

1. 瓶頸已經由 Model 轉移到 Harness

LangChain 嘅 experiment 證明:同一個 model,配合更好嘅 harness(middleware + self-verification),benchmark score 由 52.8% 升到 66.5%。Model 唔係限制因素——Harness 先係。

2. 軟件工程紀律回歸

Harness Engineering 本質上係將傳統軟件工程嘅 best practices(testing、monitoring、state management、error handling)應用到 AI agent 開發。AI 工程唔再係「magic prompt alchemy」——佢係真正嘅工程。

3. Constraining = Empowering

一個反直覺嘅發現:限制 agent 嘅行動空間反而提升佢嘅生產力。就好似一個好嘅 CI/CD pipeline 限制咗 developer 可以 deploy 咩,但反而令整個團隊更有信心同效率。

4. The Model is the Horse, You're the Rider

Martin Fowler 嘅類比完美咁總結咗呢個時代:model 係匹馬——有力、快、智能。但冇 harness(韁繩、馬鞍、轡頭),佢就會亂跑。你嘅工作唔係訓練隻馬,而係設計控制佢力量嘅系統。

總結

核心 Takeaways

  1. Prompt Engineering(2022-2024):學識點樣同 AI 講嘢。核心技巧(CoT、few-shot、role prompting)至今仍然重要,但佢只係整個 stack 嘅最底層。
  2. Context Engineering(2025):學識點樣俾 AI 正確嘅資訊。Karpathy 同 Tobi Lütke 推動嘅範式轉移——由「寫更好嘅指令」變成「設計更好嘅資訊環境」。RAG、memory、tool results、conversation management 都屬於呢個範疇。
  3. Harness Engineering(2026):學識點樣管住 AI 做嘢。LangChain、Anthropic、OpenAI 共同推動嘅新紀律——建造 model 以外嘅所有基礎設施,包括 middleware、evaluation loops、sub-agent coordination、guardrails、observability。

三句話嘅進化史

2022: 「我寫咗一個好勁嘅 prompt,ChatGPT 終於聽話!」 2025: 「我設計咗一個 context pipeline,model 有正確嘅資訊做決策。」 2026: 「我建造咗一個 harness,agent 可以自主運行 8 個鐘而唔出事。」

相關資源

Context Engineering

  • 📄 Anthropic:Effective context engineering for AI agents
  • 🐦 Andrej Karpathy:Context engineering tweet(2025-06-25)
  • 🐦 Tobi Lütke(Shopify CEO):Context engineering tweet(2025-06-18)
  • 📄 LangChain:Context Engineering for Agents
  • 📄 Neo4j:Why AI Teams Are Moving From Prompt to Context Engineering

Harness Engineering

  • 📄 LangChain:The Anatomy of an Agent Harness(2026-03-10)
  • 📄 OpenAI:Harness engineering: leveraging Codex in an agent-first world(2026)
  • 📄 Anthropic:Effective harnesses for long-running agents(2025-11-26)
  • 📄 Martin Fowler:Harness engineering for coding agent users(2026-04-02)
  • 📄 Phil Schmid:The importance of Agent Harness in 2026
  • 📄 LangChain:Improving Deep Agents with harness engineering

延伸閱讀

  • 📄 Epsilla:The Third Evolution: Why Harness Engineering Replaced Prompting in 2026
  • 📄 LangChain:State of Agent Engineering 2025 Report(1,300+ 開發者調查)
  • 📄 Elastic:Context engineering vs. prompt engineering
  • 📺 IBM Technology:Context Engineering vs. Prompt Engineering: Smarter AI with RAG & Agents

AI 工程嘅進化告訴我哋一個深刻嘅教訓:model 嘅智能只係起點。真正決定一個 AI 系統成敗嘅,係你圍繞 model 建造嘅一切——由一句 prompt,到整個 context pipeline,到完整嘅 agent harness。正如 OpenAI Codex 團隊嘅經驗所示:佢哋最大嘅挑戰唔係 model capability,而係「designing environments, feedback loops, and control systems」。歡迎嚟到 Harness Engineering 嘅時代。 🧬✨

Back to all articles
目錄