「Prompt Engineering is dead. Context Engineering is already showing its limits. The future is Harness Engineering.」 — 綜合 Andrej Karpathy、Tobi Lütke(Shopify CEO)、Harrison Chase(LangChain CEO)嘅觀點
TL;DR
2022 年你學識點樣同 AI 講嘢。2025 年你學識點樣俾 AI 正確嘅資訊。2026 年你學識點樣管住 AI 做嘢。
核心重點:
- 🎯 Prompt Engineering(2022-2024):「點樣問」— 寫一句靚嘅指令,令 LLM 聽話
- 🧠 Context Engineering(2025):「問之前知啲咩」— 設計 LLM 推理時嘅完整資訊環境
- 🔧 Harness Engineering(2026):「點樣管住佢做嘢」— 建造 AI Agent 運行嘅完整基礎設施
- 📊 核心公式:Agent = Model + Harness(LangChain 定義)
- 🚀 趨勢:由「寫更好嘅 prompt」→「提供更好嘅 context」→「建造更好嘅系統」
目錄
背景:AI 工程嘅三個時代
如果你喺 2022 年開始玩 ChatGPT,你一定經歷過呢個進化:
| 時代 | 核心問題 | 代表人物 / 公司 | 類比 |
|---|---|---|---|
| Prompt Engineering | 點樣同 model 講嘢? | OpenAI、各種 "Prompt 秘笈" | 寫一封完美嘅電郵 |
| Context Engineering | Model 知道啲咩? | Karpathy、Tobi Lütke、Anthropic | 準備整個 briefing folder |
| Harness Engineering | 點樣管住 agent 做嘢? | LangChain、OpenAI Codex、Martin Fowler | 設計成個辦公室 + 管理制度 |
💡 一句話總結:
Prompt Engineering = 點樣問
Context Engineering = 問之前俾 model 知道啲咩
Harness Engineering = 成個系統點樣運作
Part 1:Prompt Engineering — 一切嘅起點(2022-2024)
乜嘢係 Prompt Engineering?
Prompt Engineering 係你同 LLM 溝通嘅藝術——透過精心設計嘅文字指令,令 model 生成你想要嘅 output。
喺 ChatGPT 剛推出嘅日子,大家瘋狂研究點樣寫 prompt:
python# 2023 年嘅「頂級 prompt 技巧」
# ❌ 差嘅 prompt
prompt_bad = "幫我寫個 email"
# ✅ 好嘅 prompt(加角色、指令、格式)
prompt_good = """
你係一個專業嘅商務溝通顧問。
請幫我寫一封 follow-up email 俾客戶,要求:
- 語氣:專業但友善
- 長度:150-200 字
- 包含:上次 meeting 嘅 key takeaways
- 結尾:提議下次 meeting 嘅時間
背景:上次 meeting 係關於 Q3 marketing campaign budget。
"""
Prompt Engineering 嘅核心技巧
| 技巧 | 原理 | 例子 |
|---|---|---|
| Role Prompting | 指定 model 嘅角色 | 「你係一個 senior Python developer」 |
| Few-shot Examples | 俾幾個範例 | 「Input: X → Output: Y」× 3 |
| Chain-of-Thought | 叫 model 一步步思考 | 「Let's think step by step」 |
| Output Formatting | 指定輸出格式 | 「用 JSON 格式回覆」 |
| Constraints | 設定限制條件 | 「唔好用超過 100 字」 |
Prompt Engineering 嘅局限:點解佢唔夠用?
當你由「ChatGPT 玩具」進化到「production AI 系統」嘅時候,prompt engineering 嘅問題就暴露出嚟:
問題 1:Model 唔記得之前嘅對話
python# Session 1
user: "我個 project 用 FastAPI + PostgreSQL"
assistant: "好,我建議用 SQLAlchemy 做 ORM..."
# Session 2(新 session,model 乜都唔記得)
user: "點樣加 authentication?"
assistant: "你用咩 framework?" # 😤 又要重新解釋
問題 2:Static prompt 唔識適應唔同情況
python# 同一個 customer service bot,面對完全唔同嘅客戶
prompt = "你係一個客服助手,請禮貌地回答問題。"
# 客戶 A:VIP 客戶,投訴第三次 → 需要知道歷史記錄
# 客戶 B:新客戶,問基本問題 → 需要知道產品資料
# 客戶 C:技術問題 → 需要知道 API documentation
# 同一個 prompt 點可能搞掂三個完全唔同嘅場景?
問題 3:Prompt 冇辦法 scale
喺 2024 年中,Andrej Karpathy(前 Tesla AI Director、OpenAI co-founder)開始提出一個觀點:真正嘅工程唔只係寫 prompt,而係設計成個 context window 入面嘅資訊。
🎯 Prompt Engineering 嘅根本問題: 佢只優化你「點樣問」,但冇優化 model「知道啲咩」。就好似你寫咗一封完美嘅電郵,但收件人完全冇你嘅 background info——再靚嘅措辭都冇用。
Part 2:Context Engineering — 真正嘅範式轉移(2025)
Karpathy 同 Tobi Lütke 嘅定義
2025 年 6 月,兩個重量級人物嘅發言令「Context Engineering」呢個詞爆紅:
Tobi Lütke(Shopify CEO):
"I really like the term 'context engineering' over prompt engineering. It describes the core skill better: the art of providing all the context for the task to be plausibly solvable by the LLM."
Andrej Karpathy(前 Tesla AI / OpenAI):
"+1 for 'context engineering' over 'prompt engineering'. People associate prompts with short task descriptions you'd give an LLM in your day-to-day use. When in every industrial-strength LLM app, context engineering is the delicate art and science of filling the context window with just the right information for the next step."
乜嘢係 Context Engineering?
Anthropic(Claude 嘅公司)俾咗一個精準嘅定義:
Context refers to the set of tokens included when sampling from a large-language model. Context Engineering refers to the set of strategies for curating and maintaining the optimal set of tokens (information) during LLM inference.
用人話講:
- Prompt Engineering = 你寫嘅指令
- Context Engineering = model 推理嗰一刻,context window 入面所有嘅 tokens——包括 system prompt、conversation history、retrieved documents、tool results、metadata……
Context Engineering 嘅四大策略
LangChain 嘅 CEO Harrison Chase 將 Context Engineering 歸納為四大策略:Write、Select、Compress、Isolate。
策略 1:Write(寫入持久化資訊)
唔好依賴 ephemeral context——將重要資訊寫落去持久化存儲。
python# ===== Scratchpad Pattern =====
# Agent 將中間結果寫入 scratchpad,下一步可以用
class AgentScratchpad:
def __init__(self):
self.entries = []
def write(self, key: str, value: str):
self.entries.append({"key": key, "value": value})
def read(self) -> str:
return "\n".join(
f"[{e['key']}]: {e['value']}" for e in self.entries
)
# Agent 做 research 嘅過程
scratchpad = AgentScratchpad()
# Step 1: Search
scratchpad.write("search_results", "Found 3 relevant papers on RAG...")
# Step 2: Analyze
scratchpad.write("analysis", "Paper A uses dense retrieval, Paper B uses sparse...")
# Step 3: 下一個 LLM call 就有完整嘅 context
next_prompt = f"""
根據以下 research notes,寫一個總結:
{scratchpad.read()}
"""
策略 2:Select(智能選擇相關資訊)
Context window 有限,唔可以乜都塞入去。要揀最相關嘅資訊。
python# ===== RAG:最經典嘅 Select 策略 =====
from langchain.vectorstores import Chroma
from langchain.embeddings import OpenAIEmbeddings
# 建立 vector store
vectorstore = Chroma.from_documents(
documents=company_docs,
embedding=OpenAIEmbeddings()
)
# 用戶問問題時,只 retrieve 最相關嘅 documents
relevant_docs = vectorstore.similarity_search(
query="點樣申請 annual leave?",
k=3 # 只攞最相關嘅 3 份文件
)
# 組合成 context
context = "\n\n".join([doc.page_content for doc in relevant_docs])
prompt = f"""根據以下公司政策文件回答問題:
{context}
問題:點樣申請 annual leave?
"""
策略 3:Compress(壓縮過長嘅 context)
當 conversation history 太長,就需要壓縮。
python# ===== Context Compaction =====
# Anthropic 喺佢哋嘅 long-running agent 研究中大量使用呢個技術
def compact_context(conversation_history: list, model) -> str:
"""將過長嘅對話歷史壓縮成 summary"""
if count_tokens(conversation_history) < 80000:
return conversation_history # 夠短,唔使壓縮
summary_prompt = """
請將以下對話歷史壓縮成一個簡潔嘅 summary,
保留所有重要嘅決策、結果、同未完成嘅任務。
對話歷史:
{history}
"""
summary = model.generate(summary_prompt.format(
history=conversation_history
))
return [{"role": "system", "content": f"之前嘅對話 summary:{summary}"}]
策略 4:Isolate(隔離唔同任務嘅 context)
唔同嘅 sub-task 需要唔同嘅 context,混埋一齊會互相干擾。
python# ===== Sub-agent Context Isolation =====
# 每個 sub-agent 有自己獨立嘅 context window
async def research_with_isolation(topic: str):
# Sub-agent 1:搵資料(只需要 search context)
researcher = Agent(
system_prompt="你係 research assistant,搵相關資料。",
tools=[web_search, arxiv_search]
)
findings = await researcher.run(f"搵關於 {topic} 嘅最新研究")
# Sub-agent 2:寫報告(只需要 research results)
writer = Agent(
system_prompt="你係 technical writer,根據研究結果寫報告。",
tools=[]
)
report = await writer.run(f"根據以下研究結果寫報告:\n{findings}")
return report
🎯 Context Engineering 嘅本質:
Prompt Engineering 係「你講嘅嘢」,Context Engineering 係「model 知道嘅嘢」。一個完美嘅 prompt 如果被 6K tokens 嘅無關對話歷史淹沒,效果一樣差。Context Engineering 建造嗰個「容器」,Prompt Engineering 只係入面嘅一小部分。
Part 3:Harness Engineering — Agent 時代嘅工程紀律(2026)
由 Context 到 Harness:點解又要進化?
Context Engineering 解決咗「model 知道啲咩」嘅問題。但當 AI agent 開始自主運行幾個鐘頭甚至幾日,新嘅問題出現咗:
- Agent 喺第 50 步開始 drift,忘記初始目標
- Agent 瘋狂 loop,重複犯同一個錯
- Agent 調用錯誤嘅 tool,冇人 catch 到
- 跨 context window 嘅 state 丟失
- 冇 feedback loop,agent 唔知自己做錯
呢啲問題唔係 context 可以解決嘅——你需要嘅係成個運行環境嘅設計。
Harness Engineering 嘅定義
2026 年 3 月,LangChain CEO Harrison Chase 提出咗一個影響深遠嘅公式:
Agent = Model + Harness
「If you're not the model, you're the harness.」
Martin Fowler(軟件工程界嘅教父級人物)喺 2026 年 4 月嘅文章入面進一步解釋:
「Harness」呢個詞借自馬術——馬好有力好快,但冇韁繩、馬鞍、轡頭,佢就會亂跑。AI model 就係隻馬。Harness 就係控制佢力量嘅一切設備。工程師就係騎手。
邊啲公司喺推動 Harness Engineering?
| 公司 / 人物 | 貢獻 | 關鍵文章 / 日期 |
|---|---|---|
| LangChain(Harrison Chase) | 提出 Agent = Model + Harness 公式;定義 Harness 六大組件 | The Anatomy of an Agent Harness(2026-03) |
| Anthropic | Long-running agent harness 研究;Context engineering for agents | Effective harnesses for long-running agents(2025-11) |
| OpenAI | Codex 團隊嘅 harness engineering 實戰經驗 | Harness engineering: leveraging Codex(2026) |
| Martin Fowler(ThoughtWorks) | 由軟件工程角度定義 harness engineering | Harness engineering for coding agent users(2026-04) |
| Phil Schmid(Hugging Face) | Harness 成為解決 model drift 嘅主要工具 | The importance of Agent Harness in 2026(2026) |
Harness 嘅六大核心組件
LangChain 將一個完整嘅 agent harness 分解為六大組件:
組件 1:System Prompts + Skills(漸進式指令)
唔係一次過 dump 所有指令,而係根據任務 progressively load 相關嘅 skills。
python# ===== Progressive Skill Loading =====
# Agent 只喺需要時先 load 相關嘅 skill
SKILLS = {
"code_review": """
## Code Review Skill
- Check for security vulnerabilities
- Verify test coverage > 80%
- Flag any hardcoded credentials
""",
"bug_fix": """
## Bug Fix Skill
- Reproduce the bug first
- Write a failing test
- Fix and verify
""",
"feature_dev": """
## Feature Development Skill
- Start with the spec
- Write tests first (TDD)
- Implement incrementally
"""
}
def build_system_prompt(task_type: str) -> str:
base_prompt = "你係一個 senior software engineer。"
# 只 load 相關嘅 skill
if task_type in SKILLS:
return base_prompt + "\n\n" + SKILLS[task_type]
return base_prompt
組件 2:Tools + Execution Environment
Agent 需要 tools 先可以做嘢——但更重要嘅係管理 tool 嘅 execution 環境。
python# ===== Sandboxed Tool Execution =====
import subprocess
class SandboxedCodeRunner:
"""喺隔離環境入面執行 agent 生成嘅 code"""
def __init__(self, timeout: int = 30):
self.timeout = timeout
def run(self, code: str) -> dict:
try:
result = subprocess.run(
["python", "-c", code],
capture_output=True,
text=True,
timeout=self.timeout,
# 關鍵:限制權限
env={"PATH": "/usr/bin"},
)
return {
"success": result.returncode == 0,
"stdout": result.stdout,
"stderr": result.stderr
}
except subprocess.TimeoutExpired:
return {
"success": False,
"error": f"Execution timed out after {self.timeout}s"
}
組件 3:Hooks + Middleware(確定性嘅行為注入)
呢個係 Harness Engineering 最關鍵嘅創新——喺 model 嘅每一步之間注入確定性嘅邏輯。
python# ===== Middleware Pattern =====
# LangChain Deep Agents 嘅核心設計
class LoopDetectionMiddleware:
"""檢測 agent 係咪陷入 loop"""
def __init__(self, max_similar_actions: int = 3):
self.recent_actions = []
self.max_similar = max_similar_actions
def on_tool_call(self, tool_name: str, tool_args: dict) -> str | None:
action_signature = f"{tool_name}:{hash(str(tool_args))}"
self.recent_actions.append(action_signature)
# 檢查最近嘅 actions 有冇重複
recent = self.recent_actions[-self.max_similar:]
if len(recent) == self.max_similar and len(set(recent)) == 1:
return (
"⚠️ LOOP DETECTED: 你已經連續做咗同一個 action "
f"{self.max_similar} 次。請改變策略。"
)
return None # 冇問題,繼續
class AutoLintMiddleware:
"""每次 agent 寫完 code,自動 run linter"""
def on_file_write(self, filepath: str, content: str) -> str | None:
if filepath.endswith(".py"):
lint_result = run_ruff(content)
if lint_result.errors:
return f"Lint errors found:\n{lint_result.errors}\nPlease fix."
return None
💡 Middleware 嘅威力: LangChain 嘅 coding agent 只用 middleware(loop detection + auto-lint + self-verification),喺 Terminal Bench 2.0 benchmark 上由 52.8% 升到 66.5%——model 完全冇變,只係改善咗 harness。呢個就係 Harness Engineering 嘅核心 thesis:瓶頸唔係 model,係 model 周圍嘅系統。
組件 4:Sub-agent Coordination
複雜任務需要多個 specialized agents 協作。
python# ===== Anthropic 嘅 Three-Agent Pattern =====
# Generator → Evaluator → Planner
class ThreeAgentHarness:
def __init__(self):
self.planner = Agent(role="planner")
self.generator = Agent(role="generator")
self.evaluator = Agent(role="evaluator")
async def run_task(self, feature_spec: str):
# Step 1: Planner 拆解任務
plan = await self.planner.run(
f"將以下 feature 拆解成可執行嘅步驟:\n{feature_spec}"
)
for step in plan.steps:
# Step 2: Generator 執行
result = await self.generator.run(
f"執行以下步驟:\n{step}"
)
# Step 3: Evaluator 檢查品質
evaluation = await self.evaluator.run(
f"評估以下結果:\n{result}\n\n標準:{step.criteria}"
)
# 如果唔合格,要求 Generator 重做
if evaluation.score < 0.8:
result = await self.generator.run(
f"你嘅上一個結果需要改進:\n"
f"Feedback: {evaluation.feedback}\n"
f"請重新執行。"
)
🎯 Anthropic 嘅關鍵發現: Model 唔可以可靠咁評估自己嘅 output。所以需要獨立嘅 Evaluator agent——類似 GAN 嘅 Generator/Discriminator 架構。呢個 insight 改變咗成個 harness 嘅設計。
組件 5:Persistent Memory(跨 Session 記憶)
python# ===== Convention Memory =====
# Agent 記住項目嘅 conventions 同偏好
class ProjectMemory:
"""持久化嘅項目記憶"""
def __init__(self, project_path: str):
self.memory_file = f"{project_path}/.agent_memory.json"
self.conventions = self.load()
def load(self) -> dict:
if os.path.exists(self.memory_file):
return json.load(open(self.memory_file))
return {"style": {}, "decisions": [], "known_issues": []}
def add_convention(self, key: str, value: str):
"""記住項目 convention"""
self.conventions["style"][key] = value
self.save()
def add_decision(self, decision: str, reason: str):
"""記住設計決策"""
self.conventions["decisions"].append({
"decision": decision,
"reason": reason,
"timestamp": datetime.now().isoformat()
})
self.save()
def get_context(self) -> str:
"""生成 context string 俾 agent"""
return f"""
Project Conventions:
{json.dumps(self.conventions['style'], indent=2)}
Previous Decisions:
{chr(10).join(d['decision'] for d in self.conventions['decisions'][-10:])}
"""
# 使用
memory = ProjectMemory("/my-project")
memory.add_convention("naming", "Use snake_case for Python files")
memory.add_convention("testing", "Every feature needs unit + integration tests")
memory.add_decision(
"Used FastAPI over Flask",
"Better async support + OpenAPI auto-generation"
)
組件 6:Evaluation Loops(自動品質保證)
python# ===== Self-Verification Loop =====
# Agent 寫完 code 之後自動 verify
class VerificationLoop:
def __init__(self, sandbox: SandboxedCodeRunner):
self.sandbox = sandbox
async def verify_code_change(self, code: str, test_file: str) -> dict:
# Step 1: Run existing tests
test_result = self.sandbox.run(f"pytest {test_file} -v")
# Step 2: Run type checker
type_result = self.sandbox.run(f"mypy {code} --strict")
# Step 3: Run security scan
security_result = self.sandbox.run(f"bandit -r {code}")
return {
"tests_pass": test_result["success"],
"type_safe": type_result["success"],
"security_clean": security_result["success"],
"all_pass": all([
test_result["success"],
type_result["success"],
security_result["success"]
])
}
OpenAI 嘅實戰經驗:Codex 團隊學到咗啲咩?
OpenAI 喺 2026 年初發表咗一篇重要嘅文章:「Harness engineering: leveraging Codex in an agent-first world」。入面描述咗佢哋用 Codex 寫咗超過一百萬行 code 嘅經驗。
關鍵教訓
教訓 1:Entropy 同 Garbage Collection
Codex(agent)會複製 repo 入面已經存在嘅 pattern——即使嗰啲 pattern 係次優嘅。隨住時間推移,呢個會導致 code drift。
OpenAI 團隊最初嘅解決方案:每個 Friday 花 20% 嘅時間清理「AI slop」。但呢個 approach 明顯 scale 唔到。
教訓 2:Tests 作為 Harness 嘅核心
如果你嘅 codebase 冇好嘅 test coverage,agent 就會盲目行動。Tests 係 agent 嘅「guardrails」——佢哋定義咗「正確」嘅邊界。
教訓 3:Review 成為新嘅瓶頸
當 agent 寫 code 嘅速度快過人類 review 嘅速度,bottleneck 由 coding 轉移到 reviewing。呢個係 agent-first 開發嘅根本性轉變。
🚀 OpenAI 嘅核心 insight: 佢哋最大嘅挑戰唔係 model capability,而係「designing environments, feedback loops, and control systems」——即係 Harness Engineering。
Anthropic 嘅長期運行 Agent 研究
Anthropic 喺 2025 年 11 月發表咗 「Effective harnesses for long-running agents」,呢篇研究揭示咗一個反直覺嘅發現:
更大嘅 context window 往往令 agent 表現更差,唔係更好。
點解更多 context ≠ 更好?
想像一個 software engineer 做咗 8 個鐘嘅 shift:
- 最初幾個鐘:思路清晰,效率高
- 第 5 個鐘:開始忘記早期嘅 decisions
- 第 8 個鐘:疲勞,開始犯低級錯誤
AI agent 喺長 context window 入面嘅表現一模一樣——佢哋會「attention drift」,忘記早期嘅重要資訊。
Anthropic 嘅解決方案:Shift-based Harness
python# ===== Anthropic 嘅 Shift-based Pattern =====
# 類似工程師換班制度
class ShiftBasedHarness:
def __init__(self, max_steps_per_shift: int = 50):
self.max_steps = max_steps_per_shift
self.current_step = 0
self.handoff_notes = ""
async def run_long_task(self, task: str):
while not self.is_complete(task):
# 開始新嘅 "shift"
agent = self.create_fresh_agent()
# 俾新 agent 之前嘅 handoff notes
context = f"""
任務:{task}
之前嘅 shift handoff notes:
{self.handoff_notes}
請繼續工作。
"""
# Agent 工作直到 shift 結束
result = await agent.run(
context,
max_steps=self.max_steps
)
# Shift 結束,生成 handoff notes
self.handoff_notes = await agent.run(
"請寫一份 handoff notes,包括:\n"
"1. 已完成嘅工作\n"
"2. 當前狀態\n"
"3. 下一步計劃\n"
"4. 已知問題"
)
self.current_step += self.max_steps
💡 核心 insight: Agent 嘅可靠性唔係靠更 smart 嘅 model 嚟達成嘅,而係靠更好嘅 scaffolding。呢個就係 Harness Engineering 嘅精髓——建造一個環境,令即使係 imperfect 嘅 model 都可以可靠咁完成任務。
深度對比:三個時代嘅根本分別
| 維度 | Prompt Engineering | Context Engineering | Harness Engineering |
|---|---|---|---|
| 時間 | 2022-2024 | 2025 | 2026+ |
| 核心問題 | 點樣問? | Model 知道啲咩? | 系統點樣運作? |
| 優化目標 | 單次 response 品質 | Context window 資訊品質 | Agent 長期運行可靠性 |
| Scope | 一句指令 | 整個 context window | Model 以外嘅一切 |
| 互動模式 | One-turn Q&A | Multi-turn + RAG | Autonomous agent loops |
| 關係 | 基礎 | 包含 Prompt Engineering | 包含 Context Engineering |
| 類比 | 寫一封靚嘅信 | 準備完整嘅 briefing package | 建造整個辦公室 + 管理系統 |
| 失敗模式 | Model 理解錯指令 | Context 缺少關鍵資訊 | Agent drift、loop、tool misuse |
| 需要嘅技能 | 語言表達、NLP 直覺 | Information architecture | Systems engineering |
佢哋係遞進關係,唔係替代關係
🎯 三者嘅包含關係:
Prompt Engineering 係 Context Engineering 嘅一個子集
Context Engineering 係 Harness Engineering 嘅一個子集
每一層都唔會消失——佢只係被更大嘅框架所包含
實戰指南:點樣開始做 Harness Engineering
Step 1:評估你喺邊個階段
Step 2:由簡單到複雜嘅演進路徑
python# ===== Level 1: Prompt Engineering =====
# 最簡單——直接調 prompt
response = openai.chat.completions.create(
model="gpt-4o",
messages=[{
"role": "system",
"content": "你係一個 helpful assistant。用繁體中文回答。"
}, {
"role": "user",
"content": "點樣用 Python 讀 CSV?"
}]
)
# ===== Level 2: Context Engineering =====
# 加入 RAG、memory、structured context
from langchain.chains import RetrievalQA
# 建立 retrieval pipeline
qa = RetrievalQA.from_chain_type(
llm=ChatOpenAI(model="gpt-4o"),
retriever=vectorstore.as_retriever(),
chain_type="stuff"
)
# Context 自動包含 retrieved documents
result = qa.invoke({"query": "公司嘅 annual leave policy 係咩?"})
# ===== Level 3: Harness Engineering =====
# 完整嘅 agent harness 包括 tools、middleware、evals
from langgraph.prebuilt import create_react_agent
from langgraph.checkpoint.memory import MemorySaver
# 定義 tools
tools = [web_search, code_runner, file_manager]
# 建立 agent 連 persistent memory
agent = create_react_agent(
model=ChatAnthropic(model="claude-sonnet-4-20250514"),
tools=tools,
checkpointer=MemorySaver(), # 持久化 state
)
# 加 middleware
agent.add_middleware(LoopDetectionMiddleware(max_similar_actions=3))
agent.add_middleware(AutoLintMiddleware())
agent.add_middleware(TokenBudgetMiddleware(max_tokens=100000))
# 加 evaluation hook
agent.add_hook(
"on_task_complete",
lambda result: verify_with_tests(result)
)
# 運行
result = await agent.ainvoke({
"messages": [{"role": "user", "content": "實作 user authentication"}]
})
Step 3:Production Harness Checklist
- Loop Detection:Agent 連續做同一個 action 3+ 次就 intervene
- Token Budget:設定每個 task 嘅最大 token 消耗
- Timeout:每個 tool call 有 timeout
- Self-verification:Agent 嘅 output 要自動 verify(tests、linting、type check)
- Handoff Notes:Long-running tasks 要有 context compaction + handoff
- Observability:所有 tool calls 同 model interactions 都要 log
- Guardrails:限制 agent 可以 access 嘅 tools 同 resources
- Human-in-the-loop:高風險 actions 要 human approval
- Retry Logic:Tool failures 要有 graceful retry
- Memory Persistence:跨 session 嘅 conventions 同 decisions 要持久化
未來展望
1. Training 同 Inference 環境嘅融合
Phil Schmid(Hugging Face)預測:Harness 會成為 detect model drift 嘅主要工具。Labs 會用 harness 收集嘅 data 直接 feed back 到 training——創造唔會喺長任務入面「get tired」嘅 models。
2. Standardized Agent Harness Frameworks
就好似 web 開發有 React/Django/Rails,AI agent 開發會出現標準化嘅 harness frameworks。LangChain 嘅 Deep Agents、Anthropic 嘅 Claude Agent SDK 係呢個趨勢嘅先行者。
3. Harness-as-a-Service
未來可能出現 managed harness platforms——你只需要定義 agent 嘅目標同 constraints,平台提供完整嘅 middleware stack、evaluation pipeline、monitoring dashboard。
4. Agent-to-Agent Harness
當多個 agents 需要協作(例如 research agent + coding agent + review agent),harness 嘅設計會由「管住一個 agent」進化到「管住一個 agent team」。
技術啟示
1. 瓶頸已經由 Model 轉移到 Harness
LangChain 嘅 experiment 證明:同一個 model,配合更好嘅 harness(middleware + self-verification),benchmark score 由 52.8% 升到 66.5%。Model 唔係限制因素——Harness 先係。
2. 軟件工程紀律回歸
Harness Engineering 本質上係將傳統軟件工程嘅 best practices(testing、monitoring、state management、error handling)應用到 AI agent 開發。AI 工程唔再係「magic prompt alchemy」——佢係真正嘅工程。
3. Constraining = Empowering
一個反直覺嘅發現:限制 agent 嘅行動空間反而提升佢嘅生產力。就好似一個好嘅 CI/CD pipeline 限制咗 developer 可以 deploy 咩,但反而令整個團隊更有信心同效率。
4. The Model is the Horse, You're the Rider
Martin Fowler 嘅類比完美咁總結咗呢個時代:model 係匹馬——有力、快、智能。但冇 harness(韁繩、馬鞍、轡頭),佢就會亂跑。你嘅工作唔係訓練隻馬,而係設計控制佢力量嘅系統。
總結
核心 Takeaways
- Prompt Engineering(2022-2024):學識點樣同 AI 講嘢。核心技巧(CoT、few-shot、role prompting)至今仍然重要,但佢只係整個 stack 嘅最底層。
- Context Engineering(2025):學識點樣俾 AI 正確嘅資訊。Karpathy 同 Tobi Lütke 推動嘅範式轉移——由「寫更好嘅指令」變成「設計更好嘅資訊環境」。RAG、memory、tool results、conversation management 都屬於呢個範疇。
- Harness Engineering(2026):學識點樣管住 AI 做嘢。LangChain、Anthropic、OpenAI 共同推動嘅新紀律——建造 model 以外嘅所有基礎設施,包括 middleware、evaluation loops、sub-agent coordination、guardrails、observability。
三句話嘅進化史
2022: 「我寫咗一個好勁嘅 prompt,ChatGPT 終於聽話!」 2025: 「我設計咗一個 context pipeline,model 有正確嘅資訊做決策。」 2026: 「我建造咗一個 harness,agent 可以自主運行 8 個鐘而唔出事。」
相關資源
Context Engineering
- 📄 Anthropic:Effective context engineering for AI agents
- 🐦 Andrej Karpathy:Context engineering tweet(2025-06-25)
- 🐦 Tobi Lütke(Shopify CEO):Context engineering tweet(2025-06-18)
- 📄 LangChain:Context Engineering for Agents
- 📄 Neo4j:Why AI Teams Are Moving From Prompt to Context Engineering
Harness Engineering
- 📄 LangChain:The Anatomy of an Agent Harness(2026-03-10)
- 📄 OpenAI:Harness engineering: leveraging Codex in an agent-first world(2026)
- 📄 Anthropic:Effective harnesses for long-running agents(2025-11-26)
- 📄 Martin Fowler:Harness engineering for coding agent users(2026-04-02)
- 📄 Phil Schmid:The importance of Agent Harness in 2026
- 📄 LangChain:Improving Deep Agents with harness engineering
延伸閱讀
- 📄 Epsilla:The Third Evolution: Why Harness Engineering Replaced Prompting in 2026
- 📄 LangChain:State of Agent Engineering 2025 Report(1,300+ 開發者調查)
- 📄 Elastic:Context engineering vs. prompt engineering
- 📺 IBM Technology:Context Engineering vs. Prompt Engineering: Smarter AI with RAG & Agents
AI 工程嘅進化告訴我哋一個深刻嘅教訓:model 嘅智能只係起點。真正決定一個 AI 系統成敗嘅,係你圍繞 model 建造嘅一切——由一句 prompt,到整個 context pipeline,到完整嘅 agent harness。正如 OpenAI Codex 團隊嘅經驗所示:佢哋最大嘅挑戰唔係 model capability,而係「designing environments, feedback loops, and control systems」。歡迎嚟到 Harness Engineering 嘅時代。 🧬✨