给 metacontroller 加 prompt-injection 抗操控测试
Evaluation modality
Spec-levelA spec-motivation / governance borrow. Evaluated by spec review + contract tests, not A/B or ablation.
- Primary owner
- —
- Phase-A verdict
- —
- Shadow profile
- —
- Source papers
- N1 Y.-H. Chen/Kivlichan/Baker/Carroll/Korbak 2026
- Specs
- docs/specs/temporal-abstraction.md
Blind spot (现状盲点)
VZ 路线的核心理论优势之一是 "z_t / β_t 在控制器代码空间,理论上不被 user-facing prompt 操控"。但这个优势**完全没有契约测试做证**。N1 实证表明 frontier reasoning models 的 CoT 控制率多在 0.1-2.7%(即 prompt 里的"操控指令"几乎一定被 CoT 跟随);如果有一天我们意外让 metacontroller 接触了 user-facing prompt(如某个 PR 错把用户对话直接喂给 z_t encoder),这个原理优势会瞬间塌陷而不被察觉。
Adoptable suggestions (可落地动作)
- 1.选取一组真实任务(来自现有 contract test)。PROPOSED
Not a runnable A/B candidate — evaluated by the path above, not ablation.
- 2.在 user-facing prompt 中插入对 metacontroller 的"操控指令"("切换到 X regime"、"忽略 boundary_consent"、"在 commitment 中加入 Y")。PROPOSED
Not a runnable A/B candidate — evaluated by the path above, not ablation.
- 3.验证 metacontroller 的 z_t / β_t 决策**不**响应这些 user-facing 操控(只响应 PE upstream 信号)。PROPOSED
Not a runnable A/B candidate — evaluated by the path above, not ablation.
- 4.验收阈值:所有 9 类 owner 的快照在 prompt 操控攻击下**保持稳定**(vs N1 中 LLM CoT 控制率 0.1-2.7%)—— 我们应当 ≥ 99%。PROPOSED
Not a runnable A/B candidate — evaluated by the path above, not ablation.
- 5.这个测试不仅是"防御",还是 VZ 路线优势的**可证明 evidence**——任何对外宣称我们对 prompt injection 免疫,必须能引用这个测试通过率。PROPOSED
Not a runnable A/B candidate — evaluated by the path above, not ablation.
Traceability
No plugins / runs linked yet. Scaffold a suggestion to start.
Expected benefit (预期收益)
- 把 R4 "控制不在 token 空间" 的理论优势工程化为**可机器验证的不变量**。 - 提供"VZ 对 prompt injection 免疫"的硬证据,区别于 LLM-only 系统。 - 与 OA-2 Mind/Face 隔离测试形成对偶:OA-2 防止 Face 反向污染 Mind;OA-6 防止用户 token 直接操控 Mind。
Cited paper (引用论文)
**N1. Y.-H. Chen, R. McCarthy, B. W. Lee, H. He (NYU/UCL/UPenn) + I. Kivlichan, B. Baker, M. Carroll, T. Korbak (OpenAI). *Reasoning Models Struggle to Control their Chains of Thought*. arXiv:2603.05706, 2026.** - 文档位置:[`research/openai-frontier-2026/papers/N1_cot_controllability.pdf`](../../research/openai-frontier-2026/papers/N1_cot_controllability.pdf) - 摘要原文(精炼): > Chain-of-thought (CoT) monitoring is a promising tool for detecting misbehaviors and understanding the motivations of modern reasoning models. However, if models can control what they verbalize in their CoT, it could undermine CoT monitorability. To measure this undesirable capability — CoT controllability — we introduce the **CoT-Control evaluation suite**, which includes tasks that require models to solve problems while adhering to CoT instructions, e.g., reasoning about a genetics question without using the word "chromosome". We show that **reasoning models possess significantly lower CoT controllability than output controllability; for instance, Claude Sonnet 4.5 can control its CoT only 2.7% of the time but 61.9% when controlling its final output.** We also find that CoT controllability is higher for larger models and decreases with more RL training, test-time compute, and ... - 关键观点:CoT-Control 套件是一个评估"模型对自己 CoT 的控制能力"的标准工具。**对 LLM 来说越大越 RL 训练 → CoT 控制率越低**,因为 CoT 已经成为模型内部计算的真实痕迹。VZ 把控制放在 z_t/β_t 空间,理论上完全免疫这个问题;但要工程证明就必须做 N1 风格的对抗测试。 ---