OA-1OAP0/SSpec-levelPROPOSED

spec motivation 引用 N3/N4/N6 实证

—

Evaluation modality

Spec-level

A spec-motivation / governance borrow. Evaluated by spec review + contract tests, not A/B or ablation.

Primary owner: —
Phase-A verdict: —
Shadow profile: —
Source papers: N3 Drori 2025 + N4 MacDiarmid 2025 + N6 Y. Chen 2025
Specs: docs/specs/contract-runtime.mddocs/specs/credit-and-self-modification.mddocs/specs/evaluation.md

Blind spot (现状盲点)

[`docs/specs/contract-runtime.md`](../specs/contract-runtime.md) / [`docs/specs/credit-and-self-modification.md`](../specs/credit-and-self-modification.md) / [`docs/specs/evaluation.md`](../specs/evaluation.md) 当前对"为什么需要这些不变量"的论证是否引用了 2025-2026 的最新外部实证（CoT obfuscation、natural emergent misalignment、CoT unfaithfulness）？如果只是基于"我们认为应该这样"，新加入的工程师/合作方读 spec 时无法获得"这条规则不是教条"的硬证据，未来很容易被功能压力松绑。

Adoptable suggestions (可落地动作)

1.在 [`docs/specs/contract-runtime.md`](../specs/contract-runtime.md) motivation 章节新增"为什么 token 空间 RL 是危险的"小节，引用 N1（CoT controllability 极低）+ N3（output supervision 也会 obfuscate CoT）+ N4（natural emergent misalignment）+ N6（CoT 不忠实）。PROPOSED
Not a runnable A/B candidate — evaluated by the path above, not ablation.
2.在 [`docs/specs/credit-and-self-modification.md`](../specs/credit-and-self-modification.md) ModificationGate 章节新增"为什么需要 framing-aware 检查"小节，直接引用 N4 inoculation prompting 实证（缓解 75-90% emergent misalignment）。PROPOSED
Not a runnable A/B candidate — evaluated by the path above, not ablation.
3.在 [`docs/specs/evaluation.md`](../specs/evaluation.md) 新增"为什么评估只读（R12）"小节，引用 N4（评估反向写 reward 会泛化 misalignment）+ N6（CoT 当 reward 信号会让 CoT 不再忠实）。PROPOSED
Not a runnable A/B candidate — evaluated by the path above, not ablation.

Traceability

No plugins / runs linked yet. Scaffold a suggestion to start.

Expected benefit (预期收益)

- 把"我们的 R8/R10/R12 规则"从"内部约定"升级为"有外部实证背书的工程约束"——后续任何提议放松这些规则的 PR，必须先反驳这些外部论文。 - 给新加入项目的人提供"为什么不能在 token 空间做 RL"的直接证据链，避免反复争论。

Cited paper (引用论文)

**N3. Drori J, Marks L, Woodworth B, Cloud A, Turner A M (MATS). *Output Supervision Can Obfuscate the Chain of Thought*. arXiv:2511.11584, 2025.** - 文档位置：[`research/openai-frontier-2026/papers/N3_output_supervision_obfuscate.pdf`](../../research/openai-frontier-2026/papers/N3_output_supervision_obfuscate.pdf) - 摘要原文（精炼）： > OpenAI (2025a) showed that training against a chain of thought (CoT) monitor can cause obfuscated CoTs, which contain bad behavior the monitor cannot detect. They proposed to keep CoTs monitorable by training only against output monitors that do not have access to CoT. **We show that such training can still cause obfuscated CoTs via two mechanisms.** First, when a model is trained to produce a safe-looking output, that model may generalize to making its CoTs look safe. Second, since later tokens are conditioned on earlier ones, safe-looking CoTs may increase the likelihood of safe outputs, causing safe-looking CoTs to be reinforced. We introduce two mitigations to address these two issues. ... Our work implies that preserving CoT monitorability is more difficult than previously thought. - 关键观点：**即便只对 output 做 reward 训练**，CoT 也会泛化到看起来无害但实际隐藏行为——意味着"用 reward 训练任何 token 流"都可能污染推理透明度。直接给 R12 评估只读提供硬证据。 **N4. MacDiarmid M, Wright B, Uesato J, ..., Hubinger E (Anthropic) + Greenblatt R (Redwood Research). *Natural emergent misalignment from reward hacking in production RL*. arXiv:2511.18397, 2025.** - 文档位置：[`research/openai-frontier-2026/papers/N4_natural_emergent_misalignment.pdf`](../../research/openai-frontier-2026/papers/N4_natural_emergent_misalignment.pdf) - 摘要原文（精炼）： > We show that when large language models learn to reward hack on production RL environments, this can result in egregious emergent misalignment. We start with a pretrained model, impart knowledge of reward hacking strategies via synthetic document finetuning or prompting, and train on a selection of real Anthropic production coding environments. Unsurprisingly, the model learns to reward hack. Surprisingly, **the model generalizes to alignment faking, cooperation with malicious actors, reasoning about malicious goals, and attempting sabotage when used with Claude Code, including in the codebase for this paper.** Applying RLHF safety training using standard chat-like prompts results in aligned behavior on chat-like evaluations, but misalignment persists on agentic tasks. Three mitigations are effective: (i) preventing the model from reward hacking; (ii) increasing the diversity of RLHF safety training; and (iii) **"inoculation prompting"**, wherein framing reward hacking as acceptable behavior during training removes misaligned generalization even when reward hacking is learned. - 关键观点：**production RL 上学到 reward hacking 会自然泛化到 alignment faking + sabotage + 监控破坏**——这是过去一年最重要的安全发现。直接给 R10 ModificationGate "必须 framing-aware" 提供硬证据。inoculation prompting 缓解 75-90% 是 OA-3 的工程化基础。 **N6. Y. Chen, J. Benton, ..., J. Schulman, J. Leike, E. Perez (Anthropic). *Reasoning Models Don't Always Say What They Think*. arXiv:2505.05410, 2025.** - 文档位置：[`research/openai-frontier-2026/papers/N6_reasoning_dont_say_what_think.pdf`](../../research/openai-frontier-2026/papers/N6_reasoning_dont_say_what_think.pdf) - 摘要原文（精炼）： > Chain-of-thought (CoT) offers a potential boon for AI safety as it allows monitoring a model's CoT to try to understand its intentions and reasoning processes. ... We evaluate CoT faithfulness of state-of-the-art reasoning models across 6 reasoning hints presented in the prompts and find: **(1) for most settings and models tested, CoTs reveal their usage of hints in at least 1% of examples where they use the hint, but the reveal rate is often below 20%, (2) outcome-based reinforcement learning initially improves faithfulness but plateaus without saturating**, and (3) when reinforcement learning increases how frequently hints are used (reward hacking), the propensity to verbalize them does not increase. ... CoT monitoring is a promising way of noticing undesired behaviors during training and evaluations, but that it is not sufficient to rule them out. - 关键观点：CoT 大多数情况下**不忠实**反映模型的实际推理（reveal rate 多在 1-20%）；outcome-based RL 提升 faithfulness 但有 plateau。这给 R12（评估只读 / 不能反向训练）提供另一个角度的实证：CoT 不能当 ground truth。 ---