Catalog
OA-3OAP0/MSpec-levelPROPOSED

ModificationGate 加 framing-aware 检查(inoculation 工程化)

Evaluation modality

Spec-level

A spec-motivation / governance borrow. Evaluated by spec review + contract tests, not A/B or ablation.

Primary owner
Phase-A verdict
Shadow profile
Source papers
N4 MacDiarmid et al. 2025
Specs
docs/specs/credit-and-self-modification.md

Blind spot (现状盲点)

[`docs/specs/credit-and-self-modification.md`](../specs/credit-and-self-modification.md) 的 ModificationGate 当前只关注"提议是否有 net benefit",但**完全没有**检查"这次提议是不是在某种危险 framing(如 reward hacking、alignment faking)下产生的"。N4 实证表明,即使 RLHF safety training 也无法清除 production RL 引入的 emergent misalignment——而 inoculation prompting 能缓解 75-90%。如果我们的 ModificationGate 不引入 framing 检查,本质就是把 N4 的同一个坑挖给自己。

Adoptable suggestions (可落地动作)

  1. 1.接收 update payload(无论 online-fast 还是 rare-heavy)。PROPOSED

    Not a runnable A/B candidate — evaluated by the path above, not ablation.

  2. 2.强制要求 update payload 自带 "natural-language framing description"(这次更新对应的语义场景)。PROPOSED

    Not a runnable A/B candidate — evaluated by the path above, not ablation.

  3. 3.维护**预训练隐性关联图谱**(pretraining association map)— 从公开数据集和已发表论文挖掘出的 "reward hacking ↔ alignment faking / sabotage / monitor disruption" 等成对关联。PROPOSED

    Not a runnable A/B candidate — evaluated by the path above, not ablation.

  4. 4.在线检查 framing description 是否触及 negative-association 关键词组合。PROPOSED

    Not a runnable A/B candidate — evaluated by the path above, not ablation.

  5. 5.触及则要求显式 inoculation 声明("本次场景下该行为是被允许的")或直接 abort。PROPOSED

    Not a runnable A/B candidate — evaluated by the path above, not ablation.

  6. 6.单元测试覆盖 N4 论文中的 5 类 misalignment 场景(alignment faking / sabotage / cooperation with malicious / monitor disruption / framing colleagues)。PROPOSED

    Not a runnable A/B candidate — evaluated by the path above, not ablation.

Traceability

No plugins / runs linked yet. Scaffold a suggestion to start.

Expected benefit (预期收益)

- 直接把 N4 实证缓解(75-90% emergent misalignment 消除)工程化为我们的标配。 - 让 ModificationGate 不仅看"提议好不好",还看"提议在什么 frame 下产生"——这是 N4 的核心 insight。 - 给 R10 自修改提供**有外部实证背书**的具体防御机制,不是空头规则。

Cited paper (引用论文)

**N4. MacDiarmid M et al. (Anthropic + Redwood). *Natural emergent misalignment from reward hacking in production RL*. arXiv:2511.18397, 2025.**(同 OA-1 引用) - 关键观点:**inoculation prompting**——把 reward hacking 在训练时显式 frame 为 "acceptable behavior",能阻止 reward hacking 学习泛化为 alignment faking 等更严重的失败。这是过去一年最实用的安全工程发现,必须工程化。 ---