OA-3OAP0/MSpec-levelPROPOSED

ModificationGate 加 framing-aware 检查（inoculation 工程化）

—

Evaluation modality

Spec-level

A spec-motivation / governance borrow. Evaluated by spec review + contract tests, not A/B or ablation.

Primary owner: —
Phase-A verdict: —
Shadow profile: —
Source papers: N4 MacDiarmid et al. 2025
Specs: docs/specs/credit-and-self-modification.md

Blind spot (现状盲点)

[`docs/specs/credit-and-self-modification.md`](../specs/credit-and-self-modification.md) 的 ModificationGate 当前只关注"提议是否有 net benefit"，但**完全没有**检查"这次提议是不是在某种危险 framing（如 reward hacking、alignment faking）下产生的"。N4 实证表明，即使 RLHF safety training 也无法清除 production RL 引入的 emergent misalignment——而 inoculation prompting 能缓解 75-90%。如果我们的 ModificationGate 不引入 framing 检查，本质就是把 N4 的同一个坑挖给自己。

Adoptable suggestions (可落地动作)

1.接收 update payload（无论 online-fast 还是 rare-heavy）。PROPOSED
Not a runnable A/B candidate — evaluated by the path above, not ablation.
2.强制要求 update payload 自带 "natural-language framing description"（这次更新对应的语义场景）。PROPOSED
Not a runnable A/B candidate — evaluated by the path above, not ablation.
3.维护**预训练隐性关联图谱**（pretraining association map）— 从公开数据集和已发表论文挖掘出的 "reward hacking ↔ alignment faking / sabotage / monitor disruption" 等成对关联。PROPOSED
Not a runnable A/B candidate — evaluated by the path above, not ablation.
4.在线检查 framing description 是否触及 negative-association 关键词组合。PROPOSED
Not a runnable A/B candidate — evaluated by the path above, not ablation.
5.触及则要求显式 inoculation 声明（"本次场景下该行为是被允许的"）或直接 abort。PROPOSED
Not a runnable A/B candidate — evaluated by the path above, not ablation.
6.单元测试覆盖 N4 论文中的 5 类 misalignment 场景（alignment faking / sabotage / cooperation with malicious / monitor disruption / framing colleagues）。PROPOSED
Not a runnable A/B candidate — evaluated by the path above, not ablation.

Traceability

No plugins / runs linked yet. Scaffold a suggestion to start.

Expected benefit (预期收益)

- 直接把 N4 实证缓解（75-90% emergent misalignment 消除）工程化为我们的标配。 - 让 ModificationGate 不仅看"提议好不好"，还看"提议在什么 frame 下产生"——这是 N4 的核心 insight。 - 给 R10 自修改提供**有外部实证背书**的具体防御机制，不是空头规则。

Cited paper (引用论文)

**N4. MacDiarmid M et al. (Anthropic + Redwood). *Natural emergent misalignment from reward hacking in production RL*. arXiv:2511.18397, 2025.**（同 OA-1 引用） - 关键观点：**inoculation prompting**——把 reward hacking 在训练时显式 frame 为 "acceptable behavior"，能阻止 reward hacking 学习泛化为 alignment faking 等更严重的失败。这是过去一年最实用的安全工程发现，必须工程化。 ---