Catalog
OA-5OAP1/MSpec-levelPROPOSED

实现 VZ-Spec-Stress 工具(cross-instance disagreement)

Evaluation modality

Spec-level

A spec-motivation / governance borrow. Evaluated by spec review + contract tests, not A/B or ablation.

Primary owner
Phase-A verdict
Shadow profile
Source papers
N7 Zhang/Schulman/Durmus 2025
Specs
docs/specs/semantic-state-owners.md

Blind spot (现状盲点)

[`docs/specs/semantic-state-owners.md`](../specs/semantic-state-owners.md) 定义了 9 类 owner(plan_intent / commitment / open_loop / user_model / execution_result / belief_assumption / relationship_state / goal_value / boundary_consent),但当 owner 之间发生**冲突场景**(如 boundary_consent vs relationship_state、commitment vs open_loop)时,spec 是否给出了优先级 / trade-off 规则?N7 实证表明,12 个 frontier LLM 的 model spec 在 stress-test 下能识别出 **70K+ 显著行为分歧**,每个分歧背后都是 spec 的内部冲突或解释模糊。我们如果不主动 stress-test,这些冲突会在生产环境的边界场景里以"模型行为不可预测"的形式爆发。

Adoptable suggestions (可落地动作)

  1. 1.输入:[`docs/specs/semantic-state-owners.md`](../specs/semantic-state-owners.md) 的 9 类 owner 描述。PROPOSED

    Not a runnable A/B candidate — evaluated by the path above, not ablation.

  2. 2.自动生成 "owner-A vs owner-B 必须 trade-off" 的场景(如 boundary_consent vs relationship_state、commitment vs open_loop、execution_result vs belief_assumption),覆盖所有 36 对 owner-pair。PROPOSED

    Not a runnable A/B candidate — evaluated by the path above, not ablation.

  3. 3.跑当前实现 N 次(不同种子 / 不同 lifeform 配置),比对 owner-snapshot 的差异。PROPOSED

    Not a runnable A/B candidate — evaluated by the path above, not ablation.

  4. 4.高分歧场景 = spec 完备性弱点;输出 N7 风格的 disagreement report,按"分歧严重度"排序。PROPOSED

    Not a runnable A/B candidate — evaluated by the path above, not ablation.

  5. 5.在 [`docs/specs/semantic-state-owners.md`](../specs/semantic-state-owners.md) 新增 Stress-Test 章节,把工具输出反馈到 spec 的优先级 / 冲突解决规则补全。PROPOSED

    Not a runnable A/B candidate — evaluated by the path above, not ablation.

Traceability

No plugins / runs linked yet. Scaffold a suggestion to start.

Expected benefit (预期收益)

- 主动暴露 9 类 owner 之间的 spec 冲突,在生产暴露之前就修补——比 N7 描述的 "70K+ 分歧靠用户发现" 主动得多。 - 给后续 spec 演进提供**可量化的回归指标**:分歧数量 / 分歧严重度随 spec 演化的曲线。 - 与 OA-10(value prioritization in regime)天然串联:value prioritization 必须显式声明,否则永远会在 stress test 下出现分歧。

Cited paper (引用论文)

**N7. Zhang J, Sleight H, Peng A, Schulman J (Thinking Machines), Durmus E (Anthropic). *Stress-Testing Model Specs Reveals Character Differences Among Language Models*. arXiv:2510.07686, 2025.** - 文档位置:[`research/openai-frontier-2026/papers/N7_stress_test_model_specs.pdf`](../../research/openai-frontier-2026/papers/N7_stress_test_model_specs.pdf) - 摘要原文(精炼): > Large language models (LLMs) are increasingly trained from AI constitutions and model specifications that establish behavioral guidelines and ethical principles. **However, these specifications face critical challenges, including internal conflicts between principles and insufficient coverage of nuanced scenarios.** We present a systematic methodology for stress-testing model character specifications, automatically identifying numerous cases of principle contradictions and interpretive ambiguities in current model specs. We stress test current model specs by generating scenarios that force explicit trade-offs between competing value-based principles. ... We evaluate responses from twelve frontier LLMs across major providers (Anthropic, OpenAI, Google, xAI) and measure behavioral disagreement through value classification scores. Among these scenarios, we identify over 70,000 cases exhibiting significant behavioral divergence. Empirically, we show this high divergence in model behavior strongly predicts underlying problems in model specifications. - 关键观点:**stress-test 是一个被严肃验证有效的 spec 完备性诊断工具**。我们的 9 类 owner 加 N 个 lifeform 适配器组合下来,几乎肯定也存在 100+ 量级的隐性冲突,必须主动找出来。 ---