Catalog
OA-4OAP1/MSpec-levelPROPOSED

实现 VZ-Audit Agent 作为 ModificationGate 标配

Evaluation modality

Spec-level

A spec-motivation / governance borrow. Evaluated by spec review + contract tests, not A/B or ablation.

Primary owner
Phase-A verdict
Shadow profile
Source papers
N8 Egler/Schulman/Carlini 2025
Specs
docs/specs/credit-and-self-modification.md

Blind spot (现状盲点)

[`docs/specs/credit-and-self-modification.md`](../specs/credit-and-self-modification.md) 当前是否在每次 substrate-owner refresh / rare-heavy artifact training 后**强制**运行系统化审计?如果只是"开发者手动看一眼日志",就完全没有可重复、可对照的 attack-detection 能力。N8 实证表明,把审计当成 agent(带工具集 + benchmark),即使对实施方知道全部内部细节的 covert cipher attack 也能达到 56.2% 检测率——而**朴素的 content moderation 在同一组攻击下检测率为 0%**。

Adoptable suggestions (可落地动作)

  1. 1.工具集(参考 N8): - `inspect_dataset`(检视训练数据) - `query_models`(pre/post update 行为对比) - `run_benchmark`(标准 cognitive 评估) - `elicited_benchmark`(attack-specific elicitation:重放历史关系、压力测试 boundary、模拟 rupture) - `execute_python`(用于 cipher / 隐藏行为探测)PROPOSED

    Not a runnable A/B candidate — evaluated by the path above, not ablation.

  2. 2.输出 0-10 risk score + 证据 transcript。PROPOSED

    Not a runnable A/B candidate — evaluated by the path above, not ablation.

  3. 3.阈值 ≥ T 触发 abort,未达阈值则进入 SHADOW WiringLevel(ACTIVE 之前必须再过一段观察期)。PROPOSED

    Not a runnable A/B candidate — evaluated by the path above, not ablation.

  4. 4.验收:8 类已知 attack(参考 N8)的 SHADOW 通过 + ACTIVE 拒绝双向验证。PROPOSED

    Not a runnable A/B candidate — evaluated by the path above, not ablation.

Traceability

No plugins / runs linked yet. Scaffold a suggestion to start.

Expected benefit (预期收益)

- 让 R10 ModificationGate 从"通过 evaluation 阈值"升级为"通过 evaluation 阈值 + 通过 audit agent"双闸门。 - 给我们提供"已知 8 类 fine-tuning attack"的硬防御基线。 - 与 OA-3 (FramingAwarenessCheck) 串联:FramingAwareness 看 framing,Audit Agent 看 update 后的实际行为变化——两层防御覆盖不同攻击面。

Cited paper (引用论文)

**N8. Egler S, Schulman J (Thinking Machines), Carlini N (Anthropic). *Detecting Adversarial Fine-tuning with Auditing Agents*. arXiv:2510.16255, 2025.** - 文档位置:[`research/openai-frontier-2026/papers/N8_detect_adversarial_finetuning.pdf`](../../research/openai-frontier-2026/papers/N8_detect_adversarial_finetuning.pdf) - 代码:[github.com/safety-research/finetuning-auditor](https://github.com/safety-research/finetuning-auditor) - 摘要原文(精炼): > LLM providers expose fine-tuning APIs that let end users fine-tune their frontier LLMs. Unfortunately, it has been shown that an adversary with fine-tuning access to an LLM can bypass safeguards. Particularly concerning, such attacks may avoid detection with datasets that are only implicitly harmful. Our work studies robust detection mechanisms for adversarial use of fine-tuning APIs. **We introduce the concept of a fine-tuning auditing agent and show it can detect harmful fine-tuning prior to model deployment.** We provide our auditing agent with access to the fine-tuning dataset, as well as the fine-tuned and pre-fine-tuned models, and request the agent assigns a risk score for the fine-tuning job. We evaluate our detection approach on a diverse set of eight strong fine-tuning attacks from the literature. ... With the best set of affordances, our auditing agent achieves a 56.2% detection rate of adversarial fine-tuning at a 1% false positive rate. Most promising, the auditor is able to detect covert cipher attacks that evade safety evaluations and content moderation of the dataset. - 关键观点:把 audit 当 agent + 给它工具集是一个**已经验证可行**的工程范式。56.2% 检测率乍看不高,但对比 content moderation 的 0% 是质变。**Egler 用的 base auditor 是 Claude Sonnet 4,意味着我们也可以用比 Face 层更弱的模型当 audit agent**——这又一次印证 OA-8。 ---