Mechanistic interpretability 作为 owner health / drift readout,而非运行时任意 steering
Evaluation modality
Spec-levelA spec-motivation / governance borrow. Evaluated by spec review + contract tests, not A/B or ablation.
- Primary owner
- —
- Phase-A verdict
- —
- Shadow profile
- —
- Source papers
- SAE circuits + Gemma Scope + Function Vectors + KV Cache Steering
- Specs
- docs/specs/semantic-state-owners.mddocs/specs/expression-layer.mddocs/specs/evaluation.mddocs/specs/temporal-abstraction.md
Blind spot (现状盲点)
C2 线目前更多作为 R4 的旁证:latent space 里确实有 function / refusal / persona 方向。但如果只把它当 steering 技术,会很危险。对 VZ 更适合的吸收方式是:把 mechanistic interpretability 作为 owner health、drift、bounded injection 的 read-only 工具链,而不是运行时任意改激活的魔法。
Adoptable suggestions (可落地动作)
- 1.在 semantic-state-owners / evaluation 中定义 "owner internal readout evidence":SAE feature / function vector / persona vector 只能作为 owner snapshot 的健康证据。PROPOSED
Not a runnable A/B candidate — evaluated by the path above, not ablation.
- 2.KV Cache Steering / activation steering 若进入候选,必须只作为 substrate residual 之外的有界注入口实验,经过 ModificationGate 与 R2/R4 检查。PROPOSED
Not a runnable A/B candidate — evaluated by the path above, not ablation.
- 3.建立 drift report:owner snapshot 变化与 latent readout 变化是否一致;不一致则标记 PE faithfulness / owner health 风险。PROPOSED
Not a runnable A/B candidate — evaluated by the path above, not ablation.
Traceability
No plugins / runs linked yet. Scaffold a suggestion to start.
Expected benefit (预期收益)
- 把可解释性从研究玩具转为 owner 健康监控。 - 给 R11 "内部状态可命名可发布"增加几何证据。 - 避免直接 steering 绕过 owner / ModificationGate。
Cited paper (引用论文)
**Sparse Feature Circuits**、**Scaling Monosemanticity**、**Gemma Scope**、**Function Vectors**、**KV Cache Steering**、**Persona Vectors**。详见 [`research/probe/02_axis_walkthrough.md`](../../research/probe/02_axis_walkthrough.md) C2。 ---