Catalog
COG-8COGP1/MSpec-levelPROPOSED

Mechanistic interpretability 作为 owner health / drift readout,而非运行时任意 steering

Evaluation modality

Spec-level

A spec-motivation / governance borrow. Evaluated by spec review + contract tests, not A/B or ablation.

Primary owner
Phase-A verdict
Shadow profile
Source papers
SAE circuits + Gemma Scope + Function Vectors + KV Cache Steering
Specs
docs/specs/semantic-state-owners.mddocs/specs/expression-layer.mddocs/specs/evaluation.mddocs/specs/temporal-abstraction.md

Blind spot (现状盲点)

C2 线目前更多作为 R4 的旁证:latent space 里确实有 function / refusal / persona 方向。但如果只把它当 steering 技术,会很危险。对 VZ 更适合的吸收方式是:把 mechanistic interpretability 作为 owner health、drift、bounded injection 的 read-only 工具链,而不是运行时任意改激活的魔法。

Adoptable suggestions (可落地动作)

  1. 1.在 semantic-state-owners / evaluation 中定义 "owner internal readout evidence":SAE feature / function vector / persona vector 只能作为 owner snapshot 的健康证据。PROPOSED

    Not a runnable A/B candidate — evaluated by the path above, not ablation.

  2. 2.KV Cache Steering / activation steering 若进入候选,必须只作为 substrate residual 之外的有界注入口实验,经过 ModificationGate 与 R2/R4 检查。PROPOSED

    Not a runnable A/B candidate — evaluated by the path above, not ablation.

  3. 3.建立 drift report:owner snapshot 变化与 latent readout 变化是否一致;不一致则标记 PE faithfulness / owner health 风险。PROPOSED

    Not a runnable A/B candidate — evaluated by the path above, not ablation.

Traceability

No plugins / runs linked yet. Scaffold a suggestion to start.

Expected benefit (预期收益)

- 把可解释性从研究玩具转为 owner 健康监控。 - 给 R11 "内部状态可命名可发布"增加几何证据。 - 避免直接 steering 绕过 owner / ModificationGate。

Cited paper (引用论文)

**Sparse Feature Circuits**、**Scaling Monosemanticity**、**Gemma Scope**、**Function Vectors**、**KV Cache Steering**、**Persona Vectors**。详见 [`research/probe/02_axis_walkthrough.md`](../../research/probe/02_axis_walkthrough.md) C2。 ---