COG-8COGP1/MSpec-levelPROPOSED

Mechanistic interpretability 作为 owner health / drift readout，而非运行时任意 steering

—

Evaluation modality

Spec-level

A spec-motivation / governance borrow. Evaluated by spec review + contract tests, not A/B or ablation.

Primary owner: —
Phase-A verdict: —
Shadow profile: —
Source papers: SAE circuits + Gemma Scope + Function Vectors + KV Cache Steering
Specs: docs/specs/semantic-state-owners.mddocs/specs/expression-layer.mddocs/specs/evaluation.mddocs/specs/temporal-abstraction.md

Blind spot (现状盲点)

C2 线目前更多作为 R4 的旁证：latent space 里确实有 function / refusal / persona 方向。但如果只把它当 steering 技术，会很危险。对 VZ 更适合的吸收方式是：把 mechanistic interpretability 作为 owner health、drift、bounded injection 的 read-only 工具链，而不是运行时任意改激活的魔法。

Adoptable suggestions (可落地动作)

1.在 semantic-state-owners / evaluation 中定义 "owner internal readout evidence"：SAE feature / function vector / persona vector 只能作为 owner snapshot 的健康证据。PROPOSED
Not a runnable A/B candidate — evaluated by the path above, not ablation.
2.KV Cache Steering / activation steering 若进入候选，必须只作为 substrate residual 之外的有界注入口实验，经过 ModificationGate 与 R2/R4 检查。PROPOSED
Not a runnable A/B candidate — evaluated by the path above, not ablation.
3.建立 drift report：owner snapshot 变化与 latent readout 变化是否一致；不一致则标记 PE faithfulness / owner health 风险。PROPOSED
Not a runnable A/B candidate — evaluated by the path above, not ablation.

Traceability

No plugins / runs linked yet. Scaffold a suggestion to start.

Expected benefit (预期收益)

- 把可解释性从研究玩具转为 owner 健康监控。 - 给 R11 "内部状态可命名可发布"增加几何证据。 - 避免直接 steering 绕过 owner / ModificationGate。

Cited paper (引用论文)

**Sparse Feature Circuits**、**Scaling Monosemanticity**、**Gemma Scope**、**Function Vectors**、**KV Cache Steering**、**Persona Vectors**。详见 [`research/probe/02_axis_walkthrough.md`](../../research/probe/02_axis_walkthrough.md) C2。 ---