Catalog
EVO-2EVOP0/MSpec-levelPROPOSED

Evaluation cascade(廉价→昂贵)+ LLM-judge **仅 readout**(自然性/简洁性)

Evaluation modality

Spec-level

A spec-motivation / governance borrow. Evaluated by spec review + contract tests, not A/B or ablation.

Primary owner
Phase-A verdict
Shadow profile
Source papers
AlphaEvolve §2.4
Specs
docs/specs/evaluation.md

Blind spot (现状盲点)

[`docs/specs/evaluation.md`](../specs/evaluation.md) 若把六大家族 **平行罗列**而无 **cheap→expensive** 顺序,会浪费算力或在噪声指标上 **误杀**真改进。AlphaEvolve §2.4 的 **evaluation cascade** 明确 **先易后难** 过滤候选。

Adoptable suggestions (可落地动作)

  1. 1.定义固定 **cascade**:契约测试 → SHADOW snapshot 比对 → 长时窗 MemProbe / VZ-MemProbe 子集 → cross-generation head-to-head。PROPOSED

    Not a runnable A/B candidate — evaluated by the path above, not ablation.

  2. 2.**可选** LLM judge 仅产出 **自然性 / 简洁性** 等 **readout**,写入 evaluation snapshot;**禁止**进入 credit 主链或作为 Face 梯度的 reward(R12)。PROPOSED

    Not a runnable A/B candidate — evaluated by the path above, not ablation.

  3. 3.评估并行化仅放在 **后台队列**,不阻塞 turn 路径。PROPOSED

    Not a runnable A/B candidate — evaluated by the path above, not ablation.

Traceability

No plugins / runs linked yet. Scaffold a suggestion to start.

Expected benefit (预期收益)

- 与 AlphaEvolve **假设检验式 cascade** 对齐;降低误拒与 reward hacking 攻击面(与 OA 组协同)。

Cited paper (引用论文)

**AlphaEvolve** §2.4 Evaluation(cascade、parallel、multi-score、LLM-generated feedback)。 ---