EVO-2EVOP0/MSpec-levelPROPOSED
Evaluation cascade(廉价→昂贵)+ LLM-judge **仅 readout**(自然性/简洁性)
—
Evaluation modality
Spec-levelA spec-motivation / governance borrow. Evaluated by spec review + contract tests, not A/B or ablation.
- Primary owner
- —
- Phase-A verdict
- —
- Shadow profile
- —
- Source papers
- AlphaEvolve §2.4
- Specs
- docs/specs/evaluation.md
Blind spot (现状盲点)
[`docs/specs/evaluation.md`](../specs/evaluation.md) 若把六大家族 **平行罗列**而无 **cheap→expensive** 顺序,会浪费算力或在噪声指标上 **误杀**真改进。AlphaEvolve §2.4 的 **evaluation cascade** 明确 **先易后难** 过滤候选。
Adoptable suggestions (可落地动作)
- 1.定义固定 **cascade**:契约测试 → SHADOW snapshot 比对 → 长时窗 MemProbe / VZ-MemProbe 子集 → cross-generation head-to-head。PROPOSED
Not a runnable A/B candidate — evaluated by the path above, not ablation.
- 2.**可选** LLM judge 仅产出 **自然性 / 简洁性** 等 **readout**,写入 evaluation snapshot;**禁止**进入 credit 主链或作为 Face 梯度的 reward(R12)。PROPOSED
Not a runnable A/B candidate — evaluated by the path above, not ablation.
- 3.评估并行化仅放在 **后台队列**,不阻塞 turn 路径。PROPOSED
Not a runnable A/B candidate — evaluated by the path above, not ablation.
Traceability
No plugins / runs linked yet. Scaffold a suggestion to start.
Expected benefit (预期收益)
- 与 AlphaEvolve **假设检验式 cascade** 对齐;降低误拒与 reward hacking 攻击面(与 OA 组协同)。
Cited paper (引用论文)
**AlphaEvolve** §2.4 Evaluation(cascade、parallel、multi-score、LLM-generated feedback)。 ---