EVO-2EVOP0/MSpec-levelPROPOSED

Evaluation cascade（廉价→昂贵）+ LLM-judge 仅 readout（自然性/简洁性）

—

Evaluation modality

Spec-level

A spec-motivation / governance borrow. Evaluated by spec review + contract tests, not A/B or ablation.

Primary owner: —
Phase-A verdict: —
Shadow profile: —
Source papers: AlphaEvolve §2.4
Specs: docs/specs/evaluation.md

Blind spot (现状盲点)

[`docs/specs/evaluation.md`](../specs/evaluation.md) 若把六大家族 **平行罗列**而无 **cheap→expensive** 顺序，会浪费算力或在噪声指标上 **误杀**真改进。AlphaEvolve §2.4 的 **evaluation cascade** 明确 **先易后难** 过滤候选。

Adoptable suggestions (可落地动作)

1.定义固定 **cascade**：契约测试 → SHADOW snapshot 比对 → 长时窗 MemProbe / VZ-MemProbe 子集 → cross-generation head-to-head。PROPOSED
Not a runnable A/B candidate — evaluated by the path above, not ablation.
2.**可选** LLM judge 仅产出 **自然性 / 简洁性** 等 **readout**，写入 evaluation snapshot；**禁止**进入 credit 主链或作为 Face 梯度的 reward（R12）。PROPOSED
Not a runnable A/B candidate — evaluated by the path above, not ablation.
3.评估并行化仅放在 **后台队列**，不阻塞 turn 路径。PROPOSED
Not a runnable A/B candidate — evaluated by the path above, not ablation.

Traceability

No plugins / runs linked yet. Scaffold a suggestion to start.

Expected benefit (预期收益)

- 与 AlphaEvolve **假设检验式 cascade** 对齐；降低误拒与 reward hacking 攻击面（与 OA 组协同）。

Cited paper (引用论文)

**AlphaEvolve** §2.4 Evaluation（cascade、parallel、multi-score、LLM-generated feedback）。 ---