EVO-5EVOP2/MSpec-levelPROPOSED
Proposer / verifier 不对称共演化(PAIRED 式 regret 引导场景库;evaluator 只读)
—
Evaluation modality
Spec-levelA spec-motivation / governance borrow. Evaluated by spec review + contract tests, not A/B or ablation.
- Primary owner
- —
- Phase-A verdict
- —
- Shadow profile
- —
- Source papers
- AlphaGeometry 2 + PAIRED UED
- Specs
- docs/specs/multi-timescale-learning.mddocs/specs/evaluation.md
Blind spot (现状盲点)
静态 benchmark 无法随 **当前系统弱点** 自适应施压;需要 **课程与环境侧压力** 的可演化来源,但不能把 evaluator 变成 **在线 reward**(R-PE / R12)。
Adoptable suggestions (可落地动作)
- 1.**场景 proposer**(生成会话挑战配置)与 **evaluator 只读族** **禁止**相互 fine-tune 基底。PROPOSED
Not a runnable A/B candidate — evaluated by the path above, not ablation.
- 2.用 **PAIRED** 式 **regret**(或简化代理:held-out 上失败模式分布)更新 **场景 archive 的采样偏好**,指向 **高 regret** 区域。PROPOSED
Not a runnable A/B candidate — evaluated by the path above, not ablation.
- 3.思想上对齐 **AlphaGeometry 2**:**验证器硬、提议器探**;实现上保持 **evaluator = readout**。PROPOSED
Not a runnable A/B candidate — evaluated by the path above, not ablation.
Traceability
No plugins / runs linked yet. Scaffold a suggestion to start.
Expected benefit (预期收益)
- 开放任务上形成 **自适应压力测试**,而不引入 SIMA2 式 **LLM-as-reward-generator**(R-PE 冲突)。
Cited paper (引用论文)
**PAIRED** Dennis et al., arXiv:2012.02096。**AlphaGeometry 2** Trinh et al., arXiv:2502.03544(不对称 proposer/verifier 思想)。 ---