Catalog
DM-7DMSpec-levelPROPOSED

evaluation 引入"代际比较"作为开放任务的进步度量

Evaluation modality

Spec-level

A spec-motivation / governance borrow. Evaluated by spec review + contract tests, not A/B or ablation.

Primary owner
Phase-A verdict
Shadow profile
Source papers
Open-Ended Learning Team 2021(B3)
Specs
docs/specs/evaluation.md

Blind spot (现状盲点)

[`docs/specs/evaluation.md`](../specs/evaluation.md) 的多家族评估当前是否处理了"开放对话/关系任务奖励不可比"的情形?真实的关系/EQ 任务不存在统一的标量 reward——不同 session 的"该有的好结果"不一样,没法直接比较。如果只用绝对分数,长期进步无法量化;R-PE 路线的"PE 减少 = 学习"也缺少配套的 rollup 指标。

Adoptable suggestions (可落地动作)

  1. 1.在 [`docs/specs/evaluation.md`](../specs/evaluation.md) 加入新的指标族 "iterative cross-generation comparison":不要求每个 session 都有同质 reward,但要求"新一代 agent 在 held-out scenario 上对老一代 agent 的偏好胜率"作为长期进步度量。PROPOSED

    Not a runnable A/B candidate — evaluated by the path above, not ablation.

  2. 2.实施层:每一个 rare-heavy artifact 切换前后做 generation snapshot(不只是模型/控制器参数,也包括 owner snapshot 配置),评估时从 snapshot 集做"两两 head-to-head 对照 replay"。PROPOSED

    Not a runnable A/B candidate — evaluated by the path above, not ablation.

  3. 3.与 §4 ModificationGate 串联:cross-generation winrate 是 ModificationGate 开启的硬证据之一。PROPOSED

    Not a runnable A/B candidate — evaluated by the path above, not ablation.

Traceability

No plugins / runs linked yet. Scaffold a suggestion to start.

Expected benefit (预期收益)

- 解决"无单一 reward 但仍要量化进步"的根本难题——这是关系/EQ 类产品的本质难点。 - 给 R10 的 ModificationGate 提供有方法学的 evidence 源。 - 与 R-PE "PE 减少 = 学习" 完全兼容:cross-generation 胜率本质是 PE 在新一代上更小的 readout。

Cited paper (引用论文)

**B3. Open-Ended Learning Team (DeepMind London) — Stooke A, Mahajan A, Barros C, Deck C, Bauer J, Sygnowski J, Trebacz M, Jaderberg M, Mathieu M, McAleese N, Bradley-Schmieg N, Wong N, Porcel N, Raileanu R, Hughes-Fitt S, Dalibard V, Czarnecki W M. *Open-Ended Learning Leads to Generally Capable Agents*. arXiv:2107.12808, 2021.** - 文档位置:[`research/papers/dm/open-ended-learning-generally-capable-agents-2107.12808.pdf`](../../research/papers/dm/open-ended-learning-generally-capable-agents-2107.12808.pdf) - 摘要原文(精炼): > In this work we create agents that can perform well beyond a single, individual task, that exhibit much wider generalisation of behaviour to a massive, rich space of challenges. We define a universe of tasks within an environment domain and demonstrate the ability to train agents that are generally capable across this vast space and beyond. ... The resulting space is exceptionally diverse in terms of the challenges posed to agents, and as such, even measuring the learning progress of an agent is an open research problem. **We propose an iterative notion of improvement between successive generations of agents, rather than seeking to maximise a singular objective, allowing us to quantify progress despite tasks being incomparable in terms of achievable rewards.** Training an agent that is performant across such a vast space of tasks is a central challenge, one we find that pure reinforcement learning on a fixed distribution of training tasks does not succeed in. We show that through constructing an open-ended learning process, which dynamically changes the training task distributions and training objectives such that the agent never stops learning, we achieve consistent learning of new behaviours. - 关键观点:(1) 当任务奖励不可比时,**用"代际超越"而非"绝对分数"度量进步**——这是当前关系/EQ 类产品的标准答案。(2) "fixed distribution training" 失败 → 我们的 multi-timescale learning 必须保证训练数据分布是动态变化的(场景库 + lifeform-vitals 反馈 + open-dialogue session 滚动)。 --- ---