Catalog
DM-2DMSpec-levelPROPOSED

z_t 接口采用 successor-feature behavior basis

Evaluation modality

Spec-level

A spec-motivation / governance borrow. Evaluated by spec review + contract tests, not A/B or ablation.

Primary owner
Phase-A verdict
Shadow profile
Source papers
Alegre/Barreto et al. NeurIPS 2025(A3)
Specs
docs/specs/temporal-abstraction.mddocs/specs/dual-track-learning.md

Blind spot (现状盲点)

[`docs/specs/temporal-abstraction.md`](../specs/temporal-abstraction.md) 的 controller code z_t 接口当前是否定义了**组合性**?还是仅仅是 K 个独立 controller 的 one-hot 选择?如果是后者,我们错失了 successor features (SF) + Generalized Policy Improvement (GPI) 框架带来的两大收益:(1) **组合策略至少不弱于任何基策略**的形式保证;(2) dual-track(world / self)可以共享同一组 SF basis,每条轨道只学自己的 reward feature。

Adoptable suggestions (可落地动作)

  1. 1.在 [`docs/specs/temporal-abstraction.md`](../specs/temporal-abstraction.md) 加入"behavior basis as z_t parameterization"段落:把 z_t 设计为 reward feature 的线性组合权重,而不是独立 controller 的离散选择。PROPOSED

    Not a runnable A/B candidate — evaluated by the path above, not ablation.

  2. 2.在 [`docs/specs/dual-track-learning.md`](../specs/dual-track-learning.md) 明确:`world_temporal` 与 `self_temporal` 共享 SF basis 接口,但维护各自的 reward feature;snapshot 仍然分开发布(不破坏 R8 SSOT)。PROPOSED

    Not a runnable A/B candidate — evaluated by the path above, not ablation.

  3. 3.评估证据先行:先在 SHADOW 模式跑"SF basis vs 当前离散 controller"两组对比,看在 held-out scenario 上的策略质量与组合泛化能力。PROPOSED

    Not a runnable A/B candidate — evaluated by the path above, not ablation.

Traceability

No plugins / runs linked yet. Scaffold a suggestion to start.

Expected benefit (预期收益)

- **参数大幅减少**:原本要训练 K 个 controller(`world_temporal` × K + `self_temporal` × K),现在只训练一组共享 SF basis + 两组小型 reward feature head。 - **R8 SSOT 边界更清晰**:basis 在 `temporal_abstraction` 共享 owner 中维护,reward feature 各自归属 world/self temporal owner,跨模块只读快照。 - **形式保证**:GPI 性质保证组合策略不会比任何单一基策略更糟——给我们的"控制器切换不能比单一控制器更糟"提供形式工具。

Cited paper (引用论文)

**Alegre L N, Bazzan A L C, Barreto A, da Silva B C. *Constructing an Optimal Behavior Basis for the Option Keyboard*. arXiv:2505.00787, NeurIPS 2025.** - 文档位置:[`research/papers/dm/option-keyboard-controllable-world-models-small-2505.00787.pdf`](../../research/papers/dm/option-keyboard-controllable-world-models-small-2505.00787.pdf) - 作者权重:André Barreto(DeepMind London,successor features 系列长期主力)。 - 摘要原文(精炼): > Multi-task reinforcement learning aims to quickly identify solutions for new tasks with minimal or no additional interaction with the environment. Generalized Policy Improvement (GPI) addresses this by combining a set of base policies to produce a new one that is at least as good — though not necessarily optimal — as any individual base policy. Optimality can be ensured, particularly in the linear-reward case, via techniques that compute a Convex Coverage Set (CCS). However, these are computationally expensive and do not scale to complex domains. The Option Keyboard (OK) improves upon GPI by producing policies that are at least as good — and often better. ... This raises a key question: is there an optimal set of base policies — an optimal behavior basis — that enables zero-shot identification of optimal solutions for any linear tasks? We solve this open problem by introducing a novel method that efficiently constructs such an optimal behavior basis. - 关键观点:可组合 skill 不需要"many skills",只需要"right skill basis"。我们的 z_t 设计应思考"basis 的最优性"而非"skill 数量"。GPI 的 "at least as good" 保证给我们的控制器切换提供形式工具。 ---