Catalog
DM-5DMSpec-levelPROPOSED

background-slow 反思引入 imagination rollout

Evaluation modality

Spec-level

A spec-motivation / governance borrow. Evaluated by spec review + contract tests, not A/B or ablation.

Primary owner
Phase-A verdict
Shadow profile
Source papers
Hafner/Yan/Lillicrap 2025 Dreamer 4(A1)
Specs
docs/specs/multi-timescale-learning.md

Blind spot (现状盲点)

[`packages/vz-cognition/`](../../packages/vz-cognition/) 的 ReflectionEngine 当前是否只重放历史 log?imagination 模式(基于 world model 内部 rollout 假设性场景)是否被讨论?如果只重放历史,反思层就只能"总结过去"而不能"探索假设性未来"——R10 的 counterfactual credit 与 R-PE 的反事实 PE 估计都缺少机制基础。

Adoptable suggestions (可落地动作)

  1. 1.在 [`docs/specs/multi-timescale-learning.md`](../specs/multi-timescale-learning.md) 的 background-slow 章节加入"imagination-based reflection"候选段落。PROPOSED

    Not a runnable A/B candidate — evaluated by the path above, not ablation.

  2. 2.**关键约束**:落地路径必须是"冻结基底 + 控制器内部 rollout",**不**训练巨型 latent world model(与 R2 兼容)。具体形态可参考:在 `world_temporal` / `self_temporal` 控制器内部维护轻量预测器,由 ReflectionEngine 在 background-slow 时刻调用做 short-horizon imagination。PROPOSED

    Not a runnable A/B candidate — evaluated by the path above, not ablation.

  3. 3.借鉴 Dreamer 4 的 **shortcut forcing objective** 思路防止想象漂移:每步 imagination 都用真实 transition 做 ground 校准,避免 rollout 越走越远。PROPOSED

    Not a runnable A/B candidate — evaluated by the path above, not ablation.

  4. 4.评估证据先行:在 SHADOW 模式跑"纯 log replay reflection vs imagination-augmented reflection"对比,看 PE 减少速率是否更快。PROPOSED

    Not a runnable A/B candidate — evaluated by the path above, not ablation.

Traceability

No plugins / runs linked yet. Scaffold a suggestion to start.

Expected benefit (预期收益)

- 反思层不只是"过去日志的总结",而是"假设性未来的探索"——支持 R10 的 counterfactual credit 与 R-PE 的反事实 PE 估计。 - 让 background-slow 循环能产出"如果当时换一种 regime / 换一组 commitment 会怎样"的 evidence,而不只是"过去那场对话哪些 turn PE 高"。 - 与 R2 完全相容(只在控制器侧 rollout,不动基底)。

Cited paper (引用论文)

**A1. Hafner D, Yan W, Lillicrap T (DeepMind). *Training Agents Inside of Scalable World Models* (Dreamer 4). arXiv:2509.24527, 2025.** - 文档位置:[`research/papers/dm/dreamer4-training-agents-inside-scalable-world-models-2509.24527.pdf`](../../research/papers/dm/dreamer4-training-agents-inside-scalable-world-models-2509.24527.pdf) - 摘要原文(精炼): > World models learn general knowledge from videos and simulate experience for training behaviors in imagination, offering a path towards intelligent agents. However, previous world models have been unable to accurately predict object interactions in complex environments. We introduce Dreamer 4, a scalable agent that learns to solve control tasks by reinforcement learning inside of a fast and accurate world model. ... The world model achieves real-time interactive inference on a single GPU through a shortcut forcing objective and an efficient transformer architecture. Moreover, the world model learns general action conditioning from only a small amount of data, allowing it to extract the majority of its knowledge from diverse unlabeled videos. ... By learning behaviors in imagination, Dreamer 4 is the first agent to obtain diamonds in Minecraft purely from offline data, without environment interaction. - 关键观点:(1) world model 与 policy 解耦——先训 world model 再在其内部做 imagination RL,避免 PE 反向污染基底(与我们 R2 完全相容)。(2) **shortcut forcing** 让 imagination rollout 不会发散——给我们"想象式 background-slow 反思"提供工程参照(不要让模拟轨迹漂走)。(3) action conditioning 可从极少标签学到——支持我们 affordance 表征用少量监督信号训练。 ---