DM-5DMSpec-levelPROPOSED

background-slow 反思引入 imagination rollout

—

Evaluation modality

Spec-level

A spec-motivation / governance borrow. Evaluated by spec review + contract tests, not A/B or ablation.

Primary owner: —
Phase-A verdict: —
Shadow profile: —
Source papers: Hafner/Yan/Lillicrap 2025 Dreamer 4（A1）
Specs: docs/specs/multi-timescale-learning.md

Blind spot (现状盲点)

[`packages/vz-cognition/`](../../packages/vz-cognition/) 的 ReflectionEngine 当前是否只重放历史 log？imagination 模式（基于 world model 内部 rollout 假设性场景）是否被讨论？如果只重放历史，反思层就只能"总结过去"而不能"探索假设性未来"——R10 的 counterfactual credit 与 R-PE 的反事实 PE 估计都缺少机制基础。

Adoptable suggestions (可落地动作)

1.在 [`docs/specs/multi-timescale-learning.md`](../specs/multi-timescale-learning.md) 的 background-slow 章节加入"imagination-based reflection"候选段落。PROPOSED
Not a runnable A/B candidate — evaluated by the path above, not ablation.
2.**关键约束**：落地路径必须是"冻结基底 + 控制器内部 rollout"，**不**训练巨型 latent world model（与 R2 兼容）。具体形态可参考：在 `world_temporal` / `self_temporal` 控制器内部维护轻量预测器，由 ReflectionEngine 在 background-slow 时刻调用做 short-horizon imagination。PROPOSED
Not a runnable A/B candidate — evaluated by the path above, not ablation.
3.借鉴 Dreamer 4 的 **shortcut forcing objective** 思路防止想象漂移：每步 imagination 都用真实 transition 做 ground 校准，避免 rollout 越走越远。PROPOSED
Not a runnable A/B candidate — evaluated by the path above, not ablation.
4.评估证据先行：在 SHADOW 模式跑"纯 log replay reflection vs imagination-augmented reflection"对比，看 PE 减少速率是否更快。PROPOSED
Not a runnable A/B candidate — evaluated by the path above, not ablation.

Traceability

No plugins / runs linked yet. Scaffold a suggestion to start.

Expected benefit (预期收益)

- 反思层不只是"过去日志的总结"，而是"假设性未来的探索"——支持 R10 的 counterfactual credit 与 R-PE 的反事实 PE 估计。 - 让 background-slow 循环能产出"如果当时换一种 regime / 换一组 commitment 会怎样"的 evidence，而不只是"过去那场对话哪些 turn PE 高"。 - 与 R2 完全相容（只在控制器侧 rollout，不动基底）。

Cited paper (引用论文)

**A1. Hafner D, Yan W, Lillicrap T (DeepMind). *Training Agents Inside of Scalable World Models* (Dreamer 4). arXiv:2509.24527, 2025.** - 文档位置：[`research/papers/dm/dreamer4-training-agents-inside-scalable-world-models-2509.24527.pdf`](../../research/papers/dm/dreamer4-training-agents-inside-scalable-world-models-2509.24527.pdf) - 摘要原文（精炼）： > World models learn general knowledge from videos and simulate experience for training behaviors in imagination, offering a path towards intelligent agents. However, previous world models have been unable to accurately predict object interactions in complex environments. We introduce Dreamer 4, a scalable agent that learns to solve control tasks by reinforcement learning inside of a fast and accurate world model. ... The world model achieves real-time interactive inference on a single GPU through a shortcut forcing objective and an efficient transformer architecture. Moreover, the world model learns general action conditioning from only a small amount of data, allowing it to extract the majority of its knowledge from diverse unlabeled videos. ... By learning behaviors in imagination, Dreamer 4 is the first agent to obtain diamonds in Minecraft purely from offline data, without environment interaction. - 关键观点：(1) world model 与 policy 解耦——先训 world model 再在其内部做 imagination RL，避免 PE 反向污染基底（与我们 R2 完全相容）。(2) **shortcut forcing** 让 imagination rollout 不会发散——给我们"想象式 background-slow 反思"提供工程参照（不要让模拟轨迹漂走）。(3) action conditioning 可从极少标签学到——支持我们 affordance 表征用少量监督信号训练。 ---