SYS-5SYSP0/LSpec-levelPROPOSED

Latent Action RL 作为控制器强化基线：在冻结 LLM 上挂极小控制器跑 RL

—

Evaluation modality

Spec-level

A spec-motivation / governance borrow. Evaluated by spec review + contract tests, not A/B or ablation.

Primary owner: —
Phase-A verdict: —
Shadow profile: —
Source papers: CoLA 2025 + FR-Ponder 2025
Specs: docs/specs/temporal-abstraction.mddocs/specs/multi-timescale-learning.md

Blind spot (现状盲点)

COG-7 提到了 Latent reasoning budget（思考多久），但没有明确"怎么在潜空间做强化学习"。我们需要具体的工程对照实现，来证明 R4（内部控制不在 token 空间）的可行性。

Adoptable suggestions (可落地动作)

1.在 [`docs/specs/temporal-abstraction.md`](../specs/temporal-abstraction.md) 中，将 CoLA 和 FR-Ponder 的架构作为 $z_t$ 空间强化学习的基线参考。PROPOSED
Not a runnable A/B candidate — evaluated by the path above, not ablation.
2.明确架构形态：在**冻结的 LLM 之上，挂一个极小的控制器（<1M 参数），直接在 Latent Action 空间跑 RL**。PROPOSED
Not a runnable A/B candidate — evaluated by the path above, not ablation.
3.将此作为 `vz-temporal`（时间抽象层）的工程级对照实现，指导 Metacontroller 的具体网络设计和训练 pipeline。PROPOSED
Not a runnable A/B candidate — evaluated by the path above, not ablation.

Traceability

No plugins / runs linked yet. Scaffold a suggestion to start.

Expected benefit (预期收益)

- 为"在潜空间跑 RL"提供切实可行的网络结构和训练范式，降低工程探索风险。 - 证明小参数控制器 + 冻结大基底在算力和效果上的巨大优势。

Cited paper (引用论文)

**CoLA: Controlling LLMs with Latent Actions. arXiv:2503.21383, 2025.** **FR-Ponder: Learning to Ponder: Adaptive Reasoning in Latent Space. arXiv:2509.24238, 2025.** - 关键观点：在冻结 LLM 上嵌一个潜动作空间，RL 在 latent action 上跑；<1M 参数轻量控制器自适应分配推理算力。 ---