Catalog
CMA-2CMAP1/MSpec-levelPROPOSED

实现 VZ-MemProbe 评估套件(CMA 4 探针 VZ 化)

Evaluation modality

Spec-level

A spec-motivation / governance borrow. Evaluated by spec review + contract tests, not A/B or ablation.

Primary owner
Phase-A verdict
Shadow profile
Source papers
Logan 2026 CMA
Specs
docs/specs/evaluation.mddocs/specs/continuum-memory.md

Blind spot (现状盲点)

VZ 的 memory spec 很强,`continuum_profile`、multi-band CMS、session-post slow loop、bounded reflection apply、owner-side recall signal 都已经有清晰契约。但我们还缺一个专门面向 R5/R6 的**长时间窗行为级评估套件**:系统是否真的优先新事实?是否能回答"那时还发生了什么"?是否能跨 owner 做多跳联想?是否能在同词不同语境下避免污染?如果没有这组 probe,R5/R6 仍然更多是 spec 信念,而不是可重复证据。 CMA 的 4 个探针(Knowledge Updates / Temporal Association / Associative Recall / Disambiguation)比它的具体实现更值得先吸收,因为它们是 read-only evaluation,不改变 runtime,不触碰 owner 边界,收益高、风险低。

Adoptable suggestions (可落地动作)

  1. 1.**VZ-Probe-Update**:用户先给 commitment / preference / belief A,后更新为 B;查询时必须优先返回 B,并能解释 A 已过期。PROPOSED

    Not a runnable A/B candidate — evaluated by the path above, not ablation.

  2. 2.**VZ-Probe-Temporal**:围绕一次 session event 构造前后相邻事件;查询"X 周围还发生什么"时,必须返回时间邻域而不是只返回语义相似项。PROPOSED

    Not a runnable A/B candidate — evaluated by the path above, not ablation.

  3. 3.**VZ-Probe-Assoc**:构造跨 owner 多跳链路(如 user_model → relationship_state → boundary_consent),查询时不能只停留在第一跳。PROPOSED

    Not a runnable A/B candidate — evaluated by the path above, not ablation.

  4. 4.**VZ-Probe-Context**:同一关键词在不同 regime / scene 下有不同意义(如 Python=动物/语言、repair=关系修复/代码修复),检索不能跨语境污染。PROPOSED

    Not a runnable A/B candidate — evaluated by the path above, not ablation.

  5. 5.评估方式不要照搬 CMA 的单一 GPT-4o judge:优先用 deterministic contract assertion;开放项用 LLM judge + cross-instance disagreement;所有结果只作为 R12 readout,不进入 reward。PROPOSED

    Not a runnable A/B candidate — evaluated by the path above, not ablation.

Traceability

No plugins / runs linked yet. Scaffold a suggestion to start.

Expected benefit (预期收益)

- 把 R5/R6 从"设计正确"升级为"行为可证"。 - 给 CMA-1 的 spreading activation shadow path 提供客观收益判断,避免为机制而机制。 - 与 [`tests/contracts/test_common_ground_active_matched_control.py`](../../tests/contracts/test_common_ground_active_matched_control.py)、[`tests/contracts/test_session_post_slow_loop_active_matched_control.py`](../../tests/contracts/test_session_post_slow_loop_active_matched_control.py) 形成长短结合的 evidence 面。

Cited paper (引用论文)

**D1. Logan J. *Continuum Memory Architectures for Long-Horizon LLM Agents*. arXiv:2601.09913, 2026.** - 文档位置:[`research/openai-frontier-2026/papers/D1_continuum_memory_architectures.pdf`](../../research/openai-frontier-2026/papers/D1_continuum_memory_architectures.pdf) - 摘要原文(精炼): > Across all probes, CMA won 82 of 92 decisive trials with large or very large effect sizes. Latency increased by roughly 2.4× due to graph traversal and post-retrieval updates. - 关键观点:CMA 最值得先吸收的不是 graph mutation 实现,而是 4 个行为 probe。VZ 需要自己的 VZ-MemProbe 来证明连续记忆真的在长期行为上胜过静态 RAG / 简单摘要。 ---