CMA-2CMAP1/MSpec-levelPROPOSED

实现 VZ-MemProbe 评估套件（CMA 4 探针 VZ 化）

—

Evaluation modality

Spec-level

A spec-motivation / governance borrow. Evaluated by spec review + contract tests, not A/B or ablation.

Primary owner: —
Phase-A verdict: —
Shadow profile: —
Source papers: Logan 2026 CMA
Specs: docs/specs/evaluation.mddocs/specs/continuum-memory.md

Blind spot (现状盲点)

VZ 的 memory spec 很强，`continuum_profile`、multi-band CMS、session-post slow loop、bounded reflection apply、owner-side recall signal 都已经有清晰契约。但我们还缺一个专门面向 R5/R6 的**长时间窗行为级评估套件**：系统是否真的优先新事实？是否能回答"那时还发生了什么"？是否能跨 owner 做多跳联想？是否能在同词不同语境下避免污染？如果没有这组 probe，R5/R6 仍然更多是 spec 信念，而不是可重复证据。 CMA 的 4 个探针（Knowledge Updates / Temporal Association / Associative Recall / Disambiguation）比它的具体实现更值得先吸收，因为它们是 read-only evaluation，不改变 runtime，不触碰 owner 边界，收益高、风险低。

Adoptable suggestions (可落地动作)

1.**VZ-Probe-Update**：用户先给 commitment / preference / belief A，后更新为 B；查询时必须优先返回 B，并能解释 A 已过期。PROPOSED
Not a runnable A/B candidate — evaluated by the path above, not ablation.
2.**VZ-Probe-Temporal**：围绕一次 session event 构造前后相邻事件；查询"X 周围还发生什么"时，必须返回时间邻域而不是只返回语义相似项。PROPOSED
Not a runnable A/B candidate — evaluated by the path above, not ablation.
3.**VZ-Probe-Assoc**：构造跨 owner 多跳链路（如 user_model → relationship_state → boundary_consent），查询时不能只停留在第一跳。PROPOSED
Not a runnable A/B candidate — evaluated by the path above, not ablation.
4.**VZ-Probe-Context**：同一关键词在不同 regime / scene 下有不同意义（如 Python=动物/语言、repair=关系修复/代码修复），检索不能跨语境污染。PROPOSED
Not a runnable A/B candidate — evaluated by the path above, not ablation.
5.评估方式不要照搬 CMA 的单一 GPT-4o judge：优先用 deterministic contract assertion；开放项用 LLM judge + cross-instance disagreement；所有结果只作为 R12 readout，不进入 reward。PROPOSED
Not a runnable A/B candidate — evaluated by the path above, not ablation.

Traceability

No plugins / runs linked yet. Scaffold a suggestion to start.

Expected benefit (预期收益)

- 把 R5/R6 从"设计正确"升级为"行为可证"。 - 给 CMA-1 的 spreading activation shadow path 提供客观收益判断，避免为机制而机制。 - 与 [`tests/contracts/test_common_ground_active_matched_control.py`](../../tests/contracts/test_common_ground_active_matched_control.py)、[`tests/contracts/test_session_post_slow_loop_active_matched_control.py`](../../tests/contracts/test_session_post_slow_loop_active_matched_control.py) 形成长短结合的 evidence 面。

Cited paper (引用论文)

**D1. Logan J. *Continuum Memory Architectures for Long-Horizon LLM Agents*. arXiv:2601.09913, 2026.** - 文档位置：[`research/openai-frontier-2026/papers/D1_continuum_memory_architectures.pdf`](../../research/openai-frontier-2026/papers/D1_continuum_memory_architectures.pdf) - 摘要原文（精炼）： > Across all probes, CMA won 82 of 92 decisive trials with large or very large effect sizes. Latency increased by roughly 2.4× due to graph traversal and post-retrieval updates. - 关键观点：CMA 最值得先吸收的不是 graph mutation 实现，而是 4 个行为 probe。VZ 需要自己的 VZ-MemProbe 来证明连续记忆真的在长期行为上胜过静态 RAG / 简单摘要。 ---