COG-11COGP1/MSpec-levelPROPOSED

Agentic EQ / relationship 长程评估：证明关系存在质量，而不只是任务成功

—

Evaluation modality

Spec-level

A spec-motivation / governance borrow. Evaluated by spec review + contract tests, not A/B or ablation.

Primary owner: —
Phase-A verdict: —
Shadow profile: —
Source papers: COMPEER 2025 + RLFF-ESC 2025 + SocialSim 2025 + AgencyBench 2026 + Gaia2 2025
Specs: docs/specs/evaluation.mddocs/specs/evidence_program.mddocs/specs/dual-track-learning.mddocs/specs/social_cognition/04_common_ground.md

Blind spot (现状盲点)

R12 已经要求 evaluation 覆盖"存在"，CMA-2 证明记忆连续性，EVO-2 / EVO-6 证明 gate 和代际进步。但 digital life 的产品核心是关系与主体性，还需要专门证明：系统是否真的在长程关系里提供更好的支持、修复、边界维护、共同基础，而不只是任务成功或记忆召回正确。

Adoptable suggestions (可落地动作)

1.在 evaluation 中新增 "agentic EQ / long-horizon relationship" readout family 或子族：future-oriented support、repair quality、boundary-preserving empathy、common-ground continuity。PROPOSED
Not a runnable A/B candidate — evaluated by the path above, not ablation.
2.借鉴 COMPEER 的心理学步骤，但只作为 readout rubric，不作为 reward 直接训练。PROPOSED
Not a runnable A/B candidate — evaluated by the path above, not ablation.
3.借鉴 RLFF-ESC 的 future-oriented reward 思想：评估一段支持策略对未来 turn / future relationship state 的影响，而不是当下安慰感。PROPOSED
Not a runnable A/B candidate — evaluated by the path above, not ablation.
4.引入 SocialSim / AgencyBench / Gaia2 作为评测地图参考：覆盖异步、资源效率、多 agent / noisy setting。PROPOSED
Not a runnable A/B candidate — evaluated by the path above, not ablation.

Traceability

No plugins / runs linked yet. Scaffold a suggestion to start.

Expected benefit (预期收益)

- 让"产品是关系不是 IQ"有独立评估证据。 - 防止系统在 task benchmark 提升的同时损伤 trust / boundary / repair 能力。 - 与 real-open-dialogue-learning-loop 形成直接闭环。

Cited paper (引用论文)

**COMPEER**（2025）、**RLFF-ESC**（2025）、**SocialSim**（2025）、**AgencyBench**（2026）、**Gaia2 / ARE platform**（2025）。详见 [`research/arxiv-survey-2026-05.md`](../../research/arxiv-survey-2026-05.md) §8-§9。 ---