Agentic EQ / relationship 长程评估:证明关系存在质量,而不只是任务成功
Evaluation modality
Spec-levelA spec-motivation / governance borrow. Evaluated by spec review + contract tests, not A/B or ablation.
- Primary owner
- —
- Phase-A verdict
- —
- Shadow profile
- —
- Source papers
- COMPEER 2025 + RLFF-ESC 2025 + SocialSim 2025 + AgencyBench 2026 + Gaia2 2025
- Specs
- docs/specs/evaluation.mddocs/specs/evidence_program.mddocs/specs/dual-track-learning.mddocs/specs/social_cognition/04_common_ground.md
Blind spot (现状盲点)
R12 已经要求 evaluation 覆盖"存在",CMA-2 证明记忆连续性,EVO-2 / EVO-6 证明 gate 和代际进步。但 digital life 的产品核心是关系与主体性,还需要专门证明:系统是否真的在长程关系里提供更好的支持、修复、边界维护、共同基础,而不只是任务成功或记忆召回正确。
Adoptable suggestions (可落地动作)
- 1.在 evaluation 中新增 "agentic EQ / long-horizon relationship" readout family 或子族:future-oriented support、repair quality、boundary-preserving empathy、common-ground continuity。PROPOSED
Not a runnable A/B candidate — evaluated by the path above, not ablation.
- 2.借鉴 COMPEER 的心理学步骤,但只作为 readout rubric,不作为 reward 直接训练。PROPOSED
Not a runnable A/B candidate — evaluated by the path above, not ablation.
- 3.借鉴 RLFF-ESC 的 future-oriented reward 思想:评估一段支持策略对未来 turn / future relationship state 的影响,而不是当下安慰感。PROPOSED
Not a runnable A/B candidate — evaluated by the path above, not ablation.
- 4.引入 SocialSim / AgencyBench / Gaia2 作为评测地图参考:覆盖异步、资源效率、多 agent / noisy setting。PROPOSED
Not a runnable A/B candidate — evaluated by the path above, not ablation.
Traceability
No plugins / runs linked yet. Scaffold a suggestion to start.
Expected benefit (预期收益)
- 让"产品是关系不是 IQ"有独立评估证据。 - 防止系统在 task benchmark 提升的同时损伤 trust / boundary / repair 能力。 - 与 real-open-dialogue-learning-loop 形成直接闭环。
Cited paper (引用论文)
**COMPEER**(2025)、**RLFF-ESC**(2025)、**SocialSim**(2025)、**AgencyBench**(2026)、**Gaia2 / ARE platform**(2025)。详见 [`research/arxiv-survey-2026-05.md`](../../research/arxiv-survey-2026-05.md) §8-§9。 ---