Catalog
COG-11COGP1/MSpec-levelPROPOSED

Agentic EQ / relationship 长程评估:证明关系存在质量,而不只是任务成功

Evaluation modality

Spec-level

A spec-motivation / governance borrow. Evaluated by spec review + contract tests, not A/B or ablation.

Primary owner
Phase-A verdict
Shadow profile
Source papers
COMPEER 2025 + RLFF-ESC 2025 + SocialSim 2025 + AgencyBench 2026 + Gaia2 2025
Specs
docs/specs/evaluation.mddocs/specs/evidence_program.mddocs/specs/dual-track-learning.mddocs/specs/social_cognition/04_common_ground.md

Blind spot (现状盲点)

R12 已经要求 evaluation 覆盖"存在",CMA-2 证明记忆连续性,EVO-2 / EVO-6 证明 gate 和代际进步。但 digital life 的产品核心是关系与主体性,还需要专门证明:系统是否真的在长程关系里提供更好的支持、修复、边界维护、共同基础,而不只是任务成功或记忆召回正确。

Adoptable suggestions (可落地动作)

  1. 1.在 evaluation 中新增 "agentic EQ / long-horizon relationship" readout family 或子族:future-oriented support、repair quality、boundary-preserving empathy、common-ground continuity。PROPOSED

    Not a runnable A/B candidate — evaluated by the path above, not ablation.

  2. 2.借鉴 COMPEER 的心理学步骤,但只作为 readout rubric,不作为 reward 直接训练。PROPOSED

    Not a runnable A/B candidate — evaluated by the path above, not ablation.

  3. 3.借鉴 RLFF-ESC 的 future-oriented reward 思想:评估一段支持策略对未来 turn / future relationship state 的影响,而不是当下安慰感。PROPOSED

    Not a runnable A/B candidate — evaluated by the path above, not ablation.

  4. 4.引入 SocialSim / AgencyBench / Gaia2 作为评测地图参考:覆盖异步、资源效率、多 agent / noisy setting。PROPOSED

    Not a runnable A/B candidate — evaluated by the path above, not ablation.

Traceability

No plugins / runs linked yet. Scaffold a suggestion to start.

Expected benefit (预期收益)

- 让"产品是关系不是 IQ"有独立评估证据。 - 防止系统在 task benchmark 提升的同时损伤 trust / boundary / repair 能力。 - 与 real-open-dialogue-learning-loop 形成直接闭环。

Cited paper (引用论文)

**COMPEER**(2025)、**RLFF-ESC**(2025)、**SocialSim**(2025)、**AgencyBench**(2026)、**Gaia2 / ARE platform**(2025)。详见 [`research/arxiv-survey-2026-05.md`](../../research/arxiv-survey-2026-05.md) §8-§9。 ---