united-workforce/.cards/eval-architecture.md at dfce57e9ca3c13d1b24abfca57b867e324462854

shazhou/united-workforce

Fork 0

Files

T

xiaomo 678823d291

CI / check (pull_request) Successful in 2m48s

Details

chore: remove completed eval plan, extract architecture to .cards

Plan fully implemented: CLI (run/report/diff/list), runner, 4 builtin judges,
CAS storage, task loader, 6 test files. New card captures design decisions.

2026-06-07 15:02:59 +00:00

1.1 KiB

Raw Blame History

title, created, source, tags, category, links

title

created

source

category

links

Eval Architecture — Task + Judge + CAS

2026-06-07

openclaw-xiaomo

architecture

decision

architecture

eval-closes-the-trust-chain

agent-cli-protocol

frontmatter-fast-path

uwf-eval 的三层架构：

Task = 可分发的评估单元（task.yaml + fixture 目录 + judge 脚本）。定义 prompt、workflow 引用、limits、judges 列表及权重。
Judge = 独立评分脚本。node <entry> <cwd> <thread-id>，stdout 输出 {score, data} JSON。分 builtin（frontmatter 合规、upstream 消费、幻觉检测、token 统计）和 task-specific 两类。
CAS 存储：每次 eval run 的结果是 OCAS typed node，支持 diff 对比不同 run。

关键设计：uwf-eval 不是 uwf 的一部分——它作为独立包 shell out 到 uwf CLI，保持解耦。Judge 之间独立，可并行执行。

四个 builtin judges：

frontmatter — 确定性校验，每步 frontmatter 是否合规
upstream — LLM-as-judge，上游信息是否被消费
hallucination — LLM-as-judge，是否有幻觉
token-stats — 信息性指标，不参与评分

1.1 KiB Raw Blame History

1.1 KiB

Raw Blame History