test: E2E eval — 真实 agent 效果评估 #34

New Issue

2026-06-04T04:16:12Z

xiaoju commented

2026-06-04 04:16:12 +00:00

Eval Framework — 真实 agent 效果评估

test 层保证逻辑不 break，eval 层保证 agent "能干活"。

设计

uwf-eval — 独立 CLI（考官），shell out 到 uwf（考生）
Task = npm package（tarball），包含 fixture + task.yaml + judge 脚本
Judge = Node 脚本，输入 (cwd, thread-id)，输出 {score, data} JSON
所有输出 OCAS 强类型（eval-run + 每个 judge 自己的 schema）
内置 judge：frontmatter 合规、上游消费、幻觉检测、token 统计
题库：shazhou/uwf-eval-tasks monorepo（proman 管理）

控制变量

每次 eval 可切换：workflow 定义 / uwf engine 版本 / model / agent adapter

评估维度

过程：frontmatter 合规、上游信息消费、幻觉检测、token 消耗、轮次
结果：cwd diff（代码变化）、测试通过、任务完成度

Sub-issues

Phase	Issue	Description
1a	#69	eval package scaffold + CLI skeleton + OCAS schemas
1b	#70	`uwf-eval run` — prepare, execute, collect
1c	#71	builtin judges (frontmatter, upstream, hallucination, token-stats)
1d	#72	`uwf-eval report`, `diff`, `list` commands
2	#73	uwf-eval-tasks monorepo + first task (fix-off-by-one)
dep	#68	adapter $usage token reporting

实现计划

.hermes/plans/2026-06-04-eval-framework.md

Open Questions

LLM-as-judge provider config — 复用 uwf config.yaml？还是 eval 自己配？
Workflow 文件 — tarball 内自带？还是引用已注册 workflow name？支持两种？
Judge 并行执行 — judges 互相独立，值得并行吗？
LLM judge 一致性 — 多跑几次取均值？还是接受 variance？

— 小橘 🍊（NEKO Team）

## Eval Framework — 真实 agent 效果评估 test 层保证逻辑不 break，eval 层保证 agent "能干活"。 ## 设计 - `uwf-eval` — 独立 CLI（考官），shell out 到 uwf（考生） - Task = npm package（tarball），包含 fixture + task.yaml + judge 脚本 - Judge = Node 脚本，输入 (cwd, thread-id)，输出 {score, data} JSON - 所有输出 OCAS 强类型（eval-run + 每个 judge 自己的 schema） - 内置 judge：frontmatter 合规、上游消费、幻觉检测、token 统计 - 题库：`shazhou/uwf-eval-tasks` monorepo（proman 管理） ## 控制变量每次 eval 可切换：workflow 定义 / uwf engine 版本 / model / agent adapter ## 评估维度 - **过程**：frontmatter 合规、上游信息消费、幻觉检测、token 消耗、轮次 - **结果**：cwd diff（代码变化）、测试通过、任务完成度 ## Sub-issues | Phase | Issue | Description | |-------|-------|-------------| | 1a | #69 | eval package scaffold + CLI skeleton + OCAS schemas | | 1b | #70 | `uwf-eval run` — prepare, execute, collect | | 1c | #71 | builtin judges (frontmatter, upstream, hallucination, token-stats) | | 1d | #72 | `uwf-eval report`, `diff`, `list` commands | | 2 | #73 | uwf-eval-tasks monorepo + first task (fix-off-by-one) | | dep | #68 | adapter $usage token reporting | ## 实现计划 `.hermes/plans/2026-06-04-eval-framework.md` ## Open Questions 1. LLM-as-judge provider config — 复用 uwf config.yaml？还是 eval 自己配？ 2. Workflow 文件 — tarball 内自带？还是引用已注册 workflow name？支持两种？ 3. Judge 并行执行 — judges 互相独立，值得并行吗？ 4. LLM judge 一致性 — 多跑几次取均值？还是接受 variance？ — 小橘 🍊（NEKO Team）

xiaoju referenced this issue

2026-06-04 11:14:52 +00:00

chore: remove integration tests, migrate to eval framework #60

xiaoju referenced this issue from a commit

2026-06-04 12:21:29 +00:00

chore: remove integration tests, clean up CI exclusion

xiaoju referenced this issue

2026-06-04 12:21:44 +00:00

chore: remove integration tests, migrate to eval framework #62

xiaoju referenced this issue

2026-06-04 14:54:25 +00:00

feat: agent adapter token usage reporting ($usage in frontmatter) #68

xiaoju referenced this issue

2026-06-04 15:13:19 +00:00

feat: eval package scaffold + CLI skeleton + OCAS schemas #69

xiaoju referenced this issue

2026-06-04 15:13:20 +00:00

feat: uwf-eval run command — prepare, execute, collect #70

xiaoju referenced this issue

2026-06-04 15:13:23 +00:00

feat: builtin judges — frontmatter, upstream, hallucination, token-stats #71

xiaoju referenced this issue

2026-06-04 15:13:24 +00:00

feat: uwf-eval report, diff, list commands #72

xiaoju referenced this issue

2026-06-04 15:13:42 +00:00

feat: uwf-eval-tasks monorepo + first task (fix-off-by-one) #73

xiaoju commented

2026-06-04 15:15:00 +00:00

Open Questions — Resolved

LLM-as-judge provider — 每个 judge 脚本自己定义（自带 LLM 调用逻辑和配置）
Workflow 文件 — task package 内自带
Judge 并行 — 可以支持，不强制
LLM judge 多次取均值 — 可以支持，不强制

— 小橘 🍊（NEKO Team）

## Open Questions — Resolved 1. **LLM-as-judge provider** — 每个 judge 脚本自己定义（自带 LLM 调用逻辑和配置） 2. **Workflow 文件** — task package 内自带 3. **Judge 并行** — 可以支持，不强制 4. **LLM judge 多次取均值** — 可以支持，不强制 — 小橘 🍊（NEKO Team）

xiaoju referenced this issue

2026-06-04 15:46:38 +00:00

feat: add $usage field to adapter protocol #80

xiaoju referenced this issue

2026-06-04 23:46:13 +00:00

feat: eval package scaffold — CLI + schemas + types + task loader #85

xiaoju commented

2026-06-05 02:05:17 +00:00

Eval framework complete ✅

@united-workforce/eval package: prepare → execute → collect pipeline
Built-in judges: frontmatter-compliance, token-stats, test-pass
CAS-stored eval-run results
Full pipeline validated with fix-add-bug task
Task monorepo: xiaoju/uwf-eval-tasks

— 小橘 🍊（NEKO Team）

Eval framework complete ✅ - `@united-workforce/eval` package: prepare → execute → collect pipeline - Built-in judges: frontmatter-compliance, token-stats, test-pass - CAS-stored eval-run results - Full pipeline validated with `fix-add-bug` task - Task monorepo: `xiaoju/uwf-eval-tasks` — 小橘 🍊（NEKO Team）

xiaoju closed this issue

2026-06-05 02:05:18 +00:00

Sign in to join this conversation.

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: shazhou/united-workforce#34