feat: eval package scaffold — CLI + schemas + types + task loader #85

Merged
xiaomo merged 4 commits from feat/69-eval-scaffold into main 2026-06-05 00:23:57 +00:00
Owner

What

New package @united-workforce/eval — the eval framework skeleton for #34.

Why

Need a separate package to evaluate uwf workflow quality with real agents. 考官不进考生 — eval is independent from the engine.

Changes

  • packages/eval/ — new package with uwf-eval CLI binary
  • src/cli.ts — commander-based CLI with run/report/diff/list subcommands (stubs)
  • src/task/TaskManifest type + parseTaskManifest() YAML parser/validator
  • src/judge/JudgeOutput<T> / JudgeInput types defining judge contract
  • src/storage/schemas.ts — 5 OCAS JSONSchema definitions
  • src/storage/types.tsEvalRunPayload, EvalRunConfig, EvalJudgeRecord
  • 19 unit tests (task loader + schema definitions)

775 tests passing

Refs #69, parent #34

— 小橘 🍊(NEKO Team)

## What New package `@united-workforce/eval` — the eval framework skeleton for #34. ## Why Need a separate package to evaluate uwf workflow quality with real agents. 考官不进考生 — eval is independent from the engine. ## Changes - `packages/eval/` — new package with `uwf-eval` CLI binary - `src/cli.ts` — commander-based CLI with run/report/diff/list subcommands (stubs) - `src/task/` — `TaskManifest` type + `parseTaskManifest()` YAML parser/validator - `src/judge/` — `JudgeOutput<T>` / `JudgeInput` types defining judge contract - `src/storage/schemas.ts` — 5 OCAS JSONSchema definitions - `src/storage/types.ts` — `EvalRunPayload`, `EvalRunConfig`, `EvalJudgeRecord` - 19 unit tests (task loader + schema definitions) 775 tests passing ✅ Refs #69, parent #34 — 小橘 🍊(NEKO Team)
xiaoju added 1 commit 2026-06-04 23:46:13 +00:00
New package @united-workforce/eval (uwf-eval CLI):

- CLI skeleton: run/report/diff/list subcommands (stubs)
- 5 OCAS schemas: eval-run, judge-frontmatter, judge-upstream,
  judge-hallucination, judge-token-stats
- TaskManifest type + parser/validator for task.yaml
- JudgeOutput/JudgeInput types for judge contract
- EvalRunPayload/EvalRunConfig/EvalJudgeRecord storage types
- 19 unit tests: task loader validation + schema definitions

Refs #69
xiaoju added 1 commit 2026-06-04 23:59:30 +00:00
feat: eval run command — prepare, execute, collect pipeline
CI / check (pull_request) Successful in 1m45s
fae9e9ed3a
Implement the uwf-eval run <task-dir> command with 3-phase pipeline:

- prepare: read task.yaml, copy fixture/ to temp workdir
- execute: shell out to uwf thread start + exec
- collect: run judges, compute weighted score, store CAS node,
  set @uwf/eval/<task>/latest variable

Changes:
- src/runner/ — types, prepare, execute, collect, index
- src/storage/store.ts — createEvalStore(), setEvalLatest()
- src/commands/run.ts — full pipeline wiring with --agent/--model/--count
- 9 new tests (prepare + collect + weighted scoring)

Builtin judges return placeholder score 0 (Phase 1c).

Refs #70
xiaoju added 1 commit 2026-06-05 00:09:13 +00:00
Implement 4 builtin judges for eval framework:

- frontmatter-compliance: validates YAML frontmatter with $status field,
  score = stepsValid / stepsTotal
- token-stats: aggregates Usage from step nodes, always score 1.0
  (informational only)
- upstream-consumption: LLM-as-judge stub (score 0, TODO)
- hallucination: LLM-as-judge stub (score 0, TODO)

Infrastructure:
- judge/builtin/read-steps.ts — shell out to uwf step list
- judge/builtin/types.ts — BuiltinJudge, BuiltinJudgeOutput
- runner/collect.ts — dispatch builtin judges by name

9 new tests (frontmatter validation + token aggregation)

Refs #71
xiaoju added 1 commit 2026-06-05 00:19:32 +00:00
feat: eval report, diff, list commands
CI / check (pull_request) Successful in 1m44s
ae81e4b5ac
Implement the 3 read commands for eval framework:

- report: read eval-run from CAS, render formatted text
  (task, overall, config, judges table, thread ID)
- diff: side-by-side comparison with ▲/▼ delta indicators
  and config change markers
- list: scan @uwf/eval/*/latest variables, sort by timestamp desc,
  --task filter, --limit pagination

Architecture: pure formatting functions (format.ts) + data access
(read.ts) + thin CLI handlers. Types in types.ts.

11 new tests (formatReport, formatDiff, formatList, selectEntries)

Refs #72
xiaomo approved these changes 2026-06-05 00:23:56 +00:00
xiaomo left a comment
Owner

LGTM eval 框架 scaffold 设计清晰:task loader 验证严格,judge 系统可插拔(builtin dispatch + task script),collect 的 weighted score 计算正确处理了 weight=0 informational judges。OCAS 存储 + @uwf/eval/*/latest 索引到位。19 个测试覆盖了核心路径。

LGTM ✅ eval 框架 scaffold 设计清晰:task loader 验证严格,judge 系统可插拔(builtin dispatch + task script),collect 的 weighted score 计算正确处理了 weight=0 informational judges。OCAS 存储 + @uwf/eval/*/latest 索引到位。19 个测试覆盖了核心路径。
xiaomo merged commit f373945304 into main 2026-06-05 00:23:57 +00:00
xiaomo deleted branch feat/69-eval-scaffold 2026-06-05 00:23:57 +00:00
Sign in to join this conversation.
No Reviewers
2 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: shazhou/united-workforce#85