diff --git a/.cards/eval-architecture.md b/.cards/eval-architecture.md new file mode 100644 index 0000000..e2cce78 --- /dev/null +++ b/.cards/eval-architecture.md @@ -0,0 +1,25 @@ +--- +title: "Eval Architecture — Task + Judge + CAS" +created: "2026-06-07" +source: "openclaw-xiaomo" +tags: [architecture, decision] +category: "architecture" +links: + - eval-closes-the-trust-chain + - agent-cli-protocol + - frontmatter-fast-path +--- + +uwf-eval 的三层架构: + +1. **Task = 可分发的评估单元**(task.yaml + fixture 目录 + judge 脚本)。定义 prompt、workflow 引用、limits、judges 列表及权重。 +2. **Judge = 独立评分脚本**。`node `,stdout 输出 `{score, data}` JSON。分 builtin(frontmatter 合规、upstream 消费、幻觉检测、token 统计)和 task-specific 两类。 +3. **CAS 存储**:每次 eval run 的结果是 OCAS typed node,支持 diff 对比不同 run。 + +关键设计:uwf-eval **不是 uwf 的一部分**——它作为独立包 shell out 到 uwf CLI,保持解耦。Judge 之间独立,可并行执行。 + +四个 builtin judges: +- `frontmatter` — 确定性校验,每步 frontmatter 是否合规 +- `upstream` — LLM-as-judge,上游信息是否被消费 +- `hallucination` — LLM-as-judge,是否有幻觉 +- `token-stats` — 信息性指标,不参与评分 diff --git a/.hermes/plans/2026-06-04-eval-framework.md b/.hermes/plans/2026-06-04-eval-framework.md deleted file mode 100644 index 882be16..0000000 --- a/.hermes/plans/2026-06-04-eval-framework.md +++ /dev/null @@ -1,226 +0,0 @@ -# Eval Framework Implementation Plan - -## Goal - -Build `uwf-eval` CLI + eval task infrastructure for evaluating uwf workflow quality with real agents. - -## Architecture - -``` -uwf-eval (runner) task package (npm) OCAS (storage) - │ │ │ - ├─ unpack tarball ───────► fixture/ → tmp cwd │ - ├─ read task.yaml │ │ - ├─ uwf thread start/exec │ │ - ├─ run judges ───────────► dist/judges/*.js │ - ├─ collect scores │ │ - └─ store results ─────────────────────────────────────► CAS nodes + variables -``` - -### Key Design Decisions - -- **uwf-eval is NOT part of uwf** — separate package, shells out to uwf CLI -- **Task = npm package** — fixture + task.yaml + judge scripts, distributable as tarball -- **Judge = Node script** — `node `, outputs `{score, data}` JSON -- **Every output is OCAS typed** — eval-run, judge results all have registered schemas -- **Builtin judges** — frontmatter compliance, upstream consumption, hallucination, token stats -- **Task-specific judges** — bundled in the task package, custom schema per judge - -## Deliverables - -### Phase 1: Foundation (`@united-workforce/eval`) - -New package in the uwf monorepo. - -``` -packages/eval/ - src/ - cli.ts # uwf-eval entry point - commands/ - run.ts # uwf-eval run - report.ts # uwf-eval report - diff.ts # uwf-eval diff - list.ts # uwf-eval list - runner/ - prepare.ts # unpack tarball/dir → tmp cwd - execute.ts # shell out to uwf thread start/exec - collect.ts # run judges, collect scores - judge/ - types.ts # JudgeInput, JudgeOutput types - builtin/ - frontmatter.ts # frontmatter compliance check - upstream.ts # upstream info consumption (LLM-as-judge) - hallucination.ts # hallucination detection (LLM-as-judge) - token-stats.ts # token usage from $usage field (#68) - storage/ - schemas.ts # OCAS schema definitions - store.ts # CAS read/write helpers - index.ts # variable indexing (@uwf/eval/*) - task/ - types.ts # TaskManifest type (task.yaml) - loader.ts # parse task.yaml, validate - package.json - tsconfig.json -``` - -#### OCAS Schemas to Register - -1. `@uwf/eval-run` — full eval execution record - ``` - { task, config: {agent, model, engineVersion}, threadId, - judges: [{name, score, weight, dataHash}], overall, timestamp } - ``` - -2. `@uwf/eval-judge-frontmatter` — frontmatter judge data - ``` - { stepsTotal, stepsValid, invalidSteps: [{stepIndex, role, errors: string[]}] } - ``` - -3. `@uwf/eval-judge-upstream` — upstream consumption judge data - ``` - { perStep: [{role, consumed: string[], missed: string[], score}] } - ``` - -4. `@uwf/eval-judge-hallucination` — hallucination judge data - ``` - { perStep: [{role, hallucinations: string[], score}] } - ``` - -5. `@uwf/eval-judge-token-stats` — token stats (not scored, informational) - ``` - { totalInput, totalOutput, totalTurns, perStep: [{role, input, output, turns, duration}] } - ``` - -#### CLI Design - -```bash -# Run eval -uwf-eval run [--agent hermes] [--model claude-sonnet-4] [--count 20] - -# View results -uwf-eval report # render via ocas render -uwf-eval diff # side-by-side comparison -uwf-eval list # list past runs -``` - -### Phase 2: Task Package Scaffold - -Template for creating eval tasks. Also serves as the first real task. - -``` -eval-tasks/ # shazhou/uwf-eval-tasks monorepo - packages/ - _template/ # copypaste template - package.json - task.yaml - fixture/ - src/judges/ - tsconfig.json - fix-off-by-one/ # first real task - package.json # @uwf-eval/fix-off-by-one - task.yaml - fixture/ - src/calc.ts # buggy calculator - src/calc.test.ts # test that exposes the bug - package.json - src/judges/ - test-pass.ts # runs pnpm test, checks exit code - code-quality.ts # LLM judge: minimal change, correct fix - schemas/ - test-pass.json # OCAS schema for test-pass data - code-quality.json # OCAS schema for code-quality data - tsconfig.json - pnpm-workspace.yaml - tsconfig.json - biome.json -``` - -#### task.yaml Format - -```yaml -name: fix-off-by-one -description: Fix an off-by-one error in a calculator's add function -workflow: solve-issue # registered workflow name, or relative path to .yaml -prompt: "Fix the bug: add(1,2) returns 4 instead of 3" -limits: - maxSteps: 15 - timeoutMinutes: 30 -judges: - - name: frontmatter-compliance - weight: 0.15 - builtin: true - - name: upstream-consumption - weight: 0.15 - builtin: true - - name: hallucination - weight: 0.1 - builtin: true - - name: token-stats - weight: 0 # informational, not scored - builtin: true - - name: test-pass - weight: 0.3 - entry: dist/judges/test-pass.js - schema: schemas/test-pass.json - - name: code-quality - weight: 0.3 - entry: dist/judges/code-quality.js - schema: schemas/code-quality.json -``` - -#### Judge Script Contract - -```typescript -// Input: process.argv = [node, script, cwd, threadId] -// Output: stdout JSON -// Exit 0 = success, non-zero = judge error (not low score) - -import type { JudgeOutput } from "@united-workforce/eval"; - -const result: JudgeOutput = { - score: 1.0, // 0.0 - 1.0 - data: { // typed per judge schema - command: "pnpm test", - exitCode: 0, - output: "3 tests passed" - } -}; - -console.log(JSON.stringify(result)); -``` - -### Phase 3: Prerequisite — $usage in Adapter Protocol (#68) - -Blocked by #68. Token stats judge needs `$usage` in step nodes. - -Can proceed with Phase 1+2 without it — token-stats judge just returns zeros until adapters report usage. - -## Implementation Order - -1. **Phase 1a**: `@united-workforce/eval` package scaffold + CLI skeleton + OCAS schemas -2. **Phase 1b**: `run` command — prepare, execute, collect flow -3. **Phase 1c**: Builtin judges — frontmatter (deterministic), upstream + hallucination (LLM-as-judge) -4. **Phase 2a**: Create `shazhou/uwf-eval-tasks` monorepo with proman -5. **Phase 2b**: First task `fix-off-by-one` with fixture repo + 2 custom judges -6. **Phase 2c**: End-to-end test: `uwf-eval run packages/fix-off-by-one --agent hermes` -7. **Phase 1d**: `report`, `diff`, `list` commands (read from CAS, render via ocas render) - -## Dependencies - -- `@ocas/core` + `@ocas/fs` — CAS storage -- `@united-workforce/protocol` — step node types -- `commander` — CLI framework (consistent with uwf) -- LLM API access — for LLM-as-judge (upstream, hallucination, task-specific quality judges) - -## Open Questions - -1. **LLM-as-judge provider config** — reuse uwf's `~/.uwf/config.yaml` provider settings? Or separate config? -2. **Workflow file location** — task.yaml references a workflow. Should the workflow YAML be inside the tarball, or reference a registered workflow by name? -3. **Non-coding tasks** — debate workflow has no fixture repo. task.yaml needs `fixture: null` or simply omit the `fixture/` dir. Runner creates empty cwd. -4. **Parallel judge execution** — judges are independent, can run in parallel. Worth the complexity? - -## Risks - -- LLM-as-judge consistency — same input may get different scores. Mitigation: run judge multiple times, take average? Or accept variance. -- Token cost of judges — each LLM judge call costs tokens. For a 10-step workflow with 2 LLM judges = 20 LLM calls just for judging. Acceptable? -- Fixture repo drift — if the fixture evolves, old eval runs become non-comparable. Pin fixture version in task.yaml.