chore: remove completed eval plan, extract to .cards #166
@@ -0,0 +1,25 @@
|
|||||||
|
---
|
||||||
|
title: "Eval Architecture — Task + Judge + CAS"
|
||||||
|
created: "2026-06-07"
|
||||||
|
source: "openclaw-xiaomo"
|
||||||
|
tags: [architecture, decision]
|
||||||
|
category: "architecture"
|
||||||
|
links:
|
||||||
|
- eval-closes-the-trust-chain
|
||||||
|
- agent-cli-protocol
|
||||||
|
- frontmatter-fast-path
|
||||||
|
---
|
||||||
|
|
||||||
|
uwf-eval 的三层架构:
|
||||||
|
|
||||||
|
1. **Task = 可分发的评估单元**(task.yaml + fixture 目录 + judge 脚本)。定义 prompt、workflow 引用、limits、judges 列表及权重。
|
||||||
|
2. **Judge = 独立评分脚本**。`node <entry> <cwd> <thread-id>`,stdout 输出 `{score, data}` JSON。分 builtin(frontmatter 合规、upstream 消费、幻觉检测、token 统计)和 task-specific 两类。
|
||||||
|
3. **CAS 存储**:每次 eval run 的结果是 OCAS typed node,支持 diff 对比不同 run。
|
||||||
|
|
||||||
|
关键设计:uwf-eval **不是 uwf 的一部分**——它作为独立包 shell out 到 uwf CLI,保持解耦。Judge 之间独立,可并行执行。
|
||||||
|
|
||||||
|
四个 builtin judges:
|
||||||
|
- `frontmatter` — 确定性校验,每步 frontmatter 是否合规
|
||||||
|
- `upstream` — LLM-as-judge,上游信息是否被消费
|
||||||
|
- `hallucination` — LLM-as-judge,是否有幻觉
|
||||||
|
- `token-stats` — 信息性指标,不参与评分
|
||||||
@@ -1,226 +0,0 @@
|
|||||||
# Eval Framework Implementation Plan
|
|
||||||
|
|
||||||
## Goal
|
|
||||||
|
|
||||||
Build `uwf-eval` CLI + eval task infrastructure for evaluating uwf workflow quality with real agents.
|
|
||||||
|
|
||||||
## Architecture
|
|
||||||
|
|
||||||
```
|
|
||||||
uwf-eval (runner) task package (npm) OCAS (storage)
|
|
||||||
│ │ │
|
|
||||||
├─ unpack tarball ───────► fixture/ → tmp cwd │
|
|
||||||
├─ read task.yaml │ │
|
|
||||||
├─ uwf thread start/exec │ │
|
|
||||||
├─ run judges ───────────► dist/judges/*.js │
|
|
||||||
├─ collect scores │ │
|
|
||||||
└─ store results ─────────────────────────────────────► CAS nodes + variables
|
|
||||||
```
|
|
||||||
|
|
||||||
### Key Design Decisions
|
|
||||||
|
|
||||||
- **uwf-eval is NOT part of uwf** — separate package, shells out to uwf CLI
|
|
||||||
- **Task = npm package** — fixture + task.yaml + judge scripts, distributable as tarball
|
|
||||||
- **Judge = Node script** — `node <entry> <cwd> <thread-id>`, outputs `{score, data}` JSON
|
|
||||||
- **Every output is OCAS typed** — eval-run, judge results all have registered schemas
|
|
||||||
- **Builtin judges** — frontmatter compliance, upstream consumption, hallucination, token stats
|
|
||||||
- **Task-specific judges** — bundled in the task package, custom schema per judge
|
|
||||||
|
|
||||||
## Deliverables
|
|
||||||
|
|
||||||
### Phase 1: Foundation (`@united-workforce/eval`)
|
|
||||||
|
|
||||||
New package in the uwf monorepo.
|
|
||||||
|
|
||||||
```
|
|
||||||
packages/eval/
|
|
||||||
src/
|
|
||||||
cli.ts # uwf-eval entry point
|
|
||||||
commands/
|
|
||||||
run.ts # uwf-eval run
|
|
||||||
report.ts # uwf-eval report <hash>
|
|
||||||
diff.ts # uwf-eval diff <hash> <hash>
|
|
||||||
list.ts # uwf-eval list
|
|
||||||
runner/
|
|
||||||
prepare.ts # unpack tarball/dir → tmp cwd
|
|
||||||
execute.ts # shell out to uwf thread start/exec
|
|
||||||
collect.ts # run judges, collect scores
|
|
||||||
judge/
|
|
||||||
types.ts # JudgeInput, JudgeOutput types
|
|
||||||
builtin/
|
|
||||||
frontmatter.ts # frontmatter compliance check
|
|
||||||
upstream.ts # upstream info consumption (LLM-as-judge)
|
|
||||||
hallucination.ts # hallucination detection (LLM-as-judge)
|
|
||||||
token-stats.ts # token usage from $usage field (#68)
|
|
||||||
storage/
|
|
||||||
schemas.ts # OCAS schema definitions
|
|
||||||
store.ts # CAS read/write helpers
|
|
||||||
index.ts # variable indexing (@uwf/eval/*)
|
|
||||||
task/
|
|
||||||
types.ts # TaskManifest type (task.yaml)
|
|
||||||
loader.ts # parse task.yaml, validate
|
|
||||||
package.json
|
|
||||||
tsconfig.json
|
|
||||||
```
|
|
||||||
|
|
||||||
#### OCAS Schemas to Register
|
|
||||||
|
|
||||||
1. `@uwf/eval-run` — full eval execution record
|
|
||||||
```
|
|
||||||
{ task, config: {agent, model, engineVersion}, threadId,
|
|
||||||
judges: [{name, score, weight, dataHash}], overall, timestamp }
|
|
||||||
```
|
|
||||||
|
|
||||||
2. `@uwf/eval-judge-frontmatter` — frontmatter judge data
|
|
||||||
```
|
|
||||||
{ stepsTotal, stepsValid, invalidSteps: [{stepIndex, role, errors: string[]}] }
|
|
||||||
```
|
|
||||||
|
|
||||||
3. `@uwf/eval-judge-upstream` — upstream consumption judge data
|
|
||||||
```
|
|
||||||
{ perStep: [{role, consumed: string[], missed: string[], score}] }
|
|
||||||
```
|
|
||||||
|
|
||||||
4. `@uwf/eval-judge-hallucination` — hallucination judge data
|
|
||||||
```
|
|
||||||
{ perStep: [{role, hallucinations: string[], score}] }
|
|
||||||
```
|
|
||||||
|
|
||||||
5. `@uwf/eval-judge-token-stats` — token stats (not scored, informational)
|
|
||||||
```
|
|
||||||
{ totalInput, totalOutput, totalTurns, perStep: [{role, input, output, turns, duration}] }
|
|
||||||
```
|
|
||||||
|
|
||||||
#### CLI Design
|
|
||||||
|
|
||||||
```bash
|
|
||||||
# Run eval
|
|
||||||
uwf-eval run <task-dir-or-tarball> [--agent hermes] [--model claude-sonnet-4] [--count 20]
|
|
||||||
|
|
||||||
# View results
|
|
||||||
uwf-eval report <run-hash> # render via ocas render
|
|
||||||
uwf-eval diff <hash1> <hash2> # side-by-side comparison
|
|
||||||
uwf-eval list # list past runs
|
|
||||||
```
|
|
||||||
|
|
||||||
### Phase 2: Task Package Scaffold
|
|
||||||
|
|
||||||
Template for creating eval tasks. Also serves as the first real task.
|
|
||||||
|
|
||||||
```
|
|
||||||
eval-tasks/ # shazhou/uwf-eval-tasks monorepo
|
|
||||||
packages/
|
|
||||||
_template/ # copypaste template
|
|
||||||
package.json
|
|
||||||
task.yaml
|
|
||||||
fixture/
|
|
||||||
src/judges/
|
|
||||||
tsconfig.json
|
|
||||||
fix-off-by-one/ # first real task
|
|
||||||
package.json # @uwf-eval/fix-off-by-one
|
|
||||||
task.yaml
|
|
||||||
fixture/
|
|
||||||
src/calc.ts # buggy calculator
|
|
||||||
src/calc.test.ts # test that exposes the bug
|
|
||||||
package.json
|
|
||||||
src/judges/
|
|
||||||
test-pass.ts # runs pnpm test, checks exit code
|
|
||||||
code-quality.ts # LLM judge: minimal change, correct fix
|
|
||||||
schemas/
|
|
||||||
test-pass.json # OCAS schema for test-pass data
|
|
||||||
code-quality.json # OCAS schema for code-quality data
|
|
||||||
tsconfig.json
|
|
||||||
pnpm-workspace.yaml
|
|
||||||
tsconfig.json
|
|
||||||
biome.json
|
|
||||||
```
|
|
||||||
|
|
||||||
#### task.yaml Format
|
|
||||||
|
|
||||||
```yaml
|
|
||||||
name: fix-off-by-one
|
|
||||||
description: Fix an off-by-one error in a calculator's add function
|
|
||||||
workflow: solve-issue # registered workflow name, or relative path to .yaml
|
|
||||||
prompt: "Fix the bug: add(1,2) returns 4 instead of 3"
|
|
||||||
limits:
|
|
||||||
maxSteps: 15
|
|
||||||
timeoutMinutes: 30
|
|
||||||
judges:
|
|
||||||
- name: frontmatter-compliance
|
|
||||||
weight: 0.15
|
|
||||||
builtin: true
|
|
||||||
- name: upstream-consumption
|
|
||||||
weight: 0.15
|
|
||||||
builtin: true
|
|
||||||
- name: hallucination
|
|
||||||
weight: 0.1
|
|
||||||
builtin: true
|
|
||||||
- name: token-stats
|
|
||||||
weight: 0 # informational, not scored
|
|
||||||
builtin: true
|
|
||||||
- name: test-pass
|
|
||||||
weight: 0.3
|
|
||||||
entry: dist/judges/test-pass.js
|
|
||||||
schema: schemas/test-pass.json
|
|
||||||
- name: code-quality
|
|
||||||
weight: 0.3
|
|
||||||
entry: dist/judges/code-quality.js
|
|
||||||
schema: schemas/code-quality.json
|
|
||||||
```
|
|
||||||
|
|
||||||
#### Judge Script Contract
|
|
||||||
|
|
||||||
```typescript
|
|
||||||
// Input: process.argv = [node, script, cwd, threadId]
|
|
||||||
// Output: stdout JSON
|
|
||||||
// Exit 0 = success, non-zero = judge error (not low score)
|
|
||||||
|
|
||||||
import type { JudgeOutput } from "@united-workforce/eval";
|
|
||||||
|
|
||||||
const result: JudgeOutput<TestPassData> = {
|
|
||||||
score: 1.0, // 0.0 - 1.0
|
|
||||||
data: { // typed per judge schema
|
|
||||||
command: "pnpm test",
|
|
||||||
exitCode: 0,
|
|
||||||
output: "3 tests passed"
|
|
||||||
}
|
|
||||||
};
|
|
||||||
|
|
||||||
console.log(JSON.stringify(result));
|
|
||||||
```
|
|
||||||
|
|
||||||
### Phase 3: Prerequisite — $usage in Adapter Protocol (#68)
|
|
||||||
|
|
||||||
Blocked by #68. Token stats judge needs `$usage` in step nodes.
|
|
||||||
|
|
||||||
Can proceed with Phase 1+2 without it — token-stats judge just returns zeros until adapters report usage.
|
|
||||||
|
|
||||||
## Implementation Order
|
|
||||||
|
|
||||||
1. **Phase 1a**: `@united-workforce/eval` package scaffold + CLI skeleton + OCAS schemas
|
|
||||||
2. **Phase 1b**: `run` command — prepare, execute, collect flow
|
|
||||||
3. **Phase 1c**: Builtin judges — frontmatter (deterministic), upstream + hallucination (LLM-as-judge)
|
|
||||||
4. **Phase 2a**: Create `shazhou/uwf-eval-tasks` monorepo with proman
|
|
||||||
5. **Phase 2b**: First task `fix-off-by-one` with fixture repo + 2 custom judges
|
|
||||||
6. **Phase 2c**: End-to-end test: `uwf-eval run packages/fix-off-by-one --agent hermes`
|
|
||||||
7. **Phase 1d**: `report`, `diff`, `list` commands (read from CAS, render via ocas render)
|
|
||||||
|
|
||||||
## Dependencies
|
|
||||||
|
|
||||||
- `@ocas/core` + `@ocas/fs` — CAS storage
|
|
||||||
- `@united-workforce/protocol` — step node types
|
|
||||||
- `commander` — CLI framework (consistent with uwf)
|
|
||||||
- LLM API access — for LLM-as-judge (upstream, hallucination, task-specific quality judges)
|
|
||||||
|
|
||||||
## Open Questions
|
|
||||||
|
|
||||||
1. **LLM-as-judge provider config** — reuse uwf's `~/.uwf/config.yaml` provider settings? Or separate config?
|
|
||||||
2. **Workflow file location** — task.yaml references a workflow. Should the workflow YAML be inside the tarball, or reference a registered workflow by name?
|
|
||||||
3. **Non-coding tasks** — debate workflow has no fixture repo. task.yaml needs `fixture: null` or simply omit the `fixture/` dir. Runner creates empty cwd.
|
|
||||||
4. **Parallel judge execution** — judges are independent, can run in parallel. Worth the complexity?
|
|
||||||
|
|
||||||
## Risks
|
|
||||||
|
|
||||||
- LLM-as-judge consistency — same input may get different scores. Mitigation: run judge multiple times, take average? Or accept variance.
|
|
||||||
- Token cost of judges — each LLM judge call costs tokens. For a 10-step workflow with 2 LLM judges = 20 LLM calls just for judging. Acceptable?
|
|
||||||
- Fixture repo drift — if the fixture evolves, old eval runs become non-comparable. Pin fixture version in task.yaml.
|
|
||||||
Reference in New Issue
Block a user