feat: uwf-eval run command — prepare, execute, collect #70

New Issue

2026-06-04T15:13:20Z

xiaoju commented

2026-06-04 15:13:20 +00:00

Phase 1b of eval framework (#34)

Implement uwf-eval run <task-dir-or-tarball> [--agent] [--model] [--count]:

Prepare: unpack tarball or copy task dir → tmp cwd, copy fixture/ into workspace
Execute: shell out to uwf thread start + uwf thread exec -c N
Collect: iterate task.yaml judges, run each (node <entry> <cwd> <thread-id>), collect {score, data}
Store: weighted score → overall, store @uwf/eval-run CAS node, update @uwf/eval/<task>/latest variable
Output: stdout summary JSON with run-hash

Judge execution

Builtin judges: resolved from uwf-eval internals
Task judges: resolved from <task-dir>/dist/judges/<entry>
All judges: node <script> <cwd> <thread-id> → stdout JSON {score, data}
Judge data stored as typed CAS node (schema from task.yaml), dataHash referenced in eval-run

Acceptance Criteria

uwf-eval run ./some-task/ --agent hermes completes end-to-end
Result stored in OCAS with correct schema
@uwf/eval/<task>/latest variable points to run hash

Depends on: #69
Ref: #34

— 小橘 🍊（NEKO Team）

## Phase 1b of eval framework (#34) Implement `uwf-eval run <task-dir-or-tarball> [--agent] [--model] [--count]`: - [ ] **Prepare**: unpack tarball or copy task dir → tmp cwd, copy fixture/ into workspace - [ ] **Execute**: shell out to `uwf thread start` + `uwf thread exec -c N` - [ ] **Collect**: iterate task.yaml judges, run each (`node <entry> <cwd> <thread-id>`), collect `{score, data}` - [ ] **Store**: weighted score → overall, store `@uwf/eval-run` CAS node, update `@uwf/eval/<task>/latest` variable - [ ] **Output**: stdout summary JSON with run-hash ## Judge execution - Builtin judges: resolved from uwf-eval internals - Task judges: resolved from `<task-dir>/dist/judges/<entry>` - All judges: `node <script> <cwd> <thread-id>` → stdout JSON `{score, data}` - Judge data stored as typed CAS node (schema from task.yaml), dataHash referenced in eval-run ## Acceptance Criteria - `uwf-eval run ./some-task/ --agent hermes` completes end-to-end - Result stored in OCAS with correct schema - `@uwf/eval/<task>/latest` variable points to run hash Depends on: #69 Ref: #34 — 小橘 🍊（NEKO Team）

xiaoju referenced this issue

2026-06-04 15:13:24 +00:00

feat: uwf-eval report, diff, list commands #72

xiaoju referenced this issue

2026-06-04 15:14:04 +00:00

test: E2E eval — 真实 agent 效果评估 #34

xiaoju referenced this issue from a commit

2026-06-04 23:59:30 +00:00

feat: eval run command — prepare, execute, collect pipeline

xiaoju closed this issue

2026-06-05 00:33:03 +00:00

Sign in to join this conversation.