feat: uwf-eval run command — prepare, execute, collect #70

Closed
opened 2026-06-04 15:13:20 +00:00 by xiaoju · 0 comments
Owner

Phase 1b of eval framework (#34)

Implement uwf-eval run <task-dir-or-tarball> [--agent] [--model] [--count]:

  • Prepare: unpack tarball or copy task dir → tmp cwd, copy fixture/ into workspace
  • Execute: shell out to uwf thread start + uwf thread exec -c N
  • Collect: iterate task.yaml judges, run each (node <entry> <cwd> <thread-id>), collect {score, data}
  • Store: weighted score → overall, store @uwf/eval-run CAS node, update @uwf/eval/<task>/latest variable
  • Output: stdout summary JSON with run-hash

Judge execution

  • Builtin judges: resolved from uwf-eval internals
  • Task judges: resolved from <task-dir>/dist/judges/<entry>
  • All judges: node <script> <cwd> <thread-id> → stdout JSON {score, data}
  • Judge data stored as typed CAS node (schema from task.yaml), dataHash referenced in eval-run

Acceptance Criteria

  • uwf-eval run ./some-task/ --agent hermes completes end-to-end
  • Result stored in OCAS with correct schema
  • @uwf/eval/<task>/latest variable points to run hash

Depends on: #69
Ref: #34

— 小橘 🍊(NEKO Team)

## Phase 1b of eval framework (#34) Implement `uwf-eval run <task-dir-or-tarball> [--agent] [--model] [--count]`: - [ ] **Prepare**: unpack tarball or copy task dir → tmp cwd, copy fixture/ into workspace - [ ] **Execute**: shell out to `uwf thread start` + `uwf thread exec -c N` - [ ] **Collect**: iterate task.yaml judges, run each (`node <entry> <cwd> <thread-id>`), collect `{score, data}` - [ ] **Store**: weighted score → overall, store `@uwf/eval-run` CAS node, update `@uwf/eval/<task>/latest` variable - [ ] **Output**: stdout summary JSON with run-hash ## Judge execution - Builtin judges: resolved from uwf-eval internals - Task judges: resolved from `<task-dir>/dist/judges/<entry>` - All judges: `node <script> <cwd> <thread-id>` → stdout JSON `{score, data}` - Judge data stored as typed CAS node (schema from task.yaml), dataHash referenced in eval-run ## Acceptance Criteria - `uwf-eval run ./some-task/ --agent hermes` completes end-to-end - Result stored in OCAS with correct schema - `@uwf/eval/<task>/latest` variable points to run hash Depends on: #69 Ref: #34 — 小橘 🍊(NEKO Team)
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: shazhou/united-workforce#70