xiaoju
|
5edb67b79d
|
chore: prepare 0.1.0 release
CI / check (pull_request) Successful in 2m12s
- Remove legacy .changeset/ directory (no longer used)
- Add eval package to proman.yaml
- Set eval package to public for npm publishing
|
2026-06-05 02:21:24 +00:00 |
|
xiaoju
|
ae81e4b5ac
|
feat: eval report, diff, list commands
CI / check (pull_request) Successful in 1m44s
Implement the 3 read commands for eval framework:
- report: read eval-run from CAS, render formatted text
(task, overall, config, judges table, thread ID)
- diff: side-by-side comparison with ▲/▼ delta indicators
and config change markers
- list: scan @uwf/eval/*/latest variables, sort by timestamp desc,
--task filter, --limit pagination
Architecture: pure formatting functions (format.ts) + data access
(read.ts) + thin CLI handlers. Types in types.ts.
11 new tests (formatReport, formatDiff, formatList, selectEntries)
Refs #72
|
2026-06-05 00:19:25 +00:00 |
|
xiaoju
|
8c26f16716
|
feat: builtin judges — frontmatter + token-stats (deterministic) + upstream/hallucination (stubs)
CI / check (pull_request) Successful in 1m45s
Implement 4 builtin judges for eval framework:
- frontmatter-compliance: validates YAML frontmatter with $status field,
score = stepsValid / stepsTotal
- token-stats: aggregates Usage from step nodes, always score 1.0
(informational only)
- upstream-consumption: LLM-as-judge stub (score 0, TODO)
- hallucination: LLM-as-judge stub (score 0, TODO)
Infrastructure:
- judge/builtin/read-steps.ts — shell out to uwf step list
- judge/builtin/types.ts — BuiltinJudge, BuiltinJudgeOutput
- runner/collect.ts — dispatch builtin judges by name
9 new tests (frontmatter validation + token aggregation)
Refs #71
|
2026-06-05 00:09:06 +00:00 |
|
xiaoju
|
fae9e9ed3a
|
feat: eval run command — prepare, execute, collect pipeline
CI / check (pull_request) Successful in 1m45s
Implement the uwf-eval run <task-dir> command with 3-phase pipeline:
- prepare: read task.yaml, copy fixture/ to temp workdir
- execute: shell out to uwf thread start + exec
- collect: run judges, compute weighted score, store CAS node,
set @uwf/eval/<task>/latest variable
Changes:
- src/runner/ — types, prepare, execute, collect, index
- src/storage/store.ts — createEvalStore(), setEvalLatest()
- src/commands/run.ts — full pipeline wiring with --agent/--model/--count
- 9 new tests (prepare + collect + weighted scoring)
Builtin judges return placeholder score 0 (Phase 1c).
Refs #70
|
2026-06-04 23:59:21 +00:00 |
|
xiaoju
|
99619d85db
|
feat: eval package scaffold with CLI, schemas, types, task loader
CI / check (pull_request) Successful in 1m42s
New package @united-workforce/eval (uwf-eval CLI):
- CLI skeleton: run/report/diff/list subcommands (stubs)
- 5 OCAS schemas: eval-run, judge-frontmatter, judge-upstream,
judge-hallucination, judge-token-stats
- TaskManifest type + parser/validator for task.yaml
- JudgeOutput/JudgeInput types for judge contract
- EvalRunPayload/EvalRunConfig/EvalJudgeRecord storage types
- 19 unit tests: task loader validation + schema definitions
Refs #69
|
2026-06-04 23:42:16 +00:00 |
|