united-workforce

Author	SHA1	Message	Date
xiaoju	0e7e3ea44b	fix: invalid Crockford Base32 log tag in eval list command CI / check (pull_request) Successful in 3m57s Details CI / check (push) Successful in 3m31s Details L is not a valid Crockford Base32 character. Replace with H. 小橘 🍊（NEKO Team）	2026-06-06 07:57:00 +00:00
xiaoju	1cf8f350d0	fix: read eval CLI version from package.json CI / check (pull_request) Successful in 3m30s Details Fixes #95 小橘 🍊（NEKO Team）	2026-06-05 06:43:27 +00:00
xiaoju	825f0c641a	fix: resolve --agent override via config alias before raw command CI / check (pull_request) Successful in 3m37s Details When --agent is passed to uwf thread exec, try config.agents[alias] first (e.g. 'hermes' → config.agents.hermes = {command: 'uwf-hermes'}), then fall back to parseAgentOverride for raw command names. Also change eval CLI default --agent from 'hermes' to 'uwf-hermes' so it works without config alias lookup. Refs #91	2026-06-05 04:20:09 +00:00
xiaoju	a08775896f	fix: frontmatter judge handles parsed object output CI / check (pull_request) Successful in 2m38s Details The extract pipeline stores step output as a JSON object in CAS, but the frontmatter judge only checked for raw markdown strings. Now accepts both formats: parsed objects check $status directly, raw strings go through YAML frontmatter extraction. Fixes eval frontmatter-compliance scoring 0 on valid outputs.	2026-06-05 02:55:58 +00:00
xiaoju	ae81e4b5ac	feat: eval report, diff, list commands CI / check (pull_request) Successful in 1m44s Details Implement the 3 read commands for eval framework: - report: read eval-run from CAS, render formatted text (task, overall, config, judges table, thread ID) - diff: side-by-side comparison with ▲/▼ delta indicators and config change markers - list: scan @uwf/eval/*/latest variables, sort by timestamp desc, --task filter, --limit pagination Architecture: pure formatting functions (format.ts) + data access (read.ts) + thin CLI handlers. Types in types.ts. 11 new tests (formatReport, formatDiff, formatList, selectEntries) Refs #72	2026-06-05 00:19:25 +00:00
xiaoju	8c26f16716	feat: builtin judges — frontmatter + token-stats (deterministic) + upstream/hallucination (stubs) CI / check (pull_request) Successful in 1m45s Details Implement 4 builtin judges for eval framework: - frontmatter-compliance: validates YAML frontmatter with $status field, score = stepsValid / stepsTotal - token-stats: aggregates Usage from step nodes, always score 1.0 (informational only) - upstream-consumption: LLM-as-judge stub (score 0, TODO) - hallucination: LLM-as-judge stub (score 0, TODO) Infrastructure: - judge/builtin/read-steps.ts — shell out to uwf step list - judge/builtin/types.ts — BuiltinJudge, BuiltinJudgeOutput - runner/collect.ts — dispatch builtin judges by name 9 new tests (frontmatter validation + token aggregation) Refs #71	2026-06-05 00:09:06 +00:00
xiaoju	fae9e9ed3a	feat: eval run command — prepare, execute, collect pipeline CI / check (pull_request) Successful in 1m45s Details Implement the uwf-eval run <task-dir> command with 3-phase pipeline: - prepare: read task.yaml, copy fixture/ to temp workdir - execute: shell out to uwf thread start + exec - collect: run judges, compute weighted score, store CAS node, set @uwf/eval/<task>/latest variable Changes: - src/runner/ — types, prepare, execute, collect, index - src/storage/store.ts — createEvalStore(), setEvalLatest() - src/commands/run.ts — full pipeline wiring with --agent/--model/--count - 9 new tests (prepare + collect + weighted scoring) Builtin judges return placeholder score 0 (Phase 1c). Refs #70	2026-06-04 23:59:21 +00:00
xiaoju	99619d85db	feat: eval package scaffold with CLI, schemas, types, task loader CI / check (pull_request) Successful in 1m42s Details New package @united-workforce/eval (uwf-eval CLI): - CLI skeleton: run/report/diff/list subcommands (stubs) - 5 OCAS schemas: eval-run, judge-frontmatter, judge-upstream, judge-hallucination, judge-token-stats - TaskManifest type + parser/validator for task.yaml - JudgeOutput/JudgeInput types for judge contract - EvalRunPayload/EvalRunConfig/EvalJudgeRecord storage types - 19 unit tests: task loader validation + schema definitions Refs #69	2026-06-04 23:42:16 +00:00

8 Commits