18 Commits

Author SHA1 Message Date
xingyue df244c52e8 Revert "Merge pull request 'chore: release — bump @ocas/* ^0.4.0, @shazhou/proman ^0.6.3' (#150) from release/bump-ocas-proman into main"
CI / check (pull_request) Successful in 3m45s
This reverts commit 9d0c6df62c, reversing
changes made to 00d960daba.
2026-06-07 15:25:31 +08:00
xingyue 0f5bb1f191 chore: release — bump @ocas/* ^0.4.0, @shazhou/proman ^0.6.3
CI / check (pull_request) Successful in 2m35s
Published:
- @united-workforce/protocol@0.1.1
- @united-workforce/util-agent@0.1.2
- @united-workforce/agent-builtin@0.1.3
- @united-workforce/agent-claude-code@0.1.4
- @united-workforce/agent-hermes@0.1.5
- @united-workforce/agent-mock@0.1.3
- @united-workforce/cli@0.3.1
- @united-workforce/eval@0.1.6
2026-06-07 15:06:43 +08:00
xingyue 3a26285872 chore: bump @ocas/* to ^0.4.0 and @shazhou/proman to ^0.6.3
CI / check (pull_request) Successful in 3m28s
2026-06-07 14:12:03 +08:00
xiaoju aa732f5466 chore: bump eval to 0.1.5
CI / check (push) Successful in 3m56s
Fix workspace:^ not being replaced in 0.1.4 publish (was published with npm instead of pnpm).

小橘 🍊
2026-06-06 08:57:24 +00:00
xiaoju e354fc4341 chore: bump eval to 0.1.4
CI / check (push) Successful in 3m1s
小橘 🍊(NEKO Team)
2026-06-06 08:02:33 +00:00
xiaoju 0e7e3ea44b fix: invalid Crockford Base32 log tag in eval list command
CI / check (pull_request) Successful in 3m57s
CI / check (push) Successful in 3m31s
L is not a valid Crockford Base32 character. Replace with H.

小橘 🍊(NEKO Team)
2026-06-06 07:57:00 +00:00
xiaoju 9260d81084 chore: version bump for --version fix
CI / check (push) Successful in 3m2s
agent-hermes@0.1.2 agent-claude-code@0.1.1 agent-builtin@0.1.1
agent-mock@0.1.1 eval@0.1.3 util@0.1.1

小橘 🍊(NEKO Team)
2026-06-05 08:12:50 +00:00
xiaoju 1cf8f350d0 fix: read eval CLI version from package.json
CI / check (pull_request) Successful in 3m30s
Fixes #95

小橘 🍊(NEKO Team)
2026-06-05 06:43:27 +00:00
xiaoju 427568a21d chore: version bump agent-hermes@0.1.1 cli@0.1.1 eval@0.1.2
CI / check (push) Successful in 2m37s
小橘 🍊(NEKO Team)
2026-06-05 06:29:25 +00:00
xiaoju 825f0c641a fix: resolve --agent override via config alias before raw command
CI / check (pull_request) Successful in 3m37s
When --agent is passed to uwf thread exec, try config.agents[alias]
first (e.g. 'hermes' → config.agents.hermes = {command: 'uwf-hermes'}),
then fall back to parseAgentOverride for raw command names.

Also change eval CLI default --agent from 'hermes' to 'uwf-hermes'
so it works without config alias lookup.

Refs #91
2026-06-05 04:20:09 +00:00
xiaoju 81bbe1178f chore: release @united-workforce/eval@0.1.1
CI / check (push) Successful in 2m45s
2026-06-05 03:02:05 +00:00
xiaoju a08775896f fix: frontmatter judge handles parsed object output
CI / check (pull_request) Successful in 2m38s
The extract pipeline stores step output as a JSON object in CAS,
but the frontmatter judge only checked for raw markdown strings.
Now accepts both formats: parsed objects check $status directly,
raw strings go through YAML frontmatter extraction.

Fixes eval frontmatter-compliance scoring 0 on valid outputs.
2026-06-05 02:55:58 +00:00
xiaoju c892b9125b chore: remove prepublishOnly guards (proman handles release)
CI / check (push) Successful in 2m26s
2026-06-05 02:29:53 +00:00
xiaoju 5edb67b79d chore: prepare 0.1.0 release
CI / check (pull_request) Successful in 2m12s
- Remove legacy .changeset/ directory (no longer used)
- Add eval package to proman.yaml
- Set eval package to public for npm publishing
2026-06-05 02:21:24 +00:00
xiaoju ae81e4b5ac feat: eval report, diff, list commands
CI / check (pull_request) Successful in 1m44s
Implement the 3 read commands for eval framework:

- report: read eval-run from CAS, render formatted text
  (task, overall, config, judges table, thread ID)
- diff: side-by-side comparison with ▲/▼ delta indicators
  and config change markers
- list: scan @uwf/eval/*/latest variables, sort by timestamp desc,
  --task filter, --limit pagination

Architecture: pure formatting functions (format.ts) + data access
(read.ts) + thin CLI handlers. Types in types.ts.

11 new tests (formatReport, formatDiff, formatList, selectEntries)

Refs #72
2026-06-05 00:19:25 +00:00
xiaoju 8c26f16716 feat: builtin judges — frontmatter + token-stats (deterministic) + upstream/hallucination (stubs)
CI / check (pull_request) Successful in 1m45s
Implement 4 builtin judges for eval framework:

- frontmatter-compliance: validates YAML frontmatter with $status field,
  score = stepsValid / stepsTotal
- token-stats: aggregates Usage from step nodes, always score 1.0
  (informational only)
- upstream-consumption: LLM-as-judge stub (score 0, TODO)
- hallucination: LLM-as-judge stub (score 0, TODO)

Infrastructure:
- judge/builtin/read-steps.ts — shell out to uwf step list
- judge/builtin/types.ts — BuiltinJudge, BuiltinJudgeOutput
- runner/collect.ts — dispatch builtin judges by name

9 new tests (frontmatter validation + token aggregation)

Refs #71
2026-06-05 00:09:06 +00:00
xiaoju fae9e9ed3a feat: eval run command — prepare, execute, collect pipeline
CI / check (pull_request) Successful in 1m45s
Implement the uwf-eval run <task-dir> command with 3-phase pipeline:

- prepare: read task.yaml, copy fixture/ to temp workdir
- execute: shell out to uwf thread start + exec
- collect: run judges, compute weighted score, store CAS node,
  set @uwf/eval/<task>/latest variable

Changes:
- src/runner/ — types, prepare, execute, collect, index
- src/storage/store.ts — createEvalStore(), setEvalLatest()
- src/commands/run.ts — full pipeline wiring with --agent/--model/--count
- 9 new tests (prepare + collect + weighted scoring)

Builtin judges return placeholder score 0 (Phase 1c).

Refs #70
2026-06-04 23:59:21 +00:00
xiaoju 99619d85db feat: eval package scaffold with CLI, schemas, types, task loader
CI / check (pull_request) Successful in 1m42s
New package @united-workforce/eval (uwf-eval CLI):

- CLI skeleton: run/report/diff/list subcommands (stubs)
- 5 OCAS schemas: eval-run, judge-frontmatter, judge-upstream,
  judge-hallucination, judge-token-stats
- TaskManifest type + parser/validator for task.yaml
- JudgeOutput/JudgeInput types for judge contract
- EvalRunPayload/EvalRunConfig/EvalJudgeRecord storage types
- 19 unit tests: task loader validation + schema definitions

Refs #69
2026-06-04 23:42:16 +00:00