Compare commits

..

48 Commits

Author SHA1 Message Date
xiaoju 427568a21d chore: version bump agent-hermes@0.1.1 cli@0.1.1 eval@0.1.2
CI / check (push) Successful in 2m37s
小橘 🍊(NEKO Team)
2026-06-05 06:29:25 +00:00
xiaomo d3a2353acf Merge pull request 'fix: read token usage from ACP response instead of DB' (#94) from fix/usage-tokens-from-acp into main
CI / check (push) Successful in 3m25s
2026-06-05 06:18:05 +00:00
xiaoju 8085d1d6e0 fix: read token usage from ACP response instead of DB
CI / check (pull_request) Successful in 3m10s
Tokens (inputTokens, outputTokens) now come from ACP PromptResponse.usage
which is populated synchronously from run_conversation() — no WAL race.
Turns still come from DB before/after snapshot.

Previously both were read from hermes state.db after ACP prompt returned,
but WAL write lag caused incomplete token data (e.g. 235 vs actual 26,080).

Refs #91
2026-06-05 06:08:11 +00:00
xiaomo 8764d7bda3 Merge pull request 'chore: add changeset for #92 agent override alias fix' (#93) from chore/changeset-agent-override into main
CI / check (push) Successful in 3m33s
2026-06-05 05:17:36 +00:00
xiaoju 850a3b2f25 chore: add changeset for #92 agent override alias fix
CI / check (pull_request) Successful in 3m8s
2026-06-05 04:36:41 +00:00
xiaomo 3d6a517e83 Merge pull request 'fix: resolve --agent override via config alias before raw command' (#92) from fix/agent-override-alias into main
CI / check (push) Successful in 3m30s
2026-06-05 04:31:50 +00:00
xiaoju 825f0c641a fix: resolve --agent override via config alias before raw command
CI / check (pull_request) Successful in 3m37s
When --agent is passed to uwf thread exec, try config.agents[alias]
first (e.g. 'hermes' → config.agents.hermes = {command: 'uwf-hermes'}),
then fall back to parseAgentOverride for raw command names.

Also change eval CLI default --agent from 'hermes' to 'uwf-hermes'
so it works without config alias lookup.

Refs #91
2026-06-05 04:20:09 +00:00
xiaoju 81bbe1178f chore: release @united-workforce/eval@0.1.1
CI / check (push) Successful in 2m45s
2026-06-05 03:02:05 +00:00
xiaoju a0e139935e Merge pull request 'fix: frontmatter judge handles parsed object output' (#90) from fix/frontmatter-judge-object-output into main
CI / check (push) Successful in 2m12s
2026-06-05 03:01:30 +00:00
xiaoju a08775896f fix: frontmatter judge handles parsed object output
CI / check (pull_request) Successful in 2m38s
The extract pipeline stores step output as a JSON object in CAS,
but the frontmatter judge only checked for raw markdown strings.
Now accepts both formats: parsed objects check $status directly,
raw strings go through YAML frontmatter extraction.

Fixes eval frontmatter-compliance scoring 0 on valid outputs.
2026-06-05 02:55:58 +00:00
xiaoju c892b9125b chore: remove prepublishOnly guards (proman handles release)
CI / check (push) Successful in 2m26s
2026-06-05 02:29:53 +00:00
xiaoju 8c5e12c5c8 Merge pull request 'chore: prepare 0.1.0 release' (#89) from chore/prepare-release into main
CI / check (push) Failing after 12s
2026-06-05 02:28:08 +00:00
xiaoju 5edb67b79d chore: prepare 0.1.0 release
CI / check (pull_request) Successful in 2m12s
- Remove legacy .changeset/ directory (no longer used)
- Add eval package to proman.yaml
- Set eval package to public for npm publishing
2026-06-05 02:21:24 +00:00
xiaoju 3d8df5c8e2 Merge pull request 'fix: remove _ single-exit for user roles' (#88) from fix/86-remove-single-exit-underscore into main
CI / check (push) Successful in 2m16s
2026-06-05 02:09:50 +00:00
xiaoju 63cb4d3645 fix: remove _ single-exit for user roles
CI / check (pull_request) Successful in 3m7s
$START keeps _ (special entry node). All user-defined roles now require
explicit $status enum in frontmatter + matching graph keys.

- moderator: remove UNIT_STATUS fallback, error on missing $status
- validate: reject _ graph keys for non-$START roles
- validate-semantic: remove checkSingleExitRole(), require $status enum
- update all test fixtures to use explicit status values
- fix examples/analyze-topic.yaml

Fixes #86
2026-06-05 02:00:45 +00:00
xiaomo f373945304 Merge pull request 'feat: eval package scaffold — CLI + schemas + types + task loader' (#85) from feat/69-eval-scaffold into main
CI / check (push) Successful in 1m46s
feat: eval package scaffold — CLI + schemas + types + task loader (#85)
2026-06-05 00:23:56 +00:00
xiaoju ae81e4b5ac feat: eval report, diff, list commands
CI / check (pull_request) Successful in 1m44s
Implement the 3 read commands for eval framework:

- report: read eval-run from CAS, render formatted text
  (task, overall, config, judges table, thread ID)
- diff: side-by-side comparison with ▲/▼ delta indicators
  and config change markers
- list: scan @uwf/eval/*/latest variables, sort by timestamp desc,
  --task filter, --limit pagination

Architecture: pure formatting functions (format.ts) + data access
(read.ts) + thin CLI handlers. Types in types.ts.

11 new tests (formatReport, formatDiff, formatList, selectEntries)

Refs #72
2026-06-05 00:19:25 +00:00
xiaoju 8c26f16716 feat: builtin judges — frontmatter + token-stats (deterministic) + upstream/hallucination (stubs)
CI / check (pull_request) Successful in 1m45s
Implement 4 builtin judges for eval framework:

- frontmatter-compliance: validates YAML frontmatter with $status field,
  score = stepsValid / stepsTotal
- token-stats: aggregates Usage from step nodes, always score 1.0
  (informational only)
- upstream-consumption: LLM-as-judge stub (score 0, TODO)
- hallucination: LLM-as-judge stub (score 0, TODO)

Infrastructure:
- judge/builtin/read-steps.ts — shell out to uwf step list
- judge/builtin/types.ts — BuiltinJudge, BuiltinJudgeOutput
- runner/collect.ts — dispatch builtin judges by name

9 new tests (frontmatter validation + token aggregation)

Refs #71
2026-06-05 00:09:06 +00:00
xiaoju fae9e9ed3a feat: eval run command — prepare, execute, collect pipeline
CI / check (pull_request) Successful in 1m45s
Implement the uwf-eval run <task-dir> command with 3-phase pipeline:

- prepare: read task.yaml, copy fixture/ to temp workdir
- execute: shell out to uwf thread start + exec
- collect: run judges, compute weighted score, store CAS node,
  set @uwf/eval/<task>/latest variable

Changes:
- src/runner/ — types, prepare, execute, collect, index
- src/storage/store.ts — createEvalStore(), setEvalLatest()
- src/commands/run.ts — full pipeline wiring with --agent/--model/--count
- 9 new tests (prepare + collect + weighted scoring)

Builtin judges return placeholder score 0 (Phase 1c).

Refs #70
2026-06-04 23:59:21 +00:00
xiaoju 99619d85db feat: eval package scaffold with CLI, schemas, types, task loader
CI / check (pull_request) Successful in 1m42s
New package @united-workforce/eval (uwf-eval CLI):

- CLI skeleton: run/report/diff/list subcommands (stubs)
- 5 OCAS schemas: eval-run, judge-frontmatter, judge-upstream,
  judge-hallucination, judge-token-stats
- TaskManifest type + parser/validator for task.yaml
- JudgeOutput/JudgeInput types for judge contract
- EvalRunPayload/EvalRunConfig/EvalJudgeRecord storage types
- 19 unit tests: task loader validation + schema definitions

Refs #69
2026-06-04 23:42:16 +00:00
xiaomo b94234652a Merge pull request 'feat: agent-hermes reads real token counts from session DB' (#84) from feat/76-hermes-real-tokens into main
CI / check (push) Successful in 1m41s
feat: agent-hermes reads real token counts from session DB (#84)
2026-06-04 23:31:09 +00:00
xiaoju 1593dbb521 fix: compute usage as delta for session re-entry
CI / check (pull_request) Successful in 1m41s
On session resume, turns/inputTokens/outputTokens were cumulative
(entire session history) instead of per-step increments. Now we
snapshot metrics before prompt, compare after, and report the delta.

Changes:
- acp-client: add getSessionId() accessor
- hermes: extract snapshotUsage() + computeUsageDelta() pure functions
- hermes: runPrompt/runHermes/continueHermes use before/after snapshots
- 9 new unit tests for usage delta computation

Refs #68
2026-06-04 23:22:16 +00:00
xiaoju d1c523c442 feat: agent-hermes reads real token counts from session DB
CI / check (pull_request) Successful in 1m41s
- Add inputTokens/outputTokens to HermesSessionJson type
- Query input_tokens, output_tokens from sessions table in loadHermesSessionFromDb
- Update test fixture schema with token columns
- runPrompt now reports real token counts from Hermes state.db

Refs #76, #68
2026-06-04 23:06:52 +00:00
xiaomo 4283e6766b Merge pull request 'feat: agent-claude-code reports real $usage from stream-json' (#83) from feat/77-claude-code-usage into main
CI / check (push) Successful in 1m42s
feat: agent-claude-code reports real $usage from stream-json (#83)
2026-06-04 22:55:15 +00:00
xiaomo 4e4fb61ff5 Merge pull request 'feat: agent-hermes reports $usage (turns + duration)' (#82) from feat/76-hermes-usage into main
CI / check (push) Successful in 1m40s
feat: agent-hermes reports $usage (turns + duration) (#82)
2026-06-04 22:55:13 +00:00
xiaoju be92cb2dd2 feat: agent-claude-code reports real $usage from stream-json output
CI / check (pull_request) Successful in 1m40s
- Map parsed numTurns, inputTokens, outputTokens, durationMs to Usage type
- Add @united-workforce/protocol dependency + tsconfig reference
- 747 tests pass

Fixes #77
Refs #68
2026-06-04 22:36:44 +00:00
xiaoju 7681e8b8e2 feat: agent-hermes reports $usage (turns + duration)
CI / check (pull_request) Successful in 1m40s
- Count assistant turns from session messages
- Measure wall-clock duration per prompt call
- inputTokens/outputTokens remain 0 (ACP protocol doesn't expose token data yet)
- Both runPrompt and continueHermes report usage

Fixes #76
Refs #68
2026-06-04 22:30:14 +00:00
xiaomo 780005ad65 Merge pull request 'feat: agent-mock emits fixed $usage stats' (#81) from feat/75-mock-usage into main
CI / check (push) Successful in 1m42s
feat: agent-mock emits fixed $usage stats (#81)
2026-06-04 22:23:42 +00:00
xiaoju 248ac710fd feat: agent-mock emits fixed $usage stats
CI / check (pull_request) Successful in 1m41s
- Mock agent returns {turns:1, inputTokens:0, outputTokens:0, duration:0}
- E2E test 1 (linear workflow) asserts usage in CAS step nodes
- 747 tests pass

Fixes #75
Refs #68
2026-06-04 22:19:29 +00:00
xiaomo 172c232e61 Merge pull request 'feat: add $usage field to adapter protocol' (#80) from feat/74-usage-in-protocol into main
CI / check (push) Successful in 1m41s
feat: add $usage field to adapter protocol (#80)
2026-06-04 22:14:12 +00:00
xiaomo 5fe97591de Merge pull request 'fix: agent bin fields point to dist/cli.js instead of src/cli.ts' (#79) from fix/agent-bin-78 into main
CI / check (push) Successful in 2m55s
fix: agent bin fields point to dist/cli.js instead of src/cli.ts (#79)
2026-06-04 15:41:45 +00:00
xiaoju 99f40c2488 feat: add $usage field to adapter protocol
CI / check (pull_request) Successful in 2m28s
- Add Usage type to protocol (turns, inputTokens, outputTokens, duration)
- Add usage to StepRecord, StepNodePayload, StepEntry, STEP_NODE_SCHEMA
- Thread usage through util-agent extract pipeline (writeStepNode → persistStep → createAgent)
- All adapters return usage: null as placeholder (mock, hermes, claude-code, builtin)
- 746 tests pass, no breaking changes (usage not in schema required array)

Fixes #74
Refs #68
2026-06-04 15:41:07 +00:00
xingyue bf489c59a5 fix: agent bin fields point to dist/cli.js instead of src/cli.ts
CI / check (pull_request) Successful in 3m23s
All three agent packages had bin pointing to ./src/cli.ts (bun-era
leftover). Node cannot execute .ts files directly, causing
ERR_MODULE_NOT_FOUND when spawning agents.

Closes #78
2026-06-04 23:25:39 +08:00
xiaomo 9908d069ec Merge pull request 'refactor(prompt): rename subcommands and add frontmatter output' (#67) from feat/prompt-refactor-66 into main
CI / check (push) Successful in 5m15s
refactor(prompt): rename subcommands and add frontmatter output (#67)
2026-06-04 14:51:12 +00:00
xingyue 83bcda60ff refactor(prompt): rename subcommands and add frontmatter output
CI / check (pull_request) Successful in 3m1s
- Rename: user→usage-reference, author→workflow-authoring, adapter→adapter-developing
- Remove: developer (content lives in CLAUDE.md)
- All prompts output complete SKILL.md with YAML frontmatter
- Setup instructions simplified: uwf prompt bootstrap > SKILL.md
- Remove all bun references, use pnpm/npm
- Fix CLAUDE.md: fixed→independent versioning
- Delete old reference files (user/author/developer/adapter)

Closes #66
2026-06-04 22:46:11 +08:00
xiaomo 17f7f44c43 Merge pull request 'chore: rebranding cleanup — reset versions to 0.1.0, bun→pnpm in docs' (#64) from chore/rebranding-cleanup into main
CI / check (push) Successful in 3m5s
chore: rebranding cleanup — reset versions to 0.1.0, bun→pnpm in docs (#64)
2026-06-04 13:13:03 +00:00
xiaoju 3401873051 chore: rebranding cleanup — reset versions to 0.1.0, bun→pnpm in docs
CI / check (pull_request) Successful in 2m49s
- All 9 packages reset to version 0.1.0
- CLAUDE.md: bun→pnpm, fixed→independent versioning, proman commands
- docs/architecture.md: bun→pnpm in toolchain table
- docs/sync-readme.md: bun→pnpm in conventions
2026-06-04 13:05:26 +00:00
xiaomo 7fc02e50c0 Merge pull request 'refactor: extract validateCount, replace CLI spawn with direct import' (#63) from chore/61-spawn-to-direct-import into main
CI / check (push) Successful in 3m0s
refactor: extract validateCount, replace CLI spawn with direct import (#63)
2026-06-04 12:41:42 +00:00
xiaoju 18170a4313 refactor: extract validateCount, replace CLI spawn with direct import
CI / check (pull_request) Successful in 2m24s
- Extract validateCount() from cmdThreadExec (throw instead of process.exit)
- 5 validation tests now import validateCount directly (no subprocess)
- Only --help tests still spawn CLI (need Commander output)
- Test time: 1.7s → 475ms

Fixes #61
2026-06-04 12:31:17 +00:00
xiaomo 1ce0b9b9ee Merge pull request 'chore: remove integration tests, migrate to eval framework' (#62) from chore/60-remove-integration-tests into main
CI / check (push) Successful in 2m18s
chore: remove integration tests, migrate to eval framework (#62)
2026-06-04 12:25:39 +00:00
xiaoju 8bf5b88172 chore: remove integration tests, clean up CI exclusion
CI / check (pull_request) Successful in 2m41s
Deleted:
- acp-client.integration.test.ts (3 cases)
- resume-e2e.integration.test.ts (1 case, already skipped)

These tests spawn a real hermes CLI and hit live LLM,
belonging to the eval layer (#34), not CI.

ACP protocol parsing is already covered by unit test
acp-client.test.ts.

Also removed the --exclude integration/ hack from test:ci.

Fixes #60
2026-06-04 12:19:24 +00:00
xiaomo 9fbdd1dd2c Merge pull request 'fix: OCAS_DIR → OCAS_HOME in test helpers' (#59) from fix/58-test-isolation into main
CI / check (push) Successful in 2m44s
fix: OCAS_DIR → OCAS_HOME in test helpers (#59)
2026-06-04 12:16:20 +00:00
xiaoju 66c2e2a79b fix: use node dist/cli.js instead of npx tsx in thread-step-count tests
CI / check (pull_request) Successful in 3m30s
npx tsx hangs in CI Docker (30s+ timeout). node dist/cli.js runs in <2s.
2026-06-04 11:57:32 +00:00
xiaoju 58b58d511e fix: add timeout to cmdThreadExec count logic tests
CI / check (pull_request) Failing after 4m17s
2026-06-04 11:48:46 +00:00
xiaoju 596c05bfcc fix: use node dist/cli.js instead of npx tsx in prompt help test
CI / check (pull_request) Failing after 3m40s
npx tsx fails in CI (tsx not found, npm tries to install it)
2026-06-04 11:32:09 +00:00
xiaoju d26f54e8ea fix: biome format + remove unused noConsole suppressions
CI / check (pull_request) Failing after 3m58s
2026-06-04 11:22:46 +00:00
xiaoju 883bd79bcb fix: add timeout to CI-slow tests + check stderr for help output
CI / check (pull_request) Failing after 1m55s
2026-06-04 11:18:49 +00:00
xiaoju 63454a4cfd fix: OCAS_DIR → OCAS_HOME in test helpers + exclude integration tests from CI
CI / check (pull_request) Failing after 2m27s
- Remaining OCAS_DIR references caused test isolation failures
- agent-hermes integration tests need 'hermes' CLI, skip in CI

Fixes #58
2026-06-04 11:06:42 +00:00
111 changed files with 3365 additions and 659 deletions
-8
View File
@@ -1,8 +0,0 @@
# Changesets
Hello and welcome! This folder has been automatically generated by `@changesets/cli`, a build tool that works
with multi-package repos, or single-package repos to help you version and publish your code. You can
find the full documentation for it [in our repository](https://github.com/changesets/changesets).
We have a quick list of common questions to get you started engaging with this project in
[our documentation](https://github.com/changesets/changesets/blob/main/docs/common-questions.md).
-11
View File
@@ -1,11 +0,0 @@
{
"$schema": "https://unpkg.com/@changesets/config@3.1.4/schema.json",
"changelog": "@changesets/cli/changelog",
"commit": false,
"fixed": [["@united-workforce/*"]],
"linked": [],
"access": "public",
"baseBranch": "main",
"updateInternalDependencies": "patch",
"ignore": ["@united-workforce/dashboard"]
}
-30
View File
@@ -1,30 +0,0 @@
{
"mode": "exit",
"tag": "alpha",
"initialVersions": {
"@uncaged/cli": "0.4.5",
"@uncaged/workflow-agent-cursor": "0.4.5",
"@uncaged/agent-hermes": "0.4.5",
"@uncaged/workflow-agent-llm": "0.4.5",
"@uncaged/workflow-agent-react": "0.4.5",
"@uncaged/workflow-cas": "0.4.5",
"@uncaged/dashboard": "0.1.0",
"@uncaged/workflow-execute": "0.4.5",
"@uncaged/workflow-gateway": "0.4.5",
"@uncaged/protocol": "0.4.5",
"@uncaged/workflow-reactor": "0.4.5",
"@uncaged/workflow-register": "0.4.5",
"@uncaged/workflow-runtime": "0.4.5",
"@uncaged/workflow-template-develop": "0.4.5",
"@uncaged/workflow-template-solve-issue": "0.4.5",
"@uncaged/util": "0.4.5",
"@uncaged/util-agent": "0.4.5"
},
"changesets": [
"env-api-unify",
"fix-internal-deps",
"fix-publish-src",
"fix-workspace-deps",
"rfc-252-agent-fn"
]
}
+226
View File
@@ -0,0 +1,226 @@
# Eval Framework Implementation Plan
## Goal
Build `uwf-eval` CLI + eval task infrastructure for evaluating uwf workflow quality with real agents.
## Architecture
```
uwf-eval (runner) task package (npm) OCAS (storage)
│ │ │
├─ unpack tarball ───────► fixture/ → tmp cwd │
├─ read task.yaml │ │
├─ uwf thread start/exec │ │
├─ run judges ───────────► dist/judges/*.js │
├─ collect scores │ │
└─ store results ─────────────────────────────────────► CAS nodes + variables
```
### Key Design Decisions
- **uwf-eval is NOT part of uwf** — separate package, shells out to uwf CLI
- **Task = npm package** — fixture + task.yaml + judge scripts, distributable as tarball
- **Judge = Node script** — `node <entry> <cwd> <thread-id>`, outputs `{score, data}` JSON
- **Every output is OCAS typed** — eval-run, judge results all have registered schemas
- **Builtin judges** — frontmatter compliance, upstream consumption, hallucination, token stats
- **Task-specific judges** — bundled in the task package, custom schema per judge
## Deliverables
### Phase 1: Foundation (`@united-workforce/eval`)
New package in the uwf monorepo.
```
packages/eval/
src/
cli.ts # uwf-eval entry point
commands/
run.ts # uwf-eval run
report.ts # uwf-eval report <hash>
diff.ts # uwf-eval diff <hash> <hash>
list.ts # uwf-eval list
runner/
prepare.ts # unpack tarball/dir → tmp cwd
execute.ts # shell out to uwf thread start/exec
collect.ts # run judges, collect scores
judge/
types.ts # JudgeInput, JudgeOutput types
builtin/
frontmatter.ts # frontmatter compliance check
upstream.ts # upstream info consumption (LLM-as-judge)
hallucination.ts # hallucination detection (LLM-as-judge)
token-stats.ts # token usage from $usage field (#68)
storage/
schemas.ts # OCAS schema definitions
store.ts # CAS read/write helpers
index.ts # variable indexing (@uwf/eval/*)
task/
types.ts # TaskManifest type (task.yaml)
loader.ts # parse task.yaml, validate
package.json
tsconfig.json
```
#### OCAS Schemas to Register
1. `@uwf/eval-run` — full eval execution record
```
{ task, config: {agent, model, engineVersion}, threadId,
judges: [{name, score, weight, dataHash}], overall, timestamp }
```
2. `@uwf/eval-judge-frontmatter` — frontmatter judge data
```
{ stepsTotal, stepsValid, invalidSteps: [{stepIndex, role, errors: string[]}] }
```
3. `@uwf/eval-judge-upstream` — upstream consumption judge data
```
{ perStep: [{role, consumed: string[], missed: string[], score}] }
```
4. `@uwf/eval-judge-hallucination` — hallucination judge data
```
{ perStep: [{role, hallucinations: string[], score}] }
```
5. `@uwf/eval-judge-token-stats` — token stats (not scored, informational)
```
{ totalInput, totalOutput, totalTurns, perStep: [{role, input, output, turns, duration}] }
```
#### CLI Design
```bash
# Run eval
uwf-eval run <task-dir-or-tarball> [--agent hermes] [--model claude-sonnet-4] [--count 20]
# View results
uwf-eval report <run-hash> # render via ocas render
uwf-eval diff <hash1> <hash2> # side-by-side comparison
uwf-eval list # list past runs
```
### Phase 2: Task Package Scaffold
Template for creating eval tasks. Also serves as the first real task.
```
eval-tasks/ # shazhou/uwf-eval-tasks monorepo
packages/
_template/ # copypaste template
package.json
task.yaml
fixture/
src/judges/
tsconfig.json
fix-off-by-one/ # first real task
package.json # @uwf-eval/fix-off-by-one
task.yaml
fixture/
src/calc.ts # buggy calculator
src/calc.test.ts # test that exposes the bug
package.json
src/judges/
test-pass.ts # runs pnpm test, checks exit code
code-quality.ts # LLM judge: minimal change, correct fix
schemas/
test-pass.json # OCAS schema for test-pass data
code-quality.json # OCAS schema for code-quality data
tsconfig.json
pnpm-workspace.yaml
tsconfig.json
biome.json
```
#### task.yaml Format
```yaml
name: fix-off-by-one
description: Fix an off-by-one error in a calculator's add function
workflow: solve-issue # registered workflow name, or relative path to .yaml
prompt: "Fix the bug: add(1,2) returns 4 instead of 3"
limits:
maxSteps: 15
timeoutMinutes: 30
judges:
- name: frontmatter-compliance
weight: 0.15
builtin: true
- name: upstream-consumption
weight: 0.15
builtin: true
- name: hallucination
weight: 0.1
builtin: true
- name: token-stats
weight: 0 # informational, not scored
builtin: true
- name: test-pass
weight: 0.3
entry: dist/judges/test-pass.js
schema: schemas/test-pass.json
- name: code-quality
weight: 0.3
entry: dist/judges/code-quality.js
schema: schemas/code-quality.json
```
#### Judge Script Contract
```typescript
// Input: process.argv = [node, script, cwd, threadId]
// Output: stdout JSON
// Exit 0 = success, non-zero = judge error (not low score)
import type { JudgeOutput } from "@united-workforce/eval";
const result: JudgeOutput<TestPassData> = {
score: 1.0, // 0.0 - 1.0
data: { // typed per judge schema
command: "pnpm test",
exitCode: 0,
output: "3 tests passed"
}
};
console.log(JSON.stringify(result));
```
### Phase 3: Prerequisite — $usage in Adapter Protocol (#68)
Blocked by #68. Token stats judge needs `$usage` in step nodes.
Can proceed with Phase 1+2 without it — token-stats judge just returns zeros until adapters report usage.
## Implementation Order
1. **Phase 1a**: `@united-workforce/eval` package scaffold + CLI skeleton + OCAS schemas
2. **Phase 1b**: `run` command — prepare, execute, collect flow
3. **Phase 1c**: Builtin judges — frontmatter (deterministic), upstream + hallucination (LLM-as-judge)
4. **Phase 2a**: Create `shazhou/uwf-eval-tasks` monorepo with proman
5. **Phase 2b**: First task `fix-off-by-one` with fixture repo + 2 custom judges
6. **Phase 2c**: End-to-end test: `uwf-eval run packages/fix-off-by-one --agent hermes`
7. **Phase 1d**: `report`, `diff`, `list` commands (read from CAS, render via ocas render)
## Dependencies
- `@ocas/core` + `@ocas/fs` — CAS storage
- `@united-workforce/protocol` — step node types
- `commander` — CLI framework (consistent with uwf)
- LLM API access — for LLM-as-judge (upstream, hallucination, task-specific quality judges)
## Open Questions
1. **LLM-as-judge provider config** — reuse uwf's `~/.uwf/config.yaml` provider settings? Or separate config?
2. **Workflow file location** — task.yaml references a workflow. Should the workflow YAML be inside the tarball, or reference a registered workflow by name?
3. **Non-coding tasks** — debate workflow has no fixture repo. task.yaml needs `fixture: null` or simply omit the `fixture/` dir. Runner creates empty cwd.
4. **Parallel judge execution** — judges are independent, can run in parallel. Worth the complexity?
## Risks
- LLM-as-judge consistency — same input may get different scores. Mitigation: run judge multiple times, take average? Or accept variance.
- Token cost of judges — each LLM judge call costs tokens. For a 10-step workflow with 2 LLM judges = 20 LLM calls just for judging. Acceptable?
- Fixture repo drift — if the fixture evolves, old eval runs become non-comparable. Pin fixture version in task.yaml.
+25
View File
@@ -0,0 +1,25 @@
# Changelog
## 0.1.0 (2026-06-05)
Initial release of `@united-workforce/*` — a stateless workflow engine for AI agent orchestration.
### Packages
- **@united-workforce/protocol** — shared types (WorkflowPayload, StepNode, etc.)
- **@united-workforce/util** — Crockford Base32, ULID, structured logger, frontmatter parsing
- **@united-workforce/util-agent** — agent factory, context builder, extract pipeline
- **@united-workforce/cli** — `uwf` CLI (thread lifecycle, status-based moderator, workflow registry)
- **@united-workforce/eval** — `uwf-eval` CLI (prepare → execute → collect eval pipeline)
- **@united-workforce/agent-hermes** — `uwf-hermes` adapter (Hermes Agent)
- **@united-workforce/agent-claude-code** — `uwf-claude-code` adapter (Claude Code CLI)
- **@united-workforce/agent-builtin** — `uwf-builtin` adapter (built-in LLM agent)
- **@united-workforce/agent-mock** — `uwf-mock` adapter (deterministic test agent)
### Highlights
- Status-based graph routing (no LLM moderator cost)
- CAS-backed immutable thread chains (`@ocas/core`)
- Real token usage tracking (Hermes + Claude Code)
- Eval framework with built-in judges (frontmatter, token-stats, test-pass)
- `$SUSPEND` / resume for human-in-the-loop workflows
+17 -16
View File
@@ -222,41 +222,42 @@ Test files (`__tests__/**`) are exempt.
| Tool | Purpose | | Tool | Purpose |
|------|---------| |------|---------|
| **bun** | Package manager + runtime | | **pnpm** | Package manager |
| **TypeScript** | Type checking (strict mode) | | **TypeScript** | Type checking (strict mode) |
| **Biome** | Lint + format (replaces ESLint + Prettier) | | **Biome** | Lint + format (replaces ESLint + Prettier) |
| **vitest** | Test runner (`cli` uses vitest; other packages use `bun test`) | | **vitest** | Test runner (all packages) |
### Development Workflow ### Development Workflow
```bash ```bash
# ── Setup ── # ── Setup ──
bun install # install all workspace dependencies pnpm install # install all workspace dependencies
# ── Daily development ── # ── Daily development ──
bun run build # tsc --build (all packages, dependency order) pnpm run build # build all packages (dependency order)
bun run check # tsc --build + biome check + lint-log-tags pnpm run check # biome check + lint-log-tags
bun run format # biome format --write pnpm run typecheck # tsc --build
bun test # run tests across all packages pnpm run test # run tests across all packages
# ── Before committing ── # ── Before committing ──
bun run check # must pass — typecheck + lint + log tag validation pnpm run check # must pass — lint + log tag validation
bun test # must pass — all package tests pnpm run typecheck # must pass — type checking
pnpm run test # must pass — all package tests
``` ```
### Publishing ### Publishing
All public `@united-workforce/*` packages are published to **npmjs.org** with **fixed mode** (all packages share the same version number). All public `@united-workforce/*` packages are published to **npmjs.org** with **independent versioning**.
```bash ```bash
# 1. Add a changeset describing the change # 1. Add a changeset describing the change
bun changeset npx changeset
# 2. Bump all package versions + generate CHANGELOGs # 2. Bump versions + generate CHANGELOGs
bun version proman bump
# 3. Build, test, and publish (runs scripts/publish-all.mjs) # 3. Build, test, and publish
bun release proman publish
# Or publish manually with a tag: # Or publish manually with a tag:
node scripts/publish-all.mjs --tag alpha node scripts/publish-all.mjs --tag alpha
@@ -265,7 +266,7 @@ node scripts/publish-all.mjs --dry-run # preview without publishing
- `workspace:^` dependencies resolve to `^x.y.z` on publish - `workspace:^` dependencies resolve to `^x.y.z` on publish
- Publish order defined in `scripts/publish-all.mjs` (dependency order) - Publish order defined in `scripts/publish-all.mjs` (dependency order)
- Changesets config: `.changeset/config.json` (fixed mode, public access) - Changesets config: `.changeset/config.json` (independent versioning, public access)
### End-to-end: Author → Register → Run ### End-to-end: Author → Register → Run
+1 -1
View File
@@ -470,7 +470,7 @@ Use the `ocas` CLI for direct CAS operations (`~/.ocas/` store, shared with `uwf
| Tool | Purpose | | Tool | Purpose |
|------|---------| |------|---------|
| **bun** | Package manager + runtime | | **pnpm** | Package manager |
| **TypeScript** | Type checking (strict mode) | | **TypeScript** | Type checking (strict mode) |
| **Biome** | Lint + format | | **Biome** | Lint + format |
| **vitest** | Test runner | | **vitest** | Test runner |
+3 -3
View File
@@ -17,7 +17,7 @@ The root README should have these sections in order:
4. **Packages** — table with ALL packages from packages/ directory, columns: Package, Description, Type (cli/lib/agent/app) 4. **Packages** — table with ALL packages from packages/ directory, columns: Package, Description, Type (cli/lib/agent/app)
5. **Quick Start** — install, build, register workflow, start thread, run step 5. **Quick Start** — install, build, register workflow, start thread, run step
6. **CLI Reference** — brief command list, detailed usage in cli README 6. **CLI Reference** — brief command list, detailed usage in cli README
7. **Development**bun install / build / check / test 7. **Development**pnpm install / build / check / test
## Per-Package README Structure ## Per-Package README Structure
@@ -26,7 +26,7 @@ Each package README should have:
1. **Title** — package name 1. **Title** — package name
2. **One-line description** — matching package.json 2. **One-line description** — matching package.json
3. **Overview** — what it does, where it sits in the architecture, dependencies 3. **Overview** — what it does, where it sits in the architecture, dependencies
4. **Installation**bun add (for libs) or "included as binary" (for cli/agents) 4. **Installation**pnpm add (for libs) or "included as binary" (for cli/agents)
5. **API** (lib packages) — all exports from src/index.ts with type signatures, grouped by category, minimal usage examples 5. **API** (lib packages) — all exports from src/index.ts with type signatures, grouped by category, minimal usage examples
6. **CLI Usage** (cli/agent packages) — command reference with examples 6. **CLI Usage** (cli/agent packages) — command reference with examples
7. **Internal Structure** — brief src/ file organization 7. **Internal Structure** — brief src/ file organization
@@ -56,7 +56,7 @@ For each package read:
- All relative links work - All relative links work
- Package names match package.json - Package names match package.json
- No references to removed/renamed packages - No references to removed/renamed packages
- bun run build still passes - pnpm run build still passes
## Guidelines ## Guidelines
+2 -2
View File
@@ -23,7 +23,7 @@ roles:
type: object type: object
properties: properties:
$status: $status:
enum: ["_"] enum: ["done"]
thesis: thesis:
type: string type: string
keyPoints: keyPoints:
@@ -37,4 +37,4 @@ graph:
$START: $START:
_: { role: "analyst", prompt: "Analyze the topic in the task and produce a structured summary with key points." } _: { role: "analyst", prompt: "Analyze the topic in the task and produce a structured summary with key points." }
analyst: analyst:
_: { role: "$END", prompt: "Analysis complete. Finish the workflow." } done: { role: "$END", prompt: "Analysis complete. Finish the workflow." }
+30
View File
@@ -0,0 +1,30 @@
name: eval-simple
description: "Single-role eval workflow: fixer takes prompt, fixes code, done."
roles:
fixer:
description: "Fixes the code based on the prompt"
goal: |
You are a code fixer. Read the prompt, understand the bug, fix it, and verify by running the tests.
capabilities:
- code-editing
- test-running
procedure: |
1. Read the prompt to understand what needs to be fixed
2. Fix the bug in the source code
3. Run the tests mentioned in the prompt to verify
4. Output $status=done when tests pass
output: "Describe what you fixed and confirm tests pass. Set $status to done."
frontmatter:
type: object
properties:
$status:
type: string
enum: [done]
summary:
type: string
required: [$status, summary]
graph:
$START:
_: { role: "fixer", prompt: "Fix the code issue described in the task prompt." }
fixer:
done: { role: "$END", prompt: "Fix complete." }
+2 -3
View File
@@ -1,6 +1,6 @@
{ {
"name": "@united-workforce/agent-builtin", "name": "@united-workforce/agent-builtin",
"version": "0.5.0", "version": "0.1.0",
"files": [ "files": [
"src", "src",
"dist", "dist",
@@ -8,7 +8,7 @@
], ],
"type": "module", "type": "module",
"bin": { "bin": {
"uwf-builtin": "./src/cli.ts" "uwf-builtin": "./dist/cli.js"
}, },
"exports": { "exports": {
".": { ".": {
@@ -17,7 +17,6 @@
} }
}, },
"scripts": { "scripts": {
"prepublishOnly": "echo 'Use pnpm run release from repo root' && exit 1",
"test": "vitest run __tests__/", "test": "vitest run __tests__/",
"test:ci": "vitest run __tests__/" "test:ci": "vitest run __tests__/"
}, },
+8 -1
View File
@@ -82,7 +82,13 @@ async function runBuiltinWithMessages(
if (loopResult.turnCount === 0) { if (loopResult.turnCount === 0) {
log("5RWTK9NB", "no turns produced, returning empty output"); log("5RWTK9NB", "no turns produced, returning empty output");
return { output: "", detailHash: "", sessionId: session.sessionId, assembledPrompt: "" }; return {
output: "",
detailHash: "",
sessionId: session.sessionId,
assembledPrompt: "",
usage: null,
};
} }
// Read jsonl → persist turns to CAS → store detail // Read jsonl → persist turns to CAS → store detail
@@ -99,6 +105,7 @@ async function runBuiltinWithMessages(
detailHash, detailHash,
sessionId: session.sessionId, sessionId: session.sessionId,
assembledPrompt: "", assembledPrompt: "",
usage: null,
}; };
} }
+2 -2
View File
@@ -8,7 +8,7 @@
], ],
"type": "module", "type": "module",
"bin": { "bin": {
"uwf-claude-code": "./src/cli.ts" "uwf-claude-code": "./dist/cli.js"
}, },
"exports": { "exports": {
".": { ".": {
@@ -17,12 +17,12 @@
} }
}, },
"scripts": { "scripts": {
"prepublishOnly": "echo 'Use pnpm run release from repo root' && exit 1",
"test": "vitest run __tests__/", "test": "vitest run __tests__/",
"test:ci": "vitest run __tests__/" "test:ci": "vitest run __tests__/"
}, },
"dependencies": { "dependencies": {
"@ocas/core": "^0.3.0", "@ocas/core": "^0.3.0",
"@united-workforce/protocol": "workspace:^",
"@united-workforce/util": "workspace:^", "@united-workforce/util": "workspace:^",
"@united-workforce/util-agent": "workspace:^" "@united-workforce/util-agent": "workspace:^"
}, },
@@ -1,5 +1,6 @@
import { spawn } from "node:child_process"; import { spawn } from "node:child_process";
import type { Store } from "@ocas/core"; import type { Store } from "@ocas/core";
import type { Usage } from "@united-workforce/protocol";
import { createLogger } from "@united-workforce/util"; import { createLogger } from "@united-workforce/util";
import { import {
type AgentContext, type AgentContext,
@@ -145,7 +146,14 @@ async function processClaudeOutput(
); );
} }
return { output, detailHash, sessionId, assembledPrompt }; const usage: Usage = {
turns: parsed.numTurns,
inputTokens: parsed.usage.inputTokens,
outputTokens: parsed.usage.outputTokens,
duration: Math.round(parsed.durationMs / 1000),
};
return { output, detailHash, sessionId, assembledPrompt, usage };
} }
// Truly unparseable output - provide enhanced error message // Truly unparseable output - provide enhanced error message
+1 -1
View File
@@ -2,5 +2,5 @@
"extends": "../../tsconfig.json", "extends": "../../tsconfig.json",
"compilerOptions": { "rootDir": "src", "outDir": "dist" }, "compilerOptions": { "rootDir": "src", "outDir": "dist" },
"include": ["src"], "include": ["src"],
"references": [{ "path": "../util-agent" }] "references": [{ "path": "../protocol" }, { "path": "../util-agent" }]
} }
+18
View File
@@ -0,0 +1,18 @@
# @united-workforce/agent-hermes
## 0.1.1
### Patch Changes
- 8085d1d: fix: read token usage from ACP PromptResponse instead of DB
Token counts (inputTokens, outputTokens) now come from the ACP
`PromptResponse.usage` field, which is populated synchronously from
`run_conversation()` return data — no WAL race condition.
Turns (assistant message count) still come from the DB via
`snapshotTurns()` before/after delta.
Previously both tokens and turns were read from the Hermes state DB
after the ACP prompt returned, but due to WAL write lag the DB often
had incomplete token data at read time (e.g. 235 vs actual 26,080).
@@ -1,55 +0,0 @@
import { afterEach, beforeEach, describe, expect, it } from "vitest";
import { HermesAcpClient } from "../../src/acp-client.js";
const UUID_RE = /^[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}$/i;
describe("HermesAcpClient", () => {
let client: HermesAcpClient;
beforeEach(() => {
client = new HermesAcpClient();
});
afterEach(async () => {
await client.close();
});
it(
"connect() returns a UUID sessionId",
async () => {
const sessionId = await client.connect(process.cwd());
expect(typeof sessionId).toBe("string");
expect(sessionId).toMatch(UUID_RE);
},
{ timeout: 2 * 60 * 1000 },
);
it(
"prompt() returns a non-empty text response",
async () => {
await client.connect(process.cwd());
const result = await client.prompt("Reply with exactly the word: PONG");
expect(typeof result.text).toBe("string");
expect(result.text.length).toBeGreaterThan(0);
expect(typeof result.sessionId).toBe("string");
expect(result.sessionId).toMatch(UUID_RE);
},
{ timeout: 2 * 60 * 1000 },
);
it(
"prompt() can be called twice on the same session (resume)",
async () => {
await client.connect(process.cwd());
const first = await client.prompt("Say the word ALPHA and nothing else.");
expect(first.text.length).toBeGreaterThan(0);
const second = await client.prompt("Now say the word BETA and nothing else.");
expect(second.text.length).toBeGreaterThan(0);
expect(first.sessionId).toBe(second.sessionId);
},
{ timeout: 2 * 60 * 1000 },
);
});
@@ -1,56 +0,0 @@
import { afterEach, describe, expect, it } from "vitest";
import { HermesAcpClient } from "../../src/acp-client.js";
/**
* E2E test for cross-process session resume.
*
* Simulates the workflow re-entry scenario:
* 1. Client A: connect → prompt → close (developer first run)
* 2. Client B: resume(sessionId) → prompt (developer re-entry after reviewer reject)
*
* This is what happens when uwf thread step spawns uwf-hermes twice for the same role.
*/
describe("HermesAcpClient cross-process resume", () => {
const clients: HermesAcpClient[] = [];
afterEach(async () => {
for (const c of clients) {
await c.close();
}
clients.length = 0;
});
// TODO(#435): flaky — depends on live LLM; mock or move to integration suite
it.skip(
"resume() after close — second prompt returns non-empty text",
async () => {
// --- Client A: first run ---
const clientA = new HermesAcpClient();
clients.push(clientA);
await clientA.connect(process.cwd());
const first = await clientA.prompt(
"Remember the secret code: WATERMELON. Reply with exactly: ACKNOWLEDGED",
);
expect(first.text.length).toBeGreaterThan(0);
const sessionId = first.sessionId;
// Close client A (simulates uwf-hermes process exit)
await clientA.close();
// --- Client B: resume (simulates re-entry) ---
const clientB = new HermesAcpClient();
clients.push(clientB);
await clientB.resume(sessionId, process.cwd());
const second = await clientB.prompt(
"What was the secret code I told you earlier? Reply with just the code word.",
);
// The critical assertion: resumed session produces non-empty output
expect(second.text.length).toBeGreaterThan(0);
expect(second.sessionId).toBe(sessionId);
},
{ timeout: 3 * 60 * 1000 },
);
});
@@ -140,7 +140,9 @@ function createTestDb(dbPath: string): TestDb {
db.exec(`CREATE TABLE sessions ( db.exec(`CREATE TABLE sessions (
id TEXT PRIMARY KEY, id TEXT PRIMARY KEY,
model TEXT NOT NULL, model TEXT NOT NULL,
started_at INTEGER NOT NULL started_at INTEGER NOT NULL,
input_tokens INTEGER DEFAULT 0,
output_tokens INTEGER DEFAULT 0
)`); )`);
db.exec(`CREATE TABLE messages ( db.exec(`CREATE TABLE messages (
id INTEGER PRIMARY KEY AUTOINCREMENT, id INTEGER PRIMARY KEY AUTOINCREMENT,
@@ -0,0 +1,122 @@
import { describe, expect, test } from "vitest";
import type { AcpUsage } from "../src/acp-client.js";
import { buildUsage, snapshotTurns } from "../src/hermes.js";
import type { HermesSessionJson } from "../src/types.js";
function makeSession(overrides: Partial<HermesSessionJson> = {}): HermesSessionJson {
return {
session_id: "test-session",
model: "test-model",
session_start: "2026-01-01T00:00:00Z",
messages: [],
inputTokens: 0,
outputTokens: 0,
...overrides,
};
}
describe("snapshotTurns", () => {
test("returns zero for null session", () => {
const result = snapshotTurns(null);
expect(result).toEqual({ turns: 0 });
});
test("returns zero for empty session", () => {
const result = snapshotTurns(makeSession());
expect(result).toEqual({ turns: 0 });
});
test("counts assistant messages as turns", () => {
const result = snapshotTurns(
makeSession({
messages: [
{ role: "user", content: "hello", reasoning: null, tool_calls: null },
{ role: "assistant", content: "hi", reasoning: null, tool_calls: null },
{ role: "user", content: "do X", reasoning: null, tool_calls: null },
{ role: "tool", content: "result", reasoning: null, tool_calls: null },
{ role: "assistant", content: "done", reasoning: null, tool_calls: null },
],
inputTokens: 1000,
outputTokens: 500,
}),
);
expect(result).toEqual({ turns: 2 });
});
test("ignores non-assistant messages for turn count", () => {
const result = snapshotTurns(
makeSession({
messages: [
{ role: "user", content: "hello", reasoning: null, tool_calls: null },
{ role: "tool", content: "result", reasoning: null, tool_calls: null },
],
}),
);
expect(result.turns).toBe(0);
});
});
describe("buildUsage", () => {
const acpUsage: AcpUsage = { inputTokens: 5000, outputTokens: 2000, totalTokens: 7000 };
test("first visit: tokens from ACP, turns from DB delta", () => {
const beforeTurns = { turns: 0 };
const afterTurns = { turns: 3 };
const result = buildUsage(acpUsage, beforeTurns, afterTurns, 12.5);
expect(result).toEqual({
turns: 3,
inputTokens: 5000,
outputTokens: 2000,
duration: 13,
});
});
test("re-entry: turn delta computed correctly, tokens from ACP", () => {
const beforeTurns = { turns: 2 };
const afterTurns = { turns: 4 };
const acpDelta: AcpUsage = { inputTokens: 8000, outputTokens: 3500, totalTokens: 11500 };
const result = buildUsage(acpDelta, beforeTurns, afterTurns, 7.3);
expect(result).toEqual({
turns: 2,
inputTokens: 8000,
outputTokens: 3500,
duration: 7,
});
});
test("floors negative turn deltas at 0, then defaults to 1", () => {
const beforeTurns = { turns: 5 };
const afterTurns = { turns: 3 };
const result = buildUsage(acpUsage, beforeTurns, afterTurns, 1.0);
// turns would be negative (-2), floored to 0, then || 1 gives 1
expect(result.turns).toBe(1);
});
test("zero turns delta defaults to 1 (at least one turn happened)", () => {
const beforeTurns = { turns: 3 };
const afterTurns = { turns: 3 };
const result = buildUsage(acpUsage, beforeTurns, afterTurns, 5.0);
// turns delta is 0, || 1 gives 1
expect(result.turns).toBe(1);
});
test("null ACP usage yields zero tokens", () => {
const beforeTurns = { turns: 0 };
const afterTurns = { turns: 2 };
const result = buildUsage(null, beforeTurns, afterTurns, 10.0);
expect(result).toEqual({
turns: 2,
inputTokens: 0,
outputTokens: 0,
duration: 10,
});
});
test("duration is rounded", () => {
const beforeTurns = { turns: 0 };
const afterTurns = { turns: 1 };
expect(buildUsage(acpUsage, beforeTurns, afterTurns, 3.7).duration).toBe(4);
expect(buildUsage(acpUsage, beforeTurns, afterTurns, 3.2).duration).toBe(3);
expect(buildUsage(acpUsage, beforeTurns, afterTurns, 0.0).duration).toBe(0);
});
});
+2 -3
View File
@@ -1,6 +1,6 @@
{ {
"name": "@united-workforce/agent-hermes", "name": "@united-workforce/agent-hermes",
"version": "0.5.0", "version": "0.1.1",
"files": [ "files": [
"src", "src",
"dist", "dist",
@@ -8,7 +8,7 @@
], ],
"type": "module", "type": "module",
"bin": { "bin": {
"uwf-hermes": "./src/cli.ts" "uwf-hermes": "./dist/cli.js"
}, },
"exports": { "exports": {
".": { ".": {
@@ -17,7 +17,6 @@
} }
}, },
"scripts": { "scripts": {
"prepublishOnly": "echo 'Use pnpm run release from repo root' && exit 1",
"test": "vitest run __tests__/", "test": "vitest run __tests__/",
"test:ci": "vitest run __tests__/" "test:ci": "vitest run __tests__/"
}, },
+29
View File
@@ -17,9 +17,17 @@ type PendingRequest = {
reject: (reason: Error) => void; reject: (reason: Error) => void;
}; };
/** Token usage returned by ACP PromptResponse. */
export type AcpUsage = {
inputTokens: number;
outputTokens: number;
totalTokens: number;
};
export type AcpPromptResult = { export type AcpPromptResult = {
text: string; text: string;
sessionId: string; sessionId: string;
usage: AcpUsage | null;
}; };
export class HermesAcpClient { export class HermesAcpClient {
@@ -72,6 +80,11 @@ export class HermesAcpClient {
return sessionId; return sessionId;
} }
/** Return the current session ID, or null if not connected. */
getSessionId(): string | null {
return this.sessionId;
}
/** Send prompt and collect final assistant text from ACP stream chunks. */ /** Send prompt and collect final assistant text from ACP stream chunks. */
async prompt(text: string): Promise<AcpPromptResult> { async prompt(text: string): Promise<AcpPromptResult> {
if (this.sessionId === null) { if (this.sessionId === null) {
@@ -91,9 +104,25 @@ export class HermesAcpClient {
); );
} }
// Extract token usage from ACP PromptResponse.result.usage (camelCase wire format)
const result = (response as { result?: Record<string, unknown> }).result;
const rawUsage = result?.usage as Record<string, unknown> | undefined;
const usage: AcpUsage | null =
rawUsage !== undefined &&
typeof rawUsage.inputTokens === "number" &&
typeof rawUsage.outputTokens === "number" &&
typeof rawUsage.totalTokens === "number"
? {
inputTokens: rawUsage.inputTokens,
outputTokens: rawUsage.outputTokens,
totalTokens: rawUsage.totalTokens,
}
: null;
return { return {
text: this.messageChunks.join(""), text: this.messageChunks.join(""),
sessionId: this.sessionId, sessionId: this.sessionId,
usage,
}; };
} }
+80 -8
View File
@@ -1,4 +1,5 @@
import type { Store } from "@ocas/core"; import type { Store } from "@ocas/core";
import type { Usage } from "@united-workforce/protocol";
import { createLogger } from "@united-workforce/util"; import { createLogger } from "@united-workforce/util";
import { import {
type AgentContext, type AgentContext,
@@ -7,13 +8,50 @@ import {
buildRolePrompt, buildRolePrompt,
createAgent, createAgent,
} from "@united-workforce/util-agent"; } from "@united-workforce/util-agent";
import type { AcpUsage } from "./acp-client.js";
import { HermesAcpClient } from "./acp-client.js"; import { HermesAcpClient } from "./acp-client.js";
import { getCachedSessionId, setCachedSessionId } from "./session-cache.js"; import { getCachedSessionId, setCachedSessionId } from "./session-cache.js";
import { loadHermesSession, storeHermesSessionDetail } from "./session-detail.js"; import { loadHermesSession, storeHermesSessionDetail } from "./session-detail.js";
import type { HermesSessionJson } from "./types.js";
const log = createLogger({ sink: { kind: "stderr" } }); const log = createLogger({ sink: { kind: "stderr" } });
/** Snapshot of session metrics taken before and after a prompt call. */
type TurnsSnapshot = {
turns: number;
};
const ZERO_TURNS: TurnsSnapshot = { turns: 0 };
/** Extract assistant turn count from a session. Returns zero for null sessions. */
export function snapshotTurns(session: HermesSessionJson | null): TurnsSnapshot {
if (session === null) {
return ZERO_TURNS;
}
return {
turns: session.messages.filter((m) => m.role === "assistant").length,
};
}
/**
* Build Usage from ACP token data + DB turn delta.
* Tokens come from ACP PromptResponse (synchronous, accurate).
* Turns come from DB before/after snapshots (may have WAL lag, but acceptable).
*/
export function buildUsage(
acpUsage: AcpUsage | null,
beforeTurns: TurnsSnapshot,
afterTurns: TurnsSnapshot,
durationSec: number,
): Usage {
return {
turns: Math.max(0, afterTurns.turns - beforeTurns.turns) || 1,
inputTokens: acpUsage?.inputTokens ?? 0,
outputTokens: acpUsage?.outputTokens ?? 0,
duration: Math.round(durationSec),
};
}
/** Assemble system prompt, task, and prior step outputs for Hermes. */ /** Assemble system prompt, task, and prior step outputs for Hermes. */
export function buildHermesPrompt(ctx: AgentContext): string { export function buildHermesPrompt(ctx: AgentContext): string {
const parts: string[] = []; const parts: string[] = [];
@@ -108,25 +146,45 @@ export function createHermesAgent(resumeDisabled: boolean): () => Promise<void>
void client.close(); void client.close();
}); });
async function runPrompt(ctx: AgentContext, useContinuation: boolean): Promise<AgentRunResult> { async function runPrompt(
ctx: AgentContext,
useContinuation: boolean,
beforeTurns: TurnsSnapshot,
): Promise<AgentRunResult> {
const effectiveCtx = useContinuation ? ctx : { ...ctx, isFirstVisit: true }; const effectiveCtx = useContinuation ? ctx : { ...ctx, isFirstVisit: true };
const fullPrompt = buildHermesPrompt(effectiveCtx); const fullPrompt = buildHermesPrompt(effectiveCtx);
const { text, sessionId } = await client.prompt(fullPrompt); const startMs = Date.now();
const { text, sessionId, usage: acpUsage } = await client.prompt(fullPrompt);
const durationSec = (Date.now() - startMs) / 1000;
const { detailHash } = await storePromptResult(ctx.store, sessionId); const { detailHash } = await storePromptResult(ctx.store, sessionId);
if (!resumeDisabled) { if (!resumeDisabled) {
await setCachedSessionId(ctx.threadId, ctx.role, sessionId, ctx.storageRoot); await setCachedSessionId(ctx.threadId, ctx.role, sessionId, ctx.storageRoot);
} }
return { output: text, detailHash, sessionId, assembledPrompt: fullPrompt }; // Turns from DB (may lag slightly due to WAL, but acceptable)
const afterSession = await loadHermesSession(sessionId);
const afterTurns = snapshotTurns(afterSession);
const usage = buildUsage(acpUsage, beforeTurns, afterTurns, durationSec);
return { output: text, detailHash, sessionId, assembledPrompt: fullPrompt, usage };
} }
async function runHermes(ctx: AgentContext): Promise<AgentRunResult> { async function runHermes(ctx: AgentContext): Promise<AgentRunResult> {
const cwd = process.cwd(); const cwd = process.cwd();
const attempt = await prepareSession(client, ctx, cwd, resumeDisabled); const attempt = await prepareSession(client, ctx, cwd, resumeDisabled);
// Snapshot before prompt: for resumed sessions, captures cumulative state
// so we can compute the turn delta. For new sessions, this is ZERO_TURNS.
const currentSessionId = client.getSessionId();
const beforeSession =
attempt.resumed && currentSessionId !== null
? await loadHermesSession(currentSessionId)
: null;
const beforeTurns = snapshotTurns(beforeSession);
try { try {
return await runPrompt(ctx, attempt.useContinuation); return await runPrompt(ctx, attempt.useContinuation, beforeTurns);
} catch (error) { } catch (error) {
if (!attempt.resumed) { if (!attempt.resumed) {
throw error; throw error;
@@ -136,7 +194,8 @@ export function createHermesAgent(resumeDisabled: boolean): () => Promise<void>
log("8FQW2R6N", `continuation prompt failed, retrying with initial prompt: ${message}`); log("8FQW2R6N", `continuation prompt failed, retrying with initial prompt: ${message}`);
await client.close(); await client.close();
await client.connect(cwd); await client.connect(cwd);
return runPrompt(ctx, false); // Fresh session after retry — reset snapshot to zero
return runPrompt(ctx, false, ZERO_TURNS);
} }
} }
@@ -147,9 +206,22 @@ export function createHermesAgent(resumeDisabled: boolean): () => Promise<void>
): Promise<AgentRunResult> { ): Promise<AgentRunResult> {
// Client is already connected from runHermes — same ACP session, // Client is already connected from runHermes — same ACP session,
// so the agent sees the full conversation history (crucial for retries). // so the agent sees the full conversation history (crucial for retries).
const { text, sessionId } = await client.prompt(message); // Snapshot turns before the continuation prompt for delta computation.
const currentSessionId = client.getSessionId();
const beforeSession =
currentSessionId !== null ? await loadHermesSession(currentSessionId) : null;
const beforeTurns = snapshotTurns(beforeSession);
const startMs = Date.now();
const { text, sessionId, usage: acpUsage } = await client.prompt(message);
const durationSec = (Date.now() - startMs) / 1000;
const { detailHash } = await storePromptResult(store, sessionId); const { detailHash } = await storePromptResult(store, sessionId);
return { output: text, detailHash, sessionId, assembledPrompt: "" };
const afterSession = await loadHermesSession(sessionId);
const afterTurns = snapshotTurns(afterSession);
const usage = buildUsage(acpUsage, beforeTurns, afterTurns, durationSec);
return { output: text, detailHash, sessionId, assembledPrompt: "", usage };
} }
const agentMain = createAgent({ const agentMain = createAgent({
+7 -1
View File
@@ -1,2 +1,8 @@
export type { AcpUsage } from "./acp-client.js";
export { HermesAcpClient } from "./acp-client.js"; export { HermesAcpClient } from "./acp-client.js";
export { buildHermesPrompt, createHermesAgent } from "./hermes.js"; export {
buildHermesPrompt,
buildUsage,
createHermesAgent,
snapshotTurns,
} from "./hermes.js";
+8 -2
View File
@@ -106,7 +106,7 @@ function parseSessionJson(raw: unknown): HermesSessionJson | null {
messages.push(msg); messages.push(msg);
} }
} }
return { session_id, model, session_start, messages }; return { session_id, model, session_start, messages, inputTokens: 0, outputTokens: 0 };
} }
export function getHermesDbPath(): string { export function getHermesDbPath(): string {
@@ -117,6 +117,8 @@ type DbSessionRow = {
id: string; id: string;
model: string; model: string;
started_at: number; started_at: number;
input_tokens: number;
output_tokens: number;
}; };
type DbMessageRow = { type DbMessageRow = {
@@ -156,7 +158,9 @@ export function loadHermesSessionFromDb(
try { try {
db = new DatabaseSync(resolvedPath, { readOnly: true }); db = new DatabaseSync(resolvedPath, { readOnly: true });
const session = db const session = db
.prepare("SELECT id, model, started_at FROM sessions WHERE id = ?") .prepare(
"SELECT id, model, started_at, input_tokens, output_tokens FROM sessions WHERE id = ?",
)
.get(sessionId) as DbSessionRow | null; .get(sessionId) as DbSessionRow | null;
if (session === null) { if (session === null) {
return null; return null;
@@ -181,6 +185,8 @@ export function loadHermesSessionFromDb(
model: session.model, model: session.model,
session_start: new Date(session.started_at * 1000).toISOString(), session_start: new Date(session.started_at * 1000).toISOString(),
messages, messages,
inputTokens: session.input_tokens ?? 0,
outputTokens: session.output_tokens ?? 0,
}; };
} catch { } catch {
return null; return null;
+2
View File
@@ -40,4 +40,6 @@ export type HermesSessionJson = {
model: string; model: string;
session_start: string; session_start: string;
messages: HermesSessionMessage[]; messages: HermesSessionMessage[];
inputTokens: number;
outputTokens: number;
}; };
+1 -2
View File
@@ -1,6 +1,6 @@
{ {
"name": "@united-workforce/agent-mock", "name": "@united-workforce/agent-mock",
"version": "0.5.0", "version": "0.1.0",
"files": [ "files": [
"src", "src",
"dist", "dist",
@@ -17,7 +17,6 @@
} }
}, },
"scripts": { "scripts": {
"prepublishOnly": "echo 'Use pnpm run release from repo root' && exit 1",
"test": "vitest run __tests__/", "test": "vitest run __tests__/",
"test:ci": "vitest run __tests__/" "test:ci": "vitest run __tests__/"
}, },
+1
View File
@@ -103,6 +103,7 @@ export function createMockAgent(mockDataPath: string): () => Promise<void> {
detailHash, detailHash,
sessionId, sessionId,
assembledPrompt: "", assembledPrompt: "",
usage: { turns: 1, inputTokens: 0, outputTokens: 0, duration: 0 },
}; };
lastResult = result; lastResult = result;
return result; return result;
+9
View File
@@ -0,0 +1,9 @@
# @united-workforce/cli
## 0.1.1
### Patch Changes
- 850a3b2: fix: resolve --agent override via config alias before raw command
`resolveAgentConfig()` now checks `config.agents[alias]` first before falling back to `parseAgentOverride()`. Eval CLI default `--agent` changed from `"hermes"` to `"uwf-hermes"`.
+1 -2
View File
@@ -1,6 +1,6 @@
{ {
"name": "@united-workforce/cli", "name": "@united-workforce/cli",
"version": "0.5.0", "version": "0.1.1",
"files": [ "files": [
"src", "src",
"dist", "dist",
@@ -22,7 +22,6 @@
"yaml": "^2.8.4" "yaml": "^2.8.4"
}, },
"scripts": { "scripts": {
"prepublishOnly": "echo 'Use pnpm run release from repo root' && exit 1",
"test": "vitest run src/", "test": "vitest run src/",
"test:ci": "vitest run src/" "test:ci": "vitest run src/"
}, },
+10 -10
View File
@@ -42,7 +42,7 @@ roles:
type: object type: object
required: ["$status"] required: ["$status"]
properties: properties:
$status: { type: string } $status: { type: string, enum: ["done"] }
graph: graph:
$START: $START:
_: _:
@@ -59,7 +59,7 @@ graph:
prompt: "Try again" prompt: "Try again"
location: null location: null
roleB: roleB:
_: done:
role: $END role: $END
prompt: "Done" prompt: "Done"
location: null location: null
@@ -92,7 +92,7 @@ roles:
type: object type: object
required: ["$status"] required: ["$status"]
properties: properties:
$status: { type: string } $status: { type: string, enum: ["done"] }
roleC: roleC:
description: Fail role description: Fail role
goal: Do C goal: Do C
@@ -104,7 +104,7 @@ roles:
type: object type: object
required: ["$status"] required: ["$status"]
properties: properties:
$status: { type: string } $status: { type: string, enum: ["done"] }
graph: graph:
$START: $START:
_: _:
@@ -121,12 +121,12 @@ graph:
prompt: "Do C (fail)" prompt: "Do C (fail)"
location: null location: null
roleB: roleB:
_: done:
role: $END role: $END
prompt: "Done" prompt: "Done"
location: null location: null
roleC: roleC:
_: done:
role: $END role: $END
prompt: "Done" prompt: "Done"
location: null location: null
@@ -147,7 +147,7 @@ roles:
type: object type: object
required: ["$status"] required: ["$status"]
properties: properties:
$status: { type: string } $status: { type: string, enum: ["done"] }
graph: graph:
$START: $START:
_: _:
@@ -155,7 +155,7 @@ graph:
prompt: "Work" prompt: "Work"
location: null location: null
worker: worker:
_: done:
role: $END role: $END
prompt: "Done" prompt: "Done"
location: null location: null
@@ -426,8 +426,8 @@ describe("currentRole field", () => {
await writeFile(wf, SINGLE_ROLE_WORKFLOW_YAML, "utf8"); await writeFile(wf, SINGLE_ROLE_WORKFLOW_YAML, "utf8");
const { thread } = await cmdThreadStart(storageRoot, wf, "test", tmpDir); const { thread } = await cmdThreadStart(storageRoot, wf, "test", tmpDir);
// worker → _ maps to $END // worker → done maps to $END
await insertStepNode(storageRoot, thread as ThreadId, "worker", {}); await insertStepNode(storageRoot, thread as ThreadId, "worker", { $status: "done" });
const result = await cmdThreadShow(storageRoot, thread as ThreadId); const result = await cmdThreadShow(storageRoot, thread as ThreadId);
expect(result.currentRole).toBe(null); expect(result.currentRole).toBe(null);
@@ -229,6 +229,10 @@ describe("E2E mock-agent: full uwf pipeline", () => {
expect(getStatus(store, s1.output)).toBe("ready"); expect(getStatus(store, s1.output)).toBe("ready");
expect(getStatus(store, s2.output)).toBe("done"); expect(getStatus(store, s2.output)).toBe("done");
// Mock agent reports usage stats in step nodes.
expect(s1.usage).toEqual({ turns: 1, inputTokens: 0, outputTokens: 0, duration: 0 });
expect(s2.usage).toEqual({ turns: 1, inputTokens: 0, outputTokens: 0, duration: 0 });
// The start node points at the registered workflow. // The start node points at the registered workflow.
const startNode = store.cas.get(startHash as CasRef); const startNode = store.cas.get(startHash as CasRef);
expect((startNode!.payload as StartNodePayload).workflow).toBe(workflowHash); expect((startNode!.payload as StartNodePayload).workflow).toBe(workflowHash);
@@ -241,7 +245,9 @@ describe("E2E mock-agent: full uwf pipeline", () => {
expect(finalEntry!.head).toBe(step2.head); expect(finalEntry!.head).toBe(step2.head);
}); });
test("2. branching workflow loops developer→reviewer→developer→reviewer→$END", async () => { test("2. branching workflow loops developer→reviewer→developer→reviewer→$END", {
timeout: 30_000,
}, async () => {
await writeMockConfig("e2e-loop.mock.yaml"); await writeMockConfig("e2e-loop.mock.yaml");
const workflowHash = await addWorkflow("e2e-loop.workflow.yaml", "test-loop"); const workflowHash = await addWorkflow("e2e-loop.workflow.yaml", "test-loop");
@@ -299,7 +305,9 @@ describe("E2E mock-agent: full uwf pipeline", () => {
expect(finalEntry!.status).toBe("completed"); expect(finalEntry!.status).toBe("completed");
}); });
test("3. role mismatch in mock data makes the agent exit with an error", async () => { test("3. role mismatch in mock data makes the agent exit with an error", {
timeout: 30_000,
}, async () => {
// Reuses the linear workflow but with a mock whose step[1].role is wrong. // Reuses the linear workflow but with a mock whose step[1].role is wrong.
await writeMockConfig("e2e-mismatch.mock.yaml"); await writeMockConfig("e2e-mismatch.mock.yaml");
const workflowHash = await addWorkflow("e2e-linear.workflow.yaml", "test-linear"); const workflowHash = await addWorkflow("e2e-linear.workflow.yaml", "test-linear");
@@ -325,7 +333,9 @@ describe("E2E mock-agent: full uwf pipeline", () => {
expect(entry!.head).toBe(step1.head); expect(entry!.head).toBe(step1.head);
}); });
test("4. planner $SUSPEND then resume re-runs planner and reaches $END", async () => { test("4. planner $SUSPEND then resume re-runs planner and reaches $END", {
timeout: 30_000,
}, async () => {
await writeMockConfig("e2e-suspend.mock.yaml"); await writeMockConfig("e2e-suspend.mock.yaml");
const workflowHash = await addWorkflow("e2e-suspend.workflow.yaml", "test-suspend"); const workflowHash = await addWorkflow("e2e-suspend.workflow.yaml", "test-suspend");
@@ -372,7 +382,9 @@ describe("E2E mock-agent: full uwf pipeline", () => {
expect(finalEntry!.head).toBe(resumeOut.head); expect(finalEntry!.head).toBe(resumeOut.head);
}); });
test("5. --count 3 runs the whole linear pipeline in one invocation", async () => { test("5. --count 3 runs the whole linear pipeline in one invocation", {
timeout: 30_000,
}, async () => {
await writeMockConfig("e2e-count.mock.yaml"); await writeMockConfig("e2e-count.mock.yaml");
const workflowHash = await addWorkflow("e2e-count.workflow.yaml", "test-count"); const workflowHash = await addWorkflow("e2e-count.workflow.yaml", "test-count");
@@ -412,7 +424,9 @@ describe("E2E mock-agent: full uwf pipeline", () => {
expect(finalEntry!.head).toBe(results[2].head); expect(finalEntry!.head).toBe(results[2].head);
}); });
test("6. mustache edge prompt renders planner variables into the worker step", async () => { test("6. mustache edge prompt renders planner variables into the worker step", {
timeout: 30_000,
}, async () => {
await writeMockConfig("e2e-mustache.mock.yaml"); await writeMockConfig("e2e-mustache.mock.yaml");
const workflowHash = await addWorkflow("e2e-mustache.workflow.yaml", "test-mustache"); const workflowHash = await addWorkflow("e2e-mustache.workflow.yaml", "test-mustache");
@@ -441,7 +455,9 @@ describe("E2E mock-agent: full uwf pipeline", () => {
expect(workerStep.edgePrompt).toBe("Work on branch fix/42-auth in /tmp/my-repo"); expect(workerStep.edgePrompt).toBe("Work on branch fix/42-auth in /tmp/my-repo");
}); });
test("7. completed thread can be resumed (衔尾蛇: end → start)", async () => { test("7. completed thread can be resumed (衔尾蛇: end → start)", {
timeout: 30_000,
}, async () => {
// Reuse the suspend workflow (planner with ready → $END), but mock data // Reuse the suspend workflow (planner with ready → $END), but mock data
// goes straight to ready on first run, then ready again after resume. // goes straight to ready on first run, then ready again after resume.
await writeMockConfig("e2e-completed-resume.mock.yaml"); await writeMockConfig("e2e-completed-resume.mock.yaml");
@@ -8,10 +8,10 @@ const solveIssueGraph: WorkflowPayload["graph"] = {
_: { role: "planner", prompt: "Start planning from the issue in the task.", location: null }, _: { role: "planner", prompt: "Start planning from the issue in the task.", location: null },
}, },
planner: { planner: {
_: { role: "developer", prompt: "Implement the plan: {{plan}}", location: null }, planned: { role: "developer", prompt: "Implement the plan: {{plan}}", location: null },
}, },
developer: { developer: {
_: { role: "reviewer", prompt: "Review the changes: {{summary}}", location: null }, implemented: { role: "reviewer", prompt: "Review the changes: {{summary}}", location: null },
}, },
reviewer: { reviewer: {
approved: { role: "$END", prompt: "Done.", location: null }, approved: { role: "$END", prompt: "Done.", location: null },
@@ -112,7 +112,7 @@ describe("evaluate", () => {
test("mustache template rendering with simple fields", () => { test("mustache template rendering with simple fields", () => {
const result = evaluate(solveIssueGraph, "planner", { const result = evaluate(solveIssueGraph, "planner", {
$status: "_", $status: "planned",
plan: "Add auth middleware", plan: "Add auth middleware",
}); });
expect(result).toEqual({ expect(result).toEqual({
@@ -139,11 +139,11 @@ describe("evaluate", () => {
test("triple mustache also works for unescaped output", () => { test("triple mustache also works for unescaped output", () => {
const graph: Record<string, Record<string, Target>> = { const graph: Record<string, Record<string, Target>> = {
reviewer: { reviewer: {
_: { role: "developer", prompt: "Fix: {{{comments}}}", location: null }, rejected: { role: "developer", prompt: "Fix: {{{comments}}}", location: null },
}, },
}; };
const result = evaluate(graph, "reviewer", { const result = evaluate(graph, "reviewer", {
$status: "_", $status: "rejected",
comments: "<script>alert(1)</script>", comments: "<script>alert(1)</script>",
}); });
expect(result).toEqual({ expect(result).toEqual({
@@ -152,24 +152,22 @@ describe("evaluate", () => {
}); });
}); });
test("missing $status defaults to _ (unit routing)", () => { test("missing $status → error (no unit fallback)", () => {
const result = evaluate(solveIssueGraph, "planner", { const result = evaluate(solveIssueGraph, "planner", {
plan: "Add auth middleware", plan: "Add auth middleware",
}); });
expect(result).toEqual({ expect(result.ok).toBe(false);
ok: true, if (!result.ok) {
value: { expect(result.error.message).toBe(
role: "developer", 'agent output for role "planner" is missing required "$status" string',
prompt: "Implement the plan: Add auth middleware", );
location: null, }
},
});
}); });
test("mustache template with nested object paths", () => { test("mustache template with nested object paths", () => {
const graph: Record<string, Record<string, Target>> = { const graph: Record<string, Record<string, Target>> = {
reviewer: { reviewer: {
_: { rejected: {
role: "developer", role: "developer",
prompt: "Address: {{review.comments}}", prompt: "Address: {{review.comments}}",
location: null, location: null,
@@ -177,7 +175,7 @@ describe("evaluate", () => {
}, },
}; };
const result = evaluate(graph, "reviewer", { const result = evaluate(graph, "reviewer", {
$status: "_", $status: "rejected",
review: { comments: "refactor the handler" }, review: { comments: "refactor the handler" },
}); });
expect(result).toEqual({ expect(result).toEqual({
+63 -40
View File
@@ -6,101 +6,124 @@ import { describe, expect, test } from "vitest";
const __dirname = dirname(fileURLToPath(import.meta.url)); const __dirname = dirname(fileURLToPath(import.meta.url));
import { import {
cmdPromptAdapter, cmdPromptAdapterDeveloping,
cmdPromptAuthor, cmdPromptBootstrap,
cmdPromptDeveloper,
cmdPromptList, cmdPromptList,
cmdPromptSetup, cmdPromptSetup,
cmdPromptUsage, cmdPromptUsage,
cmdPromptUser, cmdPromptUsageReference,
cmdPromptWorkflowAuthoring,
} from "../commands/prompt.js"; } from "../commands/prompt.js";
describe("prompt commands", () => { describe("prompt commands", () => {
test("prompt list returns all prompt names", () => { test("prompt list returns new prompt names", () => {
const result = cmdPromptList(); const result = cmdPromptList();
expect(result).toBeInstanceOf(Array); expect(result).toBeInstanceOf(Array);
expect(result).toContain("user"); expect(result).toContain("usage");
expect(result).toContain("author"); expect(result).toContain("workflow-authoring");
expect(result).toContain("developer"); expect(result).toContain("adapter-developing");
expect(result).toContain("adapter"); expect(result).toContain("bootstrap");
expect(result).not.toContain("user");
expect(result).not.toContain("author");
expect(result).not.toContain("developer");
expect(result).not.toContain("adapter");
for (const name of result) { for (const name of result) {
expect(name).toMatch(/^\S+$/); expect(name).toMatch(/^\S+$/);
} }
}); });
test("prompt user returns non-empty markdown string", () => { test("prompt usage-reference returns non-empty markdown string with frontmatter", () => {
const result = cmdPromptUser(); const result = cmdPromptUsageReference();
expect(typeof result).toBe("string"); expect(typeof result).toBe("string");
expect(result).toContain("uwf"); expect(result).toContain("uwf");
expect(result).toContain("thread"); expect(result).toContain("thread");
expect(result).toContain("workflow"); expect(result).toContain("workflow");
expect(result).toContain("Quick Start"); expect(result).toContain("Quick Start");
expect(result).toContain("---");
expect(result).toContain("name:");
expect(result).toContain("version:");
expect(result.length).toBeGreaterThan(500); expect(result.length).toBeGreaterThan(500);
}); });
test("prompt author returns non-empty markdown string", () => { test("prompt workflow-authoring returns non-empty markdown string with frontmatter", () => {
const result = cmdPromptAuthor(); const result = cmdPromptWorkflowAuthoring();
expect(typeof result).toBe("string"); expect(typeof result).toBe("string");
expect(result).toContain("frontmatter"); expect(result).toContain("frontmatter");
expect(result).toContain("graph"); expect(result).toContain("graph");
expect(result).toContain("$START"); expect(result).toContain("$START");
expect(result).toContain("$END"); expect(result).toContain("$END");
expect(result).toContain("$status"); expect(result).toContain("$status");
expect(result).toContain("---");
expect(result).toContain("name:");
expect(result).toContain("version:");
expect(result.length).toBeGreaterThan(500); expect(result.length).toBeGreaterThan(500);
}); });
test("prompt developer returns non-empty markdown string", () => { test("prompt adapter-developing returns non-empty markdown string with frontmatter", () => {
const result = cmdPromptDeveloper(); const result = cmdPromptAdapterDeveloping();
expect(typeof result).toBe("string");
expect(result).toContain("Monorepo");
expect(result).toContain("CAS");
expect(result).toContain("Biome");
expect(result.length).toBeGreaterThan(500);
});
test("prompt adapter returns non-empty markdown string", () => {
const result = cmdPromptAdapter();
expect(typeof result).toBe("string"); expect(typeof result).toBe("string");
expect(result).toContain("createAgent"); expect(result).toContain("createAgent");
expect(result).toContain("AgentContext"); expect(result).toContain("AgentContext");
expect(result).toContain("frontmatter"); expect(result).toContain("frontmatter");
expect(result).toContain("---");
expect(result).toContain("name:");
expect(result).toContain("version:");
expect(result.length).toBeGreaterThan(500); expect(result.length).toBeGreaterThan(500);
}); });
test("prompt usage combines all references", () => { test("prompt bootstrap returns non-empty skill with frontmatter", () => {
const result = cmdPromptBootstrap();
expect(typeof result).toBe("string");
expect(result).toContain("uwf");
expect(result).toContain("---");
expect(result.length).toBeGreaterThan(100);
});
test("prompt usage combines remaining references (no developer)", () => {
const result = cmdPromptUsage(); const result = cmdPromptUsage();
expect(typeof result).toBe("string"); expect(typeof result).toBe("string");
expect(result).toContain("User Reference"); expect(result).toContain("Usage Reference");
expect(result).toContain("Author Reference"); expect(result).toContain("Workflow Authoring Reference");
expect(result).toContain("Developer Reference"); expect(result).toContain("Adapter Developing Reference");
expect(result).toContain("Adapter Reference"); expect(result).not.toContain("Developer Reference");
expect(result).toContain("---"); expect(result).toContain("---");
expect(result.length).toBeGreaterThan(2000); expect(result.length).toBeGreaterThan(2000);
}); });
test("prompt setup returns setup instructions", () => { test("prompt setup returns simplified setup instructions", () => {
const result = cmdPromptSetup(); const result = cmdPromptSetup();
expect(typeof result).toBe("string"); expect(typeof result).toBe("string");
expect(result).toContain("uwf Skill Setup"); expect(result).toContain("uwf Skill Setup");
expect(result).toContain("uwf prompt usage"); expect(result).toContain("uwf prompt bootstrap");
expect(result).toContain("uwf prompt setup");
expect(result).toContain("SKILL.md"); expect(result).toContain("SKILL.md");
expect(result).toContain("version"); expect(result).toContain("version");
expect(result).not.toMatch(/\bbun (install|run|test|changeset|version|release)\b/);
}); });
test("prompt help subcommand is suppressed", () => { test("prompt setup references new subcommand names", () => {
const output = execFileSync("npx", ["tsx", "src/cli.ts", "prompt", "--help"], { const result = cmdPromptSetup();
cwd: join(__dirname, "..", ".."), expect(result).toContain("uwf prompt usage");
expect(result).toContain("uwf prompt workflow-authoring");
expect(result).toContain("uwf prompt adapter-developing");
expect(result).not.toContain("uwf prompt user");
expect(result).not.toContain("uwf prompt author");
expect(result).not.toContain("uwf prompt developer");
expect(result).not.toMatch(/uwf prompt adapter\b(?!-developing)/);
});
test("prompt help subcommand is suppressed", { timeout: 30_000 }, () => {
const cliPath = join(__dirname, "..", "..", "dist", "cli.js");
const output = execFileSync("node", [cliPath, "prompt", "--help"], {
encoding: "utf-8", encoding: "utf-8",
env: { ...process.env, PATH: `/opt/homebrew/bin:${process.env.PATH}` }, env: { ...process.env },
}); });
expect(output).not.toMatch(/help\s+\[command\]/i); expect(output).not.toMatch(/help\s+\[command\]/i);
expect(output).toContain("usage"); expect(output).toContain("usage");
expect(output).toContain("setup"); expect(output).toContain("setup");
expect(output).toContain("user"); expect(output).toContain("workflow-authoring");
expect(output).toContain("author"); expect(output).toContain("adapter-developing");
expect(output).toContain("developer"); expect(output).toContain("bootstrap");
expect(output).toContain("adapter");
expect(output).toContain("list"); expect(output).toContain("list");
expect(output).not.toContain("developer");
}); });
}); });
@@ -118,6 +118,7 @@ async function createTestStep(
completedAtMs: Date.now() + 1000, completedAtMs: Date.now() + 1000,
assembledPrompt: null, assembledPrompt: null,
cwd: "/tmp", cwd: "/tmp",
usage: null,
}; };
return store.cas.put(schemas.stepNode, stepPayload); return store.cas.put(schemas.stepNode, stepPayload);
} }
@@ -96,6 +96,7 @@ describe("protocol types", () => {
completedAtMs: 2000, completedAtMs: 2000,
assembledPrompt: null, assembledPrompt: null,
cwd: "/test/path", cwd: "/test/path",
usage: null,
}; };
expect(record.startedAtMs).toBe(1000); expect(record.startedAtMs).toBe(1000);
expect(record.completedAtMs).toBe(2000); expect(record.completedAtMs).toBe(2000);
@@ -110,6 +111,7 @@ describe("protocol types", () => {
agent: "uwf-test", agent: "uwf-test",
timestamp: 123, timestamp: 123,
durationMs: 5000, durationMs: 5000,
usage: null,
}; };
expect(entry.durationMs).toBe(5000); expect(entry.durationMs).toBe(5000);
}); });
@@ -252,7 +254,7 @@ describe("thread read timing", () => {
}, },
graph: { graph: {
$START: { _: { role: "worker", prompt: "go", location: null } }, $START: { _: { role: "worker", prompt: "go", location: null } },
worker: { _: { role: "$END", prompt: "", location: null } }, worker: { done: { role: "$END", prompt: "", location: null } },
}, },
}); });
@@ -318,7 +320,7 @@ describe("thread read timing", () => {
}, },
graph: { graph: {
$START: { _: { role: "worker", prompt: "go", location: null } }, $START: { _: { role: "worker", prompt: "go", location: null } },
worker: { _: { role: "$END", prompt: "", location: null } }, worker: { done: { role: "$END", prompt: "", location: null } },
}, },
}); });
@@ -54,7 +54,7 @@ roles:
type: object type: object
required: ["$status"] required: ["$status"]
properties: properties:
$status: { type: string } $status: { type: string, enum: ["ready"] }
graph: graph:
$START: $START:
_: _:
@@ -62,7 +62,7 @@ graph:
prompt: "Plan the work" prompt: "Plan the work"
location: null location: null
planner: planner:
_: ready:
role: $END role: $END
prompt: "Done" prompt: "Done"
location: null location: null
@@ -110,7 +110,7 @@ roles:
type: object type: object
required: ["$status"] required: ["$status"]
properties: properties:
$status: { type: string } $status: { type: string, enum: ["ready"] }
graph: graph:
$START: $START:
_: _:
@@ -118,7 +118,7 @@ graph:
prompt: "Plan" prompt: "Plan"
location: null location: null
planner: planner:
_: ready:
role: $END role: $END
prompt: "Done" prompt: "Done"
location: null location: null
@@ -153,7 +153,7 @@ roles:
type: object type: object
required: ["$status"] required: ["$status"]
properties: properties:
$status: { type: string } $status: { type: string, enum: ["ready"] }
graph: graph:
$START: $START:
_: _:
@@ -161,7 +161,7 @@ graph:
prompt: "Plan" prompt: "Plan"
location: null location: null
planner: planner:
_: ready:
role: $END role: $END
prompt: "Done" prompt: "Done"
location: null location: null
@@ -79,7 +79,7 @@ async function setupSuspendedThread(mode: MockAgentMode): Promise<{
}, },
ok: { role: "reviewer", prompt: "Review the work", location: null }, ok: { role: "reviewer", prompt: "Review the work", location: null },
}, },
reviewer: { _: { role: "$END", prompt: "Done", location: null } }, reviewer: { done: { role: "$END", prompt: "Done", location: null } },
}, },
}); });
@@ -234,7 +234,7 @@ describe("uwf thread resume", () => {
}, },
graph: { graph: {
$START: { _: { role: "worker", prompt: "Start", location: null } }, $START: { _: { role: "worker", prompt: "Start", location: null } },
worker: { _: { role: "$END", prompt: "Done", location: null } }, worker: { done: { role: "$END", prompt: "Done", location: null } },
}, },
}); });
@@ -480,8 +480,8 @@ describe("uwf thread resume - completed threads", () => {
}, },
graph: { graph: {
$START: { _: { role: "worker", prompt: "Start work", location: null } }, $START: { _: { role: "worker", prompt: "Start work", location: null } },
worker: { _: { role: "reviewer", prompt: "Review the work", location: null } }, worker: { done: { role: "reviewer", prompt: "Review the work", location: null } },
reviewer: { _: { role: "$END", prompt: "Done", location: null } }, reviewer: { done: { role: "$END", prompt: "Done", location: null } },
}, },
}); });
@@ -493,8 +493,8 @@ describe("uwf thread resume - completed threads", () => {
process.env.OCAS_HOME = casDir; process.env.OCAS_HOME = casDir;
const workerOutputHash = await store.cas.put(outputSchemaHash, { $status: "_" }); const workerOutputHash = await store.cas.put(outputSchemaHash, { $status: "done" });
const reviewerOutputHash = await store.cas.put(outputSchemaHash, { $status: "_" }); const reviewerOutputHash = await store.cas.put(outputSchemaHash, { $status: "done" });
const detailHash = await store.cas.put(schemas.text, "mock detail"); const detailHash = await store.cas.put(schemas.text, "mock detail");
const workerStepHash = await store.cas.put(schemas.stepNode, { const workerStepHash = await store.cas.put(schemas.stepNode, {
@@ -539,9 +539,7 @@ describe("uwf thread resume - completed threads", () => {
const { createUwfStore, getThread } = await import("../store.js"); const { createUwfStore, getThread } = await import("../store.js");
const verifyUwf = await createUwfStore(tmpDir); const verifyUwf = await createUwfStore(tmpDir);
const verifyEntry = getThread(verifyUwf.varStore, THREAD_ID); const verifyEntry = getThread(verifyUwf.varStore, THREAD_ID);
// biome-ignore lint/suspicious/noConsole: test debugging
console.log("Seeded entry status:", verifyEntry?.status); console.log("Seeded entry status:", verifyEntry?.status);
// biome-ignore lint/suspicious/noConsole: test debugging
console.log("Seeded entry:", JSON.stringify(verifyEntry, null, 2)); console.log("Seeded entry:", JSON.stringify(verifyEntry, null, 2));
const promptCapturePath = join(tmpDir, "captured-prompt-completed.txt"); const promptCapturePath = join(tmpDir, "captured-prompt-completed.txt");
@@ -565,7 +563,7 @@ describe("uwf thread resume - completed threads", () => {
stepHash: newWorkerStepHash, stepHash: newWorkerStepHash,
detailHash, detailHash,
role: "worker", role: "worker",
frontmatter: { $status: "_" }, frontmatter: { $status: "done" },
body: "", body: "",
startedAtMs: 1716600003000, startedAtMs: 1716600003000,
completedAtMs: 1716600004000, completedAtMs: 1716600004000,
@@ -601,7 +599,6 @@ echo '${adapterJson}'
); );
if (result.status !== 0) { if (result.status !== 0) {
// biome-ignore lint/suspicious/noConsole: test debugging
console.error("Command failed:", result.stderr); console.error("Command failed:", result.stderr);
} }
@@ -644,7 +641,7 @@ echo '${adapterJson}'
}, },
graph: { graph: {
$START: { _: { role: "worker", prompt: "Start", location: null } }, $START: { _: { role: "worker", prompt: "Start", location: null } },
worker: { _: { role: "$END", prompt: "Done", location: null } }, worker: { done: { role: "$END", prompt: "Done", location: null } },
}, },
}); });
@@ -692,7 +689,7 @@ echo '${adapterJson}'
}, },
graph: { graph: {
$START: { _: { role: "worker", prompt: "Start", location: null } }, $START: { _: { role: "worker", prompt: "Start", location: null } },
worker: { _: { role: "$END", prompt: "Done", location: null } }, worker: { done: { role: "$END", prompt: "Done", location: null } },
}, },
}); });
@@ -31,7 +31,7 @@ roles:
type: object type: object
required: ["$status"] required: ["$status"]
properties: properties:
$status: { type: string } $status: { type: string, enum: ["ready"] }
graph: graph:
$START: $START:
_: _:
@@ -39,7 +39,7 @@ graph:
prompt: "Plan the work" prompt: "Plan the work"
location: null location: null
planner: planner:
_: ready:
role: $END role: $END
prompt: "Done" prompt: "Done"
location: null location: null
@@ -54,7 +54,7 @@ roles:
type: object type: object
required: ["$status"] required: ["$status"]
properties: properties:
$status: { type: string } $status: { type: string, enum: ["ready"] }
graph: graph:
$START: $START:
_: _:
@@ -62,7 +62,7 @@ graph:
prompt: "Plan the work" prompt: "Plan the work"
location: null location: null
planner: planner:
_: ready:
role: $END role: $END
prompt: "Done" prompt: "Done"
location: null location: null
@@ -2,19 +2,28 @@ import { execFileSync } from "node:child_process";
import { dirname, join } from "node:path"; import { dirname, join } from "node:path";
import { fileURLToPath } from "node:url"; import { fileURLToPath } from "node:url";
import { describe, expect, test } from "vitest"; import { describe, expect, test } from "vitest";
import { validateCount } from "../commands/thread.js";
const CLI_PATH = join(dirname(fileURLToPath(import.meta.url)), "..", "cli.js"); const CLI_PATH = join(dirname(fileURLToPath(import.meta.url)), "..", "..", "dist", "cli.js");
function runCli(args: string[]): { stdout: string; stderr: string; exitCode: number } { function runCli(args: string[]): {
stdout: string;
stderr: string;
exitCode: number;
} {
try { try {
const stdout = execFileSync("npx", ["tsx", CLI_PATH, ...args], { const stdout = execFileSync("node", [CLI_PATH, ...args], {
encoding: "utf8", encoding: "utf8",
env: { ...process.env, UWF_HOME: "/tmp/uwf-test-nonexistent" }, env: { ...process.env, UWF_HOME: "/tmp/uwf-test-nonexistent" },
stdio: ["ignore", "pipe", "pipe"], stdio: ["ignore", "pipe", "pipe"],
}); });
return { stdout, stderr: "", exitCode: 0 }; return { stdout, stderr: "", exitCode: 0 };
} catch (e: unknown) { } catch (e: unknown) {
const err = e as NodeJS.ErrnoException & { stdout?: string; stderr?: string; status?: number }; const err = e as NodeJS.ErrnoException & {
stdout?: string;
stderr?: string;
status?: number;
};
return { return {
stdout: err.stdout ?? "", stdout: err.stdout ?? "",
stderr: err.stderr ?? "", stderr: err.stderr ?? "",
@@ -23,50 +32,39 @@ function runCli(args: string[]): { stdout: string; stderr: string; exitCode: num
} }
} }
describe("thread exec --count CLI parsing", () => { describe("thread exec --count CLI parsing", { timeout: 30_000 }, () => {
test("--help shows -c/--count option", () => { test("--help shows -c/--count option", () => {
const result = runCli(["thread", "exec", "--help"]); const result = runCli(["thread", "exec", "--help"]);
expect(result.stdout).toContain("--count"); const combined = result.stdout + result.stderr;
expect(result.stdout).toContain("-c"); expect(combined).toContain("--count");
expect(combined).toContain("-c");
}); });
test("description says 'one or more steps'", () => { test("description says 'one or more steps'", () => {
const result = runCli(["thread", "exec", "--help"]); const result = runCli(["thread", "exec", "--help"]);
expect(result.stdout).toContain("one or more steps"); const combined = result.stdout + result.stderr;
expect(combined).toContain("one or more steps");
}); });
}); });
describe("cmdThreadExec count logic", () => { describe("validateCount", () => {
test("count=0 fails with validation error", () => { test("count=0 throws validation error", () => {
const result = runCli(["thread", "exec", "FAKE_THREAD_ID", "-c", "0"]); expect(() => validateCount(0)).toThrow("positive integer");
expect(result.exitCode).not.toBe(0);
expect(result.stderr).toContain("positive integer");
}); });
test("negative count fails with validation error", () => { test("negative count throws validation error", () => {
const result = runCli(["thread", "exec", "FAKE_THREAD_ID", "-c", "-1"]); expect(() => validateCount(-1)).toThrow("positive integer");
expect(result.exitCode).not.toBe(0);
expect(result.stderr).toContain("positive integer");
}); });
test("non-integer count fails with validation error", () => { test("non-integer count throws validation error", () => {
const result = runCli(["thread", "exec", "FAKE_THREAD_ID", "-c", "1.5"]); expect(() => validateCount(1.5)).toThrow("positive integer");
expect(result.exitCode).not.toBe(0);
expect(result.stderr).toContain("positive integer");
}); });
test("count=1 is the default (no -c flag)", () => { test("count=1 passes validation", () => {
// Without -c, it should attempt to run 1 step (failing on missing thread, not on count validation) expect(() => validateCount(1)).not.toThrow();
const result = runCli(["thread", "exec", "FAKE_THREAD_ID"]);
expect(result.exitCode).not.toBe(0);
// Should NOT contain "positive integer" error — should fail on thread lookup instead
expect(result.stderr).not.toContain("positive integer");
}); });
test("count=3 passes validation (fails on thread lookup)", () => { test("count=3 passes validation", () => {
const result = runCli(["thread", "exec", "FAKE_THREAD_ID", "-c", "3"]); expect(() => validateCount(3)).not.toThrow();
expect(result.exitCode).not.toBe(0);
// Should NOT contain "positive integer" error — should fail on thread/storage lookup
expect(result.stderr).not.toContain("positive integer");
}); });
}); });
@@ -17,7 +17,7 @@ function makeWorkflow(overrides?: Partial<WorkflowPayload>): WorkflowPayload {
frontmatter: { frontmatter: {
type: "object", type: "object",
properties: { properties: {
$status: { enum: ["_"] }, $status: { enum: ["done"] },
plan: { type: "string" }, plan: { type: "string" },
}, },
required: ["$status", "plan"], required: ["$status", "plan"],
@@ -52,7 +52,7 @@ function makeWorkflow(overrides?: Partial<WorkflowPayload>): WorkflowPayload {
}, },
graph: { graph: {
$START: { _: { role: "writer", prompt: "Begin writing", location: null } }, $START: { _: { role: "writer", prompt: "Begin writing", location: null } },
writer: { _: { role: "reviewer", prompt: "Review this: {{{plan}}}", location: null } }, writer: { done: { role: "reviewer", prompt: "Review this: {{{plan}}}", location: null } },
reviewer: { reviewer: {
approved: { role: "$END", prompt: "Done: {{{summary}}}", location: null }, approved: { role: "$END", prompt: "Done: {{{summary}}}", location: null },
rejected: { role: "writer", prompt: "Fix: {{{reason}}}", location: null }, rejected: { role: "writer", prompt: "Fix: {{{reason}}}", location: null },
@@ -82,7 +82,7 @@ describe("Suite 1: Role Reference Integrity", () => {
output: "None", output: "None",
frontmatter: { frontmatter: {
type: "object", type: "object",
properties: { $status: { enum: ["_"] } }, properties: { $status: { enum: ["done"] } },
required: ["$status"], required: ["$status"],
} as unknown as string, } as unknown as string,
}; };
@@ -173,11 +173,11 @@ describe("Suite 2: Graph Structure", () => {
output: "Isolated", output: "Isolated",
frontmatter: { frontmatter: {
type: "object", type: "object",
properties: { $status: { enum: ["_"] } }, properties: { $status: { enum: ["done"] } },
required: ["$status"], required: ["$status"],
} as unknown as string, } as unknown as string,
}; };
wf.graph.isolated = { _: { role: "$END", prompt: "done", location: null } }; wf.graph.isolated = { done: { role: "$END", prompt: "done", location: null } };
const errors = validateWorkflow(wf); const errors = validateWorkflow(wf);
expect(errors.some((e) => e.includes('role "isolated" is not reachable from $START'))).toBe( expect(errors.some((e) => e.includes('role "isolated" is not reachable from $START'))).toBe(
true, true,
@@ -186,34 +186,34 @@ describe("Suite 2: Graph Structure", () => {
test("2.6 edge target references invalid role", () => { test("2.6 edge target references invalid role", () => {
const wf = makeWorkflow(); const wf = makeWorkflow();
wf.graph.writer = { _: { role: "ghost", prompt: "Go to ghost", location: null } }; wf.graph.writer = { done: { role: "ghost", prompt: "Go to ghost", location: null } };
const errors = validateWorkflow(wf); const errors = validateWorkflow(wf);
expect(errors.some((e) => e.includes('unknown target role "ghost"'))).toBe(true); expect(errors.some((e) => e.includes('unknown target role "ghost"'))).toBe(true);
}); });
}); });
describe("Suite 3: Status-Edge Consistency", () => { describe("Suite 3: Status-Edge Consistency", () => {
test("3.1 single-exit role with multiple graph keys", () => { test("3.1 user role using _ graph key is rejected", () => {
const wf = makeWorkflow(); const wf = makeWorkflow();
wf.graph.writer = { wf.graph.writer = { _: { role: "reviewer", prompt: "Review", location: null } };
_: { role: "reviewer", prompt: "Review", location: null },
extra: { role: "$END", prompt: "Done", location: null },
};
const errors = validateWorkflow(wf); const errors = validateWorkflow(wf);
expect( expect(
errors.some((e) => errors.some((e) =>
e.includes('role "writer" is single-exit but has status keys other than "_"'), e.includes('role "writer" must use explicit $status keys in graph, not "_"'),
), ),
).toBe(true); ).toBe(true);
}); });
test("3.2 single-exit role missing _ key", () => { test("3.2 user role graph key not matching $status enum", () => {
const wf = makeWorkflow(); const wf = makeWorkflow();
wf.graph.writer = { done: { role: "reviewer", prompt: "Review", location: null } }; wf.graph.writer = { wrong: { role: "reviewer", prompt: "Review", location: null } };
const errors = validateWorkflow(wf); const errors = validateWorkflow(wf);
expect( expect(errors.some((e) => e.includes('role "writer" graph has extra status keys: wrong'))).toBe(
errors.some((e) => e.includes('role "writer" is single-exit but graph has no "_" key')), true,
).toBe(true); );
expect(errors.some((e) => e.includes('role "writer" graph is missing status keys: done'))).toBe(
true,
);
}); });
test("3.3 multi-exit role with extra statuses", () => { test("3.3 multi-exit role with extra statuses", () => {
@@ -244,9 +244,11 @@ describe("Suite 3: Status-Edge Consistency", () => {
const wf = makeWorkflow(); const wf = makeWorkflow();
wf.graph.reviewer = { _: { role: "$END", prompt: "Done", location: null } }; wf.graph.reviewer = { _: { role: "$END", prompt: "Done", location: null } };
const errors = validateWorkflow(wf); const errors = validateWorkflow(wf);
expect(errors.some((e) => e.includes('role "reviewer" is multi-exit but graph uses "_"'))).toBe( expect(
true, errors.some((e) =>
); e.includes('role "reviewer" must use explicit $status keys in graph, not "_"'),
),
).toBe(true);
}); });
}); });
@@ -314,20 +316,20 @@ describe("Suite 3b: Enum-Based Multi-Exit", () => {
expect(errors.some((e) => e.includes("missing status keys: rejected"))).toBe(true); expect(errors.some((e) => e.includes("missing status keys: rejected"))).toBe(true);
}); });
test("3b.4 enum with single value (not multi-exit) treated as single-exit", () => { test("3b.4 enum with single explicit value passes", () => {
const wf = makeWorkflow(); const wf = makeWorkflow();
wf.roles.writer = { wf.roles.writer = {
...wf.roles.writer, ...wf.roles.writer,
frontmatter: { frontmatter: {
type: "object", type: "object",
properties: { properties: {
$status: { enum: ["_"] }, $status: { enum: ["ready"] },
plan: { type: "string" }, plan: { type: "string" },
}, },
required: ["$status", "plan"], required: ["$status", "plan"],
} as unknown as string, } as unknown as string,
}; };
wf.graph.writer = { _: { role: "reviewer", prompt: "Review: {{{plan}}}", location: null } }; wf.graph.writer = { ready: { role: "reviewer", prompt: "Review: {{{plan}}}", location: null } };
const errors = validateWorkflow(wf); const errors = validateWorkflow(wf);
expect(errors).toEqual([]); expect(errors).toEqual([]);
}); });
@@ -355,13 +357,15 @@ describe("Suite 3b: Enum-Based Multi-Exit", () => {
}); });
describe("Suite 4: Mustache Template Variable Existence", () => { describe("Suite 4: Mustache Template Variable Existence", () => {
test("4.1 prompt references nonexistent variable (single-exit)", () => { test("4.1 prompt references nonexistent variable (enum status)", () => {
const wf = makeWorkflow(); const wf = makeWorkflow();
wf.graph.writer = { _: { role: "reviewer", prompt: "Review: {{{branch}}}", location: null } }; wf.graph.writer = {
done: { role: "reviewer", prompt: "Review: {{{branch}}}", location: null },
};
const errors = validateWorkflow(wf); const errors = validateWorkflow(wf);
expect( expect(
errors.some((e) => errors.some(
e.includes('prompt variable "branch" not found in role "writer" frontmatter'), (e) => e.includes('prompt variable "branch"') && e.includes('role "writer" frontmatter'),
), ),
).toBe(true); ).toBe(true);
}); });
@@ -388,7 +392,7 @@ describe("Suite 4: Mustache Template Variable Existence", () => {
test("4.4 $status variable is always valid", () => { test("4.4 $status variable is always valid", () => {
const wf = makeWorkflow(); const wf = makeWorkflow();
wf.graph.writer = { _: { role: "reviewer", prompt: "Status: {{$status}}", location: null } }; wf.graph.writer = { done: { role: "reviewer", prompt: "Status: {{$status}}", location: null } };
const errors = validateWorkflow(wf); const errors = validateWorkflow(wf);
expect(errors).toEqual([]); expect(errors).toEqual([]);
}); });
@@ -456,14 +460,14 @@ describe("Suite 6: Multiple Errors Collection", () => {
output: "None", output: "None",
frontmatter: { frontmatter: {
type: "object", type: "object",
properties: { $status: { enum: ["_"] } }, properties: { $status: { enum: ["done"] } },
required: ["$status"], required: ["$status"],
} as unknown as string, } as unknown as string,
}; };
// unknown graph reference // unknown graph reference
wf.graph.nonexistent = { _: { role: "$END", prompt: "done", location: null } }; wf.graph.nonexistent = { done: { role: "$END", prompt: "done", location: null } };
// bad mustache var // bad mustache var
wf.graph.writer = { _: { role: "reviewer", prompt: "{{{badvar}}}", location: null } }; wf.graph.writer = { done: { role: "reviewer", prompt: "{{{badvar}}}", location: null } };
const errors = validateWorkflow(wf); const errors = validateWorkflow(wf);
expect(errors.length).toBeGreaterThanOrEqual(3); expect(errors.length).toBeGreaterThanOrEqual(3);
}); });
@@ -31,7 +31,7 @@ function makeMinimalPayload(name: string, description: string): WorkflowPayload
frontmatter: { frontmatter: {
type: "object", type: "object",
properties: { properties: {
$status: { type: "string" }, $status: { type: "string", enum: ["done"] },
}, },
required: ["$status"], required: ["$status"],
} as unknown as CasRef, } as unknown as CasRef,
@@ -39,7 +39,7 @@ function makeMinimalPayload(name: string, description: string): WorkflowPayload
}, },
graph: { graph: {
$START: { _: { role: "worker", prompt: "start working", location: null } }, $START: { _: { role: "worker", prompt: "start working", location: null } },
worker: { _: { role: "$END", prompt: "done", location: null } }, worker: { done: { role: "$END", prompt: "done", location: null } },
}, },
}; };
} }
+12 -20
View File
@@ -5,14 +5,13 @@ import { Command } from "commander";
import { cmdConfigGet, cmdConfigList, cmdConfigSet } from "./commands/config.js"; import { cmdConfigGet, cmdConfigList, cmdConfigSet } from "./commands/config.js";
import { cmdLogClean, cmdLogList, cmdLogShow } from "./commands/log.js"; import { cmdLogClean, cmdLogList, cmdLogShow } from "./commands/log.js";
import { import {
cmdPromptAdapter, cmdPromptAdapterDeveloping,
cmdPromptAuthor,
cmdPromptBootstrap, cmdPromptBootstrap,
cmdPromptDeveloper,
cmdPromptList, cmdPromptList,
cmdPromptSetup, cmdPromptSetup,
cmdPromptUsage, cmdPromptUsage,
cmdPromptUser, cmdPromptUsageReference,
cmdPromptWorkflowAuthoring,
} from "./commands/prompt.js"; } from "./commands/prompt.js";
import { cmdSetup, cmdSetupInteractive } from "./commands/setup.js"; import { cmdSetup, cmdSetupInteractive } from "./commands/setup.js";
import { cmdStepFork, cmdStepList, cmdStepRead, cmdStepShow } from "./commands/step.js"; import { cmdStepFork, cmdStepList, cmdStepRead, cmdStepShow } from "./commands/step.js";
@@ -523,31 +522,24 @@ prompt
}); });
prompt prompt
.command("adapter") .command("usage-reference")
.description("Print the adapter reference (building agent adapters)") .description("Print the usage reference (CLI guide + typical workflows)")
.action(() => { .action(() => {
console.log(cmdPromptAdapter()); console.log(cmdPromptUsageReference());
}); });
prompt prompt
.command("author") .command("workflow-authoring")
.description("Print the author reference (workflow YAML design guide)") .description("Print the workflow authoring reference (YAML design guide)")
.action(() => { .action(() => {
console.log(cmdPromptAuthor()); console.log(cmdPromptWorkflowAuthoring());
}); });
prompt prompt
.command("developer") .command("adapter-developing")
.description("Print the developer reference (coding conventions + architecture)") .description("Print the adapter developing reference (building agent adapters)")
.action(() => { .action(() => {
console.log(cmdPromptDeveloper()); console.log(cmdPromptAdapterDeveloping());
});
prompt
.command("user")
.description("Print the user reference (CLI guide + typical workflows)")
.action(() => {
console.log(cmdPromptUser());
}); });
prompt prompt
+23 -43
View File
@@ -1,24 +1,21 @@
import { import {
generateAdapterReference, generateAdapterDevelopingReference,
generateAuthorReference,
generateBootstrapReference, generateBootstrapReference,
generateDeveloperReference, generateUsageReference,
generateUserReference, generateWorkflowAuthoringReference,
} from "@united-workforce/util"; } from "@united-workforce/util";
export { export {
generateAdapterReference as cmdPromptAdapter, generateAdapterDevelopingReference as cmdPromptAdapterDeveloping,
generateAuthorReference as cmdPromptAuthor,
generateBootstrapReference as cmdPromptBootstrap, generateBootstrapReference as cmdPromptBootstrap,
generateDeveloperReference as cmdPromptDeveloper, generateUsageReference as cmdPromptUsageReference,
generateUserReference as cmdPromptUser, generateWorkflowAuthoringReference as cmdPromptWorkflowAuthoring,
}; };
const PROMPT_ENTRIES: ReadonlyArray<{ name: string; generate: () => string }> = [ const PROMPT_ENTRIES: ReadonlyArray<{ name: string; generate: () => string }> = [
{ name: "user", generate: generateUserReference }, { name: "usage", generate: generateUsageReference },
{ name: "author", generate: generateAuthorReference }, { name: "workflow-authoring", generate: generateWorkflowAuthoringReference },
{ name: "developer", generate: generateDeveloperReference }, { name: "adapter-developing", generate: generateAdapterDevelopingReference },
{ name: "adapter", generate: generateAdapterReference },
{ name: "bootstrap", generate: generateBootstrapReference }, { name: "bootstrap", generate: generateBootstrapReference },
]; ];
@@ -27,7 +24,9 @@ export function cmdPromptList(): ReadonlyArray<string> {
} }
export function cmdPromptUsage(): string { export function cmdPromptUsage(): string {
return PROMPT_ENTRIES.map((e) => e.generate()).join("\n\n---\n\n"); return PROMPT_ENTRIES.filter((e) => e.name !== "bootstrap")
.map((e) => e.generate())
.join("\n\n---\n\n");
} }
export function cmdPromptSetup(): string { export function cmdPromptSetup(): string {
@@ -42,7 +41,7 @@ so that you know how to use the \`uwf\` CLI for workflow orchestration.
\`\`\`bash \`\`\`bash
uwf --help uwf --help
\`\`\` \`\`\`
If not installed: \`bun add -g @united-workforce/cli\` If not installed: \`npm install -g @united-workforce/cli\`
2. **Clean up old versions of the skill** (if any exist): 2. **Clean up old versions of the skill** (if any exist):
- Look for any existing uwf/workflow skill files in your skill directories - Look for any existing uwf/workflow skill files in your skill directories
@@ -51,43 +50,24 @@ so that you know how to use the \`uwf\` CLI for workflow orchestration.
- If the version matches the current CLI version (\`uwf --version\`), you're done — no update needed. - If the version matches the current CLI version (\`uwf --version\`), you're done — no update needed.
- If the version is older, **delete the old skill file** before proceeding. - If the version is older, **delete the old skill file** before proceeding.
3. **Get the usage skill content:** 3. **Install the bootstrap skill:**
\`\`\`bash \`\`\`bash
uwf prompt usage uwf prompt bootstrap > ~/.hermes/skills/uwf/SKILL.md
\`\`\` \`\`\`
The bootstrap prompt already includes complete YAML frontmatter no editing needed.
4. **Create a SKILL.md file** in your skill directory (e.g. \`~/.hermes/skills/uwf/SKILL.md\`). 4. **Verify** the skill is loadable by your agent framework.
The output of \`uwf prompt usage\` is the skill body WITHOUT frontmatter.
You must prepend the following YAML frontmatter:
\`\`\`yaml
---
name: uwf
description: "Use when orchestrating multi-agent workflows with uwf (United Workforce). Covers CLI usage, workflow YAML authoring, moderator logic, and agent development."
version: <VERSION>
author: United Workforce
license: MIT
metadata:
hermes:
tags: [workflow, agents, orchestration, cli, yaml]
---
\`\`\`
Replace \`<VERSION>\` with the installed CLI version (from \`uwf --version\`).
5. **Verify** the skill is loadable by your agent framework.
## Individual prompts ## Individual prompts
You can also get individual reference sections: Each prompt outputs a complete SKILL.md with frontmatter pipe directly to a file:
\`\`\`bash \`\`\`bash
uwf prompt list # list available prompt names uwf prompt list # list available prompt names
uwf prompt user # user reference (CLI guide + typical workflows) uwf prompt usage > ~/.hermes/skills/uwf-usage/SKILL.md # CLI usage guide
uwf prompt author # author reference (workflow YAML design guide) uwf prompt workflow-authoring > ~/.hermes/skills/uwf-workflow-authoring/SKILL.md
uwf prompt developer # developer reference (coding conventions + architecture) uwf prompt adapter-developing > ~/.hermes/skills/uwf-adapter-developing/SKILL.md
uwf prompt adapter # adapter reference (building agent adapters) uwf prompt bootstrap > ~/.hermes/skills/uwf/SKILL.md # bootstrap skill
uwf prompt bootstrap # bootstrap skill YAML for Hermes agents
\`\`\` \`\`\`
## Notes ## Notes
+1
View File
@@ -66,6 +66,7 @@ export async function cmdStepList(
agent: item.payload.agent, agent: item.payload.agent,
timestamp: item.timestamp, timestamp: item.timestamp,
durationMs: item.payload.completedAtMs - item.payload.startedAtMs, durationMs: item.payload.completedAtMs - item.payload.startedAtMs,
usage: item.payload.usage ?? null,
}); });
} }
+13 -3
View File
@@ -961,6 +961,12 @@ function resolveAgentConfig(
agentOverride: string | null, agentOverride: string | null,
): AgentConfig { ): AgentConfig {
if (agentOverride !== null) { if (agentOverride !== null) {
// Try config alias first (e.g. "hermes" → config.agents.hermes),
// then fall back to raw command name (e.g. "uwf-hermes" or "/usr/bin/agent").
const fromAlias = config.agents[agentOverride as AgentAlias];
if (fromAlias !== undefined) {
return fromAlias;
}
return parseAgentOverride(agentOverride); return parseAgentOverride(agentOverride);
} }
@@ -1128,6 +1134,12 @@ export async function cmdThreadResume(
}); });
} }
export function validateCount(count: number): void {
if (count < 1 || !Number.isInteger(count)) {
throw new Error(`--count must be a positive integer, got: ${count}`);
}
}
export async function cmdThreadExec( export async function cmdThreadExec(
storageRoot: string, storageRoot: string,
threadId: ThreadId, threadId: ThreadId,
@@ -1136,9 +1148,7 @@ export async function cmdThreadExec(
background: boolean, background: boolean,
backgroundWorker: boolean, backgroundWorker: boolean,
): Promise<StepOutput[]> { ): Promise<StepOutput[]> {
if (count < 1 || !Number.isInteger(count)) { validateCount(count);
fail(`--count must be a positive integer, got: ${count}`);
}
// Check if thread is already running in background (unless we ARE the background worker) // Check if thread is already running in background (unless we ARE the background worker)
if (!backgroundWorker) { if (!backgroundWorker) {
+13 -7
View File
@@ -8,7 +8,8 @@ mustache.escape = (text: string) => text;
const START_ROLE = "$START"; const START_ROLE = "$START";
const SUSPEND_ROLE = "$SUSPEND"; const SUSPEND_ROLE = "$SUSPEND";
const UNIT_STATUS = "_"; // $START is a special entry node with no agent output — it always uses this key.
const START_STATUS = "_";
type LastOutput = Record<string, unknown>; type LastOutput = Record<string, unknown>;
@@ -19,12 +20,17 @@ export function evaluate(
lastRole: string, lastRole: string,
lastOutput: LastOutput, lastOutput: LastOutput,
): Result<EvaluateResult, Error> { ): Result<EvaluateResult, Error> {
const status = let status: string;
lastRole === START_ROLE if (lastRole === START_ROLE) {
? UNIT_STATUS status = START_STATUS;
: typeof lastOutput[STATUS_KEY] === "string" } else if (typeof lastOutput[STATUS_KEY] === "string") {
? (lastOutput[STATUS_KEY] as string) status = lastOutput[STATUS_KEY] as string;
: UNIT_STATUS; } else {
return {
ok: false,
error: new Error(`agent output for role "${lastRole}" is missing required "$status" string`),
};
}
const roleTargets = graph[lastRole]; const roleTargets = graph[lastRole];
if (roleTargets === undefined) { if (roleTargets === undefined) {
+19 -46
View File
@@ -24,17 +24,13 @@ function isOneOfSchema(fm: unknown): fm is SchemaObj & { oneOf: SchemaObj[] } {
return Array.isArray(obj.oneOf); return Array.isArray(obj.oneOf);
} }
/** Check if a frontmatter schema uses enum-based multi-exit ($status with multiple enum values). */ /** Check if a frontmatter schema declares "$status" as an enum (the required form for user roles). */
function isEnumMultiExit(fm: unknown): boolean { function hasStatusEnum(fm: unknown): boolean {
if (typeof fm !== "object" || fm === null) return false; if (typeof fm !== "object" || fm === null) return false;
const obj = fm as SchemaObj; const obj = fm as SchemaObj;
const props = obj.properties as Record<string, SchemaObj> | undefined; const props = obj.properties as Record<string, SchemaObj> | undefined;
if (!props?.$status) return false; if (!props?.$status) return false;
const statusDef = props.$status; return Array.isArray(props.$status.enum);
if (!Array.isArray(statusDef.enum)) return false;
// Filter out "_" (wildcard) — if remaining values > 1, it's multi-exit
const statuses = (statusDef.enum as string[]).filter((s) => s !== "_");
return statuses.length > 1;
} }
/** Extract status values from an enum-based $status field. */ /** Extract status values from an enum-based $status field. */
@@ -43,7 +39,7 @@ function getEnumStatuses(fm: SchemaObj): string[] {
if (!props?.$status) return []; if (!props?.$status) return [];
const statusDef = props.$status; const statusDef = props.$status;
if (!Array.isArray(statusDef.enum)) return []; if (!Array.isArray(statusDef.enum)) return [];
return (statusDef.enum as string[]).filter((s) => s !== "_"); return statusDef.enum as string[];
} }
/** Get property names from a schema object. */ /** Get property names from a schema object. */
@@ -194,15 +190,19 @@ function checkOneOfDiscriminant(
} }
} }
/** Check status-edge consistency for a multi-exit role. */ /** Check status-edge consistency for a user role. "_" is reserved for $START and rejected here. */
function checkMultiExitEdges( function checkStatusEdges(
roleName: string, roleName: string,
graphKeys: Set<string>, graphKeys: Set<string>,
statusSet: Set<string>, statusSet: Set<string>,
errors: string[], errors: string[],
): void { ): void {
if (graphKeys.has("_")) { if (graphKeys.has("_")) {
errors.push(`role "${roleName}" is multi-exit but graph uses "_"`); errors.push(`role "${roleName}" must use explicit $status keys in graph, not "_"`);
return;
}
if (statusSet.has("_")) {
errors.push(`role "${roleName}" $status enum must use explicit values, not "_"`);
return; return;
} }
@@ -255,50 +255,23 @@ function checkRoleConsistency(payload: WorkflowPayload, errors: string[]): void
const statuses = getOneOfStatuses(variants); const statuses = getOneOfStatuses(variants);
checkOneOfDiscriminant(roleName, variants, statuses, errors); checkOneOfDiscriminant(roleName, variants, statuses, errors);
checkMultiExitEdges(roleName, graphKeys, new Set(statuses), errors); checkStatusEdges(roleName, graphKeys, new Set(statuses), errors);
checkMultiExitMustache(roleName, graphEntry, variants, errors); checkMultiExitMustache(roleName, graphEntry, variants, errors);
} else if (isEnumMultiExit(fm)) { } else if (hasStatusEnum(fm)) {
const statuses = getEnumStatuses(fm as SchemaObj); const statuses = getEnumStatuses(fm as SchemaObj);
checkMultiExitEdges(roleName, graphKeys, new Set(statuses), errors); checkStatusEdges(roleName, graphKeys, new Set(statuses), errors);
// For enum-based schemas, mustache vars come from the flat properties // For enum-based schemas, mustache vars come from the flat properties
checkSingleExitMustache(roleName, graphEntry, fm as SchemaObj, errors); checkEnumMustache(roleName, graphEntry, fm as SchemaObj, errors);
} else { } else {
checkSingleExitRole(roleName, graphKeys, graphEntry, fm as SchemaObj | null, errors); errors.push(
} `role "${roleName}" must define "$status" as an enum (or oneOf const) in frontmatter`,
} );
}
/** Check single-exit role status and mustache. */
function checkSingleExitRole(
roleName: string,
graphKeys: Set<string>,
graphEntry: Record<string, { role: string; prompt: string }>,
fm: SchemaObj | null,
errors: string[],
): void {
if (graphKeys.size > 1 || (graphKeys.size === 1 && !graphKeys.has("_"))) {
if (!graphKeys.has("_")) {
errors.push(`role "${roleName}" is single-exit but graph has no "_" key`);
} else {
errors.push(`role "${roleName}" is single-exit but has status keys other than "_"`);
}
}
const singleTarget = graphEntry._;
if (!singleTarget) return;
const vars = extractMustacheVars(singleTarget.prompt);
const propNames = fm ? getPropertyNames(fm) : new Set<string>();
for (const v of vars) {
if (v === "$status") continue;
if (!propNames.has(v)) {
errors.push(`prompt variable "${v}" not found in role "${roleName}" frontmatter`);
} }
} }
} }
/** Check mustache vars in all edge prompts against flat schema properties. */ /** Check mustache vars in all edge prompts against flat schema properties. */
function checkSingleExitMustache( function checkEnumMustache(
roleName: string, roleName: string,
graphEntry: Record<string, { role: string; prompt: string }>, graphEntry: Record<string, { role: string; prompt: string }>,
fm: SchemaObj, fm: SchemaObj,
+12 -3
View File
@@ -57,9 +57,18 @@ function isGraph(value: unknown): boolean {
if (!isRecord(value)) { if (!isRecord(value)) {
return false; return false;
} }
return Object.values(value).every( return Object.entries(value).every(([node, statusMap]) => {
(statusMap) => isRecord(statusMap) && Object.values(statusMap).every((t) => isTarget(t)), if (!isRecord(statusMap)) {
); return false;
}
return Object.entries(statusMap).every(([status, target]) => {
// "_" is only valid as a status key for the $START entry node.
if (status === "_" && node !== "$START") {
return false;
}
return isTarget(target);
});
});
} }
/** /**
+1 -1
View File
@@ -1,6 +1,6 @@
{ {
"name": "@united-workforce/dashboard", "name": "@united-workforce/dashboard",
"version": "0.5.0-alpha.4", "version": "0.1.0",
"private": true, "private": true,
"type": "module", "type": "module",
"scripts": { "scripts": {
+9
View File
@@ -0,0 +1,9 @@
# @united-workforce/eval
## 0.1.2
### Patch Changes
- 850a3b2: fix: resolve --agent override via config alias before raw command
`resolveAgentConfig()` now checks `config.agents[alias]` first before falling back to `parseAgentOverride()`. Eval CLI default `--agent` changed from `"hermes"` to `"uwf-hermes"`.
@@ -0,0 +1,219 @@
import type { StepEntry } from "@united-workforce/protocol";
import { beforeEach, describe, expect, test, vi } from "vitest";
import {
runFrontmatterJudge,
runHallucinationJudge,
runTokenStatsJudge,
runUpstreamJudge,
} from "../src/judge/builtin/index.js";
// Mock the shared read-steps helper so the judges never shell out to `uwf`.
vi.mock("../src/judge/builtin/read-steps.js", () => ({
readThreadSteps: vi.fn(),
}));
import { readThreadSteps } from "../src/judge/builtin/read-steps.js";
const mockedReadSteps = vi.mocked(readThreadSteps);
function makeStep(overrides: Partial<StepEntry>): StepEntry {
return {
hash: "HASH000000000",
role: "worker",
output: "---\n$status: done\n---\n\nbody",
detail: "DETAIL0000000",
agent: "hermes",
timestamp: 0,
durationMs: 0,
usage: null,
...overrides,
};
}
beforeEach(() => {
mockedReadSteps.mockReset();
});
describe("frontmatter-compliance judge", () => {
test("all steps have valid frontmatter → score 1.0", async () => {
mockedReadSteps.mockReturnValue([
makeStep({ role: "a", output: "---\n$status: done\n---\n\nwork" }),
makeStep({ role: "b", output: "---\n$status: needs_input\n---\nmore" }),
]);
const result = await runFrontmatterJudge("T1");
const data = result.data as { stepsTotal: number; stepsValid: number; invalidSteps: unknown[] };
expect(result.score).toBe(1.0);
expect(data.stepsTotal).toBe(2);
expect(data.stepsValid).toBe(2);
expect(data.invalidSteps).toHaveLength(0);
});
test("some steps missing $status → partial score", async () => {
mockedReadSteps.mockReturnValue([
makeStep({ role: "a", output: "---\n$status: done\n---\nok" }),
makeStep({ role: "b", output: "---\nfoo: bar\n---\nmissing status" }),
makeStep({ role: "c", output: "no frontmatter at all" }),
]);
const result = await runFrontmatterJudge("T2");
const data = result.data as {
stepsTotal: number;
stepsValid: number;
invalidSteps: Array<{ stepIndex: number; role: string; errors: string[] }>;
};
expect(result.score).toBeCloseTo(1 / 3, 10);
expect(data.stepsTotal).toBe(3);
expect(data.stepsValid).toBe(1);
expect(data.invalidSteps).toHaveLength(2);
expect(data.invalidSteps[0]).toMatchObject({ stepIndex: 1, role: "b" });
expect(data.invalidSteps[1]).toMatchObject({ stepIndex: 2, role: "c" });
});
test("no steps → score 0 (0/0 edge case)", async () => {
mockedReadSteps.mockReturnValue([]);
const result = await runFrontmatterJudge("T3");
const data = result.data as { stepsTotal: number; stepsValid: number; invalidSteps: unknown[] };
expect(result.score).toBe(0);
expect(data.stepsTotal).toBe(0);
expect(data.stepsValid).toBe(0);
expect(data.invalidSteps).toHaveLength(0);
});
test("empty-string $status counts as invalid", async () => {
mockedReadSteps.mockReturnValue([makeStep({ role: "a", output: '---\n$status: ""\n---\nx' })]);
const result = await runFrontmatterJudge("T4");
expect(result.score).toBe(0);
});
test("parsed object output with $status → score 1.0", async () => {
mockedReadSteps.mockReturnValue([
makeStep({ role: "a", output: { $status: "done", summary: "fixed" } as unknown as string }),
makeStep({ role: "b", output: { $status: "reviewed" } as unknown as string }),
]);
const result = await runFrontmatterJudge("T5");
const data = result.data as { stepsTotal: number; stepsValid: number; invalidSteps: unknown[] };
expect(result.score).toBe(1.0);
expect(data.stepsTotal).toBe(2);
expect(data.stepsValid).toBe(2);
});
test("parsed object output missing $status → score 0", async () => {
mockedReadSteps.mockReturnValue([
makeStep({ role: "a", output: { summary: "no status field" } as unknown as string }),
]);
const result = await runFrontmatterJudge("T6");
expect(result.score).toBe(0);
});
});
describe("token-stats judge", () => {
test("steps with usage → sums correctly", async () => {
mockedReadSteps.mockReturnValue([
makeStep({
role: "a",
usage: { turns: 2, inputTokens: 100, outputTokens: 50, duration: 1.5 },
}),
makeStep({
role: "b",
usage: { turns: 3, inputTokens: 200, outputTokens: 75, duration: 2.0 },
}),
]);
const result = await runTokenStatsJudge("T1");
const data = result.data as {
totalInput: number;
totalOutput: number;
totalTurns: number;
perStep: Array<{ role: string; inputTokens: number; outputTokens: number; turns: number }>;
};
expect(result.score).toBe(1.0);
expect(data.totalInput).toBe(300);
expect(data.totalOutput).toBe(125);
expect(data.totalTurns).toBe(5);
expect(data.perStep).toHaveLength(2);
expect(data.perStep[0]).toEqual({
role: "a",
inputTokens: 100,
outputTokens: 50,
turns: 2,
duration: 1.5,
});
});
test("steps with null usage → zeros", async () => {
mockedReadSteps.mockReturnValue([
makeStep({ role: "a", usage: null }),
makeStep({ role: "b", usage: null }),
]);
const result = await runTokenStatsJudge("T2");
const data = result.data as {
totalInput: number;
totalOutput: number;
totalTurns: number;
perStep: Array<{
inputTokens: number;
outputTokens: number;
turns: number;
duration: number;
}>;
};
expect(result.score).toBe(1.0);
expect(data.totalInput).toBe(0);
expect(data.totalOutput).toBe(0);
expect(data.totalTurns).toBe(0);
expect(data.perStep[0]).toEqual({
role: "a",
inputTokens: 0,
outputTokens: 0,
turns: 0,
duration: 0,
});
});
test("empty steps → all zeros, score 1.0", async () => {
mockedReadSteps.mockReturnValue([]);
const result = await runTokenStatsJudge("T3");
const data = result.data as {
totalInput: number;
totalOutput: number;
totalTurns: number;
perStep: unknown[];
};
expect(result.score).toBe(1.0);
expect(data.totalInput).toBe(0);
expect(data.totalOutput).toBe(0);
expect(data.totalTurns).toBe(0);
expect(data.perStep).toHaveLength(0);
});
});
describe("LLM-as-judge stubs", () => {
test("upstream-consumption returns a stub", async () => {
const result = await runUpstreamJudge("T1");
expect(result.score).toBe(0);
expect(result.data).toEqual({ perStep: [] });
expect(result.schema.title).toBe("@uwf/eval-judge-upstream");
});
test("hallucination returns a stub", async () => {
const result = await runHallucinationJudge("T1");
expect(result.score).toBe(0);
expect(result.data).toEqual({ perStep: [] });
expect(result.schema.title).toBe("@uwf/eval-judge-hallucination");
});
});
+152
View File
@@ -0,0 +1,152 @@
import { bootstrap, createMemoryStore } from "@ocas/core";
import { describe, expect, test } from "vitest";
import type { JudgeRunner } from "../src/runner/index.js";
import { collect, computeOverall } from "../src/runner/index.js";
import type { EvalRunConfig, EvalStore } from "../src/storage/index.js";
import type { JudgeEntry, TaskManifest } from "../src/task/index.js";
function makeJudge(name: string, weight: number, builtin: boolean): JudgeEntry {
return {
name,
weight,
builtin,
entry: builtin ? null : `dist/judges/${name}.js`,
schema: null,
};
}
function makeManifest(judges: JudgeEntry[]): TaskManifest {
return {
name: "fix-off-by-one",
description: "test task",
workflow: "solve-issue",
prompt: "Fix the bug",
limits: { maxSteps: 10, timeoutMinutes: 30 },
judges,
};
}
function makeEvalStore(): EvalStore {
const store = createMemoryStore();
bootstrap(store);
return { store, varStore: store.var };
}
const CONFIG: EvalRunConfig = {
agent: "hermes",
model: "claude-sonnet-4",
engineVersion: "test",
};
/** Returns a fixed score per judge name. */
function scriptedRunner(scores: Record<string, number>): JudgeRunner {
return async (_taskDir, _workDir, _threadId, judge) => ({
score: scores[judge.name] ?? 0,
data: { judged: judge.name },
schema: { type: "object" },
});
}
describe("computeOverall", () => {
test("computes the weighted average correctly", () => {
const overall = computeOverall([
{ score: 0.8, weight: 0.3 },
{ score: 0.6, weight: 0.3 },
{ score: 1.0, weight: 0.4 },
]);
// 0.24 + 0.18 + 0.4 = 0.82
expect(overall).toBeCloseTo(0.82, 10);
});
test("a weight-0 judge does not affect the result", () => {
const withInformational = computeOverall([
{ score: 1.0, weight: 1.0 },
{ score: 0.0, weight: 0.0 },
]);
expect(withInformational).toBe(1.0);
});
test("returns 0 when total weight is 0", () => {
expect(computeOverall([{ score: 0.5, weight: 0 }])).toBe(0);
});
});
describe("collect", () => {
test("computes weighted score correctly across judges", async () => {
const evalStore = makeEvalStore();
const manifest = makeManifest([
makeJudge("test-pass", 0.6, false),
makeJudge("code-quality", 0.4, false),
]);
const runJudge = scriptedRunner({ "test-pass": 1.0, "code-quality": 0.5 });
const result = await collect(
{
evalStore,
taskDir: "/tmp/task",
workDir: "/tmp/work",
threadId: "THREAD123",
manifest,
config: CONFIG,
},
runJudge,
);
// 1.0 * 0.6 + 0.5 * 0.4 = 0.8
expect(result.overall).toBeCloseTo(0.8, 10);
expect(result.runHash).toBeTruthy();
expect(result.judges).toHaveLength(2);
expect(result.judges[0]).toEqual({ name: "test-pass", score: 1.0, weight: 0.6 });
const latest = evalStore.varStore.list({
exactName: "@uwf/eval/fix-off-by-one/latest",
});
expect(latest[0]?.value).toBe(result.runHash);
});
test("handles a judge with weight 0 (informational)", async () => {
const evalStore = makeEvalStore();
const manifest = makeManifest([
makeJudge("test-pass", 1.0, false),
makeJudge("token-stats", 0, true),
]);
// token-stats is builtin → default runner would score 0; give scripted score
// that would skew the result if it were counted.
const runJudge = scriptedRunner({ "test-pass": 0.5, "token-stats": 1.0 });
const result = await collect(
{
evalStore,
taskDir: "/tmp/task",
workDir: "/tmp/work",
threadId: "THREAD123",
manifest,
config: CONFIG,
},
runJudge,
);
// Only test-pass (weight 1.0) counts → overall = 0.5
expect(result.overall).toBeCloseTo(0.5, 10);
expect(result.judges).toHaveLength(2);
const tokenStats = result.judges.find((j) => j.name === "token-stats");
expect(tokenStats?.weight).toBe(0);
});
test("unknown builtin judge name throws via the default runner", async () => {
const evalStore = makeEvalStore();
const manifest = makeManifest([makeJudge("not-a-real-judge", 1.0, true)]);
// Use the default runner (no injected runner) → builtin dispatch → unknown name throws.
await expect(
collect({
evalStore,
taskDir: "/tmp/task",
workDir: "/tmp/work",
threadId: "THREAD123",
manifest,
config: CONFIG,
}),
).rejects.toThrow(/unknown builtin judge/);
});
});
+171
View File
@@ -0,0 +1,171 @@
import { bootstrap, createMemoryStore, putSchema } from "@ocas/core";
import type { CasRef } from "@united-workforce/protocol";
import { describe, expect, test } from "vitest";
import {
formatDiff,
formatList,
formatReport,
readEvalEntries,
readEvalRun,
selectEntries,
} from "../src/commands/index.js";
import type { EvalRunPayload, EvalStore } from "../src/storage/index.js";
import { EVAL_RUN_SCHEMA, setEvalLatest } from "../src/storage/index.js";
function makeEvalStore(): EvalStore {
const store = createMemoryStore();
bootstrap(store);
return { store, varStore: store.var };
}
function makePayload(
task: string,
overall: number,
timestamp: number,
judges: EvalRunPayload["judges"] = [
{
name: "frontmatter-compliance",
score: 1.0,
weight: 0.6,
dataHash: "AAAAAAAAAAAAA" as CasRef,
},
{ name: "token-stats", score: 0.5, weight: 0, dataHash: "BBBBBBBBBBBBB" as CasRef },
],
config: EvalRunPayload["config"] = {
agent: "hermes",
model: "claude-sonnet-4",
engineVersion: "1.0.0",
},
): EvalRunPayload {
return { task, config, threadId: "THREAD0123456789", judges, overall, timestamp };
}
/** Store an eval-run node in CAS and index it under @uwf/eval/<task>/latest. */
function storeRun(evalStore: EvalStore, payload: EvalRunPayload): string {
const schemaHash = putSchema(evalStore.store, EVAL_RUN_SCHEMA);
const hash = evalStore.store.cas.put(schemaHash, payload);
setEvalLatest(evalStore.varStore, payload.task, hash);
return hash;
}
describe("formatReport", () => {
test("includes task, overall, config and judges", () => {
const payload = makePayload("fix-off-by-one", 0.8, Date.UTC(2026, 0, 2, 3, 4, 5));
const output = formatReport(payload, "RUNHASH123456");
expect(output).toContain("fix-off-by-one");
expect(output).toContain("0.8000");
expect(output).toContain("hermes");
expect(output).toContain("claude-sonnet-4");
expect(output).toContain("1.0.0");
expect(output).toContain("frontmatter-compliance");
expect(output).toContain("token-stats");
expect(output).toContain("THREAD0123456789");
expect(output).toContain("RUNHASH123456");
});
test("round-trips a stored run via readEvalRun", () => {
const evalStore = makeEvalStore();
const payload = makePayload("fix-off-by-one", 0.75, Date.now());
const hash = storeRun(evalStore, payload);
const loaded = readEvalRun(evalStore, hash);
expect(loaded).not.toBeNull();
const output = formatReport(loaded as EvalRunPayload, hash);
expect(output).toContain("fix-off-by-one");
expect(output).toContain("0.7500");
});
test("readEvalRun returns null for a missing hash", () => {
const evalStore = makeEvalStore();
expect(readEvalRun(evalStore, "NOPENOPENOPE0")).toBeNull();
});
});
describe("list", () => {
test("lists eval runs stored under different tasks", () => {
const evalStore = makeEvalStore();
storeRun(evalStore, makePayload("fix-off-by-one", 0.8, 2000));
storeRun(evalStore, makePayload("write-docs", 0.6, 1000));
const entries = readEvalEntries(evalStore);
expect(entries).toHaveLength(2);
const output = formatList(selectEntries(entries, null, 20));
expect(output).toContain("fix-off-by-one");
expect(output).toContain("write-docs");
});
test("sorts newest-first by timestamp", () => {
const evalStore = makeEvalStore();
storeRun(evalStore, makePayload("old-task", 0.5, 1000));
storeRun(evalStore, makePayload("new-task", 0.5, 2000));
const selected = selectEntries(readEvalEntries(evalStore), null, 20);
expect(selected[0]?.task).toBe("new-task");
expect(selected[1]?.task).toBe("old-task");
});
test("--task filter only shows the matching task", () => {
const evalStore = makeEvalStore();
storeRun(evalStore, makePayload("fix-off-by-one", 0.8, 2000));
storeRun(evalStore, makePayload("write-docs", 0.6, 1000));
const output = formatList(selectEntries(readEvalEntries(evalStore), "write-docs", 20));
expect(output).toContain("write-docs");
expect(output).not.toContain("fix-off-by-one");
});
test("--limit caps the number of rows", () => {
const evalStore = makeEvalStore();
storeRun(evalStore, makePayload("task-a", 0.8, 3000));
storeRun(evalStore, makePayload("task-b", 0.6, 2000));
storeRun(evalStore, makePayload("task-c", 0.4, 1000));
const selected = selectEntries(readEvalEntries(evalStore), null, 2);
expect(selected).toHaveLength(2);
expect(selected.map((e) => e.task)).toEqual(["task-a", "task-b"]);
});
test("empty store renders a placeholder", () => {
const evalStore = makeEvalStore();
const output = formatList(selectEntries(readEvalEntries(evalStore), null, 20));
expect(output).toContain("(no eval runs found)");
});
});
describe("formatDiff", () => {
test("shows an upward delta when B scores higher", () => {
const a = makePayload("fix-off-by-one", 0.6, 1000);
const b = makePayload("fix-off-by-one", 0.8, 2000);
const output = formatDiff(a, "HASHA00000000", b, "HASHB00000000");
expect(output).toContain("▲");
expect(output).toContain("HASHA00000000");
expect(output).toContain("HASHB00000000");
});
test("shows a downward delta when B scores lower", () => {
const a = makePayload("fix-off-by-one", 0.9, 1000);
const b = makePayload("fix-off-by-one", 0.4, 2000);
const output = formatDiff(a, "HASHA00000000", b, "HASHB00000000");
expect(output).toContain("▼");
});
test("marks differing config values", () => {
const a = makePayload("fix-off-by-one", 0.6, 1000, undefined, {
agent: "hermes",
model: "claude-sonnet-4",
engineVersion: "1.0.0",
});
const b = makePayload("fix-off-by-one", 0.6, 2000, undefined, {
agent: "claude-code",
model: "claude-sonnet-4",
engineVersion: "1.0.0",
});
const output = formatDiff(a, "HASHA00000000", b, "HASHB00000000");
expect(output).toContain("≠");
expect(output).toContain("claude-code");
});
});
+74
View File
@@ -0,0 +1,74 @@
import { mkdir, mkdtemp, readFile, rm, writeFile } from "node:fs/promises";
import { tmpdir } from "node:os";
import { join } from "node:path";
import { afterEach, beforeEach, describe, expect, test } from "vitest";
import { prepare } from "../src/runner/index.js";
const TASK_YAML = `
name: fix-off-by-one
description: Fix an off-by-one error
workflow: solve-issue
prompt: "Fix the bug"
limits:
maxSteps: 12
timeoutMinutes: 20
judges:
- name: frontmatter-compliance
weight: 0.5
builtin: true
- name: test-pass
weight: 0.5
entry: dist/judges/test-pass.js
`;
let taskDir: string;
beforeEach(async () => {
taskDir = await mkdtemp(join(tmpdir(), "uwf-eval-task-"));
await writeFile(join(taskDir, "task.yaml"), TASK_YAML, "utf8");
const fixtureDir = join(taskDir, "fixture");
await mkdir(join(fixtureDir, "src"), { recursive: true });
await writeFile(join(fixtureDir, "src", "calc.ts"), "export const add = (a, b) => a + b + 1;\n");
await writeFile(join(fixtureDir, "package.json"), '{ "name": "fixture" }\n');
});
afterEach(async () => {
await rm(taskDir, { recursive: true, force: true });
});
describe("prepare", () => {
test("returns the parsed manifest", async () => {
const result = await prepare(taskDir);
expect(result.taskDir).toBe(taskDir);
expect(result.manifest.name).toBe("fix-off-by-one");
expect(result.manifest.workflow).toBe("solve-issue");
expect(result.manifest.limits.maxSteps).toBe(12);
expect(result.manifest.judges).toHaveLength(2);
});
test("copies fixture into a fresh temp work dir", async () => {
const result = await prepare(taskDir);
expect(result.workDir).not.toBe(taskDir);
expect(result.workDir.startsWith(tmpdir())).toBe(true);
const calc = await readFile(join(result.workDir, "src", "calc.ts"), "utf8");
expect(calc).toContain("export const add");
const pkg = await readFile(join(result.workDir, "package.json"), "utf8");
expect(pkg).toContain("fixture");
await rm(result.workDir, { recursive: true, force: true });
});
test("creates an empty work dir when no fixture/ exists", async () => {
const noFixtureDir = await mkdtemp(join(tmpdir(), "uwf-eval-nofix-"));
await writeFile(join(noFixtureDir, "task.yaml"), TASK_YAML, "utf8");
const result = await prepare(noFixtureDir);
expect(result.workDir.startsWith(tmpdir())).toBe(true);
await rm(noFixtureDir, { recursive: true, force: true });
await rm(result.workDir, { recursive: true, force: true });
});
});
+63
View File
@@ -0,0 +1,63 @@
import { describe, expect, test } from "vitest";
import {
EVAL_JUDGE_FRONTMATTER_SCHEMA,
EVAL_JUDGE_HALLUCINATION_SCHEMA,
EVAL_JUDGE_TOKEN_STATS_SCHEMA,
EVAL_JUDGE_UPSTREAM_SCHEMA,
EVAL_RUN_SCHEMA,
} from "../src/storage/index.js";
describe("OCAS schema definitions", () => {
test("eval-run schema has correct title and required fields", () => {
expect(EVAL_RUN_SCHEMA.title).toBe("@uwf/eval-run");
const required = EVAL_RUN_SCHEMA.required as string[];
expect(required).toContain("task");
expect(required).toContain("config");
expect(required).toContain("threadId");
expect(required).toContain("judges");
expect(required).toContain("overall");
expect(required).toContain("timestamp");
});
test("frontmatter judge schema has correct title", () => {
expect(EVAL_JUDGE_FRONTMATTER_SCHEMA.title).toBe("@uwf/eval-judge-frontmatter");
const required = EVAL_JUDGE_FRONTMATTER_SCHEMA.required as string[];
expect(required).toContain("stepsTotal");
expect(required).toContain("stepsValid");
expect(required).toContain("invalidSteps");
});
test("upstream judge schema has correct title", () => {
expect(EVAL_JUDGE_UPSTREAM_SCHEMA.title).toBe("@uwf/eval-judge-upstream");
const required = EVAL_JUDGE_UPSTREAM_SCHEMA.required as string[];
expect(required).toContain("perStep");
});
test("hallucination judge schema has correct title", () => {
expect(EVAL_JUDGE_HALLUCINATION_SCHEMA.title).toBe("@uwf/eval-judge-hallucination");
const required = EVAL_JUDGE_HALLUCINATION_SCHEMA.required as string[];
expect(required).toContain("perStep");
});
test("token-stats judge schema has correct title", () => {
expect(EVAL_JUDGE_TOKEN_STATS_SCHEMA.title).toBe("@uwf/eval-judge-token-stats");
const required = EVAL_JUDGE_TOKEN_STATS_SCHEMA.required as string[];
expect(required).toContain("totalInput");
expect(required).toContain("totalOutput");
expect(required).toContain("totalTurns");
expect(required).toContain("perStep");
});
test("all schemas have type object at root", () => {
const schemas = [
EVAL_RUN_SCHEMA,
EVAL_JUDGE_FRONTMATTER_SCHEMA,
EVAL_JUDGE_UPSTREAM_SCHEMA,
EVAL_JUDGE_HALLUCINATION_SCHEMA,
EVAL_JUDGE_TOKEN_STATS_SCHEMA,
];
for (const s of schemas) {
expect(s.type).toBe("object");
}
});
});
+163
View File
@@ -0,0 +1,163 @@
import { describe, expect, test } from "vitest";
import { parseTaskManifest } from "../src/task/index.js";
const VALID_YAML = `
name: fix-off-by-one
description: Fix an off-by-one error in a calculator
workflow: solve-issue
prompt: "Fix the bug: add(1,2) returns 4 instead of 3"
limits:
maxSteps: 15
timeoutMinutes: 30
judges:
- name: frontmatter-compliance
weight: 0.15
builtin: true
- name: test-pass
weight: 0.3
entry: dist/judges/test-pass.js
schema: schemas/test-pass.json
`;
describe("parseTaskManifest", () => {
test("parses valid task.yaml", () => {
const manifest = parseTaskManifest(VALID_YAML);
expect(manifest.name).toBe("fix-off-by-one");
expect(manifest.description).toBe("Fix an off-by-one error in a calculator");
expect(manifest.workflow).toBe("solve-issue");
expect(manifest.prompt).toBe("Fix the bug: add(1,2) returns 4 instead of 3");
expect(manifest.limits).toEqual({ maxSteps: 15, timeoutMinutes: 30 });
expect(manifest.judges).toHaveLength(2);
});
test("parses builtin judge", () => {
const manifest = parseTaskManifest(VALID_YAML);
const builtin = manifest.judges[0];
expect(builtin).toBeDefined();
expect(builtin!.name).toBe("frontmatter-compliance");
expect(builtin!.weight).toBe(0.15);
expect(builtin!.builtin).toBe(true);
expect(builtin!.entry).toBeNull();
});
test("parses custom judge with entry + schema", () => {
const manifest = parseTaskManifest(VALID_YAML);
const custom = manifest.judges[1];
expect(custom).toBeDefined();
expect(custom!.name).toBe("test-pass");
expect(custom!.weight).toBe(0.3);
expect(custom!.builtin).toBe(false);
expect(custom!.entry).toBe("dist/judges/test-pass.js");
expect(custom!.schema).toBe("schemas/test-pass.json");
});
test("defaults limits when omitted", () => {
const yaml = `
name: minimal
workflow: solve-issue
prompt: do something
judges:
- name: check
builtin: true
`;
const manifest = parseTaskManifest(yaml);
expect(manifest.limits).toEqual({ maxSteps: 20, timeoutMinutes: 30 });
});
test("defaults description to empty string", () => {
const yaml = `
name: no-desc
workflow: solve-issue
prompt: do something
judges:
- name: check
builtin: true
`;
const manifest = parseTaskManifest(yaml);
expect(manifest.description).toBe("");
});
test("rejects missing name", () => {
const yaml = `
workflow: solve-issue
prompt: do something
judges:
- name: check
builtin: true
`;
expect(() => parseTaskManifest(yaml)).toThrow("name is required");
});
test("rejects missing workflow", () => {
const yaml = `
name: test
prompt: do something
judges:
- name: check
builtin: true
`;
expect(() => parseTaskManifest(yaml)).toThrow("workflow is required");
});
test("rejects missing prompt", () => {
const yaml = `
name: test
workflow: solve-issue
judges:
- name: check
builtin: true
`;
expect(() => parseTaskManifest(yaml)).toThrow("prompt is required");
});
test("rejects empty judges array", () => {
const yaml = `
name: test
workflow: solve-issue
prompt: do something
judges: []
`;
expect(() => parseTaskManifest(yaml)).toThrow("at least one judge");
});
test("rejects non-builtin judge without entry", () => {
const yaml = `
name: test
workflow: solve-issue
prompt: do something
judges:
- name: custom-check
weight: 0.5
`;
expect(() => parseTaskManifest(yaml)).toThrow("non-builtin judge must have entry");
});
test("rejects non-object YAML root", () => {
expect(() => parseTaskManifest("just a string")).toThrow("must be a YAML mapping");
});
test("rejects judge without name", () => {
const yaml = `
name: test
workflow: solve-issue
prompt: do something
judges:
- weight: 0.5
builtin: true
`;
expect(() => parseTaskManifest(yaml)).toThrow("name is required");
});
test("defaults weight to 0 when omitted", () => {
const yaml = `
name: test
workflow: solve-issue
prompt: do something
judges:
- name: token-stats
builtin: true
`;
const manifest = parseTaskManifest(yaml);
expect(manifest.judges[0]!.weight).toBe(0);
});
});
+45
View File
@@ -0,0 +1,45 @@
{
"name": "@united-workforce/eval",
"version": "0.1.2",
"private": false,
"files": [
"src",
"dist",
"package.json"
],
"type": "module",
"bin": {
"uwf-eval": "./dist/cli.js"
},
"exports": {
".": {
"types": "./dist/index.d.ts",
"import": "./dist/index.js"
}
},
"scripts": {
"test": "vitest run __tests__/",
"test:ci": "vitest run __tests__/"
},
"dependencies": {
"@ocas/core": "^0.3.0",
"@ocas/fs": "^0.3.0",
"@united-workforce/protocol": "workspace:^",
"@united-workforce/util": "workspace:^",
"commander": "^14.0.3",
"yaml": "^2.9.0"
},
"devDependencies": {
"typescript": "^5.8.3"
},
"repository": {
"type": "git",
"url": "https://git.shazhou.work/shazhou/united-workforce.git",
"directory": "packages/eval"
},
"homepage": "https://git.shazhou.work/shazhou/united-workforce#readme",
"bugs": {
"url": "https://git.shazhou.work/shazhou/united-workforce/issues"
},
"license": "MIT"
}
+22
View File
@@ -0,0 +1,22 @@
#!/usr/bin/env node
import { Command } from "commander";
import {
registerDiffCommand,
registerListCommand,
registerReportCommand,
registerRunCommand,
} from "./commands/index.js";
const program = new Command();
program
.name("uwf-eval")
.description("Evaluate uwf workflow quality with real agents")
.version("0.1.0");
registerRunCommand(program);
registerReportCommand(program);
registerDiffCommand(program);
registerListCommand(program);
program.parse();
+38
View File
@@ -0,0 +1,38 @@
import { createLogger } from "@united-workforce/util";
import type { Command } from "commander";
import { createEvalStore } from "../storage/index.js";
import { formatDiff } from "./format.js";
import { readEvalRun } from "./read.js";
const log = createLogger({ sink: { kind: "stderr" } });
const LOG_DIFF = "D3WZ8N5T";
export function registerDiffCommand(program: Command): void {
program
.command("diff <hash1> <hash2>")
.description("Compare two eval runs side-by-side")
.action(async (hash1: string, hash2: string) => {
try {
const evalStore = await createEvalStore();
const payloadA = readEvalRun(evalStore, hash1);
if (payloadA === null) {
process.stderr.write(`eval run not found: ${hash1}\n`);
process.exitCode = 1;
return;
}
const payloadB = readEvalRun(evalStore, hash2);
if (payloadB === null) {
process.stderr.write(`eval run not found: ${hash2}\n`);
process.exitCode = 1;
return;
}
log(LOG_DIFF, `diff a=${hash1} b=${hash2}`);
process.stdout.write(formatDiff(payloadA, hash1, payloadB, hash2));
} catch (e) {
const message = e instanceof Error ? e.message : String(e);
process.stderr.write(`${message}\n`);
process.exitCode = 1;
}
});
}
+148
View File
@@ -0,0 +1,148 @@
import type { EvalRunPayload } from "../storage/index.js";
import type { EvalListEntry } from "./types.js";
const NAME_WIDTH = 28;
const SCORE_WIDTH = 10;
const TIMESTAMP_WIDTH = 26;
/** Format a 0..1 score (or weight) with fixed precision. */
function formatScore(value: number): string {
return value.toFixed(4);
}
/** Human-readable ISO-8601 timestamp from epoch milliseconds. */
function formatTimestamp(ms: number): string {
return new Date(ms).toISOString();
}
/** Right-pad to a fixed column width (with a trailing space if already full). */
function pad(value: string, width: number): string {
return value.length >= width ? `${value} ` : value.padEnd(width);
}
/** Directional indicator for a score delta (B relative to A). */
function formatDelta(delta: number): string {
if (delta > 0) {
return `▲ +${formatScore(delta)}`;
}
if (delta < 0) {
return `${formatScore(delta)}`;
}
return `= ${formatScore(0)}`;
}
/** Render a single eval run as a human-readable report. */
export function formatReport(payload: EvalRunPayload, runHash: string): string {
const lines: string[] = [];
lines.push("=== Eval Report ===");
lines.push(`Task: ${payload.task}`);
lines.push(`Overall: ${formatScore(payload.overall)}`);
lines.push(`Timestamp: ${formatTimestamp(payload.timestamp)}`);
lines.push("");
lines.push("Config:");
lines.push(` Agent: ${payload.config.agent}`);
lines.push(` Model: ${payload.config.model}`);
lines.push(` Engine: ${payload.config.engineVersion}`);
lines.push("");
lines.push("Judges:");
lines.push(` ${pad("NAME", NAME_WIDTH)}${pad("SCORE", SCORE_WIDTH)}WEIGHT`);
for (const judge of payload.judges) {
lines.push(
` ${pad(judge.name, NAME_WIDTH)}${pad(formatScore(judge.score), SCORE_WIDTH)}${formatScore(judge.weight)}`,
);
}
lines.push("");
lines.push(`Thread: ${payload.threadId}`);
lines.push(`Run: ${runHash}`);
return `${lines.join("\n")}\n`;
}
/** Render a side-by-side comparison of two eval runs. */
export function formatDiff(
payloadA: EvalRunPayload,
hashA: string,
payloadB: EvalRunPayload,
hashB: string,
): string {
const lines: string[] = [];
lines.push("=== Eval Diff ===");
lines.push(`A: ${hashA} (${payloadA.task})`);
lines.push(`B: ${hashB} (${payloadB.task})`);
lines.push("");
const overallDelta = payloadB.overall - payloadA.overall;
lines.push("Overall:");
lines.push(
` A=${formatScore(payloadA.overall)} B=${formatScore(payloadB.overall)} ${formatDelta(overallDelta)}`,
);
lines.push("");
lines.push("Config:");
lines.push(configLine("Agent", payloadA.config.agent, payloadB.config.agent));
lines.push(configLine("Model", payloadA.config.model, payloadB.config.model));
lines.push(configLine("Engine", payloadA.config.engineVersion, payloadB.config.engineVersion));
lines.push("");
lines.push("Judges:");
lines.push(` ${pad("NAME", NAME_WIDTH)}${pad("A", SCORE_WIDTH)}${pad("B", SCORE_WIDTH)}DELTA`);
const scoresA = new Map(payloadA.judges.map((judge) => [judge.name, judge.score]));
const scoresB = new Map(payloadB.judges.map((judge) => [judge.name, judge.score]));
for (const name of unionJudgeNames(payloadA, payloadB)) {
const scoreA = scoresA.get(name);
const scoreB = scoresB.get(name);
const cellA = scoreA === undefined ? "—" : formatScore(scoreA);
const cellB = scoreB === undefined ? "—" : formatScore(scoreB);
const delta = scoreA !== undefined && scoreB !== undefined ? formatDelta(scoreB - scoreA) : "";
lines.push(
` ${pad(name, NAME_WIDTH)}${pad(cellA, SCORE_WIDTH)}${pad(cellB, SCORE_WIDTH)}${delta}`,
);
}
return `${lines.join("\n")}\n`;
}
/** Render a table of indexed eval runs. */
export function formatList(entries: ReadonlyArray<EvalListEntry>): string {
const lines: string[] = [];
lines.push(
` ${pad("TASK", NAME_WIDTH)}${pad("OVERALL", SCORE_WIDTH)}${pad("TIMESTAMP", TIMESTAMP_WIDTH)}HASH`,
);
if (entries.length === 0) {
lines.push(" (no eval runs found)");
}
for (const entry of entries) {
lines.push(
` ${pad(entry.task, NAME_WIDTH)}${pad(formatScore(entry.overall), SCORE_WIDTH)}${pad(formatTimestamp(entry.timestamp), TIMESTAMP_WIDTH)}${entry.hash}`,
);
}
return `${lines.join("\n")}\n`;
}
/** Sort newest-first, then apply optional task filter and result limit. */
export function selectEntries(
entries: ReadonlyArray<EvalListEntry>,
task: string | null,
limit: number | null,
): EvalListEntry[] {
const sorted = [...entries].sort((a, b) => b.timestamp - a.timestamp);
const filtered = task !== null ? sorted.filter((entry) => entry.task === task) : sorted;
return limit !== null ? filtered.slice(0, limit) : filtered;
}
/** Ordered union of judge names: A's order first, then B-only names. */
function unionJudgeNames(payloadA: EvalRunPayload, payloadB: EvalRunPayload): string[] {
const names: string[] = [];
const seen = new Set<string>();
for (const judge of [...payloadA.judges, ...payloadB.judges]) {
if (!seen.has(judge.name)) {
seen.add(judge.name);
names.push(judge.name);
}
}
return names;
}
/** One config row: `=` when equal, `≠` otherwise. */
function configLine(label: string, valueA: string, valueB: string): string {
const marker = valueA === valueB ? "=" : "≠";
return ` ${pad(`${label}:`, SCORE_WIDTH)}${marker} A=${valueA} B=${valueB}`;
}
+7
View File
@@ -0,0 +1,7 @@
export { registerDiffCommand } from "./diff.js";
export { formatDiff, formatList, formatReport, selectEntries } from "./format.js";
export { registerListCommand } from "./list.js";
export { readEvalEntries, readEvalRun } from "./read.js";
export { registerReportCommand } from "./report.js";
export { registerRunCommand } from "./run.js";
export type { EvalListEntry } from "./types.js";
+43
View File
@@ -0,0 +1,43 @@
import { createLogger } from "@united-workforce/util";
import type { Command } from "commander";
import { createEvalStore } from "../storage/index.js";
import { formatList, selectEntries } from "./format.js";
import { readEvalEntries } from "./read.js";
const log = createLogger({ sink: { kind: "stderr" } });
const LOG_LIST = "L5KX9R2B";
type ListCliOptions = {
task: string | undefined;
limit: string;
};
export function registerListCommand(program: Command): void {
program
.command("list")
.description("List past eval runs")
.option("--task <name>", "filter by task name")
.option("--limit <n>", "max results", "20")
.action(async (opts: ListCliOptions) => {
const limit = Number.parseInt(opts.limit, 10);
if (!Number.isInteger(limit) || limit < 1) {
process.stderr.write("--limit must be a positive integer\n");
process.exitCode = 1;
return;
}
try {
const evalStore = await createEvalStore();
const entries = readEvalEntries(evalStore);
const task = opts.task ?? null;
const selected = selectEntries(entries, task, limit);
log(LOG_LIST, `list task=${task ?? "*"} found=${entries.length} shown=${selected.length}`);
process.stdout.write(formatList(selected));
} catch (e) {
const message = e instanceof Error ? e.message : String(e);
process.stderr.write(`${message}\n`);
process.exitCode = 1;
}
});
}
+41
View File
@@ -0,0 +1,41 @@
import type { EvalRunPayload, EvalStore } from "../storage/index.js";
import type { EvalListEntry } from "./types.js";
/** Variable prefix and suffix for eval run pointers (`@uwf/eval/<task>/latest`). */
const EVAL_VAR_PREFIX = "@uwf/eval/";
const EVAL_VAR_SUFFIX = "/latest";
/** Read a single eval-run payload from CAS. Returns null when the node is absent. */
export function readEvalRun(evalStore: EvalStore, hash: string): EvalRunPayload | null {
const node = evalStore.store.cas.get(hash);
if (node === null) {
return null;
}
return node.payload as EvalRunPayload;
}
/**
* Read every indexed eval run by scanning `@uwf/eval/*\/latest` variables and
* loading the referenced CAS node. Dangling pointers are skipped.
*/
export function readEvalEntries(evalStore: EvalStore): EvalListEntry[] {
const { store, varStore } = evalStore;
const entries: EvalListEntry[] = [];
for (const variable of varStore.list()) {
if (!variable.name.startsWith(EVAL_VAR_PREFIX) || !variable.name.endsWith(EVAL_VAR_SUFFIX)) {
continue;
}
const node = store.cas.get(variable.value);
if (node === null) {
continue;
}
const payload = node.payload as EvalRunPayload;
entries.push({
task: payload.task,
overall: payload.overall,
timestamp: payload.timestamp,
hash: variable.value,
});
}
return entries;
}
+32
View File
@@ -0,0 +1,32 @@
import { createLogger } from "@united-workforce/util";
import type { Command } from "commander";
import { createEvalStore } from "../storage/index.js";
import { formatReport } from "./format.js";
import { readEvalRun } from "./read.js";
const log = createLogger({ sink: { kind: "stderr" } });
const LOG_REPORT = "R7QP2M4K";
export function registerReportCommand(program: Command): void {
program
.command("report <hash>")
.description("Show eval run results")
.action(async (hash: string) => {
try {
const evalStore = await createEvalStore();
const payload = readEvalRun(evalStore, hash);
if (payload === null) {
process.stderr.write(`eval run not found: ${hash}\n`);
process.exitCode = 1;
return;
}
log(LOG_REPORT, `report task=${payload.task} hash=${hash}`);
process.stdout.write(formatReport(payload, hash));
} catch (e) {
const message = e instanceof Error ? e.message : String(e);
process.stderr.write(`${message}\n`);
process.exitCode = 1;
}
});
}
+84
View File
@@ -0,0 +1,84 @@
import { resolve } from "node:path";
import type { Command } from "commander";
import type { RunResult } from "../runner/index.js";
import { collect, execute, getEngineVersion, prepare } from "../runner/index.js";
import type { EvalRunConfig } from "../storage/index.js";
import { createEvalStore } from "../storage/index.js";
type RunCliOptions = {
agent: string;
model: string | undefined;
count: string;
};
async function runOnce(
taskDir: string,
agent: string,
model: string,
engineVersion: string,
): Promise<RunResult> {
const prepared = await prepare(taskDir);
const { manifest, workDir } = prepared;
const { threadId } = await execute({
workDir,
workflow: manifest.workflow,
prompt: manifest.prompt,
agent,
maxSteps: manifest.limits.maxSteps,
});
const evalStore = await createEvalStore();
const config: EvalRunConfig = { agent, model, engineVersion };
const collected = await collect({
evalStore,
taskDir: prepared.taskDir,
workDir,
threadId,
manifest,
config,
});
return {
runHash: collected.runHash,
overall: collected.overall,
task: manifest.name,
judges: collected.judges,
};
}
export function registerRunCommand(program: Command): void {
program
.command("run <task>")
.description("Run eval on a task directory or tarball")
.option("--agent <name>", "agent adapter to use", "uwf-hermes")
.option("--model <model>", "model override")
.option("--count <n>", "number of eval runs", "1")
.action(async (task: string, opts: RunCliOptions) => {
const taskDir = resolve(task);
const agent = opts.agent;
const model = opts.model ?? "";
const count = Number.parseInt(opts.count, 10);
if (!Number.isInteger(count) || count < 1) {
process.stderr.write("--count must be a positive integer\n");
process.exitCode = 1;
return;
}
const engineVersion = getEngineVersion();
try {
const results: RunResult[] = [];
for (let i = 0; i < count; i++) {
results.push(await runOnce(taskDir, agent, model, engineVersion));
}
const output = count === 1 ? results[0] : results;
process.stdout.write(`${JSON.stringify(output)}\n`);
} catch (e) {
const message = e instanceof Error ? e.message : String(e);
process.stderr.write(`${message}\n`);
process.exitCode = 1;
}
});
}
+9
View File
@@ -0,0 +1,9 @@
import type { CasRef } from "@united-workforce/protocol";
/** Summary row for the `list` command: one indexed eval run. */
export type EvalListEntry = {
task: string;
overall: number;
timestamp: number;
hash: CasRef;
};
+34
View File
@@ -0,0 +1,34 @@
// Judge types
export type { JudgeInput, JudgeOutput } from "./judge/index.js";
export type {
CollectInput,
CollectResult,
ExecuteInput,
ExecuteResult,
JudgeRunner,
JudgeRunOutput,
JudgeSummary,
PrepareResult,
RunOptions,
RunResult,
} from "./runner/index.js";
// Runner (prepare → execute → collect)
export { collect, computeOverall, execute, getEngineVersion, prepare } from "./runner/index.js";
export type {
EvalJudgeRecord,
EvalRunConfig,
EvalRunPayload,
EvalStore,
} from "./storage/index.js";
// Storage schemas and types
export {
createEvalStore,
EVAL_JUDGE_FRONTMATTER_SCHEMA,
EVAL_JUDGE_HALLUCINATION_SCHEMA,
EVAL_JUDGE_TOKEN_STATS_SCHEMA,
EVAL_JUDGE_UPSTREAM_SCHEMA,
EVAL_RUN_SCHEMA,
setEvalLatest,
} from "./storage/index.js";
export type { JudgeEntry, TaskLimits, TaskManifest } from "./task/index.js";
export { loadTaskManifest, parseTaskManifest } from "./task/index.js";
@@ -0,0 +1,105 @@
import { createLogger } from "@united-workforce/util";
import { parse as parseYaml } from "yaml";
import { EVAL_JUDGE_FRONTMATTER_SCHEMA } from "../../storage/index.js";
import { readThreadSteps } from "./read-steps.js";
import type { BuiltinJudgeOutput } from "./types.js";
const log = createLogger({ sink: { kind: "stderr" } });
const LOG_RESULT = "F2QH7R4M";
const FENCE = "---";
type InvalidStep = {
stepIndex: number;
role: string;
errors: string[];
};
/**
* Extract the YAML frontmatter block from a step output. Returns the inner YAML
* string when the output starts with a `---\n` block closed by a `\n---` fence,
* otherwise null.
*/
function extractFrontmatterYaml(output: unknown): string | null {
if (typeof output !== "string") {
return null;
}
if (!output.startsWith(`${FENCE}\n`)) {
return null;
}
const rest = output.slice(FENCE.length + 1);
const closeIndex = rest.indexOf(`\n${FENCE}`);
if (closeIndex === -1) {
return null;
}
return rest.slice(0, closeIndex);
}
/** Validate a single step's frontmatter, returning a list of errors (empty = valid). */
function validateStepFrontmatter(output: unknown): string[] {
// CAS stores the extracted output as a JSON object after the extract pipeline.
// Accept both: parsed object (from step.output) or raw markdown string.
if (typeof output === "object" && output !== null && !Array.isArray(output)) {
const status = (output as Record<string, unknown>).$status;
if (typeof status !== "string" || status.trim() === "") {
return ["$status field is missing or not a non-empty string"];
}
return [];
}
const yaml = extractFrontmatterYaml(output);
if (yaml === null) {
return ["output does not begin with a valid '---' frontmatter block"];
}
let parsed: unknown;
try {
parsed = parseYaml(yaml);
} catch (e) {
const message = e instanceof Error ? e.message : String(e);
return [`frontmatter YAML failed to parse: ${message}`];
}
if (typeof parsed !== "object" || parsed === null || Array.isArray(parsed)) {
return ["frontmatter is not a YAML mapping"];
}
const status = (parsed as Record<string, unknown>).$status;
if (typeof status !== "string" || status.trim() === "") {
return ["$status field is missing or not a non-empty string"];
}
return [];
}
/**
* Deterministic judge: every step's agent output must contain valid YAML
* frontmatter with a non-empty `$status` field. Score = stepsValid / stepsTotal
* (0 when there are no steps).
*/
export async function runFrontmatterJudge(threadId: string): Promise<BuiltinJudgeOutput> {
const steps = readThreadSteps(threadId);
const invalidSteps: InvalidStep[] = [];
for (let i = 0; i < steps.length; i++) {
const step = steps[i];
const errors = validateStepFrontmatter(step.output);
if (errors.length > 0) {
invalidSteps.push({ stepIndex: i, role: step.role, errors });
}
}
const stepsTotal = steps.length;
const stepsValid = stepsTotal - invalidSteps.length;
const score = stepsTotal > 0 ? stepsValid / stepsTotal : 0;
log(LOG_RESULT, `frontmatter thread=${threadId} valid=${stepsValid}/${stepsTotal}`);
return {
score,
data: { stepsTotal, stepsValid, invalidSteps },
schema: EVAL_JUDGE_FRONTMATTER_SCHEMA,
};
}
@@ -0,0 +1,17 @@
import { EVAL_JUDGE_HALLUCINATION_SCHEMA } from "../../storage/index.js";
import type { BuiltinJudgeOutput } from "./types.js";
/**
* LLM-as-judge: detects claims in each step's output that are not grounded in
* the available context (hallucinations).
*
* TODO: LLM-as-judge needs provider config to call LLM API. Returns a stub
* (score 0, empty perStep) until the LLM call path is wired up.
*/
export async function runHallucinationJudge(_threadId: string): Promise<BuiltinJudgeOutput> {
return {
score: 0,
data: { perStep: [] },
schema: EVAL_JUDGE_HALLUCINATION_SCHEMA,
};
}
+6
View File
@@ -0,0 +1,6 @@
export { runFrontmatterJudge } from "./frontmatter.js";
export { runHallucinationJudge } from "./hallucination.js";
export { readThreadSteps } from "./read-steps.js";
export { runTokenStatsJudge } from "./token-stats.js";
export type { BuiltinJudge, BuiltinJudgeOutput } from "./types.js";
export { runUpstreamJudge } from "./upstream.js";
@@ -0,0 +1,14 @@
import { execFileSync } from "node:child_process";
import type { StepEntry, ThreadStepsOutput } from "@united-workforce/protocol";
/** Shell out to `uwf step list` and return the parsed step entries (excludes start entry). */
export function readThreadSteps(threadId: string): StepEntry[] {
const stdout = execFileSync("uwf", ["step", "list", threadId], {
encoding: "utf8",
stdio: ["ignore", "pipe", "pipe"],
}).trim();
const parsed = JSON.parse(stdout) as ThreadStepsOutput;
// steps[0] is the StartEntry; the rest are StepEntry records.
return parsed.steps.slice(1) as StepEntry[];
}
@@ -0,0 +1,53 @@
import { createLogger } from "@united-workforce/util";
import { EVAL_JUDGE_TOKEN_STATS_SCHEMA } from "../../storage/index.js";
import { readThreadSteps } from "./read-steps.js";
import type { BuiltinJudgeOutput } from "./types.js";
const log = createLogger({ sink: { kind: "stderr" } });
const LOG_RESULT = "T7KQ3M9P";
type PerStepStats = {
role: string;
inputTokens: number;
outputTokens: number;
turns: number;
duration: number;
};
/**
* Informational judge: aggregate token usage across every step. Always scores
* 1.0 it never penalizes a run, it only reports usage. Steps with null usage
* contribute zeros.
*/
export async function runTokenStatsJudge(threadId: string): Promise<BuiltinJudgeOutput> {
const steps = readThreadSteps(threadId);
let totalInput = 0;
let totalOutput = 0;
let totalTurns = 0;
const perStep: PerStepStats[] = [];
for (const step of steps) {
const usage = step.usage;
const inputTokens = usage !== null ? usage.inputTokens : 0;
const outputTokens = usage !== null ? usage.outputTokens : 0;
const turns = usage !== null ? usage.turns : 0;
const duration = usage !== null ? usage.duration : 0;
totalInput += inputTokens;
totalOutput += outputTokens;
totalTurns += turns;
perStep.push({ role: step.role, inputTokens, outputTokens, turns, duration });
}
log(LOG_RESULT, `token-stats thread=${threadId} in=${totalInput} out=${totalOutput}`);
return {
score: 1.0,
data: { totalInput, totalOutput, totalTurns, perStep },
schema: EVAL_JUDGE_TOKEN_STATS_SCHEMA,
};
}
+16
View File
@@ -0,0 +1,16 @@
import type { JSONSchema } from "@ocas/core";
/**
* Output produced by a builtin judge. Structurally identical to the runner's
* `JudgeRunOutput`; defined locally to keep the judge module free of a
* dependency on the runner module.
*/
export type BuiltinJudgeOutput = {
score: number;
data: unknown;
/** Schema describing `data`, used when persisting to CAS. */
schema: JSONSchema;
};
/** A builtin judge analyzes a thread's steps and returns a scored result. */
export type BuiltinJudge = (threadId: string) => Promise<BuiltinJudgeOutput>;
@@ -0,0 +1,17 @@
import { EVAL_JUDGE_UPSTREAM_SCHEMA } from "../../storage/index.js";
import type { BuiltinJudgeOutput } from "./types.js";
/**
* LLM-as-judge: measures how well each role consumed the relevant outputs from
* upstream steps.
*
* TODO: LLM-as-judge needs provider config to call LLM API. Returns a stub
* (score 0, empty perStep) until the LLM call path is wired up.
*/
export async function runUpstreamJudge(_threadId: string): Promise<BuiltinJudgeOutput> {
return {
score: 0,
data: { perStep: [] },
schema: EVAL_JUDGE_UPSTREAM_SCHEMA,
};
}
+10
View File
@@ -0,0 +1,10 @@
export {
type BuiltinJudge,
type BuiltinJudgeOutput,
readThreadSteps,
runFrontmatterJudge,
runHallucinationJudge,
runTokenStatsJudge,
runUpstreamJudge,
} from "./builtin/index.js";
export type { JudgeInput, JudgeOutput } from "./types.js";
+15
View File
@@ -0,0 +1,15 @@
/** Output shape every judge must produce on stdout (JSON). */
export type JudgeOutput<T = unknown> = {
/** Score between 0.0 and 1.0. */
score: number;
/** Judge-specific structured data, stored in CAS with its own schema. */
data: T;
};
/** Input context passed to judge scripts via argv. */
export type JudgeInput = {
/** Working directory where the task was executed. */
cwd: string;
/** Thread ID of the eval run. */
threadId: string;
};
+172
View File
@@ -0,0 +1,172 @@
import { execFileSync } from "node:child_process";
import { readFile } from "node:fs/promises";
import { resolve } from "node:path";
import type { JSONSchema, Store } from "@ocas/core";
import { putSchema } from "@ocas/core";
import type { CasRef } from "@united-workforce/protocol";
import { createLogger } from "@united-workforce/util";
import type { JudgeOutput } from "../judge/index.js";
import {
runFrontmatterJudge,
runHallucinationJudge,
runTokenStatsJudge,
runUpstreamJudge,
} from "../judge/index.js";
import type { EvalJudgeRecord, EvalRunPayload } from "../storage/index.js";
import { EVAL_RUN_SCHEMA, setEvalLatest } from "../storage/index.js";
import type { JudgeEntry } from "../task/index.js";
import type {
CollectInput,
CollectResult,
JudgeRunner,
JudgeRunOutput,
JudgeSummary,
} from "./types.js";
const log = createLogger({ sink: { kind: "stderr" } });
const LOG_JUDGE = "CT6N3P2K";
const LOG_STORED = "CT9V2Q7M";
/** Permissive schema for judge data without a dedicated schema (e.g. builtin placeholders). */
const GENERIC_DATA_SCHEMA: JSONSchema = { type: "object" };
/**
* Compute the weighted overall score. Judges with weight 0 are informational
* and do not affect the result (they contribute 0 to both numerator and
* denominator). Returns 0 when total weight is 0.
*/
export function computeOverall(judges: ReadonlyArray<{ score: number; weight: number }>): number {
let totalWeight = 0;
let weighted = 0;
for (const judge of judges) {
totalWeight += judge.weight;
weighted += judge.score * judge.weight;
}
return totalWeight > 0 ? weighted / totalWeight : 0;
}
/** Run a task-provided judge script: `node <entry> <cwd> <threadId>`. */
async function runTaskJudge(
taskDir: string,
workDir: string,
threadId: string,
judge: JudgeEntry,
): Promise<JudgeRunOutput> {
if (judge.entry === null) {
throw new Error(`judge "${judge.name}" is not builtin but has no entry`);
}
const entryPath = resolve(taskDir, judge.entry);
let stdout: string;
try {
stdout = execFileSync("node", [entryPath, workDir, threadId], {
encoding: "utf8",
stdio: ["ignore", "pipe", "pipe"],
maxBuffer: 50 * 1024 * 1024,
});
} catch (e) {
const message = e instanceof Error ? e.message : String(e);
throw new Error(`judge "${judge.name}" failed: ${message}`);
}
const line = stdout.trim().split("\n").pop()?.trim() ?? "";
let parsed: unknown;
try {
parsed = JSON.parse(line);
} catch {
throw new Error(`judge "${judge.name}" stdout is not valid JSON: ${line || "(empty)"}`);
}
const output = parsed as JudgeOutput;
if (typeof output.score !== "number") {
throw new Error(`judge "${judge.name}" output missing numeric score`);
}
const schema =
judge.schema !== null ? await loadSchema(resolve(taskDir, judge.schema)) : GENERIC_DATA_SCHEMA;
return { score: output.score, data: output.data, schema };
}
/** Load and parse an OCAS JSON Schema file. */
async function loadSchema(path: string): Promise<JSONSchema> {
const text = await readFile(path, "utf8");
return JSON.parse(text) as JSONSchema;
}
/** Dispatch a builtin judge by name. Throws on an unknown builtin name. */
async function runBuiltinJudge(name: string, threadId: string): Promise<JudgeRunOutput> {
switch (name) {
case "frontmatter-compliance":
return runFrontmatterJudge(threadId);
case "upstream-consumption":
return runUpstreamJudge(threadId);
case "hallucination":
return runHallucinationJudge(threadId);
case "token-stats":
return runTokenStatsJudge(threadId);
default:
throw new Error(`unknown builtin judge "${name}"`);
}
}
/**
* Default judge runner. Builtin judges are dispatched by name; task judges spawn
* their entry script.
*/
const defaultJudgeRunner: JudgeRunner = async (taskDir, workDir, threadId, judge) => {
if (judge.builtin) {
return runBuiltinJudge(judge.name, threadId);
}
return runTaskJudge(taskDir, workDir, threadId, judge);
};
/** Persist judge data to CAS under its schema and return the CAS hash. */
async function storeJudgeData(store: Store, schema: JSONSchema, data: unknown): Promise<CasRef> {
const schemaHash = await putSchema(store, schema);
return (await store.cas.put(schemaHash, data)) as CasRef;
}
/**
* Run all judges, store their data and the overall eval-run record in CAS, then
* index the run under `@uwf/eval/<task>/latest`.
*/
export async function collect(
input: CollectInput,
runJudge: JudgeRunner = defaultJudgeRunner,
): Promise<CollectResult> {
const { evalStore, taskDir, workDir, threadId, manifest, config } = input;
const { store, varStore } = evalStore;
const records: EvalJudgeRecord[] = [];
for (const judge of manifest.judges) {
const result = await runJudge(taskDir, workDir, threadId, judge);
const dataHash = await storeJudgeData(store, result.schema, result.data);
records.push({ name: judge.name, score: result.score, weight: judge.weight, dataHash });
log(LOG_JUDGE, `judge=${judge.name} score=${result.score} weight=${judge.weight}`);
}
const overall = computeOverall(records);
const payload: EvalRunPayload = {
task: manifest.name,
config,
threadId,
judges: records,
overall,
timestamp: Date.now(),
};
const schemaHash = await putSchema(store, EVAL_RUN_SCHEMA);
const runHash = (await store.cas.put(schemaHash, payload)) as string;
setEvalLatest(varStore, manifest.name, runHash);
log(LOG_STORED, `stored eval-run task=${manifest.name} hash=${runHash} overall=${overall}`);
const judges: JudgeSummary[] = records.map((r) => ({
name: r.name,
score: r.score,
weight: r.weight,
}));
return { runHash, overall, judges };
}
+87
View File
@@ -0,0 +1,87 @@
import { execFileSync } from "node:child_process";
import { createLogger } from "@united-workforce/util";
import type { ExecuteInput, ExecuteResult } from "./types.js";
const log = createLogger({ sink: { kind: "stderr" } });
const LOG_START = "EX5M2T9V";
const LOG_EXEC = "EX7Q4K2N";
/** Resolve the uwf CLI binary. Override with `UWF_BIN` for testing. */
function uwfBin(): string {
const override = process.env.UWF_BIN;
return override !== undefined && override !== "" ? override : "uwf";
}
/** Run a uwf subcommand and return trimmed stdout. */
function runUwf(args: string[], cwd: string): string {
try {
return execFileSync(uwfBin(), args, {
encoding: "utf8",
stdio: ["ignore", "pipe", "pipe"],
maxBuffer: 50 * 1024 * 1024,
cwd,
}).trim();
} catch (e) {
const err = e as NodeJS.ErrnoException & { stderr?: Buffer | string | null };
const stderr =
err.stderr == null
? ""
: typeof err.stderr === "string"
? err.stderr
: err.stderr.toString("utf8");
const detail = stderr.trim() !== "" ? `: ${stderr.trim()}` : "";
throw new Error(`uwf ${args[0]} ${args[1]} failed${detail}`);
}
}
/** Parse the thread ID from `uwf thread start` JSON output (`{ workflow, thread }`). */
function parseThreadId(stdout: string): string {
let parsed: unknown;
try {
parsed = JSON.parse(stdout);
} catch {
throw new Error(`uwf thread start did not emit valid JSON: ${stdout || "(empty)"}`);
}
const obj = parsed as Record<string, unknown>;
const thread = obj.thread;
if (typeof thread !== "string" || thread === "") {
throw new Error(`uwf thread start output missing thread id: ${stdout}`);
}
return thread;
}
/**
* Execute a workflow: create a thread, then run it for up to `maxSteps` steps.
* Shells out to the uwf CLI rather than importing it directly.
*/
export async function execute(input: ExecuteInput): Promise<ExecuteResult> {
const startOut = runUwf(
["thread", "start", input.workflow, "-p", input.prompt, "--cwd", input.workDir],
input.workDir,
);
const threadId = parseThreadId(startOut);
log(LOG_START, `thread started thread=${threadId} workflow=${input.workflow}`);
runUwf(
["thread", "exec", threadId, "--agent", input.agent, "-c", String(input.maxSteps)],
input.workDir,
);
log(LOG_EXEC, `thread executed thread=${threadId} maxSteps=${input.maxSteps}`);
return { threadId };
}
/** Best-effort lookup of the uwf engine version (`uwf -V`); "unknown" on failure. */
export function getEngineVersion(): string {
try {
return execFileSync(uwfBin(), ["-V"], {
encoding: "utf8",
stdio: ["ignore", "pipe", "ignore"],
}).trim();
} catch {
return "unknown";
}
}
+15
View File
@@ -0,0 +1,15 @@
export { collect, computeOverall } from "./collect.js";
export { execute, getEngineVersion } from "./execute.js";
export { prepare } from "./prepare.js";
export type {
CollectInput,
CollectResult,
ExecuteInput,
ExecuteResult,
JudgeRunner,
JudgeRunOutput,
JudgeSummary,
PrepareResult,
RunOptions,
RunResult,
} from "./types.js";
+45
View File
@@ -0,0 +1,45 @@
import { access, cp, mkdir, mkdtemp } from "node:fs/promises";
import { tmpdir } from "node:os";
import { join } from "node:path";
import { createLogger } from "@united-workforce/util";
import { loadTaskManifest } from "../task/index.js";
import type { PrepareResult } from "./types.js";
const log = createLogger({ sink: { kind: "stderr" } });
const LOG_PREPARE = "PRE4K2NQ";
const LOG_FIXTURE = "PRE7M3VX";
/** Check whether a path exists. */
async function pathExists(path: string): Promise<boolean> {
try {
await access(path);
return true;
} catch {
return false;
}
}
/**
* Prepare a task for execution: read its manifest and copy the fixture
* directory into a fresh temp working directory.
*/
export async function prepare(taskDir: string): Promise<PrepareResult> {
const manifest = await loadTaskManifest(taskDir);
log(LOG_PREPARE, `loaded task manifest name=${manifest.name} workflow=${manifest.workflow}`);
const workDir = await mkdtemp(join(tmpdir(), "uwf-eval-"));
const fixtureDir = join(taskDir, "fixture");
if (await pathExists(fixtureDir)) {
await cp(fixtureDir, workDir, { recursive: true });
log(LOG_FIXTURE, `copied fixture into workDir=${workDir}`);
} else {
await mkdir(workDir, { recursive: true });
log(LOG_FIXTURE, `no fixture/ found, using empty workDir=${workDir}`);
}
return { taskDir, workDir, manifest };
}
+85
View File
@@ -0,0 +1,85 @@
import type { JSONSchema } from "@ocas/core";
import type { EvalRunConfig, EvalStore } from "../storage/index.js";
import type { JudgeEntry, TaskManifest } from "../task/index.js";
/** Result of the prepare phase: task dir, temp working dir, parsed manifest. */
export type PrepareResult = {
taskDir: string;
workDir: string;
manifest: TaskManifest;
};
/** Input to the execute phase. */
export type ExecuteInput = {
/** Working directory the workflow runs in (the prepared temp dir). */
workDir: string;
/** Workflow name or path (from task.yaml). */
workflow: string;
/** Initial prompt for the thread. */
prompt: string;
/** Agent adapter to use. */
agent: string;
/** Maximum number of steps to execute. */
maxSteps: number;
};
/** Result of the execute phase. */
export type ExecuteResult = {
threadId: string;
};
/** Output produced by running a single judge. */
export type JudgeRunOutput = {
score: number;
data: unknown;
/** Schema describing `data`, used when persisting to CAS. */
schema: JSONSchema;
};
/** Pluggable judge execution strategy (injectable for testing). */
export type JudgeRunner = (
taskDir: string,
workDir: string,
threadId: string,
judge: JudgeEntry,
) => Promise<JudgeRunOutput>;
/** Input to the collect phase. */
export type CollectInput = {
evalStore: EvalStore;
taskDir: string;
workDir: string;
threadId: string;
manifest: TaskManifest;
config: EvalRunConfig;
};
/** A single judge's summarized result in the run output. */
export type JudgeSummary = {
name: string;
score: number;
weight: number;
};
/** Result of the collect phase. */
export type CollectResult = {
runHash: string;
overall: number;
judges: JudgeSummary[];
};
/** Options for a full eval run (from CLI flags). */
export type RunOptions = {
agent: string;
model: string;
count: number;
};
/** Final result of a full eval run. */
export type RunResult = {
runHash: string;
overall: number;
task: string;
judges: JudgeSummary[];
};
+9
View File
@@ -0,0 +1,9 @@
export {
EVAL_JUDGE_FRONTMATTER_SCHEMA,
EVAL_JUDGE_HALLUCINATION_SCHEMA,
EVAL_JUDGE_TOKEN_STATS_SCHEMA,
EVAL_JUDGE_UPSTREAM_SCHEMA,
EVAL_RUN_SCHEMA,
} from "./schemas.js";
export { createEvalStore, setEvalLatest } from "./store.js";
export type { EvalJudgeRecord, EvalRunConfig, EvalRunPayload, EvalStore } from "./types.js";
+123
View File
@@ -0,0 +1,123 @@
import type { JSONSchema } from "@ocas/core";
export const EVAL_RUN_SCHEMA: JSONSchema = {
title: "@uwf/eval-run",
type: "object",
required: ["task", "config", "threadId", "judges", "overall", "timestamp"],
properties: {
task: { type: "string" },
config: {
type: "object",
required: ["agent", "model", "engineVersion"],
properties: {
agent: { type: "string" },
model: { type: "string" },
engineVersion: { type: "string" },
},
},
threadId: { type: "string" },
judges: {
type: "array",
items: {
type: "object",
required: ["name", "score", "weight", "dataHash"],
properties: {
name: { type: "string" },
score: { type: "number" },
weight: { type: "number" },
dataHash: { type: "string" },
},
},
},
overall: { type: "number" },
timestamp: { type: "integer" },
},
};
export const EVAL_JUDGE_FRONTMATTER_SCHEMA: JSONSchema = {
title: "@uwf/eval-judge-frontmatter",
type: "object",
required: ["stepsTotal", "stepsValid", "invalidSteps"],
properties: {
stepsTotal: { type: "integer" },
stepsValid: { type: "integer" },
invalidSteps: {
type: "array",
items: {
type: "object",
required: ["stepIndex", "role", "errors"],
properties: {
stepIndex: { type: "integer" },
role: { type: "string" },
errors: { type: "array", items: { type: "string" } },
},
},
},
},
};
export const EVAL_JUDGE_UPSTREAM_SCHEMA: JSONSchema = {
title: "@uwf/eval-judge-upstream",
type: "object",
required: ["perStep"],
properties: {
perStep: {
type: "array",
items: {
type: "object",
required: ["role", "consumed", "missed", "score"],
properties: {
role: { type: "string" },
consumed: { type: "array", items: { type: "string" } },
missed: { type: "array", items: { type: "string" } },
score: { type: "number" },
},
},
},
},
};
export const EVAL_JUDGE_HALLUCINATION_SCHEMA: JSONSchema = {
title: "@uwf/eval-judge-hallucination",
type: "object",
required: ["perStep"],
properties: {
perStep: {
type: "array",
items: {
type: "object",
required: ["role", "hallucinations", "score"],
properties: {
role: { type: "string" },
hallucinations: { type: "array", items: { type: "string" } },
score: { type: "number" },
},
},
},
},
};
export const EVAL_JUDGE_TOKEN_STATS_SCHEMA: JSONSchema = {
title: "@uwf/eval-judge-token-stats",
type: "object",
required: ["totalInput", "totalOutput", "totalTurns", "perStep"],
properties: {
totalInput: { type: "integer" },
totalOutput: { type: "integer" },
totalTurns: { type: "integer" },
perStep: {
type: "array",
items: {
type: "object",
required: ["role", "inputTokens", "outputTokens", "turns", "duration"],
properties: {
role: { type: "string" },
inputTokens: { type: "integer" },
outputTokens: { type: "integer" },
turns: { type: "integer" },
duration: { type: "number" },
},
},
},
},
};
+42
View File
@@ -0,0 +1,42 @@
import { mkdir } from "node:fs/promises";
import { homedir } from "node:os";
import { join } from "node:path";
import type { VarStore } from "@ocas/core";
import { bootstrap, type Store } from "@ocas/core";
import { createFsStore, createSqliteVarStore } from "@ocas/fs";
import type { EvalStore } from "./types.js";
/** Variable name prefix for eval run pointers (`@uwf/eval/<task>/latest`). */
const EVAL_VAR_PREFIX = "@uwf/eval/";
/**
* Resolve the global CAS directory shared by all uwf and ocas tools.
* Priority: `OCAS_HOME` default ~/.ocas (matches uwf CLI's getGlobalCasDir).
*/
function getGlobalCasDir(): string {
const primary = process.env.OCAS_HOME;
if (primary !== undefined && primary !== "") {
return primary;
}
return join(homedir(), ".ocas");
}
/**
* Open the unified OCAS store on the filesystem.
* Shares the same CAS + variable backend as the uwf CLI.
*/
export async function createEvalStore(): Promise<EvalStore> {
const casDir = getGlobalCasDir();
await mkdir(casDir, { recursive: true });
const cas = createFsStore(casDir);
const { var: varStore, tag } = createSqliteVarStore(join(casDir, "vars"), cas);
const store: Store = { cas, var: varStore, tag };
bootstrap(store);
return { store, varStore };
}
/** Set the `@uwf/eval/<task>/latest` variable to point at a run hash. */
export function setEvalLatest(varStore: VarStore, taskName: string, runHash: string): void {
varStore.set(`${EVAL_VAR_PREFIX}${taskName}/latest`, runHash);
}
+33
View File
@@ -0,0 +1,33 @@
import type { Store, VarStore } from "@ocas/core";
import type { CasRef } from "@united-workforce/protocol";
/** Handle to the OCAS store used for eval persistence. */
export type EvalStore = {
store: Store;
varStore: VarStore;
};
/** A single judge result within an eval run. */
export type EvalJudgeRecord = {
name: string;
score: number;
weight: number;
dataHash: CasRef;
};
/** Config snapshot for an eval run. */
export type EvalRunConfig = {
agent: string;
model: string;
engineVersion: string;
};
/** Full eval run record stored in CAS. */
export type EvalRunPayload = {
task: string;
config: EvalRunConfig;
threadId: string;
judges: EvalJudgeRecord[];
overall: number;
timestamp: number;
};
+2
View File
@@ -0,0 +1,2 @@
export { loadTaskManifest, parseTaskManifest } from "./loader.js";
export type { JudgeEntry, TaskLimits, TaskManifest } from "./types.js";
+74
View File
@@ -0,0 +1,74 @@
import { readFile } from "node:fs/promises";
import { join } from "node:path";
import { parse as parseYaml } from "yaml";
import type { JudgeEntry, TaskLimits, TaskManifest } from "./types.js";
function isRecord(value: unknown): value is Record<string, unknown> {
return typeof value === "object" && value !== null && !Array.isArray(value);
}
function parseJudgeEntry(raw: unknown, index: number): JudgeEntry {
if (!isRecord(raw)) {
throw new Error(`judges[${index}]: expected object`);
}
const name = raw.name;
if (typeof name !== "string" || name === "") {
throw new Error(`judges[${index}]: name is required`);
}
const weight = typeof raw.weight === "number" ? raw.weight : 0;
const builtin = raw.builtin === true;
const entry = typeof raw.entry === "string" ? raw.entry : null;
const schema = typeof raw.schema === "string" ? raw.schema : null;
if (!builtin && entry === null) {
throw new Error(`judges[${index}] "${name}": non-builtin judge must have entry`);
}
return { name, weight, builtin, entry, schema };
}
function parseLimits(raw: unknown): TaskLimits {
if (!isRecord(raw)) {
return { maxSteps: 20, timeoutMinutes: 30 };
}
return {
maxSteps: typeof raw.maxSteps === "number" ? raw.maxSteps : 20,
timeoutMinutes: typeof raw.timeoutMinutes === "number" ? raw.timeoutMinutes : 30,
};
}
/** Parse and validate a task.yaml file into a TaskManifest. */
export function parseTaskManifest(yamlText: string): TaskManifest {
const raw = parseYaml(yamlText) as unknown;
if (!isRecord(raw)) {
throw new Error("task.yaml must be a YAML mapping");
}
const name = raw.name;
if (typeof name !== "string" || name === "") {
throw new Error("task.yaml: name is required");
}
const description = typeof raw.description === "string" ? raw.description : "";
const workflow = raw.workflow;
if (typeof workflow !== "string" || workflow === "") {
throw new Error("task.yaml: workflow is required");
}
const prompt = raw.prompt;
if (typeof prompt !== "string" || prompt === "") {
throw new Error("task.yaml: prompt is required");
}
const limits = parseLimits(raw.limits);
const judgesRaw = raw.judges;
if (!Array.isArray(judgesRaw) || judgesRaw.length === 0) {
throw new Error("task.yaml: at least one judge is required");
}
const judges: JudgeEntry[] = [];
for (let i = 0; i < judgesRaw.length; i++) {
judges.push(parseJudgeEntry(judgesRaw[i], i));
}
return { name, description, workflow, prompt, limits, judges };
}
/** Load and parse task.yaml from a directory. */
export async function loadTaskManifest(taskDir: string): Promise<TaskManifest> {
const yamlPath = join(taskDir, "task.yaml");
const text = await readFile(yamlPath, "utf8");
return parseTaskManifest(text);
}
+28
View File
@@ -0,0 +1,28 @@
/** Judge entry in task.yaml */
export type JudgeEntry = {
name: string;
weight: number;
builtin: boolean;
/** Path to judge entry script (relative to task root). Required for non-builtin judges. */
entry: string | null;
/** Path to OCAS schema JSON for judge data. Required for non-builtin judges. */
schema: string | null;
};
/** Limits for eval execution. */
export type TaskLimits = {
maxSteps: number;
timeoutMinutes: number;
};
/** Parsed task.yaml manifest. */
export type TaskManifest = {
name: string;
description: string;
/** Workflow name or relative path to .yaml file. */
workflow: string;
/** Initial prompt for thread start. */
prompt: string;
limits: TaskLimits;
judges: JudgeEntry[];
};
+9
View File
@@ -0,0 +1,9 @@
{
"extends": "../../tsconfig.json",
"compilerOptions": {
"rootDir": "src",
"outDir": "dist"
},
"include": ["src"],
"references": [{ "path": "../protocol" }, { "path": "../util" }]
}
+1 -2
View File
@@ -1,6 +1,6 @@
{ {
"name": "@united-workforce/protocol", "name": "@united-workforce/protocol",
"version": "0.5.0", "version": "0.1.0",
"files": [ "files": [
"src", "src",
"dist", "dist",
@@ -14,7 +14,6 @@
} }
}, },
"scripts": { "scripts": {
"prepublishOnly": "echo 'Use pnpm run release from repo root' && exit 1",
"test": "vitest run src/__tests__/", "test": "vitest run src/__tests__/",
"test:ci": "vitest run src/__tests__/" "test:ci": "vitest run src/__tests__/"
}, },
@@ -27,6 +27,7 @@ describe("Protocol types for thread/edge location", () => {
completedAtMs: Date.now() + 1000, completedAtMs: Date.now() + 1000,
assembledPrompt: null, assembledPrompt: null,
cwd: "/home/user/project", cwd: "/home/user/project",
usage: null,
}; };
expect(record.cwd).toBe("/home/user/project"); expect(record.cwd).toBe("/home/user/project");
+1
View File
@@ -44,6 +44,7 @@ export type {
ThreadStatus, ThreadStatus,
ThreadStepsOutput, ThreadStepsOutput,
ThreadsIndex, ThreadsIndex,
Usage,
WorkflowConfig, WorkflowConfig,
WorkflowName, WorkflowName,
WorkflowPayload, WorkflowPayload,
+16
View File
@@ -91,6 +91,22 @@ export const STEP_NODE_SCHEMA: JSONSchema = {
assembledPrompt: { assembledPrompt: {
anyOf: [{ type: "string", format: "ocas_ref" }, { type: "null" }], anyOf: [{ type: "string", format: "ocas_ref" }, { type: "null" }],
}, },
usage: {
anyOf: [
{
type: "object",
required: ["turns", "inputTokens", "outputTokens", "duration"],
properties: {
turns: { type: "integer" },
inputTokens: { type: "integer" },
outputTokens: { type: "integer" },
duration: { type: "number" },
},
additionalProperties: false,
},
{ type: "null" },
],
},
}, },
additionalProperties: false, additionalProperties: false,
}; };
+12
View File
@@ -22,6 +22,17 @@ export type StepRecord = {
cwd: string; cwd: string;
/** CAS ref to the fully assembled prompt sent to the agent. null for legacy steps. */ /** CAS ref to the fully assembled prompt sent to the agent. null for legacy steps. */
assembledPrompt: CasRef | null; assembledPrompt: CasRef | null;
/** Token usage statistics reported by the agent adapter. null for legacy steps. */
usage: Usage | null;
};
/** Token usage statistics reported by agent adapters. */
export type Usage = {
turns: number;
inputTokens: number;
outputTokens: number;
/** Wall-clock duration in seconds. */
duration: number;
}; };
// ── 4.2 Workflow 定义 ─────────────────────────────────────────────── // ── 4.2 Workflow 定义 ───────────────────────────────────────────────
@@ -131,6 +142,7 @@ export type StepEntry = {
agent: string; agent: string;
timestamp: number; timestamp: number;
durationMs: number; durationMs: number;
usage: Usage | null;
}; };
/** uwf thread steps — start entry */ /** uwf thread steps — start entry */
+1 -2
View File
@@ -1,6 +1,6 @@
{ {
"name": "@united-workforce/util-agent", "name": "@united-workforce/util-agent",
"version": "0.5.0", "version": "0.1.0",
"files": [ "files": [
"src", "src",
"dist", "dist",
@@ -14,7 +14,6 @@
} }
}, },
"scripts": { "scripts": {
"prepublishOnly": "echo 'Use pnpm run release from repo root' && exit 1",
"test": "vitest run __tests__/ src/__tests__/", "test": "vitest run __tests__/ src/__tests__/",
"test:ci": "vitest run __tests__/ src/__tests__/" "test:ci": "vitest run __tests__/ src/__tests__/"
}, },
+1
View File
@@ -132,6 +132,7 @@ async function buildHistory(
completedAtMs: step.completedAtMs, completedAtMs: step.completedAtMs,
cwd: step.cwd ?? "", cwd: step.cwd ?? "",
assembledPrompt: step.assembledPrompt ?? null, assembledPrompt: step.assembledPrompt ?? null,
usage: step.usage ?? null,
content, content,
}); });
} }
+8 -1
View File
@@ -1,5 +1,5 @@
import { getSchema, validate } from "@ocas/core"; import { getSchema, validate } from "@ocas/core";
import type { CasRef, StepNodePayload, ThreadId } from "@united-workforce/protocol"; import type { CasRef, StepNodePayload, ThreadId, Usage } from "@united-workforce/protocol";
import { config as loadDotenv } from "dotenv"; import { config as loadDotenv } from "dotenv";
import { buildOutputFormatInstruction } from "./build-output-format-instruction.js"; import { buildOutputFormatInstruction } from "./build-output-format-instruction.js";
import { buildContextWithMeta } from "./context.js"; import { buildContextWithMeta } from "./context.js";
@@ -65,6 +65,7 @@ async function writeStepNode(options: {
startedAtMs: number; startedAtMs: number;
completedAtMs: number; completedAtMs: number;
assembledPromptHash: CasRef | null; assembledPromptHash: CasRef | null;
usage: Usage | null;
}): Promise<CasRef> { }): Promise<CasRef> {
const payload: StepNodePayload = { const payload: StepNodePayload = {
start: options.startHash, start: options.startHash,
@@ -78,6 +79,7 @@ async function writeStepNode(options: {
completedAtMs: options.completedAtMs, completedAtMs: options.completedAtMs,
cwd: process.cwd(), cwd: process.cwd(),
assembledPrompt: options.assembledPromptHash, assembledPrompt: options.assembledPromptHash,
usage: options.usage,
}; };
const hash = await options.store.cas.put(options.schemas.stepNode, payload); const hash = await options.store.cas.put(options.schemas.stepNode, payload);
const node = options.store.cas.get(hash); const node = options.store.cas.get(hash);
@@ -117,6 +119,7 @@ async function persistStep(options: {
startedAtMs: number; startedAtMs: number;
completedAtMs: number; completedAtMs: number;
assembledPromptHash: CasRef | null; assembledPromptHash: CasRef | null;
usage: Usage | null;
}): Promise<CasRef> { }): Promise<CasRef> {
const { store, schemas, chain, headHash } = options.ctx.meta; const { store, schemas, chain, headHash } = options.ctx.meta;
return writeStepNode({ return writeStepNode({
@@ -132,6 +135,7 @@ async function persistStep(options: {
startedAtMs: options.startedAtMs, startedAtMs: options.startedAtMs,
completedAtMs: options.completedAtMs, completedAtMs: options.completedAtMs,
assembledPromptHash: options.assembledPromptHash, assembledPromptHash: options.assembledPromptHash,
usage: options.usage,
}); });
} }
@@ -200,6 +204,7 @@ export function createAgent(options: AgentOptions): () => Promise<void> {
); );
} }
const completedAtMs = Date.now(); const completedAtMs = Date.now();
const usage = agentResult.usage;
// Store the assembled prompt in CAS for later inspection via `step read --prompt` // Store the assembled prompt in CAS for later inspection via `step read --prompt`
const promptText = agentResult.assembledPrompt; const promptText = agentResult.assembledPrompt;
@@ -220,6 +225,7 @@ export function createAgent(options: AgentOptions): () => Promise<void> {
startedAtMs, startedAtMs,
completedAtMs, completedAtMs,
assembledPromptHash, assembledPromptHash,
usage,
}); });
const adapterOutput: AdapterOutput = { const adapterOutput: AdapterOutput = {
@@ -230,6 +236,7 @@ export function createAgent(options: AgentOptions): () => Promise<void> {
body: extracted.body, body: extracted.body,
startedAtMs, startedAtMs,
completedAtMs, completedAtMs,
usage,
}; };
process.stdout.write(`${JSON.stringify(adapterOutput)}\n`); process.stdout.write(`${JSON.stringify(adapterOutput)}\n`);
}; };
+9 -1
View File
@@ -1,5 +1,10 @@
import type { Store } from "@ocas/core"; import type { Store } from "@ocas/core";
import type { ModeratorContext, ThreadId, WorkflowPayload } from "@united-workforce/protocol"; import type {
ModeratorContext,
ThreadId,
Usage,
WorkflowPayload,
} from "@united-workforce/protocol";
export type AgentContext = ModeratorContext & { export type AgentContext = ModeratorContext & {
threadId: ThreadId; threadId: ThreadId;
@@ -33,6 +38,8 @@ export type AgentRunResult = {
sessionId: string; sessionId: string;
/** The fully assembled prompt that was sent to the agent. */ /** The fully assembled prompt that was sent to the agent. */
assembledPrompt: string; assembledPrompt: string;
/** Token usage statistics for this run. null when the adapter does not report usage. */
usage: Usage | null;
}; };
export type AgentContinueFn = ( export type AgentContinueFn = (
@@ -51,6 +58,7 @@ export type AdapterOutput = {
body: string; body: string;
startedAtMs: number; startedAtMs: number;
completedAtMs: number; completedAtMs: number;
usage: Usage | null;
}; };
export type AgentOptions = { export type AgentOptions = {

Some files were not shown because too many files have changed in this diff Show More