chore: version bump for --version fix

agent-hermes@0.1.2 agent-claude-code@0.1.1 agent-builtin@0.1.1 agent-mock@0.1.1 eval@0.1.3 util@0.1.1 小橘 🍊（NEKO Team）
Merge pull request 'fix: acp-client reports agent-hermes own version in MCP clientInfo' (#98 ) from fix/acp-client-own-version into main
2026-06-05 08:12:50 +00:00 · 2026-06-05 08:10:57 +00:00 · 2026-06-05 07:50:03 +00:00 · 2026-06-05 07:36:15 +00:00 · 2026-06-05 07:29:54 +00:00 · 2026-06-05 06:46:32 +00:00
117 changed files with 3424 additions and 666 deletions
@@ -1,8 +0,0 @@
-# Changesets
-
-Hello and welcome! This folder has been automatically generated by `@changesets/cli`, a build tool that works
-with multi-package repos, or single-package repos to help you version and publish your code. You can
-find the full documentation for it [in our repository](https://github.com/changesets/changesets).
-
-We have a quick list of common questions to get you started engaging with this project in
-[our documentation](https://github.com/changesets/changesets/blob/main/docs/common-questions.md).
@@ -1,11 +0,0 @@
-{
-  "$schema": "https://unpkg.com/@changesets/config@3.1.4/schema.json",
-  "changelog": "@changesets/cli/changelog",
-  "commit": false,
-  "fixed": [["@united-workforce/*"]],
-  "linked": [],
-  "access": "public",
-  "baseBranch": "main",
-  "updateInternalDependencies": "patch",
-  "ignore": ["@united-workforce/dashboard"]
-}
@@ -1,30 +0,0 @@
-{
-  "mode": "exit",
-  "tag": "alpha",
-  "initialVersions": {
-    "@uncaged/cli": "0.4.5",
-    "@uncaged/workflow-agent-cursor": "0.4.5",
-    "@uncaged/agent-hermes": "0.4.5",
-    "@uncaged/workflow-agent-llm": "0.4.5",
-    "@uncaged/workflow-agent-react": "0.4.5",
-    "@uncaged/workflow-cas": "0.4.5",
-    "@uncaged/dashboard": "0.1.0",
-    "@uncaged/workflow-execute": "0.4.5",
-    "@uncaged/workflow-gateway": "0.4.5",
-    "@uncaged/protocol": "0.4.5",
-    "@uncaged/workflow-reactor": "0.4.5",
-    "@uncaged/workflow-register": "0.4.5",
-    "@uncaged/workflow-runtime": "0.4.5",
-    "@uncaged/workflow-template-develop": "0.4.5",
-    "@uncaged/workflow-template-solve-issue": "0.4.5",
-    "@uncaged/util": "0.4.5",
-    "@uncaged/util-agent": "0.4.5"
-  },
-  "changesets": [
-    "env-api-unify",
-    "fix-internal-deps",
-    "fix-publish-src",
-    "fix-workspace-deps",
-    "rfc-252-agent-fn"
-  ]
-}
@@ -0,0 +1,226 @@
+# Eval Framework Implementation Plan
+
+## Goal
+
+Build `uwf-eval` CLI + eval task infrastructure for evaluating uwf workflow quality with real agents.
+
+## Architecture
+
+```
+uwf-eval (runner)          task package (npm)          OCAS (storage)
+  │                          │                           │
+  ├─ unpack tarball ───────► fixture/ → tmp cwd          │
+  ├─ read task.yaml          │                           │
+  ├─ uwf thread start/exec  │                           │
+  ├─ run judges ───────────► dist/judges/*.js            │
+  ├─ collect scores          │                           │
+  └─ store results ─────────────────────────────────────► CAS nodes + variables
+```
+
+### Key Design Decisions
+
+- **uwf-eval is NOT part of uwf** — separate package, shells out to uwf CLI
+- **Task = npm package** — fixture + task.yaml + judge scripts, distributable as tarball
+- **Judge = Node script** — `node <entry> <cwd> <thread-id>`, outputs `{score, data}` JSON
+- **Every output is OCAS typed** — eval-run, judge results all have registered schemas
+- **Builtin judges** — frontmatter compliance, upstream consumption, hallucination, token stats
+- **Task-specific judges** — bundled in the task package, custom schema per judge
+
+## Deliverables
+
+### Phase 1: Foundation (`@united-workforce/eval`)
+
+New package in the uwf monorepo.
+
+```
+packages/eval/
+  src/
+    cli.ts                    # uwf-eval entry point
+    commands/
+      run.ts                  # uwf-eval run
+      report.ts               # uwf-eval report <hash>
+      diff.ts                 # uwf-eval diff <hash> <hash>
+      list.ts                 # uwf-eval list
+    runner/
+      prepare.ts              # unpack tarball/dir → tmp cwd
+      execute.ts              # shell out to uwf thread start/exec
+      collect.ts              # run judges, collect scores
+    judge/
+      types.ts                # JudgeInput, JudgeOutput types
+      builtin/
+        frontmatter.ts        # frontmatter compliance check
+        upstream.ts           # upstream info consumption (LLM-as-judge)
+        hallucination.ts      # hallucination detection (LLM-as-judge)
+        token-stats.ts        # token usage from $usage field (#68)
+    storage/
+      schemas.ts              # OCAS schema definitions
+      store.ts                # CAS read/write helpers
+      index.ts                # variable indexing (@uwf/eval/*)
+    task/
+      types.ts                # TaskManifest type (task.yaml)
+      loader.ts               # parse task.yaml, validate
+  package.json
+  tsconfig.json
+```
+
+#### OCAS Schemas to Register
+
+1. `@uwf/eval-run` — full eval execution record
+   ```
+   { task, config: {agent, model, engineVersion}, threadId,
+     judges: [{name, score, weight, dataHash}], overall, timestamp }
+   ```
+
+2. `@uwf/eval-judge-frontmatter` — frontmatter judge data
+   ```
+   { stepsTotal, stepsValid, invalidSteps: [{stepIndex, role, errors: string[]}] }
+   ```
+
+3. `@uwf/eval-judge-upstream` — upstream consumption judge data
+   ```
+   { perStep: [{role, consumed: string[], missed: string[], score}] }
+   ```
+
+4. `@uwf/eval-judge-hallucination` — hallucination judge data
+   ```
+   { perStep: [{role, hallucinations: string[], score}] }
+   ```
+
+5. `@uwf/eval-judge-token-stats` — token stats (not scored, informational)
+   ```
+   { totalInput, totalOutput, totalTurns, perStep: [{role, input, output, turns, duration}] }
+   ```
+
+#### CLI Design
+
+```bash
+# Run eval
+uwf-eval run <task-dir-or-tarball> [--agent hermes] [--model claude-sonnet-4] [--count 20]
+
+# View results
+uwf-eval report <run-hash>        # render via ocas render
+uwf-eval diff <hash1> <hash2>     # side-by-side comparison
+uwf-eval list                     # list past runs
+```
+
+### Phase 2: Task Package Scaffold
+
+Template for creating eval tasks. Also serves as the first real task.
+
+```
+eval-tasks/                        # shazhou/uwf-eval-tasks monorepo
+  packages/
+    _template/                     # copypaste template
+      package.json
+      task.yaml
+      fixture/
+      src/judges/
+      tsconfig.json
+    fix-off-by-one/                # first real task
+      package.json                 # @uwf-eval/fix-off-by-one
+      task.yaml
+      fixture/
+        src/calc.ts                # buggy calculator
+        src/calc.test.ts           # test that exposes the bug
+        package.json
+      src/judges/
+        test-pass.ts               # runs pnpm test, checks exit code
+        code-quality.ts            # LLM judge: minimal change, correct fix
+      schemas/
+        test-pass.json             # OCAS schema for test-pass data
+        code-quality.json          # OCAS schema for code-quality data
+      tsconfig.json
+  pnpm-workspace.yaml
+  tsconfig.json
+  biome.json
+```
+
+#### task.yaml Format
+
+```yaml
+name: fix-off-by-one
+description: Fix an off-by-one error in a calculator's add function
+workflow: solve-issue              # registered workflow name, or relative path to .yaml
+prompt: "Fix the bug: add(1,2) returns 4 instead of 3"
+limits:
+  maxSteps: 15
+  timeoutMinutes: 30
+judges:
+  - name: frontmatter-compliance
+    weight: 0.15
+    builtin: true
+  - name: upstream-consumption
+    weight: 0.15
+    builtin: true
+  - name: hallucination
+    weight: 0.1
+    builtin: true
+  - name: token-stats
+    weight: 0                      # informational, not scored
+    builtin: true
+  - name: test-pass
+    weight: 0.3
+    entry: dist/judges/test-pass.js
+    schema: schemas/test-pass.json
+  - name: code-quality
+    weight: 0.3
+    entry: dist/judges/code-quality.js
+    schema: schemas/code-quality.json
+```
+
+#### Judge Script Contract
+
+```typescript
+// Input: process.argv = [node, script, cwd, threadId]
+// Output: stdout JSON
+// Exit 0 = success, non-zero = judge error (not low score)
+
+import type { JudgeOutput } from "@united-workforce/eval";
+
+const result: JudgeOutput<TestPassData> = {
+  score: 1.0,      // 0.0 - 1.0
+  data: {           // typed per judge schema
+    command: "pnpm test",
+    exitCode: 0,
+    output: "3 tests passed"
+  }
+};
+
+console.log(JSON.stringify(result));
+```
+
+### Phase 3: Prerequisite — $usage in Adapter Protocol (#68)
+
+Blocked by #68. Token stats judge needs `$usage` in step nodes.
+
+Can proceed with Phase 1+2 without it — token-stats judge just returns zeros until adapters report usage.
+
+## Implementation Order
+
+1. **Phase 1a**: `@united-workforce/eval` package scaffold + CLI skeleton + OCAS schemas
+2. **Phase 1b**: `run` command — prepare, execute, collect flow
+3. **Phase 1c**: Builtin judges — frontmatter (deterministic), upstream + hallucination (LLM-as-judge)
+4. **Phase 2a**: Create `shazhou/uwf-eval-tasks` monorepo with proman
+5. **Phase 2b**: First task `fix-off-by-one` with fixture repo + 2 custom judges
+6. **Phase 2c**: End-to-end test: `uwf-eval run packages/fix-off-by-one --agent hermes`
+7. **Phase 1d**: `report`, `diff`, `list` commands (read from CAS, render via ocas render)
+
+## Dependencies
+
+- `@ocas/core` + `@ocas/fs` — CAS storage
+- `@united-workforce/protocol` — step node types
+- `commander` — CLI framework (consistent with uwf)
+- LLM API access — for LLM-as-judge (upstream, hallucination, task-specific quality judges)
+
+## Open Questions
+
+1. **LLM-as-judge provider config** — reuse uwf's `~/.uwf/config.yaml` provider settings? Or separate config?
+2. **Workflow file location** — task.yaml references a workflow. Should the workflow YAML be inside the tarball, or reference a registered workflow by name?
+3. **Non-coding tasks** — debate workflow has no fixture repo. task.yaml needs `fixture: null` or simply omit the `fixture/` dir. Runner creates empty cwd.
+4. **Parallel judge execution** — judges are independent, can run in parallel. Worth the complexity?
+
+## Risks
+
+- LLM-as-judge consistency — same input may get different scores. Mitigation: run judge multiple times, take average? Or accept variance.
+- Token cost of judges — each LLM judge call costs tokens. For a 10-step workflow with 2 LLM judges = 20 LLM calls just for judging. Acceptable?
+- Fixture repo drift — if the fixture evolves, old eval runs become non-comparable. Pin fixture version in task.yaml.
@@ -0,0 +1,25 @@
+# Changelog
+
+## 0.1.0 (2026-06-05)
+
+Initial release of `@united-workforce/*` — a stateless workflow engine for AI agent orchestration.
+
+### Packages
+
+- **@united-workforce/protocol** — shared types (WorkflowPayload, StepNode, etc.)
+- **@united-workforce/util** — Crockford Base32, ULID, structured logger, frontmatter parsing
+- **@united-workforce/util-agent** — agent factory, context builder, extract pipeline
+- **@united-workforce/cli** — `uwf` CLI (thread lifecycle, status-based moderator, workflow registry)
+- **@united-workforce/eval** — `uwf-eval` CLI (prepare → execute → collect eval pipeline)
+- **@united-workforce/agent-hermes** — `uwf-hermes` adapter (Hermes Agent)
+- **@united-workforce/agent-claude-code** — `uwf-claude-code` adapter (Claude Code CLI)
+- **@united-workforce/agent-builtin** — `uwf-builtin` adapter (built-in LLM agent)
+- **@united-workforce/agent-mock** — `uwf-mock` adapter (deterministic test agent)
+
+### Highlights
+
+- Status-based graph routing (no LLM moderator cost)
+- CAS-backed immutable thread chains (`@ocas/core`)
+- Real token usage tracking (Hermes + Claude Code)
+- Eval framework with built-in judges (frontmatter, token-stats, test-pass)
+- `$SUSPEND` / resume for human-in-the-loop workflows
@@ -222,41 +222,42 @@ Test files (`__tests__/**`) are exempt.

 | Tool | Purpose |
 |------|---------|
-| **bun** | Package manager + runtime |
+| **pnpm** | Package manager |
 | **TypeScript** | Type checking (strict mode) |
 | **Biome** | Lint + format (replaces ESLint + Prettier) |
-| **vitest** | Test runner (`cli` uses vitest; other packages use `bun test`) |
+| **vitest** | Test runner (all packages) |

 ### Development Workflow

 ```bash
 # ── Setup ──
-bun install                 # install all workspace dependencies
+pnpm install                # install all workspace dependencies

 # ── Daily development ──
-bun run build               # tsc --build (all packages, dependency order)
-bun run check               # tsc --build + biome check + lint-log-tags
-bun run format              # biome format --write
-bun test                    # run tests across all packages
+pnpm run build              # build all packages (dependency order)
+pnpm run check              # biome check + lint-log-tags
+pnpm run typecheck          # tsc --build
+pnpm run test               # run tests across all packages

 # ── Before committing ──
-bun run check               # must pass — typecheck + lint + log tag validation
-bun test                    # must pass — all package tests
+pnpm run check              # must pass — lint + log tag validation
+pnpm run typecheck          # must pass — type checking
+pnpm run test               # must pass — all package tests
 ```

 ### Publishing

-All public `@united-workforce/*` packages are published to **npmjs.org** with **fixed mode** (all packages share the same version number).
+All public `@united-workforce/*` packages are published to **npmjs.org** with **independent versioning**.

 ```bash
 # 1. Add a changeset describing the change
-bun changeset
+npx changeset

-# 2. Bump all package versions + generate CHANGELOGs
-bun version
+# 2. Bump versions + generate CHANGELOGs
+proman bump

-# 3. Build, test, and publish (runs scripts/publish-all.mjs)
-bun release
+# 3. Build, test, and publish
+proman publish

 # Or publish manually with a tag:
 node scripts/publish-all.mjs --tag alpha
@@ -265,7 +266,7 @@ node scripts/publish-all.mjs --dry-run    # preview without publishing

 - `workspace:^` dependencies resolve to `^x.y.z` on publish
 - Publish order defined in `scripts/publish-all.mjs` (dependency order)
- Changesets config: `.changeset/config.json` (fixed mode, public access)
+- Changesets config: `.changeset/config.json` (independent versioning, public access)

 ### End-to-end: Author → Register → Run

@@ -470,7 +470,7 @@ Use the `ocas` CLI for direct CAS operations (`~/.ocas/` store, shared with `uwf

 | Tool | Purpose |
 |------|---------|
-| **bun** | Package manager + runtime |
+| **pnpm** | Package manager |
 | **TypeScript** | Type checking (strict mode) |
 | **Biome** | Lint + format |
 | **vitest** | Test runner |
@@ -17,7 +17,7 @@ The root README should have these sections in order:
 4. **Packages** — table with ALL packages from packages/ directory, columns: Package, Description, Type (cli/lib/agent/app)
 5. **Quick Start** — install, build, register workflow, start thread, run step
 6. **CLI Reference** — brief command list, detailed usage in cli README
-7. **Development** — bun install / build / check / test
+7. **Development** — pnpm install / build / check / test

 ## Per-Package README Structure

@@ -26,7 +26,7 @@ Each package README should have:
 1. **Title** — package name
 2. **One-line description** — matching package.json
 3. **Overview** — what it does, where it sits in the architecture, dependencies
-4. **Installation** — bun add (for libs) or "included as binary" (for cli/agents)
+4. **Installation** — pnpm add (for libs) or "included as binary" (for cli/agents)
 5. **API** (lib packages) — all exports from src/index.ts with type signatures, grouped by category, minimal usage examples
 6. **CLI Usage** (cli/agent packages) — command reference with examples
 7. **Internal Structure** — brief src/ file organization
@@ -56,7 +56,7 @@ For each package read:
 - All relative links work
 - Package names match package.json
 - No references to removed/renamed packages
- bun run build still passes
+- pnpm run build still passes

 ## Guidelines

@@ -23,7 +23,7 @@ roles:
      type: object
      properties:
        $status:
-          enum: ["_"]
+          enum: ["done"]
        thesis:
          type: string
        keyPoints:
@@ -37,4 +37,4 @@ graph:
  $START:
    _: { role: "analyst", prompt: "Analyze the topic in the task and produce a structured summary with key points." }
  analyst:
-    _: { role: "$END", prompt: "Analysis complete. Finish the workflow." }
+    done: { role: "$END", prompt: "Analysis complete. Finish the workflow." }
@@ -0,0 +1,30 @@
+name: eval-simple
+description: "Single-role eval workflow: fixer takes prompt, fixes code, done."
+roles:
+  fixer:
+    description: "Fixes the code based on the prompt"
+    goal: |
+      You are a code fixer. Read the prompt, understand the bug, fix it, and verify by running the tests.
+    capabilities:
+      - code-editing
+      - test-running
+    procedure: |
+      1. Read the prompt to understand what needs to be fixed
+      2. Fix the bug in the source code
+      3. Run the tests mentioned in the prompt to verify
+      4. Output $status=done when tests pass
+    output: "Describe what you fixed and confirm tests pass. Set $status to done."
+    frontmatter:
+      type: object
+      properties:
+        $status:
+          type: string
+          enum: [done]
+        summary:
+          type: string
+      required: [$status, summary]
+graph:
+  $START:
+    _: { role: "fixer", prompt: "Fix the code issue described in the task prompt." }
+  fixer:
+    done: { role: "$END", prompt: "Fix complete." }
@@ -1,6 +1,6 @@
 {
  "name": "@united-workforce/agent-builtin",
-  "version": "0.5.0",
+  "version": "0.1.1",
  "files": [
    "src",
    "dist",
@@ -8,7 +8,7 @@
  ],
  "type": "module",
  "bin": {
-    "uwf-builtin": "./src/cli.ts"
+    "uwf-builtin": "./dist/cli.js"
  },
  "exports": {
    ".": {
@@ -17,7 +17,6 @@
    }
  },
  "scripts": {
-    "prepublishOnly": "echo 'Use pnpm run release from repo root' && exit 1",
    "test": "vitest run __tests__/",
    "test:ci": "vitest run __tests__/"
  },
@@ -82,7 +82,13 @@ async function runBuiltinWithMessages(

  if (loopResult.turnCount === 0) {
    log("5RWTK9NB", "no turns produced, returning empty output");
-    return { output: "", detailHash: "", sessionId: session.sessionId, assembledPrompt: "" };
+    return {
+      output: "",
+      detailHash: "",
+      sessionId: session.sessionId,
+      assembledPrompt: "",
+      usage: null,
+    };
  }

  // Read jsonl → persist turns to CAS → store detail
@@ -99,6 +105,7 @@ async function runBuiltinWithMessages(
    detailHash,
    sessionId: session.sessionId,
    assembledPrompt: "",
+    usage: null,
  };
 }

@@ -1,5 +1,12 @@
 #!/usr/bin/env node

+// eslint-disable-next-line -- dynamic import for version
+const pkg = await import("../package.json", { with: { type: "json" } });
+if (process.argv.includes("--version") || process.argv.includes("-V")) {
+  process.stdout.write(`${pkg.default.version}\n`);
+  process.exit(0);
+}
+
 import { createBuiltinAgent } from "./agent.js";

 const main = createBuiltinAgent();
@@ -1,6 +1,6 @@
 {
  "name": "@united-workforce/agent-claude-code",
-  "version": "0.1.0",
+  "version": "0.1.1",
  "files": [
    "src",
    "dist",
@@ -8,7 +8,7 @@
  ],
  "type": "module",
  "bin": {
-    "uwf-claude-code": "./src/cli.ts"
+    "uwf-claude-code": "./dist/cli.js"
  },
  "exports": {
    ".": {
@@ -17,12 +17,12 @@
    }
  },
  "scripts": {
-    "prepublishOnly": "echo 'Use pnpm run release from repo root' && exit 1",
    "test": "vitest run __tests__/",
    "test:ci": "vitest run __tests__/"
  },
  "dependencies": {
    "@ocas/core": "^0.3.0",
+    "@united-workforce/protocol": "workspace:^",
    "@united-workforce/util": "workspace:^",
    "@united-workforce/util-agent": "workspace:^"
  },
@@ -1,5 +1,6 @@
 import { spawn } from "node:child_process";
 import type { Store } from "@ocas/core";
+import type { Usage } from "@united-workforce/protocol";
 import { createLogger } from "@united-workforce/util";
 import {
  type AgentContext,
@@ -145,7 +146,14 @@ async function processClaudeOutput(
      );
    }

-    return { output, detailHash, sessionId, assembledPrompt };
+    const usage: Usage = {
+      turns: parsed.numTurns,
+      inputTokens: parsed.usage.inputTokens,
+      outputTokens: parsed.usage.outputTokens,
+      duration: Math.round(parsed.durationMs / 1000),
+    };
+
+    return { output, detailHash, sessionId, assembledPrompt, usage };
  }

  // Truly unparseable output - provide enhanced error message
@@ -1,5 +1,12 @@
 #!/usr/bin/env node

+// eslint-disable-next-line -- dynamic import for version
+const pkg = await import("../package.json", { with: { type: "json" } });
+if (process.argv.includes("--version") || process.argv.includes("-V")) {
+  process.stdout.write(`${pkg.default.version}\n`);
+  process.exit(0);
+}
+
 import { createClaudeCodeAgent } from "./claude-code.js";

 const model = process.env.CLAUDE_MODEL ?? null;
@@ -2,5 +2,5 @@
  "extends": "../../tsconfig.json",
  "compilerOptions": { "rootDir": "src", "outDir": "dist" },
  "include": ["src"],
-  "references": [{ "path": "../util-agent" }]
+  "references": [{ "path": "../protocol" }, { "path": "../util-agent" }]
 }
@@ -0,0 +1,18 @@
+# @united-workforce/agent-hermes
+
+## 0.1.1
+
+### Patch Changes
+
+- 8085d1d: fix: read token usage from ACP PromptResponse instead of DB
+
+  Token counts (inputTokens, outputTokens) now come from the ACP
+  `PromptResponse.usage` field, which is populated synchronously from
+  `run_conversation()` return data — no WAL race condition.
+
+  Turns (assistant message count) still come from the DB via
+  `snapshotTurns()` before/after delta.
+
+  Previously both tokens and turns were read from the Hermes state DB
+  after the ACP prompt returned, but due to WAL write lag the DB often
+  had incomplete token data at read time (e.g. 235 vs actual 26,080).
@@ -1,55 +0,0 @@
-import { afterEach, beforeEach, describe, expect, it } from "vitest";
-import { HermesAcpClient } from "../../src/acp-client.js";
-
-const UUID_RE = /^[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}$/i;
-
-describe("HermesAcpClient", () => {
-  let client: HermesAcpClient;
-
-  beforeEach(() => {
-    client = new HermesAcpClient();
-  });
-
-  afterEach(async () => {
-    await client.close();
-  });
-
-  it(
-    "connect() returns a UUID sessionId",
-    async () => {
-      const sessionId = await client.connect(process.cwd());
-      expect(typeof sessionId).toBe("string");
-      expect(sessionId).toMatch(UUID_RE);
-    },
-    { timeout: 2 * 60 * 1000 },
-  );
-
-  it(
-    "prompt() returns a non-empty text response",
-    async () => {
-      await client.connect(process.cwd());
-      const result = await client.prompt("Reply with exactly the word: PONG");
-      expect(typeof result.text).toBe("string");
-      expect(result.text.length).toBeGreaterThan(0);
-      expect(typeof result.sessionId).toBe("string");
-      expect(result.sessionId).toMatch(UUID_RE);
-    },
-    { timeout: 2 * 60 * 1000 },
-  );
-
-  it(
-    "prompt() can be called twice on the same session (resume)",
-    async () => {
-      await client.connect(process.cwd());
-
-      const first = await client.prompt("Say the word ALPHA and nothing else.");
-      expect(first.text.length).toBeGreaterThan(0);
-
-      const second = await client.prompt("Now say the word BETA and nothing else.");
-      expect(second.text.length).toBeGreaterThan(0);
-
-      expect(first.sessionId).toBe(second.sessionId);
-    },
-    { timeout: 2 * 60 * 1000 },
-  );
-});
@@ -1,56 +0,0 @@
-import { afterEach, describe, expect, it } from "vitest";
-import { HermesAcpClient } from "../../src/acp-client.js";
-
-/**
- * E2E test for cross-process session resume.
- *
- * Simulates the workflow re-entry scenario:
- * 1. Client A: connect → prompt → close (developer first run)
- * 2. Client B: resume(sessionId) → prompt (developer re-entry after reviewer reject)
- *
- * This is what happens when uwf thread step spawns uwf-hermes twice for the same role.
- */
-describe("HermesAcpClient cross-process resume", () => {
-  const clients: HermesAcpClient[] = [];
-
-  afterEach(async () => {
-    for (const c of clients) {
-      await c.close();
-    }
-    clients.length = 0;
-  });
-
-  // TODO(#435): flaky — depends on live LLM; mock or move to integration suite
-  it.skip(
-    "resume() after close — second prompt returns non-empty text",
-    async () => {
-      // --- Client A: first run ---
-      const clientA = new HermesAcpClient();
-      clients.push(clientA);
-
-      await clientA.connect(process.cwd());
-      const first = await clientA.prompt(
-        "Remember the secret code: WATERMELON. Reply with exactly: ACKNOWLEDGED",
-      );
-      expect(first.text.length).toBeGreaterThan(0);
-      const sessionId = first.sessionId;
-
-      // Close client A (simulates uwf-hermes process exit)
-      await clientA.close();
-
-      // --- Client B: resume (simulates re-entry) ---
-      const clientB = new HermesAcpClient();
-      clients.push(clientB);
-
-      await clientB.resume(sessionId, process.cwd());
-      const second = await clientB.prompt(
-        "What was the secret code I told you earlier? Reply with just the code word.",
-      );
-
-      // The critical assertion: resumed session produces non-empty output
-      expect(second.text.length).toBeGreaterThan(0);
-      expect(second.sessionId).toBe(sessionId);
-    },
-    { timeout: 3 * 60 * 1000 },
-  );
-});
@@ -140,7 +140,9 @@ function createTestDb(dbPath: string): TestDb {
  db.exec(`CREATE TABLE sessions (
    id TEXT PRIMARY KEY,
    model TEXT NOT NULL,
-    started_at INTEGER NOT NULL
+    started_at INTEGER NOT NULL,
+    input_tokens INTEGER DEFAULT 0,
+    output_tokens INTEGER DEFAULT 0
  )`);
  db.exec(`CREATE TABLE messages (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
@@ -0,0 +1,122 @@
+import { describe, expect, test } from "vitest";
+import type { AcpUsage } from "../src/acp-client.js";
+import { buildUsage, snapshotTurns } from "../src/hermes.js";
+import type { HermesSessionJson } from "../src/types.js";
+
+function makeSession(overrides: Partial<HermesSessionJson> = {}): HermesSessionJson {
+  return {
+    session_id: "test-session",
+    model: "test-model",
+    session_start: "2026-01-01T00:00:00Z",
+    messages: [],
+    inputTokens: 0,
+    outputTokens: 0,
+    ...overrides,
+  };
+}
+
+describe("snapshotTurns", () => {
+  test("returns zero for null session", () => {
+    const result = snapshotTurns(null);
+    expect(result).toEqual({ turns: 0 });
+  });
+
+  test("returns zero for empty session", () => {
+    const result = snapshotTurns(makeSession());
+    expect(result).toEqual({ turns: 0 });
+  });
+
+  test("counts assistant messages as turns", () => {
+    const result = snapshotTurns(
+      makeSession({
+        messages: [
+          { role: "user", content: "hello", reasoning: null, tool_calls: null },
+          { role: "assistant", content: "hi", reasoning: null, tool_calls: null },
+          { role: "user", content: "do X", reasoning: null, tool_calls: null },
+          { role: "tool", content: "result", reasoning: null, tool_calls: null },
+          { role: "assistant", content: "done", reasoning: null, tool_calls: null },
+        ],
+        inputTokens: 1000,
+        outputTokens: 500,
+      }),
+    );
+    expect(result).toEqual({ turns: 2 });
+  });
+
+  test("ignores non-assistant messages for turn count", () => {
+    const result = snapshotTurns(
+      makeSession({
+        messages: [
+          { role: "user", content: "hello", reasoning: null, tool_calls: null },
+          { role: "tool", content: "result", reasoning: null, tool_calls: null },
+        ],
+      }),
+    );
+    expect(result.turns).toBe(0);
+  });
+});
+
+describe("buildUsage", () => {
+  const acpUsage: AcpUsage = { inputTokens: 5000, outputTokens: 2000, totalTokens: 7000 };
+
+  test("first visit: tokens from ACP, turns from DB delta", () => {
+    const beforeTurns = { turns: 0 };
+    const afterTurns = { turns: 3 };
+    const result = buildUsage(acpUsage, beforeTurns, afterTurns, 12.5);
+    expect(result).toEqual({
+      turns: 3,
+      inputTokens: 5000,
+      outputTokens: 2000,
+      duration: 13,
+    });
+  });
+
+  test("re-entry: turn delta computed correctly, tokens from ACP", () => {
+    const beforeTurns = { turns: 2 };
+    const afterTurns = { turns: 4 };
+    const acpDelta: AcpUsage = { inputTokens: 8000, outputTokens: 3500, totalTokens: 11500 };
+    const result = buildUsage(acpDelta, beforeTurns, afterTurns, 7.3);
+    expect(result).toEqual({
+      turns: 2,
+      inputTokens: 8000,
+      outputTokens: 3500,
+      duration: 7,
+    });
+  });
+
+  test("floors negative turn deltas at 0, then defaults to 1", () => {
+    const beforeTurns = { turns: 5 };
+    const afterTurns = { turns: 3 };
+    const result = buildUsage(acpUsage, beforeTurns, afterTurns, 1.0);
+    // turns would be negative (-2), floored to 0, then || 1 gives 1
+    expect(result.turns).toBe(1);
+  });
+
+  test("zero turns delta defaults to 1 (at least one turn happened)", () => {
+    const beforeTurns = { turns: 3 };
+    const afterTurns = { turns: 3 };
+    const result = buildUsage(acpUsage, beforeTurns, afterTurns, 5.0);
+    // turns delta is 0, || 1 gives 1
+    expect(result.turns).toBe(1);
+  });
+
+  test("null ACP usage yields zero tokens", () => {
+    const beforeTurns = { turns: 0 };
+    const afterTurns = { turns: 2 };
+    const result = buildUsage(null, beforeTurns, afterTurns, 10.0);
+    expect(result).toEqual({
+      turns: 2,
+      inputTokens: 0,
+      outputTokens: 0,
+      duration: 10,
+    });
+  });
+
+  test("duration is rounded", () => {
+    const beforeTurns = { turns: 0 };
+    const afterTurns = { turns: 1 };
+    expect(buildUsage(acpUsage, beforeTurns, afterTurns, 3.7).duration).toBe(4);
+    expect(buildUsage(acpUsage, beforeTurns, afterTurns, 3.2).duration).toBe(3);
+    expect(buildUsage(acpUsage, beforeTurns, afterTurns, 0.0).duration).toBe(0);
+  });
+});
@@ -1,6 +1,6 @@
 {
  "name": "@united-workforce/agent-hermes",
-  "version": "0.5.0",
+  "version": "0.1.2",
  "files": [
    "src",
    "dist",
@@ -8,7 +8,7 @@
  ],
  "type": "module",
  "bin": {
-    "uwf-hermes": "./src/cli.ts"
+    "uwf-hermes": "./dist/cli.js"
  },
  "exports": {
    ".": {
@@ -17,7 +17,6 @@
    }
  },
  "scripts": {
-    "prepublishOnly": "echo 'Use pnpm run release from repo root' && exit 1",
    "test": "vitest run __tests__/",
    "test:ci": "vitest run __tests__/"
  },
@@ -1,6 +1,16 @@
 import type { ChildProcess } from "node:child_process";
 import { spawn } from "node:child_process";
+import { readFileSync } from "node:fs";
+import { dirname, join } from "node:path";
 import { createInterface } from "node:readline";
+import { fileURLToPath } from "node:url";
+
+const __dirname = dirname(fileURLToPath(import.meta.url));
+const OWN_VERSION = (
+  JSON.parse(readFileSync(join(__dirname, "..", "package.json"), "utf-8")) as {
+    version: string;
+  }
+).version;

 const HERMES_COMMAND = "hermes";
 const PROTOCOL_VERSION = 1;
@@ -17,9 +27,17 @@ type PendingRequest = {
  reject: (reason: Error) => void;
 };

+/** Token usage returned by ACP PromptResponse. */
+export type AcpUsage = {
+  inputTokens: number;
+  outputTokens: number;
+  totalTokens: number;
+};
+
 export type AcpPromptResult = {
  text: string;
  sessionId: string;
+  usage: AcpUsage | null;
 };

 export class HermesAcpClient {
@@ -72,6 +90,11 @@ export class HermesAcpClient {
    return sessionId;
  }

+  /** Return the current session ID, or null if not connected. */
+  getSessionId(): string | null {
+    return this.sessionId;
+  }
+
  /** Send prompt and collect final assistant text from ACP stream chunks. */
  async prompt(text: string): Promise<AcpPromptResult> {
    if (this.sessionId === null) {
@@ -91,9 +114,25 @@ export class HermesAcpClient {
      );
    }

+    // Extract token usage from ACP PromptResponse.result.usage (camelCase wire format)
+    const result = (response as { result?: Record<string, unknown> }).result;
+    const rawUsage = result?.usage as Record<string, unknown> | undefined;
+    const usage: AcpUsage | null =
+      rawUsage !== undefined &&
+      typeof rawUsage.inputTokens === "number" &&
+      typeof rawUsage.outputTokens === "number" &&
+      typeof rawUsage.totalTokens === "number"
+        ? {
+            inputTokens: rawUsage.inputTokens,
+            outputTokens: rawUsage.outputTokens,
+            totalTokens: rawUsage.totalTokens,
+          }
+        : null;
+
    return {
      text: this.messageChunks.join(""),
      sessionId: this.sessionId,
+      usage,
    };
  }

@@ -270,7 +309,7 @@ export class HermesAcpClient {
  private async initialize(): Promise<void> {
    const initResponse = await this.sendRequest("initialize", {
      protocolVersion: PROTOCOL_VERSION,
-      clientInfo: { name: "uwf", version: "0.1.0" },
+      clientInfo: { name: "uwf-hermes", version: OWN_VERSION },
      capabilities: {},
    });

@@ -1,5 +1,12 @@
 #!/usr/bin/env node

+// eslint-disable-next-line -- dynamic import for version
+const pkg = await import("../package.json", { with: { type: "json" } });
+if (process.argv.includes("--version") || process.argv.includes("-V")) {
+  process.stdout.write(`${pkg.default.version}\n`);
+  process.exit(0);
+}
+
 import { createHermesAgent } from "./hermes.js";
 import { isResumeDisabled } from "./session-cache.js";

@@ -1,4 +1,5 @@
 import type { Store } from "@ocas/core";
+import type { Usage } from "@united-workforce/protocol";
 import { createLogger } from "@united-workforce/util";
 import {
  type AgentContext,
@@ -7,13 +8,50 @@ import {
  buildRolePrompt,
  createAgent,
 } from "@united-workforce/util-agent";
-
+import type { AcpUsage } from "./acp-client.js";
 import { HermesAcpClient } from "./acp-client.js";
 import { getCachedSessionId, setCachedSessionId } from "./session-cache.js";
 import { loadHermesSession, storeHermesSessionDetail } from "./session-detail.js";
+import type { HermesSessionJson } from "./types.js";

 const log = createLogger({ sink: { kind: "stderr" } });

+/** Snapshot of session metrics taken before and after a prompt call. */
+type TurnsSnapshot = {
+  turns: number;
+};
+
+const ZERO_TURNS: TurnsSnapshot = { turns: 0 };
+
+/** Extract assistant turn count from a session. Returns zero for null sessions. */
+export function snapshotTurns(session: HermesSessionJson | null): TurnsSnapshot {
+  if (session === null) {
+    return ZERO_TURNS;
+  }
+  return {
+    turns: session.messages.filter((m) => m.role === "assistant").length,
+  };
+}
+
+/**
+ * Build Usage from ACP token data + DB turn delta.
+ * Tokens come from ACP PromptResponse (synchronous, accurate).
+ * Turns come from DB before/after snapshots (may have WAL lag, but acceptable).
+ */
+export function buildUsage(
+  acpUsage: AcpUsage | null,
+  beforeTurns: TurnsSnapshot,
+  afterTurns: TurnsSnapshot,
+  durationSec: number,
+): Usage {
+  return {
+    turns: Math.max(0, afterTurns.turns - beforeTurns.turns) || 1,
+    inputTokens: acpUsage?.inputTokens ?? 0,
+    outputTokens: acpUsage?.outputTokens ?? 0,
+    duration: Math.round(durationSec),
+  };
+}
+
 /** Assemble system prompt, task, and prior step outputs for Hermes. */
 export function buildHermesPrompt(ctx: AgentContext): string {
  const parts: string[] = [];
@@ -108,25 +146,45 @@ export function createHermesAgent(resumeDisabled: boolean): () => Promise<void>
    void client.close();
  });

-  async function runPrompt(ctx: AgentContext, useContinuation: boolean): Promise<AgentRunResult> {
+  async function runPrompt(
+    ctx: AgentContext,
+    useContinuation: boolean,
+    beforeTurns: TurnsSnapshot,
+  ): Promise<AgentRunResult> {
    const effectiveCtx = useContinuation ? ctx : { ...ctx, isFirstVisit: true };
    const fullPrompt = buildHermesPrompt(effectiveCtx);
-    const { text, sessionId } = await client.prompt(fullPrompt);
+    const startMs = Date.now();
+    const { text, sessionId, usage: acpUsage } = await client.prompt(fullPrompt);
+    const durationSec = (Date.now() - startMs) / 1000;
    const { detailHash } = await storePromptResult(ctx.store, sessionId);

    if (!resumeDisabled) {
      await setCachedSessionId(ctx.threadId, ctx.role, sessionId, ctx.storageRoot);
    }

-    return { output: text, detailHash, sessionId, assembledPrompt: fullPrompt };
+    // Turns from DB (may lag slightly due to WAL, but acceptable)
+    const afterSession = await loadHermesSession(sessionId);
+    const afterTurns = snapshotTurns(afterSession);
+    const usage = buildUsage(acpUsage, beforeTurns, afterTurns, durationSec);
+
+    return { output: text, detailHash, sessionId, assembledPrompt: fullPrompt, usage };
  }

  async function runHermes(ctx: AgentContext): Promise<AgentRunResult> {
    const cwd = process.cwd();
    const attempt = await prepareSession(client, ctx, cwd, resumeDisabled);

+    // Snapshot before prompt: for resumed sessions, captures cumulative state
+    // so we can compute the turn delta. For new sessions, this is ZERO_TURNS.
+    const currentSessionId = client.getSessionId();
+    const beforeSession =
+      attempt.resumed && currentSessionId !== null
+        ? await loadHermesSession(currentSessionId)
+        : null;
+    const beforeTurns = snapshotTurns(beforeSession);
+
    try {
-      return await runPrompt(ctx, attempt.useContinuation);
+      return await runPrompt(ctx, attempt.useContinuation, beforeTurns);
    } catch (error) {
      if (!attempt.resumed) {
        throw error;
@@ -136,7 +194,8 @@ export function createHermesAgent(resumeDisabled: boolean): () => Promise<void>
      log("8FQW2R6N", `continuation prompt failed, retrying with initial prompt: ${message}`);
      await client.close();
      await client.connect(cwd);
-      return runPrompt(ctx, false);
+      // Fresh session after retry — reset snapshot to zero
+      return runPrompt(ctx, false, ZERO_TURNS);
    }
  }

@@ -147,9 +206,22 @@ export function createHermesAgent(resumeDisabled: boolean): () => Promise<void>
  ): Promise<AgentRunResult> {
    // Client is already connected from runHermes — same ACP session,
    // so the agent sees the full conversation history (crucial for retries).
-    const { text, sessionId } = await client.prompt(message);
+    // Snapshot turns before the continuation prompt for delta computation.
+    const currentSessionId = client.getSessionId();
+    const beforeSession =
+      currentSessionId !== null ? await loadHermesSession(currentSessionId) : null;
+    const beforeTurns = snapshotTurns(beforeSession);
+
+    const startMs = Date.now();
+    const { text, sessionId, usage: acpUsage } = await client.prompt(message);
+    const durationSec = (Date.now() - startMs) / 1000;
    const { detailHash } = await storePromptResult(store, sessionId);
-    return { output: text, detailHash, sessionId, assembledPrompt: "" };
+
+    const afterSession = await loadHermesSession(sessionId);
+    const afterTurns = snapshotTurns(afterSession);
+    const usage = buildUsage(acpUsage, beforeTurns, afterTurns, durationSec);
+
+    return { output: text, detailHash, sessionId, assembledPrompt: "", usage };
  }

  const agentMain = createAgent({
@@ -1,2 +1,8 @@
+export type { AcpUsage } from "./acp-client.js";
 export { HermesAcpClient } from "./acp-client.js";
-export { buildHermesPrompt, createHermesAgent } from "./hermes.js";
+export {
+  buildHermesPrompt,
+  buildUsage,
+  createHermesAgent,
+  snapshotTurns,
+} from "./hermes.js";
@@ -106,7 +106,7 @@ function parseSessionJson(raw: unknown): HermesSessionJson | null {
      messages.push(msg);
    }
  }
-  return { session_id, model, session_start, messages };
+  return { session_id, model, session_start, messages, inputTokens: 0, outputTokens: 0 };
 }

 export function getHermesDbPath(): string {
@@ -117,6 +117,8 @@ type DbSessionRow = {
  id: string;
  model: string;
  started_at: number;
+  input_tokens: number;
+  output_tokens: number;
 };

 type DbMessageRow = {
@@ -156,7 +158,9 @@ export function loadHermesSessionFromDb(
  try {
    db = new DatabaseSync(resolvedPath, { readOnly: true });
    const session = db
-      .prepare("SELECT id, model, started_at FROM sessions WHERE id = ?")
+      .prepare(
+        "SELECT id, model, started_at, input_tokens, output_tokens FROM sessions WHERE id = ?",
+      )
      .get(sessionId) as DbSessionRow | null;
    if (session === null) {
      return null;
@@ -181,6 +185,8 @@ export function loadHermesSessionFromDb(
      model: session.model,
      session_start: new Date(session.started_at * 1000).toISOString(),
      messages,
+      inputTokens: session.input_tokens ?? 0,
+      outputTokens: session.output_tokens ?? 0,
    };
  } catch {
    return null;
@@ -40,4 +40,6 @@ export type HermesSessionJson = {
  model: string;
  session_start: string;
  messages: HermesSessionMessage[];
+  inputTokens: number;
+  outputTokens: number;
 };
@@ -1,6 +1,6 @@
 {
  "name": "@united-workforce/agent-mock",
-  "version": "0.5.0",
+  "version": "0.1.1",
  "files": [
    "src",
    "dist",
@@ -17,7 +17,6 @@
    }
  },
  "scripts": {
-    "prepublishOnly": "echo 'Use pnpm run release from repo root' && exit 1",
    "test": "vitest run __tests__/",
    "test:ci": "vitest run __tests__/"
  },
@@ -1,5 +1,12 @@
 #!/usr/bin/env node

+// eslint-disable-next-line -- dynamic import for version
+const pkg = await import("../package.json", { with: { type: "json" } });
+if (process.argv.includes("--version") || process.argv.includes("-V")) {
+  process.stdout.write(`${pkg.default.version}\n`);
+  process.exit(0);
+}
+
 import { createMockAgent } from "./mock-agent.js";

 const USAGE = "usage: uwf-mock --mock-data <path> --thread <id> --role <role> --prompt <text>";
@@ -103,6 +103,7 @@ export function createMockAgent(mockDataPath: string): () => Promise<void> {
      detailHash,
      sessionId,
      assembledPrompt: "",
+      usage: { turns: 1, inputTokens: 0, outputTokens: 0, duration: 0 },
    };
    lastResult = result;
    return result;
@@ -0,0 +1,9 @@
+# @united-workforce/cli
+
+## 0.1.1
+
+### Patch Changes
+
+- 850a3b2: fix: resolve --agent override via config alias before raw command
+
+  `resolveAgentConfig()` now checks `config.agents[alias]` first before falling back to `parseAgentOverride()`. Eval CLI default `--agent` changed from `"hermes"` to `"uwf-hermes"`.
@@ -1,6 +1,6 @@
 {
  "name": "@united-workforce/cli",
-  "version": "0.5.0",
+  "version": "0.1.1",
  "files": [
    "src",
    "dist",
@@ -22,7 +22,6 @@
    "yaml": "^2.8.4"
  },
  "scripts": {
-    "prepublishOnly": "echo 'Use pnpm run release from repo root' && exit 1",
    "test": "vitest run src/",
    "test:ci": "vitest run src/"
  },
@@ -42,7 +42,7 @@ roles:
      type: object
      required: ["$status"]
      properties:
-        $status: { type: string }
+        $status: { type: string, enum: ["done"] }
 graph:
  $START:
    _:
@@ -59,7 +59,7 @@ graph:
      prompt: "Try again"
      location: null
  roleB:
-    _:
+    done:
      role: $END
      prompt: "Done"
      location: null
@@ -92,7 +92,7 @@ roles:
      type: object
      required: ["$status"]
      properties:
-        $status: { type: string }
+        $status: { type: string, enum: ["done"] }
  roleC:
    description: Fail role
    goal: Do C
@@ -104,7 +104,7 @@ roles:
      type: object
      required: ["$status"]
      properties:
-        $status: { type: string }
+        $status: { type: string, enum: ["done"] }
 graph:
  $START:
    _:
@@ -121,12 +121,12 @@ graph:
      prompt: "Do C (fail)"
      location: null
  roleB:
-    _:
+    done:
      role: $END
      prompt: "Done"
      location: null
  roleC:
-    _:
+    done:
      role: $END
      prompt: "Done"
      location: null
@@ -147,7 +147,7 @@ roles:
      type: object
      required: ["$status"]
      properties:
-        $status: { type: string }
+        $status: { type: string, enum: ["done"] }
 graph:
  $START:
    _:
@@ -155,7 +155,7 @@ graph:
      prompt: "Work"
      location: null
  worker:
-    _:
+    done:
      role: $END
      prompt: "Done"
      location: null
@@ -324,7 +324,7 @@ describe("currentRole field", () => {
    try {
      const wf = join(tmpDir, "test-current-role.yaml");
      await writeFile(wf, SIMPLE_WORKFLOW_YAML, "utf8");
-      const { thread } = await cmdThreadStart(storageRoot, wf, "test", tmpDir);
+      const { thread, workflow } = await cmdThreadStart(storageRoot, wf, "test", tmpDir);
      const tid = thread as ThreadId;

      await createMarker(storageRoot, {
@@ -426,8 +426,8 @@ describe("currentRole field", () => {
      await writeFile(wf, SINGLE_ROLE_WORKFLOW_YAML, "utf8");

      const { thread } = await cmdThreadStart(storageRoot, wf, "test", tmpDir);
-      // worker → _ maps to $END
-      await insertStepNode(storageRoot, thread as ThreadId, "worker", {});
+      // worker → done maps to $END
+      await insertStepNode(storageRoot, thread as ThreadId, "worker", { $status: "done" });

      const result = await cmdThreadShow(storageRoot, thread as ThreadId);
      expect(result.currentRole).toBe(null);
@@ -229,6 +229,10 @@ describe("E2E mock-agent: full uwf pipeline", () => {
    expect(getStatus(store, s1.output)).toBe("ready");
    expect(getStatus(store, s2.output)).toBe("done");

+    // Mock agent reports usage stats in step nodes.
+    expect(s1.usage).toEqual({ turns: 1, inputTokens: 0, outputTokens: 0, duration: 0 });
+    expect(s2.usage).toEqual({ turns: 1, inputTokens: 0, outputTokens: 0, duration: 0 });
+
    // The start node points at the registered workflow.
    const startNode = store.cas.get(startHash as CasRef);
    expect((startNode!.payload as StartNodePayload).workflow).toBe(workflowHash);
@@ -241,7 +245,9 @@ describe("E2E mock-agent: full uwf pipeline", () => {
    expect(finalEntry!.head).toBe(step2.head);
  });

-  test("2. branching workflow loops developer→reviewer→developer→reviewer→$END", async () => {
+  test("2. branching workflow loops developer→reviewer→developer→reviewer→$END", {
+    timeout: 30_000,
+  }, async () => {
    await writeMockConfig("e2e-loop.mock.yaml");
    const workflowHash = await addWorkflow("e2e-loop.workflow.yaml", "test-loop");

@@ -299,7 +305,9 @@ describe("E2E mock-agent: full uwf pipeline", () => {
    expect(finalEntry!.status).toBe("completed");
  });

-  test("3. role mismatch in mock data makes the agent exit with an error", async () => {
+  test("3. role mismatch in mock data makes the agent exit with an error", {
+    timeout: 30_000,
+  }, async () => {
    // Reuses the linear workflow but with a mock whose step[1].role is wrong.
    await writeMockConfig("e2e-mismatch.mock.yaml");
    const workflowHash = await addWorkflow("e2e-linear.workflow.yaml", "test-linear");
@@ -325,7 +333,9 @@ describe("E2E mock-agent: full uwf pipeline", () => {
    expect(entry!.head).toBe(step1.head);
  });

-  test("4. planner $SUSPEND then resume re-runs planner and reaches $END", async () => {
+  test("4. planner $SUSPEND then resume re-runs planner and reaches $END", {
+    timeout: 30_000,
+  }, async () => {
    await writeMockConfig("e2e-suspend.mock.yaml");
    const workflowHash = await addWorkflow("e2e-suspend.workflow.yaml", "test-suspend");

@@ -372,7 +382,9 @@ describe("E2E mock-agent: full uwf pipeline", () => {
    expect(finalEntry!.head).toBe(resumeOut.head);
  });

-  test("5. --count 3 runs the whole linear pipeline in one invocation", async () => {
+  test("5. --count 3 runs the whole linear pipeline in one invocation", {
+    timeout: 30_000,
+  }, async () => {
    await writeMockConfig("e2e-count.mock.yaml");
    const workflowHash = await addWorkflow("e2e-count.workflow.yaml", "test-count");

@@ -412,7 +424,9 @@ describe("E2E mock-agent: full uwf pipeline", () => {
    expect(finalEntry!.head).toBe(results[2].head);
  });

-  test("6. mustache edge prompt renders planner variables into the worker step", async () => {
+  test("6. mustache edge prompt renders planner variables into the worker step", {
+    timeout: 30_000,
+  }, async () => {
    await writeMockConfig("e2e-mustache.mock.yaml");
    const workflowHash = await addWorkflow("e2e-mustache.workflow.yaml", "test-mustache");

@@ -441,7 +455,9 @@ describe("E2E mock-agent: full uwf pipeline", () => {
    expect(workerStep.edgePrompt).toBe("Work on branch fix/42-auth in /tmp/my-repo");
  });

-  test("7. completed thread can be resumed (衔尾蛇: end → start)", async () => {
+  test("7. completed thread can be resumed (衔尾蛇: end → start)", {
+    timeout: 30_000,
+  }, async () => {
    // Reuse the suspend workflow (planner with ready → $END), but mock data
    // goes straight to ready on first run, then ready again after resume.
    await writeMockConfig("e2e-completed-resume.mock.yaml");
@@ -8,10 +8,10 @@ const solveIssueGraph: WorkflowPayload["graph"] = {
    _: { role: "planner", prompt: "Start planning from the issue in the task.", location: null },
  },
  planner: {
-    _: { role: "developer", prompt: "Implement the plan: {{plan}}", location: null },
+    planned: { role: "developer", prompt: "Implement the plan: {{plan}}", location: null },
  },
  developer: {
-    _: { role: "reviewer", prompt: "Review the changes: {{summary}}", location: null },
+    implemented: { role: "reviewer", prompt: "Review the changes: {{summary}}", location: null },
  },
  reviewer: {
    approved: { role: "$END", prompt: "Done.", location: null },
@@ -112,7 +112,7 @@ describe("evaluate", () => {

  test("mustache template rendering with simple fields", () => {
    const result = evaluate(solveIssueGraph, "planner", {
-      $status: "_",
+      $status: "planned",
      plan: "Add auth middleware",
    });
    expect(result).toEqual({
@@ -139,11 +139,11 @@ describe("evaluate", () => {
  test("triple mustache also works for unescaped output", () => {
    const graph: Record<string, Record<string, Target>> = {
      reviewer: {
-        _: { role: "developer", prompt: "Fix: {{{comments}}}", location: null },
+        rejected: { role: "developer", prompt: "Fix: {{{comments}}}", location: null },
      },
    };
    const result = evaluate(graph, "reviewer", {
-      $status: "_",
+      $status: "rejected",
      comments: "<script>alert(1)</script>",
    });
    expect(result).toEqual({
@@ -152,24 +152,22 @@ describe("evaluate", () => {
    });
  });

-  test("missing $status defaults to _ (unit routing)", () => {
+  test("missing $status → error (no unit fallback)", () => {
    const result = evaluate(solveIssueGraph, "planner", {
      plan: "Add auth middleware",
    });
-    expect(result).toEqual({
-      ok: true,
-      value: {
-        role: "developer",
-        prompt: "Implement the plan: Add auth middleware",
-        location: null,
-      },
-    });
+    expect(result.ok).toBe(false);
+    if (!result.ok) {
+      expect(result.error.message).toBe(
+        'agent output for role "planner" is missing required "$status" string',
+      );
+    }
  });

  test("mustache template with nested object paths", () => {
    const graph: Record<string, Record<string, Target>> = {
      reviewer: {
-        _: {
+        rejected: {
          role: "developer",
          prompt: "Address: {{review.comments}}",
          location: null,
@@ -177,7 +175,7 @@ describe("evaluate", () => {
      },
    };
    const result = evaluate(graph, "reviewer", {
-      $status: "_",
+      $status: "rejected",
      review: { comments: "refactor the handler" },
    });
    expect(result).toEqual({
@@ -6,101 +6,124 @@ import { describe, expect, test } from "vitest";
 const __dirname = dirname(fileURLToPath(import.meta.url));

 import {
-  cmdPromptAdapter,
-  cmdPromptAuthor,
-  cmdPromptDeveloper,
+  cmdPromptAdapterDeveloping,
+  cmdPromptBootstrap,
  cmdPromptList,
  cmdPromptSetup,
  cmdPromptUsage,
-  cmdPromptUser,
+  cmdPromptUsageReference,
+  cmdPromptWorkflowAuthoring,
 } from "../commands/prompt.js";

 describe("prompt commands", () => {
-  test("prompt list returns all prompt names", () => {
+  test("prompt list returns new prompt names", () => {
    const result = cmdPromptList();
    expect(result).toBeInstanceOf(Array);
-    expect(result).toContain("user");
-    expect(result).toContain("author");
-    expect(result).toContain("developer");
-    expect(result).toContain("adapter");
+    expect(result).toContain("usage");
+    expect(result).toContain("workflow-authoring");
+    expect(result).toContain("adapter-developing");
+    expect(result).toContain("bootstrap");
+    expect(result).not.toContain("user");
+    expect(result).not.toContain("author");
+    expect(result).not.toContain("developer");
+    expect(result).not.toContain("adapter");
    for (const name of result) {
      expect(name).toMatch(/^\S+$/);
    }
  });

-  test("prompt user returns non-empty markdown string", () => {
-    const result = cmdPromptUser();
+  test("prompt usage-reference returns non-empty markdown string with frontmatter", () => {
+    const result = cmdPromptUsageReference();
    expect(typeof result).toBe("string");
    expect(result).toContain("uwf");
    expect(result).toContain("thread");
    expect(result).toContain("workflow");
    expect(result).toContain("Quick Start");
+    expect(result).toContain("---");
+    expect(result).toContain("name:");
+    expect(result).toContain("version:");
    expect(result.length).toBeGreaterThan(500);
  });

-  test("prompt author returns non-empty markdown string", () => {
-    const result = cmdPromptAuthor();
+  test("prompt workflow-authoring returns non-empty markdown string with frontmatter", () => {
+    const result = cmdPromptWorkflowAuthoring();
    expect(typeof result).toBe("string");
    expect(result).toContain("frontmatter");
    expect(result).toContain("graph");
    expect(result).toContain("$START");
    expect(result).toContain("$END");
    expect(result).toContain("$status");
+    expect(result).toContain("---");
+    expect(result).toContain("name:");
+    expect(result).toContain("version:");
    expect(result.length).toBeGreaterThan(500);
  });

-  test("prompt developer returns non-empty markdown string", () => {
-    const result = cmdPromptDeveloper();
-    expect(typeof result).toBe("string");
-    expect(result).toContain("Monorepo");
-    expect(result).toContain("CAS");
-    expect(result).toContain("Biome");
-    expect(result.length).toBeGreaterThan(500);
-  });
-
-  test("prompt adapter returns non-empty markdown string", () => {
-    const result = cmdPromptAdapter();
+  test("prompt adapter-developing returns non-empty markdown string with frontmatter", () => {
+    const result = cmdPromptAdapterDeveloping();
    expect(typeof result).toBe("string");
    expect(result).toContain("createAgent");
    expect(result).toContain("AgentContext");
    expect(result).toContain("frontmatter");
+    expect(result).toContain("---");
+    expect(result).toContain("name:");
+    expect(result).toContain("version:");
    expect(result.length).toBeGreaterThan(500);
  });

-  test("prompt usage combines all references", () => {
+  test("prompt bootstrap returns non-empty skill with frontmatter", () => {
+    const result = cmdPromptBootstrap();
+    expect(typeof result).toBe("string");
+    expect(result).toContain("uwf");
+    expect(result).toContain("---");
+    expect(result.length).toBeGreaterThan(100);
+  });
+
+  test("prompt usage combines remaining references (no developer)", () => {
    const result = cmdPromptUsage();
    expect(typeof result).toBe("string");
-    expect(result).toContain("User Reference");
-    expect(result).toContain("Author Reference");
-    expect(result).toContain("Developer Reference");
-    expect(result).toContain("Adapter Reference");
+    expect(result).toContain("Usage Reference");
+    expect(result).toContain("Workflow Authoring Reference");
+    expect(result).toContain("Adapter Developing Reference");
+    expect(result).not.toContain("Developer Reference");
    expect(result).toContain("---");
    expect(result.length).toBeGreaterThan(2000);
  });

-  test("prompt setup returns setup instructions", () => {
+  test("prompt setup returns simplified setup instructions", () => {
    const result = cmdPromptSetup();
    expect(typeof result).toBe("string");
    expect(result).toContain("uwf Skill Setup");
-    expect(result).toContain("uwf prompt usage");
-    expect(result).toContain("uwf prompt setup");
+    expect(result).toContain("uwf prompt bootstrap");
    expect(result).toContain("SKILL.md");
    expect(result).toContain("version");
+    expect(result).not.toMatch(/\bbun (install|run|test|changeset|version|release)\b/);
  });

-  test("prompt help subcommand is suppressed", () => {
-    const output = execFileSync("npx", ["tsx", "src/cli.ts", "prompt", "--help"], {
-      cwd: join(__dirname, "..", ".."),
+  test("prompt setup references new subcommand names", () => {
+    const result = cmdPromptSetup();
+    expect(result).toContain("uwf prompt usage");
+    expect(result).toContain("uwf prompt workflow-authoring");
+    expect(result).toContain("uwf prompt adapter-developing");
+    expect(result).not.toContain("uwf prompt user");
+    expect(result).not.toContain("uwf prompt author");
+    expect(result).not.toContain("uwf prompt developer");
+    expect(result).not.toMatch(/uwf prompt adapter\b(?!-developing)/);
+  });
+
+  test("prompt help subcommand is suppressed", { timeout: 30_000 }, () => {
+    const cliPath = join(__dirname, "..", "..", "dist", "cli.js");
+    const output = execFileSync("node", [cliPath, "prompt", "--help"], {
      encoding: "utf-8",
-      env: { ...process.env, PATH: `/opt/homebrew/bin:${process.env.PATH}` },
+      env: { ...process.env },
    });
    expect(output).not.toMatch(/help\s+\[command\]/i);
    expect(output).toContain("usage");
    expect(output).toContain("setup");
-    expect(output).toContain("user");
-    expect(output).toContain("author");
-    expect(output).toContain("developer");
-    expect(output).toContain("adapter");
+    expect(output).toContain("workflow-authoring");
+    expect(output).toContain("adapter-developing");
+    expect(output).toContain("bootstrap");
    expect(output).toContain("list");
+    expect(output).not.toContain("developer");
  });
 });
@@ -118,6 +118,7 @@ async function createTestStep(
    completedAtMs: Date.now() + 1000,
    assembledPrompt: null,
    cwd: "/tmp",
+    usage: null,
  };
  return store.cas.put(schemas.stepNode, stepPayload);
 }
@@ -96,6 +96,7 @@ describe("protocol types", () => {
      completedAtMs: 2000,
      assembledPrompt: null,
      cwd: "/test/path",
+      usage: null,
    };
    expect(record.startedAtMs).toBe(1000);
    expect(record.completedAtMs).toBe(2000);
@@ -110,6 +111,7 @@ describe("protocol types", () => {
      agent: "uwf-test",
      timestamp: 123,
      durationMs: 5000,
+      usage: null,
    };
    expect(entry.durationMs).toBe(5000);
  });
@@ -252,7 +254,7 @@ describe("thread read timing", () => {
      },
      graph: {
        $START: { _: { role: "worker", prompt: "go", location: null } },
-        worker: { _: { role: "$END", prompt: "", location: null } },
+        worker: { done: { role: "$END", prompt: "", location: null } },
      },
    });

@@ -318,7 +320,7 @@ describe("thread read timing", () => {
      },
      graph: {
        $START: { _: { role: "worker", prompt: "go", location: null } },
-        worker: { _: { role: "$END", prompt: "", location: null } },
+        worker: { done: { role: "$END", prompt: "", location: null } },
      },
    });

@@ -15,7 +15,7 @@ import {
 async function makeUwfStore(storageRoot: string) {
  const casDir = join(storageRoot, "cas");
  await mkdir(casDir, { recursive: true });
-  process.env.OCAS_DIR = casDir;
+  process.env.OCAS_HOME = casDir;
  return createUwfStore(storageRoot);
 }

@@ -54,7 +54,7 @@ roles:
      type: object
      required: ["$status"]
      properties:
-        $status: { type: string }
+        $status: { type: string, enum: ["ready"] }
 graph:
  $START:
    _:
@@ -62,7 +62,7 @@ graph:
      prompt: "Plan the work"
      location: null
  planner:
-    _:
+    ready:
      role: $END
      prompt: "Done"
      location: null
@@ -110,7 +110,7 @@ roles:
      type: object
      required: ["$status"]
      properties:
-        $status: { type: string }
+        $status: { type: string, enum: ["ready"] }
 graph:
  $START:
    _:
@@ -118,7 +118,7 @@ graph:
      prompt: "Plan"
      location: null
  planner:
-    _:
+    ready:
      role: $END
      prompt: "Done"
      location: null
@@ -153,7 +153,7 @@ roles:
      type: object
      required: ["$status"]
      properties:
-        $status: { type: string }
+        $status: { type: string, enum: ["ready"] }
 graph:
  $START:
    _:
@@ -161,7 +161,7 @@ graph:
      prompt: "Plan"
      location: null
  planner:
-    _:
+    ready:
      role: $END
      prompt: "Done"
      location: null
@@ -79,7 +79,7 @@ async function setupSuspendedThread(mode: MockAgentMode): Promise<{
        },
        ok: { role: "reviewer", prompt: "Review the work", location: null },
      },
-      reviewer: { _: { role: "$END", prompt: "Done", location: null } },
+      reviewer: { done: { role: "$END", prompt: "Done", location: null } },
    },
  });

@@ -234,7 +234,7 @@ describe("uwf thread resume", () => {
      },
      graph: {
        $START: { _: { role: "worker", prompt: "Start", location: null } },
-        worker: { _: { role: "$END", prompt: "Done", location: null } },
+        worker: { done: { role: "$END", prompt: "Done", location: null } },
      },
    });

@@ -480,8 +480,8 @@ describe("uwf thread resume - completed threads", () => {
      },
      graph: {
        $START: { _: { role: "worker", prompt: "Start work", location: null } },
-        worker: { _: { role: "reviewer", prompt: "Review the work", location: null } },
-        reviewer: { _: { role: "$END", prompt: "Done", location: null } },
+        worker: { done: { role: "reviewer", prompt: "Review the work", location: null } },
+        reviewer: { done: { role: "$END", prompt: "Done", location: null } },
      },
    });

@@ -491,10 +491,10 @@ describe("uwf thread resume - completed threads", () => {
      cwd: tmpDir,
    });

-    process.env.OCAS_DIR = casDir;
+    process.env.OCAS_HOME = casDir;

-    const workerOutputHash = await store.cas.put(outputSchemaHash, { $status: "_" });
-    const reviewerOutputHash = await store.cas.put(outputSchemaHash, { $status: "_" });
+    const workerOutputHash = await store.cas.put(outputSchemaHash, { $status: "done" });
+    const reviewerOutputHash = await store.cas.put(outputSchemaHash, { $status: "done" });
    const detailHash = await store.cas.put(schemas.text, "mock detail");

    const workerStepHash = await store.cas.put(schemas.stepNode, {
@@ -539,9 +539,7 @@ describe("uwf thread resume - completed threads", () => {
    const { createUwfStore, getThread } = await import("../store.js");
    const verifyUwf = await createUwfStore(tmpDir);
    const verifyEntry = getThread(verifyUwf.varStore, THREAD_ID);
-    // biome-ignore lint/suspicious/noConsole: test debugging
    console.log("Seeded entry status:", verifyEntry?.status);
-    // biome-ignore lint/suspicious/noConsole: test debugging
    console.log("Seeded entry:", JSON.stringify(verifyEntry, null, 2));

    const promptCapturePath = join(tmpDir, "captured-prompt-completed.txt");
@@ -565,7 +563,7 @@ describe("uwf thread resume - completed threads", () => {
      stepHash: newWorkerStepHash,
      detailHash,
      role: "worker",
-      frontmatter: { $status: "_" },
+      frontmatter: { $status: "done" },
      body: "",
      startedAtMs: 1716600003000,
      completedAtMs: 1716600004000,
@@ -601,7 +599,6 @@ echo '${adapterJson}'
    );

    if (result.status !== 0) {
-      // biome-ignore lint/suspicious/noConsole: test debugging
      console.error("Command failed:", result.stderr);
    }

@@ -644,7 +641,7 @@ echo '${adapterJson}'
      },
      graph: {
        $START: { _: { role: "worker", prompt: "Start", location: null } },
-        worker: { _: { role: "$END", prompt: "Done", location: null } },
+        worker: { done: { role: "$END", prompt: "Done", location: null } },
      },
    });

@@ -654,7 +651,7 @@ echo '${adapterJson}'
      cwd: tmpDir,
    });

-    process.env.OCAS_DIR = casDir;
+    process.env.OCAS_HOME = casDir;
    await seedThreads(tmpDir, {
      [THREAD_ID]: {
        head: startHash,
@@ -692,7 +689,7 @@ echo '${adapterJson}'
      },
      graph: {
        $START: { _: { role: "worker", prompt: "Start", location: null } },
-        worker: { _: { role: "$END", prompt: "Done", location: null } },
+        worker: { done: { role: "$END", prompt: "Done", location: null } },
      },
    });

@@ -702,7 +699,7 @@ echo '${adapterJson}'
      cwd: tmpDir,
    });

-    process.env.OCAS_DIR = casDir;
+    process.env.OCAS_HOME = casDir;
    await seedThreads(tmpDir, { [THREAD_ID]: startHash });

    const result = runUwf(["thread", "resume", THREAD_ID], casDir);
@@ -31,7 +31,7 @@ roles:
      type: object
      required: ["$status"]
      properties:
-        $status: { type: string }
+        $status: { type: string, enum: ["ready"] }
 graph:
  $START:
    _:
@@ -39,7 +39,7 @@ graph:
      prompt: "Plan the work"
      location: null
  planner:
-    _:
+    ready:
      role: $END
      prompt: "Done"
      location: null
@@ -54,7 +54,7 @@ roles:
      type: object
      required: ["$status"]
      properties:
-        $status: { type: string }
+        $status: { type: string, enum: ["ready"] }
 graph:
  $START:
    _:
@@ -62,7 +62,7 @@ graph:
      prompt: "Plan the work"
      location: null
  planner:
-    _:
+    ready:
      role: $END
      prompt: "Done"
      location: null
@@ -2,19 +2,28 @@ import { execFileSync } from "node:child_process";
 import { dirname, join } from "node:path";
 import { fileURLToPath } from "node:url";
 import { describe, expect, test } from "vitest";
+import { validateCount } from "../commands/thread.js";

-const CLI_PATH = join(dirname(fileURLToPath(import.meta.url)), "..", "cli.js");
+const CLI_PATH = join(dirname(fileURLToPath(import.meta.url)), "..", "..", "dist", "cli.js");

-function runCli(args: string[]): { stdout: string; stderr: string; exitCode: number } {
+function runCli(args: string[]): {
+  stdout: string;
+  stderr: string;
+  exitCode: number;
+} {
  try {
-    const stdout = execFileSync("npx", ["tsx", CLI_PATH, ...args], {
+    const stdout = execFileSync("node", [CLI_PATH, ...args], {
      encoding: "utf8",
      env: { ...process.env, UWF_HOME: "/tmp/uwf-test-nonexistent" },
      stdio: ["ignore", "pipe", "pipe"],
    });
    return { stdout, stderr: "", exitCode: 0 };
  } catch (e: unknown) {
-    const err = e as NodeJS.ErrnoException & { stdout?: string; stderr?: string; status?: number };
+    const err = e as NodeJS.ErrnoException & {
+      stdout?: string;
+      stderr?: string;
+      status?: number;
+    };
    return {
      stdout: err.stdout ?? "",
      stderr: err.stderr ?? "",
@@ -23,50 +32,39 @@ function runCli(args: string[]): { stdout: string; stderr: string; exitCode: num
  }
 }

-describe("thread exec --count CLI parsing", () => {
+describe("thread exec --count CLI parsing", { timeout: 30_000 }, () => {
  test("--help shows -c/--count option", () => {
    const result = runCli(["thread", "exec", "--help"]);
-    expect(result.stdout).toContain("--count");
-    expect(result.stdout).toContain("-c");
+    const combined = result.stdout + result.stderr;
+    expect(combined).toContain("--count");
+    expect(combined).toContain("-c");
  });

  test("description says 'one or more steps'", () => {
    const result = runCli(["thread", "exec", "--help"]);
-    expect(result.stdout).toContain("one or more steps");
+    const combined = result.stdout + result.stderr;
+    expect(combined).toContain("one or more steps");
  });
 });

-describe("cmdThreadExec count logic", () => {
-  test("count=0 fails with validation error", () => {
-    const result = runCli(["thread", "exec", "FAKE_THREAD_ID", "-c", "0"]);
-    expect(result.exitCode).not.toBe(0);
-    expect(result.stderr).toContain("positive integer");
+describe("validateCount", () => {
+  test("count=0 throws validation error", () => {
+    expect(() => validateCount(0)).toThrow("positive integer");
  });

-  test("negative count fails with validation error", () => {
-    const result = runCli(["thread", "exec", "FAKE_THREAD_ID", "-c", "-1"]);
-    expect(result.exitCode).not.toBe(0);
-    expect(result.stderr).toContain("positive integer");
+  test("negative count throws validation error", () => {
+    expect(() => validateCount(-1)).toThrow("positive integer");
  });

-  test("non-integer count fails with validation error", () => {
-    const result = runCli(["thread", "exec", "FAKE_THREAD_ID", "-c", "1.5"]);
-    expect(result.exitCode).not.toBe(0);
-    expect(result.stderr).toContain("positive integer");
+  test("non-integer count throws validation error", () => {
+    expect(() => validateCount(1.5)).toThrow("positive integer");
  });

-  test("count=1 is the default (no -c flag)", () => {
-    // Without -c, it should attempt to run 1 step (failing on missing thread, not on count validation)
-    const result = runCli(["thread", "exec", "FAKE_THREAD_ID"]);
-    expect(result.exitCode).not.toBe(0);
-    // Should NOT contain "positive integer" error — should fail on thread lookup instead
-    expect(result.stderr).not.toContain("positive integer");
+  test("count=1 passes validation", () => {
+    expect(() => validateCount(1)).not.toThrow();
  });

-  test("count=3 passes validation (fails on thread lookup)", () => {
-    const result = runCli(["thread", "exec", "FAKE_THREAD_ID", "-c", "3"]);
-    expect(result.exitCode).not.toBe(0);
-    // Should NOT contain "positive integer" error — should fail on thread/storage lookup
-    expect(result.stderr).not.toContain("positive integer");
+  test("count=3 passes validation", () => {
+    expect(() => validateCount(3)).not.toThrow();
  });
 });
@@ -17,7 +17,7 @@ function makeWorkflow(overrides?: Partial<WorkflowPayload>): WorkflowPayload {
        frontmatter: {
          type: "object",
          properties: {
-            $status: { enum: ["_"] },
+            $status: { enum: ["done"] },
            plan: { type: "string" },
          },
          required: ["$status", "plan"],
@@ -52,7 +52,7 @@ function makeWorkflow(overrides?: Partial<WorkflowPayload>): WorkflowPayload {
    },
    graph: {
      $START: { _: { role: "writer", prompt: "Begin writing", location: null } },
-      writer: { _: { role: "reviewer", prompt: "Review this: {{{plan}}}", location: null } },
+      writer: { done: { role: "reviewer", prompt: "Review this: {{{plan}}}", location: null } },
      reviewer: {
        approved: { role: "$END", prompt: "Done: {{{summary}}}", location: null },
        rejected: { role: "writer", prompt: "Fix: {{{reason}}}", location: null },
@@ -82,7 +82,7 @@ describe("Suite 1: Role Reference Integrity", () => {
      output: "None",
      frontmatter: {
        type: "object",
-        properties: { $status: { enum: ["_"] } },
+        properties: { $status: { enum: ["done"] } },
        required: ["$status"],
      } as unknown as string,
    };
@@ -173,11 +173,11 @@ describe("Suite 2: Graph Structure", () => {
      output: "Isolated",
      frontmatter: {
        type: "object",
-        properties: { $status: { enum: ["_"] } },
+        properties: { $status: { enum: ["done"] } },
        required: ["$status"],
      } as unknown as string,
    };
-    wf.graph.isolated = { _: { role: "$END", prompt: "done", location: null } };
+    wf.graph.isolated = { done: { role: "$END", prompt: "done", location: null } };
    const errors = validateWorkflow(wf);
    expect(errors.some((e) => e.includes('role "isolated" is not reachable from $START'))).toBe(
      true,
@@ -186,34 +186,34 @@ describe("Suite 2: Graph Structure", () => {

  test("2.6 edge target references invalid role", () => {
    const wf = makeWorkflow();
-    wf.graph.writer = { _: { role: "ghost", prompt: "Go to ghost", location: null } };
+    wf.graph.writer = { done: { role: "ghost", prompt: "Go to ghost", location: null } };
    const errors = validateWorkflow(wf);
    expect(errors.some((e) => e.includes('unknown target role "ghost"'))).toBe(true);
  });
 });

 describe("Suite 3: Status-Edge Consistency", () => {
-  test("3.1 single-exit role with multiple graph keys", () => {
+  test("3.1 user role using _ graph key is rejected", () => {
    const wf = makeWorkflow();
-    wf.graph.writer = {
-      _: { role: "reviewer", prompt: "Review", location: null },
-      extra: { role: "$END", prompt: "Done", location: null },
-    };
+    wf.graph.writer = { _: { role: "reviewer", prompt: "Review", location: null } };
    const errors = validateWorkflow(wf);
    expect(
      errors.some((e) =>
-        e.includes('role "writer" is single-exit but has status keys other than "_"'),
+        e.includes('role "writer" must use explicit $status keys in graph, not "_"'),
      ),
    ).toBe(true);
  });

-  test("3.2 single-exit role missing _ key", () => {
+  test("3.2 user role graph key not matching $status enum", () => {
    const wf = makeWorkflow();
-    wf.graph.writer = { done: { role: "reviewer", prompt: "Review", location: null } };
+    wf.graph.writer = { wrong: { role: "reviewer", prompt: "Review", location: null } };
    const errors = validateWorkflow(wf);
-    expect(
-      errors.some((e) => e.includes('role "writer" is single-exit but graph has no "_" key')),
-    ).toBe(true);
+    expect(errors.some((e) => e.includes('role "writer" graph has extra status keys: wrong'))).toBe(
+      true,
+    );
+    expect(errors.some((e) => e.includes('role "writer" graph is missing status keys: done'))).toBe(
+      true,
+    );
  });

  test("3.3 multi-exit role with extra statuses", () => {
@@ -244,9 +244,11 @@ describe("Suite 3: Status-Edge Consistency", () => {
    const wf = makeWorkflow();
    wf.graph.reviewer = { _: { role: "$END", prompt: "Done", location: null } };
    const errors = validateWorkflow(wf);
-    expect(errors.some((e) => e.includes('role "reviewer" is multi-exit but graph uses "_"'))).toBe(
-      true,
-    );
+    expect(
+      errors.some((e) =>
+        e.includes('role "reviewer" must use explicit $status keys in graph, not "_"'),
+      ),
+    ).toBe(true);
  });
 });

@@ -314,20 +316,20 @@ describe("Suite 3b: Enum-Based Multi-Exit", () => {
    expect(errors.some((e) => e.includes("missing status keys: rejected"))).toBe(true);
  });

-  test("3b.4 enum with single value (not multi-exit) treated as single-exit", () => {
+  test("3b.4 enum with single explicit value passes", () => {
    const wf = makeWorkflow();
    wf.roles.writer = {
      ...wf.roles.writer,
      frontmatter: {
        type: "object",
        properties: {
-          $status: { enum: ["_"] },
+          $status: { enum: ["ready"] },
          plan: { type: "string" },
        },
        required: ["$status", "plan"],
      } as unknown as string,
    };
-    wf.graph.writer = { _: { role: "reviewer", prompt: "Review: {{{plan}}}", location: null } };
+    wf.graph.writer = { ready: { role: "reviewer", prompt: "Review: {{{plan}}}", location: null } };
    const errors = validateWorkflow(wf);
    expect(errors).toEqual([]);
  });
@@ -355,13 +357,15 @@ describe("Suite 3b: Enum-Based Multi-Exit", () => {
 });

 describe("Suite 4: Mustache Template Variable Existence", () => {
-  test("4.1 prompt references nonexistent variable (single-exit)", () => {
+  test("4.1 prompt references nonexistent variable (enum status)", () => {
    const wf = makeWorkflow();
-    wf.graph.writer = { _: { role: "reviewer", prompt: "Review: {{{branch}}}", location: null } };
+    wf.graph.writer = {
+      done: { role: "reviewer", prompt: "Review: {{{branch}}}", location: null },
+    };
    const errors = validateWorkflow(wf);
    expect(
-      errors.some((e) =>
-        e.includes('prompt variable "branch" not found in role "writer" frontmatter'),
+      errors.some(
+        (e) => e.includes('prompt variable "branch"') && e.includes('role "writer" frontmatter'),
      ),
    ).toBe(true);
  });
@@ -388,7 +392,7 @@ describe("Suite 4: Mustache Template Variable Existence", () => {

  test("4.4 $status variable is always valid", () => {
    const wf = makeWorkflow();
-    wf.graph.writer = { _: { role: "reviewer", prompt: "Status: {{$status}}", location: null } };
+    wf.graph.writer = { done: { role: "reviewer", prompt: "Status: {{$status}}", location: null } };
    const errors = validateWorkflow(wf);
    expect(errors).toEqual([]);
  });
@@ -456,14 +460,14 @@ describe("Suite 6: Multiple Errors Collection", () => {
      output: "None",
      frontmatter: {
        type: "object",
-        properties: { $status: { enum: ["_"] } },
+        properties: { $status: { enum: ["done"] } },
        required: ["$status"],
      } as unknown as string,
    };
    // unknown graph reference
-    wf.graph.nonexistent = { _: { role: "$END", prompt: "done", location: null } };
+    wf.graph.nonexistent = { done: { role: "$END", prompt: "done", location: null } };
    // bad mustache var
-    wf.graph.writer = { _: { role: "reviewer", prompt: "{{{badvar}}}", location: null } };
+    wf.graph.writer = { done: { role: "reviewer", prompt: "{{{badvar}}}", location: null } };
    const errors = validateWorkflow(wf);
    expect(errors.length).toBeGreaterThanOrEqual(3);
  });
@@ -31,7 +31,7 @@ function makeMinimalPayload(name: string, description: string): WorkflowPayload
        frontmatter: {
          type: "object",
          properties: {
-            $status: { type: "string" },
+            $status: { type: "string", enum: ["done"] },
          },
          required: ["$status"],
        } as unknown as CasRef,
@@ -39,7 +39,7 @@ function makeMinimalPayload(name: string, description: string): WorkflowPayload
    },
    graph: {
      $START: { _: { role: "worker", prompt: "start working", location: null } },
-      worker: { _: { role: "$END", prompt: "done", location: null } },
+      worker: { done: { role: "$END", prompt: "done", location: null } },
    },
  };
 }
@@ -5,14 +5,13 @@ import { Command } from "commander";
 import { cmdConfigGet, cmdConfigList, cmdConfigSet } from "./commands/config.js";
 import { cmdLogClean, cmdLogList, cmdLogShow } from "./commands/log.js";
 import {
-  cmdPromptAdapter,
-  cmdPromptAuthor,
+  cmdPromptAdapterDeveloping,
  cmdPromptBootstrap,
-  cmdPromptDeveloper,
  cmdPromptList,
  cmdPromptSetup,
  cmdPromptUsage,
-  cmdPromptUser,
+  cmdPromptUsageReference,
+  cmdPromptWorkflowAuthoring,
 } from "./commands/prompt.js";
 import { cmdSetup, cmdSetupInteractive } from "./commands/setup.js";
 import { cmdStepFork, cmdStepList, cmdStepRead, cmdStepShow } from "./commands/step.js";
@@ -523,31 +522,24 @@ prompt
  });

 prompt
-  .command("adapter")
-  .description("Print the adapter reference (building agent adapters)")
+  .command("usage-reference")
+  .description("Print the usage reference (CLI guide + typical workflows)")
  .action(() => {
-    console.log(cmdPromptAdapter());
+    console.log(cmdPromptUsageReference());
  });

 prompt
-  .command("author")
-  .description("Print the author reference (workflow YAML design guide)")
+  .command("workflow-authoring")
+  .description("Print the workflow authoring reference (YAML design guide)")
  .action(() => {
-    console.log(cmdPromptAuthor());
+    console.log(cmdPromptWorkflowAuthoring());
  });

 prompt
-  .command("developer")
-  .description("Print the developer reference (coding conventions + architecture)")
+  .command("adapter-developing")
+  .description("Print the adapter developing reference (building agent adapters)")
  .action(() => {
-    console.log(cmdPromptDeveloper());
-  });
-
-prompt
-  .command("user")
-  .description("Print the user reference (CLI guide + typical workflows)")
-  .action(() => {
-    console.log(cmdPromptUser());
+    console.log(cmdPromptAdapterDeveloping());
  });

 prompt
@@ -1,24 +1,21 @@
 import {
-  generateAdapterReference,
-  generateAuthorReference,
+  generateAdapterDevelopingReference,
  generateBootstrapReference,
-  generateDeveloperReference,
-  generateUserReference,
+  generateUsageReference,
+  generateWorkflowAuthoringReference,
 } from "@united-workforce/util";

 export {
-  generateAdapterReference as cmdPromptAdapter,
-  generateAuthorReference as cmdPromptAuthor,
+  generateAdapterDevelopingReference as cmdPromptAdapterDeveloping,
  generateBootstrapReference as cmdPromptBootstrap,
-  generateDeveloperReference as cmdPromptDeveloper,
-  generateUserReference as cmdPromptUser,
+  generateUsageReference as cmdPromptUsageReference,
+  generateWorkflowAuthoringReference as cmdPromptWorkflowAuthoring,
 };

 const PROMPT_ENTRIES: ReadonlyArray<{ name: string; generate: () => string }> = [
-  { name: "user", generate: generateUserReference },
-  { name: "author", generate: generateAuthorReference },
-  { name: "developer", generate: generateDeveloperReference },
-  { name: "adapter", generate: generateAdapterReference },
+  { name: "usage", generate: generateUsageReference },
+  { name: "workflow-authoring", generate: generateWorkflowAuthoringReference },
+  { name: "adapter-developing", generate: generateAdapterDevelopingReference },
  { name: "bootstrap", generate: generateBootstrapReference },
 ];

@@ -27,7 +24,9 @@ export function cmdPromptList(): ReadonlyArray<string> {
 }

 export function cmdPromptUsage(): string {
-  return PROMPT_ENTRIES.map((e) => e.generate()).join("\n\n---\n\n");
+  return PROMPT_ENTRIES.filter((e) => e.name !== "bootstrap")
+    .map((e) => e.generate())
+    .join("\n\n---\n\n");
 }

 export function cmdPromptSetup(): string {
@@ -42,7 +41,7 @@ so that you know how to use the \`uwf\` CLI for workflow orchestration.
   \`\`\`bash
   uwf --help
   \`\`\`
-   If not installed: \`bun add -g @united-workforce/cli\`
+   If not installed: \`npm install -g @united-workforce/cli\`

 2. **Clean up old versions of the skill** (if any exist):
   - Look for any existing uwf/workflow skill files in your skill directories
@@ -51,43 +50,24 @@ so that you know how to use the \`uwf\` CLI for workflow orchestration.
   - If the version matches the current CLI version (\`uwf --version\`), you're done — no update needed.
   - If the version is older, **delete the old skill file** before proceeding.

-3. **Get the usage skill content:**
+3. **Install the bootstrap skill:**
   \`\`\`bash
-   uwf prompt usage
+   uwf prompt bootstrap > ~/.hermes/skills/uwf/SKILL.md
   \`\`\`
+   The bootstrap prompt already includes complete YAML frontmatter — no editing needed.

-4. **Create a SKILL.md file** in your skill directory (e.g. \`~/.hermes/skills/uwf/SKILL.md\`).
-   The output of \`uwf prompt usage\` is the skill body WITHOUT frontmatter.
-   You must prepend the following YAML frontmatter:
-
-   \`\`\`yaml
-   ---
-   name: uwf
-   description: "Use when orchestrating multi-agent workflows with uwf (United Workforce). Covers CLI usage, workflow YAML authoring, moderator logic, and agent development."
-   version: <VERSION>
-   author: United Workforce
-   license: MIT
-   metadata:
-     hermes:
-       tags: [workflow, agents, orchestration, cli, yaml]
-   ---
-   \`\`\`
-
-   Replace \`<VERSION>\` with the installed CLI version (from \`uwf --version\`).
-
-5. **Verify** the skill is loadable by your agent framework.
+4. **Verify** the skill is loadable by your agent framework.

 ## Individual prompts

-You can also get individual reference sections:
+Each prompt outputs a complete SKILL.md with frontmatter — pipe directly to a file:

 \`\`\`bash
 uwf prompt list                                              # list available prompt names
-uwf prompt user                # user reference (CLI guide + typical workflows)
-uwf prompt author              # author reference (workflow YAML design guide)
-uwf prompt developer           # developer reference (coding conventions + architecture)
-uwf prompt adapter             # adapter reference (building agent adapters)
-uwf prompt bootstrap           # bootstrap skill YAML for Hermes agents
+uwf prompt usage > ~/.hermes/skills/uwf-usage/SKILL.md      # CLI usage guide
+uwf prompt workflow-authoring > ~/.hermes/skills/uwf-workflow-authoring/SKILL.md
+uwf prompt adapter-developing > ~/.hermes/skills/uwf-adapter-developing/SKILL.md
+uwf prompt bootstrap > ~/.hermes/skills/uwf/SKILL.md        # bootstrap skill
 \`\`\`

 ## Notes
@@ -66,6 +66,7 @@ export async function cmdStepList(
      agent: item.payload.agent,
      timestamp: item.timestamp,
      durationMs: item.payload.completedAtMs - item.payload.startedAtMs,
+      usage: item.payload.usage ?? null,
    });
  }

@@ -961,6 +961,12 @@ function resolveAgentConfig(
  agentOverride: string | null,
 ): AgentConfig {
  if (agentOverride !== null) {
+    // Try config alias first (e.g. "hermes" → config.agents.hermes),
+    // then fall back to raw command name (e.g. "uwf-hermes" or "/usr/bin/agent").
+    const fromAlias = config.agents[agentOverride as AgentAlias];
+    if (fromAlias !== undefined) {
+      return fromAlias;
+    }
    return parseAgentOverride(agentOverride);
  }

@@ -1128,6 +1134,12 @@ export async function cmdThreadResume(
  });
 }

+export function validateCount(count: number): void {
+  if (count < 1 || !Number.isInteger(count)) {
+    throw new Error(`--count must be a positive integer, got: ${count}`);
+  }
+}
+
 export async function cmdThreadExec(
  storageRoot: string,
  threadId: ThreadId,
@@ -1136,9 +1148,7 @@ export async function cmdThreadExec(
  background: boolean,
  backgroundWorker: boolean,
 ): Promise<StepOutput[]> {
-  if (count < 1 || !Number.isInteger(count)) {
-    fail(`--count must be a positive integer, got: ${count}`);
-  }
+  validateCount(count);

  // Check if thread is already running in background (unless we ARE the background worker)
  if (!backgroundWorker) {
@@ -8,7 +8,8 @@ mustache.escape = (text: string) => text;

 const START_ROLE = "$START";
 const SUSPEND_ROLE = "$SUSPEND";
-const UNIT_STATUS = "_";
+// $START is a special entry node with no agent output — it always uses this key.
+const START_STATUS = "_";

 type LastOutput = Record<string, unknown>;

@@ -19,12 +20,17 @@ export function evaluate(
  lastRole: string,
  lastOutput: LastOutput,
 ): Result<EvaluateResult, Error> {
-  const status =
-    lastRole === START_ROLE
-      ? UNIT_STATUS
-      : typeof lastOutput[STATUS_KEY] === "string"
-        ? (lastOutput[STATUS_KEY] as string)
-        : UNIT_STATUS;
+  let status: string;
+  if (lastRole === START_ROLE) {
+    status = START_STATUS;
+  } else if (typeof lastOutput[STATUS_KEY] === "string") {
+    status = lastOutput[STATUS_KEY] as string;
+  } else {
+    return {
+      ok: false,
+      error: new Error(`agent output for role "${lastRole}" is missing required "$status" string`),
+    };
+  }

  const roleTargets = graph[lastRole];
  if (roleTargets === undefined) {
@@ -24,17 +24,13 @@ function isOneOfSchema(fm: unknown): fm is SchemaObj & { oneOf: SchemaObj[] } {
  return Array.isArray(obj.oneOf);
 }

-/** Check if a frontmatter schema uses enum-based multi-exit ($status with multiple enum values). */
-function isEnumMultiExit(fm: unknown): boolean {
+/** Check if a frontmatter schema declares "$status" as an enum (the required form for user roles). */
+function hasStatusEnum(fm: unknown): boolean {
  if (typeof fm !== "object" || fm === null) return false;
  const obj = fm as SchemaObj;
  const props = obj.properties as Record<string, SchemaObj> | undefined;
  if (!props?.$status) return false;
-  const statusDef = props.$status;
-  if (!Array.isArray(statusDef.enum)) return false;
-  // Filter out "_" (wildcard) — if remaining values > 1, it's multi-exit
-  const statuses = (statusDef.enum as string[]).filter((s) => s !== "_");
-  return statuses.length > 1;
+  return Array.isArray(props.$status.enum);
 }

 /** Extract status values from an enum-based $status field. */
@@ -43,7 +39,7 @@ function getEnumStatuses(fm: SchemaObj): string[] {
  if (!props?.$status) return [];
  const statusDef = props.$status;
  if (!Array.isArray(statusDef.enum)) return [];
-  return (statusDef.enum as string[]).filter((s) => s !== "_");
+  return statusDef.enum as string[];
 }

 /** Get property names from a schema object. */
@@ -194,15 +190,19 @@ function checkOneOfDiscriminant(
  }
 }

-/** Check status-edge consistency for a multi-exit role. */
-function checkMultiExitEdges(
+/** Check status-edge consistency for a user role. "_" is reserved for $START and rejected here. */
+function checkStatusEdges(
  roleName: string,
  graphKeys: Set<string>,
  statusSet: Set<string>,
  errors: string[],
 ): void {
  if (graphKeys.has("_")) {
-    errors.push(`role "${roleName}" is multi-exit but graph uses "_"`);
+    errors.push(`role "${roleName}" must use explicit $status keys in graph, not "_"`);
+    return;
+  }
+  if (statusSet.has("_")) {
+    errors.push(`role "${roleName}" $status enum must use explicit values, not "_"`);
    return;
  }

@@ -255,50 +255,23 @@ function checkRoleConsistency(payload: WorkflowPayload, errors: string[]): void
      const statuses = getOneOfStatuses(variants);

      checkOneOfDiscriminant(roleName, variants, statuses, errors);
-      checkMultiExitEdges(roleName, graphKeys, new Set(statuses), errors);
+      checkStatusEdges(roleName, graphKeys, new Set(statuses), errors);
      checkMultiExitMustache(roleName, graphEntry, variants, errors);
-    } else if (isEnumMultiExit(fm)) {
+    } else if (hasStatusEnum(fm)) {
      const statuses = getEnumStatuses(fm as SchemaObj);
-      checkMultiExitEdges(roleName, graphKeys, new Set(statuses), errors);
+      checkStatusEdges(roleName, graphKeys, new Set(statuses), errors);
      // For enum-based schemas, mustache vars come from the flat properties
-      checkSingleExitMustache(roleName, graphEntry, fm as SchemaObj, errors);
+      checkEnumMustache(roleName, graphEntry, fm as SchemaObj, errors);
    } else {
-      checkSingleExitRole(roleName, graphKeys, graphEntry, fm as SchemaObj | null, errors);
-    }
-  }
-}
-
-/** Check single-exit role status and mustache. */
-function checkSingleExitRole(
-  roleName: string,
-  graphKeys: Set<string>,
-  graphEntry: Record<string, { role: string; prompt: string }>,
-  fm: SchemaObj | null,
-  errors: string[],
-): void {
-  if (graphKeys.size > 1 || (graphKeys.size === 1 && !graphKeys.has("_"))) {
-    if (!graphKeys.has("_")) {
-      errors.push(`role "${roleName}" is single-exit but graph has no "_" key`);
-    } else {
-      errors.push(`role "${roleName}" is single-exit but has status keys other than "_"`);
-    }
-  }
-
-  const singleTarget = graphEntry._;
-  if (!singleTarget) return;
-
-  const vars = extractMustacheVars(singleTarget.prompt);
-  const propNames = fm ? getPropertyNames(fm) : new Set<string>();
-  for (const v of vars) {
-    if (v === "$status") continue;
-    if (!propNames.has(v)) {
-      errors.push(`prompt variable "${v}" not found in role "${roleName}" frontmatter`);
+      errors.push(
+        `role "${roleName}" must define "$status" as an enum (or oneOf const) in frontmatter`,
+      );
    }
  }
 }

 /** Check mustache vars in all edge prompts against flat schema properties. */
-function checkSingleExitMustache(
+function checkEnumMustache(
  roleName: string,
  graphEntry: Record<string, { role: string; prompt: string }>,
  fm: SchemaObj,
@@ -57,9 +57,18 @@ function isGraph(value: unknown): boolean {
  if (!isRecord(value)) {
    return false;
  }
-  return Object.values(value).every(
-    (statusMap) => isRecord(statusMap) && Object.values(statusMap).every((t) => isTarget(t)),
-  );
+  return Object.entries(value).every(([node, statusMap]) => {
+    if (!isRecord(statusMap)) {
+      return false;
+    }
+    return Object.entries(statusMap).every(([status, target]) => {
+      // "_" is only valid as a status key for the $START entry node.
+      if (status === "_" && node !== "$START") {
+        return false;
+      }
+      return isTarget(target);
+    });
+  });
 }

 /**
@@ -1,6 +1,6 @@
 {
  "name": "@united-workforce/dashboard",
-  "version": "0.5.0-alpha.4",
+  "version": "0.1.0",
  "private": true,
  "type": "module",
  "scripts": {
@@ -0,0 +1,9 @@
+# @united-workforce/eval
+
+## 0.1.2
+
+### Patch Changes
+
+- 850a3b2: fix: resolve --agent override via config alias before raw command
+
+  `resolveAgentConfig()` now checks `config.agents[alias]` first before falling back to `parseAgentOverride()`. Eval CLI default `--agent` changed from `"hermes"` to `"uwf-hermes"`.
@@ -0,0 +1,219 @@
+import type { StepEntry } from "@united-workforce/protocol";
+import { beforeEach, describe, expect, test, vi } from "vitest";
+
+import {
+  runFrontmatterJudge,
+  runHallucinationJudge,
+  runTokenStatsJudge,
+  runUpstreamJudge,
+} from "../src/judge/builtin/index.js";
+
+// Mock the shared read-steps helper so the judges never shell out to `uwf`.
+vi.mock("../src/judge/builtin/read-steps.js", () => ({
+  readThreadSteps: vi.fn(),
+}));
+
+import { readThreadSteps } from "../src/judge/builtin/read-steps.js";
+
+const mockedReadSteps = vi.mocked(readThreadSteps);
+
+function makeStep(overrides: Partial<StepEntry>): StepEntry {
+  return {
+    hash: "HASH000000000",
+    role: "worker",
+    output: "---\n$status: done\n---\n\nbody",
+    detail: "DETAIL0000000",
+    agent: "hermes",
+    timestamp: 0,
+    durationMs: 0,
+    usage: null,
+    ...overrides,
+  };
+}
+
+beforeEach(() => {
+  mockedReadSteps.mockReset();
+});
+
+describe("frontmatter-compliance judge", () => {
+  test("all steps have valid frontmatter → score 1.0", async () => {
+    mockedReadSteps.mockReturnValue([
+      makeStep({ role: "a", output: "---\n$status: done\n---\n\nwork" }),
+      makeStep({ role: "b", output: "---\n$status: needs_input\n---\nmore" }),
+    ]);
+
+    const result = await runFrontmatterJudge("T1");
+    const data = result.data as { stepsTotal: number; stepsValid: number; invalidSteps: unknown[] };
+
+    expect(result.score).toBe(1.0);
+    expect(data.stepsTotal).toBe(2);
+    expect(data.stepsValid).toBe(2);
+    expect(data.invalidSteps).toHaveLength(0);
+  });
+
+  test("some steps missing $status → partial score", async () => {
+    mockedReadSteps.mockReturnValue([
+      makeStep({ role: "a", output: "---\n$status: done\n---\nok" }),
+      makeStep({ role: "b", output: "---\nfoo: bar\n---\nmissing status" }),
+      makeStep({ role: "c", output: "no frontmatter at all" }),
+    ]);
+
+    const result = await runFrontmatterJudge("T2");
+    const data = result.data as {
+      stepsTotal: number;
+      stepsValid: number;
+      invalidSteps: Array<{ stepIndex: number; role: string; errors: string[] }>;
+    };
+
+    expect(result.score).toBeCloseTo(1 / 3, 10);
+    expect(data.stepsTotal).toBe(3);
+    expect(data.stepsValid).toBe(1);
+    expect(data.invalidSteps).toHaveLength(2);
+    expect(data.invalidSteps[0]).toMatchObject({ stepIndex: 1, role: "b" });
+    expect(data.invalidSteps[1]).toMatchObject({ stepIndex: 2, role: "c" });
+  });
+
+  test("no steps → score 0 (0/0 edge case)", async () => {
+    mockedReadSteps.mockReturnValue([]);
+
+    const result = await runFrontmatterJudge("T3");
+    const data = result.data as { stepsTotal: number; stepsValid: number; invalidSteps: unknown[] };
+
+    expect(result.score).toBe(0);
+    expect(data.stepsTotal).toBe(0);
+    expect(data.stepsValid).toBe(0);
+    expect(data.invalidSteps).toHaveLength(0);
+  });
+
+  test("empty-string $status counts as invalid", async () => {
+    mockedReadSteps.mockReturnValue([makeStep({ role: "a", output: '---\n$status: ""\n---\nx' })]);
+
+    const result = await runFrontmatterJudge("T4");
+    expect(result.score).toBe(0);
+  });
+
+  test("parsed object output with $status → score 1.0", async () => {
+    mockedReadSteps.mockReturnValue([
+      makeStep({ role: "a", output: { $status: "done", summary: "fixed" } as unknown as string }),
+      makeStep({ role: "b", output: { $status: "reviewed" } as unknown as string }),
+    ]);
+
+    const result = await runFrontmatterJudge("T5");
+    const data = result.data as { stepsTotal: number; stepsValid: number; invalidSteps: unknown[] };
+
+    expect(result.score).toBe(1.0);
+    expect(data.stepsTotal).toBe(2);
+    expect(data.stepsValid).toBe(2);
+  });
+
+  test("parsed object output missing $status → score 0", async () => {
+    mockedReadSteps.mockReturnValue([
+      makeStep({ role: "a", output: { summary: "no status field" } as unknown as string }),
+    ]);
+
+    const result = await runFrontmatterJudge("T6");
+    expect(result.score).toBe(0);
+  });
+});
+
+describe("token-stats judge", () => {
+  test("steps with usage → sums correctly", async () => {
+    mockedReadSteps.mockReturnValue([
+      makeStep({
+        role: "a",
+        usage: { turns: 2, inputTokens: 100, outputTokens: 50, duration: 1.5 },
+      }),
+      makeStep({
+        role: "b",
+        usage: { turns: 3, inputTokens: 200, outputTokens: 75, duration: 2.0 },
+      }),
+    ]);
+
+    const result = await runTokenStatsJudge("T1");
+    const data = result.data as {
+      totalInput: number;
+      totalOutput: number;
+      totalTurns: number;
+      perStep: Array<{ role: string; inputTokens: number; outputTokens: number; turns: number }>;
+    };
+
+    expect(result.score).toBe(1.0);
+    expect(data.totalInput).toBe(300);
+    expect(data.totalOutput).toBe(125);
+    expect(data.totalTurns).toBe(5);
+    expect(data.perStep).toHaveLength(2);
+    expect(data.perStep[0]).toEqual({
+      role: "a",
+      inputTokens: 100,
+      outputTokens: 50,
+      turns: 2,
+      duration: 1.5,
+    });
+  });
+
+  test("steps with null usage → zeros", async () => {
+    mockedReadSteps.mockReturnValue([
+      makeStep({ role: "a", usage: null }),
+      makeStep({ role: "b", usage: null }),
+    ]);
+
+    const result = await runTokenStatsJudge("T2");
+    const data = result.data as {
+      totalInput: number;
+      totalOutput: number;
+      totalTurns: number;
+      perStep: Array<{
+        inputTokens: number;
+        outputTokens: number;
+        turns: number;
+        duration: number;
+      }>;
+    };
+
+    expect(result.score).toBe(1.0);
+    expect(data.totalInput).toBe(0);
+    expect(data.totalOutput).toBe(0);
+    expect(data.totalTurns).toBe(0);
+    expect(data.perStep[0]).toEqual({
+      role: "a",
+      inputTokens: 0,
+      outputTokens: 0,
+      turns: 0,
+      duration: 0,
+    });
+  });
+
+  test("empty steps → all zeros, score 1.0", async () => {
+    mockedReadSteps.mockReturnValue([]);
+
+    const result = await runTokenStatsJudge("T3");
+    const data = result.data as {
+      totalInput: number;
+      totalOutput: number;
+      totalTurns: number;
+      perStep: unknown[];
+    };
+
+    expect(result.score).toBe(1.0);
+    expect(data.totalInput).toBe(0);
+    expect(data.totalOutput).toBe(0);
+    expect(data.totalTurns).toBe(0);
+    expect(data.perStep).toHaveLength(0);
+  });
+});
+
+describe("LLM-as-judge stubs", () => {
+  test("upstream-consumption returns a stub", async () => {
+    const result = await runUpstreamJudge("T1");
+    expect(result.score).toBe(0);
+    expect(result.data).toEqual({ perStep: [] });
+    expect(result.schema.title).toBe("@uwf/eval-judge-upstream");
+  });
+
+  test("hallucination returns a stub", async () => {
+    const result = await runHallucinationJudge("T1");
+    expect(result.score).toBe(0);
+    expect(result.data).toEqual({ perStep: [] });
+    expect(result.schema.title).toBe("@uwf/eval-judge-hallucination");
+  });
+});
@@ -0,0 +1,152 @@
+import { bootstrap, createMemoryStore } from "@ocas/core";
+import { describe, expect, test } from "vitest";
+import type { JudgeRunner } from "../src/runner/index.js";
+import { collect, computeOverall } from "../src/runner/index.js";
+import type { EvalRunConfig, EvalStore } from "../src/storage/index.js";
+import type { JudgeEntry, TaskManifest } from "../src/task/index.js";
+
+function makeJudge(name: string, weight: number, builtin: boolean): JudgeEntry {
+  return {
+    name,
+    weight,
+    builtin,
+    entry: builtin ? null : `dist/judges/${name}.js`,
+    schema: null,
+  };
+}
+
+function makeManifest(judges: JudgeEntry[]): TaskManifest {
+  return {
+    name: "fix-off-by-one",
+    description: "test task",
+    workflow: "solve-issue",
+    prompt: "Fix the bug",
+    limits: { maxSteps: 10, timeoutMinutes: 30 },
+    judges,
+  };
+}
+
+function makeEvalStore(): EvalStore {
+  const store = createMemoryStore();
+  bootstrap(store);
+  return { store, varStore: store.var };
+}
+
+const CONFIG: EvalRunConfig = {
+  agent: "hermes",
+  model: "claude-sonnet-4",
+  engineVersion: "test",
+};
+
+/** Returns a fixed score per judge name. */
+function scriptedRunner(scores: Record<string, number>): JudgeRunner {
+  return async (_taskDir, _workDir, _threadId, judge) => ({
+    score: scores[judge.name] ?? 0,
+    data: { judged: judge.name },
+    schema: { type: "object" },
+  });
+}
+
+describe("computeOverall", () => {
+  test("computes the weighted average correctly", () => {
+    const overall = computeOverall([
+      { score: 0.8, weight: 0.3 },
+      { score: 0.6, weight: 0.3 },
+      { score: 1.0, weight: 0.4 },
+    ]);
+    // 0.24 + 0.18 + 0.4 = 0.82
+    expect(overall).toBeCloseTo(0.82, 10);
+  });
+
+  test("a weight-0 judge does not affect the result", () => {
+    const withInformational = computeOverall([
+      { score: 1.0, weight: 1.0 },
+      { score: 0.0, weight: 0.0 },
+    ]);
+    expect(withInformational).toBe(1.0);
+  });
+
+  test("returns 0 when total weight is 0", () => {
+    expect(computeOverall([{ score: 0.5, weight: 0 }])).toBe(0);
+  });
+});
+
+describe("collect", () => {
+  test("computes weighted score correctly across judges", async () => {
+    const evalStore = makeEvalStore();
+    const manifest = makeManifest([
+      makeJudge("test-pass", 0.6, false),
+      makeJudge("code-quality", 0.4, false),
+    ]);
+    const runJudge = scriptedRunner({ "test-pass": 1.0, "code-quality": 0.5 });
+
+    const result = await collect(
+      {
+        evalStore,
+        taskDir: "/tmp/task",
+        workDir: "/tmp/work",
+        threadId: "THREAD123",
+        manifest,
+        config: CONFIG,
+      },
+      runJudge,
+    );
+
+    // 1.0 * 0.6 + 0.5 * 0.4 = 0.8
+    expect(result.overall).toBeCloseTo(0.8, 10);
+    expect(result.runHash).toBeTruthy();
+    expect(result.judges).toHaveLength(2);
+    expect(result.judges[0]).toEqual({ name: "test-pass", score: 1.0, weight: 0.6 });
+
+    const latest = evalStore.varStore.list({
+      exactName: "@uwf/eval/fix-off-by-one/latest",
+    });
+    expect(latest[0]?.value).toBe(result.runHash);
+  });
+
+  test("handles a judge with weight 0 (informational)", async () => {
+    const evalStore = makeEvalStore();
+    const manifest = makeManifest([
+      makeJudge("test-pass", 1.0, false),
+      makeJudge("token-stats", 0, true),
+    ]);
+    // token-stats is builtin → default runner would score 0; give scripted score
+    // that would skew the result if it were counted.
+    const runJudge = scriptedRunner({ "test-pass": 0.5, "token-stats": 1.0 });
+
+    const result = await collect(
+      {
+        evalStore,
+        taskDir: "/tmp/task",
+        workDir: "/tmp/work",
+        threadId: "THREAD123",
+        manifest,
+        config: CONFIG,
+      },
+      runJudge,
+    );
+
+    // Only test-pass (weight 1.0) counts → overall = 0.5
+    expect(result.overall).toBeCloseTo(0.5, 10);
+    expect(result.judges).toHaveLength(2);
+    const tokenStats = result.judges.find((j) => j.name === "token-stats");
+    expect(tokenStats?.weight).toBe(0);
+  });
+
+  test("unknown builtin judge name throws via the default runner", async () => {
+    const evalStore = makeEvalStore();
+    const manifest = makeManifest([makeJudge("not-a-real-judge", 1.0, true)]);
+
+    // Use the default runner (no injected runner) → builtin dispatch → unknown name throws.
+    await expect(
+      collect({
+        evalStore,
+        taskDir: "/tmp/task",
+        workDir: "/tmp/work",
+        threadId: "THREAD123",
+        manifest,
+        config: CONFIG,
+      }),
+    ).rejects.toThrow(/unknown builtin judge/);
+  });
+});
@@ -0,0 +1,171 @@
+import { bootstrap, createMemoryStore, putSchema } from "@ocas/core";
+import type { CasRef } from "@united-workforce/protocol";
+import { describe, expect, test } from "vitest";
+
+import {
+  formatDiff,
+  formatList,
+  formatReport,
+  readEvalEntries,
+  readEvalRun,
+  selectEntries,
+} from "../src/commands/index.js";
+import type { EvalRunPayload, EvalStore } from "../src/storage/index.js";
+import { EVAL_RUN_SCHEMA, setEvalLatest } from "../src/storage/index.js";
+
+function makeEvalStore(): EvalStore {
+  const store = createMemoryStore();
+  bootstrap(store);
+  return { store, varStore: store.var };
+}
+
+function makePayload(
+  task: string,
+  overall: number,
+  timestamp: number,
+  judges: EvalRunPayload["judges"] = [
+    {
+      name: "frontmatter-compliance",
+      score: 1.0,
+      weight: 0.6,
+      dataHash: "AAAAAAAAAAAAA" as CasRef,
+    },
+    { name: "token-stats", score: 0.5, weight: 0, dataHash: "BBBBBBBBBBBBB" as CasRef },
+  ],
+  config: EvalRunPayload["config"] = {
+    agent: "hermes",
+    model: "claude-sonnet-4",
+    engineVersion: "1.0.0",
+  },
+): EvalRunPayload {
+  return { task, config, threadId: "THREAD0123456789", judges, overall, timestamp };
+}
+
+/** Store an eval-run node in CAS and index it under @uwf/eval/<task>/latest. */
+function storeRun(evalStore: EvalStore, payload: EvalRunPayload): string {
+  const schemaHash = putSchema(evalStore.store, EVAL_RUN_SCHEMA);
+  const hash = evalStore.store.cas.put(schemaHash, payload);
+  setEvalLatest(evalStore.varStore, payload.task, hash);
+  return hash;
+}
+
+describe("formatReport", () => {
+  test("includes task, overall, config and judges", () => {
+    const payload = makePayload("fix-off-by-one", 0.8, Date.UTC(2026, 0, 2, 3, 4, 5));
+    const output = formatReport(payload, "RUNHASH123456");
+
+    expect(output).toContain("fix-off-by-one");
+    expect(output).toContain("0.8000");
+    expect(output).toContain("hermes");
+    expect(output).toContain("claude-sonnet-4");
+    expect(output).toContain("1.0.0");
+    expect(output).toContain("frontmatter-compliance");
+    expect(output).toContain("token-stats");
+    expect(output).toContain("THREAD0123456789");
+    expect(output).toContain("RUNHASH123456");
+  });
+
+  test("round-trips a stored run via readEvalRun", () => {
+    const evalStore = makeEvalStore();
+    const payload = makePayload("fix-off-by-one", 0.75, Date.now());
+    const hash = storeRun(evalStore, payload);
+
+    const loaded = readEvalRun(evalStore, hash);
+    expect(loaded).not.toBeNull();
+    const output = formatReport(loaded as EvalRunPayload, hash);
+    expect(output).toContain("fix-off-by-one");
+    expect(output).toContain("0.7500");
+  });
+
+  test("readEvalRun returns null for a missing hash", () => {
+    const evalStore = makeEvalStore();
+    expect(readEvalRun(evalStore, "NOPENOPENOPE0")).toBeNull();
+  });
+});
+
+describe("list", () => {
+  test("lists eval runs stored under different tasks", () => {
+    const evalStore = makeEvalStore();
+    storeRun(evalStore, makePayload("fix-off-by-one", 0.8, 2000));
+    storeRun(evalStore, makePayload("write-docs", 0.6, 1000));
+
+    const entries = readEvalEntries(evalStore);
+    expect(entries).toHaveLength(2);
+
+    const output = formatList(selectEntries(entries, null, 20));
+    expect(output).toContain("fix-off-by-one");
+    expect(output).toContain("write-docs");
+  });
+
+  test("sorts newest-first by timestamp", () => {
+    const evalStore = makeEvalStore();
+    storeRun(evalStore, makePayload("old-task", 0.5, 1000));
+    storeRun(evalStore, makePayload("new-task", 0.5, 2000));
+
+    const selected = selectEntries(readEvalEntries(evalStore), null, 20);
+    expect(selected[0]?.task).toBe("new-task");
+    expect(selected[1]?.task).toBe("old-task");
+  });
+
+  test("--task filter only shows the matching task", () => {
+    const evalStore = makeEvalStore();
+    storeRun(evalStore, makePayload("fix-off-by-one", 0.8, 2000));
+    storeRun(evalStore, makePayload("write-docs", 0.6, 1000));
+
+    const output = formatList(selectEntries(readEvalEntries(evalStore), "write-docs", 20));
+    expect(output).toContain("write-docs");
+    expect(output).not.toContain("fix-off-by-one");
+  });
+
+  test("--limit caps the number of rows", () => {
+    const evalStore = makeEvalStore();
+    storeRun(evalStore, makePayload("task-a", 0.8, 3000));
+    storeRun(evalStore, makePayload("task-b", 0.6, 2000));
+    storeRun(evalStore, makePayload("task-c", 0.4, 1000));
+
+    const selected = selectEntries(readEvalEntries(evalStore), null, 2);
+    expect(selected).toHaveLength(2);
+    expect(selected.map((e) => e.task)).toEqual(["task-a", "task-b"]);
+  });
+
+  test("empty store renders a placeholder", () => {
+    const evalStore = makeEvalStore();
+    const output = formatList(selectEntries(readEvalEntries(evalStore), null, 20));
+    expect(output).toContain("(no eval runs found)");
+  });
+});
+
+describe("formatDiff", () => {
+  test("shows an upward delta when B scores higher", () => {
+    const a = makePayload("fix-off-by-one", 0.6, 1000);
+    const b = makePayload("fix-off-by-one", 0.8, 2000);
+    const output = formatDiff(a, "HASHA00000000", b, "HASHB00000000");
+
+    expect(output).toContain("▲");
+    expect(output).toContain("HASHA00000000");
+    expect(output).toContain("HASHB00000000");
+  });
+
+  test("shows a downward delta when B scores lower", () => {
+    const a = makePayload("fix-off-by-one", 0.9, 1000);
+    const b = makePayload("fix-off-by-one", 0.4, 2000);
+    const output = formatDiff(a, "HASHA00000000", b, "HASHB00000000");
+    expect(output).toContain("▼");
+  });
+
+  test("marks differing config values", () => {
+    const a = makePayload("fix-off-by-one", 0.6, 1000, undefined, {
+      agent: "hermes",
+      model: "claude-sonnet-4",
+      engineVersion: "1.0.0",
+    });
+    const b = makePayload("fix-off-by-one", 0.6, 2000, undefined, {
+      agent: "claude-code",
+      model: "claude-sonnet-4",
+      engineVersion: "1.0.0",
+    });
+    const output = formatDiff(a, "HASHA00000000", b, "HASHB00000000");
+    expect(output).toContain("≠");
+    expect(output).toContain("claude-code");
+  });
+});
@@ -0,0 +1,74 @@
+import { mkdir, mkdtemp, readFile, rm, writeFile } from "node:fs/promises";
+import { tmpdir } from "node:os";
+import { join } from "node:path";
+
+import { afterEach, beforeEach, describe, expect, test } from "vitest";
+
+import { prepare } from "../src/runner/index.js";
+
+const TASK_YAML = `
+name: fix-off-by-one
+description: Fix an off-by-one error
+workflow: solve-issue
+prompt: "Fix the bug"
+limits:
+  maxSteps: 12
+  timeoutMinutes: 20
+judges:
+  - name: frontmatter-compliance
+    weight: 0.5
+    builtin: true
+  - name: test-pass
+    weight: 0.5
+    entry: dist/judges/test-pass.js
+`;
+
+let taskDir: string;
+
+beforeEach(async () => {
+  taskDir = await mkdtemp(join(tmpdir(), "uwf-eval-task-"));
+  await writeFile(join(taskDir, "task.yaml"), TASK_YAML, "utf8");
+  const fixtureDir = join(taskDir, "fixture");
+  await mkdir(join(fixtureDir, "src"), { recursive: true });
+  await writeFile(join(fixtureDir, "src", "calc.ts"), "export const add = (a, b) => a + b + 1;\n");
+  await writeFile(join(fixtureDir, "package.json"), '{ "name": "fixture" }\n');
+});
+
+afterEach(async () => {
+  await rm(taskDir, { recursive: true, force: true });
+});
+
+describe("prepare", () => {
+  test("returns the parsed manifest", async () => {
+    const result = await prepare(taskDir);
+    expect(result.taskDir).toBe(taskDir);
+    expect(result.manifest.name).toBe("fix-off-by-one");
+    expect(result.manifest.workflow).toBe("solve-issue");
+    expect(result.manifest.limits.maxSteps).toBe(12);
+    expect(result.manifest.judges).toHaveLength(2);
+  });
+
+  test("copies fixture into a fresh temp work dir", async () => {
+    const result = await prepare(taskDir);
+    expect(result.workDir).not.toBe(taskDir);
+    expect(result.workDir.startsWith(tmpdir())).toBe(true);
+
+    const calc = await readFile(join(result.workDir, "src", "calc.ts"), "utf8");
+    expect(calc).toContain("export const add");
+    const pkg = await readFile(join(result.workDir, "package.json"), "utf8");
+    expect(pkg).toContain("fixture");
+
+    await rm(result.workDir, { recursive: true, force: true });
+  });
+
+  test("creates an empty work dir when no fixture/ exists", async () => {
+    const noFixtureDir = await mkdtemp(join(tmpdir(), "uwf-eval-nofix-"));
+    await writeFile(join(noFixtureDir, "task.yaml"), TASK_YAML, "utf8");
+
+    const result = await prepare(noFixtureDir);
+    expect(result.workDir.startsWith(tmpdir())).toBe(true);
+
+    await rm(noFixtureDir, { recursive: true, force: true });
+    await rm(result.workDir, { recursive: true, force: true });
+  });
+});
@@ -0,0 +1,63 @@
+import { describe, expect, test } from "vitest";
+import {
+  EVAL_JUDGE_FRONTMATTER_SCHEMA,
+  EVAL_JUDGE_HALLUCINATION_SCHEMA,
+  EVAL_JUDGE_TOKEN_STATS_SCHEMA,
+  EVAL_JUDGE_UPSTREAM_SCHEMA,
+  EVAL_RUN_SCHEMA,
+} from "../src/storage/index.js";
+
+describe("OCAS schema definitions", () => {
+  test("eval-run schema has correct title and required fields", () => {
+    expect(EVAL_RUN_SCHEMA.title).toBe("@uwf/eval-run");
+    const required = EVAL_RUN_SCHEMA.required as string[];
+    expect(required).toContain("task");
+    expect(required).toContain("config");
+    expect(required).toContain("threadId");
+    expect(required).toContain("judges");
+    expect(required).toContain("overall");
+    expect(required).toContain("timestamp");
+  });
+
+  test("frontmatter judge schema has correct title", () => {
+    expect(EVAL_JUDGE_FRONTMATTER_SCHEMA.title).toBe("@uwf/eval-judge-frontmatter");
+    const required = EVAL_JUDGE_FRONTMATTER_SCHEMA.required as string[];
+    expect(required).toContain("stepsTotal");
+    expect(required).toContain("stepsValid");
+    expect(required).toContain("invalidSteps");
+  });
+
+  test("upstream judge schema has correct title", () => {
+    expect(EVAL_JUDGE_UPSTREAM_SCHEMA.title).toBe("@uwf/eval-judge-upstream");
+    const required = EVAL_JUDGE_UPSTREAM_SCHEMA.required as string[];
+    expect(required).toContain("perStep");
+  });
+
+  test("hallucination judge schema has correct title", () => {
+    expect(EVAL_JUDGE_HALLUCINATION_SCHEMA.title).toBe("@uwf/eval-judge-hallucination");
+    const required = EVAL_JUDGE_HALLUCINATION_SCHEMA.required as string[];
+    expect(required).toContain("perStep");
+  });
+
+  test("token-stats judge schema has correct title", () => {
+    expect(EVAL_JUDGE_TOKEN_STATS_SCHEMA.title).toBe("@uwf/eval-judge-token-stats");
+    const required = EVAL_JUDGE_TOKEN_STATS_SCHEMA.required as string[];
+    expect(required).toContain("totalInput");
+    expect(required).toContain("totalOutput");
+    expect(required).toContain("totalTurns");
+    expect(required).toContain("perStep");
+  });
+
+  test("all schemas have type object at root", () => {
+    const schemas = [
+      EVAL_RUN_SCHEMA,
+      EVAL_JUDGE_FRONTMATTER_SCHEMA,
+      EVAL_JUDGE_UPSTREAM_SCHEMA,
+      EVAL_JUDGE_HALLUCINATION_SCHEMA,
+      EVAL_JUDGE_TOKEN_STATS_SCHEMA,
+    ];
+    for (const s of schemas) {
+      expect(s.type).toBe("object");
+    }
+  });
+});
@@ -0,0 +1,163 @@
+import { describe, expect, test } from "vitest";
+import { parseTaskManifest } from "../src/task/index.js";
+
+const VALID_YAML = `
+name: fix-off-by-one
+description: Fix an off-by-one error in a calculator
+workflow: solve-issue
+prompt: "Fix the bug: add(1,2) returns 4 instead of 3"
+limits:
+  maxSteps: 15
+  timeoutMinutes: 30
+judges:
+  - name: frontmatter-compliance
+    weight: 0.15
+    builtin: true
+  - name: test-pass
+    weight: 0.3
+    entry: dist/judges/test-pass.js
+    schema: schemas/test-pass.json
+`;
+
+describe("parseTaskManifest", () => {
+  test("parses valid task.yaml", () => {
+    const manifest = parseTaskManifest(VALID_YAML);
+    expect(manifest.name).toBe("fix-off-by-one");
+    expect(manifest.description).toBe("Fix an off-by-one error in a calculator");
+    expect(manifest.workflow).toBe("solve-issue");
+    expect(manifest.prompt).toBe("Fix the bug: add(1,2) returns 4 instead of 3");
+    expect(manifest.limits).toEqual({ maxSteps: 15, timeoutMinutes: 30 });
+    expect(manifest.judges).toHaveLength(2);
+  });
+
+  test("parses builtin judge", () => {
+    const manifest = parseTaskManifest(VALID_YAML);
+    const builtin = manifest.judges[0];
+    expect(builtin).toBeDefined();
+    expect(builtin!.name).toBe("frontmatter-compliance");
+    expect(builtin!.weight).toBe(0.15);
+    expect(builtin!.builtin).toBe(true);
+    expect(builtin!.entry).toBeNull();
+  });
+
+  test("parses custom judge with entry + schema", () => {
+    const manifest = parseTaskManifest(VALID_YAML);
+    const custom = manifest.judges[1];
+    expect(custom).toBeDefined();
+    expect(custom!.name).toBe("test-pass");
+    expect(custom!.weight).toBe(0.3);
+    expect(custom!.builtin).toBe(false);
+    expect(custom!.entry).toBe("dist/judges/test-pass.js");
+    expect(custom!.schema).toBe("schemas/test-pass.json");
+  });
+
+  test("defaults limits when omitted", () => {
+    const yaml = `
+name: minimal
+workflow: solve-issue
+prompt: do something
+judges:
+  - name: check
+    builtin: true
+`;
+    const manifest = parseTaskManifest(yaml);
+    expect(manifest.limits).toEqual({ maxSteps: 20, timeoutMinutes: 30 });
+  });
+
+  test("defaults description to empty string", () => {
+    const yaml = `
+name: no-desc
+workflow: solve-issue
+prompt: do something
+judges:
+  - name: check
+    builtin: true
+`;
+    const manifest = parseTaskManifest(yaml);
+    expect(manifest.description).toBe("");
+  });
+
+  test("rejects missing name", () => {
+    const yaml = `
+workflow: solve-issue
+prompt: do something
+judges:
+  - name: check
+    builtin: true
+`;
+    expect(() => parseTaskManifest(yaml)).toThrow("name is required");
+  });
+
+  test("rejects missing workflow", () => {
+    const yaml = `
+name: test
+prompt: do something
+judges:
+  - name: check
+    builtin: true
+`;
+    expect(() => parseTaskManifest(yaml)).toThrow("workflow is required");
+  });
+
+  test("rejects missing prompt", () => {
+    const yaml = `
+name: test
+workflow: solve-issue
+judges:
+  - name: check
+    builtin: true
+`;
+    expect(() => parseTaskManifest(yaml)).toThrow("prompt is required");
+  });
+
+  test("rejects empty judges array", () => {
+    const yaml = `
+name: test
+workflow: solve-issue
+prompt: do something
+judges: []
+`;
+    expect(() => parseTaskManifest(yaml)).toThrow("at least one judge");
+  });
+
+  test("rejects non-builtin judge without entry", () => {
+    const yaml = `
+name: test
+workflow: solve-issue
+prompt: do something
+judges:
+  - name: custom-check
+    weight: 0.5
+`;
+    expect(() => parseTaskManifest(yaml)).toThrow("non-builtin judge must have entry");
+  });
+
+  test("rejects non-object YAML root", () => {
+    expect(() => parseTaskManifest("just a string")).toThrow("must be a YAML mapping");
+  });
+
+  test("rejects judge without name", () => {
+    const yaml = `
+name: test
+workflow: solve-issue
+prompt: do something
+judges:
+  - weight: 0.5
+    builtin: true
+`;
+    expect(() => parseTaskManifest(yaml)).toThrow("name is required");
+  });
+
+  test("defaults weight to 0 when omitted", () => {
+    const yaml = `
+name: test
+workflow: solve-issue
+prompt: do something
+judges:
+  - name: token-stats
+    builtin: true
+`;
+    const manifest = parseTaskManifest(yaml);
+    expect(manifest.judges[0]!.weight).toBe(0);
+  });
+});
@@ -0,0 +1,45 @@
+{
+  "name": "@united-workforce/eval",
+  "version": "0.1.3",
+  "private": false,
+  "files": [
+    "src",
+    "dist",
+    "package.json"
+  ],
+  "type": "module",
+  "bin": {
+    "uwf-eval": "./dist/cli.js"
+  },
+  "exports": {
+    ".": {
+      "types": "./dist/index.d.ts",
+      "import": "./dist/index.js"
+    }
+  },
+  "scripts": {
+    "test": "vitest run __tests__/",
+    "test:ci": "vitest run __tests__/"
+  },
+  "dependencies": {
+    "@ocas/core": "^0.3.0",
+    "@ocas/fs": "^0.3.0",
+    "@united-workforce/protocol": "workspace:^",
+    "@united-workforce/util": "workspace:^",
+    "commander": "^14.0.3",
+    "yaml": "^2.9.0"
+  },
+  "devDependencies": {
+    "typescript": "^5.8.3"
+  },
+  "repository": {
+    "type": "git",
+    "url": "https://git.shazhou.work/shazhou/united-workforce.git",
+    "directory": "packages/eval"
+  },
+  "homepage": "https://git.shazhou.work/shazhou/united-workforce#readme",
+  "bugs": {
+    "url": "https://git.shazhou.work/shazhou/united-workforce/issues"
+  },
+  "license": "MIT"
+}
@@ -0,0 +1,25 @@
+#!/usr/bin/env node
+import { Command } from "commander";
+import {
+  registerDiffCommand,
+  registerListCommand,
+  registerReportCommand,
+  registerRunCommand,
+} from "./commands/index.js";
+
+// eslint-disable-next-line -- dynamic import for version
+const pkg = await import("../package.json", { with: { type: "json" } });
+
+const program = new Command();
+
+program
+  .name("uwf-eval")
+  .description("Evaluate uwf workflow quality with real agents")
+  .version(pkg.default.version, "-V, --version");
+
+registerRunCommand(program);
+registerReportCommand(program);
+registerDiffCommand(program);
+registerListCommand(program);
+
+program.parse();
@@ -0,0 +1,38 @@
+import { createLogger } from "@united-workforce/util";
+import type { Command } from "commander";
+
+import { createEvalStore } from "../storage/index.js";
+import { formatDiff } from "./format.js";
+import { readEvalRun } from "./read.js";
+
+const log = createLogger({ sink: { kind: "stderr" } });
+const LOG_DIFF = "D3WZ8N5T";
+
+export function registerDiffCommand(program: Command): void {
+  program
+    .command("diff <hash1> <hash2>")
+    .description("Compare two eval runs side-by-side")
+    .action(async (hash1: string, hash2: string) => {
+      try {
+        const evalStore = await createEvalStore();
+        const payloadA = readEvalRun(evalStore, hash1);
+        if (payloadA === null) {
+          process.stderr.write(`eval run not found: ${hash1}\n`);
+          process.exitCode = 1;
+          return;
+        }
+        const payloadB = readEvalRun(evalStore, hash2);
+        if (payloadB === null) {
+          process.stderr.write(`eval run not found: ${hash2}\n`);
+          process.exitCode = 1;
+          return;
+        }
+        log(LOG_DIFF, `diff a=${hash1} b=${hash2}`);
+        process.stdout.write(formatDiff(payloadA, hash1, payloadB, hash2));
+      } catch (e) {
+        const message = e instanceof Error ? e.message : String(e);
+        process.stderr.write(`${message}\n`);
+        process.exitCode = 1;
+      }
+    });
+}
@@ -0,0 +1,148 @@
+import type { EvalRunPayload } from "../storage/index.js";
+import type { EvalListEntry } from "./types.js";
+
+const NAME_WIDTH = 28;
+const SCORE_WIDTH = 10;
+const TIMESTAMP_WIDTH = 26;
+
+/** Format a 0..1 score (or weight) with fixed precision. */
+function formatScore(value: number): string {
+  return value.toFixed(4);
+}
+
+/** Human-readable ISO-8601 timestamp from epoch milliseconds. */
+function formatTimestamp(ms: number): string {
+  return new Date(ms).toISOString();
+}
+
+/** Right-pad to a fixed column width (with a trailing space if already full). */
+function pad(value: string, width: number): string {
+  return value.length >= width ? `${value} ` : value.padEnd(width);
+}
+
+/** Directional indicator for a score delta (B relative to A). */
+function formatDelta(delta: number): string {
+  if (delta > 0) {
+    return `▲ +${formatScore(delta)}`;
+  }
+  if (delta < 0) {
+    return `▼ ${formatScore(delta)}`;
+  }
+  return `= ${formatScore(0)}`;
+}
+
+/** Render a single eval run as a human-readable report. */
+export function formatReport(payload: EvalRunPayload, runHash: string): string {
+  const lines: string[] = [];
+  lines.push("=== Eval Report ===");
+  lines.push(`Task:       ${payload.task}`);
+  lines.push(`Overall:    ${formatScore(payload.overall)}`);
+  lines.push(`Timestamp:  ${formatTimestamp(payload.timestamp)}`);
+  lines.push("");
+  lines.push("Config:");
+  lines.push(`  Agent:    ${payload.config.agent}`);
+  lines.push(`  Model:    ${payload.config.model}`);
+  lines.push(`  Engine:   ${payload.config.engineVersion}`);
+  lines.push("");
+  lines.push("Judges:");
+  lines.push(`  ${pad("NAME", NAME_WIDTH)}${pad("SCORE", SCORE_WIDTH)}WEIGHT`);
+  for (const judge of payload.judges) {
+    lines.push(
+      `  ${pad(judge.name, NAME_WIDTH)}${pad(formatScore(judge.score), SCORE_WIDTH)}${formatScore(judge.weight)}`,
+    );
+  }
+  lines.push("");
+  lines.push(`Thread:     ${payload.threadId}`);
+  lines.push(`Run:        ${runHash}`);
+  return `${lines.join("\n")}\n`;
+}
+
+/** Render a side-by-side comparison of two eval runs. */
+export function formatDiff(
+  payloadA: EvalRunPayload,
+  hashA: string,
+  payloadB: EvalRunPayload,
+  hashB: string,
+): string {
+  const lines: string[] = [];
+  lines.push("=== Eval Diff ===");
+  lines.push(`A: ${hashA}  (${payloadA.task})`);
+  lines.push(`B: ${hashB}  (${payloadB.task})`);
+  lines.push("");
+
+  const overallDelta = payloadB.overall - payloadA.overall;
+  lines.push("Overall:");
+  lines.push(
+    `  A=${formatScore(payloadA.overall)}  B=${formatScore(payloadB.overall)}  ${formatDelta(overallDelta)}`,
+  );
+  lines.push("");
+
+  lines.push("Config:");
+  lines.push(configLine("Agent", payloadA.config.agent, payloadB.config.agent));
+  lines.push(configLine("Model", payloadA.config.model, payloadB.config.model));
+  lines.push(configLine("Engine", payloadA.config.engineVersion, payloadB.config.engineVersion));
+  lines.push("");
+
+  lines.push("Judges:");
+  lines.push(`  ${pad("NAME", NAME_WIDTH)}${pad("A", SCORE_WIDTH)}${pad("B", SCORE_WIDTH)}DELTA`);
+  const scoresA = new Map(payloadA.judges.map((judge) => [judge.name, judge.score]));
+  const scoresB = new Map(payloadB.judges.map((judge) => [judge.name, judge.score]));
+  for (const name of unionJudgeNames(payloadA, payloadB)) {
+    const scoreA = scoresA.get(name);
+    const scoreB = scoresB.get(name);
+    const cellA = scoreA === undefined ? "—" : formatScore(scoreA);
+    const cellB = scoreB === undefined ? "—" : formatScore(scoreB);
+    const delta = scoreA !== undefined && scoreB !== undefined ? formatDelta(scoreB - scoreA) : "";
+    lines.push(
+      `  ${pad(name, NAME_WIDTH)}${pad(cellA, SCORE_WIDTH)}${pad(cellB, SCORE_WIDTH)}${delta}`,
+    );
+  }
+  return `${lines.join("\n")}\n`;
+}
+
+/** Render a table of indexed eval runs. */
+export function formatList(entries: ReadonlyArray<EvalListEntry>): string {
+  const lines: string[] = [];
+  lines.push(
+    `  ${pad("TASK", NAME_WIDTH)}${pad("OVERALL", SCORE_WIDTH)}${pad("TIMESTAMP", TIMESTAMP_WIDTH)}HASH`,
+  );
+  if (entries.length === 0) {
+    lines.push("  (no eval runs found)");
+  }
+  for (const entry of entries) {
+    lines.push(
+      `  ${pad(entry.task, NAME_WIDTH)}${pad(formatScore(entry.overall), SCORE_WIDTH)}${pad(formatTimestamp(entry.timestamp), TIMESTAMP_WIDTH)}${entry.hash}`,
+    );
+  }
+  return `${lines.join("\n")}\n`;
+}
+
+/** Sort newest-first, then apply optional task filter and result limit. */
+export function selectEntries(
+  entries: ReadonlyArray<EvalListEntry>,
+  task: string | null,
+  limit: number | null,
+): EvalListEntry[] {
+  const sorted = [...entries].sort((a, b) => b.timestamp - a.timestamp);
+  const filtered = task !== null ? sorted.filter((entry) => entry.task === task) : sorted;
+  return limit !== null ? filtered.slice(0, limit) : filtered;
+}
+
+/** Ordered union of judge names: A's order first, then B-only names. */
+function unionJudgeNames(payloadA: EvalRunPayload, payloadB: EvalRunPayload): string[] {
+  const names: string[] = [];
+  const seen = new Set<string>();
+  for (const judge of [...payloadA.judges, ...payloadB.judges]) {
+    if (!seen.has(judge.name)) {
+      seen.add(judge.name);
+      names.push(judge.name);
+    }
+  }
+  return names;
+}
+
+/** One config row: `=` when equal, `≠` otherwise. */
+function configLine(label: string, valueA: string, valueB: string): string {
+  const marker = valueA === valueB ? "=" : "≠";
+  return `  ${pad(`${label}:`, SCORE_WIDTH)}${marker} A=${valueA}  B=${valueB}`;
+}
@@ -0,0 +1,7 @@
+export { registerDiffCommand } from "./diff.js";
+export { formatDiff, formatList, formatReport, selectEntries } from "./format.js";
+export { registerListCommand } from "./list.js";
+export { readEvalEntries, readEvalRun } from "./read.js";
+export { registerReportCommand } from "./report.js";
+export { registerRunCommand } from "./run.js";
+export type { EvalListEntry } from "./types.js";
@@ -0,0 +1,43 @@
+import { createLogger } from "@united-workforce/util";
+import type { Command } from "commander";
+
+import { createEvalStore } from "../storage/index.js";
+import { formatList, selectEntries } from "./format.js";
+import { readEvalEntries } from "./read.js";
+
+const log = createLogger({ sink: { kind: "stderr" } });
+const LOG_LIST = "L5KX9R2B";
+
+type ListCliOptions = {
+  task: string | undefined;
+  limit: string;
+};
+
+export function registerListCommand(program: Command): void {
+  program
+    .command("list")
+    .description("List past eval runs")
+    .option("--task <name>", "filter by task name")
+    .option("--limit <n>", "max results", "20")
+    .action(async (opts: ListCliOptions) => {
+      const limit = Number.parseInt(opts.limit, 10);
+      if (!Number.isInteger(limit) || limit < 1) {
+        process.stderr.write("--limit must be a positive integer\n");
+        process.exitCode = 1;
+        return;
+      }
+
+      try {
+        const evalStore = await createEvalStore();
+        const entries = readEvalEntries(evalStore);
+        const task = opts.task ?? null;
+        const selected = selectEntries(entries, task, limit);
+        log(LOG_LIST, `list task=${task ?? "*"} found=${entries.length} shown=${selected.length}`);
+        process.stdout.write(formatList(selected));
+      } catch (e) {
+        const message = e instanceof Error ? e.message : String(e);
+        process.stderr.write(`${message}\n`);
+        process.exitCode = 1;
+      }
+    });
+}
@@ -0,0 +1,41 @@
+import type { EvalRunPayload, EvalStore } from "../storage/index.js";
+import type { EvalListEntry } from "./types.js";
+
+/** Variable prefix and suffix for eval run pointers (`@uwf/eval/<task>/latest`). */
+const EVAL_VAR_PREFIX = "@uwf/eval/";
+const EVAL_VAR_SUFFIX = "/latest";
+
+/** Read a single eval-run payload from CAS. Returns null when the node is absent. */
+export function readEvalRun(evalStore: EvalStore, hash: string): EvalRunPayload | null {
+  const node = evalStore.store.cas.get(hash);
+  if (node === null) {
+    return null;
+  }
+  return node.payload as EvalRunPayload;
+}
+
+/**
+ * Read every indexed eval run by scanning `@uwf/eval/*\/latest` variables and
+ * loading the referenced CAS node. Dangling pointers are skipped.
+ */
+export function readEvalEntries(evalStore: EvalStore): EvalListEntry[] {
+  const { store, varStore } = evalStore;
+  const entries: EvalListEntry[] = [];
+  for (const variable of varStore.list()) {
+    if (!variable.name.startsWith(EVAL_VAR_PREFIX) || !variable.name.endsWith(EVAL_VAR_SUFFIX)) {
+      continue;
+    }
+    const node = store.cas.get(variable.value);
+    if (node === null) {
+      continue;
+    }
+    const payload = node.payload as EvalRunPayload;
+    entries.push({
+      task: payload.task,
+      overall: payload.overall,
+      timestamp: payload.timestamp,
+      hash: variable.value,
+    });
+  }
+  return entries;
+}
@@ -0,0 +1,32 @@
+import { createLogger } from "@united-workforce/util";
+import type { Command } from "commander";
+
+import { createEvalStore } from "../storage/index.js";
+import { formatReport } from "./format.js";
+import { readEvalRun } from "./read.js";
+
+const log = createLogger({ sink: { kind: "stderr" } });
+const LOG_REPORT = "R7QP2M4K";
+
+export function registerReportCommand(program: Command): void {
+  program
+    .command("report <hash>")
+    .description("Show eval run results")
+    .action(async (hash: string) => {
+      try {
+        const evalStore = await createEvalStore();
+        const payload = readEvalRun(evalStore, hash);
+        if (payload === null) {
+          process.stderr.write(`eval run not found: ${hash}\n`);
+          process.exitCode = 1;
+          return;
+        }
+        log(LOG_REPORT, `report task=${payload.task} hash=${hash}`);
+        process.stdout.write(formatReport(payload, hash));
+      } catch (e) {
+        const message = e instanceof Error ? e.message : String(e);
+        process.stderr.write(`${message}\n`);
+        process.exitCode = 1;
+      }
+    });
+}
@@ -0,0 +1,84 @@
+import { resolve } from "node:path";
+
+import type { Command } from "commander";
+import type { RunResult } from "../runner/index.js";
+import { collect, execute, getEngineVersion, prepare } from "../runner/index.js";
+import type { EvalRunConfig } from "../storage/index.js";
+import { createEvalStore } from "../storage/index.js";
+
+type RunCliOptions = {
+  agent: string;
+  model: string | undefined;
+  count: string;
+};
+
+async function runOnce(
+  taskDir: string,
+  agent: string,
+  model: string,
+  engineVersion: string,
+): Promise<RunResult> {
+  const prepared = await prepare(taskDir);
+  const { manifest, workDir } = prepared;
+
+  const { threadId } = await execute({
+    workDir,
+    workflow: manifest.workflow,
+    prompt: manifest.prompt,
+    agent,
+    maxSteps: manifest.limits.maxSteps,
+  });
+
+  const evalStore = await createEvalStore();
+  const config: EvalRunConfig = { agent, model, engineVersion };
+  const collected = await collect({
+    evalStore,
+    taskDir: prepared.taskDir,
+    workDir,
+    threadId,
+    manifest,
+    config,
+  });
+
+  return {
+    runHash: collected.runHash,
+    overall: collected.overall,
+    task: manifest.name,
+    judges: collected.judges,
+  };
+}
+
+export function registerRunCommand(program: Command): void {
+  program
+    .command("run <task>")
+    .description("Run eval on a task directory or tarball")
+    .option("--agent <name>", "agent adapter to use", "uwf-hermes")
+    .option("--model <model>", "model override")
+    .option("--count <n>", "number of eval runs", "1")
+    .action(async (task: string, opts: RunCliOptions) => {
+      const taskDir = resolve(task);
+      const agent = opts.agent;
+      const model = opts.model ?? "";
+      const count = Number.parseInt(opts.count, 10);
+      if (!Number.isInteger(count) || count < 1) {
+        process.stderr.write("--count must be a positive integer\n");
+        process.exitCode = 1;
+        return;
+      }
+
+      const engineVersion = getEngineVersion();
+
+      try {
+        const results: RunResult[] = [];
+        for (let i = 0; i < count; i++) {
+          results.push(await runOnce(taskDir, agent, model, engineVersion));
+        }
+        const output = count === 1 ? results[0] : results;
+        process.stdout.write(`${JSON.stringify(output)}\n`);
+      } catch (e) {
+        const message = e instanceof Error ? e.message : String(e);
+        process.stderr.write(`${message}\n`);
+        process.exitCode = 1;
+      }
+    });
+}
@@ -0,0 +1,9 @@
+import type { CasRef } from "@united-workforce/protocol";
+
+/** Summary row for the `list` command: one indexed eval run. */
+export type EvalListEntry = {
+  task: string;
+  overall: number;
+  timestamp: number;
+  hash: CasRef;
+};
@@ -0,0 +1,34 @@
+// Judge types
+export type { JudgeInput, JudgeOutput } from "./judge/index.js";
+export type {
+  CollectInput,
+  CollectResult,
+  ExecuteInput,
+  ExecuteResult,
+  JudgeRunner,
+  JudgeRunOutput,
+  JudgeSummary,
+  PrepareResult,
+  RunOptions,
+  RunResult,
+} from "./runner/index.js";
+// Runner (prepare → execute → collect)
+export { collect, computeOverall, execute, getEngineVersion, prepare } from "./runner/index.js";
+export type {
+  EvalJudgeRecord,
+  EvalRunConfig,
+  EvalRunPayload,
+  EvalStore,
+} from "./storage/index.js";
+// Storage schemas and types
+export {
+  createEvalStore,
+  EVAL_JUDGE_FRONTMATTER_SCHEMA,
+  EVAL_JUDGE_HALLUCINATION_SCHEMA,
+  EVAL_JUDGE_TOKEN_STATS_SCHEMA,
+  EVAL_JUDGE_UPSTREAM_SCHEMA,
+  EVAL_RUN_SCHEMA,
+  setEvalLatest,
+} from "./storage/index.js";
+export type { JudgeEntry, TaskLimits, TaskManifest } from "./task/index.js";
+export { loadTaskManifest, parseTaskManifest } from "./task/index.js";
@@ -0,0 +1,105 @@
+import { createLogger } from "@united-workforce/util";
+import { parse as parseYaml } from "yaml";
+
+import { EVAL_JUDGE_FRONTMATTER_SCHEMA } from "../../storage/index.js";
+import { readThreadSteps } from "./read-steps.js";
+import type { BuiltinJudgeOutput } from "./types.js";
+
+const log = createLogger({ sink: { kind: "stderr" } });
+
+const LOG_RESULT = "F2QH7R4M";
+
+const FENCE = "---";
+
+type InvalidStep = {
+  stepIndex: number;
+  role: string;
+  errors: string[];
+};
+
+/**
+ * Extract the YAML frontmatter block from a step output. Returns the inner YAML
+ * string when the output starts with a `---\n` block closed by a `\n---` fence,
+ * otherwise null.
+ */
+function extractFrontmatterYaml(output: unknown): string | null {
+  if (typeof output !== "string") {
+    return null;
+  }
+  if (!output.startsWith(`${FENCE}\n`)) {
+    return null;
+  }
+  const rest = output.slice(FENCE.length + 1);
+  const closeIndex = rest.indexOf(`\n${FENCE}`);
+  if (closeIndex === -1) {
+    return null;
+  }
+  return rest.slice(0, closeIndex);
+}
+
+/** Validate a single step's frontmatter, returning a list of errors (empty = valid). */
+function validateStepFrontmatter(output: unknown): string[] {
+  // CAS stores the extracted output as a JSON object after the extract pipeline.
+  // Accept both: parsed object (from step.output) or raw markdown string.
+  if (typeof output === "object" && output !== null && !Array.isArray(output)) {
+    const status = (output as Record<string, unknown>).$status;
+    if (typeof status !== "string" || status.trim() === "") {
+      return ["$status field is missing or not a non-empty string"];
+    }
+    return [];
+  }
+
+  const yaml = extractFrontmatterYaml(output);
+  if (yaml === null) {
+    return ["output does not begin with a valid '---' frontmatter block"];
+  }
+
+  let parsed: unknown;
+  try {
+    parsed = parseYaml(yaml);
+  } catch (e) {
+    const message = e instanceof Error ? e.message : String(e);
+    return [`frontmatter YAML failed to parse: ${message}`];
+  }
+
+  if (typeof parsed !== "object" || parsed === null || Array.isArray(parsed)) {
+    return ["frontmatter is not a YAML mapping"];
+  }
+
+  const status = (parsed as Record<string, unknown>).$status;
+  if (typeof status !== "string" || status.trim() === "") {
+    return ["$status field is missing or not a non-empty string"];
+  }
+
+  return [];
+}
+
+/**
+ * Deterministic judge: every step's agent output must contain valid YAML
+ * frontmatter with a non-empty `$status` field. Score = stepsValid / stepsTotal
+ * (0 when there are no steps).
+ */
+export async function runFrontmatterJudge(threadId: string): Promise<BuiltinJudgeOutput> {
+  const steps = readThreadSteps(threadId);
+
+  const invalidSteps: InvalidStep[] = [];
+  for (let i = 0; i < steps.length; i++) {
+    const step = steps[i];
+    const errors = validateStepFrontmatter(step.output);
+    if (errors.length > 0) {
+      invalidSteps.push({ stepIndex: i, role: step.role, errors });
+    }
+  }
+
+  const stepsTotal = steps.length;
+  const stepsValid = stepsTotal - invalidSteps.length;
+  const score = stepsTotal > 0 ? stepsValid / stepsTotal : 0;
+
+  log(LOG_RESULT, `frontmatter thread=${threadId} valid=${stepsValid}/${stepsTotal}`);
+
+  return {
+    score,
+    data: { stepsTotal, stepsValid, invalidSteps },
+    schema: EVAL_JUDGE_FRONTMATTER_SCHEMA,
+  };
+}
@@ -0,0 +1,17 @@
+import { EVAL_JUDGE_HALLUCINATION_SCHEMA } from "../../storage/index.js";
+import type { BuiltinJudgeOutput } from "./types.js";
+
+/**
+ * LLM-as-judge: detects claims in each step's output that are not grounded in
+ * the available context (hallucinations).
+ *
+ * TODO: LLM-as-judge — needs provider config to call LLM API. Returns a stub
+ * (score 0, empty perStep) until the LLM call path is wired up.
+ */
+export async function runHallucinationJudge(_threadId: string): Promise<BuiltinJudgeOutput> {
+  return {
+    score: 0,
+    data: { perStep: [] },
+    schema: EVAL_JUDGE_HALLUCINATION_SCHEMA,
+  };
+}
@@ -0,0 +1,6 @@
+export { runFrontmatterJudge } from "./frontmatter.js";
+export { runHallucinationJudge } from "./hallucination.js";
+export { readThreadSteps } from "./read-steps.js";
+export { runTokenStatsJudge } from "./token-stats.js";
+export type { BuiltinJudge, BuiltinJudgeOutput } from "./types.js";
+export { runUpstreamJudge } from "./upstream.js";
@@ -0,0 +1,14 @@
+import { execFileSync } from "node:child_process";
+
+import type { StepEntry, ThreadStepsOutput } from "@united-workforce/protocol";
+
+/** Shell out to `uwf step list` and return the parsed step entries (excludes start entry). */
+export function readThreadSteps(threadId: string): StepEntry[] {
+  const stdout = execFileSync("uwf", ["step", "list", threadId], {
+    encoding: "utf8",
+    stdio: ["ignore", "pipe", "pipe"],
+  }).trim();
+  const parsed = JSON.parse(stdout) as ThreadStepsOutput;
+  // steps[0] is the StartEntry; the rest are StepEntry records.
+  return parsed.steps.slice(1) as StepEntry[];
+}
@@ -0,0 +1,53 @@
+import { createLogger } from "@united-workforce/util";
+
+import { EVAL_JUDGE_TOKEN_STATS_SCHEMA } from "../../storage/index.js";
+import { readThreadSteps } from "./read-steps.js";
+import type { BuiltinJudgeOutput } from "./types.js";
+
+const log = createLogger({ sink: { kind: "stderr" } });
+
+const LOG_RESULT = "T7KQ3M9P";
+
+type PerStepStats = {
+  role: string;
+  inputTokens: number;
+  outputTokens: number;
+  turns: number;
+  duration: number;
+};
+
+/**
+ * Informational judge: aggregate token usage across every step. Always scores
+ * 1.0 — it never penalizes a run, it only reports usage. Steps with null usage
+ * contribute zeros.
+ */
+export async function runTokenStatsJudge(threadId: string): Promise<BuiltinJudgeOutput> {
+  const steps = readThreadSteps(threadId);
+
+  let totalInput = 0;
+  let totalOutput = 0;
+  let totalTurns = 0;
+  const perStep: PerStepStats[] = [];
+
+  for (const step of steps) {
+    const usage = step.usage;
+    const inputTokens = usage !== null ? usage.inputTokens : 0;
+    const outputTokens = usage !== null ? usage.outputTokens : 0;
+    const turns = usage !== null ? usage.turns : 0;
+    const duration = usage !== null ? usage.duration : 0;
+
+    totalInput += inputTokens;
+    totalOutput += outputTokens;
+    totalTurns += turns;
+
+    perStep.push({ role: step.role, inputTokens, outputTokens, turns, duration });
+  }
+
+  log(LOG_RESULT, `token-stats thread=${threadId} in=${totalInput} out=${totalOutput}`);
+
+  return {
+    score: 1.0,
+    data: { totalInput, totalOutput, totalTurns, perStep },
+    schema: EVAL_JUDGE_TOKEN_STATS_SCHEMA,
+  };
+}
@@ -0,0 +1,16 @@
+import type { JSONSchema } from "@ocas/core";
+
+/**
+ * Output produced by a builtin judge. Structurally identical to the runner's
+ * `JudgeRunOutput`; defined locally to keep the judge module free of a
+ * dependency on the runner module.
+ */
+export type BuiltinJudgeOutput = {
+  score: number;
+  data: unknown;
+  /** Schema describing `data`, used when persisting to CAS. */
+  schema: JSONSchema;
+};
+
+/** A builtin judge analyzes a thread's steps and returns a scored result. */
+export type BuiltinJudge = (threadId: string) => Promise<BuiltinJudgeOutput>;
@@ -0,0 +1,17 @@
+import { EVAL_JUDGE_UPSTREAM_SCHEMA } from "../../storage/index.js";
+import type { BuiltinJudgeOutput } from "./types.js";
+
+/**
+ * LLM-as-judge: measures how well each role consumed the relevant outputs from
+ * upstream steps.
+ *
+ * TODO: LLM-as-judge — needs provider config to call LLM API. Returns a stub
+ * (score 0, empty perStep) until the LLM call path is wired up.
+ */
+export async function runUpstreamJudge(_threadId: string): Promise<BuiltinJudgeOutput> {
+  return {
+    score: 0,
+    data: { perStep: [] },
+    schema: EVAL_JUDGE_UPSTREAM_SCHEMA,
+  };
+}
@@ -0,0 +1,10 @@
+export {
+  type BuiltinJudge,
+  type BuiltinJudgeOutput,
+  readThreadSteps,
+  runFrontmatterJudge,
+  runHallucinationJudge,
+  runTokenStatsJudge,
+  runUpstreamJudge,
+} from "./builtin/index.js";
+export type { JudgeInput, JudgeOutput } from "./types.js";
@@ -0,0 +1,15 @@
+/** Output shape every judge must produce on stdout (JSON). */
+export type JudgeOutput<T = unknown> = {
+  /** Score between 0.0 and 1.0. */
+  score: number;
+  /** Judge-specific structured data, stored in CAS with its own schema. */
+  data: T;
+};
+
+/** Input context passed to judge scripts via argv. */
+export type JudgeInput = {
+  /** Working directory where the task was executed. */
+  cwd: string;
+  /** Thread ID of the eval run. */
+  threadId: string;
+};
@@ -0,0 +1,172 @@
+import { execFileSync } from "node:child_process";
+import { readFile } from "node:fs/promises";
+import { resolve } from "node:path";
+
+import type { JSONSchema, Store } from "@ocas/core";
+import { putSchema } from "@ocas/core";
+import type { CasRef } from "@united-workforce/protocol";
+import { createLogger } from "@united-workforce/util";
+
+import type { JudgeOutput } from "../judge/index.js";
+import {
+  runFrontmatterJudge,
+  runHallucinationJudge,
+  runTokenStatsJudge,
+  runUpstreamJudge,
+} from "../judge/index.js";
+import type { EvalJudgeRecord, EvalRunPayload } from "../storage/index.js";
+import { EVAL_RUN_SCHEMA, setEvalLatest } from "../storage/index.js";
+import type { JudgeEntry } from "../task/index.js";
+import type {
+  CollectInput,
+  CollectResult,
+  JudgeRunner,
+  JudgeRunOutput,
+  JudgeSummary,
+} from "./types.js";
+
+const log = createLogger({ sink: { kind: "stderr" } });
+
+const LOG_JUDGE = "CT6N3P2K";
+const LOG_STORED = "CT9V2Q7M";
+
+/** Permissive schema for judge data without a dedicated schema (e.g. builtin placeholders). */
+const GENERIC_DATA_SCHEMA: JSONSchema = { type: "object" };
+
+/**
+ * Compute the weighted overall score. Judges with weight 0 are informational
+ * and do not affect the result (they contribute 0 to both numerator and
+ * denominator). Returns 0 when total weight is 0.
+ */
+export function computeOverall(judges: ReadonlyArray<{ score: number; weight: number }>): number {
+  let totalWeight = 0;
+  let weighted = 0;
+  for (const judge of judges) {
+    totalWeight += judge.weight;
+    weighted += judge.score * judge.weight;
+  }
+  return totalWeight > 0 ? weighted / totalWeight : 0;
+}
+
+/** Run a task-provided judge script: `node <entry> <cwd> <threadId>`. */
+async function runTaskJudge(
+  taskDir: string,
+  workDir: string,
+  threadId: string,
+  judge: JudgeEntry,
+): Promise<JudgeRunOutput> {
+  if (judge.entry === null) {
+    throw new Error(`judge "${judge.name}" is not builtin but has no entry`);
+  }
+  const entryPath = resolve(taskDir, judge.entry);
+
+  let stdout: string;
+  try {
+    stdout = execFileSync("node", [entryPath, workDir, threadId], {
+      encoding: "utf8",
+      stdio: ["ignore", "pipe", "pipe"],
+      maxBuffer: 50 * 1024 * 1024,
+    });
+  } catch (e) {
+    const message = e instanceof Error ? e.message : String(e);
+    throw new Error(`judge "${judge.name}" failed: ${message}`);
+  }
+
+  const line = stdout.trim().split("\n").pop()?.trim() ?? "";
+  let parsed: unknown;
+  try {
+    parsed = JSON.parse(line);
+  } catch {
+    throw new Error(`judge "${judge.name}" stdout is not valid JSON: ${line || "(empty)"}`);
+  }
+  const output = parsed as JudgeOutput;
+  if (typeof output.score !== "number") {
+    throw new Error(`judge "${judge.name}" output missing numeric score`);
+  }
+
+  const schema =
+    judge.schema !== null ? await loadSchema(resolve(taskDir, judge.schema)) : GENERIC_DATA_SCHEMA;
+  return { score: output.score, data: output.data, schema };
+}
+
+/** Load and parse an OCAS JSON Schema file. */
+async function loadSchema(path: string): Promise<JSONSchema> {
+  const text = await readFile(path, "utf8");
+  return JSON.parse(text) as JSONSchema;
+}
+
+/** Dispatch a builtin judge by name. Throws on an unknown builtin name. */
+async function runBuiltinJudge(name: string, threadId: string): Promise<JudgeRunOutput> {
+  switch (name) {
+    case "frontmatter-compliance":
+      return runFrontmatterJudge(threadId);
+    case "upstream-consumption":
+      return runUpstreamJudge(threadId);
+    case "hallucination":
+      return runHallucinationJudge(threadId);
+    case "token-stats":
+      return runTokenStatsJudge(threadId);
+    default:
+      throw new Error(`unknown builtin judge "${name}"`);
+  }
+}
+
+/**
+ * Default judge runner. Builtin judges are dispatched by name; task judges spawn
+ * their entry script.
+ */
+const defaultJudgeRunner: JudgeRunner = async (taskDir, workDir, threadId, judge) => {
+  if (judge.builtin) {
+    return runBuiltinJudge(judge.name, threadId);
+  }
+  return runTaskJudge(taskDir, workDir, threadId, judge);
+};
+
+/** Persist judge data to CAS under its schema and return the CAS hash. */
+async function storeJudgeData(store: Store, schema: JSONSchema, data: unknown): Promise<CasRef> {
+  const schemaHash = await putSchema(store, schema);
+  return (await store.cas.put(schemaHash, data)) as CasRef;
+}
+
+/**
+ * Run all judges, store their data and the overall eval-run record in CAS, then
+ * index the run under `@uwf/eval/<task>/latest`.
+ */
+export async function collect(
+  input: CollectInput,
+  runJudge: JudgeRunner = defaultJudgeRunner,
+): Promise<CollectResult> {
+  const { evalStore, taskDir, workDir, threadId, manifest, config } = input;
+  const { store, varStore } = evalStore;
+
+  const records: EvalJudgeRecord[] = [];
+  for (const judge of manifest.judges) {
+    const result = await runJudge(taskDir, workDir, threadId, judge);
+    const dataHash = await storeJudgeData(store, result.schema, result.data);
+    records.push({ name: judge.name, score: result.score, weight: judge.weight, dataHash });
+    log(LOG_JUDGE, `judge=${judge.name} score=${result.score} weight=${judge.weight}`);
+  }
+
+  const overall = computeOverall(records);
+
+  const payload: EvalRunPayload = {
+    task: manifest.name,
+    config,
+    threadId,
+    judges: records,
+    overall,
+    timestamp: Date.now(),
+  };
+
+  const schemaHash = await putSchema(store, EVAL_RUN_SCHEMA);
+  const runHash = (await store.cas.put(schemaHash, payload)) as string;
+  setEvalLatest(varStore, manifest.name, runHash);
+  log(LOG_STORED, `stored eval-run task=${manifest.name} hash=${runHash} overall=${overall}`);
+
+  const judges: JudgeSummary[] = records.map((r) => ({
+    name: r.name,
+    score: r.score,
+    weight: r.weight,
+  }));
+  return { runHash, overall, judges };
+}
@@ -0,0 +1,87 @@
+import { execFileSync } from "node:child_process";
+
+import { createLogger } from "@united-workforce/util";
+
+import type { ExecuteInput, ExecuteResult } from "./types.js";
+
+const log = createLogger({ sink: { kind: "stderr" } });
+
+const LOG_START = "EX5M2T9V";
+const LOG_EXEC = "EX7Q4K2N";
+
+/** Resolve the uwf CLI binary. Override with `UWF_BIN` for testing. */
+function uwfBin(): string {
+  const override = process.env.UWF_BIN;
+  return override !== undefined && override !== "" ? override : "uwf";
+}
+
+/** Run a uwf subcommand and return trimmed stdout. */
+function runUwf(args: string[], cwd: string): string {
+  try {
+    return execFileSync(uwfBin(), args, {
+      encoding: "utf8",
+      stdio: ["ignore", "pipe", "pipe"],
+      maxBuffer: 50 * 1024 * 1024,
+      cwd,
+    }).trim();
+  } catch (e) {
+    const err = e as NodeJS.ErrnoException & { stderr?: Buffer | string | null };
+    const stderr =
+      err.stderr == null
+        ? ""
+        : typeof err.stderr === "string"
+          ? err.stderr
+          : err.stderr.toString("utf8");
+    const detail = stderr.trim() !== "" ? `: ${stderr.trim()}` : "";
+    throw new Error(`uwf ${args[0]} ${args[1]} failed${detail}`);
+  }
+}
+
+/** Parse the thread ID from `uwf thread start` JSON output (`{ workflow, thread }`). */
+function parseThreadId(stdout: string): string {
+  let parsed: unknown;
+  try {
+    parsed = JSON.parse(stdout);
+  } catch {
+    throw new Error(`uwf thread start did not emit valid JSON: ${stdout || "(empty)"}`);
+  }
+  const obj = parsed as Record<string, unknown>;
+  const thread = obj.thread;
+  if (typeof thread !== "string" || thread === "") {
+    throw new Error(`uwf thread start output missing thread id: ${stdout}`);
+  }
+  return thread;
+}
+
+/**
+ * Execute a workflow: create a thread, then run it for up to `maxSteps` steps.
+ * Shells out to the uwf CLI rather than importing it directly.
+ */
+export async function execute(input: ExecuteInput): Promise<ExecuteResult> {
+  const startOut = runUwf(
+    ["thread", "start", input.workflow, "-p", input.prompt, "--cwd", input.workDir],
+    input.workDir,
+  );
+  const threadId = parseThreadId(startOut);
+  log(LOG_START, `thread started thread=${threadId} workflow=${input.workflow}`);
+
+  runUwf(
+    ["thread", "exec", threadId, "--agent", input.agent, "-c", String(input.maxSteps)],
+    input.workDir,
+  );
+  log(LOG_EXEC, `thread executed thread=${threadId} maxSteps=${input.maxSteps}`);
+
+  return { threadId };
+}
+
+/** Best-effort lookup of the uwf engine version (`uwf -V`); "unknown" on failure. */
+export function getEngineVersion(): string {
+  try {
+    return execFileSync(uwfBin(), ["-V"], {
+      encoding: "utf8",
+      stdio: ["ignore", "pipe", "ignore"],
+    }).trim();
+  } catch {
+    return "unknown";
+  }
+}
@@ -0,0 +1,15 @@
+export { collect, computeOverall } from "./collect.js";
+export { execute, getEngineVersion } from "./execute.js";
+export { prepare } from "./prepare.js";
+export type {
+  CollectInput,
+  CollectResult,
+  ExecuteInput,
+  ExecuteResult,
+  JudgeRunner,
+  JudgeRunOutput,
+  JudgeSummary,
+  PrepareResult,
+  RunOptions,
+  RunResult,
+} from "./types.js";
@@ -0,0 +1,45 @@
+import { access, cp, mkdir, mkdtemp } from "node:fs/promises";
+import { tmpdir } from "node:os";
+import { join } from "node:path";
+
+import { createLogger } from "@united-workforce/util";
+
+import { loadTaskManifest } from "../task/index.js";
+import type { PrepareResult } from "./types.js";
+
+const log = createLogger({ sink: { kind: "stderr" } });
+
+const LOG_PREPARE = "PRE4K2NQ";
+const LOG_FIXTURE = "PRE7M3VX";
+
+/** Check whether a path exists. */
+async function pathExists(path: string): Promise<boolean> {
+  try {
+    await access(path);
+    return true;
+  } catch {
+    return false;
+  }
+}
+
+/**
+ * Prepare a task for execution: read its manifest and copy the fixture
+ * directory into a fresh temp working directory.
+ */
+export async function prepare(taskDir: string): Promise<PrepareResult> {
+  const manifest = await loadTaskManifest(taskDir);
+  log(LOG_PREPARE, `loaded task manifest name=${manifest.name} workflow=${manifest.workflow}`);
+
+  const workDir = await mkdtemp(join(tmpdir(), "uwf-eval-"));
+
+  const fixtureDir = join(taskDir, "fixture");
+  if (await pathExists(fixtureDir)) {
+    await cp(fixtureDir, workDir, { recursive: true });
+    log(LOG_FIXTURE, `copied fixture into workDir=${workDir}`);
+  } else {
+    await mkdir(workDir, { recursive: true });
+    log(LOG_FIXTURE, `no fixture/ found, using empty workDir=${workDir}`);
+  }
+
+  return { taskDir, workDir, manifest };
+}
@@ -0,0 +1,85 @@
+import type { JSONSchema } from "@ocas/core";
+
+import type { EvalRunConfig, EvalStore } from "../storage/index.js";
+import type { JudgeEntry, TaskManifest } from "../task/index.js";
+
+/** Result of the prepare phase: task dir, temp working dir, parsed manifest. */
+export type PrepareResult = {
+  taskDir: string;
+  workDir: string;
+  manifest: TaskManifest;
+};
+
+/** Input to the execute phase. */
+export type ExecuteInput = {
+  /** Working directory the workflow runs in (the prepared temp dir). */
+  workDir: string;
+  /** Workflow name or path (from task.yaml). */
+  workflow: string;
+  /** Initial prompt for the thread. */
+  prompt: string;
+  /** Agent adapter to use. */
+  agent: string;
+  /** Maximum number of steps to execute. */
+  maxSteps: number;
+};
+
+/** Result of the execute phase. */
+export type ExecuteResult = {
+  threadId: string;
+};
+
+/** Output produced by running a single judge. */
+export type JudgeRunOutput = {
+  score: number;
+  data: unknown;
+  /** Schema describing `data`, used when persisting to CAS. */
+  schema: JSONSchema;
+};
+
+/** Pluggable judge execution strategy (injectable for testing). */
+export type JudgeRunner = (
+  taskDir: string,
+  workDir: string,
+  threadId: string,
+  judge: JudgeEntry,
+) => Promise<JudgeRunOutput>;
+
+/** Input to the collect phase. */
+export type CollectInput = {
+  evalStore: EvalStore;
+  taskDir: string;
+  workDir: string;
+  threadId: string;
+  manifest: TaskManifest;
+  config: EvalRunConfig;
+};
+
+/** A single judge's summarized result in the run output. */
+export type JudgeSummary = {
+  name: string;
+  score: number;
+  weight: number;
+};
+
+/** Result of the collect phase. */
+export type CollectResult = {
+  runHash: string;
+  overall: number;
+  judges: JudgeSummary[];
+};
+
+/** Options for a full eval run (from CLI flags). */
+export type RunOptions = {
+  agent: string;
+  model: string;
+  count: number;
+};
+
+/** Final result of a full eval run. */
+export type RunResult = {
+  runHash: string;
+  overall: number;
+  task: string;
+  judges: JudgeSummary[];
+};
@@ -0,0 +1,9 @@
+export {
+  EVAL_JUDGE_FRONTMATTER_SCHEMA,
+  EVAL_JUDGE_HALLUCINATION_SCHEMA,
+  EVAL_JUDGE_TOKEN_STATS_SCHEMA,
+  EVAL_JUDGE_UPSTREAM_SCHEMA,
+  EVAL_RUN_SCHEMA,
+} from "./schemas.js";
+export { createEvalStore, setEvalLatest } from "./store.js";
+export type { EvalJudgeRecord, EvalRunConfig, EvalRunPayload, EvalStore } from "./types.js";
@@ -0,0 +1,123 @@
+import type { JSONSchema } from "@ocas/core";
+
+export const EVAL_RUN_SCHEMA: JSONSchema = {
+  title: "@uwf/eval-run",
+  type: "object",
+  required: ["task", "config", "threadId", "judges", "overall", "timestamp"],
+  properties: {
+    task: { type: "string" },
+    config: {
+      type: "object",
+      required: ["agent", "model", "engineVersion"],
+      properties: {
+        agent: { type: "string" },
+        model: { type: "string" },
+        engineVersion: { type: "string" },
+      },
+    },
+    threadId: { type: "string" },
+    judges: {
+      type: "array",
+      items: {
+        type: "object",
+        required: ["name", "score", "weight", "dataHash"],
+        properties: {
+          name: { type: "string" },
+          score: { type: "number" },
+          weight: { type: "number" },
+          dataHash: { type: "string" },
+        },
+      },
+    },
+    overall: { type: "number" },
+    timestamp: { type: "integer" },
+  },
+};
+
+export const EVAL_JUDGE_FRONTMATTER_SCHEMA: JSONSchema = {
+  title: "@uwf/eval-judge-frontmatter",
+  type: "object",
+  required: ["stepsTotal", "stepsValid", "invalidSteps"],
+  properties: {
+    stepsTotal: { type: "integer" },
+    stepsValid: { type: "integer" },
+    invalidSteps: {
+      type: "array",
+      items: {
+        type: "object",
+        required: ["stepIndex", "role", "errors"],
+        properties: {
+          stepIndex: { type: "integer" },
+          role: { type: "string" },
+          errors: { type: "array", items: { type: "string" } },
+        },
+      },
+    },
+  },
+};
+
+export const EVAL_JUDGE_UPSTREAM_SCHEMA: JSONSchema = {
+  title: "@uwf/eval-judge-upstream",
+  type: "object",
+  required: ["perStep"],
+  properties: {
+    perStep: {
+      type: "array",
+      items: {
+        type: "object",
+        required: ["role", "consumed", "missed", "score"],
+        properties: {
+          role: { type: "string" },
+          consumed: { type: "array", items: { type: "string" } },
+          missed: { type: "array", items: { type: "string" } },
+          score: { type: "number" },
+        },
+      },
+    },
+  },
+};
+
+export const EVAL_JUDGE_HALLUCINATION_SCHEMA: JSONSchema = {
+  title: "@uwf/eval-judge-hallucination",
+  type: "object",
+  required: ["perStep"],
+  properties: {
+    perStep: {
+      type: "array",
+      items: {
+        type: "object",
+        required: ["role", "hallucinations", "score"],
+        properties: {
+          role: { type: "string" },
+          hallucinations: { type: "array", items: { type: "string" } },
+          score: { type: "number" },
+        },
+      },
+    },
+  },
+};
+
+export const EVAL_JUDGE_TOKEN_STATS_SCHEMA: JSONSchema = {
+  title: "@uwf/eval-judge-token-stats",
+  type: "object",
+  required: ["totalInput", "totalOutput", "totalTurns", "perStep"],
+  properties: {
+    totalInput: { type: "integer" },
+    totalOutput: { type: "integer" },
+    totalTurns: { type: "integer" },
+    perStep: {
+      type: "array",
+      items: {
+        type: "object",
+        required: ["role", "inputTokens", "outputTokens", "turns", "duration"],
+        properties: {
+          role: { type: "string" },
+          inputTokens: { type: "integer" },
+          outputTokens: { type: "integer" },
+          turns: { type: "integer" },
+          duration: { type: "number" },
+        },
+      },
+    },
+  },
+};
@@ -0,0 +1,42 @@
+import { mkdir } from "node:fs/promises";
+import { homedir } from "node:os";
+import { join } from "node:path";
+import type { VarStore } from "@ocas/core";
+import { bootstrap, type Store } from "@ocas/core";
+import { createFsStore, createSqliteVarStore } from "@ocas/fs";
+
+import type { EvalStore } from "./types.js";
+
+/** Variable name prefix for eval run pointers (`@uwf/eval/<task>/latest`). */
+const EVAL_VAR_PREFIX = "@uwf/eval/";
+
+/**
+ * Resolve the global CAS directory shared by all uwf and ocas tools.
+ * Priority: `OCAS_HOME` → default ~/.ocas (matches uwf CLI's getGlobalCasDir).
+ */
+function getGlobalCasDir(): string {
+  const primary = process.env.OCAS_HOME;
+  if (primary !== undefined && primary !== "") {
+    return primary;
+  }
+  return join(homedir(), ".ocas");
+}
+
+/**
+ * Open the unified OCAS store on the filesystem.
+ * Shares the same CAS + variable backend as the uwf CLI.
+ */
+export async function createEvalStore(): Promise<EvalStore> {
+  const casDir = getGlobalCasDir();
+  await mkdir(casDir, { recursive: true });
+  const cas = createFsStore(casDir);
+  const { var: varStore, tag } = createSqliteVarStore(join(casDir, "vars"), cas);
+  const store: Store = { cas, var: varStore, tag };
+  bootstrap(store);
+  return { store, varStore };
+}
+
+/** Set the `@uwf/eval/<task>/latest` variable to point at a run hash. */
+export function setEvalLatest(varStore: VarStore, taskName: string, runHash: string): void {
+  varStore.set(`${EVAL_VAR_PREFIX}${taskName}/latest`, runHash);
+}
@@ -0,0 +1,33 @@
+import type { Store, VarStore } from "@ocas/core";
+import type { CasRef } from "@united-workforce/protocol";
+
+/** Handle to the OCAS store used for eval persistence. */
+export type EvalStore = {
+  store: Store;
+  varStore: VarStore;
+};
+
+/** A single judge result within an eval run. */
+export type EvalJudgeRecord = {
+  name: string;
+  score: number;
+  weight: number;
+  dataHash: CasRef;
+};
+
+/** Config snapshot for an eval run. */
+export type EvalRunConfig = {
+  agent: string;
+  model: string;
+  engineVersion: string;
+};
+
+/** Full eval run record stored in CAS. */
+export type EvalRunPayload = {
+  task: string;
+  config: EvalRunConfig;
+  threadId: string;
+  judges: EvalJudgeRecord[];
+  overall: number;
+  timestamp: number;
+};
@@ -0,0 +1,2 @@
+export { loadTaskManifest, parseTaskManifest } from "./loader.js";
+export type { JudgeEntry, TaskLimits, TaskManifest } from "./types.js";
@@ -0,0 +1,74 @@
+import { readFile } from "node:fs/promises";
+import { join } from "node:path";
+import { parse as parseYaml } from "yaml";
+import type { JudgeEntry, TaskLimits, TaskManifest } from "./types.js";
+
+function isRecord(value: unknown): value is Record<string, unknown> {
+  return typeof value === "object" && value !== null && !Array.isArray(value);
+}
+
+function parseJudgeEntry(raw: unknown, index: number): JudgeEntry {
+  if (!isRecord(raw)) {
+    throw new Error(`judges[${index}]: expected object`);
+  }
+  const name = raw.name;
+  if (typeof name !== "string" || name === "") {
+    throw new Error(`judges[${index}]: name is required`);
+  }
+  const weight = typeof raw.weight === "number" ? raw.weight : 0;
+  const builtin = raw.builtin === true;
+  const entry = typeof raw.entry === "string" ? raw.entry : null;
+  const schema = typeof raw.schema === "string" ? raw.schema : null;
+  if (!builtin && entry === null) {
+    throw new Error(`judges[${index}] "${name}": non-builtin judge must have entry`);
+  }
+  return { name, weight, builtin, entry, schema };
+}
+
+function parseLimits(raw: unknown): TaskLimits {
+  if (!isRecord(raw)) {
+    return { maxSteps: 20, timeoutMinutes: 30 };
+  }
+  return {
+    maxSteps: typeof raw.maxSteps === "number" ? raw.maxSteps : 20,
+    timeoutMinutes: typeof raw.timeoutMinutes === "number" ? raw.timeoutMinutes : 30,
+  };
+}
+
+/** Parse and validate a task.yaml file into a TaskManifest. */
+export function parseTaskManifest(yamlText: string): TaskManifest {
+  const raw = parseYaml(yamlText) as unknown;
+  if (!isRecord(raw)) {
+    throw new Error("task.yaml must be a YAML mapping");
+  }
+  const name = raw.name;
+  if (typeof name !== "string" || name === "") {
+    throw new Error("task.yaml: name is required");
+  }
+  const description = typeof raw.description === "string" ? raw.description : "";
+  const workflow = raw.workflow;
+  if (typeof workflow !== "string" || workflow === "") {
+    throw new Error("task.yaml: workflow is required");
+  }
+  const prompt = raw.prompt;
+  if (typeof prompt !== "string" || prompt === "") {
+    throw new Error("task.yaml: prompt is required");
+  }
+  const limits = parseLimits(raw.limits);
+  const judgesRaw = raw.judges;
+  if (!Array.isArray(judgesRaw) || judgesRaw.length === 0) {
+    throw new Error("task.yaml: at least one judge is required");
+  }
+  const judges: JudgeEntry[] = [];
+  for (let i = 0; i < judgesRaw.length; i++) {
+    judges.push(parseJudgeEntry(judgesRaw[i], i));
+  }
+  return { name, description, workflow, prompt, limits, judges };
+}
+
+/** Load and parse task.yaml from a directory. */
+export async function loadTaskManifest(taskDir: string): Promise<TaskManifest> {
+  const yamlPath = join(taskDir, "task.yaml");
+  const text = await readFile(yamlPath, "utf8");
+  return parseTaskManifest(text);
+}
@@ -0,0 +1,28 @@
+/** Judge entry in task.yaml */
+export type JudgeEntry = {
+  name: string;
+  weight: number;
+  builtin: boolean;
+  /** Path to judge entry script (relative to task root). Required for non-builtin judges. */
+  entry: string | null;
+  /** Path to OCAS schema JSON for judge data. Required for non-builtin judges. */
+  schema: string | null;
+};
+
+/** Limits for eval execution. */
+export type TaskLimits = {
+  maxSteps: number;
+  timeoutMinutes: number;
+};
+
+/** Parsed task.yaml manifest. */
+export type TaskManifest = {
+  name: string;
+  description: string;
+  /** Workflow name or relative path to .yaml file. */
+  workflow: string;
+  /** Initial prompt for thread start. */
+  prompt: string;
+  limits: TaskLimits;
+  judges: JudgeEntry[];
+};
@@ -0,0 +1,9 @@
+{
+  "extends": "../../tsconfig.json",
+  "compilerOptions": {
+    "rootDir": "src",
+    "outDir": "dist"
+  },
+  "include": ["src"],
+  "references": [{ "path": "../protocol" }, { "path": "../util" }]
+}
@@ -1,6 +1,6 @@
 {
  "name": "@united-workforce/protocol",
-  "version": "0.5.0",
+  "version": "0.1.0",
  "files": [
    "src",
    "dist",
@@ -14,7 +14,6 @@
    }
  },
  "scripts": {
-    "prepublishOnly": "echo 'Use pnpm run release from repo root' && exit 1",
    "test": "vitest run src/__tests__/",
    "test:ci": "vitest run src/__tests__/"
  },
@@ -27,6 +27,7 @@ describe("Protocol types for thread/edge location", () => {
        completedAtMs: Date.now() + 1000,
        assembledPrompt: null,
        cwd: "/home/user/project",
+        usage: null,
      };

      expect(record.cwd).toBe("/home/user/project");
@@ -44,6 +44,7 @@ export type {
  ThreadStatus,
  ThreadStepsOutput,
  ThreadsIndex,
+  Usage,
  WorkflowConfig,
  WorkflowName,
  WorkflowPayload,
@@ -91,6 +91,22 @@ export const STEP_NODE_SCHEMA: JSONSchema = {
    assembledPrompt: {
      anyOf: [{ type: "string", format: "ocas_ref" }, { type: "null" }],
    },
+    usage: {
+      anyOf: [
+        {
+          type: "object",
+          required: ["turns", "inputTokens", "outputTokens", "duration"],
+          properties: {
+            turns: { type: "integer" },
+            inputTokens: { type: "integer" },
+            outputTokens: { type: "integer" },
+            duration: { type: "number" },
+          },
+          additionalProperties: false,
+        },
+        { type: "null" },
+      ],
+    },
  },
  additionalProperties: false,
 };
--- a/Show More
+++ b/Show More
Author	SHA1	Message	Date
xiaoju	9260d81084	chore: version bump for --version fix CI / check (push) Successful in 3m2s Details agent-hermes@0.1.2 agent-claude-code@0.1.1 agent-builtin@0.1.1 agent-mock@0.1.1 eval@0.1.3 util@0.1.1 小橘 🍊（NEKO Team）	2026-06-05 08:12:50 +00:00
xiaomo	c8d884072a	Merge pull request 'fix: acp-client reports agent-hermes own version in MCP clientInfo' (#98 ) from fix/acp-client-own-version into main CI / check (push) Successful in 2m27s Details	2026-06-05 08:10:57 +00:00
xiaoju	abeb465f46	fix: acp-client reports own package version, not util VERSION CI / check (pull_request) Successful in 2m36s Details Address review nit from PR #97: clientInfo.version should be agent-hermes's own version for correct identification under independent versioning. 小橘 🍊（NEKO Team）	2026-06-05 07:50:03 +00:00
xiaomo	28427a973f	Merge pull request 'fix: add --version to adapter CLIs, read VERSION from package.json' (#97 ) from fix/adapter-version into main CI / check (push) Successful in 3m3s Details	2026-06-05 07:36:15 +00:00
xiaoju	794f9db568	fix: add --version to adapter CLIs, read VERSION from package.json CI / check (pull_request) Successful in 3m29s Details - All uwf-* adapter CLIs now support --version / -V - util VERSION constant reads from package.json at runtime - agent-hermes ACP clientInfo uses dynamic VERSION 小橘 🍊（NEKO Team）	2026-06-05 07:29:54 +00:00
xiaoju	cd585a26f1	Merge pull request 'fix: read eval CLI version from package.json' (#96 ) from fix/95-eval-version into main CI / check (push) Successful in 3m28s Details	2026-06-05 06:46:32 +00:00
xiaoju	1cf8f350d0	fix: read eval CLI version from package.json CI / check (pull_request) Successful in 3m30s Details Fixes #95 小橘 🍊（NEKO Team）	2026-06-05 06:43:27 +00:00
xiaoju	427568a21d	chore: version bump agent-hermes@0.1.1 cli@0.1.1 eval@0.1.2 CI / check (push) Successful in 2m37s Details 小橘 🍊（NEKO Team）	2026-06-05 06:29:25 +00:00
xiaomo	d3a2353acf	Merge pull request 'fix: read token usage from ACP response instead of DB' (#94 ) from fix/usage-tokens-from-acp into main CI / check (push) Successful in 3m25s Details	2026-06-05 06:18:05 +00:00
xiaoju	8085d1d6e0	fix: read token usage from ACP response instead of DB CI / check (pull_request) Successful in 3m10s Details Tokens (inputTokens, outputTokens) now come from ACP PromptResponse.usage which is populated synchronously from run_conversation() — no WAL race. Turns still come from DB before/after snapshot. Previously both were read from hermes state.db after ACP prompt returned, but WAL write lag caused incomplete token data (e.g. 235 vs actual 26,080). Refs #91	2026-06-05 06:08:11 +00:00
xiaomo	8764d7bda3	Merge pull request 'chore: add changeset for #92 agent override alias fix' (#93 ) from chore/changeset-agent-override into main CI / check (push) Successful in 3m33s Details	2026-06-05 05:17:36 +00:00
xiaoju	850a3b2f25	chore: add changeset for #92 agent override alias fix CI / check (pull_request) Successful in 3m8s Details	2026-06-05 04:36:41 +00:00
xiaomo	3d6a517e83	Merge pull request 'fix: resolve --agent override via config alias before raw command' (#92 ) from fix/agent-override-alias into main CI / check (push) Successful in 3m30s Details	2026-06-05 04:31:50 +00:00
xiaoju	825f0c641a	fix: resolve --agent override via config alias before raw command CI / check (pull_request) Successful in 3m37s Details When --agent is passed to uwf thread exec, try config.agents[alias] first (e.g. 'hermes' → config.agents.hermes = {command: 'uwf-hermes'}), then fall back to parseAgentOverride for raw command names. Also change eval CLI default --agent from 'hermes' to 'uwf-hermes' so it works without config alias lookup. Refs #91	2026-06-05 04:20:09 +00:00
xiaoju	81bbe1178f	chore: release @united-workforce/eval@0.1.1 CI / check (push) Successful in 2m45s Details	2026-06-05 03:02:05 +00:00
xiaoju	a0e139935e	Merge pull request 'fix: frontmatter judge handles parsed object output' (#90 ) from fix/frontmatter-judge-object-output into main CI / check (push) Successful in 2m12s Details	2026-06-05 03:01:30 +00:00
xiaoju	a08775896f	fix: frontmatter judge handles parsed object output CI / check (pull_request) Successful in 2m38s Details The extract pipeline stores step output as a JSON object in CAS, but the frontmatter judge only checked for raw markdown strings. Now accepts both formats: parsed objects check $status directly, raw strings go through YAML frontmatter extraction. Fixes eval frontmatter-compliance scoring 0 on valid outputs.	2026-06-05 02:55:58 +00:00
xiaoju	c892b9125b	chore: remove prepublishOnly guards (proman handles release) CI / check (push) Successful in 2m26s Details	2026-06-05 02:29:53 +00:00
xiaoju	8c5e12c5c8	Merge pull request 'chore: prepare 0.1.0 release' (#89 ) from chore/prepare-release into main CI / check (push) Failing after 12s Details	2026-06-05 02:28:08 +00:00
xiaoju	5edb67b79d	chore: prepare 0.1.0 release CI / check (pull_request) Successful in 2m12s Details - Remove legacy .changeset/ directory (no longer used) - Add eval package to proman.yaml - Set eval package to public for npm publishing	2026-06-05 02:21:24 +00:00
xiaoju	3d8df5c8e2	Merge pull request 'fix: remove _ single-exit for user roles' (#88 ) from fix/86-remove-single-exit-underscore into main CI / check (push) Successful in 2m16s Details	2026-06-05 02:09:50 +00:00
xiaoju	63cb4d3645	fix: remove _ single-exit for user roles CI / check (pull_request) Successful in 3m7s Details $START keeps _ (special entry node). All user-defined roles now require explicit $status enum in frontmatter + matching graph keys. - moderator: remove UNIT_STATUS fallback, error on missing $status - validate: reject _ graph keys for non-$START roles - validate-semantic: remove checkSingleExitRole(), require $status enum - update all test fixtures to use explicit status values - fix examples/analyze-topic.yaml Fixes #86	2026-06-05 02:00:45 +00:00
xiaomo	f373945304	Merge pull request 'feat: eval package scaffold — CLI + schemas + types + task loader' (#85 ) from feat/69-eval-scaffold into main CI / check (push) Successful in 1m46s Details feat: eval package scaffold — CLI + schemas + types + task loader (#85)	2026-06-05 00:23:56 +00:00
xiaoju	ae81e4b5ac	feat: eval report, diff, list commands CI / check (pull_request) Successful in 1m44s Details Implement the 3 read commands for eval framework: - report: read eval-run from CAS, render formatted text (task, overall, config, judges table, thread ID) - diff: side-by-side comparison with ▲/▼ delta indicators and config change markers - list: scan @uwf/eval/*/latest variables, sort by timestamp desc, --task filter, --limit pagination Architecture: pure formatting functions (format.ts) + data access (read.ts) + thin CLI handlers. Types in types.ts. 11 new tests (formatReport, formatDiff, formatList, selectEntries) Refs #72	2026-06-05 00:19:25 +00:00
xiaoju	8c26f16716	feat: builtin judges — frontmatter + token-stats (deterministic) + upstream/hallucination (stubs) CI / check (pull_request) Successful in 1m45s Details Implement 4 builtin judges for eval framework: - frontmatter-compliance: validates YAML frontmatter with $status field, score = stepsValid / stepsTotal - token-stats: aggregates Usage from step nodes, always score 1.0 (informational only) - upstream-consumption: LLM-as-judge stub (score 0, TODO) - hallucination: LLM-as-judge stub (score 0, TODO) Infrastructure: - judge/builtin/read-steps.ts — shell out to uwf step list - judge/builtin/types.ts — BuiltinJudge, BuiltinJudgeOutput - runner/collect.ts — dispatch builtin judges by name 9 new tests (frontmatter validation + token aggregation) Refs #71	2026-06-05 00:09:06 +00:00
xiaoju	fae9e9ed3a	feat: eval run command — prepare, execute, collect pipeline CI / check (pull_request) Successful in 1m45s Details Implement the uwf-eval run <task-dir> command with 3-phase pipeline: - prepare: read task.yaml, copy fixture/ to temp workdir - execute: shell out to uwf thread start + exec - collect: run judges, compute weighted score, store CAS node, set @uwf/eval/<task>/latest variable Changes: - src/runner/ — types, prepare, execute, collect, index - src/storage/store.ts — createEvalStore(), setEvalLatest() - src/commands/run.ts — full pipeline wiring with --agent/--model/--count - 9 new tests (prepare + collect + weighted scoring) Builtin judges return placeholder score 0 (Phase 1c). Refs #70	2026-06-04 23:59:21 +00:00
xiaoju	99619d85db	feat: eval package scaffold with CLI, schemas, types, task loader CI / check (pull_request) Successful in 1m42s Details New package @united-workforce/eval (uwf-eval CLI): - CLI skeleton: run/report/diff/list subcommands (stubs) - 5 OCAS schemas: eval-run, judge-frontmatter, judge-upstream, judge-hallucination, judge-token-stats - TaskManifest type + parser/validator for task.yaml - JudgeOutput/JudgeInput types for judge contract - EvalRunPayload/EvalRunConfig/EvalJudgeRecord storage types - 19 unit tests: task loader validation + schema definitions Refs #69	2026-06-04 23:42:16 +00:00
xiaomo	b94234652a	Merge pull request 'feat: agent-hermes reads real token counts from session DB' (#84 ) from feat/76-hermes-real-tokens into main CI / check (push) Successful in 1m41s Details feat: agent-hermes reads real token counts from session DB (#84)	2026-06-04 23:31:09 +00:00
xiaoju	1593dbb521	fix: compute usage as delta for session re-entry CI / check (pull_request) Successful in 1m41s Details On session resume, turns/inputTokens/outputTokens were cumulative (entire session history) instead of per-step increments. Now we snapshot metrics before prompt, compare after, and report the delta. Changes: - acp-client: add getSessionId() accessor - hermes: extract snapshotUsage() + computeUsageDelta() pure functions - hermes: runPrompt/runHermes/continueHermes use before/after snapshots - 9 new unit tests for usage delta computation Refs #68	2026-06-04 23:22:16 +00:00
xiaoju	d1c523c442	feat: agent-hermes reads real token counts from session DB CI / check (pull_request) Successful in 1m41s Details - Add inputTokens/outputTokens to HermesSessionJson type - Query input_tokens, output_tokens from sessions table in loadHermesSessionFromDb - Update test fixture schema with token columns - runPrompt now reports real token counts from Hermes state.db Refs #76, #68	2026-06-04 23:06:52 +00:00
xiaomo	4283e6766b	Merge pull request 'feat: agent-claude-code reports real $usage from stream-json' (#83 ) from feat/77-claude-code-usage into main CI / check (push) Successful in 1m42s Details feat: agent-claude-code reports real $usage from stream-json (#83)	2026-06-04 22:55:15 +00:00
xiaomo	4e4fb61ff5	Merge pull request 'feat: agent-hermes reports $usage (turns + duration)' (#82 ) from feat/76-hermes-usage into main CI / check (push) Successful in 1m40s Details feat: agent-hermes reports $usage (turns + duration) (#82)	2026-06-04 22:55:13 +00:00
xiaoju	be92cb2dd2	feat: agent-claude-code reports real $usage from stream-json output CI / check (pull_request) Successful in 1m40s Details - Map parsed numTurns, inputTokens, outputTokens, durationMs to Usage type - Add @united-workforce/protocol dependency + tsconfig reference - 747 tests pass Fixes #77 Refs #68	2026-06-04 22:36:44 +00:00
xiaoju	7681e8b8e2	feat: agent-hermes reports $usage (turns + duration) CI / check (pull_request) Successful in 1m40s Details - Count assistant turns from session messages - Measure wall-clock duration per prompt call - inputTokens/outputTokens remain 0 (ACP protocol doesn't expose token data yet) - Both runPrompt and continueHermes report usage Fixes #76 Refs #68	2026-06-04 22:30:14 +00:00
xiaomo	780005ad65	Merge pull request 'feat: agent-mock emits fixed $usage stats' (#81 ) from feat/75-mock-usage into main CI / check (push) Successful in 1m42s Details feat: agent-mock emits fixed $usage stats (#81)	2026-06-04 22:23:42 +00:00
xiaoju	248ac710fd	feat: agent-mock emits fixed $usage stats CI / check (pull_request) Successful in 1m41s Details - Mock agent returns {turns:1, inputTokens:0, outputTokens:0, duration:0} - E2E test 1 (linear workflow) asserts usage in CAS step nodes - 747 tests pass Fixes #75 Refs #68	2026-06-04 22:19:29 +00:00
xiaomo	172c232e61	Merge pull request 'feat: add $usage field to adapter protocol' (#80 ) from feat/74-usage-in-protocol into main CI / check (push) Successful in 1m41s Details feat: add $usage field to adapter protocol (#80)	2026-06-04 22:14:12 +00:00
xiaomo	5fe97591de	Merge pull request 'fix: agent bin fields point to dist/cli.js instead of src/cli.ts' (#79 ) from fix/agent-bin-78 into main CI / check (push) Successful in 2m55s Details fix: agent bin fields point to dist/cli.js instead of src/cli.ts (#79)	2026-06-04 15:41:45 +00:00
xiaoju	99f40c2488	feat: add $usage field to adapter protocol CI / check (pull_request) Successful in 2m28s Details - Add Usage type to protocol (turns, inputTokens, outputTokens, duration) - Add usage to StepRecord, StepNodePayload, StepEntry, STEP_NODE_SCHEMA - Thread usage through util-agent extract pipeline (writeStepNode → persistStep → createAgent) - All adapters return usage: null as placeholder (mock, hermes, claude-code, builtin) - 746 tests pass, no breaking changes (usage not in schema required array) Fixes #74 Refs #68	2026-06-04 15:41:07 +00:00
xingyue	bf489c59a5	fix: agent bin fields point to dist/cli.js instead of src/cli.ts CI / check (pull_request) Successful in 3m23s Details All three agent packages had bin pointing to ./src/cli.ts (bun-era leftover). Node cannot execute .ts files directly, causing ERR_MODULE_NOT_FOUND when spawning agents. Closes #78	2026-06-04 23:25:39 +08:00
xiaomo	9908d069ec	Merge pull request 'refactor(prompt): rename subcommands and add frontmatter output' (#67 ) from feat/prompt-refactor-66 into main CI / check (push) Successful in 5m15s Details refactor(prompt): rename subcommands and add frontmatter output (#67)	2026-06-04 14:51:12 +00:00
xingyue	83bcda60ff	refactor(prompt): rename subcommands and add frontmatter output CI / check (pull_request) Successful in 3m1s Details - Rename: user→usage-reference, author→workflow-authoring, adapter→adapter-developing - Remove: developer (content lives in CLAUDE.md) - All prompts output complete SKILL.md with YAML frontmatter - Setup instructions simplified: uwf prompt bootstrap > SKILL.md - Remove all bun references, use pnpm/npm - Fix CLAUDE.md: fixed→independent versioning - Delete old reference files (user/author/developer/adapter) Closes #66	2026-06-04 22:46:11 +08:00
xiaomo	17f7f44c43	Merge pull request 'chore: rebranding cleanup — reset versions to 0.1.0, bun→pnpm in docs' (#64 ) from chore/rebranding-cleanup into main CI / check (push) Successful in 3m5s Details chore: rebranding cleanup — reset versions to 0.1.0, bun→pnpm in docs (#64)	2026-06-04 13:13:03 +00:00
xiaoju	3401873051	chore: rebranding cleanup — reset versions to 0.1.0, bun→pnpm in docs CI / check (pull_request) Successful in 2m49s Details - All 9 packages reset to version 0.1.0 - CLAUDE.md: bun→pnpm, fixed→independent versioning, proman commands - docs/architecture.md: bun→pnpm in toolchain table - docs/sync-readme.md: bun→pnpm in conventions	2026-06-04 13:05:26 +00:00
xiaomo	7fc02e50c0	Merge pull request 'refactor: extract validateCount, replace CLI spawn with direct import' (#63 ) from chore/61-spawn-to-direct-import into main CI / check (push) Successful in 3m0s Details refactor: extract validateCount, replace CLI spawn with direct import (#63)	2026-06-04 12:41:42 +00:00
xiaoju	18170a4313	refactor: extract validateCount, replace CLI spawn with direct import CI / check (pull_request) Successful in 2m24s Details - Extract validateCount() from cmdThreadExec (throw instead of process.exit) - 5 validation tests now import validateCount directly (no subprocess) - Only --help tests still spawn CLI (need Commander output) - Test time: 1.7s → 475ms Fixes #61	2026-06-04 12:31:17 +00:00
xiaomo	1ce0b9b9ee	Merge pull request 'chore: remove integration tests, migrate to eval framework' (#62 ) from chore/60-remove-integration-tests into main CI / check (push) Successful in 2m18s Details chore: remove integration tests, migrate to eval framework (#62)	2026-06-04 12:25:39 +00:00
xiaoju	8bf5b88172	chore: remove integration tests, clean up CI exclusion CI / check (pull_request) Successful in 2m41s Details Deleted: - acp-client.integration.test.ts (3 cases) - resume-e2e.integration.test.ts (1 case, already skipped) These tests spawn a real hermes CLI and hit live LLM, belonging to the eval layer (#34), not CI. ACP protocol parsing is already covered by unit test acp-client.test.ts. Also removed the --exclude integration/ hack from test:ci. Fixes #60	2026-06-04 12:19:24 +00:00
xiaomo	9fbdd1dd2c	Merge pull request 'fix: OCAS_DIR → OCAS_HOME in test helpers' (#59 ) from fix/58-test-isolation into main CI / check (push) Successful in 2m44s Details fix: OCAS_DIR → OCAS_HOME in test helpers (#59)	2026-06-04 12:16:20 +00:00
xiaoju	66c2e2a79b	fix: use node dist/cli.js instead of npx tsx in thread-step-count tests CI / check (pull_request) Successful in 3m30s Details npx tsx hangs in CI Docker (30s+ timeout). node dist/cli.js runs in <2s.	2026-06-04 11:57:32 +00:00
xiaoju	58b58d511e	fix: add timeout to cmdThreadExec count logic tests CI / check (pull_request) Failing after 4m17s Details	2026-06-04 11:48:46 +00:00
xiaoju	596c05bfcc	fix: use node dist/cli.js instead of npx tsx in prompt help test CI / check (pull_request) Failing after 3m40s Details npx tsx fails in CI (tsx not found, npm tries to install it)	2026-06-04 11:32:09 +00:00
xiaoju	d26f54e8ea	fix: biome format + remove unused noConsole suppressions CI / check (pull_request) Failing after 3m58s Details	2026-06-04 11:22:46 +00:00
xiaoju	883bd79bcb	fix: add timeout to CI-slow tests + check stderr for help output CI / check (pull_request) Failing after 1m55s Details	2026-06-04 11:18:49 +00:00
xiaoju	63454a4cfd	fix: OCAS_DIR → OCAS_HOME in test helpers + exclude integration tests from CI CI / check (pull_request) Failing after 2m27s Details - Remaining OCAS_DIR references caused test isolation failures - agent-hermes integration tests need 'hermes' CLI, skip in CI Fixes #58	2026-06-04 11:06:42 +00:00
xiaoju	5fe492c011	Merge pull request 'fix: add missing workflow destructure in current-role test' (#57 ) from fix/56-ts-compile-error into main CI / check (push) Failing after 1m35s Details	2026-06-04 11:00:25 +00:00
xiaoju	9f5891169e	fix: add missing workflow destructure in current-role test CI / check (pull_request) Failing after 1m37s Details The createMarker call used shorthand 'workflow' but the variable was not destructured from cmdThreadStart. Fixes #56	2026-06-04 10:56:44 +00:00
xiaoju	0470d9445a	Merge pull request 'fix: disable pnpm minimumReleaseAge in CI' (#55 ) from fix/ci-disable-release-age into main CI / check (push) Failing after 1m45s Details	2026-06-04 10:32:51 +00:00
xiaoju	07128b89af	fix: pnpm 11 CI compatibility CI / check (pull_request) Failing after 1m27s Details - Set minimumReleaseAge: 0 (pnpm 11 defaults to 1440 min) - Add allowBuilds for esbuild and msw (pnpm 11 blocks build scripts by default, config moved from package.json to pnpm-workspace.yaml)	2026-06-04 10:23:02 +00:00