united-workforce

Author	SHA1	Message	Date
xiaoju	ae81e4b5ac	feat: eval report, diff, list commands CI / check (pull_request) Successful in 1m44s Details Implement the 3 read commands for eval framework: - report: read eval-run from CAS, render formatted text (task, overall, config, judges table, thread ID) - diff: side-by-side comparison with ▲/▼ delta indicators and config change markers - list: scan @uwf/eval/*/latest variables, sort by timestamp desc, --task filter, --limit pagination Architecture: pure formatting functions (format.ts) + data access (read.ts) + thin CLI handlers. Types in types.ts. 11 new tests (formatReport, formatDiff, formatList, selectEntries) Refs #72	2026-06-05 00:19:25 +00:00
xiaoju	8c26f16716	feat: builtin judges — frontmatter + token-stats (deterministic) + upstream/hallucination (stubs) CI / check (pull_request) Successful in 1m45s Details Implement 4 builtin judges for eval framework: - frontmatter-compliance: validates YAML frontmatter with $status field, score = stepsValid / stepsTotal - token-stats: aggregates Usage from step nodes, always score 1.0 (informational only) - upstream-consumption: LLM-as-judge stub (score 0, TODO) - hallucination: LLM-as-judge stub (score 0, TODO) Infrastructure: - judge/builtin/read-steps.ts — shell out to uwf step list - judge/builtin/types.ts — BuiltinJudge, BuiltinJudgeOutput - runner/collect.ts — dispatch builtin judges by name 9 new tests (frontmatter validation + token aggregation) Refs #71	2026-06-05 00:09:06 +00:00
xiaoju	fae9e9ed3a	feat: eval run command — prepare, execute, collect pipeline CI / check (pull_request) Successful in 1m45s Details Implement the uwf-eval run <task-dir> command with 3-phase pipeline: - prepare: read task.yaml, copy fixture/ to temp workdir - execute: shell out to uwf thread start + exec - collect: run judges, compute weighted score, store CAS node, set @uwf/eval/<task>/latest variable Changes: - src/runner/ — types, prepare, execute, collect, index - src/storage/store.ts — createEvalStore(), setEvalLatest() - src/commands/run.ts — full pipeline wiring with --agent/--model/--count - 9 new tests (prepare + collect + weighted scoring) Builtin judges return placeholder score 0 (Phase 1c). Refs #70	2026-06-04 23:59:21 +00:00
xiaoju	99619d85db	feat: eval package scaffold with CLI, schemas, types, task loader CI / check (pull_request) Successful in 1m42s Details New package @united-workforce/eval (uwf-eval CLI): - CLI skeleton: run/report/diff/list subcommands (stubs) - 5 OCAS schemas: eval-run, judge-frontmatter, judge-upstream, judge-hallucination, judge-token-stats - TaskManifest type + parser/validator for task.yaml - JudgeOutput/JudgeInput types for judge contract - EvalRunPayload/EvalRunConfig/EvalJudgeRecord storage types - 19 unit tests: task loader validation + schema definitions Refs #69	2026-06-04 23:42:16 +00:00
xiaomo	b94234652a	Merge pull request 'feat: agent-hermes reads real token counts from session DB' (#84 ) from feat/76-hermes-real-tokens into main CI / check (push) Successful in 1m41s Details feat: agent-hermes reads real token counts from session DB (#84)	2026-06-04 23:31:09 +00:00
xiaoju	1593dbb521	fix: compute usage as delta for session re-entry CI / check (pull_request) Successful in 1m41s Details On session resume, turns/inputTokens/outputTokens were cumulative (entire session history) instead of per-step increments. Now we snapshot metrics before prompt, compare after, and report the delta. Changes: - acp-client: add getSessionId() accessor - hermes: extract snapshotUsage() + computeUsageDelta() pure functions - hermes: runPrompt/runHermes/continueHermes use before/after snapshots - 9 new unit tests for usage delta computation Refs #68	2026-06-04 23:22:16 +00:00
xiaoju	d1c523c442	feat: agent-hermes reads real token counts from session DB CI / check (pull_request) Successful in 1m41s Details - Add inputTokens/outputTokens to HermesSessionJson type - Query input_tokens, output_tokens from sessions table in loadHermesSessionFromDb - Update test fixture schema with token columns - runPrompt now reports real token counts from Hermes state.db Refs #76, #68	2026-06-04 23:06:52 +00:00
xiaomo	4283e6766b	Merge pull request 'feat: agent-claude-code reports real $usage from stream-json' (#83 ) from feat/77-claude-code-usage into main CI / check (push) Successful in 1m42s Details feat: agent-claude-code reports real $usage from stream-json (#83)	2026-06-04 22:55:15 +00:00
xiaomo	4e4fb61ff5	Merge pull request 'feat: agent-hermes reports $usage (turns + duration)' (#82 ) from feat/76-hermes-usage into main CI / check (push) Successful in 1m40s Details feat: agent-hermes reports $usage (turns + duration) (#82)	2026-06-04 22:55:13 +00:00
xiaoju	be92cb2dd2	feat: agent-claude-code reports real $usage from stream-json output CI / check (pull_request) Successful in 1m40s Details - Map parsed numTurns, inputTokens, outputTokens, durationMs to Usage type - Add @united-workforce/protocol dependency + tsconfig reference - 747 tests pass Fixes #77 Refs #68	2026-06-04 22:36:44 +00:00
xiaoju	7681e8b8e2	feat: agent-hermes reports $usage (turns + duration) CI / check (pull_request) Successful in 1m40s Details - Count assistant turns from session messages - Measure wall-clock duration per prompt call - inputTokens/outputTokens remain 0 (ACP protocol doesn't expose token data yet) - Both runPrompt and continueHermes report usage Fixes #76 Refs #68	2026-06-04 22:30:14 +00:00
xiaomo	780005ad65	Merge pull request 'feat: agent-mock emits fixed $usage stats' (#81 ) from feat/75-mock-usage into main CI / check (push) Successful in 1m42s Details feat: agent-mock emits fixed $usage stats (#81)	2026-06-04 22:23:42 +00:00
xiaoju	248ac710fd	feat: agent-mock emits fixed $usage stats CI / check (pull_request) Successful in 1m41s Details - Mock agent returns {turns:1, inputTokens:0, outputTokens:0, duration:0} - E2E test 1 (linear workflow) asserts usage in CAS step nodes - 747 tests pass Fixes #75 Refs #68	2026-06-04 22:19:29 +00:00
xiaomo	172c232e61	Merge pull request 'feat: add $usage field to adapter protocol' (#80 ) from feat/74-usage-in-protocol into main CI / check (push) Successful in 1m41s Details feat: add $usage field to adapter protocol (#80)	2026-06-04 22:14:12 +00:00
xiaomo	5fe97591de	Merge pull request 'fix: agent bin fields point to dist/cli.js instead of src/cli.ts' (#79 ) from fix/agent-bin-78 into main CI / check (push) Successful in 2m55s Details fix: agent bin fields point to dist/cli.js instead of src/cli.ts (#79)	2026-06-04 15:41:45 +00:00
xiaoju	99f40c2488	feat: add $usage field to adapter protocol CI / check (pull_request) Successful in 2m28s Details - Add Usage type to protocol (turns, inputTokens, outputTokens, duration) - Add usage to StepRecord, StepNodePayload, StepEntry, STEP_NODE_SCHEMA - Thread usage through util-agent extract pipeline (writeStepNode → persistStep → createAgent) - All adapters return usage: null as placeholder (mock, hermes, claude-code, builtin) - 746 tests pass, no breaking changes (usage not in schema required array) Fixes #74 Refs #68	2026-06-04 15:41:07 +00:00
xingyue	bf489c59a5	fix: agent bin fields point to dist/cli.js instead of src/cli.ts CI / check (pull_request) Successful in 3m23s Details All three agent packages had bin pointing to ./src/cli.ts (bun-era leftover). Node cannot execute .ts files directly, causing ERR_MODULE_NOT_FOUND when spawning agents. Closes #78	2026-06-04 23:25:39 +08:00
xiaomo	9908d069ec	Merge pull request 'refactor(prompt): rename subcommands and add frontmatter output' (#67 ) from feat/prompt-refactor-66 into main CI / check (push) Successful in 5m15s Details refactor(prompt): rename subcommands and add frontmatter output (#67)	2026-06-04 14:51:12 +00:00
xingyue	83bcda60ff	refactor(prompt): rename subcommands and add frontmatter output CI / check (pull_request) Successful in 3m1s Details - Rename: user→usage-reference, author→workflow-authoring, adapter→adapter-developing - Remove: developer (content lives in CLAUDE.md) - All prompts output complete SKILL.md with YAML frontmatter - Setup instructions simplified: uwf prompt bootstrap > SKILL.md - Remove all bun references, use pnpm/npm - Fix CLAUDE.md: fixed→independent versioning - Delete old reference files (user/author/developer/adapter) Closes #66	2026-06-04 22:46:11 +08:00
xiaomo	17f7f44c43	Merge pull request 'chore: rebranding cleanup — reset versions to 0.1.0, bun→pnpm in docs' (#64 ) from chore/rebranding-cleanup into main CI / check (push) Successful in 3m5s Details chore: rebranding cleanup — reset versions to 0.1.0, bun→pnpm in docs (#64)	2026-06-04 13:13:03 +00:00
xiaoju	3401873051	chore: rebranding cleanup — reset versions to 0.1.0, bun→pnpm in docs CI / check (pull_request) Successful in 2m49s Details - All 9 packages reset to version 0.1.0 - CLAUDE.md: bun→pnpm, fixed→independent versioning, proman commands - docs/architecture.md: bun→pnpm in toolchain table - docs/sync-readme.md: bun→pnpm in conventions	2026-06-04 13:05:26 +00:00
xiaomo	7fc02e50c0	Merge pull request 'refactor: extract validateCount, replace CLI spawn with direct import' (#63 ) from chore/61-spawn-to-direct-import into main CI / check (push) Successful in 3m0s Details refactor: extract validateCount, replace CLI spawn with direct import (#63)	2026-06-04 12:41:42 +00:00
xiaoju	18170a4313	refactor: extract validateCount, replace CLI spawn with direct import CI / check (pull_request) Successful in 2m24s Details - Extract validateCount() from cmdThreadExec (throw instead of process.exit) - 5 validation tests now import validateCount directly (no subprocess) - Only --help tests still spawn CLI (need Commander output) - Test time: 1.7s → 475ms Fixes #61	2026-06-04 12:31:17 +00:00
xiaomo	1ce0b9b9ee	Merge pull request 'chore: remove integration tests, migrate to eval framework' (#62 ) from chore/60-remove-integration-tests into main CI / check (push) Successful in 2m18s Details chore: remove integration tests, migrate to eval framework (#62)	2026-06-04 12:25:39 +00:00
xiaoju	8bf5b88172	chore: remove integration tests, clean up CI exclusion CI / check (pull_request) Successful in 2m41s Details Deleted: - acp-client.integration.test.ts (3 cases) - resume-e2e.integration.test.ts (1 case, already skipped) These tests spawn a real hermes CLI and hit live LLM, belonging to the eval layer (#34), not CI. ACP protocol parsing is already covered by unit test acp-client.test.ts. Also removed the --exclude integration/ hack from test:ci. Fixes #60	2026-06-04 12:19:24 +00:00
xiaomo	9fbdd1dd2c	Merge pull request 'fix: OCAS_DIR → OCAS_HOME in test helpers' (#59 ) from fix/58-test-isolation into main CI / check (push) Successful in 2m44s Details fix: OCAS_DIR → OCAS_HOME in test helpers (#59)	2026-06-04 12:16:20 +00:00
xiaoju	66c2e2a79b	fix: use node dist/cli.js instead of npx tsx in thread-step-count tests CI / check (pull_request) Successful in 3m30s Details npx tsx hangs in CI Docker (30s+ timeout). node dist/cli.js runs in <2s.	2026-06-04 11:57:32 +00:00
xiaoju	58b58d511e	fix: add timeout to cmdThreadExec count logic tests CI / check (pull_request) Failing after 4m17s Details	2026-06-04 11:48:46 +00:00
xiaoju	596c05bfcc	fix: use node dist/cli.js instead of npx tsx in prompt help test CI / check (pull_request) Failing after 3m40s Details npx tsx fails in CI (tsx not found, npm tries to install it)	2026-06-04 11:32:09 +00:00
xiaoju	d26f54e8ea	fix: biome format + remove unused noConsole suppressions CI / check (pull_request) Failing after 3m58s Details	2026-06-04 11:22:46 +00:00
xiaoju	883bd79bcb	fix: add timeout to CI-slow tests + check stderr for help output CI / check (pull_request) Failing after 1m55s Details	2026-06-04 11:18:49 +00:00
xiaoju	63454a4cfd	fix: OCAS_DIR → OCAS_HOME in test helpers + exclude integration tests from CI CI / check (pull_request) Failing after 2m27s Details - Remaining OCAS_DIR references caused test isolation failures - agent-hermes integration tests need 'hermes' CLI, skip in CI Fixes #58	2026-06-04 11:06:42 +00:00
xiaoju	5fe492c011	Merge pull request 'fix: add missing workflow destructure in current-role test' (#57 ) from fix/56-ts-compile-error into main CI / check (push) Failing after 1m35s Details	2026-06-04 11:00:25 +00:00
xiaoju	9f5891169e	fix: add missing workflow destructure in current-role test CI / check (pull_request) Failing after 1m37s Details The createMarker call used shorthand 'workflow' but the variable was not destructured from cmdThreadStart. Fixes #56	2026-06-04 10:56:44 +00:00
xiaoju	0470d9445a	Merge pull request 'fix: disable pnpm minimumReleaseAge in CI' (#55 ) from fix/ci-disable-release-age into main CI / check (push) Failing after 1m45s Details	2026-06-04 10:32:51 +00:00
xiaoju	07128b89af	fix: pnpm 11 CI compatibility CI / check (pull_request) Failing after 1m27s Details - Set minimumReleaseAge: 0 (pnpm 11 defaults to 1440 min) - Add allowBuilds for esbuild and msw (pnpm 11 blocks build scripts by default, config moved from package.json to pnpm-workspace.yaml)	2026-06-04 10:23:02 +00:00
xiaoju	1fdeb716ca	Merge pull request 'fix: migrate CI from bun to pnpm' (#54 ) from fix/52-ci-bun-to-pnpm into main CI / check (push) Failing after 51s Details	2026-06-04 10:05:35 +00:00
xiaoju	1b99f0e2c1	fix: migrate CI from bun to pnpm CI / check (pull_request) Failing after 1m44s Details Closes #52	2026-06-04 10:05:02 +00:00
xiaomo	f56e24cf82	Merge pull request 'test: expand E2E coverage — suspend, count, mustache, completed resume' (#51 ) from test/33-more-e2e into main CI / check (push) Failing after 1m28s Details test: expand E2E coverage — suspend, count, mustache, completed resume (#51)	2026-06-04 09:04:09 +00:00
xiaoju	974c2b8f1b	test: add E2E tests for suspend/resume, --count, mustache, and completed resume (#33 ) CI / check (pull_request) Failing after 1m40s Details 4 new E2E scenarios: 4. $SUSPEND → resume lifecycle (suspendedRole/suspendMessage metadata) 5. --count 3 runs entire pipeline in one invocation 6. mustache template variables rendered into edgePrompt 7. completed thread resume (衔尾蛇: end → start, CAS chain preserved) Total: 7 E2E scenarios, all passing.	2026-06-04 09:03:01 +00:00
xiaomo	6e7276425d	Merge pull request 'chore: fix biome check errors (40 → 0)' (#50 ) from chore/fix-biome-check into main CI / check (push) Failing after 1m16s Details chore: fix biome check errors (40 → 0) (#50)	2026-06-04 09:01:49 +00:00
xingyue	dbb7885ffd	chore: fix biome check errors (40 → 0) CI / check (pull_request) Failing after 1m39s Details - Auto-fix: import sorting, formatting (17 files) - Unsafe auto-fix: unused vars, template literals (7 files) - Manual: nursery/noConsole → suspicious/noConsole suppression - Manual: suppress noExcessiveCognitiveComplexity for cmdThreadResume and parseWorkflowPayload - Manual: remove unused destructured vars in current-role tests Closes #48	2026-06-04 16:45:45 +08:00
xiaomo	cd7e4e77ff	Merge pull request 'feat: agent-mock package for deterministic E2E testing (#33 )' (#44 ) from test/33-mock-agent into main CI / check (push) Failing after 1m38s Details feat: agent-mock package for deterministic E2E testing (#44)	2026-06-04 08:38:51 +00:00
xiaomo	64a8bab5ce	Merge pull request 'fix: resolve workflow from CAS chain in collectCompletedThreads' (#47 ) from fix/completed-thread-workflow into main CI / check (push) Failing after 1m33s Details fix: resolve workflow from CAS chain in collectCompletedThreads (#47)	2026-06-04 08:38:06 +00:00
xiaoju	80e8efb05e	test: E2E integration tests with uwf-mock agent (#33 ) CI / check (pull_request) Failing after 2m30s Details Three scenarios testing the full CLI pipeline: 1. Linear workflow (planner → worker → $END): CAS chain integrity 2. Loop workflow (developer ↔ reviewer): moderator routing through cycles 3. Role mismatch detection: agent catches routing bugs Uses workflow add → thread start → thread exec with uwf-mock, verifying CAS state, thread lifecycle, and error handling. Updated assertions to use getThread().status === 'completed' (aligned with PR #45 unified thread storage). Refs #33	2026-06-04 08:06:22 +00:00
xiaoju	75fb752a82	feat: add agent-mock package for deterministic E2E testing (#33 ) New package @united-workforce/agent-mock (uwf-mock CLI): - Reads pre-scripted outputs from a YAML mock data file (--mock-data) - Counts existing CAS chain steps to determine step index - Validates expected role matches actual moderator routing - Stores minimal detail node in CAS for valid step refs - Zero LLM, instant execution, 100% deterministic Usage in config.yaml: agents: mock: command: uwf-mock args: ["--mock-data", "./fixtures/scenario.yaml"] Refs #33	2026-06-04 08:00:07 +00:00
xingyue	06af1dc668	fix: resolve workflow from CAS chain in collectCompletedThreads CI / check (pull_request) Failing after 1m28s Details Instead of hardcoding workflow as empty string for completed/cancelled threads, use resolveWorkflowFromHead to get the actual workflow hash from the CAS chain, consistent with active thread handling. Closes #46	2026-06-04 15:35:08 +08:00
xiaomo	bbea89c067	Merge pull request 'refactor: unified thread storage + resume completed threads' (#45 ) from refactor/39-unified-thread-storage into main CI / check (push) Failing after 1m26s Details refactor: unified thread storage + resume completed threads (#45)	2026-06-04 07:25:56 +00:00
xingyue	bda3e3a861	feat(cli): resume completed threads (衔尾蛇: end → start) CI / check (pull_request) Failing after 3m45s Details uwf thread resume now supports completed threads: - Evaluates workflow graph from $START to find first role - Clears completed state (status → idle, completedAt → null) - Builds resume prompt with supplement context - Full CAS chain preserved for rich context Suspended resume behavior unchanged. Cancelled/idle threads still rejected. 425 tests pass. Part of #39, closes #43	2026-06-04 15:13:47 +08:00
xingyue	ca7b68ca5f	refactor(cli): unify thread storage, remove history prefix - store.ts: all threads in @uwf/thread/* with status tag - Remove HISTORY_VAR_PREFIX, ThreadHistoryLine, deleteThread - Add loadActiveThreads, loadHistoryThreads, completeThread - Add migrateHistoryVarsToThreadVars migration - thread.ts: replace deleteThread+addHistoryEntry with completeThread - shared.ts: remove findHistoryEntry fallback - Update all tests for unified storage model 422 tests pass. Part of #39, closes #41, closes #42	2026-06-04 15:01:20 +08:00

1 2 3 4 5 ...

993 Commits