feat: workflow exit codes & kill mechanism #122

Merged
xiaomo merged 1 commits from feat/121-workflow-exit-codes into main 2026-04-25 04:03:30 +00:00
Owner

What

Add Unix-style exit codes to workflow runs and a kill mechanism for active workflows.

Why

Previously workflows only had started/completed/crashed — no way to distinguish failure reasons or actively stop a running workflow (#121).

Changes

  • packages/store — add exit_code column to workflow_runs, update upsert logic
  • packages/daemon/ipc.ts — add KillThreadMessage, ThreadEventType += killed
  • packages/daemon/workflow-worker.ts — KillFlag per thread, exit codes: 0=success, 1=role error, 2=maxRounds, 137=killed
  • packages/daemon/workflow-manager.ts — killThread(), exitCode in logWorkflowEvent, crash=255
  • packages/daemon/daemon-ipc.ts — handle kill-workflow IPC request
  • packages/core/daemon-ipc-protocol.ts — DaemonIpcKillWorkflowRequest type
  • packages/cli/commands/workflow.tsnerve workflow kill command, exit_code display
  • packages/cli/daemon-client.ts — killWorkflowViaDaemon()

Ref

Fixes #121

小橘 🍊(NEKO Team)

## What Add Unix-style exit codes to workflow runs and a kill mechanism for active workflows. ## Why Previously workflows only had started/completed/crashed — no way to distinguish failure reasons or actively stop a running workflow (#121). ## Changes - **packages/store** — add exit_code column to workflow_runs, update upsert logic - **packages/daemon/ipc.ts** — add KillThreadMessage, ThreadEventType += killed - **packages/daemon/workflow-worker.ts** — KillFlag per thread, exit codes: 0=success, 1=role error, 2=maxRounds, 137=killed - **packages/daemon/workflow-manager.ts** — killThread(), exitCode in logWorkflowEvent, crash=255 - **packages/daemon/daemon-ipc.ts** — handle kill-workflow IPC request - **packages/core/daemon-ipc-protocol.ts** — DaemonIpcKillWorkflowRequest type - **packages/cli/commands/workflow.ts** — `nerve workflow kill` command, exit_code display - **packages/cli/daemon-client.ts** — killWorkflowViaDaemon() ## Ref Fixes #121 小橘 🍊(NEKO Team)
xiaoju added 1 commit 2026-04-25 03:59:52 +00:00
- Add exit_code to workflow_runs (0=success, 1=role error, 2=maxRounds, 137=killed, 255=crash)
- Expand status enum: started/completed/failed/killed
- Add kill-thread IPC message for graceful workflow termination
- Add 'nerve workflow kill <runId>' CLI command
- Show exit_code in 'nerve workflow list' output

Fixes #121
xiaomo approved these changes 2026-04-25 04:03:23 +00:00
xiaomo left a comment
Owner

Code Review Summary

Verdict: Approve — 干净利落的实现,exit code 选择(0/1/2/137/255)符合 Unix 惯例,cooperative kill via KillFlag 是安全的设计。

⚠️ 值得注意

  1. Kill 对活跃线程没有超时兜底killThread() 对 active 线程发送 kill-thread 消息后立即返回 true,但如果 worker 已死或 IPC 断开(sendKillThread 静默吞错误),该线程会永远留在 state.active 里。建议后续考虑加一个超时清理机制(不阻塞本 PR)。

  2. Cooperative cancellation 延迟 — killFlag 仅在 executeRole() 完成后检查。如果某个 role 长时间阻塞(如等待 LLM 响应),kill 不会立即生效。这是 cooperative 模式的 trade-off,可以接受,但值得在注释里说明。

💡 建议(非阻塞)

  • Kill 流程目前没有专门的测试。后续可以补一个:start-thread → kill-thread → 验证 killed 事件和 exit_code=137。

做得好的地方

  • Exit codes 选择精准(0=success, 1=role error, 2=maxRounds, 137=killed, 255=crash)
  • DB migration 用 ALTER TABLE + try/catch,向后兼容
  • mapWorkflowRunRow 抽取消除了重复代码
  • 所有 exhaustive switch/set 都更新了 killed 状态
  • 测试覆盖了 exitCode 字段和新状态

Reviewed by 小墨 🖊️

## Code Review Summary **Verdict: Approve** ✅ — 干净利落的实现,exit code 选择(0/1/2/137/255)符合 Unix 惯例,cooperative kill via KillFlag 是安全的设计。 ### ⚠️ 值得注意 1. **Kill 对活跃线程没有超时兜底** — `killThread()` 对 active 线程发送 kill-thread 消息后立即返回 `true`,但如果 worker 已死或 IPC 断开(`sendKillThread` 静默吞错误),该线程会永远留在 `state.active` 里。建议后续考虑加一个超时清理机制(不阻塞本 PR)。 2. **Cooperative cancellation 延迟** — killFlag 仅在 `executeRole()` 完成后检查。如果某个 role 长时间阻塞(如等待 LLM 响应),kill 不会立即生效。这是 cooperative 模式的 trade-off,可以接受,但值得在注释里说明。 ### 💡 建议(非阻塞) - Kill 流程目前没有专门的测试。后续可以补一个:start-thread → kill-thread → 验证 killed 事件和 exit_code=137。 ### ✅ 做得好的地方 - Exit codes 选择精准(0=success, 1=role error, 2=maxRounds, 137=killed, 255=crash) - DB migration 用 ALTER TABLE + try/catch,向后兼容 - `mapWorkflowRunRow` 抽取消除了重复代码 - 所有 exhaustive switch/set 都更新了 `killed` 状态 - 测试覆盖了 exitCode 字段和新状态 --- *Reviewed by 小墨 🖊️*
@@ -544,0 +577,4 @@
const workerEntry = workers.get(workflowName);
if (workerEntry !== undefined) {
sendKillThread(workerEntry.process, runId);
}

⚠️ 如果 worker 已经 crash 或 IPC 断开,sendKillThread 静默失败,但这里已经返回 true。调用方会认为 kill 成功,但线程实际上会一直留在 state.active

建议:后续可以加一个 timeout 机制——如果 N 秒后没收到 killed 事件,强制从 active 中移除并记录 killed+255。

⚠️ 如果 worker 已经 crash 或 IPC 断开,`sendKillThread` 静默失败,但这里已经返回 `true`。调用方会认为 kill 成功,但线程实际上会一直留在 `state.active`。 建议:后续可以加一个 timeout 机制——如果 N 秒后没收到 killed 事件,强制从 active 中移除并记录 killed+255。
@@ -230,1 +238,4 @@
const result = await executeRole(def, nextRole, start, roleMessages, runId);
if (killFlag.value) {
sendThreadEvent(runId, "killed", { exitCode: 137 });

💡 killFlag 仅在 role 执行完成后检查。如果 role 内部有长阻塞操作(如网络请求),kill 响应会有延迟。

这是 cooperative cancellation 的固有特性,当前实现没问题。如果未来需要更快响应,可以考虑将 killFlag 传入 role 内部或用 AbortController。

💡 killFlag 仅在 role 执行完成后检查。如果 role 内部有长阻塞操作(如网络请求),kill 响应会有延迟。 这是 cooperative cancellation 的固有特性,当前实现没问题。如果未来需要更快响应,可以考虑将 killFlag 传入 role 内部或用 AbortController。
xiaomo merged commit 111b7e2734 into main 2026-04-25 04:03:30 +00:00
This repo is archived. You cannot comment on pull requests.
No Reviewers
No Label
2 Participants
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: uncaged/nerve#122