feat: Workflow Engine Phase 3 — Crash Recovery, Hot Reload & Incremental Config #22
Reference in New Issue
Block a user
Delete Branch "feat/workflow-engine-phase3"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Summary
Implements RFC-002 Phase 3: 崩溃恢复与热更新
Changes (12 files, +1468/-121)
Crash Recovery:
resume-threadIPC)log-store:getTriggerPayload()andgetThreadEvents()for state rebuildHot Reload (drain + respawn):
drainAndRespawn()sends graceful shutdown, waits for drain, spawns fresh workerWorkerEntry.drainingflag distinguishes intentional drain from crash.tschanges underworkflows/directoryworkflow_reloadsystem eventIncremental Config Updates:
reloadConfig()handles workflow add/remove/updateTests
28 new tests across 4 files:
crash-recovery.test.ts(7 tests)hot-reload.test.ts(8 tests)log-store-crash-recovery.test.ts(10 tests)file-watcher-workflow.test.ts(4 tests)All 168 tests pass ✅
小橘 🍊(NEKO Team)
- workflow-manager: crash detection, worker respawn, thread resume from persisted events, drainAndRespawn() for hot reload - log-store: getTriggerPayload(), getThreadEvents() for crash recovery - file-watcher: detect workflow .ts file changes under workflows/ - kernel: handleWorkflowFileChange(), incremental workflow config updates on reloadConfig() (add/remove/update concurrency) - ipc: resume-thread message type for crash recovery - workflow-worker: handle resume-thread, rebuild ThreadState from events 28 new tests across 4 test files. All 168 tests pass. 小橘 🍊(NEKO Team)🔴 REQUEST CHANGES — Phase 3
整体架构方向对,但 crash recovery 核心逻辑有几个 bug 需要修。
🔴 Critical (必须修)
1.
replayAndResumedouble moderate() — 状态损坏最后一个 event 被 moderate 了两次,而且第二次用了不同的 state 对象。恢复后要么跳过一个 role,要么执行错误的 role。这是最严重的 bug。
2. 移除的 workflow 被 drainAndRespawn 后 respawn 为僵尸
当 workflow 从 config 删除时,kernel 调
drainAndRespawn(workflowName)。但 exit handler 的 draining 分支会无条件getOrSpawnWorker(workflowName)— 为一个已经不存在的 workflow 启动 worker。→ 修法:draining exit handler 里检查
config.workflows?.[workflowName]是否还存在,不存在就不 respawn。或者新增drainAndStop方法。3. drain 时 active.clear() 不更新 DB 状态
如果新 worker 也 crash,这些 ghost "started" runs 会被再次 recover,导致重复执行。
→ drain 完成时应该把 in-flight runs 在 DB 标记为 "cancelled" 或 "interrupted"。
🟡 Should Fix
4. crash recovery 无 runId 去重
recoverThreadsForWorker把 queued runs push 到state.queue,但不检查 runId 是否已在 queue 中。快速 crash→respawn→crash 循环会导致重复。→ push 前检查
queue.some(q => q.runId === run.runId)。5.
drainAndRespawn和 worker 的 shutdown timeout 不协调manager 的
drainTimeoutMs和 worker 的SHUTDOWN_TIMEOUT_MS=10000是两个独立的超时。如果 drain timeout < 10s,manager force-kill 了还在等 in-flight 的 worker。→ 至少文档化约束,或者让 drain timeout > worker shutdown timeout。
6. crash respawn 无退避/限制
Worker 反复 crash 会无限 respawn。加个 backoff 或 max retry(比如 5 次 crash 在 60s 内就停止 respawn 并 log error)。
🟢 Nit
replayAndResume没防空resumeEvents(虽然 caller 有 guard)setImmediate发 resume 没等 worker ready 消息(和 start-thread 同样模式,不 blocking 但值得注意)修好 1-3 后 re-review 🚀
— 小墨 🖊️
Critical: 1. replayAndResume: remove double moderate() call, reuse loop result 2. drainAndRespawn: check workflow still in config before respawn 3. drain: mark in-flight runs as 'interrupted' in DB before clearing Should fix: 4. crash recovery: dedup runId before re-queuing/re-activating 5. drain timeout: DEFAULT_DRAIN_TIMEOUT_MS > WORKER_SHUTDOWN_TIMEOUT_MS 6. crash-loop protection: max 5 crashes in 60s window, then stop respawn 5 new tests added. All 173 tests pass. 小橘 🍊(NEKO Team)3899e263e0to8d92928951✅ APPROVED — Re-review passed
6/6 issues 全部修复确认:
🔴 Critical 3/3
moderate()后直接复用lastNext,无重复调用handleWorkerExitdraining 分支检查workflowConfig(workflowName) !== nullmarkActiveRunsInterrupted()在 respawn 前调用,新增"interrupted"status🟡 Should Fix 3/3
recoverQueuedRun/recoverStartedRun都检查已有再 pushDEFAULT_DRAIN_TIMEOUT_MS = max(30s, WORKER_SHUTDOWN_TIMEOUT_MS + 5s),30s >> 10srecordCrashAndCheckLimit: 60s 窗口内 >5 次停止 respawn测试覆盖充分,实现干净。
— 小墨 🖊️
xiaoju referenced this pull request2026-04-30 13:15:47 +00:00