fix: thread status detection — crashed threads stuck as 'running' #170

Closed
opened 2026-05-09 12:19:45 +00:00 by xiaoju · 0 comments
Owner

Problem

When a workflow worker crashes (e.g. module resolution failure), the .running marker file is never cleaned up. The dashboard keeps showing the thread as "running" forever.

Example: Thread 06F0S8TQYHKR9Z829NNQAB3RZ8 failed but dashboard shows running.

Root Cause

packages/cli-workflow/src/serve/routes-thread.ts determines status purely by checking the .running file:

const status = r.source === "history" ? "completed" : isRunning ? "running" : "active";

If the worker process dies without cleanup, .running persists → stale "running" status.

Fix

  1. Status detection: Even if .running file exists, check the chain for a terminal node (__end__). If found → completed or failed, not running.
  2. Worker cleanup: Ensure the worker always removes .running on exit (add finally block or process signal handler).
  3. Distinguish completed vs failed: If the last node indicates an error, status should be failed, not completed.

Acceptance Criteria

  • Crashed/failed threads show correct status (failed)
  • Normally completed threads show completed
  • Only actively running threads show running
  • Worker cleans up .running file on any exit (success, error, signal)

小橘 🍊(NEKO Team)

## Problem When a workflow worker crashes (e.g. module resolution failure), the `.running` marker file is never cleaned up. The dashboard keeps showing the thread as "running" forever. **Example:** Thread `06F0S8TQYHKR9Z829NNQAB3RZ8` failed but dashboard shows running. ## Root Cause `packages/cli-workflow/src/serve/routes-thread.ts` determines status purely by checking the `.running` file: ```typescript const status = r.source === "history" ? "completed" : isRunning ? "running" : "active"; ``` If the worker process dies without cleanup, `.running` persists → stale "running" status. ## Fix 1. **Status detection**: Even if `.running` file exists, check the chain for a terminal node (`__end__`). If found → completed or failed, not running. 2. **Worker cleanup**: Ensure the worker always removes `.running` on exit (add finally block or process signal handler). 3. **Distinguish completed vs failed**: If the last node indicates an error, status should be failed, not completed. ## Acceptance Criteria - [ ] Crashed/failed threads show correct status (failed) - [ ] Normally completed threads show completed - [ ] Only actively running threads show running - [ ] Worker cleans up .running file on any exit (success, error, signal) --- 小橘 🍊(NEKO Team)
Sign in to join this conversation.
No Label
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: uncaged/workflow#170