Compare commits

..

9 Commits

Author SHA1 Message Date
xiaomo 0455f928f5 fix: address review feedback (星月) for Phase 3
1. sendToWorker: IPC send failure now marks thread as failed + dequeues next
2. crashLimitBlocked Set: prevents new startWorkflow from bypassing crash limit
3. "respawning" log skipped when crash limit is active
4. logWorkflowEvent payload: unknown | null (project convention, not ?:)
2026-04-30 14:10:06 +00:00
xiaomo dc4454d23e refactor: migrate workflow-manager process logic to WorkerRuntime (RFC-006 Phase 3)
- workflow-manager.ts: 792 → 498 lines, no more fork/ChildProcess/crash counting
- Extract pure functions to workflow-manager-support.ts (256 lines)
- WorkerRuntime gains: onCrashLimitReached callback, WorkerDrainOpts for per-call timeout
- worker-pool.ts updated for new evict/drain signature (null opts)
- All 167 daemon tests pass

Closes #282
2026-04-30 13:56:43 +00:00
xingyue 082d2e72f2 Merge pull request 'refactor: align develop prompts and .knowledge with flat workspace' (#288) from refactor/287-align-prompts-knowledge into main 2026-04-30 13:46:33 +00:00
xingyue fbf63e0266 Merge pull request 'RFC-006 Phase 2: Migrate SenseWorkerPool to WorkerRuntime' (#292) from refactor/rfc-006-worker-runtime into main 2026-04-30 13:44:56 +00:00
xingyue 7d89e8ab61 Merge pull request 'feat(cli): add hermes nerve skill — Phase 1 of RFC #289' (#291) from feat/agent-inject-phase1 into main 2026-04-30 13:42:05 +00:00
xiaomo e67ddc58d8 fix: address review feedback (星月)
1. trySendSync: wrap child.send in try/catch — IPC race between connected check and send
2. gracefulStop: same try/catch for shutdown send
3. Remove crashTimestamps reset on ready — crash window detection was being bypassed
2026-04-30 13:41:31 +00:00
xiaoju 06b1e3d785 refactor(cli,workflow-meta): scaffold AGENT.md on init; align develop prompts
Generate AGENT.md at ~/.uncaged-nerve root during nerve init (layout, verb-first workflows, createRole four-tuple, root build, coding style). Role prompts instruct agents to use cat AGENT.md instead of node_modules nerve-skills paths.

E2E init test asserts AGENT.md. Retain .knowledge workflow/adapter updates and flat single-file roles guidance from the branch.

Fixes #287

Made-with: Cursor
2026-04-30 13:41:24 +00:00
xiaomo 4dffcb636b fix: resolve 2 failing tests after WorkerRuntime migration
- Add trySendSync() for synchronous send when worker is ready+connected
- sendCompute uses sync path first, async fallback for cold start
- Add forwardStderr, allowRespawn, hasDisconnectedChild, onReady(key,msg)
- Tests: add connected:true to mocks, flush async fork microtasks
- All 167 daemon tests pass
2026-04-30 13:34:10 +00:00
xiaomo c34ec46416 feat(daemon): WorkerRuntime — generic message-routed process manager (closes #280)
RFC-006 Phase 1: ManagedWorker state machine + WorkerRuntime<K> with
cold start, crash respawn, drain/evict, graceful shutdown.
8 test cases covering all lifecycle scenarios.
2026-04-30 13:09:19 +00:00
26 changed files with 1423 additions and 616 deletions
+22 -1
View File
@@ -28,8 +28,29 @@ For long-running or incremental agent outputs:
|---------|---------|------|
| `@uncaged/nerve-adapter-cursor` | `cursorAdapter` / `createCursorAdapter()` | cursor-agent CLI |
| `@uncaged/nerve-adapter-hermes` | `hermesAdapter` / `createHermesAdapter()` | hermes chat CLI |
| `@uncaged/nerve-workflow-utils` | `createLlmAdapter(provider)` | OpenAI-compatible HTTP chat (single-turn) |
Each exports a **default instance** (sensible defaults) and a **factory** for custom config.
The Cursor and Hermes adapter packages each export a **default instance** (sensible defaults) and a **factory** for custom config. `createLlmAdapter` is a factory on `@uncaged/nerve-workflow-utils` only.
## createLlmAdapter
`createLlmAdapter` builds an `AgentFn` from an `LlmProvider` (`baseUrl`, `apiKey`, `model`). One chat completion per role step: **system** = the string passed by `createRole` (your prompt); **user** = `ctx.start.content` (the thread’s start frame). On failure it throws with a formatted LLM error.
```ts
import { createLlmAdapter, createRole } from "@uncaged/nerve-workflow-utils";
import { z } from "zod";
const metaSchema = z.object({ ok: z.boolean() });
const planner = createRole(
createLlmAdapter({ baseUrl: "https://api.example.com/v1", apiKey: "…", model: "gpt-4o-mini" }),
"You are a planner…",
metaSchema,
extractConfig,
);
```
Use this when you want a role backed by an HTTP LLM instead of a subprocess CLI adapter.
## Usage in Workflows
+14
View File
@@ -2,6 +2,20 @@
Stateful multi-step execution driven by Roles and a Moderator.
## Workspace Layout (authoring)
User Nerve workspaces use a **flat** build: one root `package.json`, one root bundle script (typically `scripts/build.mjs` wired from `scripts.build`), and **no** per-workflow `package.json` or `tsconfig.json`.
| Location | Purpose |
|----------|---------|
| `workflows/<name>/index.ts` | Default export: `WorkflowDefinition` (moderator + role map). |
| `workflows/<name>/roles/<role>.ts` | One module per role — schemas, prompts, `createRole` factories, or hand-written async role functions. |
| `dist/workflows/<name>/index.js` | Emit of the root build; this is what the daemon loads. |
**Naming:** Workflow ids should be **verb-first** kebab-case phrases (e.g. `deploy-staging`, `scan-dependencies`), not opaque nouns alone.
Senses follow the same flat pattern: `senses/<name>/src/*.ts`, `migrations/`, root build → `dist/senses/<name>/index.js`. See `.knowledge/sense.md`.
## Core Concepts
- **Workflow** — definition with concurrency strategy
@@ -205,6 +205,10 @@ describe("e2e init", () => {
expect(existsSync(join(nerveRoot, "scripts", "build.mjs"))).toBe(true);
expect(existsSync(join(nerveRoot, "biome.json"))).toBe(true);
expect(existsSync(join(nerveRoot, ".gitignore"))).toBe(true);
expect(existsSync(join(nerveRoot, "AGENT.md"))).toBe(true);
const agentMd = readFileSync(join(nerveRoot, "AGENT.md"), "utf8");
expect(agentMd).toContain("verb-first");
expect(agentMd).toContain("createRole");
expect(existsSync(join(nerveRoot, "senses", "cpu-usage", "src", "index.ts"))).toBe(true);
expect(existsSync(join(nerveRoot, "senses", "cpu-usage", "src", "schema.ts"))).toBe(true);
expect(existsSync(join(nerveRoot, "senses", "cpu-usage", "migrations", "0001_init.sql"))).toBe(
+65
View File
@@ -127,6 +127,70 @@ node_modules/
knowledge.db
`;
/** Generated at workspace root so agents can \`cat AGENT.md\` instead of npm skill paths. */
const AGENT_MD = `# Nerve workspace — agent guide
This file is created by \`nerve init\`. Read it before implementing senses or workflows.
## Directory layout
| Path | Purpose |
|------|---------|
| \`nerve.yaml\` | Senses, workflows, intervals, groups |
| \`package.json\` | Single root package — no per-sense/per-workflow packages |
| \`scripts/build.mjs\` | Root esbuild step; output under \`dist/\` |
| \`senses/<name>/src/index.ts\` | Sense \`compute()\` entry |
| \`senses/<name>/src/schema.ts\` | Drizzle SQLite schema (TypeScript) |
| \`senses/<name>/migrations/*.sql\` | SQL migrations (next to \`src/\`, not inside it) |
| \`workflows/<name>/index.ts\` | Default export: \`WorkflowDefinition\` |
| \`workflows/<name>/roles/<role>.ts\` | One TypeScript file per role |
| \`dist/senses/<name>/index.js\` | Bundled sense (after build) |
| \`dist/workflows/<name>/index.js\` | Bundled workflow (after build) |
There is **no** \`package.json\` or \`tsconfig.json\` inside individual senses or workflows.
## Naming
- **Workflows:** verb-first kebab-case (e.g. \`review-pull-request\`, \`deploy-staging\`). Avoid bare nouns like \`notifications\`.
- **Senses:** kebab-case descriptive nouns (e.g. \`cpu-usage\`).
## Workflow roles — four-tuple pattern
Wire each role with \`createRole\` from \`@uncaged/nerve-workflow-utils\`:
1. **Adapter** — \`AgentFn\` (LLM call)
2. **Prompt builder** — \`async (ctx: ThreadContext) => string\`
3. **Meta schema** — Zod object (routing / structured output from the model)
4. **Extractor config** — how JSON meta is parsed from replies
Keep meta small (often one boolean per role). The **moderator** in \`WorkflowDefinition\` routes between role names.
## Build commands
Always run from the **workspace root**:
\`\`\`bash
pnpm run build
# or: npm run build
\`\`\`
Fix errors until this succeeds. New workflows must appear under \`workflows/<name>/\` and be registered in \`nerve.yaml\`; new senses under \`senses/<name>/\` with matching \`nerve.yaml\` entries.
## Coding style (Nerve conventions)
- Use \`type\`, not \`interface\`; prefer \`function\` over classes (except errors / library requirements).
- **Named exports only** — no \`export default\` (exception: \`workflows/<name>/index.ts\` uses default export for the daemon loader).
- Nullable fields: \`T | null\`, not TypeScript optional \`?:\`.
- No dynamic \`import()\` in workspace code (bundling and tooling assume static imports).
- Use \`async\`/\`await\`; use a \`Result\` type for expected failures instead of control-flow try/catch.
## Extra references (optional)
- \`CONVENTIONS.md\` — project-specific overrides at repo root.
- \`.knowledge/*.md\` — deeper docs when working inside the Nerve monorepo.
- \`.cursor/skills/\` — Cursor Agent Skills (\`SKILL.md\` per skill).
`;
const NERVE_SKILLS_MDC = `---
description: >-
Where Agent Skills live in this Nerve workspace and how to use them with Cursor
@@ -362,6 +426,7 @@ async function runInitWorkspace(force: boolean, skipInstall = false): Promise<vo
writeFile(join(nerveRoot, "scripts", "build.mjs"), BUILD_MJS);
writeFile(join(nerveRoot, "biome.json"), BIOME_JSON);
writeFile(join(nerveRoot, ".gitignore"), GITIGNORE);
writeFile(join(nerveRoot, "AGENT.md"), AGENT_MD);
writeFile(join(nerveRoot, "senses", "cpu-usage", "src", "index.ts"), CPU_INDEX_TS);
writeFile(join(nerveRoot, "senses", "cpu-usage", "src", "schema.ts"), CPU_SCHEMA_TS);
writeFile(
@@ -28,6 +28,11 @@ function makeMockChild(pid = 1): MockChild {
child.connected = true;
child.exitCode = null;
child.pid = pid;
setImmediate(() => {
if (child.connected) {
child.emit("message", { type: "ready" });
}
});
child.send = vi.fn((msg: unknown) => {
if (
msg !== null &&
@@ -132,6 +137,7 @@ describe("WorkflowManager — crash recovery (Phase 3)", () => {
mgr.startWorkflow("my-wf", { prompt: "test 1", maxRounds: 10, dryRun: false });
mgr.startWorkflow("my-wf", { prompt: "test 2", maxRounds: 10, dryRun: false });
await vi.runAllTimersAsync();
expect(mgr.activeCount("my-wf")).toBe(2);
// Simulate unexpected exit (not shutdown)
@@ -159,6 +165,7 @@ describe("WorkflowManager — crash recovery (Phase 3)", () => {
mgr.startWorkflow("my-wf", { prompt: "test", maxRounds: 10, dryRun: false });
mgr.startWorkflow("my-wf", { prompt: "test", maxRounds: 10, dryRun: false });
await vi.runAllTimersAsync();
expect(mgr.activeCount("my-wf")).toBe(2);
const child = mockChildren[0];
@@ -183,6 +190,7 @@ describe("WorkflowManager — crash recovery (Phase 3)", () => {
const mgr = createWorkflowManager("/nerve-root", config, logStore);
mgr.startWorkflow("my-wf", { prompt: "test", maxRounds: 10, dryRun: false });
await vi.runAllTimersAsync();
expect(mockChildren).toHaveLength(1);
const child = mockChildren[0];
@@ -216,6 +224,7 @@ describe("WorkflowManager — crash recovery (Phase 3)", () => {
const mgr = createWorkflowManager("/nerve-root", config, logStore);
mgr.startWorkflow("my-wf", { prompt: "test", maxRounds: 10, dryRun: false });
await vi.runAllTimersAsync();
const firstChild = mockChildren[0];
firstChild.exitCode = 1;
firstChild.connected = false;
@@ -260,6 +269,7 @@ describe("WorkflowManager — crash recovery (Phase 3)", () => {
// Start one thread to fill the concurrency slot (so queued run stays queued on respawn)
mgr.startWorkflow("my-wf", { prompt: "test", maxRounds: 10, dryRun: false });
await vi.runAllTimersAsync();
const firstChild = mockChildren[0];
firstChild.exitCode = 1;
firstChild.connected = false;
@@ -285,6 +295,7 @@ describe("WorkflowManager — crash recovery (Phase 3)", () => {
const mgr = createWorkflowManager("/nerve-root", config, logStore);
mgr.startWorkflow("my-wf", { prompt: "test", maxRounds: 10, dryRun: false });
await vi.runAllTimersAsync();
const child = mockChildren[0];
const startCall = (child.send as ReturnType<typeof vi.fn>).mock.calls[0];
@@ -322,6 +333,7 @@ describe("WorkflowManager — crash recovery (Phase 3)", () => {
const launch = { prompt: "build-docker for myrepo", maxRounds: 10, dryRun: false };
mgr.startWorkflow("my-wf", launch);
await vi.runAllTimersAsync();
const startedCall = logStore.upsertWorkflowRun.mock.calls.find(
(args: any[]) => (args[0] as { type: string }).type === "started",
@@ -357,6 +369,7 @@ describe("WorkflowManager — crash recovery (Phase 3)", () => {
// Start one thread to fill the concurrency slot
mgr.startWorkflow("my-wf", { prompt: "test", maxRounds: 10, dryRun: false });
await vi.runAllTimersAsync();
const firstChild = mockChildren[0];
// Crash once → respawn → crash again → second respawn
@@ -398,6 +411,7 @@ describe("WorkflowManager — crash recovery (Phase 3)", () => {
const mgr = createWorkflowManager("/nerve-root", config, logStore);
mgr.startWorkflow("my-wf", { prompt: "test", maxRounds: 10, dryRun: false });
await vi.runAllTimersAsync();
const firstChild = mockChildren[0];
firstChild.exitCode = 1;
firstChild.connected = false;
@@ -428,6 +442,7 @@ describe("WorkflowManager — crash recovery (Phase 3)", () => {
const mgr = createWorkflowManager("/nerve-root", config, logStore);
mgr.startWorkflow("crash-wf", { prompt: "test", maxRounds: 10, dryRun: false });
await vi.runAllTimersAsync();
// Crash the worker 6 times in rapid succession (within CRASH_WINDOW_MS = 60s)
for (let i = 0; i < 6; i++) {
@@ -0,0 +1,9 @@
// Ready then crashes on a timer; still echoes IPC so parent tests can send after respawn
process.on("message", (msg) => {
if (msg && msg.type === "shutdown") {
process.exit(0);
}
process.send({ type: "echo", payload: msg });
});
process.send({ type: "ready" });
setTimeout(() => process.exit(1), 50);
@@ -0,0 +1,9 @@
// Simple test worker: sends ready, echoes messages, handles shutdown
process.on("message", (msg) => {
if (msg && msg.type === "shutdown") {
process.exit(0);
}
// Echo back with 'echo' type
process.send({ type: "echo", payload: msg });
});
process.send({ type: "ready" });
@@ -0,0 +1,9 @@
// Like echo-worker but writes stderr for tail diagnostics
console.error("stderr-marker");
process.on("message", (msg) => {
if (msg && msg.type === "shutdown") {
process.exit(0);
}
process.send({ type: "echo", payload: msg });
});
process.send({ type: "ready" });
@@ -33,6 +33,11 @@ function makeMockChild(pid = 1): MockChild {
child.connected = true;
child.exitCode = null;
child.pid = pid;
setImmediate(() => {
if (child.connected) {
child.emit("message", { type: "ready" });
}
});
child.send = vi.fn((msg: unknown) => {
if (
msg !== null &&
@@ -114,6 +119,7 @@ describe("WorkflowManager — drainAndRespawn (Phase 3 hot reload)", () => {
const mgr = createWorkflowManager("/nerve-root", config, logStore);
mgr.startWorkflow("my-wf", { prompt: "test", maxRounds: 10, dryRun: false });
await vi.runAllTimersAsync();
expect(mockChildren).toHaveLength(1);
// Remove workflow from config before drain completes
@@ -134,6 +140,7 @@ describe("WorkflowManager — drainAndRespawn (Phase 3 hot reload)", () => {
mgr.startWorkflow("my-wf", { prompt: "test", maxRounds: 10, dryRun: false });
mgr.startWorkflow("my-wf", { prompt: "test", maxRounds: 10, dryRun: false });
await vi.runAllTimersAsync();
expect(mgr.activeCount("my-wf")).toBe(2);
const drainPromise = mgr.drainAndRespawn("my-wf", 5000);
@@ -165,6 +172,7 @@ describe("WorkflowManager — drainAndRespawn (Phase 3 hot reload)", () => {
const mgr = createWorkflowManager("/nerve-root", config, logStore);
mgr.startWorkflow("my-wf", { prompt: "test", maxRounds: 10, dryRun: false });
await vi.runAllTimersAsync();
expect(mockChildren).toHaveLength(1);
const drainPromise = mgr.drainAndRespawn("my-wf", 5000);
@@ -181,6 +189,7 @@ describe("WorkflowManager — drainAndRespawn (Phase 3 hot reload)", () => {
const mgr = createWorkflowManager("/nerve-root", config, logStore);
mgr.startWorkflow("my-wf", { prompt: "test", maxRounds: 10, dryRun: false });
await vi.runAllTimersAsync();
expect(mockChildren).toHaveLength(1);
const drainPromise = mgr.drainAndRespawn("my-wf", 5000);
@@ -198,6 +207,7 @@ describe("WorkflowManager — drainAndRespawn (Phase 3 hot reload)", () => {
const mgr = createWorkflowManager("/nerve-root", config, logStore);
mgr.startWorkflow("my-wf", { prompt: "test", maxRounds: 10, dryRun: false });
await vi.runAllTimersAsync();
const drainPromise = mgr.drainAndRespawn("my-wf", 5000);
await vi.runAllTimersAsync();
@@ -223,6 +233,7 @@ describe("WorkflowManager — drainAndRespawn (Phase 3 hot reload)", () => {
const mgr = createWorkflowManager("/nerve-root", config, logStore);
mgr.startWorkflow("my-wf", { prompt: "first", maxRounds: 10, dryRun: false });
await vi.runAllTimersAsync();
const drainPromise = mgr.drainAndRespawn("my-wf", 5000);
await vi.runAllTimersAsync();
@@ -230,6 +241,7 @@ describe("WorkflowManager — drainAndRespawn (Phase 3 hot reload)", () => {
// Start a new thread on the fresh worker
mgr.startWorkflow("my-wf", { prompt: "second", maxRounds: 10, dryRun: false });
await vi.runAllTimersAsync();
const newChild = mockChildren[1];
const startCalls = (newChild.send as ReturnType<typeof vi.fn>).mock.calls.filter(
@@ -257,12 +269,13 @@ describe("WorkflowManager — drainWhenIdle (hot reload without interrupting in-
vi.clearAllMocks();
});
it("does not send shutdown while a thread is still active", () => {
it("does not send shutdown while a thread is still active", async () => {
const logStore = makeLogStore();
const config = makeWfConfig({ "my-wf": { concurrency: 1, overflow: "drop" } });
const mgr = createWorkflowManager("/nerve-root", config, logStore);
mgr.startWorkflow("my-wf", { prompt: "test", maxRounds: 10, dryRun: false });
await vi.runAllTimersAsync();
const child = mockChildren[0];
mgr.drainWhenIdle("my-wf");
@@ -282,6 +295,7 @@ describe("WorkflowManager — drainWhenIdle (hot reload without interrupting in-
const mgr = createWorkflowManager("/nerve-root", config, logStore);
mgr.startWorkflow("my-wf", { prompt: "test", maxRounds: 10, dryRun: false });
await vi.runAllTimersAsync();
const child = mockChildren[0];
const runId = (child.send as ReturnType<typeof vi.fn>).mock.calls[0][0] as { runId: string };
@@ -311,6 +325,7 @@ describe("WorkflowManager — drainWhenIdle (hot reload without interrupting in-
mgr.startWorkflow("my-wf", { prompt: "a", maxRounds: 10, dryRun: false });
mgr.startWorkflow("my-wf", { prompt: "b", maxRounds: 10, dryRun: false });
await vi.runAllTimersAsync();
const child = mockChildren[0];
const sendMock = child.send as ReturnType<typeof vi.fn>;
const runIdA = (sendMock.mock.calls[0][0] as { runId: string }).runId;
@@ -355,6 +370,7 @@ describe("WorkflowManager — drainWhenIdle (hot reload without interrupting in-
const mgr = createWorkflowManager("/nerve-root", config, logStore);
mgr.startWorkflow("my-wf", { prompt: "test", maxRounds: 10, dryRun: false });
await vi.runAllTimersAsync();
const child = mockChildren[0];
const runId = (child.send as ReturnType<typeof vi.fn>).mock.calls[0][0] as { runId: string };
@@ -388,6 +404,7 @@ describe("WorkflowManager — drainWhenIdle (hot reload without interrupting in-
const mgr = createWorkflowManager("/nerve-root", config, logStore);
mgr.startWorkflow("my-wf", { prompt: "test", maxRounds: 10, dryRun: false });
await vi.runAllTimersAsync();
const child = mockChildren[0];
const runId = (child.send as ReturnType<typeof vi.fn>).mock.calls[0][0] as { runId: string };
@@ -414,6 +431,7 @@ describe("WorkflowManager — drainWhenIdle (hot reload without interrupting in-
const mgr = createWorkflowManager("/nerve-root", config, logStore);
mgr.startWorkflow("my-wf", { prompt: "once", maxRounds: 10, dryRun: false });
await vi.runAllTimersAsync();
const firstChild = mockChildren[0];
const runId = (firstChild.send as ReturnType<typeof vi.fn>).mock.calls[0][0] as {
runId: string;
@@ -471,6 +489,7 @@ describe("Kernel — workflow hot reload via file-watcher (Phase 3)", () => {
// Trigger a workflow thread so a worker is spawned
kernel.workflowManager.startWorkflow("my-wf", { prompt: "test", maxRounds: 10, dryRun: false });
await vi.runAllTimersAsync();
// Manually call drainAndRespawn (simulating what kernel does on workflow file change)
const drainPromise = kernel.workflowManager.drainAndRespawn("my-wf", 1000);
@@ -511,6 +530,7 @@ describe("Kernel — workflow hot reload via file-watcher (Phase 3)", () => {
maxRounds: 10,
dryRun: false,
});
await vi.runAllTimersAsync();
expect(mockChildren).toHaveLength(1);
// Reload config without old-wf
@@ -551,6 +571,7 @@ describe("Kernel — workflow hot reload via file-watcher (Phase 3)", () => {
});
kernel.workflowManager.startWorkflow("my-wf", { prompt: "test", maxRounds: 10, dryRun: false });
await vi.runAllTimersAsync();
const workersBefore = mockChildren.length;
// Reload with updated concurrency — should NOT spawn a new workflow worker
@@ -573,6 +594,7 @@ describe("Kernel — workflow hot reload via file-watcher (Phase 3)", () => {
// Can now start up to 5 concurrent threads (previously only 1)
kernel.workflowManager.startWorkflow("my-wf", { prompt: "test", maxRounds: 10, dryRun: false });
kernel.workflowManager.startWorkflow("my-wf", { prompt: "test", maxRounds: 10, dryRun: false });
await vi.runAllTimersAsync();
expect(kernel.workflowManager.activeCount("my-wf")).toBe(3);
const stopPromise = kernel.stop();
@@ -70,6 +70,14 @@ const { createKernel } = await import("../kernel.js");
// Helpers
// ---------------------------------------------------------------------------
/** Sense worker `fork` runs on the next microtask per scheduled `start`. */
async function flushSenseWorkerForkMicrotasks(kernel: { groups: Set<string> }): Promise<void> {
const n = kernel.groups.size;
for (let i = 0; i < n; i++) {
await Promise.resolve();
}
}
function makeConfig(overrides: Partial<NerveConfig> = {}): NerveConfig {
return {
senses: {
@@ -142,6 +150,8 @@ describe("kernel — getHealth", () => {
},
});
const kernel = createKernel(config, nerveRoot);
await flushSenseWorkerForkMicrotasks(kernel);
await vi.runAllTimersAsync();
const health = kernel.getHealth();
expect(health.activeSenses).toBe(3);
@@ -171,6 +181,8 @@ describe("kernel — restartGroup", () => {
it("sends shutdown to old worker and spawns new one", async () => {
const config = makeConfig();
const kernel = createKernel(config, nerveRoot);
await flushSenseWorkerForkMicrotasks(kernel);
await vi.runAllTimersAsync();
expect(mockChildren.length).toBe(1);
const oldChild = mockChildren[0];
@@ -178,6 +190,7 @@ describe("kernel — restartGroup", () => {
const restartPromise = kernel.restartGroup("system");
// The shutdown message triggers exit in the mock
await restartPromise;
await vi.runAllTimersAsync();
// A new child should have been spawned
expect(mockChildren.length).toBe(2);
@@ -191,6 +204,8 @@ describe("kernel — restartGroup", () => {
it("restartGroup on unknown group does nothing", async () => {
const config = makeConfig();
const kernel = createKernel(config, nerveRoot);
await flushSenseWorkerForkMicrotasks(kernel);
await vi.runAllTimersAsync();
expect(mockChildren.length).toBe(1);
await kernel.restartGroup("nonexistent");
@@ -218,6 +233,8 @@ describe("kernel — reloadConfig", () => {
it("adds new group worker when new sense group appears", async () => {
const config = makeConfig();
const kernel = createKernel(config, nerveRoot);
await flushSenseWorkerForkMicrotasks(kernel);
await vi.runAllTimersAsync();
expect(mockChildren.length).toBe(1); // only system group
expect(kernel.groups.has("network")).toBe(false);
@@ -249,6 +266,9 @@ describe("kernel — reloadConfig", () => {
api: { port: null, token: null, host: "127.0.0.1" },
});
await Promise.resolve();
await vi.runAllTimersAsync();
expect(kernel.groups.has("network")).toBe(true);
expect(mockChildren.length).toBe(2); // system + network
@@ -283,6 +303,8 @@ describe("kernel — reloadConfig", () => {
api: { port: null, token: null, host: "127.0.0.1" },
};
const kernel = createKernel(config, nerveRoot);
await flushSenseWorkerForkMicrotasks(kernel);
await vi.runAllTimersAsync();
expect(mockChildren.length).toBe(2);
expect(kernel.groups.has("network")).toBe(true);
@@ -308,6 +330,7 @@ describe("kernel — reloadConfig", () => {
});
expect(kernel.groups.has("network")).toBe(false);
await vi.runAllTimersAsync();
// Network child should have received shutdown
expect(networkChild.send).toHaveBeenCalledWith(expect.objectContaining({ type: "shutdown" }));
@@ -317,6 +340,8 @@ describe("kernel — reloadConfig", () => {
it("health reflects updated sense count after reloadConfig", async () => {
const config = makeConfig();
const kernel = createKernel(config, nerveRoot);
await flushSenseWorkerForkMicrotasks(kernel);
await vi.runAllTimersAsync();
expect(kernel.getHealth().activeSenses).toBe(1);
@@ -29,6 +29,9 @@ type MockChild = EventEmitter & {
function makeMockChild(pid = 1): MockChild {
const child = new EventEmitter() as MockChild;
child.connected = true;
setImmediate(() => {
child.emit("message", { type: "ready" });
});
child.send = vi.fn((msg: unknown) => {
if (
msg !== null &&
@@ -136,6 +139,7 @@ describe("kernel.triggerSense()", () => {
logStore: makeMockLogStore() as never,
});
await vi.runAllTimersAsync();
expect(() => kernel.triggerSense("no-such-sense")).toThrow(/Unknown sense/);
await kernel.stop();
@@ -169,6 +173,7 @@ describe("kernel.triggerSense()", () => {
logStore: makeMockLogStore() as never,
});
await vi.runAllTimersAsync();
// Two groups → two workers
expect(mockChildren.length).toBe(2);
@@ -214,6 +219,7 @@ describe("kernel.triggerSense()", () => {
logStore: makeMockLogStore() as never,
});
await vi.runAllTimersAsync();
// Both senses share the "system" group → one worker only
expect(mockChildren.length).toBe(1);
const worker = mockChildren[0];
@@ -237,6 +243,7 @@ describe("kernel.triggerSense()", () => {
logStore: makeMockLogStore() as never,
});
await new Promise<void>((resolve) => setImmediate(resolve));
const worker = mockChildren[0];
worker.connected = false;
@@ -102,6 +102,13 @@ function makeLogStore() {
};
}
async function flushSenseWorkerForkMicrotasks(kernel: { groups: Set<string> }): Promise<void> {
const n = kernel.groups.size;
for (let i = 0; i < n; i++) {
await Promise.resolve();
}
}
function makeConfig(overrides: Partial<NerveConfig> = {}): NerveConfig {
return {
senses: {
@@ -164,6 +171,8 @@ describe("kernel + workflowManager integration", () => {
workerScript: "fake-worker.js",
logStore,
});
await flushSenseWorkerForkMicrotasks(kernel);
await vi.runAllTimersAsync();
// Simulate a sense worker sending a signal with workflow launch payload
// The kernel's handleWorkerMessage processes "signal" type messages
@@ -185,6 +194,8 @@ describe("kernel + workflowManager integration", () => {
});
}
await vi.runAllTimersAsync();
// A workflow worker should be spawned and a start-thread message sent
const workflowWorker = mockChildren.find((c) =>
(c.send as ReturnType<typeof vi.fn>).mock.calls.some(
@@ -222,6 +233,8 @@ describe("kernel + workflowManager integration", () => {
workerScript: "fake-worker.js",
logStore,
});
await flushSenseWorkerForkMicrotasks(kernel);
await vi.runAllTimersAsync();
// Simulate sense worker returning a signal plus workflow launch
const workerPool = mockChildren[0];
@@ -241,6 +254,8 @@ describe("kernel + workflowManager integration", () => {
});
}
await vi.runAllTimersAsync();
// Find the start-thread call and verify triggerPayload
const startThreadCall = mockChildren
.flatMap((c) => (c.send as ReturnType<typeof vi.fn>).mock.calls as [unknown][])
@@ -275,6 +290,8 @@ describe("kernel + workflowManager integration", () => {
workerScript: "fake-worker.js",
logStore,
});
await flushSenseWorkerForkMicrotasks(kernel);
await vi.runAllTimersAsync();
const workerPool = mockChildren[0];
if (workerPool) {
@@ -293,6 +310,8 @@ describe("kernel + workflowManager integration", () => {
});
}
await vi.runAllTimersAsync();
const senseEntries = logStore.append.mock.calls
.map((c) => c[0] as { source: string; type: string; refId: string | null })
.filter((e) => e.source === "sense" && e.refId === "cpu-usage");
@@ -337,6 +356,8 @@ describe("kernel + workflowManager integration", () => {
workerScript: "fake-worker.js",
logStore,
});
await flushSenseWorkerForkMicrotasks(kernel);
await vi.runAllTimersAsync();
// Emit a regular signal (shorthand payload) — should NOT trigger any workflow
const workerPool = mockChildren[0];
@@ -387,6 +408,8 @@ describe("kernel + workflowManager integration", () => {
workerScript: "fake-worker.js",
logStore,
});
await flushSenseWorkerForkMicrotasks(kernel);
await vi.runAllTimersAsync();
// Simulate sense compute returning a signal plus workflow launch
const workerPool = mockChildren[0];
@@ -406,6 +429,8 @@ describe("kernel + workflowManager integration", () => {
});
}
await vi.runAllTimersAsync();
expect(logStore.upsertWorkflowRun).toHaveBeenCalledWith(
expect.objectContaining({ source: "workflow", type: "started" }),
expect.objectContaining({ workflow: "log-test-workflow", status: "started" }),
@@ -440,6 +465,8 @@ describe("kernel + workflowManager integration", () => {
workerScript: "fake-worker.js",
logStore,
});
await flushSenseWorkerForkMicrotasks(kernel);
await vi.runAllTimersAsync();
// Reload with a workflow added
const newConfig: NerveConfig = {
@@ -479,6 +506,8 @@ describe("kernel + workflowManager integration", () => {
});
}
await vi.runAllTimersAsync();
const startThreadCall = mockChildren
.flatMap((c) => (c.send as ReturnType<typeof vi.fn>).mock.calls as [unknown][])
.find(
@@ -517,6 +546,8 @@ describe("kernel + workflowManager integration", () => {
workerScript: "fake-worker.js",
logStore,
});
await flushSenseWorkerForkMicrotasks(kernel);
await vi.runAllTimersAsync();
// Reload with the workflow removed
const newConfig: NerveConfig = {
@@ -561,6 +592,8 @@ describe("kernel + workflowManager integration", () => {
});
}
await vi.runAllTimersAsync();
const startThreadCall = mockChildren
.flatMap((c) => (c.send as ReturnType<typeof vi.fn>).mock.calls as [unknown][])
.find(
@@ -600,6 +633,8 @@ describe("kernel + workflowManager integration", () => {
workerScript: "fake-worker.js",
logStore,
});
await flushSenseWorkerForkMicrotasks(kernel);
await vi.runAllTimersAsync();
// Trigger a workflow via sense compute return value
const workerPool = mockChildren[0];
@@ -619,6 +654,8 @@ describe("kernel + workflowManager integration", () => {
});
}
await vi.runAllTimersAsync();
const stopPromise = kernel.stop();
await vi.runAllTimersAsync();
await expect(stopPromise).resolves.toBeUndefined();
@@ -664,6 +701,8 @@ describe("kernel + workflowManager integration", () => {
workerScript: "fake-worker.js",
logStore,
});
await flushSenseWorkerForkMicrotasks(kernel);
await vi.runAllTimersAsync();
const health = kernel.getHealth();
expect(health).toHaveProperty("activeWorkflows");
+22 -2
View File
@@ -16,10 +16,12 @@ type MockChild = EventEmitter & {
send: ReturnType<typeof vi.fn>;
kill: ReturnType<typeof vi.fn>;
pid: number;
connected: boolean;
};
function makeMockChild(pid = 1): MockChild {
const child = new EventEmitter() as MockChild;
child.connected = true;
setImmediate(() => {
child.emit("message", { type: "ready" });
});
@@ -27,7 +29,10 @@ function makeMockChild(pid = 1): MockChild {
if (msg === null || typeof msg !== "object") return;
const m = msg as Record<string, unknown>;
if (m.type === "shutdown") {
setImmediate(() => child.emit("exit", 0, null));
setImmediate(() => {
child.connected = false;
child.emit("exit", 0, null);
});
return;
}
if (m.type === "compute" && typeof m.sense === "string") {
@@ -37,6 +42,7 @@ function makeMockChild(pid = 1): MockChild {
}
});
child.kill = vi.fn((_signal?: string) => {
child.connected = false;
child.emit("exit", null, _signal ?? "SIGKILL");
});
child.pid = pid;
@@ -59,6 +65,14 @@ const { createLogStore } = await import("@uncaged/nerve-store");
// Helpers
// ---------------------------------------------------------------------------
/** `WorkerRuntime.start` schedules `fork` on the next microtask — flush one tick per initial group. */
async function flushSenseWorkerForkMicrotasks(kernel: { groups: Set<string> }): Promise<void> {
const n = kernel.groups.size;
for (let i = 0; i < n; i++) {
await Promise.resolve();
}
}
function makeConfig(overrides: Partial<NerveConfig> = {}): NerveConfig {
return {
senses: {
@@ -173,6 +187,7 @@ describe("kernel — message routing", () => {
},
});
const kernel = createKernel(config, nerveRoot);
await flushSenseWorkerForkMicrotasks(kernel);
const child = mockChildren[0];
child.emit("message", { type: "error", sense: "cpu-usage", error: "compute failed" });
@@ -201,6 +216,7 @@ describe("kernel — message routing", () => {
},
});
const kernel = createKernel(config, nerveRoot);
await flushSenseWorkerForkMicrotasks(kernel);
const child = mockChildren[0];
const callsBefore = stderrSpy.mock.calls.length;
@@ -228,6 +244,7 @@ describe("kernel — message routing", () => {
},
});
const kernel = createKernel(config, nerveRoot);
await flushSenseWorkerForkMicrotasks(kernel);
const child = mockChildren[0];
expect(() => child.emit("message", { type: "unknown-type" })).not.toThrow();
@@ -290,6 +307,7 @@ describe("kernel — groupForSense mapping", () => {
api: { port: null, token: null, host: "127.0.0.1" },
};
const kernel = createKernel(config, nerveRoot);
await flushSenseWorkerForkMicrotasks(kernel);
// system and network = 2 unique groups
expect(mockChildren.length).toBe(2);
@@ -311,8 +329,10 @@ describe("kernel — groupForSense mapping", () => {
},
});
const kernel = createKernel(config, nerveRoot);
await flushSenseWorkerForkMicrotasks(kernel);
const child = mockChildren[0];
child.emit("message", { type: "ready" });
vi.advanceTimersByTime(500);
expect(child.send).toHaveBeenCalledWith(
@@ -50,6 +50,7 @@ async function startWorkerWithReady(
group: string,
): Promise<void> {
const pr = pool.startWorker(group);
await Promise.resolve();
const child = mockChildren[mockChildren.length - 1];
child.emit("message", { type: "ready" });
await pr;
@@ -137,6 +138,7 @@ describe("createSenseWorkerPool", () => {
expect(pool.activeGroupCount()).toBe(1);
pool.evictGroup("x");
expect(pool.hasWorkerForGroup("x")).toBe(false);
await Promise.resolve();
expect(mockChildren[0].send).toHaveBeenCalledWith(
expect.objectContaining({ type: "shutdown" }),
);
@@ -159,6 +161,7 @@ describe("createSenseWorkerPool", () => {
const p = pool.restartGroup("g");
expect(onBeforeGroupRestart).toHaveBeenCalledWith("g");
await Promise.resolve();
expect(mockChildren[0].send).toHaveBeenCalledWith(
expect.objectContaining({ type: "shutdown" }),
);
@@ -171,7 +174,7 @@ describe("createSenseWorkerPool", () => {
});
it("onWorkerCrashed runs and schedules respawn after non-zero exit", async () => {
vi.useFakeTimers({ shouldAdvanceTime: true });
vi.useFakeTimers();
const onWorkerCrashed = vi.fn();
const pool = createSenseWorkerPool({
nerveRoot: "/tmp/n",
@@ -0,0 +1,181 @@
import { dirname, join } from "node:path";
import { fileURLToPath } from "node:url";
import { afterEach, describe, expect, it, vi } from "vitest";
import { createWorkerRuntime } from "../worker-runtime.js";
const fixturesDir = join(dirname(fileURLToPath(import.meta.url)), "fixtures");
const echoWorkerPath = join(fixturesDir, "echo-worker.js");
const crashWorkerPath = join(fixturesDir, "crash-worker.js");
const stderrWorkerPath = join(fixturesDir, "stderr-worker.js");
function baseConfig(script: string) {
return {
script,
argsForKey: () => [],
forwardStderr: true,
onMessage: vi.fn(),
onReady: vi.fn(),
onExit: vi.fn(),
onCrashLimitReached: null,
respawn: {
enabled: true,
maxCrashes: 6,
windowMs: 60_000,
delayMs: 80,
allowRespawn: null,
},
shutdownTimeoutMs: 5000,
};
}
describe("createWorkerRuntime", () => {
const runtimes: Array<{ shutdown: () => Promise<void> }> = [];
afterEach(async () => {
await Promise.all(runtimes.splice(0).map((r) => r.shutdown()));
});
function track<R extends { shutdown: () => Promise<void> }>(r: R): R {
runtimes.push(r);
return r;
}
it("start + send message + receive echo", async () => {
const incoming: unknown[] = [];
const rt = track(
createWorkerRuntime({
...baseConfig(echoWorkerPath),
onMessage: (_key, msg) => {
incoming.push(msg);
},
}),
);
await rt.start("a");
expect(rt.has("a")).toBe(true);
await rt.send("a", { type: "ping", n: 1 });
await vi.waitFor(() => {
expect(incoming.some((m) => isEchoOf(m, { type: "ping", n: 1 }))).toBe(true);
});
await rt.shutdown();
});
it("cold start on send (no explicit start)", async () => {
const incoming: unknown[] = [];
const rt = track(
createWorkerRuntime({
...baseConfig(echoWorkerPath),
onMessage: (_key, msg) => {
incoming.push(msg);
},
}),
);
expect(rt.has("x")).toBe(false);
await rt.send("x", { type: "hi" });
await vi.waitFor(() => {
expect(rt.has("x")).toBe(true);
expect(incoming.some((m) => isEchoOf(m, { type: "hi" }))).toBe(true);
});
await rt.shutdown();
});
it("evict stops worker; has() is false", async () => {
const rt = track(createWorkerRuntime(baseConfig(echoWorkerPath)));
await rt.start("k");
expect(rt.has("k")).toBe(true);
await rt.evict("k", null);
expect(rt.has("k")).toBe(false);
await rt.shutdown();
});
it("drain stops and respawns (new pid)", async () => {
const rt = track(createWorkerRuntime(baseConfig(echoWorkerPath)));
await rt.start("k");
const before = rt.pid("k");
expect(before).not.toBeNull();
await rt.drain("k", null);
const after = rt.pid("k");
expect(after).not.toBeNull();
expect(after).not.toBe(before);
await rt.shutdown();
});
it("crash triggers auto-respawn", async () => {
const incoming: unknown[] = [];
const onExit = vi.fn();
const rt = track(
createWorkerRuntime({
...baseConfig(crashWorkerPath),
onExit,
onMessage: (_key, msg) => {
incoming.push(msg);
},
}),
);
await rt.start("c");
await vi.waitFor(() => expect(onExit.mock.calls.length).toBeGreaterThanOrEqual(1), {
timeout: 3000,
});
await vi.waitFor(() => expect(rt.has("c")).toBe(true), { timeout: 3000 });
await rt.send("c", { type: "after-crash" });
await vi.waitFor(() => {
expect(incoming.some((m) => isEchoOf(m, { type: "after-crash" }))).toBe(true);
});
await rt.shutdown();
});
it("crash limit reached → no more automatic respawns", async () => {
const rt = track(
createWorkerRuntime({
...baseConfig(crashWorkerPath),
respawn: {
enabled: true,
maxCrashes: 2,
windowMs: 60_000,
delayMs: 50,
allowRespawn: null,
},
}),
);
await rt.start("z");
await vi.waitFor(() => expect(rt.has("z")).toBe(false), { timeout: 8000 });
await rt.shutdown();
});
it("shutdown stops all workers", async () => {
const rt = track(createWorkerRuntime(baseConfig(echoWorkerPath)));
await rt.start("a");
await rt.start("b");
expect(rt.keys().sort()).toEqual(["a", "b"].sort());
await rt.shutdown();
expect(rt.keys()).toEqual([]);
expect(rt.has("a")).toBe(false);
expect(rt.has("b")).toBe(false);
});
it("stderrTail captures stderr output", async () => {
const rt = track(createWorkerRuntime(baseConfig(stderrWorkerPath)));
await rt.start("s");
await vi.waitFor(() => {
expect(rt.stderrTail("s")).toContain("stderr-marker");
});
await rt.shutdown();
});
});
function isEchoOf(msg: unknown, payload: unknown): boolean {
return (
typeof msg === "object" &&
msg !== null &&
(msg as Record<string, unknown>).type === "echo" &&
JSON.stringify((msg as Record<string, unknown>).payload) === JSON.stringify(payload)
);
}
@@ -26,6 +26,11 @@ function makeMockChild(pid = 1): MockChild {
child.connected = true;
child.exitCode = null;
child.pid = pid;
setImmediate(() => {
if (child.connected) {
child.emit("message", { type: "ready" });
}
});
child.send = vi.fn((msg: unknown) => {
if (
msg !== null &&
@@ -110,7 +115,7 @@ describe("WorkflowManager", () => {
});
describe("startWorkflow under concurrency limit dispatches thread", () => {
it("forks a worker and sends start-thread when active < concurrency", () => {
it("forks a worker and sends start-thread when active < concurrency", async () => {
const logStore = makeLogStore();
const config = makeConfig({
"my-workflow": { concurrency: 2, overflow: "drop" },
@@ -118,6 +123,7 @@ describe("WorkflowManager", () => {
const mgr = createWorkflowManager("/nerve-root", config, logStore);
mgr.startWorkflow("my-workflow", { prompt: "test", maxRounds: 10, dryRun: false });
await vi.runAllTimersAsync();
expect(mockChildren).toHaveLength(1);
expect(mockChildren[0].send).toHaveBeenCalledWith(
@@ -126,7 +132,7 @@ describe("WorkflowManager", () => {
expect(mgr.activeCount("my-workflow")).toBe(1);
});
it("reuses the same worker for a second thread under the limit", () => {
it("reuses the same worker for a second thread under the limit", async () => {
const logStore = makeLogStore();
const config = makeConfig({
"my-workflow": { concurrency: 3, overflow: "drop" },
@@ -135,6 +141,7 @@ describe("WorkflowManager", () => {
mgr.startWorkflow("my-workflow", { prompt: "test 1", maxRounds: 10, dryRun: false });
mgr.startWorkflow("my-workflow", { prompt: "test 2", maxRounds: 10, dryRun: false });
await vi.runAllTimersAsync();
// Only one forked child — worker is reused
expect(mockChildren).toHaveLength(1);
@@ -142,7 +149,7 @@ describe("WorkflowManager", () => {
expect(mgr.activeCount("my-workflow")).toBe(2);
});
it("logs a 'started' event for each dispatched thread", () => {
it("logs a 'started' event for each dispatched thread", async () => {
const logStore = makeLogStore();
const config = makeConfig({
"my-workflow": { concurrency: 2, overflow: "drop" },
@@ -150,6 +157,7 @@ describe("WorkflowManager", () => {
const mgr = createWorkflowManager("/nerve-root", config, logStore);
mgr.startWorkflow("my-workflow", { prompt: "test", maxRounds: 10, dryRun: false });
await vi.runAllTimersAsync();
expect(logStore.upsertWorkflowRun).toHaveBeenCalledWith(
expect.objectContaining({ source: "workflow", type: "started" }),
@@ -159,7 +167,7 @@ describe("WorkflowManager", () => {
});
describe("startWorkflow at limit with drop overflow drops the request", () => {
it("does NOT send start-thread when at concurrency limit with overflow=drop", () => {
it("does NOT send start-thread when at concurrency limit with overflow=drop", async () => {
const logStore = makeLogStore();
const config = makeConfig({
"drop-wf": { concurrency: 1, overflow: "drop" },
@@ -169,6 +177,7 @@ describe("WorkflowManager", () => {
mgr.startWorkflow("drop-wf", { prompt: "first", maxRounds: 10, dryRun: false });
// now at limit — second call should be dropped
mgr.startWorkflow("drop-wf", { prompt: "second", maxRounds: 10, dryRun: false });
await vi.runAllTimersAsync();
expect(mgr.activeCount("drop-wf")).toBe(1);
expect(mgr.queueLength("drop-wf")).toBe(0);
@@ -254,7 +263,7 @@ describe("WorkflowManager", () => {
});
describe("completing a thread dequeues the next one", () => {
it("dispatches the next queued thread when the active thread sends completed", () => {
it("dispatches the next queued thread when the active thread sends completed", async () => {
const logStore = makeLogStore();
const config = makeConfig({
"queue-wf": { concurrency: 1, overflow: "queue", maxQueue: 5 },
@@ -263,6 +272,7 @@ describe("WorkflowManager", () => {
mgr.startWorkflow("queue-wf", { prompt: "first", maxRounds: 10, dryRun: false });
mgr.startWorkflow("queue-wf", { prompt: "second", maxRounds: 10, dryRun: false });
await vi.runAllTimersAsync();
expect(mgr.activeCount("queue-wf")).toBe(1);
expect(mgr.queueLength("queue-wf")).toBe(1);
@@ -289,7 +299,7 @@ describe("WorkflowManager", () => {
);
});
it("dispatches next queued thread when active thread sends failed", () => {
it("dispatches next queued thread when active thread sends failed", async () => {
const logStore = makeLogStore();
const config = makeConfig({
"queue-wf": { concurrency: 1, overflow: "queue", maxQueue: 5 },
@@ -298,6 +308,7 @@ describe("WorkflowManager", () => {
mgr.startWorkflow("queue-wf", { prompt: "first", maxRounds: 10, dryRun: false });
mgr.startWorkflow("queue-wf", { prompt: "second", maxRounds: 10, dryRun: false });
await vi.runAllTimersAsync();
const child = mockChildren[0];
const firstRunId = (child.send as ReturnType<typeof vi.fn>).mock.calls[0][0].runId as string;
@@ -325,6 +336,7 @@ describe("WorkflowManager", () => {
mgr.startWorkflow("wf-a", { prompt: "test", maxRounds: 10, dryRun: false });
mgr.startWorkflow("wf-b", { prompt: "test", maxRounds: 10, dryRun: false });
await vi.runAllTimersAsync();
// Two distinct workers should have been forked
expect(mockChildren).toHaveLength(2);
+74 -122
View File
@@ -1,19 +1,13 @@
/**
* Sense worker pool — forked child processes per sense group (IPC lifecycle).
* Sense worker pool — thin wrapper around WorkerRuntime (RFC-006): one fork per sense group.
*/
import { fork } from "node:child_process";
import type { ChildProcess } from "node:child_process";
import { dirname, join } from "node:path";
import { fileURLToPath } from "node:url";
import type { ComputeMessage, ShutdownMessage } from "./ipc.js";
import { parseWorkerMessage } from "./ipc.js";
import {
formatCapturedStderrTail,
formatChildExitSummary,
teeCapturedStderr,
} from "./worker-fork-support.js";
import type { ComputeMessage } from "./ipc.js";
import { formatCapturedStderrTail, formatChildExitSummary } from "./worker-fork-support.js";
import { createWorkerRuntime } from "./worker-runtime.js";
export function resolveWorkerScript(): string {
const __filename = fileURLToPath(import.meta.url);
@@ -21,17 +15,12 @@ export function resolveWorkerScript(): string {
return join(__dir, "sense-worker.js");
}
type WorkerEntry = {
group: string;
process: ChildProcess;
};
export type SenseWorkerPoolOptions = {
nerveRoot: string;
workerScript: string;
/** Invoked for every IPC message from a worker (including ready / signal / error). */
onWorkerMessage: (raw: unknown) => void;
/** Sense names in a group — used when clearing scheduler state on crash or restart. */
/** Sense names in a group — reserved for scheduler-aligned cleanup (kernel passes current config). */
sensesForGroup: (group: string) => string[];
/**
* Called when a worker exits with non-zero code before scheduling a respawn
@@ -58,144 +47,107 @@ export type SenseWorkerPool = {
activeGroupCount: () => number;
};
function spawnWorker(
nerveRoot: string,
group: string,
workerScript: string,
stderrTail: { value: string },
): ChildProcess {
const child = fork(workerScript, ["--group", group, "--root", nerveRoot], {
stdio: ["ignore", "inherit", "pipe", "ipc"],
});
teeCapturedStderr(child, stderrTail);
child.on("error", (err) => {
if ((err as NodeJS.ErrnoException).code !== "EPIPE") {
console.error("[worker] error:", err.message);
}
});
return child;
}
function sendComputeToProcess(worker: ChildProcess, senseName: string): void {
if (worker.connected === false) return;
const msg: ComputeMessage = { type: "compute", sense: senseName };
try {
worker.send(msg);
} catch {
// IPC channel closed between connected check and send
}
}
function sendShutdownToProcess(worker: ChildProcess): void {
if (worker.connected === false) return;
const msg: ShutdownMessage = { type: "shutdown" };
try {
worker.send(msg);
} catch {
// IPC channel closed between connected check and send
}
}
function waitForExit(child: ChildProcess, timeoutMs: number): Promise<void> {
return new Promise((resolve) => {
const timer = setTimeout(() => {
child.kill("SIGKILL");
resolve();
}, timeoutMs);
child.once("exit", () => {
clearTimeout(timer);
resolve();
});
});
}
/** Matches legacy pool: long crash window, 1s respawn delay, practical unlimited respawns. */
const SENSE_WORKER_RESPAWN = {
enabled: true,
maxCrashes: 100_000,
windowMs: 86_400_000,
delayMs: 1000,
} as const;
export function createSenseWorkerPool(options: SenseWorkerPoolOptions): SenseWorkerPool {
const workers = new Map<string, WorkerEntry>();
function startWorker(group: string): Promise<void> {
const stderrTail = { value: "" };
const child = spawnWorker(options.nerveRoot, group, options.workerScript, stderrTail);
let workerReadyResolve: (() => void) | undefined;
const workerReady = new Promise<void>((resolve) => {
workerReadyResolve = resolve;
});
child.on("message", (raw: unknown) => {
const result = parseWorkerMessage(raw);
if (result.ok && result.value.type === "ready") {
workerReadyResolve?.();
}
const runtime = createWorkerRuntime<string>({
script: options.workerScript,
argsForKey: (group) => ["--group", group, "--root", options.nerveRoot],
forwardStderr: true,
onMessage: (_key, raw) => {
options.onWorkerMessage(raw);
});
child.on("exit", (code, signal) => {
const summary = formatChildExitSummary(code, signal ?? null);
},
onReady: (_key, msg) => {
options.onWorkerMessage(msg);
},
onCrashLimitReached: null,
onExit: (group, code, signal) => {
const sig =
signal === null || signal === undefined || signal === ""
? null
: (signal as NodeJS.Signals);
const summary = formatChildExitSummary(code, sig);
process.stderr.write(
`[kernel] worker for group "${group}" exited (${summary})${formatCapturedStderrTail(stderrTail.value)}\n`,
`[kernel] worker for group "${group}" exited (${summary})${formatCapturedStderrTail(runtime.stderrTail(group))}\n`,
);
workerReadyResolve?.();
if (!options.isStopped() && code !== 0) {
process.stderr.write(`[kernel] respawning worker for group "${group}" in 1s\n`);
options.onWorkerCrashed(group);
setTimeout(() => {
if (!options.isStopped()) {
startWorker(group);
}
}, 1000);
}
});
},
respawn: {
...SENSE_WORKER_RESPAWN,
allowRespawn: (_group) => !options.isStopped(),
},
shutdownTimeoutMs: 5000,
});
workers.set(group, { group, process: child });
return workerReady;
/** Groups we have ever started — mirrors legacy Map presence for `restartGroup` no-op when unknown. */
const trackedGroups = new Set<string>();
/** Marks groups mid-evict so `hasWorkerForGroup` drops immediately (legacy synchronous eviction). */
const evicting = new Set<string>();
async function startWorker(group: string): Promise<void> {
trackedGroups.add(group);
await runtime.start(group);
}
async function restartGroup(group: string): Promise<void> {
const entry = workers.get(group);
if (entry === undefined) return;
options.onBeforeGroupRestart(group);
sendShutdownToProcess(entry.process);
await waitForExit(entry.process, 5000);
if (!options.isStopped()) {
await startWorker(group);
if (!trackedGroups.has(group)) {
return;
}
options.onBeforeGroupRestart(group);
await runtime.drain(group, null);
}
function evictGroup(group: string): void {
const entry = workers.get(group);
if (entry === undefined) return;
sendShutdownToProcess(entry.process);
workers.delete(group);
trackedGroups.delete(group);
evicting.add(group);
void runtime.evict(group, null).finally(() => {
evicting.delete(group);
});
}
async function shutdownAll(): Promise<void> {
const exitPromises: Promise<void>[] = [];
for (const entry of workers.values()) {
sendShutdownToProcess(entry.process);
exitPromises.push(waitForExit(entry.process, 5000));
}
await Promise.all(exitPromises);
await runtime.shutdown();
trackedGroups.clear();
evicting.clear();
}
function sendCompute(group: string, senseName: string): void {
const entry = workers.get(group);
if (entry === undefined) return;
sendComputeToProcess(entry.process, senseName);
if (!trackedGroups.has(group) || evicting.has(group)) {
return;
}
// Legacy pool: `child.send` no-op when IPC is closed (still allow cold start: child === null).
if (runtime.hasDisconnectedChild(group)) {
return;
}
const msg: ComputeMessage = { type: "compute", sense: senseName };
if (!runtime.trySendSync(group, msg)) {
void runtime.send(group, msg).catch(() => {
// IPC channel may close between scheduling and send — same as legacy try/catch on child.send
});
}
}
function getWorkerPid(group: string): number | null {
return workers.get(group)?.process.pid ?? null;
return runtime.pid(group);
}
/** True once `startWorker` has been called for the group and it is not mid-evict (matches legacy Map key). */
function hasWorkerForGroup(group: string): boolean {
return workers.has(group);
return trackedGroups.has(group) && !evicting.has(group);
}
/** Count of sense groups with a worker slot (includes not-yet-ready), excluding evicted keys. */
function activeGroupCount(): number {
return workers.size;
return trackedGroups.size;
}
return {
+422
View File
@@ -0,0 +1,422 @@
/**
* Generic message-routed worker process manager (RFC-006).
* One forked Node child per key; cold start, crash respawn, drain/evict, shutdown.
*/
import { type ChildProcess, type Serializable, fork } from "node:child_process";
import { isPlainRecord } from "@uncaged/nerve-core";
const STDERR_TAIL_MAX_CHARS = 2048;
export type WorkerDrainOpts = {
shutdownTimeoutMs: number | null;
};
export type WorkerRuntimeConfig<K extends string> = {
script: string;
argsForKey: (key: K) => string[];
/** When false, stderr is not captured into `stderrTail` (e.g. tests without a pipe). */
forwardStderr: boolean;
onMessage: (key: K, msg: unknown) => void;
onReady: (key: K, msg: unknown) => void;
onExit: (key: K, code: number | null, signal: string | null) => void;
/** Invoked when automatic respawn is skipped because `maxCrashes` was exceeded in `windowMs`. */
onCrashLimitReached: ((key: K) => void) | null;
respawn: {
enabled: boolean;
maxCrashes: number;
windowMs: number;
delayMs: number;
/** When non-null, return false to skip automatic respawn after an unexpected exit. */
allowRespawn: ((key: K) => boolean) | null;
};
shutdownTimeoutMs: number;
};
export type WorkerRuntime<K extends string> = {
send: (key: K, msg: unknown) => Promise<void>;
/** When the worker is already ready and IPC-connected, sends synchronously (returns true). Otherwise false — caller may fall back to `send`. */
trySendSync: (key: K, msg: unknown) => boolean;
start: (key: K) => Promise<void>;
evict: (key: K, opts: WorkerDrainOpts | null) => Promise<void>;
drain: (key: K, opts: WorkerDrainOpts | null) => Promise<void>;
shutdown: () => Promise<void>;
has: (key: K) => boolean;
/** True when a child exists but IPC is disconnected (legacy pool skipped sends in this case). */
hasDisconnectedChild: (key: K) => boolean;
pid: (key: K) => number | null;
keys: () => K[];
stderrTail: (key: K) => string;
};
type WorkerMachineState = "stopped" | "starting" | "ready" | "draining";
type ReadyWaiter = {
resolve: () => void;
reject: (err: Error) => void;
};
/** Internal: one forked process slot (ManagedWorker). */
type WorkerSlot<K extends string> = {
key: K;
state: WorkerMachineState;
child: ChildProcess | null;
pid: number | null;
stderrTail: string;
crashTimestamps: number[];
expectExit: boolean;
readyWaiters: ReadyWaiter[];
opChain: Promise<void>;
};
function isReadyIpcMessage(raw: unknown): boolean {
return isPlainRecord(raw) && raw.type === "ready";
}
function signalToString(signal: NodeJS.Signals | null): string | null {
if (signal === null) {
return null;
}
return String(signal);
}
function attachStderrTail<K extends string>(child: ChildProcess, slot: WorkerSlot<K>): void {
const stream = child.stderr;
if (stream == null) {
return;
}
stream.setEncoding("utf8");
stream.on("data", (chunk: string | Buffer) => {
const text = typeof chunk === "string" ? chunk : chunk.toString("utf8");
slot.stderrTail = (slot.stderrTail + text).slice(-STDERR_TAIL_MAX_CHARS);
});
}
function enqueueOp<K extends string>(slot: WorkerSlot<K>, fn: () => Promise<void>): Promise<void> {
const run = slot.opChain.then(fn, fn);
slot.opChain = run.then(
() => {},
() => {},
);
return run;
}
function resolveReadyWaiters<K extends string>(slot: WorkerSlot<K>): void {
const waiters = slot.readyWaiters;
slot.readyWaiters = [];
for (const w of waiters) {
w.resolve();
}
}
function rejectReadyWaiters<K extends string>(slot: WorkerSlot<K>, err: Error): void {
const waiters = slot.readyWaiters;
slot.readyWaiters = [];
for (const w of waiters) {
w.reject(err);
}
}
function waitForReady<K extends string>(
slot: WorkerSlot<K>,
shutdownTimeoutMs: number,
): Promise<void> {
if (slot.state === "ready" && slot.child !== null && slot.child.connected) {
return Promise.resolve();
}
return new Promise((resolve, reject) => {
let settled = false;
const timer = setTimeout(() => {
if (!settled) {
settled = true;
reject(new Error(`Worker "${String(slot.key)}" ready timeout`));
}
}, shutdownTimeoutMs);
slot.readyWaiters.push({
resolve: () => {
if (settled) {
return;
}
settled = true;
clearTimeout(timer);
resolve();
},
reject: (err: Error) => {
if (settled) {
return;
}
settled = true;
clearTimeout(timer);
reject(err);
},
});
});
}
async function waitForChildExit(child: ChildProcess, timeoutMs: number): Promise<void> {
await new Promise<void>((resolve) => {
const timer = setTimeout(() => {
child.kill("SIGKILL");
}, timeoutMs);
child.once("exit", () => {
clearTimeout(timer);
resolve();
});
});
}
export function createWorkerRuntime<K extends string>(
config: WorkerRuntimeConfig<K>,
): WorkerRuntime<K> {
const workers = new Map<K, WorkerSlot<K>>();
function getOrCreateSlot(key: K): WorkerSlot<K> {
let slot = workers.get(key);
if (slot === undefined) {
slot = {
key,
state: "stopped",
child: null,
pid: null,
stderrTail: "",
crashTimestamps: [],
expectExit: false,
readyWaiters: [],
opChain: Promise.resolve(),
};
workers.set(key, slot);
}
return slot;
}
function handleWorkerMessage(slot: WorkerSlot<K>, msg: unknown): void {
if (isReadyIpcMessage(msg)) {
if (slot.state === "starting") {
slot.state = "ready";
config.onReady(slot.key, msg);
resolveReadyWaiters(slot);
}
return;
}
config.onMessage(slot.key, msg);
}
function onChildExit(
slot: WorkerSlot<K>,
code: number | null,
signal: NodeJS.Signals | null,
): void {
config.onExit(slot.key, code, signalToString(signal));
if (slot.child !== null) {
slot.child.removeAllListeners("message");
slot.child.removeAllListeners("exit");
}
const wasExpect = slot.expectExit;
slot.expectExit = false;
slot.child = null;
slot.pid = null;
if (wasExpect) {
slot.state = "stopped";
return;
}
rejectReadyWaiters(slot, new Error(`Worker "${String(slot.key)}" exited unexpectedly`));
slot.state = "stopped";
void enqueueOp(slot, async () => {
await handleUnexpectedCrashRecovery(slot);
});
}
function registerChild(slot: WorkerSlot<K>, child: ChildProcess): void {
slot.child = child;
slot.pid = child.pid ?? null;
if (config.forwardStderr) {
attachStderrTail(child, slot);
}
child.on("message", (msg: unknown) => {
handleWorkerMessage(slot, msg);
});
child.on("exit", (code, sig) => {
onChildExit(slot, code, sig ?? null);
});
}
async function forkAndWaitReady(slot: WorkerSlot<K>): Promise<void> {
if (slot.state === "ready" && slot.child !== null && slot.child.connected) {
return;
}
slot.state = "starting";
let child: ChildProcess;
try {
child = fork(config.script, config.argsForKey(slot.key), {
stdio: ["ignore", "inherit", "pipe", "ipc"],
env: process.env,
});
} catch (e) {
slot.state = "stopped";
const err = e instanceof Error ? e : new Error(String(e));
rejectReadyWaiters(slot, err);
throw err;
}
registerChild(slot, child);
await waitForReady(slot, config.shutdownTimeoutMs);
}
function resolveShutdownTimeoutMs(opts: WorkerDrainOpts | null): number {
if (opts !== null && opts.shutdownTimeoutMs !== null) {
return opts.shutdownTimeoutMs;
}
return config.shutdownTimeoutMs;
}
async function gracefulStop(slot: WorkerSlot<K>, shutdownTimeoutMs: number): Promise<void> {
if (slot.child === null) {
return;
}
slot.expectExit = true;
slot.state = "draining";
const child = slot.child;
try {
child.send({ type: "shutdown" });
} catch {
// IPC channel may have closed between null-check and send
}
await waitForChildExit(child, shutdownTimeoutMs);
}
async function handleUnexpectedCrashRecovery(slot: WorkerSlot<K>): Promise<void> {
if (!config.respawn.enabled) {
return;
}
if (config.respawn.allowRespawn !== null && !config.respawn.allowRespawn(slot.key)) {
return;
}
const now = Date.now();
slot.crashTimestamps.push(now);
slot.crashTimestamps = slot.crashTimestamps.filter((t) => now - t <= config.respawn.windowMs);
if (slot.crashTimestamps.length >= config.respawn.maxCrashes) {
console.error(
`[WorkerRuntime] worker "${String(slot.key)}" exceeded crash limit (${String(config.respawn.maxCrashes)} in ${String(config.respawn.windowMs)}ms); not respawning`,
);
if (config.onCrashLimitReached !== null) {
config.onCrashLimitReached(slot.key);
}
return;
}
await new Promise<void>((resolve) => setTimeout(resolve, config.respawn.delayMs));
await forkAndWaitReady(slot);
}
async function shutdownWorker(slot: WorkerSlot<K>): Promise<void> {
await gracefulStop(slot, config.shutdownTimeoutMs);
workers.delete(slot.key);
}
function isActive(slot: WorkerSlot<K>): boolean {
return slot.state === "ready" && slot.child !== null && slot.child.connected;
}
return {
send: async (key: K, msg: unknown) => {
const slot = getOrCreateSlot(key);
await enqueueOp(slot, async () => {
await forkAndWaitReady(slot);
const child = slot.child;
if (child === null || !child.connected) {
throw new Error(`Worker "${String(key)}" is not connected`);
}
child.send(msg as Serializable);
});
},
trySendSync: (key: K, msg: unknown): boolean => {
const slot = workers.get(key);
if (slot === undefined || !isActive(slot)) {
return false;
}
const child = slot.child;
if (child === null || !child.connected) {
return false;
}
try {
child.send(msg as Serializable);
return true;
} catch {
return false;
}
},
start: async (key: K) => {
const slot = getOrCreateSlot(key);
await enqueueOp(slot, async () => {
await forkAndWaitReady(slot);
});
},
evict: async (key: K, opts: WorkerDrainOpts | null) => {
const slot = getOrCreateSlot(key);
const shutdownMs = resolveShutdownTimeoutMs(opts);
await enqueueOp(slot, async () => {
await gracefulStop(slot, shutdownMs);
workers.delete(key);
});
},
drain: async (key: K, opts: WorkerDrainOpts | null) => {
const slot = getOrCreateSlot(key);
const shutdownMs = resolveShutdownTimeoutMs(opts);
await enqueueOp(slot, async () => {
if (slot.child === null) {
await forkAndWaitReady(slot);
return;
}
await gracefulStop(slot, shutdownMs);
await forkAndWaitReady(slot);
});
},
shutdown: async () => {
const snapshot = [...workers.values()];
await Promise.all(snapshot.map((slot) => enqueueOp(slot, () => shutdownWorker(slot))));
},
has: (key: K) => {
const slot = workers.get(key);
return slot !== undefined && isActive(slot);
},
hasDisconnectedChild: (key: K): boolean => {
const slot = workers.get(key);
if (slot === undefined || slot.child === null) {
return false;
}
return !slot.child.connected;
},
pid: (key: K) => {
const slot = workers.get(key);
if (slot === undefined || !isActive(slot) || slot.pid === null) {
return null;
}
return slot.pid;
},
keys: () => [...workers.values()].filter((slot) => isActive(slot)).map((slot) => slot.key),
stderrTail: (key: K) => {
const slot = workers.get(key);
return slot === undefined ? "" : slot.stderrTail;
},
};
}
@@ -0,0 +1,256 @@
/**
* Pure helpers and IPC branching for workflow-manager (keeps workflow-manager.ts lean).
*/
import { dirname, join } from "node:path";
import { fileURLToPath } from "node:url";
import type { WorkflowMessage } from "@uncaged/nerve-core";
import { START, isPlainRecord } from "@uncaged/nerve-core";
import type { LogStore, WorkflowRunStatus } from "@uncaged/nerve-store";
import type { ResumeThreadMessage, ThreadEventMessage } from "./ipc.js";
import type { WorkerToParentMessage } from "./ipc.js";
export type PendingThread = {
runId: string;
prompt: string;
maxRounds: number;
dryRun: boolean;
};
export type WorkflowState = {
active: Set<string>;
queue: PendingThread[];
};
/** Matches legacy manager: 6 crashes within 60s stops respawn (was `length > 5`). */
export const WORKFLOW_WORKER_RESPAWN = {
enabled: true,
maxCrashes: 6,
windowMs: 60_000,
delayMs: 0,
} as const;
/**
* Worker shutdown timeout — must stay in sync with SHUTDOWN_TIMEOUT_MS in workflow-worker.ts.
* The drain timeout passed to drainAndRespawn must be >= this value so the worker has
* enough time to finish in-flight threads before the parent force-kills it.
*/
export const WORKER_SHUTDOWN_TIMEOUT_MS = 10_000;
export const DEFAULT_MAX_QUEUE = 100;
export function readLaunchFromTriggerPayload(
raw: unknown,
engineDefaultMaxRounds: number,
): { prompt: string; maxRounds: number; dryRun: boolean } {
if (isPlainRecord(raw)) {
const o = raw;
if (typeof o.prompt === "string" && typeof o.maxRounds === "number") {
const dryRun = typeof o.dryRun === "boolean" ? o.dryRun : false;
return { prompt: o.prompt, maxRounds: o.maxRounds, dryRun };
}
}
return { prompt: "", maxRounds: engineDefaultMaxRounds, dryRun: false };
}
export function ensureThreadMessagesWithStart(
messages: Array<{ role: string; content: string; meta: unknown; timestamp: number }>,
threadId: string,
fallbackPrompt: string,
fallbackMaxRounds: number,
): WorkflowMessage[] {
const mapped: WorkflowMessage[] = messages.map((m) => ({
role: m.role,
content: m.content,
meta: m.meta,
timestamp: m.timestamp,
}));
if (mapped.length > 0 && mapped[0].role === START) {
return mapped;
}
const start: WorkflowMessage = {
role: START,
content: fallbackPrompt,
meta: { maxRounds: fallbackMaxRounds, threadId },
timestamp: Date.now(),
};
return [start, ...mapped];
}
export function resolveWorkflowWorkerScript(): string {
const __filename = fileURLToPath(import.meta.url);
const __dir = dirname(__filename);
return join(__dir, "workflow-worker.js");
}
export function mapWorkflowRunStatus(eventType: string): WorkflowRunStatus | null {
const map: Record<string, WorkflowRunStatus> = {
started: "started",
queued: "queued",
completed: "completed",
failed: "failed",
crashed: "crashed",
dropped: "dropped",
interrupted: "interrupted",
killed: "killed",
};
return map[eventType] ?? null;
}
export function extractExitCodeFromPayload(payload: unknown): number | null {
if (isPlainRecord(payload) && typeof payload.exitCode === "number") {
return payload.exitCode;
}
return null;
}
export function appendWorkflowRunLog(
logStore: LogStore,
workflowName: string,
runId: string,
eventType: string,
payload: unknown | undefined,
exitCode: number | null,
): void {
const timestamp = Date.now();
const serialised = payload !== undefined ? JSON.stringify(payload) : null;
const status = mapWorkflowRunStatus(eventType);
if (status !== null) {
logStore.upsertWorkflowRun(
{
source: "workflow",
type: eventType,
refId: runId,
payload: serialised,
timestamp,
},
{ runId, workflow: workflowName, status, timestamp, exitCode },
);
} else {
logStore.append({
source: "workflow",
type: eventType,
refId: runId,
payload: serialised,
timestamp,
});
}
}
export function recoverQueuedRun(
workflowName: string,
runId: string,
state: WorkflowState,
logStore: LogStore,
engineMaxRounds: number,
): void {
if (state.queue.some((q) => q.runId === runId)) return;
const launch = readLaunchFromTriggerPayload(logStore.getTriggerPayload(runId), engineMaxRounds);
state.queue.push({
runId,
prompt: launch.prompt,
maxRounds: launch.maxRounds,
dryRun: launch.dryRun,
});
process.stderr.write(
`[workflow-manager] crash-recovery: re-queued thread "${runId}" for "${workflowName}"\n`,
);
}
export function recoverStartedRun(
workflowName: string,
runId: string,
state: WorkflowState,
logStore: LogStore,
engineMaxRounds: number,
sendResume: (wf: string, msg: ResumeThreadMessage) => void,
): void {
if (state.active.has(runId)) return;
const rawMessages = logStore.getThreadMessages(runId);
const launch = readLaunchFromTriggerPayload(logStore.getTriggerPayload(runId), engineMaxRounds);
const messages = ensureThreadMessagesWithStart(
rawMessages,
runId,
launch.prompt,
launch.maxRounds,
);
state.active.add(runId);
const msg: ResumeThreadMessage = {
type: "resume-thread",
runId,
messages,
maxRounds: launch.maxRounds,
dryRun: launch.dryRun,
};
sendResume(workflowName, msg);
process.stderr.write(
`[workflow-manager] crash-recovery: resuming thread "${runId}" for "${workflowName}" (${String(messages.length)} messages)\n`,
);
}
export function recoverThreadsFromStore(
workflowName: string,
logStore: LogStore,
engineMaxRounds: number,
getOrCreateState: (name: string) => WorkflowState,
sendResume: (wf: string, msg: ResumeThreadMessage) => void,
): void {
const activeRuns = logStore.getActiveWorkflowRuns(workflowName);
const state = getOrCreateState(workflowName);
for (const run of activeRuns) {
if (run.status === "queued") {
recoverQueuedRun(workflowName, run.runId, state, logStore, engineMaxRounds);
} else if (run.status === "started") {
recoverStartedRun(workflowName, run.runId, state, logStore, engineMaxRounds, sendResume);
}
}
}
export type WorkflowManagerMessageDeps = {
logStore: LogStore;
handleThreadEvent: (workflowName: string, msg: ThreadEventMessage) => void;
onWorkflowRoleError: (
workflowName: string,
runId: string,
error: string,
exitCode: number,
) => void;
};
export function dispatchWorkflowWorkerMessage(
workflowName: string,
msg: WorkerToParentMessage,
deps: WorkflowManagerMessageDeps,
): void {
if (msg.type === "thread-event") {
deps.handleThreadEvent(workflowName, msg);
return;
}
if (msg.type === "thread-workflow-message") {
deps.logStore.append({
source: "workflow",
type: "thread_workflow_message",
refId: msg.runId,
payload: JSON.stringify(msg.message),
timestamp: Date.now(),
});
return;
}
if (msg.type === "workflow-error") {
process.stderr.write(
`[workflow-manager] workflow-error for runId "${msg.runId}" in "${workflowName}": ${msg.error}\n`,
);
deps.onWorkflowRoleError(workflowName, msg.runId, msg.error, msg.exitCode);
return;
}
if (msg.type === "error") {
process.stderr.write(`[workflow-manager] error from "${workflowName}" worker: ${msg.error}\n`);
}
}
+174 -447
View File
@@ -6,33 +6,24 @@
* Concurrency and overflow (drop/queue) are enforced here in the parent process.
*/
import { fork } from "node:child_process";
import type { ChildProcess } from "node:child_process";
import { dirname, join } from "node:path";
import { fileURLToPath } from "node:url";
import type { NerveConfig, WorkflowConfig, WorkflowStatus } from "@uncaged/nerve-core";
import type {
NerveConfig,
WorkflowConfig,
WorkflowMessage,
WorkflowStatus,
} from "@uncaged/nerve-core";
import { START, isPlainRecord } from "@uncaged/nerve-core";
import type { LogStore, WorkflowRunStatus } from "@uncaged/nerve-store";
import type {
KillThreadMessage,
ResumeThreadMessage,
ShutdownMessage,
StartThreadMessage,
ThreadEventMessage,
} from "./ipc.js";
import type { LogStore } from "@uncaged/nerve-store";
import type { KillThreadMessage, StartThreadMessage, ThreadEventMessage } from "./ipc.js";
import { parseWorkerMessage } from "./ipc.js";
import { formatCapturedStderrTail, formatChildExitSummary } from "./worker-fork-support.js";
import { createWorkerRuntime } from "./worker-runtime.js";
import {
formatCapturedStderrTail,
formatChildExitSummary,
teeCapturedStderr,
} from "./worker-fork-support.js";
DEFAULT_MAX_QUEUE,
WORKER_SHUTDOWN_TIMEOUT_MS,
WORKFLOW_WORKER_RESPAWN,
type WorkflowState,
appendWorkflowRunLog,
dispatchWorkflowWorkerMessage,
extractExitCodeFromPayload,
recoverThreadsFromStore,
resolveWorkflowWorkerScript,
} from "./workflow-manager-support.js";
export type WorkflowLaunchParams = {
prompt: string;
@@ -74,169 +65,109 @@ export type WorkflowManager = {
stop: () => Promise<void>;
};
type PendingThread = {
runId: string;
prompt: string;
maxRounds: number;
dryRun: boolean;
};
type WorkflowState = {
active: Set<string>;
queue: PendingThread[];
};
type WorkerEntry = {
workflowName: string;
process: ChildProcess;
stopping: boolean;
/** When set, the worker is draining before a hot-reload respawn. */
draining: boolean;
stderrTail: { value: string };
};
// Crash respawn backoff: track crash timestamps per workflow.
const MAX_CRASHES_IN_WINDOW = 5;
const CRASH_WINDOW_MS = 60_000;
/**
* Worker shutdown timeout — must stay in sync with SHUTDOWN_TIMEOUT_MS in workflow-worker.ts.
* The drain timeout passed to drainAndRespawn must be >= this value so the worker has
* enough time to finish in-flight threads before the parent force-kills it.
*/
const WORKER_SHUTDOWN_TIMEOUT_MS = 10_000;
const DEFAULT_MAX_QUEUE = 100;
function readLaunchFromTriggerPayload(
raw: unknown,
engineDefaultMaxRounds: number,
): { prompt: string; maxRounds: number; dryRun: boolean } {
if (isPlainRecord(raw)) {
const o = raw;
if (typeof o.prompt === "string" && typeof o.maxRounds === "number") {
const dryRun = typeof o.dryRun === "boolean" ? o.dryRun : false;
return { prompt: o.prompt, maxRounds: o.maxRounds, dryRun };
}
}
return { prompt: "", maxRounds: engineDefaultMaxRounds, dryRun: false };
}
function ensureThreadMessagesWithStart(
messages: Array<{ role: string; content: string; meta: unknown; timestamp: number }>,
threadId: string,
fallbackPrompt: string,
fallbackMaxRounds: number,
): WorkflowMessage[] {
const mapped: WorkflowMessage[] = messages.map((m) => ({
role: m.role,
content: m.content,
meta: m.meta,
timestamp: m.timestamp,
}));
if (mapped.length > 0 && mapped[0].role === START) {
return mapped;
}
const start: WorkflowMessage = {
role: START,
content: fallbackPrompt,
meta: { maxRounds: fallbackMaxRounds, threadId },
timestamp: Date.now(),
};
return [start, ...mapped];
}
function resolveWorkerScript(): string {
const __filename = fileURLToPath(import.meta.url);
const __dir = dirname(__filename);
return join(__dir, "workflow-worker.js");
}
function spawnWorkflowWorker(
nerveRoot: string,
workflowName: string,
workerScript: string,
stderrTail: { value: string },
): ChildProcess {
const child = fork(workerScript, ["--workflow", workflowName, "--root", nerveRoot], {
stdio: ["ignore", "inherit", "pipe", "ipc"],
});
teeCapturedStderr(child, stderrTail);
// Prevent unhandled EPIPE when writing to a child whose IPC channel closed
child.on("error", (err) => {
if ((err as NodeJS.ErrnoException).code !== "EPIPE") {
console.error("[worker] error:", err.message);
}
});
return child;
}
function sendStartThread(worker: ChildProcess, msg: StartThreadMessage): void {
if (worker.connected === false) return;
try {
worker.send(msg);
} catch {
// IPC channel closed between connected check and send
}
}
function sendShutdown(worker: ChildProcess, entry: WorkerEntry): void {
entry.stopping = true;
if (worker.connected === false) return;
const msg: ShutdownMessage = { type: "shutdown" };
try {
worker.send(msg);
} catch {
// IPC channel closed between connected check and send
}
}
function sendResumeThread(worker: ChildProcess, msg: ResumeThreadMessage): void {
if (worker.connected === false) return;
try {
worker.send(msg);
} catch {
// IPC channel closed between connected check and send
}
}
function sendKillThread(worker: ChildProcess, runId: string): void {
if (worker.connected === false) return;
const msg: KillThreadMessage = { type: "kill-thread", runId };
try {
worker.send(msg);
} catch {
// IPC channel closed between connected check and send
}
}
function waitForExit(child: ChildProcess, timeoutMs: number): Promise<void> {
return new Promise((resolve) => {
const timer = setTimeout(() => {
child.kill("SIGKILL");
resolve();
}, timeoutMs);
child.once("exit", () => {
clearTimeout(timer);
resolve();
});
});
}
export function createWorkflowManager(
nerveRoot: string,
initialConfig: NerveConfig,
logStore: LogStore,
): WorkflowManager {
const workerScript = resolveWorkerScript();
const workerScript = resolveWorkflowWorkerScript();
/**
* Default drain timeout must be at least WORKER_SHUTDOWN_TIMEOUT_MS so the worker
* has enough time to finish in-flight threads before the parent force-kills it.
*/
const DEFAULT_DRAIN_TIMEOUT_MS = Math.max(30_000, WORKER_SHUTDOWN_TIMEOUT_MS + 5_000);
const states = new Map<string, WorkflowState>();
const workers = new Map<string, WorkerEntry>();
const crashTimestamps = new Map<string, number[]>();
const trackedWorkflows = new Set<string>();
const hotReloadEvicting = new Set<string>();
const crashRecoveryPending = new Set<string>();
const crashLimitBlocked = new Set<string>();
let stopped = false;
let config = initialConfig;
const pendingDrains = new Set<string>();
function logWorkflowEvent(
workflowName: string,
runId: string,
eventType: string,
payload: unknown | null = null,
exitCode: number | null = null,
): void {
appendWorkflowRunLog(logStore, workflowName, runId, eventType, payload, exitCode);
}
const runtime = createWorkerRuntime<string>({
script: workerScript,
argsForKey: (workflowName) => ["--workflow", workflowName, "--root", nerveRoot],
forwardStderr: true,
onMessage: (workflowName, raw) => {
handleWorkerMessage(workflowName, raw);
},
onReady: (workflowName, _msg) => {
if (crashRecoveryPending.has(workflowName)) {
crashRecoveryPending.delete(workflowName);
recoverThreadsFromStore(
workflowName,
logStore,
config.maxRounds,
getOrCreateState,
(wf, msg) => {
sendToWorker(wf, msg);
},
);
}
},
onExit: (workflowName, code, signalStr) => {
const sig =
signalStr === null || signalStr === undefined || signalStr === ""
? null
: (signalStr as NodeJS.Signals);
if (hotReloadEvicting.has(workflowName)) {
hotReloadEvicting.delete(workflowName);
markActiveRunsInterrupted(workflowName);
if (!stopped && workflowConfig(workflowName) !== null) {
process.stderr.write(
`[workflow-manager] worker for "${workflowName}" drained, respawning\n`,
);
}
return;
}
if (stopped) {
const state = states.get(workflowName);
if (state !== undefined) {
state.active.clear();
}
crashRecoveryPending.delete(workflowName);
return;
}
const summary = formatChildExitSummary(code, sig);
const stderrExtra = formatCapturedStderrTail(runtime.stderrTail(workflowName));
process.stderr.write(
`[workflow-manager] worker for "${workflowName}" exited (${summary})${stderrExtra}\n`,
);
cleanupAfterUnexpectedWorkerExit(workflowName);
crashRecoveryPending.add(workflowName);
},
onCrashLimitReached: (workflowName) => {
crashRecoveryPending.delete(workflowName);
trackedWorkflows.delete(workflowName);
crashLimitBlocked.add(workflowName);
process.stderr.write(
`[workflow-manager] worker for "${workflowName}" exceeded crash limit (${String(WORKFLOW_WORKER_RESPAWN.maxCrashes)} in ${String(WORKFLOW_WORKER_RESPAWN.windowMs)}ms) — stopping respawn\n`,
);
},
respawn: {
...WORKFLOW_WORKER_RESPAWN,
allowRespawn: (_wf) => !stopped,
},
shutdownTimeoutMs: DEFAULT_DRAIN_TIMEOUT_MS,
});
function getOrCreateState(workflowName: string): WorkflowState {
let state = states.get(workflowName);
if (state === undefined) {
@@ -250,60 +181,38 @@ export function createWorkflowManager(
return config.workflows[workflowName] ?? null;
}
function toWorkflowRunStatus(eventType: string): WorkflowRunStatus | null {
const map: Record<string, WorkflowRunStatus> = {
started: "started",
queued: "queued",
completed: "completed",
failed: "failed",
crashed: "crashed",
dropped: "dropped",
interrupted: "interrupted",
killed: "killed",
};
return map[eventType] ?? null;
}
function extractExitCode(payload: unknown): number | null {
if (isPlainRecord(payload) && typeof payload.exitCode === "number") {
return payload.exitCode;
/** IPC send — matches legacy pool: no-op when IPC is disconnected; cold-start via WorkerRuntime.send. */
function sendToWorker(workflowName: string, msg: unknown): void {
if (crashLimitBlocked.has(workflowName)) {
return;
}
return null;
}
function logWorkflowEvent(
workflowName: string,
runId: string,
eventType: string,
payload?: unknown,
exitCode: number | null = null,
): void {
const timestamp = Date.now();
const serialised = payload !== undefined ? JSON.stringify(payload) : null;
const status = toWorkflowRunStatus(eventType);
if (status !== null) {
logStore.upsertWorkflowRun(
{
source: "workflow",
type: eventType,
refId: runId,
payload: serialised,
timestamp,
},
{ runId, workflow: workflowName, status, timestamp, exitCode },
);
} else {
logStore.append({
source: "workflow",
type: eventType,
refId: runId,
payload: serialised,
timestamp,
trackedWorkflows.add(workflowName);
if (runtime.hasDisconnectedChild(workflowName)) {
return;
}
if (!runtime.trySendSync(workflowName, msg)) {
void runtime.send(workflowName, msg).catch(() => {
// IPC channel closed — mark any thread from this message as failed
if (isStartThreadMsg(msg)) {
const state = states.get(workflowName);
if (state?.active.has(msg.runId)) {
state.active.delete(msg.runId);
logWorkflowEvent(workflowName, msg.runId, "failed", { error: "IPC channel closed" }, 1);
dequeueNext(workflowName);
}
}
});
}
}
function isStartThreadMsg(msg: unknown): msg is StartThreadMessage {
return (
msg !== null &&
typeof msg === "object" &&
(msg as Record<string, unknown>).type === "start-thread"
);
}
function dispatchThread(
workflowName: string,
runId: string,
@@ -314,7 +223,6 @@ export function createWorkflowManager(
const state = getOrCreateState(workflowName);
state.active.add(runId);
const worker = getOrSpawnWorker(workflowName);
const msg: StartThreadMessage = {
type: "start-thread",
runId,
@@ -323,7 +231,7 @@ export function createWorkflowManager(
maxRounds,
dryRun,
};
sendStartThread(worker.process, msg);
sendToWorker(workflowName, msg);
logWorkflowEvent(workflowName, runId, "started", { prompt, maxRounds, dryRun });
}
@@ -367,92 +275,20 @@ export function createWorkflowManager(
if (msg.eventType === "completed" || msg.eventType === "failed" || msg.eventType === "killed") {
state.active.delete(msg.runId);
dequeueNext(workflowName);
const exitCode = extractExitCode(msg.payload);
const exitCode = extractExitCodeFromPayload(msg.payload);
logWorkflowEvent(workflowName, msg.runId, msg.eventType, msg.payload, exitCode);
maybeDeferredHotReloadDrain(workflowName);
}
}
function recoverQueuedRun(workflowName: string, runId: string, state: WorkflowState): void {
if (state.queue.some((q) => q.runId === runId)) return;
const launch = readLaunchFromTriggerPayload(
logStore.getTriggerPayload(runId),
config.maxRounds,
);
state.queue.push({
runId,
prompt: launch.prompt,
maxRounds: launch.maxRounds,
dryRun: launch.dryRun,
});
process.stderr.write(
`[workflow-manager] crash-recovery: re-queued thread "${runId}" for "${workflowName}"\n`,
);
}
function recoverStartedRun(
workflowName: string,
runId: string,
state: WorkflowState,
worker: WorkerEntry,
): void {
if (state.active.has(runId)) return;
const rawMessages = logStore.getThreadMessages(runId);
const launch = readLaunchFromTriggerPayload(
logStore.getTriggerPayload(runId),
config.maxRounds,
);
const messages = ensureThreadMessagesWithStart(
rawMessages,
runId,
launch.prompt,
launch.maxRounds,
);
state.active.add(runId);
const msg: ResumeThreadMessage = {
type: "resume-thread",
runId,
messages,
maxRounds: launch.maxRounds,
dryRun: launch.dryRun,
};
sendResumeThread(worker.process, msg);
process.stderr.write(
`[workflow-manager] crash-recovery: resuming thread "${runId}" for "${workflowName}" (${messages.length} messages)\n`,
);
}
function recoverThreadsForWorker(workflowName: string, worker: WorkerEntry): void {
const activeRuns = logStore.getActiveWorkflowRuns(workflowName);
const state = getOrCreateState(workflowName);
for (const run of activeRuns) {
if (run.status === "queued") {
recoverQueuedRun(workflowName, run.runId, state);
} else if (run.status === "started") {
recoverStartedRun(workflowName, run.runId, state, worker);
}
}
}
function recordCrashAndCheckLimit(workflowName: string): boolean {
const now = Date.now();
const timestamps = (crashTimestamps.get(workflowName) ?? []).filter(
(t) => now - t < CRASH_WINDOW_MS,
);
timestamps.push(now);
crashTimestamps.set(workflowName, timestamps);
return timestamps.length > MAX_CRASHES_IN_WINDOW;
}
function handleWorkerCrash(workflowName: string): void {
function cleanupAfterUnexpectedWorkerExit(workflowName: string): void {
const state = states.get(workflowName);
if (state === undefined) return;
const crashedCount = state.active.size;
if (crashedCount > 0) {
process.stderr.write(
`[workflow-manager] worker for "${workflowName}" crashed with ${crashedCount} active thread(s)\n`,
`[workflow-manager] worker for "${workflowName}" crashed with ${String(crashedCount)} active thread(s)\n`,
);
for (const runId of state.active) {
logWorkflowEvent(workflowName, runId, "crashed", undefined, 255);
@@ -460,26 +296,13 @@ export function createWorkflowManager(
}
state.active.clear();
workers.delete(workflowName);
pendingDrains.delete(workflowName);
if (stopped || workflowConfig(workflowName) === null) return;
if (recordCrashAndCheckLimit(workflowName)) {
const count = crashTimestamps.get(workflowName)?.length ?? 0;
if (!stopped && !crashLimitBlocked.has(workflowName) && workflowConfig(workflowName) !== null) {
process.stderr.write(
`[workflow-manager] worker for "${workflowName}" crashed ${count} times in ${CRASH_WINDOW_MS}ms — stopping respawn\n`,
`[workflow-manager] respawning worker for "${workflowName}" after crash\n`,
);
return;
}
process.stderr.write(
`[workflow-manager] respawning worker for "${workflowName}" after crash\n`,
);
const newWorker = getOrSpawnWorker(workflowName);
setImmediate(() => {
recoverThreadsForWorker(workflowName, newWorker);
});
}
function handleWorkerMessage(workflowName: string, raw: unknown): void {
@@ -490,43 +313,19 @@ export function createWorkflowManager(
);
return;
}
const msg = result.value;
if (msg.type === "thread-event") {
handleThreadEvent(workflowName, msg);
return;
}
if (msg.type === "thread-workflow-message") {
logStore.append({
source: "workflow",
type: "thread_workflow_message",
refId: msg.runId,
payload: JSON.stringify(msg.message),
timestamp: Date.now(),
});
return;
}
if (msg.type === "workflow-error") {
process.stderr.write(
`[workflow-manager] workflow-error for runId "${msg.runId}" in "${workflowName}": ${msg.error}\n`,
);
const state = states.get(workflowName);
if (state !== undefined) {
state.active.delete(msg.runId);
dequeueNext(workflowName);
}
logWorkflowEvent(workflowName, msg.runId, "failed", { error: msg.error }, msg.exitCode);
maybeDeferredHotReloadDrain(workflowName);
return;
}
if (msg.type === "error") {
process.stderr.write(
`[workflow-manager] error from "${workflowName}" worker: ${msg.error}\n`,
);
}
dispatchWorkflowWorkerMessage(workflowName, result.value, {
logStore,
handleThreadEvent,
onWorkflowRoleError: (wf, runId, error, exitCode) => {
const state = states.get(wf);
if (state !== undefined) {
state.active.delete(runId);
dequeueNext(wf);
}
logWorkflowEvent(wf, runId, "failed", { error }, exitCode);
maybeDeferredHotReloadDrain(wf);
},
});
}
function markActiveRunsInterrupted(workflowName: string): void {
@@ -538,67 +337,6 @@ export function createWorkflowManager(
state.active.clear();
}
function handleWorkerExit(
workflowName: string,
code: number | null,
signal: NodeJS.Signals | null,
): void {
const entry = workers.get(workflowName);
if (entry?.draining) {
workers.delete(workflowName);
markActiveRunsInterrupted(workflowName);
if (!stopped && workflowConfig(workflowName) !== null) {
process.stderr.write(
`[workflow-manager] worker for "${workflowName}" drained, respawning\n`,
);
getOrSpawnWorker(workflowName);
}
return;
}
if (entry?.stopping) {
workers.delete(workflowName);
const state = states.get(workflowName);
if (state !== undefined) {
state.active.clear();
}
return;
}
const summary = formatChildExitSummary(code, signal);
const stderrExtra = entry !== undefined ? formatCapturedStderrTail(entry.stderrTail.value) : "";
process.stderr.write(
`[workflow-manager] worker for "${workflowName}" exited (${summary})${stderrExtra}\n`,
);
handleWorkerCrash(workflowName);
}
function getOrSpawnWorker(workflowName: string): WorkerEntry {
const existing = workers.get(workflowName);
if (existing !== undefined && existing.process.exitCode === null) {
return existing;
}
const stderrTail = { value: "" };
const child = spawnWorkflowWorker(nerveRoot, workflowName, workerScript, stderrTail);
child.on("message", (raw: unknown) => {
handleWorkerMessage(workflowName, raw);
});
child.on("exit", (code, signal) => {
handleWorkerExit(workflowName, code, signal ?? null);
});
const entry: WorkerEntry = {
workflowName,
process: child,
stopping: false,
draining: false,
stderrTail,
};
workers.set(workflowName, entry);
return entry;
}
function killThread(runId: string): boolean {
for (const [workflowName, state] of states) {
const queueIdx = state.queue.findIndex((q) => q.runId === runId);
@@ -609,10 +347,8 @@ export function createWorkflowManager(
}
if (state.active.has(runId)) {
const workerEntry = workers.get(workflowName);
if (workerEntry !== undefined) {
sendKillThread(workerEntry.process, runId);
}
const msg: KillThreadMessage = { type: "kill-thread", runId };
sendToWorker(workflowName, msg);
return true;
}
}
@@ -663,7 +399,7 @@ export function createWorkflowManager(
state.queue.push({ runId, prompt, maxRounds, dryRun });
logWorkflowEvent(workflowName, runId, "queued");
process.stderr.write(
`[workflow-manager] queued thread for "${workflowName}" runId "${runId}" (queue length: ${state.queue.length})\n`,
`[workflow-manager] queued thread for "${workflowName}" runId "${runId}" (queue length: ${String(state.queue.length)})\n`,
);
}
@@ -707,35 +443,29 @@ export function createWorkflowManager(
config = newConfig;
}
/**
* Default drain timeout must be at least WORKER_SHUTDOWN_TIMEOUT_MS so the worker
* has enough time to finish in-flight threads before the parent force-kills it.
*/
const DEFAULT_DRAIN_TIMEOUT_MS = Math.max(30_000, WORKER_SHUTDOWN_TIMEOUT_MS + 5_000);
async function drainAndRespawn(
workflowName: string,
drainTimeoutMs: number = DEFAULT_DRAIN_TIMEOUT_MS,
): Promise<void> {
const entry = workers.get(workflowName);
if (entry === undefined) {
// No active worker — nothing to drain
if (!trackedWorkflows.has(workflowName)) {
return;
}
entry.draining = true;
// Send shutdown without setting stopping=true (so the exit handler uses the draining branch)
if (entry.process.connected) {
const msg: ShutdownMessage = { type: "shutdown" };
try {
entry.process.send(msg);
} catch {
// IPC closed
const shutdownMs = Math.max(drainTimeoutMs, WORKER_SHUTDOWN_TIMEOUT_MS);
hotReloadEvicting.add(workflowName);
try {
await runtime.evict(workflowName, { shutdownTimeoutMs: shutdownMs });
trackedWorkflows.delete(workflowName);
if (!stopped && workflowConfig(workflowName) !== null) {
trackedWorkflows.add(workflowName);
await runtime.start(workflowName);
}
} finally {
hotReloadEvicting.delete(workflowName);
}
await waitForExit(entry.process, drainTimeoutMs);
// The exit handler (draining branch) will respawn the worker automatically
}
function drainWhenIdle(workflowName: string): void {
const state = states.get(workflowName);
const hasActiveRuns = state !== undefined && state.active.size > 0;
@@ -761,20 +491,17 @@ export function createWorkflowManager(
pendingDrains.add(workflowName);
process.stderr.write(
`[workflow-manager] deferring hot-reload for "${workflowName}" until ${state.active.size} active run(s) complete\n`,
`[workflow-manager] deferring hot-reload for "${workflowName}" until ${String(state.active.size)} active run(s) complete\n`,
);
}
async function stop(): Promise<void> {
stopped = true;
pendingDrains.clear();
const exitPromises: Promise<void>[] = [];
for (const entry of workers.values()) {
sendShutdown(entry.process, entry);
exitPromises.push(waitForExit(entry.process, 5000));
}
await Promise.all(exitPromises);
workers.clear();
hotReloadEvicting.clear();
crashRecoveryPending.clear();
await runtime.shutdown();
trackedWorkflows.clear();
}
return {
@@ -10,7 +10,7 @@ export type CoderMeta = z.infer<typeof coderMetaSchema>;
export function coderPrompt({ threadId }: { threadId: string }): string {
return `Read the workflow thread for the planner's sense design and any tester feedback: \`nerve thread ${threadId}\`
Read the nerve-dev skill for sense file structure and conventions: \`cat node_modules/@uncaged/nerve-skills/nerve-dev/SKILL.md\`
Read \`cat AGENT.md\` from the repository root, then \`CONVENTIONS.md\` and \`.knowledge/sense.md\` if present.
## Your task
@@ -20,21 +20,21 @@ Implement (or fix) the sense the planner designed. If there is tester feedback i
You do NOT need to finish everything in one pass. You may return \`done: false\` to continue in the next iteration.
## File structure for each sense
## File structure for each sense (flat workspace)
- \`senses/<name>/src/index.ts\` — TypeScript compute source; import schema as \`./schema.ts\`
The workspace has **one root** \`package.json\` and root \`scripts/build.mjs\` (or equivalent) that bundles all senses. There is **no** per-sense \`package.json\`. Bundled output is \`dist/senses/<name>/index.js\` after a root build.
- \`senses/<name>/src/index.ts\` — compute entry; import schema as \`./schema.ts\`
- \`senses/<name>/src/schema.ts\` — Drizzle schema (TypeScript)
- \`senses/<name>/migrations/\`Drizzle migration files (at sense root, not inside src/)
- \`senses/<name>/package.json\` — with esbuild build script
- \`senses/<name>/index.js\` — bundled output generated by \`pnpm build\` (do NOT edit by hand)
- \`senses/<name>/migrations/\`SQL migration files (at sense root, not inside \`src/\`)
Look at existing senses for the package.json template and patterns.
Look at existing senses for patterns.
## When to return done: true
Return \`done: true\` ONLY when ALL of the following are true:
- All required files are created
- \`pnpm install --no-cache && pnpm build\` succeeds (run it!)
- From the **workspace root**, \`pnpm run build\` or \`npm run build\` succeeds (run it!) and \`dist/senses/<name>/index.js\` exists
- \`nerve.yaml\` is updated with the sense config
Return \`done: false\` if you made progress but there is still work to do.`;
@@ -12,7 +12,7 @@ export function plannerPrompt({ threadId }: { threadId: string }): string {
return `You are planning a new Nerve sense.
Read the workflow thread for the user's request: \`nerve thread ${threadId}\`
Read the nerve-dev skill for sense conventions: \`cat node_modules/@uncaged/nerve-skills/nerve-dev/SKILL.md\`
Read the workspace guide: \`cat AGENT.md\` from the repository root (created by \`nerve init\`). Also read \`CONVENTIONS.md\` and \`.knowledge/sense.md\` if present. Optional skills live under \`.cursor/skills/\`.
Also look at existing senses in the \`senses/\` directory for patterns.
Pick a good kebab-case name for this sense. Produce a PLAN (not code) in markdown:
@@ -17,21 +17,20 @@ export function testerPrompt({
**IMPORTANT: The Nerve workspace is at \`${nerveRoot}\`. All paths below are relative to this directory. Always \`cd ${nerveRoot}\` first.**
Read the workflow thread for context: \`nerve thread ${threadId}\`
Read the nerve-dev skill for expected file structure: \`cat ${nerveRoot}/node_modules/@uncaged/nerve-skills/nerve-dev/SKILL.md\`
Read \`cat ${nerveRoot}/AGENT.md\`, then \`${nerveRoot}/CONVENTIONS.md\` and \`${nerveRoot}/.knowledge/sense.md\` if they exist.
Verify the full lifecycle in this order:
1. **File check** — all required sense files exist:
1. **File check** — all required sense files exist (no per-sense \`package.json\`):
- \`senses/<name>/src/index.ts\`
- \`senses/<name>/src/schema.ts\`
- \`senses/<name>/migrations/\`
- \`senses/<name>/package.json\`
2. **Build** — run inside the sense directory:
2. **Build** — from the workspace root:
\`\`\`
cd ${nerveRoot}/senses/<name> && pnpm install --no-cache && pnpm build
cd ${nerveRoot} && pnpm run build
\`\`\`
Must produce \`index.js\` at sense root without errors.
(or \`npm run build\` per root \`package.json\`.) Must produce \`${nerveRoot}/dist/senses/<name>/index.js\` without errors.
3. **Config check** — \`nerve validate\` passes, confirming nerve.yaml is valid.
@@ -10,7 +10,7 @@ export type CoderMeta = z.infer<typeof coderMetaSchema>;
export function coderPrompt({ threadId }: { threadId: string }): string {
return `Read the workflow thread to get the planner's design and any reviewer/tester/committer feedback: \`nerve thread ${threadId}\`
Read the nerve-dev skill for workflow file structure and conventions: \`cat node_modules/@uncaged/nerve-skills/nerve-dev/SKILL.md\`
Read \`cat AGENT.md\` from the repository root, then \`CONVENTIONS.md\` and \`.knowledge/workflow.md\` if present. Optional skills live under \`.cursor/skills/\`.
Also look at existing workflows in the \`workflows/\` directory for patterns.
## Your task
@@ -29,15 +29,13 @@ You do NOT need to finish everything in one pass. You may return \`done: false\`
2. Second pass: implement role logic
3. Third pass: fix build/lint errors
## Workflow file structure
## Workflow file structure (flat workspace)
The workspace has **one root** \`package.json\` and **one** root build (\`pnpm run build\` or \`npm run build\`), implemented by \`scripts/build.mjs\`, which emits bundles under \`dist/workflows/<name>/index.js\`. There is **no** per-workflow \`package.json\` or \`tsconfig.json\`.
Each workflow must have:
- \`workflows/<name>/index.ts\`WorkflowDefinition default export
- \`workflows/<name>/build.ts\` — factory function
- \`workflows/<name>/moderator.ts\` — moderator + meta types
- \`workflows/<name>/roles/<role>.ts\` — meta schema and prompt function per role
- \`workflows/<name>/package.json\` — with esbuild build script
- \`workflows/<name>/tsconfig.json\` — TypeScript config
- \`workflows/<name>/index.ts\`default export \`WorkflowDefinition\` (moderator and meta types typically live here or are imported from co-located modules)
- \`workflows/<name>/roles/<role>.ts\` — one TypeScript file per role (schemas, prompts, \`createRole\` wiring, or plain async role functions)
For **new workflows**, also update \`nerve.yaml\` with \`workflows.<name>\`.
@@ -53,7 +51,7 @@ For **new workflows**, also update \`nerve.yaml\` with \`workflows.<name>\`.
Return \`done: true\` ONLY when ALL of the following are true:
- All changes from the plan are implemented
- \`cd workflows/<name> && pnpm install --no-cache && pnpm build\` succeeds (run it!)
- From the **workspace root**, \`pnpm run build\` or \`npm run build\` succeeds (run it!) so \`dist/workflows/<name>/index.js\` is produced
- No lint or type errors remain
Return \`done: false\` if you made progress but there is still work to do, or if build/lint has errors you plan to fix in the next iteration.`;
@@ -12,18 +12,18 @@ export function plannerPrompt({ threadId }: { threadId: string }): string {
return `You are a Nerve workflow planner. You can **create new workflows** or **modify existing ones**.
Read the workflow thread for the user's request: \`nerve thread ${threadId}\`
Read the nerve-dev skill for workflow conventions: \`cat node_modules/@uncaged/nerve-skills/nerve-dev/SKILL.md\`
Read the workspace guide: \`cat AGENT.md\` from the repository root (created by \`nerve init\`). Also read \`CONVENTIONS.md\` if it exists; if \`.knowledge/workflow.md\` exists (e.g. Nerve monorepo), read it for layout and engine behavior. Optional Cursor skills live under \`.cursor/skills/\`.
List existing workflows: \`ls workflows/\`
## Determine the task type
1. If the user wants to **modify an existing workflow** — read its current code (\`cat workflows/<name>/moderator.ts\`, \`cat workflows/<name>/build.ts\`, \`ls workflows/<name>/roles/\`, etc.) and understand its current structure before planning changes.
1. If the user wants to **modify an existing workflow** — read its current code (\`cat workflows/<name>/index.ts\`, \`ls workflows/<name>/roles/\`, \`cat workflows/<name>/roles/<role>.ts\`, etc.) and understand its current structure before planning changes.
2. If the user wants to **create a new workflow** — look at existing workflows in \`workflows/\` for patterns to follow.
## Produce a PLAN (not code) in markdown
For **new workflows**:
- Workflow name (kebab-case)
- Workflow name — **verb-first** kebab-case phrase (e.g. \`review-pull-request\`, \`deploy-staging\`), not a bare noun
- Roles list (name, purpose, tool)
- Flow transitions / moderator routing logic
- Validation loops design
@@ -17,24 +17,22 @@ export function testerPrompt({
**IMPORTANT: The Nerve workspace is at \`${nerveRoot}\`. All paths below are relative to this directory. Always \`cd ${nerveRoot}\` first.**
Read the workflow thread for context: \`nerve thread ${threadId}\`
Read the nerve-dev skill for expected file structure: \`cat ${nerveRoot}/node_modules/@uncaged/nerve-skills/nerve-dev/SKILL.md\`
Read \`cat ${nerveRoot}/AGENT.md\`, then \`${nerveRoot}/CONVENTIONS.md\` and \`${nerveRoot}/.knowledge/workflow.md\` if they exist.
Get the workflow name from the thread (the planner's output).
Verify the full lifecycle in this order:
1. **File check** — all required workflow files exist (under \`${nerveRoot}/\`):
1. **File check** — all required workflow sources exist (under \`${nerveRoot}/\`):
- \`workflows/<name>/index.ts\`
- \`workflows/<name>/build.ts\`
- \`workflows/<name>/moderator.ts\`
- \`workflows/<name>/roles/\` with one \`.ts\` file per role
- \`workflows/<name>/package.json\`
- \`workflows/<name>/roles/\` with one \`.ts\` file per role (flat files, not per-role packages)
- **No** \`workflows/<name>/package.json\` or \`tsconfig.json\` expected
2. **Build** — run inside the workflow directory:
2. **Build** — from the workspace root:
\`\`\`
cd ${nerveRoot}/workflows/<name> && pnpm install --no-cache && pnpm build
cd ${nerveRoot} && pnpm run build
\`\`\`
Must produce \`dist/index.js\` without errors.
(or \`npm run build\` if that is what the root \`package.json\` defines.) Must produce \`${nerveRoot}/dist/workflows/<name>/index.js\` without errors.
3. **Config check** — \`cd ${nerveRoot} && nerve validate\` passes, confirming nerve.yaml is valid.