a1b1d5eaf1
- Move formatChildExitSummary/formatCapturedStderrTail to worker-runtime.ts - Move ignoreSessionBroadcastSignals to new worker-signals.ts - Delete worker-fork-support.ts (teeCapturedStderr no longer used) - Update .knowledge/worker-isolation.md and architecture.md for WorkerRuntime - All 167 tests pass, biome check clean Closes #283
158 lines
5.7 KiB
Markdown
158 lines
5.7 KiB
Markdown
# Worker Isolation
|
|
|
|
Nerve's worker architecture ensures complete isolation between different types of user code while maintaining system stability.
|
|
|
|
## Process Architecture
|
|
|
|
```
|
|
Kernel (Main Process)
|
|
├── Sense Worker (Group A) ── sense-1, sense-2
|
|
├── Sense Worker (Group B) ── sense-3, sense-4
|
|
├── Workflow Worker (cleanup) ── cleanup workflow instances
|
|
└── Workflow Worker (review) ── review workflow instances
|
|
```
|
|
|
|
### WorkerRuntime (RFC-006)
|
|
|
|
Forked worker processes are managed by **`WorkerRuntime`** (`worker-runtime.ts`): one Node child per logical key, cold start, optional respawn after crash, drain/evict, and coordinated shutdown over IPC. **`worker-pool.ts`** (sense groups) and **`workflow-manager.ts`** (workflow types) both configure and delegate to `createWorkerRuntime` instead of owning ad-hoc fork logic.
|
|
|
|
Worker **entrypoints** (`sense-worker.ts`, `workflow-worker.ts`) import lightweight helpers only — e.g. `worker-signals.ts` for session broadcast signal handling — so they do not pull in the parent-side runtime module.
|
|
|
|
## Isolation Boundaries
|
|
|
|
### 1. Sense Workers
|
|
- **One worker per sense group** (configured in `nerve.yaml`)
|
|
- Groups share a child process but have isolated execution contexts
|
|
- Crash in one sense doesn't affect other groups
|
|
- Each group has its own SQLite database
|
|
|
|
### 2. Workflow Workers
|
|
- **One worker per workflow type** (spawned on-demand)
|
|
- Multiple threads of the same workflow share a worker process
|
|
- Concurrency limits enforced at the workflow level
|
|
- Workers terminate when no active threads remain
|
|
|
|
### 3. Kernel Protection
|
|
- **User code never runs in kernel process**
|
|
- All `compute()` and workflow role functions run in workers
|
|
- Kernel only handles IPC, scheduling, and coordination
|
|
- System remains stable even with infinite loops or crashes in user code
|
|
|
|
## Worker Lifecycle
|
|
|
|
### Sense Workers
|
|
```
|
|
nerve daemon start → spawn worker per group → long-lived process
|
|
→ hot reload on file changes
|
|
→ respawn on crash
|
|
```
|
|
|
|
### Workflow Workers
|
|
```
|
|
workflow trigger → check existing worker → reuse or spawn
|
|
→ execute thread
|
|
→ terminate when idle
|
|
```
|
|
|
|
## Communication Patterns
|
|
|
|
### Kernel ↔ Sense Worker
|
|
- IPC via child process stdio
|
|
- JSON-formatted messages
|
|
- Worker reports signals back to kernel
|
|
- Bidirectional: kernel can request immediate computes
|
|
|
|
### Kernel ↔ Workflow Worker
|
|
- Similar IPC protocol
|
|
- Workflow definition loaded in worker
|
|
- Role execution results streamed back
|
|
- Thread state managed in kernel
|
|
|
|
## Resource Limits & Control
|
|
|
|
### Timeout Enforcement
|
|
Configurable timeouts per sense (in `nerve.yaml`):
|
|
```yaml
|
|
senses:
|
|
my-sense:
|
|
timeout: 30000 # Execution timeout (ms)
|
|
gracePeriod: 5000 # Grace period before hard kill
|
|
```
|
|
|
|
**Timeout Implementation:**
|
|
- `AbortController` for async operations
|
|
- `Promise.race()` between compute and timeout
|
|
- Grace period triggers `process.exit(1)` to kill entire worker group
|
|
|
|
### Memory & CPU Limits
|
|
**No Application-Level Resource Quotas**:
|
|
- No memory caps, CPU throttling, or disk I/O limits enforced by Nerve
|
|
- Workers can consume arbitrary system resources until OS limits
|
|
- No cgroup/container isolation — full filesystem access within user permissions
|
|
- No syscall filtering (no seccomp restrictions)
|
|
|
|
**OS-Level Constraints Only**:
|
|
- Process memory limited by system `ulimit -m`
|
|
- CPU usage bounded by scheduler only
|
|
- Network requests unrestricted
|
|
- Can spawn additional processes (not tracked by Nerve)
|
|
|
|
### Concurrency Control
|
|
|
|
#### Sense Workers
|
|
- One active compute per sense at a time (serialized via promise chains)
|
|
- No memory sharing between sense groups
|
|
- Crash isolation: one sense crash doesn't affect other groups
|
|
|
|
#### Workflow Workers
|
|
Per-workflow limits configured in `nerve.yaml`:
|
|
```yaml
|
|
workflows:
|
|
my-workflow:
|
|
concurrency: 2 # Max parallel threads
|
|
overflow: "drop" # or "queue"
|
|
maxQueue: 10 # Queue size limit
|
|
```
|
|
|
|
### Process Management
|
|
|
|
#### Signal Handling
|
|
Workers ignore session broadcast signals (SIGINT/SIGTERM) via `ignoreSessionBroadcastSignals()` in `worker-signals.ts`:
|
|
```typescript
|
|
// Workers ignore terminal signals; kernel coordinates shutdown
|
|
process.on("SIGINT", () => {});
|
|
process.on("SIGTERM", () => {});
|
|
```
|
|
|
|
#### Graceful Shutdown & State Handoff
|
|
**Sense Workers**:
|
|
- IPC `shutdown` message → `process.exit(0)` (immediate)
|
|
- No graceful termination period for senses
|
|
- State rebuilt from SQLite on respawn (no handoff needed)
|
|
|
|
**Workflow Workers**:
|
|
- IPC `shutdown` → wait for in-flight threads to complete
|
|
- Drain timeout: `WORKER_SHUTDOWN_TIMEOUT_MS` (10s)
|
|
- If threads don't complete → `SIGKILL` force termination
|
|
- Thread state preserved in log store for crash recovery
|
|
|
|
**State Handoff Mechanism**:
|
|
- No explicit state transfer between old/new workers
|
|
- Sense workers: SQLite database contains full state
|
|
- Workflow workers: Log store contains thread message history
|
|
- Kernel coordinates recovery via `recoverThreadsForWorker()`
|
|
|
|
## Failure Handling
|
|
|
|
### Worker Crashes
|
|
- **Sense workers**: Automatic respawn after 1s delay, state rebuilt from DB
|
|
- **Workflow workers**: Crash recovery from log store thread messages
|
|
- **Kernel protection**: Main process continues, marks affected runs as crashed
|
|
- **Crash limits**: Max 5 crashes per workflow in 60s window (prevents infinite respawn)
|
|
|
|
### Resource Exhaustion
|
|
- **Memory**: Worker process killed by OS, kernel respawns automatically
|
|
- **Compute timeout**: Grace period → hard kill → respawn
|
|
- **Infinite loops**: Timeout enforcement prevents hanging indefinitely
|
|
|
|
This architecture allows Nerve to run untrusted or experimental code safely while maintaining system availability. |