nerve/.knowledge/worker-isolation.md

# Worker Isolation

Nerve's worker architecture ensures complete isolation between different types of user code while maintaining system stability.

## Process Architecture

```
Kernel (Main Process)
├── Sense Worker (Group A) ── sense-1, sense-2
├── Sense Worker (Group B) ── sense-3, sense-4
├── Workflow Worker (cleanup) ── cleanup workflow instances
└── Workflow Worker (review) ── review workflow instances
```

### WorkerRuntime (RFC-006)

Forked worker processes are managed by **`WorkerRuntime`** (`worker-runtime.ts`): one Node child per logical key, cold start, optional respawn after crash, drain/evict, and coordinated shutdown over IPC. **`worker-pool.ts`** (sense groups) and **`workflow-manager.ts`** (workflow types) both configure and delegate to `createWorkerRuntime` instead of owning ad-hoc fork logic.

Worker **entrypoints** (`sense-worker.ts`, `workflow-worker.ts`) import lightweight helpers only — e.g. `worker-signals.ts` for session broadcast signal handling — so they do not pull in the parent-side runtime module.

## Isolation Boundaries

### 1. Sense Workers
- **One worker per sense group** (configured in `nerve.yaml`)
- Groups share a child process but have isolated execution contexts
- Crash in one sense doesn't affect other groups
- Each group has its own SQLite database

### 2. Workflow Workers
- **One worker per workflow type** (spawned on-demand)
- Multiple threads of the same workflow share a worker process
- Concurrency limits enforced at the workflow level
- Workers terminate when no active threads remain

### 3. Kernel Protection
- **User code never runs in kernel process**
- All `compute()` and workflow role functions run in workers
- Kernel only handles IPC, scheduling, and coordination
- System remains stable even with infinite loops or crashes in user code

## Worker Lifecycle

### Sense Workers
```
nerve daemon start → spawn worker per group → long-lived process
                   → hot reload on file changes
                   → respawn on crash
```

### Workflow Workers
```
workflow trigger → check existing worker → reuse or spawn
                                       → execute thread
                                       → terminate when idle
```

## Communication Patterns

### Kernel ↔ Sense Worker
- IPC via child process stdio
- JSON-formatted messages
- Worker reports signals back to kernel
- Bidirectional: kernel can request immediate computes

### Kernel ↔ Workflow Worker
- Similar IPC protocol
- Workflow definition loaded in worker
- Role execution results streamed back
- Thread state managed in kernel

## Resource Limits & Control

### Timeout Enforcement
Configurable timeouts per sense (in `nerve.yaml`):
```yaml
senses:
  my-sense:
    timeout: 30000      # Execution timeout (ms)
    gracePeriod: 5000   # Grace period before hard kill
```

**Timeout Implementation:**
- `AbortController` for async operations
- `Promise.race()` between compute and timeout
- Grace period triggers `process.exit(1)` to kill entire worker group

### Memory & CPU Limits
**No Application-Level Resource Quotas**:
- No memory caps, CPU throttling, or disk I/O limits enforced by Nerve
- Workers can consume arbitrary system resources until OS limits
- No cgroup/container isolation — full filesystem access within user permissions
- No syscall filtering (no seccomp restrictions)

**OS-Level Constraints Only**:
- Process memory limited by system `ulimit -m`
- CPU usage bounded by scheduler only
- Network requests unrestricted
- Can spawn additional processes (not tracked by Nerve)

### Concurrency Control

#### Sense Workers
- One active compute per sense at a time (serialized via promise chains)
- No memory sharing between sense groups
- Crash isolation: one sense crash doesn't affect other groups

#### Workflow Workers
Per-workflow limits configured in `nerve.yaml`:
```yaml
workflows:
  my-workflow:
    concurrency: 2        # Max parallel threads
    overflow: "drop"      # or "queue"
    maxQueue: 10         # Queue size limit
```

### Process Management

#### Signal Handling
Workers ignore session broadcast signals (SIGINT/SIGTERM) via `ignoreSessionBroadcastSignals()` in `worker-signals.ts`:
```typescript
// Workers ignore terminal signals; kernel coordinates shutdown
process.on("SIGINT", () => {});
process.on("SIGTERM", () => {});
```

#### Graceful Shutdown & State Handoff
**Sense Workers**:
- IPC `shutdown` message → `process.exit(0)` (immediate)
- No graceful termination period for senses
- State rebuilt from SQLite on respawn (no handoff needed)

**Workflow Workers**:
- IPC `shutdown` → wait for in-flight threads to complete
- Drain timeout: `WORKER_SHUTDOWN_TIMEOUT_MS` (10s)
- If threads don't complete → `SIGKILL` force termination
- Thread state preserved in log store for crash recovery

**State Handoff Mechanism**:
- No explicit state transfer between old/new workers
- Sense workers: SQLite database contains full state
- Workflow workers: Log store contains thread message history
- Kernel coordinates recovery via `recoverThreadsForWorker()`

## Failure Handling

### Worker Crashes
- **Sense workers**: Automatic respawn after 1s delay, state rebuilt from DB
- **Workflow workers**: Crash recovery from log store thread messages
- **Kernel protection**: Main process continues, marks affected runs as crashed
- **Crash limits**: Max 5 crashes per workflow in 60s window (prevents infinite respawn)

### Resource Exhaustion
- **Memory**: Worker process killed by OS, kernel respawns automatically
- **Compute timeout**: Grace period → hard kill → respawn
- **Infinite loops**: Timeout enforcement prevents hanging indefinitely

This architecture allows Nerve to run untrusted or experimental code safely while maintaining system availability.