This repository has been archived on 2026-06-01. You can view files and clone it. You cannot open issues or pull requests or push a commit.
Files
nerve/.knowledge/worker-isolation.md
xiaomo a1b1d5eaf1 chore: RFC-006 Phase 4 cleanup — delete worker-fork-support.ts
- Move formatChildExitSummary/formatCapturedStderrTail to worker-runtime.ts
- Move ignoreSessionBroadcastSignals to new worker-signals.ts
- Delete worker-fork-support.ts (teeCapturedStderr no longer used)
- Update .knowledge/worker-isolation.md and architecture.md for WorkerRuntime
- All 167 tests pass, biome check clean

Closes #283
2026-04-30 14:17:16 +00:00

158 lines
5.7 KiB
Markdown

# Worker Isolation
Nerve's worker architecture ensures complete isolation between different types of user code while maintaining system stability.
## Process Architecture
```
Kernel (Main Process)
├── Sense Worker (Group A) ── sense-1, sense-2
├── Sense Worker (Group B) ── sense-3, sense-4
├── Workflow Worker (cleanup) ── cleanup workflow instances
└── Workflow Worker (review) ── review workflow instances
```
### WorkerRuntime (RFC-006)
Forked worker processes are managed by **`WorkerRuntime`** (`worker-runtime.ts`): one Node child per logical key, cold start, optional respawn after crash, drain/evict, and coordinated shutdown over IPC. **`worker-pool.ts`** (sense groups) and **`workflow-manager.ts`** (workflow types) both configure and delegate to `createWorkerRuntime` instead of owning ad-hoc fork logic.
Worker **entrypoints** (`sense-worker.ts`, `workflow-worker.ts`) import lightweight helpers only — e.g. `worker-signals.ts` for session broadcast signal handling — so they do not pull in the parent-side runtime module.
## Isolation Boundaries
### 1. Sense Workers
- **One worker per sense group** (configured in `nerve.yaml`)
- Groups share a child process but have isolated execution contexts
- Crash in one sense doesn't affect other groups
- Each group has its own SQLite database
### 2. Workflow Workers
- **One worker per workflow type** (spawned on-demand)
- Multiple threads of the same workflow share a worker process
- Concurrency limits enforced at the workflow level
- Workers terminate when no active threads remain
### 3. Kernel Protection
- **User code never runs in kernel process**
- All `compute()` and workflow role functions run in workers
- Kernel only handles IPC, scheduling, and coordination
- System remains stable even with infinite loops or crashes in user code
## Worker Lifecycle
### Sense Workers
```
nerve daemon start → spawn worker per group → long-lived process
→ hot reload on file changes
→ respawn on crash
```
### Workflow Workers
```
workflow trigger → check existing worker → reuse or spawn
→ execute thread
→ terminate when idle
```
## Communication Patterns
### Kernel ↔ Sense Worker
- IPC via child process stdio
- JSON-formatted messages
- Worker reports signals back to kernel
- Bidirectional: kernel can request immediate computes
### Kernel ↔ Workflow Worker
- Similar IPC protocol
- Workflow definition loaded in worker
- Role execution results streamed back
- Thread state managed in kernel
## Resource Limits & Control
### Timeout Enforcement
Configurable timeouts per sense (in `nerve.yaml`):
```yaml
senses:
my-sense:
timeout: 30000 # Execution timeout (ms)
gracePeriod: 5000 # Grace period before hard kill
```
**Timeout Implementation:**
- `AbortController` for async operations
- `Promise.race()` between compute and timeout
- Grace period triggers `process.exit(1)` to kill entire worker group
### Memory & CPU Limits
**No Application-Level Resource Quotas**:
- No memory caps, CPU throttling, or disk I/O limits enforced by Nerve
- Workers can consume arbitrary system resources until OS limits
- No cgroup/container isolation — full filesystem access within user permissions
- No syscall filtering (no seccomp restrictions)
**OS-Level Constraints Only**:
- Process memory limited by system `ulimit -m`
- CPU usage bounded by scheduler only
- Network requests unrestricted
- Can spawn additional processes (not tracked by Nerve)
### Concurrency Control
#### Sense Workers
- One active compute per sense at a time (serialized via promise chains)
- No memory sharing between sense groups
- Crash isolation: one sense crash doesn't affect other groups
#### Workflow Workers
Per-workflow limits configured in `nerve.yaml`:
```yaml
workflows:
my-workflow:
concurrency: 2 # Max parallel threads
overflow: "drop" # or "queue"
maxQueue: 10 # Queue size limit
```
### Process Management
#### Signal Handling
Workers ignore session broadcast signals (SIGINT/SIGTERM) via `ignoreSessionBroadcastSignals()` in `worker-signals.ts`:
```typescript
// Workers ignore terminal signals; kernel coordinates shutdown
process.on("SIGINT", () => {});
process.on("SIGTERM", () => {});
```
#### Graceful Shutdown & State Handoff
**Sense Workers**:
- IPC `shutdown` message → `process.exit(0)` (immediate)
- No graceful termination period for senses
- State rebuilt from SQLite on respawn (no handoff needed)
**Workflow Workers**:
- IPC `shutdown` → wait for in-flight threads to complete
- Drain timeout: `WORKER_SHUTDOWN_TIMEOUT_MS` (10s)
- If threads don't complete → `SIGKILL` force termination
- Thread state preserved in log store for crash recovery
**State Handoff Mechanism**:
- No explicit state transfer between old/new workers
- Sense workers: SQLite database contains full state
- Workflow workers: Log store contains thread message history
- Kernel coordinates recovery via `recoverThreadsForWorker()`
## Failure Handling
### Worker Crashes
- **Sense workers**: Automatic respawn after 1s delay, state rebuilt from DB
- **Workflow workers**: Crash recovery from log store thread messages
- **Kernel protection**: Main process continues, marks affected runs as crashed
- **Crash limits**: Max 5 crashes per workflow in 60s window (prevents infinite respawn)
### Resource Exhaustion
- **Memory**: Worker process killed by OS, kernel respawns automatically
- **Compute timeout**: Grace period → hard kill → respawn
- **Infinite loops**: Timeout enforcement prevents hanging indefinitely
This architecture allows Nerve to run untrusted or experimental code safely while maintaining system availability.