docs(knowledge): update cards via knowledge-extraction workflow (5q/round)

7 cards updated, 4 new cards added. Topics: signal-routing, worker-isolation, storage-layer, adapter-isolation, sense contracts, workflow runtime enforcement, coding conventions details. 小橘 <xiaoju@shazhou.work>
2026-04-30 05:56:29 +00:00
parent 2387b73141
commit 9c832b0e21
11 changed files with 731 additions and 14 deletions
@@ -0,0 +1,171 @@
+# Adapter Process Isolation
+
+Describes sandboxing, process isolation, resource limits, and timeout enforcement for adapter invocations in the Nerve workflow system.
+
+## Process Isolation Model
+
+Adapters run in a **two-tier isolation** model:
+
+1. **Workflow Worker Process** — Each workflow runs in a dedicated Node.js worker process (`workflow-worker.ts`) forked from the main daemon
+2. **Adapter Child Process** — Each adapter spawns CLI tools as child processes via `spawnSafe()` with `shell: false`
+
+## Resource Limits & Timeouts
+
+### Adapter-Level Timeouts
+
+- **Default timeout**: 300 seconds (300,000ms) for both cursor and hermes adapters
+- **Configurable** via `AgentConfig.timeout` in adapter factory functions
+- **Wall-clock enforcement** using `setTimeout()` — kills child process with `SIGTERM` on timeout
+- **AbortSignal support** — external cancellation triggers immediate `SIGTERM`
+
+### Timeout Behavior
+
+```ts
+// Timeout resolution priority (packages/core/src/spawn-safe.ts):
+// 1. Explicit timeoutMs value
+// 2. AbortSignal presence → no internal timer (relies on external abort)
+// 3. DEFAULT_TIMEOUT_MS (300_000) fallback
+```
+
+- Child process terminated with `SIGTERM` on timeout/abort
+- Returns `{ kind: "timeout", stdout, stderr }` error result
+- **No grace period** — immediate kill
+- **No SIGKILL escalation** — relies entirely on `SIGTERM` effectiveness
+
+#### SIGTERM Limitations
+
+If a child process **ignores or blocks `SIGTERM`** (e.g., signal handlers, blocked delivery):
+
+- **No fallback to `SIGKILL`** — process may remain alive indefinitely
+- **No escalation timer** — spawnSafe() does not implement progressive signal escalation
+- **Potential zombie/orphan risk** — unresponsive processes continue consuming resources
+- **OS-level cleanup only** — relies on parent process death or OS reaping mechanisms
+
+## Sandboxing Characteristics
+
+### What's Isolated
+
+- **File system**: Child process runs in specified `cwd` (workflow working directory)
+- **Environment**: Controlled env vars via `nerveCommandEnv()` + optional overrides
+- **Network**: No explicit restrictions (inherits parent process network access)
+- **Process tree**: Child processes are direct children, not containerized
+
+### What's NOT Sandboxed
+
+- **No resource quotas** (CPU, memory, disk I/O limits)
+- **No filesystem chroot/containers** — full filesystem access within user permissions
+- **No network isolation** — can make arbitrary network calls
+- **No syscall filtering** — no seccomp or similar restrictions
+
+#### Runtime Resource Enforcement
+
+**No active resource monitoring or constraints**:
+
+- **No cgroups** (Linux) — no CPU, memory, or I/O limits enforced
+- **No job objects** (Windows) — no resource quotas or process tree limits
+- **No worker_threads resource tracking** — Node.js worker processes run unrestricted
+- **Pure timeout-based enforcement** — only wall-clock time limits via `setTimeout()`
+- **OS-scheduled resource sharing** — relies entirely on operating system process scheduling
+
+Adapters can consume unlimited:
+- **CPU time** (until timeout)
+- **Memory** (until OOM)
+- **Disk I/O** (no quotas)
+- **Network bandwidth** (no throttling)
+- **File descriptors** (until ulimit)
+
+#### Environment Variable Security
+
+The `nerveCommandEnv()` function provides **minimal sanitization**:
+
+```ts
+// spawn-safe.ts lines 47-55
+export function nerveCommandEnv(): SpawnEnv {
+  const home = homedir();
+  const pnpmHome = join(home, ".local/share/pnpm");
+  return {
+    ...process.env,           // ← Full parent environment inherited
+    PNPM_HOME: pnpmHome,
+    PATH: `${pnpmHome}:${process.env.PATH ?? ""}`,
+  };
+}
+```
+
+- **No filtering of sensitive keys** — `NODE_OPTIONS`, `LD_PRELOAD`, `PYTHONPATH` passed through unchanged
+- **Full environment inheritance** — all parent process environment variables copied
+- **Injection risk** — malicious env vars (e.g., `NODE_OPTIONS=--require=evil.js`) affect Node.js child processes
+- **Path manipulation** — sensitive PATH entries remain accessible to adapters
+
+## Security Model
+
+### Execution Context
+
+- Uses `shell: false` to prevent shell injection attacks
+- Arguments passed as separate array elements (not shell-parsed)
+- PATH includes `~/.local/share/pnpm` for tool discovery
+- Inherits parent process user/group permissions
+
+#### File Descriptor Management
+
+```ts
+// spawn-safe.ts line 122
+stdio: ["ignore", "pipe", "pipe"]
+```
+
+- **stdin closed**: Child receives no input (`stdio[0]: "ignore"`)
+- **stdout/stderr captured**: Piped to parent for collection (`stdio[1,2]: "pipe"`)
+- **No explicit fd closing**: Node.js default behavior — inherits other file descriptors
+- **Parent sockets/pipes accessible**: Child can access parent's open network connections, database handles, etc.
+- **Security risk**: Adapter processes may access unintended parent file descriptors
+
+### Attack Surface
+
+- CLI tools have **full user-level filesystem access**
+- Can spawn additional processes (not tracked/limited)
+- Network requests unrestricted
+- Resource consumption relies on OS-level limits
+
+## Worker Process Management
+
+### Workflow Isolation
+
+- Each workflow type gets dedicated worker process
+- Worker processes handle multiple concurrent threads (runIds)
+- Kill flags enable per-thread cancellation without killing worker
+- Graceful shutdown waits up to 10 seconds for in-flight operations
+
+#### Cross-RunId Contamination Risks
+
+**Shared mutable state** poses contamination risks between concurrent runIds:
+
+- **`process.env` mutations**: Environment changes affect all subsequent runIds in same worker
+- **`require.cache` pollution**: Module cache shared across all runIds — side effects persist
+- **Global variables**: Any global state mutations from one runId visible to others
+- **`process.cwd()` changes**: Working directory changes affect entire worker process
+- **File descriptors**: Open files/sockets shared between runId executions
+
+**No runId-specific scoping** implemented:
+- Worker reuses single Node.js process for efficiency
+- Each role execution sees cumulative environment from previous runIds
+- **Mitigation relies on adapter discipline** — clean implementations avoid global mutations
+
+### Error Handling
+
+- Adapter failures don't crash the worker process
+- Timeout/abort errors are isolated to specific role execution
+- Worker process survives adapter failures and continues serving other threads
+
+## Configuration
+
+```yaml
+# Example nerve.yaml configuration for timeout overrides
+workflows:
+  my-workflow:
+    roles:
+      coder:
+        adapter:
+          type: cursor
+          timeout: 600000  # 10 minutes in milliseconds
+```
+
+Timeout configuration happens at the adapter creation level, not as a system-wide sandbox policy.
@@ -9,9 +9,19 @@ type AgentFn = (prompt: string, context: WorkflowContext) => Promise<string>
 ```

 - Input: prompt + context (start frame, messages, workdir, AbortSignal)
- Output: raw string — structured extraction is separate
+- Output: **single-shot `Promise<string>`** — no streaming support
 - Adapter handles tool-specific details internally

+### Streaming Limitations
+
+The `AgentFn` protocol does **not** support streaming responses (`AsyncIterable<string>` or `ReadableStream`). It's strictly limited to single-shot `Promise<string>` returns.
+
+For long-running or incremental agent outputs:
+- CLI tools buffer full output until completion
+- Timeout enforcement via `timeoutMs` (default 300s)
+- No intermediate results exposed to workflow logic
+- Progress indication happens at the CLI tool level only
+
 ## Available Adapters

 | Package | Adapter | Tool |
@@ -45,3 +55,14 @@ extract:
 ```

 Two-level merge: global → role override. Retry once on parse failure (feeds error back to LLM), then throw `ExtractError`.
+
+## Error Handling
+
+When adapters' underlying CLI tools (e.g., `cursor-agent` or `hermes`) fail, errors are surfaced **synchronously via rejection** with no fallback/retry logic:
+
+- **Missing/unavailable tool**: `spawn_failed` error when CLI binary not found in `$PATH`
+- **Non-zero exit code**: `non_zero_exit` error with captured stdout/stderr
+- **Timeout**: `timeout` error when execution exceeds configured `timeoutMs`
+- **Abort signal**: `aborted` error when `AbortSignal` triggers cancellation
+
+All errors are immediately thrown as `Error` instances with descriptive messages (e.g., `"cursor-agent: exitCode=7 stdout=... stderr=..."`). No automatic retries or fallback adapters.
@@ -33,3 +33,14 @@ Senses own both the "what" (compute logic) and the "when" (config-driven schedul
 - One worker per Workflow type (on-demand)
 - Workers never talk to each other
 - All user code runs in isolated Workers; kernel never loads user code directly
+
+## Storage Systems
+
+- **Log Store** — SQLite with WAL mode for audit trails and workflow state
+- **Sense Databases** — Isolated SQLite per sense group for private data
+- **Knowledge Store** — Vector search index for project context
+- **Blob Store** — Content-addressable storage for large artifacts
+
+## Signal Flow
+
+Sense compute outputs are routed through signal routing logic that determines whether to emit a signal or trigger a workflow—never both simultaneously.
@@ -6,6 +6,8 @@

 ```bash
 nerve init                    # scaffold a new workspace (nerve.yaml, senses/, workflows/)
+nerve init --force            # reinitialize workspace even if ~/.uncaged-nerve/ exists (preserves data/)
+nerve init --from <git-url>   # clone existing workspace from git repository
 nerve validate                # validate nerve.yaml config
 nerve dev                     # run kernel foreground (development, Ctrl+C to stop)
 nerve start                   # start daemon (background)
@@ -14,6 +16,14 @@ nerve status                  # check daemon health (uptime, senses, workflows)
 nerve daemon                  # restart daemon (stop + start)
 ```

+### Init Behavior
+
+**Default `nerve init`**: Creates workspace at `~/.uncaged-nerve/`. If this directory already exists and is non-empty, **exits with error** requiring `--force` flag. No merge/overwrite logic — prevents accidental workspace destruction.
+
+**Force mode `nerve init --force`**: Reinitializes workspace even if `~/.uncaged-nerve/` exists. **Preserves `data/` directory** (containing sense SQLite databases and logs) but overwrites all config files (`nerve.yaml`, `package.json`, etc.) and example senses.
+
+**Git clone `nerve init --from <url>`**: Clones existing repository to `~/.uncaged-nerve/`. Requires empty target directory — fails if workspace already exists and is non-empty.
+
 ## Sense Management

 ```bash
@@ -24,6 +24,21 @@ type Config = { throttle?: string }
 - `throw` only for programmer errors (bugs)
 - No try-catch for flow control

+### Result<T, E> Type
+
+Defined in `@uncaged/nerve-core` (`packages/core/src/result.ts`):
+
+```ts
+export type Result<T, E = Error> = { ok: true; value: T } | { ok: false; error: E };
+```
+
+**Discriminated union** with tagged `ok` field. Helper functions:
+- `ok(value)` → `{ ok: true, value }`
+- `err(error)` → `{ ok: false, error }`
+
+**Exhaustive handling**: Pattern is `if (!result.ok) { handle error }` then access `result.value`. 
+No compiler enforcement - relies on manual discipline and TypeScript's flow control analysis.
+
 ## Naming

 | Type | Style |
@@ -38,9 +53,25 @@ type Config = { throttle?: string }
 - Always named exports, never default
 - One module = one responsibility

+### Module Naming Conventions
+
+**Primary exports** use descriptive, unambiguous names:
+- Functions: `createXxx()`, `parseXxx()`, `xxxAgent()` (e.g., `createCursorAdapter`, `cursorAgent`)
+- Types: Domain-specific prefixes (e.g., `CursorAgentOptions`, `SenseComputeFn`, `WorkflowContext`)
+- Constants: `UPPER_SNAKE_CASE` with context (e.g., `DEFAULT_SENSE_SIGNAL_RETENTION`, `CURSOR_ADAPTER_DEFAULT_MS`)
+
+**Avoiding ambiguity**:
+- Package-scoped naming: `@uncaged/nerve-adapter-cursor` exports `cursorAgent`, `createCursorAdapter`
+- Factory pattern: `createXxxAdapter()` for configurable instances, `xxxAdapter` for defaults
+- Descriptive type prefixes prevent collision (e.g., `CursorAgentOptions` vs `HermesAgentOptions`)
+
 ## Async

 - Always `async/await`, never `.then()` chains
+- Use `AbortSignal` for cancellation: `AbortController` to create signals, pass to long-running operations
+- `spawn-safe.ts` and adapter functions accept `abortSignal: AbortSignal | null` parameter
+- On abort: child processes receive `SIGTERM`, async operations should check `signal.aborted`
+- No enforced Biome/Vitest rules for AbortSignal usage (manual discipline required)

 ## No Dynamic Import

@@ -19,10 +19,20 @@ nerve knowledge query --repo /path "query"  # search specific repo

 ## Embedding

- Remote service: configured via `EMBED_SERVICE_URL` env var (self-hosted Cloudflare Worker + KV cache)
- Model: Dashscope text-embedding-v3 (1024 dims)
- Cache: content-addressable (sha256 of model+text), never expires
- Fallback: word-overlap scoring when embed service not configured
+- **Default model**: Dashscope text-embedding-v3 (1024 dimensions)
+- **Remote service**: configured via `EMBED_SERVICE_URL` env var (self-hosted Cloudflare Worker + KV cache)
+- **Model configuration**: No mechanism to specify alternate models — hardcoded to text-embedding-v3 in remote service
+- **Vector dimensions**: Fixed at 1024 (Float32Array, stored as 4096-byte Buffer blobs in SQLite)
+- **Cache**: content-addressable (sha256 of model+text), never expires
+- **Fallback**: word-overlap scoring when embed service not configured
+
+### Configuration
+
+The embedding model is **not configurable** through `knowledge.yaml` or other config files. The remote service at `embed.shazhou.workers.dev` uses Dashscope text-embedding-v3 exclusively. To use different models, you would need to:
+
+1. Deploy your own embedding service compatible with the same API
+2. Point `EMBED_SERVICE_URL` to your service
+3. Ensure vector dimensions match (1024) or modify knowledge database schema

 ## Chunking

@@ -12,15 +12,29 @@ export { snapshots as table } from "./schema.ts";  // drizzle table for runtime
 export async function compute(): Promise<ComputeResult<T>> { ... }  // pure, no args
 ```

- `compute()` is a **pure function with no arguments** — no db, no peers, no signal
- Returns `ComputeResult<T>` = `null | { signal: T; workflow: WorkflowTrigger | null }`
-  - `null` → silent, no storage, no signal
-  - `{ signal: data, workflow: null }` → persist data, emit signal
-  - `{ signal: data, workflow: { name, prompt } }` → persist data, emit signal, AND trigger workflow
- **Runtime handles persistence** — `db.insert(table).values(result.signal)` is done by `sense-runtime`, not by the sense itself
- Each Sense has its own **independent SQLite database**
- Schema defined with Drizzle ORM (`schema.ts` is single source of truth)
- Types: `SenseComputeFn`, `SenseModule`, `ComputeResult` exported from `@uncaged/nerve-core`
+**Function Signature & Input Schema:**
+- `compute()` is **parameterless** — no direct inputs, environment variables available
+- No database access within compute — runtime provides isolated execution context 
+- Must be pure function (no side effects, no external API calls)
+
+**Return Value Contract:**
+- `ComputeResult<T>` = `null | { signal: T; workflow: WorkflowTrigger | null }`
+  - `null` → silent, no storage, no signal  
+  - `{ signal: data, workflow: null }` → persist + emit signal
+  - `{ signal, workflow: WorkflowTrigger }` → persist + emit signal + trigger workflow
+  - Any other value → treated as `{ signal: value, workflow: null }`
+
+**Error Handling & Serialization:**
+- Exceptions caught by worker, logged as errors (no signal emitted)
+- Signal payload must be JSON-serializable (passed via IPC)
+- Invalid workflow triggers silently dropped (signal still emitted)
+
+**Timeout & Scheduling Semantics:**
+- Timeout priority: explicit config → AbortSignal → DEFAULT_TIMEOUT_MS (30s)
+- Enforced via `Promise.race()` with timeout promise
+- Grace period can trigger `process.exit(1)` after timeout (kills worker group)
+- Interval translation: YAML config values used directly as milliseconds in `setInterval()`
+- Jitter control: throttle mechanism prevents rapid-fire, single deferred trigger per throttle window

 ## Config (nerve.yaml)

@@ -34,3 +48,14 @@ senses:
    interval: 30s        # periodic trigger (optional)
    on: [disk-pressure]  # trigger on signals from other senses (optional)
 ```
+
+## Manual Trigger Context
+
+**`nerve sense trigger <name>`** sends IPC message to running daemon. The compute context is initialized as follows:
+
+- **SQLite Database**: Opened in **read-write mode** at `data/senses/<name>.db`
+- **Migrations**: All `*.sql` files in `senses/<name>/migrations/` applied in lexicographic order
+- **Environment**: Inherits daemon process environment (no special secrets injection)
+- **Arguments**: No runtime arguments or mock inputs supported — `compute()` is always pure function with no parameters
+- **Isolation**: Runs in forked child process (worker) with full filesystem access within user permissions
+- **Persistence**: Runtime automatically calls `db.insert(table).values(result.signal)` if compute returns non-null signal
@@ -0,0 +1,91 @@
+# Signal Routing
+
+Signal routing is the core mechanism that determines how Sense outputs flow through the Nerve system.
+
+## Routing Logic
+
+When a Sense `compute()` function returns non-null, the output goes through `routeSenseComputeOutput()` in `packages/core/src/sense-workflow-directive.ts`:
+
+```
+Sense compute() → non-null → routeSenseComputeOutput() → { signal, workflow }
+                           ↓
+          kernel.ts → signal ALWAYS emitted + optional workflow start
+```
+
+## Two Output Formats
+
+### 1. Explicit Format
+```typescript
+{
+  signal: any,           // emitted as signal
+  workflow: {           // optional workflow trigger
+    name: string,
+    maxRounds: number,
+    prompt: string,
+    dryRun: boolean
+  } | null
+}
+```
+
+### 2. Shorthand Format
+Any other value is treated as:
+```typescript
+{ signal: payload, workflow: null }
+```
+
+## Workflow Directive Parsing
+
+## Concrete Routing Predicates
+
+The routing decision is implemented in `routeSenseComputeOutput()` using these exact matching criteria:
+
+### 1. Explicit Format Detection
+```typescript
+if (isPlainRecord(payload) && Object.hasOwn(payload, "signal"))
+```
+- Payload must be a plain object
+- Must have `signal` property (any value)
+- Workflow extracted from `workflow` property or defaults to null
+
+### 2. Workflow Validation
+When workflow is non-null, it's validated via `parseWorkflowTrigger()`:
+- `name`: non-empty string (trimmed)  
+- `maxRounds`: positive integer >= 1
+- `prompt`: string
+- `dryRun`: boolean
+
+**Critical behavior**: Invalid workflows are silently dropped (become null) but signal emission continues. This prevents malformed workflow config from blocking signals.
+
+### 3. Fallback to Shorthand
+Any value that doesn't match explicit format becomes:
+```typescript
+{ signal: payload, workflow: null }
+```
+
+## Processing Flow
+
+```typescript
+// In kernel.ts handleSenseWorkerSignal()
+const { signal: signalPayload, workflow } = routeResult.value;
+
+// Signal is ALWAYS emitted when compute returns non-null
+bus.emit({ id, senseId, payload: signalPayload, timestamp });
+
+// Workflow is started ONLY if workflow is non-null
+if (workflow !== null) {
+  workflowManager.startWorkflow(workflow.name, { ... });
+}
+```
+
+## Legacy String Format (Deprecated)
+
+The old `"name|maxRounds|prompt"` string format is converted to the structured format internally but should not be used in new code.
+
+## Key Behaviors
+
+1. **Signal priority**: Every non-null compute result emits a signal, regardless of workflow
+2. **Additive behavior**: Valid workflow triggers are executed in addition to signal emission
+3. **Failure tolerance**: Invalid workflow directives are silently ignored, signal still emits
+4. **Structure-based routing**: No complex predicates - simply checks object structure and property existence
+
+This routing mechanism ensures clean separation between perception (signals) and action (workflows) while maintaining backward compatibility.
@@ -0,0 +1,132 @@
+# Storage Layer
+
+Nerve uses multiple storage systems designed for different data types and access patterns.
+
+## Core Storage Components
+
+### 1. Log Store (`logs.db`)
+Append-only audit trail implemented in SQLite with WAL mode.
+
+**Schema:**
+- `logs` — all system events (signals, workflow transitions, sense outputs)
+- `meta` — key-value store for system metadata
+- `workflow_runs` — materialized view of workflow execution state
+
+**Key Features:**
+- Atomic workflow state updates via transactions
+- Thread message persistence for crash recovery
+- Configurable log archival to JSONL files
+- Full-text search across log entries
+
+### 2. Sense Databases
+Each sense group gets its own SQLite database for private state.
+
+**Characteristics:**
+- Isolated per sense group (e.g., `system-senses.db`)
+- Managed by individual sense compute functions
+- Drizzle ORM integration for schema management
+- No cross-sense data sharing
+
+### 3. Knowledge Store (`knowledge.db`)
+Vector-enabled search index for project context.
+
+**Contents:**
+- Chunked source files with embeddings
+- Curated knowledge cards from `.knowledge/`
+- Semantic search capabilities
+- Global vs. repo-scoped search modes
+
+### 4. Blob Store (CAS)
+Content-addressable storage for large artifacts.
+
+**Design:**
+- SHA-256 based file naming
+- Automatic deduplication
+- Used for workflow artifacts and large payloads
+
+## Consistency & Isolation Mechanisms
+
+### SQLite WAL Mode
+All SQLite databases use `PRAGMA journal_mode=WAL` for:
+- **Writer-reader concurrency** — readers don't block writers
+- **Atomic writes** — each transaction is fully applied or rolled back
+- **Crash recovery** — WAL provides consistent state after crashes
+
+### Transaction Management
+
+#### Log Store Transactions
+Uses `BEGIN IMMEDIATE` transactions (`packages/store/src/log-store.ts`):
+```typescript
+function runInTransaction<T>(db: DatabaseSync, fn: () => T): T {
+  db.exec("BEGIN IMMEDIATE");  // Exclusive write lock
+  try {
+    const result = fn();
+    db.exec("COMMIT");
+    return result;
+  } catch (e) {
+    db.exec("ROLLBACK");
+    throw e;
+  }
+}
+```
+
+**Key Operations:**
+- `upsertWorkflowRun()` — atomically writes log entry + workflow state
+- `archiveLogs()` — transactional export + delete + watermark update
+
+#### Sense Database Isolation
+- Each sense group has its own SQLite file (e.g., `system-senses.db`)
+- No cross-sense transactions or coordination required
+- Independent schema migrations per sense
+- Private `_signals` table for signal history retention
+
+### Process-Level Isolation
+
+#### Worker Process Architecture
+- **One worker per sense group** — prevents data races within group
+- **One worker per workflow type** — isolated execution contexts  
+- **No shared memory** — all communication via IPC messages
+
+#### Concurrency Control
+Workflow manager enforces limits per workflow:
+```yaml
+workflows:
+  my-workflow:
+    concurrency: 2        # Max parallel threads
+    overflow: "queue"     # or "drop" 
+    maxQueue: 10         # Queue depth limit
+```
+
+### Consistency Guarantees & Failure Modes
+
+**Strong Consistency (Single Database)**:
+1. **Within Log Store** — ACID transactions with immediate consistency
+2. **Within Sense DB** — WAL mode ensures atomic commits per database
+3. **Workflow State** — `upsertWorkflowRun()` atomically updates log + materialized view
+
+**No Cross-Database Consistency**:
+- No distributed transactions across multiple SQLite files
+- Log Store and Sense Databases can temporarily diverge during failures
+- Signal emission and workflow triggering are separate, non-atomic operations
+
+**Failure Recovery Mechanisms**:
+- **Sense worker crash**: State rebuilt from sense SQLite database on respawn
+- **Workflow worker crash**: Thread state recovered from log store message history
+- **Kernel crash**: All workers respawned, state recovered from persistent stores
+- **Log Store corruption**: WAL recovery on database open
+- **Sense DB corruption**: Migrations re-run, `_signals` table rebuilt if needed
+
+**Rollback Scenarios**:
+- **Log write failure**: Transaction rolled back, no state changes persisted
+- **Sense compute failure**: Error logged, no signal/workflow emitted
+- **Workflow failure**: Thread marked as failed in materialized view
+- **IPC failure**: Worker respawned, pending operations lost (not rolled back)
+
+## Archive Strategy
+
+Logs older than retention window (default 30 days) are:
+1. Exported to `data/archive/logs/YYYY-MM-DD.jsonl`
+2. Deleted from active database
+3. Watermark updated to prevent re-processing
+
+This keeps the active database size bounded while preserving audit trails.
@@ -0,0 +1,152 @@
+# Worker Isolation
+
+Nerve's worker architecture ensures complete isolation between different types of user code while maintaining system stability.
+
+## Process Architecture
+
+```
+Kernel (Main Process)
+├── Sense Worker (Group A) ── sense-1, sense-2
+├── Sense Worker (Group B) ── sense-3, sense-4
+├── Workflow Worker (cleanup) ── cleanup workflow instances
+└── Workflow Worker (review) ── review workflow instances
+```
+
+## Isolation Boundaries
+
+### 1. Sense Workers
+- **One worker per sense group** (configured in `nerve.yaml`)
+- Groups share a child process but have isolated execution contexts
+- Crash in one sense doesn't affect other groups
+- Each group has its own SQLite database
+
+### 2. Workflow Workers  
+- **One worker per workflow type** (spawned on-demand)
+- Multiple threads of the same workflow share a worker process
+- Concurrency limits enforced at the workflow level
+- Workers terminate when no active threads remain
+
+### 3. Kernel Protection
+- **User code never runs in kernel process**
+- All `compute()` and workflow role functions run in workers
+- Kernel only handles IPC, scheduling, and coordination
+- System remains stable even with infinite loops or crashes in user code
+
+## Worker Lifecycle
+
+### Sense Workers
+```
+nerve daemon start → spawn worker per group → long-lived process
+                   → hot reload on file changes
+                   → respawn on crash
+```
+
+### Workflow Workers
+```
+workflow trigger → check existing worker → reuse or spawn
+                                       → execute thread
+                                       → terminate when idle
+```
+
+## Communication Patterns
+
+### Kernel ↔ Sense Worker
+- IPC via child process stdio
+- JSON-formatted messages
+- Worker reports signals back to kernel
+- Bidirectional: kernel can request immediate computes
+
+### Kernel ↔ Workflow Worker  
+- Similar IPC protocol
+- Workflow definition loaded in worker
+- Role execution results streamed back
+- Thread state managed in kernel
+
+## Resource Limits & Control
+
+### Timeout Enforcement
+Configurable timeouts per sense (in `nerve.yaml`):
+```yaml
+senses:
+  my-sense:
+    timeout: 30000      # Execution timeout (ms)
+    gracePeriod: 5000   # Grace period before hard kill
+```
+
+**Timeout Implementation:**
+- `AbortController` for async operations 
+- `Promise.race()` between compute and timeout
+- Grace period triggers `process.exit(1)` to kill entire worker group
+
+### Memory & CPU Limits
+**No Application-Level Resource Quotas**:
+- No memory caps, CPU throttling, or disk I/O limits enforced by Nerve
+- Workers can consume arbitrary system resources until OS limits
+- No cgroup/container isolation — full filesystem access within user permissions
+- No syscall filtering (no seccomp restrictions)
+
+**OS-Level Constraints Only**:
+- Process memory limited by system `ulimit -m` 
+- CPU usage bounded by scheduler only
+- Network requests unrestricted
+- Can spawn additional processes (not tracked by Nerve)
+
+### Concurrency Control
+
+#### Sense Workers
+- One active compute per sense at a time (serialized via promise chains)
+- No memory sharing between sense groups
+- Crash isolation: one sense crash doesn't affect other groups
+
+#### Workflow Workers
+Per-workflow limits configured in `nerve.yaml`:
+```yaml
+workflows:
+  my-workflow:
+    concurrency: 2        # Max parallel threads
+    overflow: "drop"      # or "queue"
+    maxQueue: 10         # Queue size limit
+```
+
+### Process Management
+
+#### Signal Handling
+Workers ignore session broadcast signals (SIGINT/SIGTERM):
+```typescript
+// Workers ignore terminal signals; kernel coordinates shutdown
+process.on("SIGINT", () => {}); 
+process.on("SIGTERM", () => {});
+```
+
+#### Graceful Shutdown & State Handoff
+**Sense Workers**:
+- IPC `shutdown` message → `process.exit(0)` (immediate)
+- No graceful termination period for senses
+- State rebuilt from SQLite on respawn (no handoff needed)
+
+**Workflow Workers**:
+- IPC `shutdown` → wait for in-flight threads to complete
+- Drain timeout: `WORKER_SHUTDOWN_TIMEOUT_MS` (10s)
+- If threads don't complete → `SIGKILL` force termination
+- Thread state preserved in log store for crash recovery
+
+**State Handoff Mechanism**:
+- No explicit state transfer between old/new workers
+- Sense workers: SQLite database contains full state
+- Workflow workers: Log store contains thread message history
+- Kernel coordinates recovery via `recoverThreadsForWorker()`
+
+## Failure Handling
+
+### Worker Crashes
+- **Sense workers**: Automatic respawn after 1s delay, state rebuilt from DB
+- **Workflow workers**: Crash recovery from log store thread messages  
+- **Kernel protection**: Main process continues, marks affected runs as crashed
+- **Crash limits**: Max 5 crashes per workflow in 60s window (prevents infinite respawn)
+
+### Resource Exhaustion
+- **Memory**: Worker process killed by OS, kernel respawns automatically
+- **Compute timeout**: Grace period → hard kill → respawn
+- **Infinite loops**: Timeout enforcement prevents hanging indefinitely
+
+This architecture allows Nerve to run untrusted or experimental code safely while maintaining system availability.
@@ -57,3 +57,66 @@ const workflow: WorkflowDefinition<MyMeta> = {
 - `prompt: string | ((start, messages) => Promise<string>)` — static or dynamic
 - `meta: z.ZodType<M>` — Zod schema, directly (no wrapper needed)
 - `extract: LlmExtractorConfig` — provider for structured extraction
+
+## Runtime Enforcement Mechanisms
+
+### Role Authority & Validation
+
+**Role Function Lookup**:
+- Roles accessed via `def.roles[nextRole]` dictionary lookup
+- Unknown roles trigger immediate workflow error (`Unknown role: ${nextRole}`)
+- No dynamic role registration during execution
+
+**Result Validation** (`validateRoleResult()`):
+```typescript
+// Required return shape from every role function
+{ content: string, meta: Record<string, unknown> }
+```
+- `content` must be string (non-string → workflow error)
+- `meta` must be plain object (array/null/primitive → workflow error)
+- Validation failure terminates thread immediately
+
+### Moderator Authority & Routing Control
+
+**Next Role Selection**:
+- Moderator must return role name from `roles` keys OR `END` symbol
+- Called after every role completion (receives full context)
+- No validation of role name until execution attempt
+- Pure function constraint: cannot perform side effects
+
+**Causal Chain Integrity**:
+- Moderator receives immutable history: `{ start, steps }`
+- Steps array contains ALL role outputs in chronological order
+- No role can modify prior steps or start metadata
+- Thread context built from log store on crash recovery
+
+### Unauthorized Command Event Prevention
+
+**Message Flow Control**:
+- Role functions have NO direct access to kernel IPC
+- All outputs flow through `sendWorkflowMessage()` wrapper
+- Worker process validates messages before kernel transmission
+- No direct log store database access from roles
+
+**Process Isolation**:
+- Roles execute in forked worker processes (not kernel)
+- File system access limited to user permissions
+- No network isolation (roles can make arbitrary HTTP calls)
+- Worker has read/write access to workflow workspace only
+
+### Concurrent Thread Management
+
+**Kill Flag Implementation**:
+```typescript
+type KillFlag = { value: boolean };
+// Checked before role execution and after completion
+if (killFlag.value) {
+  sendThreadEvent(runId, "killed", { exitCode: 137 });
+  return;
+}
+```
+
+**Concurrency Enforcement**:
+- Workflow manager enforces per-workflow limits in kernel
+- Excess threads queued/dropped per overflow policy
+- No role can spawn additional threads (no access to workflow manager)