diff --git a/.knowledge/adapter-isolation.md b/.knowledge/adapter-isolation.md new file mode 100644 index 0000000..e7f9c75 --- /dev/null +++ b/.knowledge/adapter-isolation.md @@ -0,0 +1,171 @@ +# Adapter Process Isolation + +Describes sandboxing, process isolation, resource limits, and timeout enforcement for adapter invocations in the Nerve workflow system. + +## Process Isolation Model + +Adapters run in a **two-tier isolation** model: + +1. **Workflow Worker Process** — Each workflow runs in a dedicated Node.js worker process (`workflow-worker.ts`) forked from the main daemon +2. **Adapter Child Process** — Each adapter spawns CLI tools as child processes via `spawnSafe()` with `shell: false` + +## Resource Limits & Timeouts + +### Adapter-Level Timeouts + +- **Default timeout**: 300 seconds (300,000ms) for both cursor and hermes adapters +- **Configurable** via `AgentConfig.timeout` in adapter factory functions +- **Wall-clock enforcement** using `setTimeout()` — kills child process with `SIGTERM` on timeout +- **AbortSignal support** — external cancellation triggers immediate `SIGTERM` + +### Timeout Behavior + +```ts +// Timeout resolution priority (packages/core/src/spawn-safe.ts): +// 1. Explicit timeoutMs value +// 2. AbortSignal presence → no internal timer (relies on external abort) +// 3. DEFAULT_TIMEOUT_MS (300_000) fallback +``` + +- Child process terminated with `SIGTERM` on timeout/abort +- Returns `{ kind: "timeout", stdout, stderr }` error result +- **No grace period** — immediate kill +- **No SIGKILL escalation** — relies entirely on `SIGTERM` effectiveness + +#### SIGTERM Limitations + +If a child process **ignores or blocks `SIGTERM`** (e.g., signal handlers, blocked delivery): + +- **No fallback to `SIGKILL`** — process may remain alive indefinitely +- **No escalation timer** — spawnSafe() does not implement progressive signal escalation +- **Potential zombie/orphan risk** — unresponsive processes continue consuming resources +- **OS-level cleanup only** — relies on parent process death or OS reaping mechanisms + +## Sandboxing Characteristics + +### What's Isolated + +- **File system**: Child process runs in specified `cwd` (workflow working directory) +- **Environment**: Controlled env vars via `nerveCommandEnv()` + optional overrides +- **Network**: No explicit restrictions (inherits parent process network access) +- **Process tree**: Child processes are direct children, not containerized + +### What's NOT Sandboxed + +- **No resource quotas** (CPU, memory, disk I/O limits) +- **No filesystem chroot/containers** — full filesystem access within user permissions +- **No network isolation** — can make arbitrary network calls +- **No syscall filtering** — no seccomp or similar restrictions + +#### Runtime Resource Enforcement + +**No active resource monitoring or constraints**: + +- **No cgroups** (Linux) — no CPU, memory, or I/O limits enforced +- **No job objects** (Windows) — no resource quotas or process tree limits +- **No worker_threads resource tracking** — Node.js worker processes run unrestricted +- **Pure timeout-based enforcement** — only wall-clock time limits via `setTimeout()` +- **OS-scheduled resource sharing** — relies entirely on operating system process scheduling + +Adapters can consume unlimited: +- **CPU time** (until timeout) +- **Memory** (until OOM) +- **Disk I/O** (no quotas) +- **Network bandwidth** (no throttling) +- **File descriptors** (until ulimit) + +#### Environment Variable Security + +The `nerveCommandEnv()` function provides **minimal sanitization**: + +```ts +// spawn-safe.ts lines 47-55 +export function nerveCommandEnv(): SpawnEnv { + const home = homedir(); + const pnpmHome = join(home, ".local/share/pnpm"); + return { + ...process.env, // ← Full parent environment inherited + PNPM_HOME: pnpmHome, + PATH: `${pnpmHome}:${process.env.PATH ?? ""}`, + }; +} +``` + +- **No filtering of sensitive keys** — `NODE_OPTIONS`, `LD_PRELOAD`, `PYTHONPATH` passed through unchanged +- **Full environment inheritance** — all parent process environment variables copied +- **Injection risk** — malicious env vars (e.g., `NODE_OPTIONS=--require=evil.js`) affect Node.js child processes +- **Path manipulation** — sensitive PATH entries remain accessible to adapters + +## Security Model + +### Execution Context + +- Uses `shell: false` to prevent shell injection attacks +- Arguments passed as separate array elements (not shell-parsed) +- PATH includes `~/.local/share/pnpm` for tool discovery +- Inherits parent process user/group permissions + +#### File Descriptor Management + +```ts +// spawn-safe.ts line 122 +stdio: ["ignore", "pipe", "pipe"] +``` + +- **stdin closed**: Child receives no input (`stdio[0]: "ignore"`) +- **stdout/stderr captured**: Piped to parent for collection (`stdio[1,2]: "pipe"`) +- **No explicit fd closing**: Node.js default behavior — inherits other file descriptors +- **Parent sockets/pipes accessible**: Child can access parent's open network connections, database handles, etc. +- **Security risk**: Adapter processes may access unintended parent file descriptors + +### Attack Surface + +- CLI tools have **full user-level filesystem access** +- Can spawn additional processes (not tracked/limited) +- Network requests unrestricted +- Resource consumption relies on OS-level limits + +## Worker Process Management + +### Workflow Isolation + +- Each workflow type gets dedicated worker process +- Worker processes handle multiple concurrent threads (runIds) +- Kill flags enable per-thread cancellation without killing worker +- Graceful shutdown waits up to 10 seconds for in-flight operations + +#### Cross-RunId Contamination Risks + +**Shared mutable state** poses contamination risks between concurrent runIds: + +- **`process.env` mutations**: Environment changes affect all subsequent runIds in same worker +- **`require.cache` pollution**: Module cache shared across all runIds — side effects persist +- **Global variables**: Any global state mutations from one runId visible to others +- **`process.cwd()` changes**: Working directory changes affect entire worker process +- **File descriptors**: Open files/sockets shared between runId executions + +**No runId-specific scoping** implemented: +- Worker reuses single Node.js process for efficiency +- Each role execution sees cumulative environment from previous runIds +- **Mitigation relies on adapter discipline** — clean implementations avoid global mutations + +### Error Handling + +- Adapter failures don't crash the worker process +- Timeout/abort errors are isolated to specific role execution +- Worker process survives adapter failures and continues serving other threads + +## Configuration + +```yaml +# Example nerve.yaml configuration for timeout overrides +workflows: + my-workflow: + roles: + coder: + adapter: + type: cursor + timeout: 600000 # 10 minutes in milliseconds +``` + +Timeout configuration happens at the adapter creation level, not as a system-wide sandbox policy. \ No newline at end of file diff --git a/.knowledge/adapter.md b/.knowledge/adapter.md index 7fe1c2f..db124b0 100644 --- a/.knowledge/adapter.md +++ b/.knowledge/adapter.md @@ -9,9 +9,19 @@ type AgentFn = (prompt: string, context: WorkflowContext) => Promise ``` - Input: prompt + context (start frame, messages, workdir, AbortSignal) -- Output: raw string — structured extraction is separate +- Output: **single-shot `Promise`** — no streaming support - Adapter handles tool-specific details internally +### Streaming Limitations + +The `AgentFn` protocol does **not** support streaming responses (`AsyncIterable` or `ReadableStream`). It's strictly limited to single-shot `Promise` returns. + +For long-running or incremental agent outputs: +- CLI tools buffer full output until completion +- Timeout enforcement via `timeoutMs` (default 300s) +- No intermediate results exposed to workflow logic +- Progress indication happens at the CLI tool level only + ## Available Adapters | Package | Adapter | Tool | @@ -45,3 +55,14 @@ extract: ``` Two-level merge: global → role override. Retry once on parse failure (feeds error back to LLM), then throw `ExtractError`. + +## Error Handling + +When adapters' underlying CLI tools (e.g., `cursor-agent` or `hermes`) fail, errors are surfaced **synchronously via rejection** with no fallback/retry logic: + +- **Missing/unavailable tool**: `spawn_failed` error when CLI binary not found in `$PATH` +- **Non-zero exit code**: `non_zero_exit` error with captured stdout/stderr +- **Timeout**: `timeout` error when execution exceeds configured `timeoutMs` +- **Abort signal**: `aborted` error when `AbortSignal` triggers cancellation + +All errors are immediately thrown as `Error` instances with descriptive messages (e.g., `"cursor-agent: exitCode=7 stdout=... stderr=..."`). No automatic retries or fallback adapters. diff --git a/.knowledge/architecture.md b/.knowledge/architecture.md index 0942c30..16fad6b 100644 --- a/.knowledge/architecture.md +++ b/.knowledge/architecture.md @@ -33,3 +33,14 @@ Senses own both the "what" (compute logic) and the "when" (config-driven schedul - One worker per Workflow type (on-demand) - Workers never talk to each other - All user code runs in isolated Workers; kernel never loads user code directly + +## Storage Systems + +- **Log Store** — SQLite with WAL mode for audit trails and workflow state +- **Sense Databases** — Isolated SQLite per sense group for private data +- **Knowledge Store** — Vector search index for project context +- **Blob Store** — Content-addressable storage for large artifacts + +## Signal Flow + +Sense compute outputs are routed through signal routing logic that determines whether to emit a signal or trigger a workflow—never both simultaneously. diff --git a/.knowledge/cli.md b/.knowledge/cli.md index a76cbc0..7bc7a11 100644 --- a/.knowledge/cli.md +++ b/.knowledge/cli.md @@ -6,6 +6,8 @@ ```bash nerve init # scaffold a new workspace (nerve.yaml, senses/, workflows/) +nerve init --force # reinitialize workspace even if ~/.uncaged-nerve/ exists (preserves data/) +nerve init --from # clone existing workspace from git repository nerve validate # validate nerve.yaml config nerve dev # run kernel foreground (development, Ctrl+C to stop) nerve start # start daemon (background) @@ -14,6 +16,14 @@ nerve status # check daemon health (uptime, senses, workflows) nerve daemon # restart daemon (stop + start) ``` +### Init Behavior + +**Default `nerve init`**: Creates workspace at `~/.uncaged-nerve/`. If this directory already exists and is non-empty, **exits with error** requiring `--force` flag. No merge/overwrite logic — prevents accidental workspace destruction. + +**Force mode `nerve init --force`**: Reinitializes workspace even if `~/.uncaged-nerve/` exists. **Preserves `data/` directory** (containing sense SQLite databases and logs) but overwrites all config files (`nerve.yaml`, `package.json`, etc.) and example senses. + +**Git clone `nerve init --from `**: Clones existing repository to `~/.uncaged-nerve/`. Requires empty target directory — fails if workspace already exists and is non-empty. + ## Sense Management ```bash diff --git a/.knowledge/coding-conventions.md b/.knowledge/coding-conventions.md index b14253b..f037779 100644 --- a/.knowledge/coding-conventions.md +++ b/.knowledge/coding-conventions.md @@ -24,6 +24,21 @@ type Config = { throttle?: string } - `throw` only for programmer errors (bugs) - No try-catch for flow control +### Result Type + +Defined in `@uncaged/nerve-core` (`packages/core/src/result.ts`): + +```ts +export type Result = { ok: true; value: T } | { ok: false; error: E }; +``` + +**Discriminated union** with tagged `ok` field. Helper functions: +- `ok(value)` → `{ ok: true, value }` +- `err(error)` → `{ ok: false, error }` + +**Exhaustive handling**: Pattern is `if (!result.ok) { handle error }` then access `result.value`. +No compiler enforcement - relies on manual discipline and TypeScript's flow control analysis. + ## Naming | Type | Style | @@ -38,9 +53,25 @@ type Config = { throttle?: string } - Always named exports, never default - One module = one responsibility +### Module Naming Conventions + +**Primary exports** use descriptive, unambiguous names: +- Functions: `createXxx()`, `parseXxx()`, `xxxAgent()` (e.g., `createCursorAdapter`, `cursorAgent`) +- Types: Domain-specific prefixes (e.g., `CursorAgentOptions`, `SenseComputeFn`, `WorkflowContext`) +- Constants: `UPPER_SNAKE_CASE` with context (e.g., `DEFAULT_SENSE_SIGNAL_RETENTION`, `CURSOR_ADAPTER_DEFAULT_MS`) + +**Avoiding ambiguity**: +- Package-scoped naming: `@uncaged/nerve-adapter-cursor` exports `cursorAgent`, `createCursorAdapter` +- Factory pattern: `createXxxAdapter()` for configurable instances, `xxxAdapter` for defaults +- Descriptive type prefixes prevent collision (e.g., `CursorAgentOptions` vs `HermesAgentOptions`) + ## Async - Always `async/await`, never `.then()` chains +- Use `AbortSignal` for cancellation: `AbortController` to create signals, pass to long-running operations +- `spawn-safe.ts` and adapter functions accept `abortSignal: AbortSignal | null` parameter +- On abort: child processes receive `SIGTERM`, async operations should check `signal.aborted` +- No enforced Biome/Vitest rules for AbortSignal usage (manual discipline required) ## No Dynamic Import diff --git a/.knowledge/knowledge-layer.md b/.knowledge/knowledge-layer.md index 85dc229..50df031 100644 --- a/.knowledge/knowledge-layer.md +++ b/.knowledge/knowledge-layer.md @@ -19,10 +19,20 @@ nerve knowledge query --repo /path "query" # search specific repo ## Embedding -- Remote service: configured via `EMBED_SERVICE_URL` env var (self-hosted Cloudflare Worker + KV cache) -- Model: Dashscope text-embedding-v3 (1024 dims) -- Cache: content-addressable (sha256 of model+text), never expires -- Fallback: word-overlap scoring when embed service not configured +- **Default model**: Dashscope text-embedding-v3 (1024 dimensions) +- **Remote service**: configured via `EMBED_SERVICE_URL` env var (self-hosted Cloudflare Worker + KV cache) +- **Model configuration**: No mechanism to specify alternate models — hardcoded to text-embedding-v3 in remote service +- **Vector dimensions**: Fixed at 1024 (Float32Array, stored as 4096-byte Buffer blobs in SQLite) +- **Cache**: content-addressable (sha256 of model+text), never expires +- **Fallback**: word-overlap scoring when embed service not configured + +### Configuration + +The embedding model is **not configurable** through `knowledge.yaml` or other config files. The remote service at `embed.shazhou.workers.dev` uses Dashscope text-embedding-v3 exclusively. To use different models, you would need to: + +1. Deploy your own embedding service compatible with the same API +2. Point `EMBED_SERVICE_URL` to your service +3. Ensure vector dimensions match (1024) or modify knowledge database schema ## Chunking diff --git a/.knowledge/sense.md b/.knowledge/sense.md index 2625582..2c1d646 100644 --- a/.knowledge/sense.md +++ b/.knowledge/sense.md @@ -12,15 +12,29 @@ export { snapshots as table } from "./schema.ts"; // drizzle table for runtime export async function compute(): Promise> { ... } // pure, no args ``` -- `compute()` is a **pure function with no arguments** — no db, no peers, no signal -- Returns `ComputeResult` = `null | { signal: T; workflow: WorkflowTrigger | null }` - - `null` → silent, no storage, no signal - - `{ signal: data, workflow: null }` → persist data, emit signal - - `{ signal: data, workflow: { name, prompt } }` → persist data, emit signal, AND trigger workflow -- **Runtime handles persistence** — `db.insert(table).values(result.signal)` is done by `sense-runtime`, not by the sense itself -- Each Sense has its own **independent SQLite database** -- Schema defined with Drizzle ORM (`schema.ts` is single source of truth) -- Types: `SenseComputeFn`, `SenseModule`, `ComputeResult` exported from `@uncaged/nerve-core` +**Function Signature & Input Schema:** +- `compute()` is **parameterless** — no direct inputs, environment variables available +- No database access within compute — runtime provides isolated execution context +- Must be pure function (no side effects, no external API calls) + +**Return Value Contract:** +- `ComputeResult` = `null | { signal: T; workflow: WorkflowTrigger | null }` + - `null` → silent, no storage, no signal + - `{ signal: data, workflow: null }` → persist + emit signal + - `{ signal, workflow: WorkflowTrigger }` → persist + emit signal + trigger workflow + - Any other value → treated as `{ signal: value, workflow: null }` + +**Error Handling & Serialization:** +- Exceptions caught by worker, logged as errors (no signal emitted) +- Signal payload must be JSON-serializable (passed via IPC) +- Invalid workflow triggers silently dropped (signal still emitted) + +**Timeout & Scheduling Semantics:** +- Timeout priority: explicit config → AbortSignal → DEFAULT_TIMEOUT_MS (30s) +- Enforced via `Promise.race()` with timeout promise +- Grace period can trigger `process.exit(1)` after timeout (kills worker group) +- Interval translation: YAML config values used directly as milliseconds in `setInterval()` +- Jitter control: throttle mechanism prevents rapid-fire, single deferred trigger per throttle window ## Config (nerve.yaml) @@ -34,3 +48,14 @@ senses: interval: 30s # periodic trigger (optional) on: [disk-pressure] # trigger on signals from other senses (optional) ``` + +## Manual Trigger Context + +**`nerve sense trigger `** sends IPC message to running daemon. The compute context is initialized as follows: + +- **SQLite Database**: Opened in **read-write mode** at `data/senses/.db` +- **Migrations**: All `*.sql` files in `senses//migrations/` applied in lexicographic order +- **Environment**: Inherits daemon process environment (no special secrets injection) +- **Arguments**: No runtime arguments or mock inputs supported — `compute()` is always pure function with no parameters +- **Isolation**: Runs in forked child process (worker) with full filesystem access within user permissions +- **Persistence**: Runtime automatically calls `db.insert(table).values(result.signal)` if compute returns non-null signal diff --git a/.knowledge/signal-routing.md b/.knowledge/signal-routing.md new file mode 100644 index 0000000..8481730 --- /dev/null +++ b/.knowledge/signal-routing.md @@ -0,0 +1,91 @@ +# Signal Routing + +Signal routing is the core mechanism that determines how Sense outputs flow through the Nerve system. + +## Routing Logic + +When a Sense `compute()` function returns non-null, the output goes through `routeSenseComputeOutput()` in `packages/core/src/sense-workflow-directive.ts`: + +``` +Sense compute() → non-null → routeSenseComputeOutput() → { signal, workflow } + ↓ + kernel.ts → signal ALWAYS emitted + optional workflow start +``` + +## Two Output Formats + +### 1. Explicit Format +```typescript +{ + signal: any, // emitted as signal + workflow: { // optional workflow trigger + name: string, + maxRounds: number, + prompt: string, + dryRun: boolean + } | null +} +``` + +### 2. Shorthand Format +Any other value is treated as: +```typescript +{ signal: payload, workflow: null } +``` + +## Workflow Directive Parsing + +## Concrete Routing Predicates + +The routing decision is implemented in `routeSenseComputeOutput()` using these exact matching criteria: + +### 1. Explicit Format Detection +```typescript +if (isPlainRecord(payload) && Object.hasOwn(payload, "signal")) +``` +- Payload must be a plain object +- Must have `signal` property (any value) +- Workflow extracted from `workflow` property or defaults to null + +### 2. Workflow Validation +When workflow is non-null, it's validated via `parseWorkflowTrigger()`: +- `name`: non-empty string (trimmed) +- `maxRounds`: positive integer >= 1 +- `prompt`: string +- `dryRun`: boolean + +**Critical behavior**: Invalid workflows are silently dropped (become null) but signal emission continues. This prevents malformed workflow config from blocking signals. + +### 3. Fallback to Shorthand +Any value that doesn't match explicit format becomes: +```typescript +{ signal: payload, workflow: null } +``` + +## Processing Flow + +```typescript +// In kernel.ts handleSenseWorkerSignal() +const { signal: signalPayload, workflow } = routeResult.value; + +// Signal is ALWAYS emitted when compute returns non-null +bus.emit({ id, senseId, payload: signalPayload, timestamp }); + +// Workflow is started ONLY if workflow is non-null +if (workflow !== null) { + workflowManager.startWorkflow(workflow.name, { ... }); +} +``` + +## Legacy String Format (Deprecated) + +The old `"name|maxRounds|prompt"` string format is converted to the structured format internally but should not be used in new code. + +## Key Behaviors + +1. **Signal priority**: Every non-null compute result emits a signal, regardless of workflow +2. **Additive behavior**: Valid workflow triggers are executed in addition to signal emission +3. **Failure tolerance**: Invalid workflow directives are silently ignored, signal still emits +4. **Structure-based routing**: No complex predicates - simply checks object structure and property existence + +This routing mechanism ensures clean separation between perception (signals) and action (workflows) while maintaining backward compatibility. \ No newline at end of file diff --git a/.knowledge/storage-layer.md b/.knowledge/storage-layer.md new file mode 100644 index 0000000..c6e4ae0 --- /dev/null +++ b/.knowledge/storage-layer.md @@ -0,0 +1,132 @@ +# Storage Layer + +Nerve uses multiple storage systems designed for different data types and access patterns. + +## Core Storage Components + +### 1. Log Store (`logs.db`) +Append-only audit trail implemented in SQLite with WAL mode. + +**Schema:** +- `logs` — all system events (signals, workflow transitions, sense outputs) +- `meta` — key-value store for system metadata +- `workflow_runs` — materialized view of workflow execution state + +**Key Features:** +- Atomic workflow state updates via transactions +- Thread message persistence for crash recovery +- Configurable log archival to JSONL files +- Full-text search across log entries + +### 2. Sense Databases +Each sense group gets its own SQLite database for private state. + +**Characteristics:** +- Isolated per sense group (e.g., `system-senses.db`) +- Managed by individual sense compute functions +- Drizzle ORM integration for schema management +- No cross-sense data sharing + +### 3. Knowledge Store (`knowledge.db`) +Vector-enabled search index for project context. + +**Contents:** +- Chunked source files with embeddings +- Curated knowledge cards from `.knowledge/` +- Semantic search capabilities +- Global vs. repo-scoped search modes + +### 4. Blob Store (CAS) +Content-addressable storage for large artifacts. + +**Design:** +- SHA-256 based file naming +- Automatic deduplication +- Used for workflow artifacts and large payloads + +## Consistency & Isolation Mechanisms + +### SQLite WAL Mode +All SQLite databases use `PRAGMA journal_mode=WAL` for: +- **Writer-reader concurrency** — readers don't block writers +- **Atomic writes** — each transaction is fully applied or rolled back +- **Crash recovery** — WAL provides consistent state after crashes + +### Transaction Management + +#### Log Store Transactions +Uses `BEGIN IMMEDIATE` transactions (`packages/store/src/log-store.ts`): +```typescript +function runInTransaction(db: DatabaseSync, fn: () => T): T { + db.exec("BEGIN IMMEDIATE"); // Exclusive write lock + try { + const result = fn(); + db.exec("COMMIT"); + return result; + } catch (e) { + db.exec("ROLLBACK"); + throw e; + } +} +``` + +**Key Operations:** +- `upsertWorkflowRun()` — atomically writes log entry + workflow state +- `archiveLogs()` — transactional export + delete + watermark update + +#### Sense Database Isolation +- Each sense group has its own SQLite file (e.g., `system-senses.db`) +- No cross-sense transactions or coordination required +- Independent schema migrations per sense +- Private `_signals` table for signal history retention + +### Process-Level Isolation + +#### Worker Process Architecture +- **One worker per sense group** — prevents data races within group +- **One worker per workflow type** — isolated execution contexts +- **No shared memory** — all communication via IPC messages + +#### Concurrency Control +Workflow manager enforces limits per workflow: +```yaml +workflows: + my-workflow: + concurrency: 2 # Max parallel threads + overflow: "queue" # or "drop" + maxQueue: 10 # Queue depth limit +``` + +### Consistency Guarantees & Failure Modes + +**Strong Consistency (Single Database)**: +1. **Within Log Store** — ACID transactions with immediate consistency +2. **Within Sense DB** — WAL mode ensures atomic commits per database +3. **Workflow State** — `upsertWorkflowRun()` atomically updates log + materialized view + +**No Cross-Database Consistency**: +- No distributed transactions across multiple SQLite files +- Log Store and Sense Databases can temporarily diverge during failures +- Signal emission and workflow triggering are separate, non-atomic operations + +**Failure Recovery Mechanisms**: +- **Sense worker crash**: State rebuilt from sense SQLite database on respawn +- **Workflow worker crash**: Thread state recovered from log store message history +- **Kernel crash**: All workers respawned, state recovered from persistent stores +- **Log Store corruption**: WAL recovery on database open +- **Sense DB corruption**: Migrations re-run, `_signals` table rebuilt if needed + +**Rollback Scenarios**: +- **Log write failure**: Transaction rolled back, no state changes persisted +- **Sense compute failure**: Error logged, no signal/workflow emitted +- **Workflow failure**: Thread marked as failed in materialized view +- **IPC failure**: Worker respawned, pending operations lost (not rolled back) + +## Archive Strategy + +Logs older than retention window (default 30 days) are: +1. Exported to `data/archive/logs/YYYY-MM-DD.jsonl` +2. Deleted from active database +3. Watermark updated to prevent re-processing + +This keeps the active database size bounded while preserving audit trails. \ No newline at end of file diff --git a/.knowledge/worker-isolation.md b/.knowledge/worker-isolation.md new file mode 100644 index 0000000..f7b26c9 --- /dev/null +++ b/.knowledge/worker-isolation.md @@ -0,0 +1,152 @@ +# Worker Isolation + +Nerve's worker architecture ensures complete isolation between different types of user code while maintaining system stability. + +## Process Architecture + +``` +Kernel (Main Process) +├── Sense Worker (Group A) ── sense-1, sense-2 +├── Sense Worker (Group B) ── sense-3, sense-4 +├── Workflow Worker (cleanup) ── cleanup workflow instances +└── Workflow Worker (review) ── review workflow instances +``` + +## Isolation Boundaries + +### 1. Sense Workers +- **One worker per sense group** (configured in `nerve.yaml`) +- Groups share a child process but have isolated execution contexts +- Crash in one sense doesn't affect other groups +- Each group has its own SQLite database + +### 2. Workflow Workers +- **One worker per workflow type** (spawned on-demand) +- Multiple threads of the same workflow share a worker process +- Concurrency limits enforced at the workflow level +- Workers terminate when no active threads remain + +### 3. Kernel Protection +- **User code never runs in kernel process** +- All `compute()` and workflow role functions run in workers +- Kernel only handles IPC, scheduling, and coordination +- System remains stable even with infinite loops or crashes in user code + +## Worker Lifecycle + +### Sense Workers +``` +nerve daemon start → spawn worker per group → long-lived process + → hot reload on file changes + → respawn on crash +``` + +### Workflow Workers +``` +workflow trigger → check existing worker → reuse or spawn + → execute thread + → terminate when idle +``` + +## Communication Patterns + +### Kernel ↔ Sense Worker +- IPC via child process stdio +- JSON-formatted messages +- Worker reports signals back to kernel +- Bidirectional: kernel can request immediate computes + +### Kernel ↔ Workflow Worker +- Similar IPC protocol +- Workflow definition loaded in worker +- Role execution results streamed back +- Thread state managed in kernel + +## Resource Limits & Control + +### Timeout Enforcement +Configurable timeouts per sense (in `nerve.yaml`): +```yaml +senses: + my-sense: + timeout: 30000 # Execution timeout (ms) + gracePeriod: 5000 # Grace period before hard kill +``` + +**Timeout Implementation:** +- `AbortController` for async operations +- `Promise.race()` between compute and timeout +- Grace period triggers `process.exit(1)` to kill entire worker group + +### Memory & CPU Limits +**No Application-Level Resource Quotas**: +- No memory caps, CPU throttling, or disk I/O limits enforced by Nerve +- Workers can consume arbitrary system resources until OS limits +- No cgroup/container isolation — full filesystem access within user permissions +- No syscall filtering (no seccomp restrictions) + +**OS-Level Constraints Only**: +- Process memory limited by system `ulimit -m` +- CPU usage bounded by scheduler only +- Network requests unrestricted +- Can spawn additional processes (not tracked by Nerve) + +### Concurrency Control + +#### Sense Workers +- One active compute per sense at a time (serialized via promise chains) +- No memory sharing between sense groups +- Crash isolation: one sense crash doesn't affect other groups + +#### Workflow Workers +Per-workflow limits configured in `nerve.yaml`: +```yaml +workflows: + my-workflow: + concurrency: 2 # Max parallel threads + overflow: "drop" # or "queue" + maxQueue: 10 # Queue size limit +``` + +### Process Management + +#### Signal Handling +Workers ignore session broadcast signals (SIGINT/SIGTERM): +```typescript +// Workers ignore terminal signals; kernel coordinates shutdown +process.on("SIGINT", () => {}); +process.on("SIGTERM", () => {}); +``` + +#### Graceful Shutdown & State Handoff +**Sense Workers**: +- IPC `shutdown` message → `process.exit(0)` (immediate) +- No graceful termination period for senses +- State rebuilt from SQLite on respawn (no handoff needed) + +**Workflow Workers**: +- IPC `shutdown` → wait for in-flight threads to complete +- Drain timeout: `WORKER_SHUTDOWN_TIMEOUT_MS` (10s) +- If threads don't complete → `SIGKILL` force termination +- Thread state preserved in log store for crash recovery + +**State Handoff Mechanism**: +- No explicit state transfer between old/new workers +- Sense workers: SQLite database contains full state +- Workflow workers: Log store contains thread message history +- Kernel coordinates recovery via `recoverThreadsForWorker()` + +## Failure Handling + +### Worker Crashes +- **Sense workers**: Automatic respawn after 1s delay, state rebuilt from DB +- **Workflow workers**: Crash recovery from log store thread messages +- **Kernel protection**: Main process continues, marks affected runs as crashed +- **Crash limits**: Max 5 crashes per workflow in 60s window (prevents infinite respawn) + +### Resource Exhaustion +- **Memory**: Worker process killed by OS, kernel respawns automatically +- **Compute timeout**: Grace period → hard kill → respawn +- **Infinite loops**: Timeout enforcement prevents hanging indefinitely + +This architecture allows Nerve to run untrusted or experimental code safely while maintaining system availability. \ No newline at end of file diff --git a/.knowledge/workflow.md b/.knowledge/workflow.md index 0d58aa6..f9d0d50 100644 --- a/.knowledge/workflow.md +++ b/.knowledge/workflow.md @@ -57,3 +57,66 @@ const workflow: WorkflowDefinition = { - `prompt: string | ((start, messages) => Promise)` — static or dynamic - `meta: z.ZodType` — Zod schema, directly (no wrapper needed) - `extract: LlmExtractorConfig` — provider for structured extraction + +## Runtime Enforcement Mechanisms + +### Role Authority & Validation + +**Role Function Lookup**: +- Roles accessed via `def.roles[nextRole]` dictionary lookup +- Unknown roles trigger immediate workflow error (`Unknown role: ${nextRole}`) +- No dynamic role registration during execution + +**Result Validation** (`validateRoleResult()`): +```typescript +// Required return shape from every role function +{ content: string, meta: Record } +``` +- `content` must be string (non-string → workflow error) +- `meta` must be plain object (array/null/primitive → workflow error) +- Validation failure terminates thread immediately + +### Moderator Authority & Routing Control + +**Next Role Selection**: +- Moderator must return role name from `roles` keys OR `END` symbol +- Called after every role completion (receives full context) +- No validation of role name until execution attempt +- Pure function constraint: cannot perform side effects + +**Causal Chain Integrity**: +- Moderator receives immutable history: `{ start, steps }` +- Steps array contains ALL role outputs in chronological order +- No role can modify prior steps or start metadata +- Thread context built from log store on crash recovery + +### Unauthorized Command Event Prevention + +**Message Flow Control**: +- Role functions have NO direct access to kernel IPC +- All outputs flow through `sendWorkflowMessage()` wrapper +- Worker process validates messages before kernel transmission +- No direct log store database access from roles + +**Process Isolation**: +- Roles execute in forked worker processes (not kernel) +- File system access limited to user permissions +- No network isolation (roles can make arbitrary HTTP calls) +- Worker has read/write access to workflow workspace only + +### Concurrent Thread Management + +**Kill Flag Implementation**: +```typescript +type KillFlag = { value: boolean }; +// Checked before role execution and after completion +if (killFlag.value) { + sendThreadEvent(runId, "killed", { exitCode: 137 }); + return; +} +``` + +**Concurrency Enforcement**: +- Workflow manager enforces per-workflow limits in kernel +- Excess threads queued/dropped per overflow policy +- No role can spawn additional threads (no access to workflow manager)