docs(knowledge): update cards via knowledge-extraction workflow (5q/round)

7 cards updated, 4 new cards added. Topics: signal-routing,
worker-isolation, storage-layer, adapter-isolation, sense contracts,
workflow runtime enforcement, coding conventions details.

小橘 <xiaoju@shazhou.work>
This commit is contained in:
2026-04-30 05:56:29 +00:00
parent 2387b73141
commit 9c832b0e21
11 changed files with 731 additions and 14 deletions
+171
View File
@@ -0,0 +1,171 @@
# Adapter Process Isolation
Describes sandboxing, process isolation, resource limits, and timeout enforcement for adapter invocations in the Nerve workflow system.
## Process Isolation Model
Adapters run in a **two-tier isolation** model:
1. **Workflow Worker Process** — Each workflow runs in a dedicated Node.js worker process (`workflow-worker.ts`) forked from the main daemon
2. **Adapter Child Process** — Each adapter spawns CLI tools as child processes via `spawnSafe()` with `shell: false`
## Resource Limits & Timeouts
### Adapter-Level Timeouts
- **Default timeout**: 300 seconds (300,000ms) for both cursor and hermes adapters
- **Configurable** via `AgentConfig.timeout` in adapter factory functions
- **Wall-clock enforcement** using `setTimeout()` — kills child process with `SIGTERM` on timeout
- **AbortSignal support** — external cancellation triggers immediate `SIGTERM`
### Timeout Behavior
```ts
// Timeout resolution priority (packages/core/src/spawn-safe.ts):
// 1. Explicit timeoutMs value
// 2. AbortSignal presence → no internal timer (relies on external abort)
// 3. DEFAULT_TIMEOUT_MS (300_000) fallback
```
- Child process terminated with `SIGTERM` on timeout/abort
- Returns `{ kind: "timeout", stdout, stderr }` error result
- **No grace period** — immediate kill
- **No SIGKILL escalation** — relies entirely on `SIGTERM` effectiveness
#### SIGTERM Limitations
If a child process **ignores or blocks `SIGTERM`** (e.g., signal handlers, blocked delivery):
- **No fallback to `SIGKILL`** — process may remain alive indefinitely
- **No escalation timer** — spawnSafe() does not implement progressive signal escalation
- **Potential zombie/orphan risk** — unresponsive processes continue consuming resources
- **OS-level cleanup only** — relies on parent process death or OS reaping mechanisms
## Sandboxing Characteristics
### What's Isolated
- **File system**: Child process runs in specified `cwd` (workflow working directory)
- **Environment**: Controlled env vars via `nerveCommandEnv()` + optional overrides
- **Network**: No explicit restrictions (inherits parent process network access)
- **Process tree**: Child processes are direct children, not containerized
### What's NOT Sandboxed
- **No resource quotas** (CPU, memory, disk I/O limits)
- **No filesystem chroot/containers** — full filesystem access within user permissions
- **No network isolation** — can make arbitrary network calls
- **No syscall filtering** — no seccomp or similar restrictions
#### Runtime Resource Enforcement
**No active resource monitoring or constraints**:
- **No cgroups** (Linux) — no CPU, memory, or I/O limits enforced
- **No job objects** (Windows) — no resource quotas or process tree limits
- **No worker_threads resource tracking** — Node.js worker processes run unrestricted
- **Pure timeout-based enforcement** — only wall-clock time limits via `setTimeout()`
- **OS-scheduled resource sharing** — relies entirely on operating system process scheduling
Adapters can consume unlimited:
- **CPU time** (until timeout)
- **Memory** (until OOM)
- **Disk I/O** (no quotas)
- **Network bandwidth** (no throttling)
- **File descriptors** (until ulimit)
#### Environment Variable Security
The `nerveCommandEnv()` function provides **minimal sanitization**:
```ts
// spawn-safe.ts lines 47-55
export function nerveCommandEnv(): SpawnEnv {
const home = homedir();
const pnpmHome = join(home, ".local/share/pnpm");
return {
...process.env, // ← Full parent environment inherited
PNPM_HOME: pnpmHome,
PATH: `${pnpmHome}:${process.env.PATH ?? ""}`,
};
}
```
- **No filtering of sensitive keys** — `NODE_OPTIONS`, `LD_PRELOAD`, `PYTHONPATH` passed through unchanged
- **Full environment inheritance** — all parent process environment variables copied
- **Injection risk** — malicious env vars (e.g., `NODE_OPTIONS=--require=evil.js`) affect Node.js child processes
- **Path manipulation** — sensitive PATH entries remain accessible to adapters
## Security Model
### Execution Context
- Uses `shell: false` to prevent shell injection attacks
- Arguments passed as separate array elements (not shell-parsed)
- PATH includes `~/.local/share/pnpm` for tool discovery
- Inherits parent process user/group permissions
#### File Descriptor Management
```ts
// spawn-safe.ts line 122
stdio: ["ignore", "pipe", "pipe"]
```
- **stdin closed**: Child receives no input (`stdio[0]: "ignore"`)
- **stdout/stderr captured**: Piped to parent for collection (`stdio[1,2]: "pipe"`)
- **No explicit fd closing**: Node.js default behavior — inherits other file descriptors
- **Parent sockets/pipes accessible**: Child can access parent's open network connections, database handles, etc.
- **Security risk**: Adapter processes may access unintended parent file descriptors
### Attack Surface
- CLI tools have **full user-level filesystem access**
- Can spawn additional processes (not tracked/limited)
- Network requests unrestricted
- Resource consumption relies on OS-level limits
## Worker Process Management
### Workflow Isolation
- Each workflow type gets dedicated worker process
- Worker processes handle multiple concurrent threads (runIds)
- Kill flags enable per-thread cancellation without killing worker
- Graceful shutdown waits up to 10 seconds for in-flight operations
#### Cross-RunId Contamination Risks
**Shared mutable state** poses contamination risks between concurrent runIds:
- **`process.env` mutations**: Environment changes affect all subsequent runIds in same worker
- **`require.cache` pollution**: Module cache shared across all runIds — side effects persist
- **Global variables**: Any global state mutations from one runId visible to others
- **`process.cwd()` changes**: Working directory changes affect entire worker process
- **File descriptors**: Open files/sockets shared between runId executions
**No runId-specific scoping** implemented:
- Worker reuses single Node.js process for efficiency
- Each role execution sees cumulative environment from previous runIds
- **Mitigation relies on adapter discipline** — clean implementations avoid global mutations
### Error Handling
- Adapter failures don't crash the worker process
- Timeout/abort errors are isolated to specific role execution
- Worker process survives adapter failures and continues serving other threads
## Configuration
```yaml
# Example nerve.yaml configuration for timeout overrides
workflows:
my-workflow:
roles:
coder:
adapter:
type: cursor
timeout: 600000 # 10 minutes in milliseconds
```
Timeout configuration happens at the adapter creation level, not as a system-wide sandbox policy.
+22 -1
View File
@@ -9,9 +9,19 @@ type AgentFn = (prompt: string, context: WorkflowContext) => Promise<string>
```
- Input: prompt + context (start frame, messages, workdir, AbortSignal)
- Output: raw string — structured extraction is separate
- Output: **single-shot `Promise<string>`** — no streaming support
- Adapter handles tool-specific details internally
### Streaming Limitations
The `AgentFn` protocol does **not** support streaming responses (`AsyncIterable<string>` or `ReadableStream`). It's strictly limited to single-shot `Promise<string>` returns.
For long-running or incremental agent outputs:
- CLI tools buffer full output until completion
- Timeout enforcement via `timeoutMs` (default 300s)
- No intermediate results exposed to workflow logic
- Progress indication happens at the CLI tool level only
## Available Adapters
| Package | Adapter | Tool |
@@ -45,3 +55,14 @@ extract:
```
Two-level merge: global → role override. Retry once on parse failure (feeds error back to LLM), then throw `ExtractError`.
## Error Handling
When adapters' underlying CLI tools (e.g., `cursor-agent` or `hermes`) fail, errors are surfaced **synchronously via rejection** with no fallback/retry logic:
- **Missing/unavailable tool**: `spawn_failed` error when CLI binary not found in `$PATH`
- **Non-zero exit code**: `non_zero_exit` error with captured stdout/stderr
- **Timeout**: `timeout` error when execution exceeds configured `timeoutMs`
- **Abort signal**: `aborted` error when `AbortSignal` triggers cancellation
All errors are immediately thrown as `Error` instances with descriptive messages (e.g., `"cursor-agent: exitCode=7 stdout=... stderr=..."`). No automatic retries or fallback adapters.
+11
View File
@@ -33,3 +33,14 @@ Senses own both the "what" (compute logic) and the "when" (config-driven schedul
- One worker per Workflow type (on-demand)
- Workers never talk to each other
- All user code runs in isolated Workers; kernel never loads user code directly
## Storage Systems
- **Log Store** — SQLite with WAL mode for audit trails and workflow state
- **Sense Databases** — Isolated SQLite per sense group for private data
- **Knowledge Store** — Vector search index for project context
- **Blob Store** — Content-addressable storage for large artifacts
## Signal Flow
Sense compute outputs are routed through signal routing logic that determines whether to emit a signal or trigger a workflow—never both simultaneously.
+10
View File
@@ -6,6 +6,8 @@
```bash
nerve init # scaffold a new workspace (nerve.yaml, senses/, workflows/)
nerve init --force # reinitialize workspace even if ~/.uncaged-nerve/ exists (preserves data/)
nerve init --from <git-url> # clone existing workspace from git repository
nerve validate # validate nerve.yaml config
nerve dev # run kernel foreground (development, Ctrl+C to stop)
nerve start # start daemon (background)
@@ -14,6 +16,14 @@ nerve status # check daemon health (uptime, senses, workflows)
nerve daemon # restart daemon (stop + start)
```
### Init Behavior
**Default `nerve init`**: Creates workspace at `~/.uncaged-nerve/`. If this directory already exists and is non-empty, **exits with error** requiring `--force` flag. No merge/overwrite logic — prevents accidental workspace destruction.
**Force mode `nerve init --force`**: Reinitializes workspace even if `~/.uncaged-nerve/` exists. **Preserves `data/` directory** (containing sense SQLite databases and logs) but overwrites all config files (`nerve.yaml`, `package.json`, etc.) and example senses.
**Git clone `nerve init --from <url>`**: Clones existing repository to `~/.uncaged-nerve/`. Requires empty target directory — fails if workspace already exists and is non-empty.
## Sense Management
```bash
+31
View File
@@ -24,6 +24,21 @@ type Config = { throttle?: string }
- `throw` only for programmer errors (bugs)
- No try-catch for flow control
### Result<T, E> Type
Defined in `@uncaged/nerve-core` (`packages/core/src/result.ts`):
```ts
export type Result<T, E = Error> = { ok: true; value: T } | { ok: false; error: E };
```
**Discriminated union** with tagged `ok` field. Helper functions:
- `ok(value)``{ ok: true, value }`
- `err(error)``{ ok: false, error }`
**Exhaustive handling**: Pattern is `if (!result.ok) { handle error }` then access `result.value`.
No compiler enforcement - relies on manual discipline and TypeScript's flow control analysis.
## Naming
| Type | Style |
@@ -38,9 +53,25 @@ type Config = { throttle?: string }
- Always named exports, never default
- One module = one responsibility
### Module Naming Conventions
**Primary exports** use descriptive, unambiguous names:
- Functions: `createXxx()`, `parseXxx()`, `xxxAgent()` (e.g., `createCursorAdapter`, `cursorAgent`)
- Types: Domain-specific prefixes (e.g., `CursorAgentOptions`, `SenseComputeFn`, `WorkflowContext`)
- Constants: `UPPER_SNAKE_CASE` with context (e.g., `DEFAULT_SENSE_SIGNAL_RETENTION`, `CURSOR_ADAPTER_DEFAULT_MS`)
**Avoiding ambiguity**:
- Package-scoped naming: `@uncaged/nerve-adapter-cursor` exports `cursorAgent`, `createCursorAdapter`
- Factory pattern: `createXxxAdapter()` for configurable instances, `xxxAdapter` for defaults
- Descriptive type prefixes prevent collision (e.g., `CursorAgentOptions` vs `HermesAgentOptions`)
## Async
- Always `async/await`, never `.then()` chains
- Use `AbortSignal` for cancellation: `AbortController` to create signals, pass to long-running operations
- `spawn-safe.ts` and adapter functions accept `abortSignal: AbortSignal | null` parameter
- On abort: child processes receive `SIGTERM`, async operations should check `signal.aborted`
- No enforced Biome/Vitest rules for AbortSignal usage (manual discipline required)
## No Dynamic Import
+14 -4
View File
@@ -19,10 +19,20 @@ nerve knowledge query --repo /path "query" # search specific repo
## Embedding
- Remote service: configured via `EMBED_SERVICE_URL` env var (self-hosted Cloudflare Worker + KV cache)
- Model: Dashscope text-embedding-v3 (1024 dims)
- Cache: content-addressable (sha256 of model+text), never expires
- Fallback: word-overlap scoring when embed service not configured
- **Default model**: Dashscope text-embedding-v3 (1024 dimensions)
- **Remote service**: configured via `EMBED_SERVICE_URL` env var (self-hosted Cloudflare Worker + KV cache)
- **Model configuration**: No mechanism to specify alternate models — hardcoded to text-embedding-v3 in remote service
- **Vector dimensions**: Fixed at 1024 (Float32Array, stored as 4096-byte Buffer blobs in SQLite)
- **Cache**: content-addressable (sha256 of model+text), never expires
- **Fallback**: word-overlap scoring when embed service not configured
### Configuration
The embedding model is **not configurable** through `knowledge.yaml` or other config files. The remote service at `embed.shazhou.workers.dev` uses Dashscope text-embedding-v3 exclusively. To use different models, you would need to:
1. Deploy your own embedding service compatible with the same API
2. Point `EMBED_SERVICE_URL` to your service
3. Ensure vector dimensions match (1024) or modify knowledge database schema
## Chunking
+34 -9
View File
@@ -12,15 +12,29 @@ export { snapshots as table } from "./schema.ts"; // drizzle table for runtime
export async function compute(): Promise<ComputeResult<T>> { ... } // pure, no args
```
- `compute()` is a **pure function with no arguments** — no db, no peers, no signal
- Returns `ComputeResult<T>` = `null | { signal: T; workflow: WorkflowTrigger | null }`
- `null` → silent, no storage, no signal
- `{ signal: data, workflow: null }` → persist data, emit signal
- `{ signal: data, workflow: { name, prompt } }` → persist data, emit signal, AND trigger workflow
- **Runtime handles persistence** — `db.insert(table).values(result.signal)` is done by `sense-runtime`, not by the sense itself
- Each Sense has its own **independent SQLite database**
- Schema defined with Drizzle ORM (`schema.ts` is single source of truth)
- Types: `SenseComputeFn`, `SenseModule`, `ComputeResult` exported from `@uncaged/nerve-core`
**Function Signature & Input Schema:**
- `compute()` is **parameterless** — no direct inputs, environment variables available
- No database access within compute — runtime provides isolated execution context
- Must be pure function (no side effects, no external API calls)
**Return Value Contract:**
- `ComputeResult<T>` = `null | { signal: T; workflow: WorkflowTrigger | null }`
- `null` silent, no storage, no signal
- `{ signal: data, workflow: null }` → persist + emit signal
- `{ signal, workflow: WorkflowTrigger }` → persist + emit signal + trigger workflow
- Any other value → treated as `{ signal: value, workflow: null }`
**Error Handling & Serialization:**
- Exceptions caught by worker, logged as errors (no signal emitted)
- Signal payload must be JSON-serializable (passed via IPC)
- Invalid workflow triggers silently dropped (signal still emitted)
**Timeout & Scheduling Semantics:**
- Timeout priority: explicit config → AbortSignal → DEFAULT_TIMEOUT_MS (30s)
- Enforced via `Promise.race()` with timeout promise
- Grace period can trigger `process.exit(1)` after timeout (kills worker group)
- Interval translation: YAML config values used directly as milliseconds in `setInterval()`
- Jitter control: throttle mechanism prevents rapid-fire, single deferred trigger per throttle window
## Config (nerve.yaml)
@@ -34,3 +48,14 @@ senses:
interval: 30s # periodic trigger (optional)
on: [disk-pressure] # trigger on signals from other senses (optional)
```
## Manual Trigger Context
**`nerve sense trigger <name>`** sends IPC message to running daemon. The compute context is initialized as follows:
- **SQLite Database**: Opened in **read-write mode** at `data/senses/<name>.db`
- **Migrations**: All `*.sql` files in `senses/<name>/migrations/` applied in lexicographic order
- **Environment**: Inherits daemon process environment (no special secrets injection)
- **Arguments**: No runtime arguments or mock inputs supported — `compute()` is always pure function with no parameters
- **Isolation**: Runs in forked child process (worker) with full filesystem access within user permissions
- **Persistence**: Runtime automatically calls `db.insert(table).values(result.signal)` if compute returns non-null signal
+91
View File
@@ -0,0 +1,91 @@
# Signal Routing
Signal routing is the core mechanism that determines how Sense outputs flow through the Nerve system.
## Routing Logic
When a Sense `compute()` function returns non-null, the output goes through `routeSenseComputeOutput()` in `packages/core/src/sense-workflow-directive.ts`:
```
Sense compute() → non-null → routeSenseComputeOutput() → { signal, workflow }
kernel.ts → signal ALWAYS emitted + optional workflow start
```
## Two Output Formats
### 1. Explicit Format
```typescript
{
signal: any, // emitted as signal
workflow: { // optional workflow trigger
name: string,
maxRounds: number,
prompt: string,
dryRun: boolean
} | null
}
```
### 2. Shorthand Format
Any other value is treated as:
```typescript
{ signal: payload, workflow: null }
```
## Workflow Directive Parsing
## Concrete Routing Predicates
The routing decision is implemented in `routeSenseComputeOutput()` using these exact matching criteria:
### 1. Explicit Format Detection
```typescript
if (isPlainRecord(payload) && Object.hasOwn(payload, "signal"))
```
- Payload must be a plain object
- Must have `signal` property (any value)
- Workflow extracted from `workflow` property or defaults to null
### 2. Workflow Validation
When workflow is non-null, it's validated via `parseWorkflowTrigger()`:
- `name`: non-empty string (trimmed)
- `maxRounds`: positive integer >= 1
- `prompt`: string
- `dryRun`: boolean
**Critical behavior**: Invalid workflows are silently dropped (become null) but signal emission continues. This prevents malformed workflow config from blocking signals.
### 3. Fallback to Shorthand
Any value that doesn't match explicit format becomes:
```typescript
{ signal: payload, workflow: null }
```
## Processing Flow
```typescript
// In kernel.ts handleSenseWorkerSignal()
const { signal: signalPayload, workflow } = routeResult.value;
// Signal is ALWAYS emitted when compute returns non-null
bus.emit({ id, senseId, payload: signalPayload, timestamp });
// Workflow is started ONLY if workflow is non-null
if (workflow !== null) {
workflowManager.startWorkflow(workflow.name, { ... });
}
```
## Legacy String Format (Deprecated)
The old `"name|maxRounds|prompt"` string format is converted to the structured format internally but should not be used in new code.
## Key Behaviors
1. **Signal priority**: Every non-null compute result emits a signal, regardless of workflow
2. **Additive behavior**: Valid workflow triggers are executed in addition to signal emission
3. **Failure tolerance**: Invalid workflow directives are silently ignored, signal still emits
4. **Structure-based routing**: No complex predicates - simply checks object structure and property existence
This routing mechanism ensures clean separation between perception (signals) and action (workflows) while maintaining backward compatibility.
+132
View File
@@ -0,0 +1,132 @@
# Storage Layer
Nerve uses multiple storage systems designed for different data types and access patterns.
## Core Storage Components
### 1. Log Store (`logs.db`)
Append-only audit trail implemented in SQLite with WAL mode.
**Schema:**
- `logs` — all system events (signals, workflow transitions, sense outputs)
- `meta` — key-value store for system metadata
- `workflow_runs` — materialized view of workflow execution state
**Key Features:**
- Atomic workflow state updates via transactions
- Thread message persistence for crash recovery
- Configurable log archival to JSONL files
- Full-text search across log entries
### 2. Sense Databases
Each sense group gets its own SQLite database for private state.
**Characteristics:**
- Isolated per sense group (e.g., `system-senses.db`)
- Managed by individual sense compute functions
- Drizzle ORM integration for schema management
- No cross-sense data sharing
### 3. Knowledge Store (`knowledge.db`)
Vector-enabled search index for project context.
**Contents:**
- Chunked source files with embeddings
- Curated knowledge cards from `.knowledge/`
- Semantic search capabilities
- Global vs. repo-scoped search modes
### 4. Blob Store (CAS)
Content-addressable storage for large artifacts.
**Design:**
- SHA-256 based file naming
- Automatic deduplication
- Used for workflow artifacts and large payloads
## Consistency & Isolation Mechanisms
### SQLite WAL Mode
All SQLite databases use `PRAGMA journal_mode=WAL` for:
- **Writer-reader concurrency** — readers don't block writers
- **Atomic writes** — each transaction is fully applied or rolled back
- **Crash recovery** — WAL provides consistent state after crashes
### Transaction Management
#### Log Store Transactions
Uses `BEGIN IMMEDIATE` transactions (`packages/store/src/log-store.ts`):
```typescript
function runInTransaction<T>(db: DatabaseSync, fn: () => T): T {
db.exec("BEGIN IMMEDIATE"); // Exclusive write lock
try {
const result = fn();
db.exec("COMMIT");
return result;
} catch (e) {
db.exec("ROLLBACK");
throw e;
}
}
```
**Key Operations:**
- `upsertWorkflowRun()` — atomically writes log entry + workflow state
- `archiveLogs()` — transactional export + delete + watermark update
#### Sense Database Isolation
- Each sense group has its own SQLite file (e.g., `system-senses.db`)
- No cross-sense transactions or coordination required
- Independent schema migrations per sense
- Private `_signals` table for signal history retention
### Process-Level Isolation
#### Worker Process Architecture
- **One worker per sense group** — prevents data races within group
- **One worker per workflow type** — isolated execution contexts
- **No shared memory** — all communication via IPC messages
#### Concurrency Control
Workflow manager enforces limits per workflow:
```yaml
workflows:
my-workflow:
concurrency: 2 # Max parallel threads
overflow: "queue" # or "drop"
maxQueue: 10 # Queue depth limit
```
### Consistency Guarantees & Failure Modes
**Strong Consistency (Single Database)**:
1. **Within Log Store** — ACID transactions with immediate consistency
2. **Within Sense DB** — WAL mode ensures atomic commits per database
3. **Workflow State**`upsertWorkflowRun()` atomically updates log + materialized view
**No Cross-Database Consistency**:
- No distributed transactions across multiple SQLite files
- Log Store and Sense Databases can temporarily diverge during failures
- Signal emission and workflow triggering are separate, non-atomic operations
**Failure Recovery Mechanisms**:
- **Sense worker crash**: State rebuilt from sense SQLite database on respawn
- **Workflow worker crash**: Thread state recovered from log store message history
- **Kernel crash**: All workers respawned, state recovered from persistent stores
- **Log Store corruption**: WAL recovery on database open
- **Sense DB corruption**: Migrations re-run, `_signals` table rebuilt if needed
**Rollback Scenarios**:
- **Log write failure**: Transaction rolled back, no state changes persisted
- **Sense compute failure**: Error logged, no signal/workflow emitted
- **Workflow failure**: Thread marked as failed in materialized view
- **IPC failure**: Worker respawned, pending operations lost (not rolled back)
## Archive Strategy
Logs older than retention window (default 30 days) are:
1. Exported to `data/archive/logs/YYYY-MM-DD.jsonl`
2. Deleted from active database
3. Watermark updated to prevent re-processing
This keeps the active database size bounded while preserving audit trails.
+152
View File
@@ -0,0 +1,152 @@
# Worker Isolation
Nerve's worker architecture ensures complete isolation between different types of user code while maintaining system stability.
## Process Architecture
```
Kernel (Main Process)
├── Sense Worker (Group A) ── sense-1, sense-2
├── Sense Worker (Group B) ── sense-3, sense-4
├── Workflow Worker (cleanup) ── cleanup workflow instances
└── Workflow Worker (review) ── review workflow instances
```
## Isolation Boundaries
### 1. Sense Workers
- **One worker per sense group** (configured in `nerve.yaml`)
- Groups share a child process but have isolated execution contexts
- Crash in one sense doesn't affect other groups
- Each group has its own SQLite database
### 2. Workflow Workers
- **One worker per workflow type** (spawned on-demand)
- Multiple threads of the same workflow share a worker process
- Concurrency limits enforced at the workflow level
- Workers terminate when no active threads remain
### 3. Kernel Protection
- **User code never runs in kernel process**
- All `compute()` and workflow role functions run in workers
- Kernel only handles IPC, scheduling, and coordination
- System remains stable even with infinite loops or crashes in user code
## Worker Lifecycle
### Sense Workers
```
nerve daemon start → spawn worker per group → long-lived process
→ hot reload on file changes
→ respawn on crash
```
### Workflow Workers
```
workflow trigger → check existing worker → reuse or spawn
→ execute thread
→ terminate when idle
```
## Communication Patterns
### Kernel ↔ Sense Worker
- IPC via child process stdio
- JSON-formatted messages
- Worker reports signals back to kernel
- Bidirectional: kernel can request immediate computes
### Kernel ↔ Workflow Worker
- Similar IPC protocol
- Workflow definition loaded in worker
- Role execution results streamed back
- Thread state managed in kernel
## Resource Limits & Control
### Timeout Enforcement
Configurable timeouts per sense (in `nerve.yaml`):
```yaml
senses:
my-sense:
timeout: 30000 # Execution timeout (ms)
gracePeriod: 5000 # Grace period before hard kill
```
**Timeout Implementation:**
- `AbortController` for async operations
- `Promise.race()` between compute and timeout
- Grace period triggers `process.exit(1)` to kill entire worker group
### Memory & CPU Limits
**No Application-Level Resource Quotas**:
- No memory caps, CPU throttling, or disk I/O limits enforced by Nerve
- Workers can consume arbitrary system resources until OS limits
- No cgroup/container isolation — full filesystem access within user permissions
- No syscall filtering (no seccomp restrictions)
**OS-Level Constraints Only**:
- Process memory limited by system `ulimit -m`
- CPU usage bounded by scheduler only
- Network requests unrestricted
- Can spawn additional processes (not tracked by Nerve)
### Concurrency Control
#### Sense Workers
- One active compute per sense at a time (serialized via promise chains)
- No memory sharing between sense groups
- Crash isolation: one sense crash doesn't affect other groups
#### Workflow Workers
Per-workflow limits configured in `nerve.yaml`:
```yaml
workflows:
my-workflow:
concurrency: 2 # Max parallel threads
overflow: "drop" # or "queue"
maxQueue: 10 # Queue size limit
```
### Process Management
#### Signal Handling
Workers ignore session broadcast signals (SIGINT/SIGTERM):
```typescript
// Workers ignore terminal signals; kernel coordinates shutdown
process.on("SIGINT", () => {});
process.on("SIGTERM", () => {});
```
#### Graceful Shutdown & State Handoff
**Sense Workers**:
- IPC `shutdown` message → `process.exit(0)` (immediate)
- No graceful termination period for senses
- State rebuilt from SQLite on respawn (no handoff needed)
**Workflow Workers**:
- IPC `shutdown` → wait for in-flight threads to complete
- Drain timeout: `WORKER_SHUTDOWN_TIMEOUT_MS` (10s)
- If threads don't complete → `SIGKILL` force termination
- Thread state preserved in log store for crash recovery
**State Handoff Mechanism**:
- No explicit state transfer between old/new workers
- Sense workers: SQLite database contains full state
- Workflow workers: Log store contains thread message history
- Kernel coordinates recovery via `recoverThreadsForWorker()`
## Failure Handling
### Worker Crashes
- **Sense workers**: Automatic respawn after 1s delay, state rebuilt from DB
- **Workflow workers**: Crash recovery from log store thread messages
- **Kernel protection**: Main process continues, marks affected runs as crashed
- **Crash limits**: Max 5 crashes per workflow in 60s window (prevents infinite respawn)
### Resource Exhaustion
- **Memory**: Worker process killed by OS, kernel respawns automatically
- **Compute timeout**: Grace period → hard kill → respawn
- **Infinite loops**: Timeout enforcement prevents hanging indefinitely
This architecture allows Nerve to run untrusted or experimental code safely while maintaining system availability.
+63
View File
@@ -57,3 +57,66 @@ const workflow: WorkflowDefinition<MyMeta> = {
- `prompt: string | ((start, messages) => Promise<string>)` — static or dynamic
- `meta: z.ZodType<M>` — Zod schema, directly (no wrapper needed)
- `extract: LlmExtractorConfig` — provider for structured extraction
## Runtime Enforcement Mechanisms
### Role Authority & Validation
**Role Function Lookup**:
- Roles accessed via `def.roles[nextRole]` dictionary lookup
- Unknown roles trigger immediate workflow error (`Unknown role: ${nextRole}`)
- No dynamic role registration during execution
**Result Validation** (`validateRoleResult()`):
```typescript
// Required return shape from every role function
{ content: string, meta: Record<string, unknown> }
```
- `content` must be string (non-string → workflow error)
- `meta` must be plain object (array/null/primitive → workflow error)
- Validation failure terminates thread immediately
### Moderator Authority & Routing Control
**Next Role Selection**:
- Moderator must return role name from `roles` keys OR `END` symbol
- Called after every role completion (receives full context)
- No validation of role name until execution attempt
- Pure function constraint: cannot perform side effects
**Causal Chain Integrity**:
- Moderator receives immutable history: `{ start, steps }`
- Steps array contains ALL role outputs in chronological order
- No role can modify prior steps or start metadata
- Thread context built from log store on crash recovery
### Unauthorized Command Event Prevention
**Message Flow Control**:
- Role functions have NO direct access to kernel IPC
- All outputs flow through `sendWorkflowMessage()` wrapper
- Worker process validates messages before kernel transmission
- No direct log store database access from roles
**Process Isolation**:
- Roles execute in forked worker processes (not kernel)
- File system access limited to user permissions
- No network isolation (roles can make arbitrary HTTP calls)
- Worker has read/write access to workflow workspace only
### Concurrent Thread Management
**Kill Flag Implementation**:
```typescript
type KillFlag = { value: boolean };
// Checked before role execution and after completion
if (killFlag.value) {
sendThreadEvent(runId, "killed", { exitCode: 137 });
return;
}
```
**Concurrency Enforcement**:
- Workflow manager enforces per-workflow limits in kernel
- Excess threads queued/dropped per overflow policy
- No role can spawn additional threads (no access to workflow manager)