RFC: Cloud Workflow Orchestrator for Cross-Agent Coordination #115

New Issue

2026-04-25T02:29:34Z

tuanzi commented

2026-04-25 02:29:34 +00:00

Summary

Design a cloud-native (Cloudflare Workers) reactive orchestrator for cross-agent coordination, sharing nerve's workflow/moderator semantics but operating in a purely event-driven (passive) mode — no sense layer.

Motivation

Nerve daemons currently run as single-machine processes with in-memory signal bus. There is no mechanism for agents to coordinate across machines. Real scenarios include:

Broadcast notifications — tell all agents to perform some setup
Collaborative workflows — e.g., code review where a reviewer is recruited from a pool, then author and reviewer interact until completion

Core Model

The cloud orchestrator is a subset of nerve: only reflex + workflow, no sense.

Sense is active (polling/sampling) — not needed here
Cloud orchestrator is purely reactive — agents POST events, orchestrator responds

Primitives

event in → reflex → workflow → (emit events / spawn workflows)

Workflow Roles

Two types:

named — bound to a specific agent at definition time
open — unbound, filled via recruitment (broadcast + claim)

The "Dungeon Queue" Pattern

Cross-agent workflows follow a recruitment → execution lifecycle, modeled as nested workflows:

Initiation — an agent POSTs a task event (e.g., "PR needs review")
Recruitment workflow — orchestrator broadcasts a召集 event to eligible agents, collects claim events, manages the queue
Team-up check — when all open roles are filled, recruitment workflow spawns the actual task workflow
Task workflow — moderator (JSONata / simple automaton) orchestrates role turns. Each role turn becomes an event delivered to the bound agent. Agent processes locally (may trigger local nerve workflows), then POSTs response event back.

Agent-side Integration

Each nerve daemon gets a cloud-workflow-adapter (peer to local workflow-manager):

Receives role-turn events from cloud orchestrator
Executes local logic (may involve local sense/signal/workflow)
Posts response events back
Can initiate cloud workflows by POSTing task events

Architecture

┌─────────────┐     events      ┌──────────────────────┐     events      ┌─────────────┐
│  Agent A    │ ──────────────→ │  Cloud Orchestrator  │ ──────────────→ │  Agent B    │
│  (nerve)    │ ←────────────── │  (CF Worker + DO)    │ ←────────────── │  (nerve)    │
│             │                 │                      │                 │             │
│ - sense     │                 │  - reflex (event→wf) │                 │ - sense     │
│ - signal    │                 │  - workflow engine   │                 │ - signal    │
│ - reflex    │                 │  - moderator         │                 │ - reflex    │
│ - workflow  │                 │  - role dispatch     │                 │ - workflow  │
│ - cloud     │                 │  - recruitment       │                 │ - cloud     │
│   adapter   │                 │                      │                 │   adapter   │
└─────────────┘                 └──────────────────────┘                 └─────────────┘
                                         │
                                    CF D1 (events)
                                    CF DO (workflow state)

Relationship to Pulseflare

Pulseflare (currently a passive event store on CF D1) evolves into this orchestrator. The existing event append/query API becomes the foundation; workflow engine and recruitment logic are built on top.

Schema Validation

Consumer-side validation (Robustness Principle):

Event payloads are schemaless at the orchestrator level
Each agent's nerve config declares what event kinds it accepts and their expected shape
Invalid events are dropped with a warning log
Future: optional schema registry for stronger guarantees

Open Questions

Transport: REST polling vs WebSocket (CF DO) vs both?
Event ordering: is causal ordering sufficient, or do we need total ordering?
Failure handling: what happens when an agent goes offline mid-workflow? Timeout + reassignment?
Workflow definition format: reuse nerve's YAML workflow format directly, or a variant?
Auth: how do agents authenticate with the orchestrator? Per-agent tokens?
Observability: how to trace a workflow that spans multiple agents?

Next Steps

Settle on transport mechanism
Define event schema conventions (kind naming, metadata fields)
Prototype recruitment workflow on CF Worker
Design cloud-workflow-adapter for nerve daemon
Define moderator format for cloud workflows

## Summary Design a cloud-native (Cloudflare Workers) reactive orchestrator for cross-agent coordination, sharing nerve's workflow/moderator semantics but operating in a purely event-driven (passive) mode — no sense layer. ## Motivation Nerve daemons currently run as single-machine processes with in-memory signal bus. There is no mechanism for agents to coordinate across machines. Real scenarios include: 1. **Broadcast notifications** — tell all agents to perform some setup 2. **Collaborative workflows** — e.g., code review where a reviewer is recruited from a pool, then author and reviewer interact until completion ## Core Model The cloud orchestrator is a **subset of nerve**: only `reflex + workflow`, no `sense`. - **Sense is active** (polling/sampling) — not needed here - **Cloud orchestrator is purely reactive** — agents POST events, orchestrator responds ### Primitives ``` event in → reflex → workflow → (emit events / spawn workflows) ``` ### Workflow Roles Two types: - **named** — bound to a specific agent at definition time - **open** — unbound, filled via recruitment (broadcast + claim) ### The "Dungeon Queue" Pattern Cross-agent workflows follow a recruitment → execution lifecycle, modeled as nested workflows: 1. **Initiation** — an agent POSTs a task event (e.g., "PR needs review") 2. **Recruitment workflow** — orchestrator broadcasts a召集 event to eligible agents, collects claim events, manages the queue 3. **Team-up check** — when all open roles are filled, recruitment workflow spawns the actual task workflow 4. **Task workflow** — moderator (JSONata / simple automaton) orchestrates role turns. Each role turn becomes an event delivered to the bound agent. Agent processes locally (may trigger local nerve workflows), then POSTs response event back. ### Agent-side Integration Each nerve daemon gets a `cloud-workflow-adapter` (peer to local workflow-manager): - Receives role-turn events from cloud orchestrator - Executes local logic (may involve local sense/signal/workflow) - Posts response events back - Can initiate cloud workflows by POSTing task events ## Architecture ``` ┌─────────────┐ events ┌──────────────────────┐ events ┌─────────────┐ │ Agent A │ ──────────────→ │ Cloud Orchestrator │ ──────────────→ │ Agent B │ │ (nerve) │ ←────────────── │ (CF Worker + DO) │ ←────────────── │ (nerve) │ │ │ │ │ │ │ │ - sense │ │ - reflex (event→wf) │ │ - sense │ │ - signal │ │ - workflow engine │ │ - signal │ │ - reflex │ │ - moderator │ │ - reflex │ │ - workflow │ │ - role dispatch │ │ - workflow │ │ - cloud │ │ - recruitment │ │ - cloud │ │ adapter │ │ │ │ adapter │ └─────────────┘ └──────────────────────┘ └─────────────┘ │ CF D1 (events) CF DO (workflow state) ``` ## Relationship to Pulseflare Pulseflare (currently a passive event store on CF D1) evolves into this orchestrator. The existing event append/query API becomes the foundation; workflow engine and recruitment logic are built on top. ## Schema Validation Consumer-side validation (Robustness Principle): - Event payloads are schemaless at the orchestrator level - Each agent's nerve config declares what event kinds it accepts and their expected shape - Invalid events are dropped with a warning log - Future: optional schema registry for stronger guarantees ## Open Questions 1. **Transport**: REST polling vs WebSocket (CF DO) vs both? 2. **Event ordering**: is causal ordering sufficient, or do we need total ordering? 3. **Failure handling**: what happens when an agent goes offline mid-workflow? Timeout + reassignment? 4. **Workflow definition format**: reuse nerve's YAML workflow format directly, or a variant? 5. **Auth**: how do agents authenticate with the orchestrator? Per-agent tokens? 6. **Observability**: how to trace a workflow that spans multiple agents? ## Next Steps - [ ] Settle on transport mechanism - [ ] Define event schema conventions (kind naming, metadata fields) - [ ] Prototype recruitment workflow on CF Worker - [ ] Design cloud-workflow-adapter for nerve daemon - [ ] Define moderator format for cloud workflows

tuanzi commented

2026-04-25 02:33:47 +00:00

Updates from discussion

Naming

The cloud orchestrator lives in the nerve monorepo as a new package: nerveflare (packages/nerveflare). Replaces pulseflare.

Agent-side Model

The local nerve daemon interacts with nerveflare through standard sense/signal/reflex — no special adapter needed:

NerveflareSense — periodically polls nerveflare for events on two channels:
- my channel — events addressed to this specific agent (role turn responses, directed messages)
- open channel — broadcast recruitment events (open roles waiting to be claimed)
Polled events become local signals
Reflex routes them:
- My channel signal → trigger workflow to process and respond
- Open channel signal → check local capacity, if available → claim via POST to nerveflare → on success, enter collaborative workflow

Capacity Awareness

"Am I busy?" can itself be a sense — e.g., count active local workflow threads. Reflex for open channel claims only fires when capacity sense reports availability. This keeps the claim decision within nerve's own model.

Revised Architecture (agent side)

┌─ Local Nerve Daemon ───────────────────────────┐
│                                                 │
│  NerveflareSense (poll my channel + open channel)│
│       │                                         │
│       ▼                                         │
│  Signal Bus                                     │
│       │                                         │
│       ▼                                         │
│  Reflex                                         │
│   ├─ my-channel signal → process workflow       │
│   └─ open-channel signal                        │
│       └─ if CapacitySense says available         │
│           → claim workflow → process workflow   │
│                                                 │
│  Process Workflow:                              │
│   - run local logic (may involve other senses)  │
│   - POST response event back to nerveflare      │
│                                                 │
└─────────────────────────────────────────────────┘

This means zero new primitives on the agent side. Cross-agent coordination is just another sense source + workflow target. The entire nerve model (sense → signal → reflex → workflow) stays intact.

## Updates from discussion ### Naming The cloud orchestrator lives in the nerve monorepo as a new package: **nerveflare** (`packages/nerveflare`). Replaces pulseflare. ### Agent-side Model The local nerve daemon interacts with nerveflare through standard sense/signal/reflex — no special adapter needed: 1. **NerveflareSense** — periodically polls nerveflare for events on two channels: - **my channel** — events addressed to this specific agent (role turn responses, directed messages) - **open channel** — broadcast recruitment events (open roles waiting to be claimed) 2. Polled events become local **signals** 3. **Reflex** routes them: - My channel signal → trigger workflow to process and respond - Open channel signal → check local capacity, if available → claim via POST to nerveflare → on success, enter collaborative workflow ### Capacity Awareness "Am I busy?" can itself be a sense — e.g., count active local workflow threads. Reflex for open channel claims only fires when capacity sense reports availability. This keeps the claim decision within nerve's own model. ### Revised Architecture (agent side) ``` ┌─ Local Nerve Daemon ───────────────────────────┐ │ │ │ NerveflareSense (poll my channel + open channel)│ │ │ │ │ ▼ │ │ Signal Bus │ │ │ │ │ ▼ │ │ Reflex │ │ ├─ my-channel signal → process workflow │ │ └─ open-channel signal │ │ └─ if CapacitySense says available │ │ → claim workflow → process workflow │ │ │ │ Process Workflow: │ │ - run local logic (may involve other senses) │ │ - POST response event back to nerveflare │ │ │ └─────────────────────────────────────────────────┘ ``` This means **zero new primitives** on the agent side. Cross-agent coordination is just another sense source + workflow target. The entire nerve model (sense → signal → reflex → workflow) stays intact.

xiaomo commented

2026-04-25 02:44:25 +00:00

Design Review 讨论记录

亮点 👍

"nerve 子集"定位清晰——只要 reflex + workflow，去掉 sense，符合云端纯 reactive 特性
"Dungeon Queue" 招募模式巧妙，recruitment → team-up → task 三层嵌套 workflow 职责明确
Pulseflare 从被动 event store 演进为 orchestrator，自然的增量路径

讨论要点

1. Transport 选择
建议 REST + CF DO WebSocket 双支持。REST 做 fallback，WebSocket（DO hibernation API）做实时 role-turn 推送。纯 polling 延迟高，纯 WS 需要重连逻辑。

2. Agent 离线 mid-workflow

最关键的 open question，但可以后续再做，先跑通核心
初步思路：timeout → 释放 slot → 重新 open recruitment（"角色空出来了，谁来接"）
难点：判定真离线 vs step 耗时长。可能需要 agent 主动发 heartbeat/progress event，无 heartbeat 才算离线
需要好好设计，不急

3. Event ordering
Causal ordering 够用。跨 agent 天然异步，role-turn 由 moderator 排序，不需要全局 total order（反而成瓶颈）。

4. Workflow 定义格式
建议复用 nerve YAML + binding: cloud 标记区分，workflow 作者不需要学两套语法。

5. Auth
Per-agent token + agent-id claim。Orchestrator 维护 agent registry（哪些 agent 能参与哪些 role），token 在 nerve init 时生成并注册。

6. 本地 nerve 与 cloud workflow 的关系（已澄清）
本地 nerve 不感知 cloud-workflow 的存在。sense 层多一个"云端事件源"——收到任务/招募 claim 成功 → reflex 触发本地 workflow → 做完 POST 结果回去。跟处理本地事件（如 CPU 告警）没本质区别。两个 workflow 没有交集，最多共享一些 TS 类型和代码逻辑。

## Design Review 讨论记录 ### 亮点 👍 - "nerve 子集"定位清晰——只要 reflex + workflow，去掉 sense，符合云端纯 reactive 特性 - "Dungeon Queue" 招募模式巧妙，recruitment → team-up → task 三层嵌套 workflow 职责明确 - Pulseflare 从被动 event store 演进为 orchestrator，自然的增量路径 ### 讨论要点 **1. Transport 选择** 建议 REST + CF DO WebSocket 双支持。REST 做 fallback，WebSocket（DO hibernation API）做实时 role-turn 推送。纯 polling 延迟高，纯 WS 需要重连逻辑。 **2. Agent 离线 mid-workflow** - 最关键的 open question，但可以后续再做，先跑通核心 - 初步思路：timeout → 释放 slot → 重新 open recruitment（"角色空出来了，谁来接"） - 难点：判定真离线 vs step 耗时长。可能需要 agent 主动发 heartbeat/progress event，无 heartbeat 才算离线 - 需要好好设计，不急 **3. Event ordering** Causal ordering 够用。跨 agent 天然异步，role-turn 由 moderator 排序，不需要全局 total order（反而成瓶颈）。 **4. Workflow 定义格式** 建议复用 nerve YAML + `binding: cloud` 标记区分，workflow 作者不需要学两套语法。 **5. Auth** Per-agent token + agent-id claim。Orchestrator 维护 agent registry（哪些 agent 能参与哪些 role），token 在 `nerve init` 时生成并注册。 **6. 本地 nerve 与 cloud workflow 的关系（已澄清）** 本地 nerve 不感知 cloud-workflow 的存在。sense 层多一个"云端事件源"——收到任务/招募 claim 成功 → reflex 触发本地 workflow → 做完 POST 结果回去。跟处理本地事件（如 CPU 告警）没本质区别。两个 workflow 没有交集，最多共享一些 TS 类型和代码逻辑。

tuanzi referenced this issue

2026-04-25 03:00:20 +00:00

RFC: Khala — Stateless Agent Pool Cloud Workflow Orchestrator #119

tuanzi closed this issue

2026-04-25 03:00:31 +00:00

tuanzi commented

2026-04-25 03:00:32 +00:00

Superseded by #119 — simplified design with stateless agent pool model, no named channels or recruitment phases.

This repo is archived. You cannot comment on issues.

Branches Tags

main

chore/325-workflow-cleanup

refactor/320-extract-workflow-package

refactor/318-sense-shell-only

refactor/316-followup

feat/315-shell-trigger

feat/agent-inject-claude

fix/313-state-persistence-hardening

refactor/308-stateful-sense

docs/285-workflow-naming-convention

feat/agent-inject-cursor

chore/dead-code-cleanup

chore/rfc-006-cleanup

fix/298-update-hermes-skill

refactor/rfc-006-workflow-runtime

feat/agent-inject-phase3

feat/agent-inject-phase2

refactor/rfc-006-worker-runtime

refactor/287-align-prompts-knowledge

feat/agent-inject-phase1

refactor/277-llm-adapter-four-tuple

refactor/274-single-package-workspace

refactor/core-file-consolidation

refactor/rfc-005-phase-1

chore/knowledge-cards

refactor/pure-sense-compute

feat/sense-contract

feat/workflow-meta-package

feat/role-reviewer-package

feat/rfc004-role-committer

docs/rfc-004-package-architecture

feat/254-with-dry-run

fix/136-reflex-null-on

fix/134-hot-reload-in-flight

feat/130-dryrun-defaults

fix/123-llmextract-dryrun-defaults

feat/121-workflow-exit-codes

refactor/111-split-types-generify-sense-result

refactor/110-moderator-context-restructure

refactor/109-role-step

refactor/113-logentry-timestamp

refactor/108-remove-null-unify-ts

feat/106-workspace-biome

feat/104-dryrun-utils

feat/101-dry-run

refactor/100-extract-start-signal

feat/97-workflow-utils

docs/95-update-readme-to-match-code

refactor/93-shared-ipc-types

chore/add-pre-push-hook

fix/test-failures-after-type-safety-refactor

refactor/type-safety

refactor/split-kernel

refactor/extract-nerve-store

fix/pr81-review-followups

refactor/workflow-type-safety

feat/workflow-thread-77

chore/cursor-rules-from-conventions

fix/trigger-payload-string-support

docs/readme-update

feat/init-from-git

build/tsup-to-rslib

refactor/drizzle-v1-node-sqlite

fix/walkthrough-cleanup

refactor/node-sqlite

refactor/sql-js-migration

refactor/static-imports

feat/sense-query

fix/dev-worker-crash

refactor/daemon-subcommand

fix/review-issues-46-49

feat/blob-store

fix/init-sqlite-retry

refactor/decouple-daemon-from-cli

feat/log-archive

feat/nerve-logs

fix/phase4-followup

feat/workflow-engine-phase4

feat/workflow-engine-phase3

fix/init-runtime-bugs

feat/workflow-engine-phase2

feat/workflow-engine-phase1

rfc-002-workflow

feat/phase-7-logging

feat/phase-6-hot-reload

feat/phase-5-cli-workspace

feat/phase-4-process-manager

feat/signal-bus-reflex

feat/sense-runtime

feat/phase-1-core-types

2 Participants

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: uncaged/nerve#115