RFC: Cloud Workflow Orchestrator for Cross-Agent Coordination #115

Closed
opened 2026-04-25 02:29:34 +00:00 by tuanzi · 3 comments
Owner

Summary

Design a cloud-native (Cloudflare Workers) reactive orchestrator for cross-agent coordination, sharing nerve's workflow/moderator semantics but operating in a purely event-driven (passive) mode — no sense layer.

Motivation

Nerve daemons currently run as single-machine processes with in-memory signal bus. There is no mechanism for agents to coordinate across machines. Real scenarios include:

  1. Broadcast notifications — tell all agents to perform some setup
  2. Collaborative workflows — e.g., code review where a reviewer is recruited from a pool, then author and reviewer interact until completion

Core Model

The cloud orchestrator is a subset of nerve: only reflex + workflow, no sense.

  • Sense is active (polling/sampling) — not needed here
  • Cloud orchestrator is purely reactive — agents POST events, orchestrator responds

Primitives

event in → reflex → workflow → (emit events / spawn workflows)

Workflow Roles

Two types:

  • named — bound to a specific agent at definition time
  • open — unbound, filled via recruitment (broadcast + claim)

The "Dungeon Queue" Pattern

Cross-agent workflows follow a recruitment → execution lifecycle, modeled as nested workflows:

  1. Initiation — an agent POSTs a task event (e.g., "PR needs review")
  2. Recruitment workflow — orchestrator broadcasts a召集 event to eligible agents, collects claim events, manages the queue
  3. Team-up check — when all open roles are filled, recruitment workflow spawns the actual task workflow
  4. Task workflow — moderator (JSONata / simple automaton) orchestrates role turns. Each role turn becomes an event delivered to the bound agent. Agent processes locally (may trigger local nerve workflows), then POSTs response event back.

Agent-side Integration

Each nerve daemon gets a cloud-workflow-adapter (peer to local workflow-manager):

  • Receives role-turn events from cloud orchestrator
  • Executes local logic (may involve local sense/signal/workflow)
  • Posts response events back
  • Can initiate cloud workflows by POSTing task events

Architecture

┌─────────────┐     events      ┌──────────────────────┐     events      ┌─────────────┐
│  Agent A    │ ──────────────→ │  Cloud Orchestrator  │ ──────────────→ │  Agent B    │
│  (nerve)    │ ←────────────── │  (CF Worker + DO)    │ ←────────────── │  (nerve)    │
│             │                 │                      │                 │             │
│ - sense     │                 │  - reflex (event→wf) │                 │ - sense     │
│ - signal    │                 │  - workflow engine   │                 │ - signal    │
│ - reflex    │                 │  - moderator         │                 │ - reflex    │
│ - workflow  │                 │  - role dispatch     │                 │ - workflow  │
│ - cloud     │                 │  - recruitment       │                 │ - cloud     │
│   adapter   │                 │                      │                 │   adapter   │
└─────────────┘                 └──────────────────────┘                 └─────────────┘
                                         │
                                    CF D1 (events)
                                    CF DO (workflow state)

Relationship to Pulseflare

Pulseflare (currently a passive event store on CF D1) evolves into this orchestrator. The existing event append/query API becomes the foundation; workflow engine and recruitment logic are built on top.

Schema Validation

Consumer-side validation (Robustness Principle):

  • Event payloads are schemaless at the orchestrator level
  • Each agent's nerve config declares what event kinds it accepts and their expected shape
  • Invalid events are dropped with a warning log
  • Future: optional schema registry for stronger guarantees

Open Questions

  1. Transport: REST polling vs WebSocket (CF DO) vs both?
  2. Event ordering: is causal ordering sufficient, or do we need total ordering?
  3. Failure handling: what happens when an agent goes offline mid-workflow? Timeout + reassignment?
  4. Workflow definition format: reuse nerve's YAML workflow format directly, or a variant?
  5. Auth: how do agents authenticate with the orchestrator? Per-agent tokens?
  6. Observability: how to trace a workflow that spans multiple agents?

Next Steps

  • Settle on transport mechanism
  • Define event schema conventions (kind naming, metadata fields)
  • Prototype recruitment workflow on CF Worker
  • Design cloud-workflow-adapter for nerve daemon
  • Define moderator format for cloud workflows
## Summary Design a cloud-native (Cloudflare Workers) reactive orchestrator for cross-agent coordination, sharing nerve's workflow/moderator semantics but operating in a purely event-driven (passive) mode — no sense layer. ## Motivation Nerve daemons currently run as single-machine processes with in-memory signal bus. There is no mechanism for agents to coordinate across machines. Real scenarios include: 1. **Broadcast notifications** — tell all agents to perform some setup 2. **Collaborative workflows** — e.g., code review where a reviewer is recruited from a pool, then author and reviewer interact until completion ## Core Model The cloud orchestrator is a **subset of nerve**: only `reflex + workflow`, no `sense`. - **Sense is active** (polling/sampling) — not needed here - **Cloud orchestrator is purely reactive** — agents POST events, orchestrator responds ### Primitives ``` event in → reflex → workflow → (emit events / spawn workflows) ``` ### Workflow Roles Two types: - **named** — bound to a specific agent at definition time - **open** — unbound, filled via recruitment (broadcast + claim) ### The "Dungeon Queue" Pattern Cross-agent workflows follow a recruitment → execution lifecycle, modeled as nested workflows: 1. **Initiation** — an agent POSTs a task event (e.g., "PR needs review") 2. **Recruitment workflow** — orchestrator broadcasts a召集 event to eligible agents, collects claim events, manages the queue 3. **Team-up check** — when all open roles are filled, recruitment workflow spawns the actual task workflow 4. **Task workflow** — moderator (JSONata / simple automaton) orchestrates role turns. Each role turn becomes an event delivered to the bound agent. Agent processes locally (may trigger local nerve workflows), then POSTs response event back. ### Agent-side Integration Each nerve daemon gets a `cloud-workflow-adapter` (peer to local workflow-manager): - Receives role-turn events from cloud orchestrator - Executes local logic (may involve local sense/signal/workflow) - Posts response events back - Can initiate cloud workflows by POSTing task events ## Architecture ``` ┌─────────────┐ events ┌──────────────────────┐ events ┌─────────────┐ │ Agent A │ ──────────────→ │ Cloud Orchestrator │ ──────────────→ │ Agent B │ │ (nerve) │ ←────────────── │ (CF Worker + DO) │ ←────────────── │ (nerve) │ │ │ │ │ │ │ │ - sense │ │ - reflex (event→wf) │ │ - sense │ │ - signal │ │ - workflow engine │ │ - signal │ │ - reflex │ │ - moderator │ │ - reflex │ │ - workflow │ │ - role dispatch │ │ - workflow │ │ - cloud │ │ - recruitment │ │ - cloud │ │ adapter │ │ │ │ adapter │ └─────────────┘ └──────────────────────┘ └─────────────┘ │ CF D1 (events) CF DO (workflow state) ``` ## Relationship to Pulseflare Pulseflare (currently a passive event store on CF D1) evolves into this orchestrator. The existing event append/query API becomes the foundation; workflow engine and recruitment logic are built on top. ## Schema Validation Consumer-side validation (Robustness Principle): - Event payloads are schemaless at the orchestrator level - Each agent's nerve config declares what event kinds it accepts and their expected shape - Invalid events are dropped with a warning log - Future: optional schema registry for stronger guarantees ## Open Questions 1. **Transport**: REST polling vs WebSocket (CF DO) vs both? 2. **Event ordering**: is causal ordering sufficient, or do we need total ordering? 3. **Failure handling**: what happens when an agent goes offline mid-workflow? Timeout + reassignment? 4. **Workflow definition format**: reuse nerve's YAML workflow format directly, or a variant? 5. **Auth**: how do agents authenticate with the orchestrator? Per-agent tokens? 6. **Observability**: how to trace a workflow that spans multiple agents? ## Next Steps - [ ] Settle on transport mechanism - [ ] Define event schema conventions (kind naming, metadata fields) - [ ] Prototype recruitment workflow on CF Worker - [ ] Design cloud-workflow-adapter for nerve daemon - [ ] Define moderator format for cloud workflows
Author
Owner

Updates from discussion

Naming

The cloud orchestrator lives in the nerve monorepo as a new package: nerveflare (packages/nerveflare). Replaces pulseflare.

Agent-side Model

The local nerve daemon interacts with nerveflare through standard sense/signal/reflex — no special adapter needed:

  1. NerveflareSense — periodically polls nerveflare for events on two channels:
    • my channel — events addressed to this specific agent (role turn responses, directed messages)
    • open channel — broadcast recruitment events (open roles waiting to be claimed)
  2. Polled events become local signals
  3. Reflex routes them:
    • My channel signal → trigger workflow to process and respond
    • Open channel signal → check local capacity, if available → claim via POST to nerveflare → on success, enter collaborative workflow

Capacity Awareness

"Am I busy?" can itself be a sense — e.g., count active local workflow threads. Reflex for open channel claims only fires when capacity sense reports availability. This keeps the claim decision within nerve's own model.

Revised Architecture (agent side)

┌─ Local Nerve Daemon ───────────────────────────┐
│                                                 │
│  NerveflareSense (poll my channel + open channel)│
│       │                                         │
│       ▼                                         │
│  Signal Bus                                     │
│       │                                         │
│       ▼                                         │
│  Reflex                                         │
│   ├─ my-channel signal → process workflow       │
│   └─ open-channel signal                        │
│       └─ if CapacitySense says available         │
│           → claim workflow → process workflow   │
│                                                 │
│  Process Workflow:                              │
│   - run local logic (may involve other senses)  │
│   - POST response event back to nerveflare      │
│                                                 │
└─────────────────────────────────────────────────┘

This means zero new primitives on the agent side. Cross-agent coordination is just another sense source + workflow target. The entire nerve model (sense → signal → reflex → workflow) stays intact.

## Updates from discussion ### Naming The cloud orchestrator lives in the nerve monorepo as a new package: **nerveflare** (`packages/nerveflare`). Replaces pulseflare. ### Agent-side Model The local nerve daemon interacts with nerveflare through standard sense/signal/reflex — no special adapter needed: 1. **NerveflareSense** — periodically polls nerveflare for events on two channels: - **my channel** — events addressed to this specific agent (role turn responses, directed messages) - **open channel** — broadcast recruitment events (open roles waiting to be claimed) 2. Polled events become local **signals** 3. **Reflex** routes them: - My channel signal → trigger workflow to process and respond - Open channel signal → check local capacity, if available → claim via POST to nerveflare → on success, enter collaborative workflow ### Capacity Awareness "Am I busy?" can itself be a sense — e.g., count active local workflow threads. Reflex for open channel claims only fires when capacity sense reports availability. This keeps the claim decision within nerve's own model. ### Revised Architecture (agent side) ``` ┌─ Local Nerve Daemon ───────────────────────────┐ │ │ │ NerveflareSense (poll my channel + open channel)│ │ │ │ │ ▼ │ │ Signal Bus │ │ │ │ │ ▼ │ │ Reflex │ │ ├─ my-channel signal → process workflow │ │ └─ open-channel signal │ │ └─ if CapacitySense says available │ │ → claim workflow → process workflow │ │ │ │ Process Workflow: │ │ - run local logic (may involve other senses) │ │ - POST response event back to nerveflare │ │ │ └─────────────────────────────────────────────────┘ ``` This means **zero new primitives** on the agent side. Cross-agent coordination is just another sense source + workflow target. The entire nerve model (sense → signal → reflex → workflow) stays intact.
Owner

Design Review 讨论记录

亮点 👍

  • "nerve 子集"定位清晰——只要 reflex + workflow,去掉 sense,符合云端纯 reactive 特性
  • "Dungeon Queue" 招募模式巧妙,recruitment → team-up → task 三层嵌套 workflow 职责明确
  • Pulseflare 从被动 event store 演进为 orchestrator,自然的增量路径

讨论要点

1. Transport 选择
建议 REST + CF DO WebSocket 双支持。REST 做 fallback,WebSocket(DO hibernation API)做实时 role-turn 推送。纯 polling 延迟高,纯 WS 需要重连逻辑。

2. Agent 离线 mid-workflow

  • 最关键的 open question,但可以后续再做,先跑通核心
  • 初步思路:timeout → 释放 slot → 重新 open recruitment("角色空出来了,谁来接")
  • 难点:判定真离线 vs step 耗时长。可能需要 agent 主动发 heartbeat/progress event,无 heartbeat 才算离线
  • 需要好好设计,不急

3. Event ordering
Causal ordering 够用。跨 agent 天然异步,role-turn 由 moderator 排序,不需要全局 total order(反而成瓶颈)。

4. Workflow 定义格式
建议复用 nerve YAML + binding: cloud 标记区分,workflow 作者不需要学两套语法。

5. Auth
Per-agent token + agent-id claim。Orchestrator 维护 agent registry(哪些 agent 能参与哪些 role),token 在 nerve init 时生成并注册。

6. 本地 nerve 与 cloud workflow 的关系(已澄清)
本地 nerve 不感知 cloud-workflow 的存在。sense 层多一个"云端事件源"——收到任务/招募 claim 成功 → reflex 触发本地 workflow → 做完 POST 结果回去。跟处理本地事件(如 CPU 告警)没本质区别。两个 workflow 没有交集,最多共享一些 TS 类型和代码逻辑。

## Design Review 讨论记录 ### 亮点 👍 - "nerve 子集"定位清晰——只要 reflex + workflow,去掉 sense,符合云端纯 reactive 特性 - "Dungeon Queue" 招募模式巧妙,recruitment → team-up → task 三层嵌套 workflow 职责明确 - Pulseflare 从被动 event store 演进为 orchestrator,自然的增量路径 ### 讨论要点 **1. Transport 选择** 建议 REST + CF DO WebSocket 双支持。REST 做 fallback,WebSocket(DO hibernation API)做实时 role-turn 推送。纯 polling 延迟高,纯 WS 需要重连逻辑。 **2. Agent 离线 mid-workflow** - 最关键的 open question,但可以后续再做,先跑通核心 - 初步思路:timeout → 释放 slot → 重新 open recruitment("角色空出来了,谁来接") - 难点:判定真离线 vs step 耗时长。可能需要 agent 主动发 heartbeat/progress event,无 heartbeat 才算离线 - 需要好好设计,不急 **3. Event ordering** Causal ordering 够用。跨 agent 天然异步,role-turn 由 moderator 排序,不需要全局 total order(反而成瓶颈)。 **4. Workflow 定义格式** 建议复用 nerve YAML + `binding: cloud` 标记区分,workflow 作者不需要学两套语法。 **5. Auth** Per-agent token + agent-id claim。Orchestrator 维护 agent registry(哪些 agent 能参与哪些 role),token 在 `nerve init` 时生成并注册。 **6. 本地 nerve 与 cloud workflow 的关系(已澄清)** 本地 nerve 不感知 cloud-workflow 的存在。sense 层多一个"云端事件源"——收到任务/招募 claim 成功 → reflex 触发本地 workflow → 做完 POST 结果回去。跟处理本地事件(如 CPU 告警)没本质区别。两个 workflow 没有交集,最多共享一些 TS 类型和代码逻辑。
Author
Owner

Superseded by #119 — simplified design with stateless agent pool model, no named channels or recruitment phases.

Superseded by #119 — simplified design with stateless agent pool model, no named channels or recruitment phases.
This repo is archived. You cannot comment on issues.
No Label
2 Participants
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: uncaged/nerve#115