RFC: Khala — Stateless Agent Pool Cloud Workflow Orchestrator #119

New Issue

2026-04-25T03:00:20Z

tuanzi commented

2026-04-25 03:00:20 +00:00

acdea25b595a8484d3ae3d2445ea3ec3601ac760

Summary

Khala is a cloud-native (Cloudflare Workers + D1 + Durable Objects) reactive workflow orchestrator for cross-agent coordination. It treats all agents as a stateless, homogeneous worker pool — any agent can execute any role turn at any time.

Supersedes #115 (original RFC with named channels and recruitment phases).

Key Insight

Two properties of our agent fleet make the design radically simple:

Shared skills (shazhou/skills) — all agents have equivalent capabilities
Workflow thread as context — work state lives in the thread, not the agent session

Therefore agents are interchangeable work units. No need for named channels, role binding, recruitment phases, or offline recovery.

Architecture

┌─────────────────────────────────────────────┐
│              Khala (CF)                 │
│                                             │
│  ┌─────────────┐    ┌────────────────────┐  │
│  │  Task Queue  │    │  Workflow Engine   │  │
│  │  (open only) │    │  (DO per thread)   │  │
│  └──────┬──────┘    │  - moderator       │  │
│         │           │  - thread state    │  │
│         │           │  - message history │  │
│         │           └────────┬───────────┘  │
│         │                    │              │
│         └────────────────────┘              │
│              ▲            │                 │
│              │ response   │ turn event      │
└──────────────┼────────────┼─────────────────┘
               │            ▼
         ┌─────┴──────────────────┐
         │    Any Agent (nerve)   │
         │                        │
         │  Sense: poll task queue │
         │  → Signal → Reflex     │
         │  → Local workflow      │
         │    (execute turn)      │
         │  → POST response back  │
         └────────────────────────┘

Cloud Workflow Definition

workflow: code-review
moderator: ./moderators/code-review.jsonata   # pure state machine

roles:
  author:
    prompt: |
      You are the PR author. Explain your changes, respond to
      reviewer feedback, and apply suggested fixes.
  reviewer:
    prompt: |
      You are the code reviewer. Review code quality, architecture,
      potential bugs. Give specific, actionable suggestions.

Role Definition

Local workflow: role = async function (code)
Cloud workflow: role = prompt (declarative)

A role is "who you are and what you're responsible for", not "how to do it". Execution details come from the agent's local skills and tools.

Turn Event

When moderator assigns a turn, the agent receives:

role prompt — your role in this workflow
turn instruction — what to do this step (generated by moderator)
thread id — for querying context

Thread Context: Query, Don't Dump

Thread history is NOT bulk-loaded into agent context. Instead, agents get a query interface to pull what they need:

GET /threads/:id/messages?role=author&last=3
GET /threads/:id/messages?since=<timestamp>
GET /threads/:id/messages?step=2

This:

Prevents token explosion on long workflows
Lets agents decide what context is relevant
Removes dependency on agent continuity (any agent can pick up any turn, pull context on demand)

The query capability is injected as a tool/context-provider into the local workflow that executes the turn.

Agent-Side Integration

Zero new primitives. Standard nerve sense/signal/reflex:

KhalaSense — polls the task queue periodically
Turn event → local signal
Reflex triggers local workflow to:
- Read role prompt + turn instruction
- Query thread context as needed
- Execute with local LLM + tools
- POST response event back to khala

Capacity management: a CapacitySense checks local workflow load. Reflex only fires when the agent has bandwidth.

Khala Internals

Khala is a nerve subset: only reflex + workflow, no sense.

Purely reactive — agents POST events in, khala responds
Moderator — JSONata or simple automaton, runs in CF DO
Thread state — persisted in D1, keyed by thread id
Task queue — D1 table, agents poll for unclaimed turns
Turn lifecycle: moderator emits turn → queue → agent claims → executes → posts response → moderator routes next

Transport

REST as primary (simple, reliable)
CF DO WebSocket as optional upgrade for low-latency push (hibernation API for cost efficiency)

Event Ordering

Causal ordering only. Moderator serializes turns within a thread. No global total order needed.

Auth

Per-agent token. Agents register with khala on nerve init. Orchestrator maintains an agent registry.

Relationship to Pulseflare

Khala replaces pulseflare. The D1 event store from pulseflare becomes the persistence layer for thread messages.

Open Questions

Moderator format: JSONata sufficient, or need a small DSL?
Turn timeout: how long before an unclaimed/unfinished turn is re-queued?
Result aggregation: how does a workflow return a final result to the initiator?
Workflow initiation: REST API? Or also via agent event?
Observability: tracing a multi-agent workflow execution

Next Steps

Define khala package structure in nerve monorepo
Design D1 schema (threads, messages, task_queue)
Prototype moderator engine on CF DO
Build KhalaSense for local nerve daemon
End-to-end demo: two agents doing a code review

acdea25b595a8484d3ae3d2445ea3ec3601ac760 ## Summary Khala is a cloud-native (Cloudflare Workers + D1 + Durable Objects) reactive workflow orchestrator for cross-agent coordination. It treats all agents as a **stateless, homogeneous worker pool** — any agent can execute any role turn at any time. Supersedes #115 (original RFC with named channels and recruitment phases). ## Key Insight Two properties of our agent fleet make the design radically simple: 1. **Shared skills** (`shazhou/skills`) — all agents have equivalent capabilities 2. **Workflow thread as context** — work state lives in the thread, not the agent session Therefore agents are **interchangeable work units**. No need for named channels, role binding, recruitment phases, or offline recovery. ## Architecture ``` ┌─────────────────────────────────────────────┐ │ Khala (CF) │ │ │ │ ┌─────────────┐ ┌────────────────────┐ │ │ │ Task Queue │ │ Workflow Engine │ │ │ │ (open only) │ │ (DO per thread) │ │ │ └──────┬──────┘ │ - moderator │ │ │ │ │ - thread state │ │ │ │ │ - message history │ │ │ │ └────────┬───────────┘ │ │ │ │ │ │ └────────────────────┘ │ │ ▲ │ │ │ │ response │ turn event │ └──────────────┼────────────┼─────────────────┘ │ ▼ ┌─────┴──────────────────┐ │ Any Agent (nerve) │ │ │ │ Sense: poll task queue │ │ → Signal → Reflex │ │ → Local workflow │ │ (execute turn) │ │ → POST response back │ └────────────────────────┘ ``` ## Cloud Workflow Definition ```yaml workflow: code-review moderator: ./moderators/code-review.jsonata # pure state machine roles: author: prompt: | You are the PR author. Explain your changes, respond to reviewer feedback, and apply suggested fixes. reviewer: prompt: | You are the code reviewer. Review code quality, architecture, potential bugs. Give specific, actionable suggestions. ``` ### Role Definition - **Local workflow**: role = async function (code) - **Cloud workflow**: role = prompt (declarative) A role is "who you are and what you're responsible for", not "how to do it". Execution details come from the agent's local skills and tools. ### Turn Event When moderator assigns a turn, the agent receives: - **role prompt** — your role in this workflow - **turn instruction** — what to do this step (generated by moderator) - **thread id** — for querying context ## Thread Context: Query, Don't Dump Thread history is NOT bulk-loaded into agent context. Instead, agents get a **query interface** to pull what they need: ``` GET /threads/:id/messages?role=author&last=3 GET /threads/:id/messages?since=<timestamp> GET /threads/:id/messages?step=2 ``` This: - Prevents token explosion on long workflows - Lets agents decide what context is relevant - Removes dependency on agent continuity (any agent can pick up any turn, pull context on demand) The query capability is injected as a tool/context-provider into the local workflow that executes the turn. ## Agent-Side Integration Zero new primitives. Standard nerve sense/signal/reflex: 1. **KhalaSense** — polls the task queue periodically 2. Turn event → local **signal** 3. **Reflex** triggers local workflow to: - Read role prompt + turn instruction - Query thread context as needed - Execute with local LLM + tools - POST response event back to khala Capacity management: a **CapacitySense** checks local workflow load. Reflex only fires when the agent has bandwidth. ## Khala Internals Khala is a **nerve subset**: only reflex + workflow, no sense. - **Purely reactive** — agents POST events in, khala responds - **Moderator** — JSONata or simple automaton, runs in CF DO - **Thread state** — persisted in D1, keyed by thread id - **Task queue** — D1 table, agents poll for unclaimed turns - **Turn lifecycle**: moderator emits turn → queue → agent claims → executes → posts response → moderator routes next ### Transport - **REST** as primary (simple, reliable) - **CF DO WebSocket** as optional upgrade for low-latency push (hibernation API for cost efficiency) ## Event Ordering Causal ordering only. Moderator serializes turns within a thread. No global total order needed. ## Auth Per-agent token. Agents register with khala on `nerve init`. Orchestrator maintains an agent registry. ## Relationship to Pulseflare Khala replaces pulseflare. The D1 event store from pulseflare becomes the persistence layer for thread messages. ## Open Questions 1. **Moderator format**: JSONata sufficient, or need a small DSL? 2. **Turn timeout**: how long before an unclaimed/unfinished turn is re-queued? 3. **Result aggregation**: how does a workflow return a final result to the initiator? 4. **Workflow initiation**: REST API? Or also via agent event? 5. **Observability**: tracing a multi-agent workflow execution ## Next Steps - [ ] Define khala package structure in nerve monorepo - [ ] Design D1 schema (threads, messages, task_queue) - [ ] Prototype moderator engine on CF DO - [ ] Build KhalaSense for local nerve daemon - [ ] End-to-end demo: two agents doing a code review

tuanzi referenced this issue

2026-04-25 03:00:32 +00:00

RFC: Cloud Workflow Orchestrator for Cross-Agent Coordination #115

xiaomo commented

2026-04-25 03:23:31 +00:00

Design Review 补充讨论

Turn Timeout + 乐观锁

Turn 带 claim_id，agent POST response 时校验 claim_id 是否还是当前 holder：

匹配 → 接受
不匹配（已 timeout 转给别人）→ 409 拒掉，agent 丢弃结果

两层 timeout：

Claim timeout — claim 后 N 分钟没响应 → 释放回队列（re-queue）
Idle timeout — 队列里无人 claim → 通知 initiator

超时值可配置在 workflow definition 里。

Workflow Initiation: API Only

Cloud workflow 是纯 reactive，不需要引入 sense/reflex 概念。POST /workflows 创建 thread 就够了。

Result Aggregation

Moderator 定义 terminal 状态，到达时把最后一条 message（或 moderator 汇总）作为 workflow result，写回 D1 + 通知 initiator（webhook 或 polling）。

Observability

thread_id 天然是 trace ID。每个 turn event 带 timestamp + agent_id + role，D1 里形成 audit log。加个 GET /threads/:id/trace 即可。

统一 Workflow 命名：Git 语义

Local 和 cloud workflow 统一管理，采用 git branch 命名风格：

$ nerve workflow list
  code-review          # local
  deploy-check         # local
  origin/code-review   # remote
  origin/design-critique

无前缀 = local，前缀名字留给 remote
默认 remote 叫 origin
nerve workflow run origin/code-review → POST 到 nerveflare 创建 thread
nerve workflow logs origin/code-review#thread-123 → 查 thread 历史

Config 预留多 remote 支持（先只实现 origin）：

remotes:
  origin: https://nerveflare.shazhou.workers.dev

Workflow definition 统一 YAML，用 binding: local 或 binding: cloud 区分。

Moderator Format

JSONata 先行，够用再说。避免过早抽象 DSL。

CapacitySense

建议第一版先不做，hardcode max concurrent turns，等跑起来看实际瓶颈再优化。

## Design Review 补充讨论 ### Turn Timeout + 乐观锁 Turn 带 `claim_id`，agent POST response 时校验 claim_id 是否还是当前 holder： - 匹配 → 接受 - 不匹配（已 timeout 转给别人）→ 409 拒掉，agent 丢弃结果两层 timeout： - **Claim timeout** — claim 后 N 分钟没响应 → 释放回队列（re-queue） - **Idle timeout** — 队列里无人 claim → 通知 initiator 超时值可配置在 workflow definition 里。 ### Workflow Initiation: API Only Cloud workflow 是纯 reactive，不需要引入 sense/reflex 概念。`POST /workflows` 创建 thread 就够了。 ### Result Aggregation Moderator 定义 `terminal` 状态，到达时把最后一条 message（或 moderator 汇总）作为 workflow result，写回 D1 + 通知 initiator（webhook 或 polling）。 ### Observability `thread_id` 天然是 trace ID。每个 turn event 带 timestamp + agent_id + role，D1 里形成 audit log。加个 `GET /threads/:id/trace` 即可。 ### 统一 Workflow 命名：Git 语义 Local 和 cloud workflow 统一管理，采用 git branch 命名风格： ``` $ nerve workflow list code-review # local deploy-check # local origin/code-review # remote origin/design-critique ``` - **无前缀 = local**，前缀名字留给 remote - 默认 remote 叫 `origin` - `nerve workflow run origin/code-review` → POST 到 nerveflare 创建 thread - `nerve workflow logs origin/code-review#thread-123` → 查 thread 历史 Config 预留多 remote 支持（先只实现 `origin`）： ```yaml remotes: origin: https://nerveflare.shazhou.workers.dev ``` Workflow definition 统一 YAML，用 `binding: local` 或 `binding: cloud` 区分。 ### Moderator Format JSONata 先行，够用再说。避免过早抽象 DSL。 ### CapacitySense 建议第一版先不做，hardcode max concurrent turns，等跑起来看实际瓶颈再优化。

xiaomo referenced this issue

2026-04-25 03:36:36 +00:00

CLI 统一设计：Local + Remote Workflow (git-style 命名) — Khala #120

tuanzi changed title from ~~RFC: Nerveflare — Stateless Agent Pool Cloud Workflow Orchestrator~~ to RFC: Khala — Stateless Agent Pool Cloud Workflow Orchestrator

2026-04-25 03:57:19 +00:00

tuanzi commented

2026-04-25 03:58:01 +00:00

Naming Update

Renamed from Nerveflare to Khala (卡拉).

Inspired by StarCraft's Protoss Khala — a psychic link connecting all individuals as equals, sharing knowledge and consciousness.

nerve = individual agent's nervous system (local sensing)
khala = the shared consciousness network (cross-agent workflow orchestration)

Package: packages/khala
Deployment: khala.shazhou.workers.dev

## Naming Update Renamed from **Nerveflare** to **Khala** (卡拉). Inspired by StarCraft's Protoss Khala — a psychic link connecting all individuals as equals, sharing knowledge and consciousness. - nerve = individual agent's nervous system (local sensing) - khala = the shared consciousness network (cross-agent workflow orchestration) Package: `packages/khala` Deployment: `khala.shazhou.workers.dev`

tuanzi referenced this issue

2026-04-25 04:33:39 +00:00

Khala Phase 0: Project scaffolding — CF Worker + Hono + D1 #124

tuanzi referenced this issue

2026-04-25 04:33:40 +00:00

Khala Phase 1: D1 schema & data access layer #125

tuanzi referenced this issue

2026-04-25 04:33:41 +00:00

Khala Phase 2: Agent auth middleware & admin API #127

tuanzi referenced this issue

2026-04-25 04:33:43 +00:00

Khala Phase 3: Workflow engine — ThreadDO + JSONata moderator #128

tuanzi referenced this issue

2026-04-25 04:33:45 +00:00

Khala Phase 4: Task queue API + timeout sweep #129

This repo is archived. You cannot comment on issues.

Branches Tags

main

chore/325-workflow-cleanup

refactor/320-extract-workflow-package

refactor/318-sense-shell-only

refactor/316-followup

feat/315-shell-trigger

feat/agent-inject-claude

fix/313-state-persistence-hardening

refactor/308-stateful-sense

docs/285-workflow-naming-convention

feat/agent-inject-cursor

chore/dead-code-cleanup

chore/rfc-006-cleanup

fix/298-update-hermes-skill

refactor/rfc-006-workflow-runtime

feat/agent-inject-phase3

feat/agent-inject-phase2

refactor/rfc-006-worker-runtime

refactor/287-align-prompts-knowledge

feat/agent-inject-phase1

refactor/277-llm-adapter-four-tuple

refactor/274-single-package-workspace

refactor/core-file-consolidation

refactor/rfc-005-phase-1

chore/knowledge-cards

refactor/pure-sense-compute

feat/sense-contract

feat/workflow-meta-package

feat/role-reviewer-package

feat/rfc004-role-committer

docs/rfc-004-package-architecture

feat/254-with-dry-run

fix/136-reflex-null-on

fix/134-hot-reload-in-flight

feat/130-dryrun-defaults

fix/123-llmextract-dryrun-defaults

feat/121-workflow-exit-codes

refactor/111-split-types-generify-sense-result

refactor/110-moderator-context-restructure

refactor/109-role-step

refactor/113-logentry-timestamp

refactor/108-remove-null-unify-ts

feat/106-workspace-biome

feat/104-dryrun-utils

feat/101-dry-run

refactor/100-extract-start-signal

feat/97-workflow-utils

docs/95-update-readme-to-match-code

refactor/93-shared-ipc-types

chore/add-pre-push-hook

fix/test-failures-after-type-safety-refactor

refactor/type-safety

refactor/split-kernel

refactor/extract-nerve-store

fix/pr81-review-followups

refactor/workflow-type-safety

feat/workflow-thread-77

chore/cursor-rules-from-conventions

fix/trigger-payload-string-support

docs/readme-update

feat/init-from-git

build/tsup-to-rslib

refactor/drizzle-v1-node-sqlite

fix/walkthrough-cleanup

refactor/node-sqlite

refactor/sql-js-migration

refactor/static-imports

feat/sense-query

fix/dev-worker-crash

refactor/daemon-subcommand

fix/review-issues-46-49

feat/blob-store

fix/init-sqlite-retry

refactor/decouple-daemon-from-cli

feat/log-archive

feat/nerve-logs

fix/phase4-followup

feat/workflow-engine-phase4

feat/workflow-engine-phase3

fix/init-runtime-bugs

feat/workflow-engine-phase2

feat/workflow-engine-phase1

rfc-002-workflow

feat/phase-7-logging

feat/phase-6-hot-reload

feat/phase-5-cli-workspace

feat/phase-4-process-manager

feat/signal-bus-reflex

feat/sense-runtime

feat/phase-1-core-types

2 Participants

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: uncaged/nerve#119