refactor: unify .data.jsonl row format with inline $cas pointers #154

Closed
opened 2026-05-09 05:47:32 +00:00 by xiaoju · 0 comments
Owner

Background

Current .data.jsonl has three different row formats: start record (with parameters.prompt inline), role steps (with contentHash + refs[]), and end/completion (yet another shape). This makes GC scanning, tooling, and replay unnecessarily complex.

Proposal: Unified Row Format

Every line in .data.jsonl becomes:

{ "role": "<string>", "content": <CasRef | string>, "meta": <object>, "timestamp": <number> }

$cas Pointer

Instead of a top-level refs[] array, CAS references are expressed inline wherever they appear:

{ "$cas": "<hash>" }

This applies to content and any value inside meta.

Examples

{"role":"__start__","content":{"$cas":"PROMPT_HASH"},"meta":{"maxRounds":10,"name":"solve-issue","hash":"ABC"},"timestamp":1234}
{"role":"coder","content":{"$cas":"DEF456"},"meta":{"plan":{"$cas":"GHI789"}},"timestamp":1235}
{"role":"__end__","content":{"$cas":"SUMMARY_HASH"},"meta":{"returnCode":0},"timestamp":1236}

Benefits

  1. GC simplicity — recursively scan all { "$cas": hash } in any row to find live refs. No need to understand per-role schema.
  2. Uniform parsing — one row parser for all record types.
  3. Refs co-located with usage$cas appears exactly where the data is used, not in a detached refs[] array.

Design Decisions Needed

Content field strategy

  • Option A: Always CAScontent is always { "$cas": hash }, even for tiny strings. Simplest, most uniform.
  • Option B: Threshold — small content (< 256 bytes) can be inline string, large content uses { "$cas": hash }. More compact but adds a branching path.

Changes Required

workflow-protocol

  • Remove StartStep.content: string — replace with content: CasRef | string
  • Remove RoleStep.contentHash + refs[] — replace with content: CasRef, refs become $cas inside meta
  • Add CasRef type: { $cas: string }
  • Unify all step types into one DataRow type

workflow-execute

  • engine.ts: write start/end rows in unified format, put prompt into CAS before writing
  • fork-thread.ts: parse unified format when restoring history
  • gc.ts: simplify — just recurse all $cas pointers in each row

workflow-runtime

  • createWorkflow: receive contentHash in start instead of content string
  • All ctx.start.content reads → resolve from CAS

workflow-cas

  • Add helper: findCasRefs(obj: unknown): string[] — recursively extract all $cas hashes from any object

Templates & Agents

  • Update all ctx.start.content usages
  • Update meta schemas if they currently reference refs

Breaking Change

This is intentionally a breaking change to .data.jsonl format. Old JSONL files will not be compatible. This is acceptable per project convention (no backward compat fallbacks).

Ref

小橘 🍊(NEKO Team)

## Background Current `.data.jsonl` has **three different row formats**: start record (with `parameters.prompt` inline), role steps (with `contentHash` + `refs[]`), and end/completion (yet another shape). This makes GC scanning, tooling, and replay unnecessarily complex. ## Proposal: Unified Row Format Every line in `.data.jsonl` becomes: ```json { "role": "<string>", "content": <CasRef | string>, "meta": <object>, "timestamp": <number> } ``` ### `$cas` Pointer Instead of a top-level `refs[]` array, CAS references are expressed **inline** wherever they appear: ```json { "$cas": "<hash>" } ``` This applies to `content` and any value inside `meta`. ### Examples ```jsonl {"role":"__start__","content":{"$cas":"PROMPT_HASH"},"meta":{"maxRounds":10,"name":"solve-issue","hash":"ABC"},"timestamp":1234} {"role":"coder","content":{"$cas":"DEF456"},"meta":{"plan":{"$cas":"GHI789"}},"timestamp":1235} {"role":"__end__","content":{"$cas":"SUMMARY_HASH"},"meta":{"returnCode":0},"timestamp":1236} ``` ### Benefits 1. **GC simplicity** — recursively scan all `{ "$cas": hash }` in any row to find live refs. No need to understand per-role schema. 2. **Uniform parsing** — one row parser for all record types. 3. **Refs co-located with usage** — `$cas` appears exactly where the data is used, not in a detached `refs[]` array. ## Design Decisions Needed ### Content field strategy - **Option A: Always CAS** — `content` is always `{ "$cas": hash }`, even for tiny strings. Simplest, most uniform. - **Option B: Threshold** — small content (< 256 bytes) can be inline string, large content uses `{ "$cas": hash }`. More compact but adds a branching path. ### Changes Required #### `workflow-protocol` - Remove `StartStep.content: string` — replace with `content: CasRef | string` - Remove `RoleStep.contentHash` + `refs[]` — replace with `content: CasRef`, refs become `$cas` inside `meta` - Add `CasRef` type: `{ $cas: string }` - Unify all step types into one `DataRow` type #### `workflow-execute` - `engine.ts`: write start/end rows in unified format, put prompt into CAS before writing - `fork-thread.ts`: parse unified format when restoring history - `gc.ts`: simplify — just recurse all `$cas` pointers in each row #### `workflow-runtime` - `createWorkflow`: receive `contentHash` in start instead of `content` string - All `ctx.start.content` reads → resolve from CAS #### `workflow-cas` - Add helper: `findCasRefs(obj: unknown): string[]` — recursively extract all `$cas` hashes from any object #### Templates & Agents - Update all `ctx.start.content` usages - Update meta schemas if they currently reference `refs` ## Breaking Change This is intentionally a breaking change to `.data.jsonl` format. Old JSONL files will not be compatible. This is acceptable per project convention (no backward compat fallbacks). ## Ref 小橘 🍊(NEKO Team)
Sign in to join this conversation.
No Label
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: uncaged/workflow#154