#71 - feat: builtin judges — frontmatter, upstream, hallucination, token-stats - united-workforce

xiaoju commented

2026-06-04 15:13:23 +00:00

Owner

Phase 1c of eval framework (#34)

Implement 4 builtin judges:

1. frontmatter-compliance (deterministic)

Read each step node from CAS via thread-id
Check: $status exists and is valid enum value, required fields present per role schema
Output: {stepsTotal, stepsValid, invalidSteps: [{stepIndex, role, errors}]}
Score: stepsValid / stepsTotal

2. upstream-consumption (LLM-as-judge)

For each step N>0: extract key outputs from step N-1, ask LLM if step N references/uses them
Output: {perStep: [{role, consumed, missed, score}]}
Score: average of per-step scores

3. hallucination (LLM-as-judge)

For each step: ask LLM to identify references to non-existent files/functions/variables given the cwd contents
Output: {perStep: [{role, hallucinations, score}]}
Score: average (1.0 = no hallucinations)

4. token-stats (informational, weight=0)

Read $usage from each step node (requires #68)
Output: {totalInput, totalOutput, totalTurns, perStep: [{role, input, output, turns, duration}]}
Score: always 1.0 (informational only)
Graceful fallback: zeros if $usage not present

Open question

LLM-as-judge provider config: reuse uwf config.yaml? Or eval-specific?

Depends on: #69
Ref: #34

— 小橘 🍊（NEKO Team）

## Phase 1c of eval framework (#34) Implement 4 builtin judges: ### 1. frontmatter-compliance (deterministic) - [ ] Read each step node from CAS via thread-id - [ ] Check: `$status` exists and is valid enum value, required fields present per role schema - [ ] Output: `{stepsTotal, stepsValid, invalidSteps: [{stepIndex, role, errors}]}` - [ ] Score: stepsValid / stepsTotal ### 2. upstream-consumption (LLM-as-judge) - [ ] For each step N>0: extract key outputs from step N-1, ask LLM if step N references/uses them - [ ] Output: `{perStep: [{role, consumed, missed, score}]}` - [ ] Score: average of per-step scores ### 3. hallucination (LLM-as-judge) - [ ] For each step: ask LLM to identify references to non-existent files/functions/variables given the cwd contents - [ ] Output: `{perStep: [{role, hallucinations, score}]}` - [ ] Score: average (1.0 = no hallucinations) ### 4. token-stats (informational, weight=0) - [ ] Read `$usage` from each step node (requires #68) - [ ] Output: `{totalInput, totalOutput, totalTurns, perStep: [{role, input, output, turns, duration}]}` - [ ] Score: always 1.0 (informational only) - [ ] Graceful fallback: zeros if `$usage` not present ## Open question - LLM-as-judge provider config: reuse uwf config.yaml? Or eval-specific? Depends on: #69 Ref: #34 — 小橘 🍊（NEKO Team）