feat: builtin judges — frontmatter, upstream, hallucination, token-stats #71

Closed
opened 2026-06-04 15:13:23 +00:00 by xiaoju · 0 comments
Owner

Phase 1c of eval framework (#34)

Implement 4 builtin judges:

1. frontmatter-compliance (deterministic)

  • Read each step node from CAS via thread-id
  • Check: $status exists and is valid enum value, required fields present per role schema
  • Output: {stepsTotal, stepsValid, invalidSteps: [{stepIndex, role, errors}]}
  • Score: stepsValid / stepsTotal

2. upstream-consumption (LLM-as-judge)

  • For each step N>0: extract key outputs from step N-1, ask LLM if step N references/uses them
  • Output: {perStep: [{role, consumed, missed, score}]}
  • Score: average of per-step scores

3. hallucination (LLM-as-judge)

  • For each step: ask LLM to identify references to non-existent files/functions/variables given the cwd contents
  • Output: {perStep: [{role, hallucinations, score}]}
  • Score: average (1.0 = no hallucinations)

4. token-stats (informational, weight=0)

  • Read $usage from each step node (requires #68)
  • Output: {totalInput, totalOutput, totalTurns, perStep: [{role, input, output, turns, duration}]}
  • Score: always 1.0 (informational only)
  • Graceful fallback: zeros if $usage not present

Open question

  • LLM-as-judge provider config: reuse uwf config.yaml? Or eval-specific?

Depends on: #69
Ref: #34

— 小橘 🍊(NEKO Team)

## Phase 1c of eval framework (#34) Implement 4 builtin judges: ### 1. frontmatter-compliance (deterministic) - [ ] Read each step node from CAS via thread-id - [ ] Check: `$status` exists and is valid enum value, required fields present per role schema - [ ] Output: `{stepsTotal, stepsValid, invalidSteps: [{stepIndex, role, errors}]}` - [ ] Score: stepsValid / stepsTotal ### 2. upstream-consumption (LLM-as-judge) - [ ] For each step N>0: extract key outputs from step N-1, ask LLM if step N references/uses them - [ ] Output: `{perStep: [{role, consumed, missed, score}]}` - [ ] Score: average of per-step scores ### 3. hallucination (LLM-as-judge) - [ ] For each step: ask LLM to identify references to non-existent files/functions/variables given the cwd contents - [ ] Output: `{perStep: [{role, hallucinations, score}]}` - [ ] Score: average (1.0 = no hallucinations) ### 4. token-stats (informational, weight=0) - [ ] Read `$usage` from each step node (requires #68) - [ ] Output: `{totalInput, totalOutput, totalTurns, perStep: [{role, input, output, turns, duration}]}` - [ ] Score: always 1.0 (informational only) - [ ] Graceful fallback: zeros if `$usage` not present ## Open question - LLM-as-judge provider config: reuse uwf config.yaml? Or eval-specific? Depends on: #69 Ref: #34 — 小橘 🍊(NEKO Team)
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: shazhou/united-workforce#71