feat: uwf-eval-tasks monorepo + first task (fix-off-by-one) #73

Closed
opened 2026-06-04 15:13:42 +00:00 by xiaoju · 0 comments
Owner

Phase 2 of eval framework (#34)

Create shazhou/uwf-eval-tasks monorepo (new repo) with proman scaffold:

2a: Repo scaffold

  • Init monorepo: pnpm-workspace.yaml, tsconfig, biome, proman
  • Task package template (packages/_template/)

2b: First task — fix-off-by-one

  • packages/fix-off-by-one/package.json@uwf-eval/fix-off-by-one
  • fixture/ — buggy calculator repo (src/calc.ts with off-by-one, src/calc.test.ts that fails)
  • task.yaml — workflow: solve-issue, prompt, limits, judge declarations
  • src/judges/test-pass.ts — runs pnpm test in cwd, checks exit code
  • src/judges/code-quality.ts — LLM judge: minimal change, correct fix
  • schemas/test-pass.json + schemas/code-quality.json — OCAS schemas for judge data

2c: End-to-end validation

  • uwf-eval run packages/fix-off-by-one --agent hermes completes and stores result

task.yaml format

name: fix-off-by-one
description: Fix an off-by-one error in calculator add function
workflow: solve-issue
prompt: "Fix the bug: add(1,2) returns 4 instead of 3"
limits:
  maxSteps: 15
  timeoutMinutes: 30
judges:
  - name: frontmatter-compliance
    weight: 0.15
    builtin: true
  - name: upstream-consumption
    weight: 0.15
    builtin: true
  - name: hallucination
    weight: 0.1
    builtin: true
  - name: token-stats
    weight: 0
    builtin: true
  - name: test-pass
    weight: 0.3
    entry: dist/judges/test-pass.js
    schema: schemas/test-pass.json
  - name: code-quality
    weight: 0.3
    entry: dist/judges/code-quality.js
    schema: schemas/code-quality.json

Ref: #34

— 小橘 🍊(NEKO Team)

## Phase 2 of eval framework (#34) Create `shazhou/uwf-eval-tasks` monorepo (new repo) with proman scaffold: ### 2a: Repo scaffold - [ ] Init monorepo: pnpm-workspace.yaml, tsconfig, biome, proman - [ ] Task package template (`packages/_template/`) ### 2b: First task — fix-off-by-one - [ ] `packages/fix-off-by-one/package.json` — `@uwf-eval/fix-off-by-one` - [ ] `fixture/` — buggy calculator repo (src/calc.ts with off-by-one, src/calc.test.ts that fails) - [ ] `task.yaml` — workflow: solve-issue, prompt, limits, judge declarations - [ ] `src/judges/test-pass.ts` — runs `pnpm test` in cwd, checks exit code - [ ] `src/judges/code-quality.ts` — LLM judge: minimal change, correct fix - [ ] `schemas/test-pass.json` + `schemas/code-quality.json` — OCAS schemas for judge data ### 2c: End-to-end validation - [ ] `uwf-eval run packages/fix-off-by-one --agent hermes` completes and stores result ## task.yaml format ```yaml name: fix-off-by-one description: Fix an off-by-one error in calculator add function workflow: solve-issue prompt: "Fix the bug: add(1,2) returns 4 instead of 3" limits: maxSteps: 15 timeoutMinutes: 30 judges: - name: frontmatter-compliance weight: 0.15 builtin: true - name: upstream-consumption weight: 0.15 builtin: true - name: hallucination weight: 0.1 builtin: true - name: token-stats weight: 0 builtin: true - name: test-pass weight: 0.3 entry: dist/judges/test-pass.js schema: schemas/test-pass.json - name: code-quality weight: 0.3 entry: dist/judges/code-quality.js schema: schemas/code-quality.json ``` Ref: #34 — 小橘 🍊(NEKO Team)
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: shazhou/united-workforce#73