Skill Evaluation

Evaluate skill quality by running test cases against your skill instructions. The evaluation pipeline compares “with skill” versus “baseline” outputs and auto-grades each test case using LLM-as-judge assertions.

Overview

Skill evaluation integrates with the Prompt Comparison system to measure how much a skill improves agent output quality. Each evaluation:

Creates a PromptComparison with two variants: with-skill (includes skill instructions) and baseline (no skill)
Runs both variants against your test cases
Auto-grades assertions using LLM-as-judge after all runs complete
Computes benchmark stats: pass rate delta, latency comparison

Evaluation File Format

Skills store test cases in evals/evals.json following the agentskills.io specification:


{
  "skill_name": "my-skill",
  "evals": [
    {
      "id": 1,
      "prompt": "Review this function for bugs",
      "expected_output": "A list of identified issues with severity and fix suggestions",
      "files": ["evals/files/sample.py"],
      "assertions": [
        "The output identifies at least one bug",
        "Each issue includes a suggested fix",
        "Output is structured with clear sections"
      ]
    }
  ]
}

Field	Required	Description
`id`	Yes	Unique numeric identifier
`prompt`	Yes	The user task to evaluate
`expected_output`	Yes	Description of what correct output looks like
`files`	No	Referenced files for context
`assertions`	Yes	Grading criteria (pass/fail per assertion)

Workflow

Create Test Cases

Open the skill editor, switch to the Evaluation tab, and click Add Test Case. Each test case needs:

A realistic user prompt
An expected output description
One or more assertions for grading

Click Save to write the test cases to evals/evals.json.

Run Evaluation

Click Run Evaluation to create a PromptComparison with two variants:

With skill — Your skill instructions are prepended to each prompt
Baseline — The same prompts without skill instructions

Then click Start Run to execute the comparison.

Review Results

Results appear inline in the Evaluation tab:

Status banner — Shows running/completed/failed with a link to the full Prompt Compare view
Benchmark summary — Pass rate and latency for both variants with delta
Per-test results — Expandable rows showing each variant’s output and assertion pass/fail

Iterate

Update your skill instructions, add or modify test cases, then run a new evaluation. Each run creates a separate comparison, so you can track improvement over time.

Auto-Grading

When comparison runs complete, the system automatically:

Detects skill eval datasets (columns include assertions)
Parses assertions from each dataset row
Calls gradeAssertions() using LLM-as-judge (low temperature for consistency)
Records pass/fail with evidence in SkillEvalGrading records
Updates the comparison summary with pass rates per variant

Auto-grading happens server-side after runs complete. No manual grading step is needed — results appear automatically when you check back.

Benchmark Summary

The benchmark card shows three columns:

Metric	With Skill	Baseline	Delta
Pass rate	85%	60%	+25%
Avg latency	1.2s	0.8s	+0.4s

Positive delta (green) means the skill improves output quality
Negative delta (red) means the baseline outperforms the skill
Latency increase is expected since skill instructions add tokens

Assertion Design

Write assertions that test specific, observable properties of the output:

Good assertions:

“Output includes a summary section with key findings”
“All code examples include error handling”
“Response is under 500 words”

Weak assertions:

“Output is good” (too vague for LLM-as-judge)
“Better than baseline” (comparative, not absolute)

Start with 2-3 test cases covering the most common use case. Add edge cases after confirming the basic flow works.

API Reference

tRPC Endpoints

Endpoint	Type	Description
`skills.getEvals`	Query	Load evals.json for a skill
`skills.saveEvals`	Mutation	Write evals.json (workspace skills only)
`skills.createEvalComparison`	Mutation	Create comparison from evals
`skills.startEvalRun`	Mutation	Start the comparison runner
`skills.getEvalResults`	Query	Fetch latest results with grading + benchmark
`skills.gradeSkillRun`	Mutation	Manually grade a single run

Database Model

Grading results are stored in SkillEvalGrading, linked 1:1 with PromptComparisonRun:


SkillEvalGrading
  ├── promptComparisonRunId (unique, FK → PromptComparisonRun)
  ├── assertions            (JSON array of assertion strings)
  ├── results               (JSON array of {index, passed, evidence})
  └── passRate              (Float, 0-1)