Skill Evaluation
Evaluate skill quality by running test cases against your skill instructions. The evaluation pipeline compares “with skill” versus “baseline” outputs and auto-grades each test case using LLM-as-judge assertions.
Overview
Skill evaluation integrates with the Prompt Comparison system to measure how much a skill improves agent output quality. Each evaluation:
- Creates a PromptComparison with two variants: with-skill (includes skill instructions) and baseline (no skill)
- Runs both variants against your test cases
- Auto-grades assertions using LLM-as-judge after all runs complete
- Computes benchmark stats: pass rate delta, latency comparison
Evaluation File Format
Skills store test cases in evals/evals.json following the agentskills.io specification:
{
"skill_name": "my-skill",
"evals": [
{
"id": 1,
"prompt": "Review this function for bugs",
"expected_output": "A list of identified issues with severity and fix suggestions",
"files": ["evals/files/sample.py"],
"assertions": [
"The output identifies at least one bug",
"Each issue includes a suggested fix",
"Output is structured with clear sections"
]
}
]
}| Field | Required | Description |
|---|---|---|
id | Yes | Unique numeric identifier |
prompt | Yes | The user task to evaluate |
expected_output | Yes | Description of what correct output looks like |
files | No | Referenced files for context |
assertions | Yes | Grading criteria (pass/fail per assertion) |
Workflow
Create Test Cases
Open the skill editor, switch to the Evaluation tab, and click Add Test Case. Each test case needs:
- A realistic user prompt
- An expected output description
- One or more assertions for grading
Click Save to write the test cases to evals/evals.json.
Run Evaluation
Click Run Evaluation to create a PromptComparison with two variants:
- With skill — Your skill instructions are prepended to each prompt
- Baseline — The same prompts without skill instructions
Then click Start Run to execute the comparison.
Review Results
Results appear inline in the Evaluation tab:
- Status banner — Shows running/completed/failed with a link to the full Prompt Compare view
- Benchmark summary — Pass rate and latency for both variants with delta
- Per-test results — Expandable rows showing each variant’s output and assertion pass/fail
Iterate
Update your skill instructions, add or modify test cases, then run a new evaluation. Each run creates a separate comparison, so you can track improvement over time.
Auto-Grading
When comparison runs complete, the system automatically:
- Detects skill eval datasets (columns include
assertions) - Parses assertions from each dataset row
- Calls
gradeAssertions()using LLM-as-judge (low temperature for consistency) - Records pass/fail with evidence in
SkillEvalGradingrecords - Updates the comparison summary with pass rates per variant
Auto-grading happens server-side after runs complete. No manual grading step is needed — results appear automatically when you check back.
Benchmark Summary
The benchmark card shows three columns:
| Metric | With Skill | Baseline | Delta |
|---|---|---|---|
| Pass rate | 85% | 60% | +25% |
| Avg latency | 1.2s | 0.8s | +0.4s |
- Positive delta (green) means the skill improves output quality
- Negative delta (red) means the baseline outperforms the skill
- Latency increase is expected since skill instructions add tokens
Assertion Design
Write assertions that test specific, observable properties of the output:
Good assertions:
- “Output includes a summary section with key findings”
- “All code examples include error handling”
- “Response is under 500 words”
Weak assertions:
- “Output is good” (too vague for LLM-as-judge)
- “Better than baseline” (comparative, not absolute)
Start with 2-3 test cases covering the most common use case. Add edge cases after confirming the basic flow works.
API Reference
tRPC Endpoints
| Endpoint | Type | Description |
|---|---|---|
skills.getEvals | Query | Load evals.json for a skill |
skills.saveEvals | Mutation | Write evals.json (workspace skills only) |
skills.createEvalComparison | Mutation | Create comparison from evals |
skills.startEvalRun | Mutation | Start the comparison runner |
skills.getEvalResults | Query | Fetch latest results with grading + benchmark |
skills.gradeSkillRun | Mutation | Manually grade a single run |
Database Model
Grading results are stored in SkillEvalGrading, linked 1:1 with PromptComparisonRun:
SkillEvalGrading
├── promptComparisonRunId (unique, FK → PromptComparisonRun)
├── assertions (JSON array of assertion strings)
├── results (JSON array of {index, passed, evidence})
└── passRate (Float, 0-1)