Skip to Content
EvaluationsSkill Evaluation

Skill Evaluation

Evaluate skill quality by running test cases against your skill instructions. The evaluation pipeline compares “with skill” versus “baseline” outputs and auto-grades each test case using LLM-as-judge assertions.

Overview

Skill evaluation integrates with the Prompt Comparison system to measure how much a skill improves agent output quality. Each evaluation:

  1. Creates a PromptComparison with two variants: with-skill (includes skill instructions) and baseline (no skill)
  2. Runs both variants against your test cases
  3. Auto-grades assertions using LLM-as-judge after all runs complete
  4. Computes benchmark stats: pass rate delta, latency comparison

Evaluation File Format

Skills store test cases in evals/evals.json following the agentskills.io  specification:

{ "skill_name": "my-skill", "evals": [ { "id": 1, "prompt": "Review this function for bugs", "expected_output": "A list of identified issues with severity and fix suggestions", "files": ["evals/files/sample.py"], "assertions": [ "The output identifies at least one bug", "Each issue includes a suggested fix", "Output is structured with clear sections" ] } ] }
FieldRequiredDescription
idYesUnique numeric identifier
promptYesThe user task to evaluate
expected_outputYesDescription of what correct output looks like
filesNoReferenced files for context
assertionsYesGrading criteria (pass/fail per assertion)

Workflow

Create Test Cases

Open the skill editor, switch to the Evaluation tab, and click Add Test Case. Each test case needs:

  • A realistic user prompt
  • An expected output description
  • One or more assertions for grading

Click Save to write the test cases to evals/evals.json.

Run Evaluation

Click Run Evaluation to create a PromptComparison with two variants:

  • With skill — Your skill instructions are prepended to each prompt
  • Baseline — The same prompts without skill instructions

Then click Start Run to execute the comparison.

Review Results

Results appear inline in the Evaluation tab:

  • Status banner — Shows running/completed/failed with a link to the full Prompt Compare view
  • Benchmark summary — Pass rate and latency for both variants with delta
  • Per-test results — Expandable rows showing each variant’s output and assertion pass/fail

Iterate

Update your skill instructions, add or modify test cases, then run a new evaluation. Each run creates a separate comparison, so you can track improvement over time.

Auto-Grading

When comparison runs complete, the system automatically:

  1. Detects skill eval datasets (columns include assertions)
  2. Parses assertions from each dataset row
  3. Calls gradeAssertions() using LLM-as-judge (low temperature for consistency)
  4. Records pass/fail with evidence in SkillEvalGrading records
  5. Updates the comparison summary with pass rates per variant

Auto-grading happens server-side after runs complete. No manual grading step is needed — results appear automatically when you check back.

Benchmark Summary

The benchmark card shows three columns:

MetricWith SkillBaselineDelta
Pass rate85%60%+25%
Avg latency1.2s0.8s+0.4s
  • Positive delta (green) means the skill improves output quality
  • Negative delta (red) means the baseline outperforms the skill
  • Latency increase is expected since skill instructions add tokens

Assertion Design

Write assertions that test specific, observable properties of the output:

Good assertions:

  • “Output includes a summary section with key findings”
  • “All code examples include error handling”
  • “Response is under 500 words”

Weak assertions:

  • “Output is good” (too vague for LLM-as-judge)
  • “Better than baseline” (comparative, not absolute)

Start with 2-3 test cases covering the most common use case. Add edge cases after confirming the basic flow works.

API Reference

tRPC Endpoints

EndpointTypeDescription
skills.getEvalsQueryLoad evals.json for a skill
skills.saveEvalsMutationWrite evals.json (workspace skills only)
skills.createEvalComparisonMutationCreate comparison from evals
skills.startEvalRunMutationStart the comparison runner
skills.getEvalResultsQueryFetch latest results with grading + benchmark
skills.gradeSkillRunMutationManually grade a single run

Database Model

Grading results are stored in SkillEvalGrading, linked 1:1 with PromptComparisonRun:

SkillEvalGrading ├── promptComparisonRunId (unique, FK → PromptComparisonRun) ├── assertions (JSON array of assertion strings) ├── results (JSON array of {index, passed, evidence}) └── passRate (Float, 0-1)
Last updated on