skill grade

Edit · Mar 14, 2026 · 3 minutes read · Follow @mgechev AI LLMs Agents Evals

skill grade

A few weeks ago I wrote about Skill Eval, a framework for testing AI agent skills. The idea resonated — skills are becoming a critical part of how teams work with agents, and without a way to measure whether they work, you’re guessing.

The problem was that Skill Eval required too much setup. You had to clone a repo, understand a specific directory structure, write TypeScript config, and wire everything together before you could run your first eval. The barrier to entry was high for something that should be simple.

I rewrote it from scratch as Skillgrade — a CLI tool you install globally and run from your skill directory. The workflow is now three commands:

npm i -g skillgrade

cd my-skill/
GEMINI_API_KEY=your-key skillgrade init
GEMINI_API_KEY=your-key skillgrade --smoke

That’s it. skillgrade init reads your SKILL.md, generates an eval.yaml with AI-powered tasks and graders, and you’re ready to run. No cloning, no boilerplate, no project structure to learn.

What Changed

The biggest shift is moving from a framework you build on top of to a tool you point at your skill. Everything lives in a single eval.yaml file:

version: "1"

defaults:
  agent: gemini
  provider: docker
  trials: 5
  timeout: 300
  threshold: 0.8

tasks:
  - name: fix-linting-errors
    instruction: |
            Use the superlint tool to fix coding standard violations in app.js.
    workspace:
      - src: fixtures/broken-app.js
        dest: app.js
    graders:
      - type: deterministic
        run: bash graders/check.sh
        weight: 0.7
      - type: llm_rubric
        rubric: |
                    Did the agent follow the check → fix → verify workflow?
        weight: 0.3

Tasks, graders, Docker config, agent selection — all in one place. String values like instruction and rubric support file references, so you can point to external markdown files when things get complex.

The agent is auto-detected from your API key. Set GEMINI_API_KEY and it uses Gemini. Set ANTHROPIC_API_KEY and it uses Claude. Set OPENAI_API_KEY and it uses Codex. Override with --agent=claude if you need to.

Presets

One thing I noticed when using Skill Eval was that I kept running the same configurations over and over. Quick smoke test during development. Thorough regression suite before merging. Skillgrade bakes these into presets:

--smoke — fast feedback, fewer trials
--reliable — balanced run for development
--regression — thorough evaluation before shipping

Instead of remembering the right combination of --trials and thresholds, you pick the preset that matches your intent.

CI Mode

Running evals in CI was possible before but required more work than it should. Now it’s a flag:

# .github/workflows/skillgrade.yml
- run: |
    npm i -g skillgrade
    cd skills/superlint
    GEMINI_API_KEY=${{ secrets.GEMINI_API_KEY }} skillgrade --regression --ci --provider=local

--ci exits with code 1 if the pass rate falls below --threshold (default: 0.8). Use --provider=local in CI — the runner is already an ephemeral sandbox, so Docker adds overhead without benefit.

Preview

After a run, you can review results in two ways:

skillgrade preview           # CLI report
skillgrade preview browser   # web UI → http://localhost:3847

Reports are saved to $TMPDIR/skillgrade/<skill-name>/results/ by default. Override with --output=DIR.

Getting Started

npm i -g skillgrade

# Go to your skill directory (must have SKILL.md)
cd my-skill/
GEMINI_API_KEY=your-key skillgrade init

# Run a smoke test
GEMINI_API_KEY=your-key skillgrade --smoke

Check out the examples in the repo — superlint for a simple setup and angular-modern for a TypeScript grader.

If you were using Skill Eval, the migration is straightforward: move your task config into eval.yaml and drop the framework dependency. The grading model is the same — deterministic and LLM rubric graders with configurable weights.

The core idea hasn’t changed. Skills are instructions for agents, and instructions need tests. Skillgrade just makes it easier to write and run those tests. Don’t ship skills without evals.