<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
	<channel>
		<title>Evals on Minko Gechev&#39;s blog</title>
		<link>https://blog.mgechev.com/categories/Evals/</link>
		<description>Recent content in Evals on Minko Gechev&#39;s blog</description>
		<generator>Hugo</generator>
		<language>en-us</language>
		
		
		
		
			<lastBuildDate>Sat, 14 Mar 2026 00:00:00 +0000</lastBuildDate>
		
			<atom:link href="https://blog.mgechev.com/categories/Evals/feed.xml" rel="self" type="application/rss+xml" />
			<item>
				<title>skillgrade</title>
				<link>https://blog.mgechev.com/2026/03/14/skillgrade/</link>
				<pubDate>Sat, 14 Mar 2026 00:00:00 +0000</pubDate>
				<guid>https://blog.mgechev.com/2026/03/14/skillgrade/</guid>
				<description>&lt;div style=&#34;padding: 15px; border-radius: 5px; background-color:rgb(241 255 240); border: 1px solid #e9e9e9;&#34;&gt;&#xA;⭐ Find &lt;a href=&#34;https://github.com/mgechev/skillgrade&#34;&gt;Skillgrade on GitHub&lt;/a&gt;&#xA;&lt;/div&gt;&#xA;&lt;h1 id=&#34;skillgrade&#34;&gt;skillgrade&lt;/h1&gt;&#xA;&lt;p&gt;A few weeks ago I wrote about &lt;a href=&#34;https://blog.mgechev.com/2026/02/26/skill-eval/&#34;&gt;Skill Eval&lt;/a&gt;, a framework for testing AI agent skills. The idea resonated — skills are becoming a critical part of how teams work with agents, and without a way to measure whether they work, you&amp;rsquo;re guessing.&lt;/p&gt;&#xA;&lt;p&gt;The problem was that Skill Eval required too much setup. You had to clone a repo, understand a specific directory structure, write TypeScript config, and wire everything together before you could run your first eval. The barrier to entry was high for something that should be simple.&lt;/p&gt;</description>
			</item>
			<item>
				<title>Unit Tests for AI Agent Skills</title>
				<link>https://blog.mgechev.com/2026/02/26/skill-eval/</link>
				<pubDate>Thu, 26 Feb 2026 00:00:00 +0000</pubDate>
				<guid>https://blog.mgechev.com/2026/02/26/skill-eval/</guid>
				<description>&lt;div style=&#34;padding: 15px; border-radius: 5px; background-color:rgb(241 255 240); border: 1px solid #e9e9e9;&#34;&gt;&#xA;⭐ Find &lt;a href=&#34;https://github.com/mgechev/skill-eval&#34;&gt;Skill Eval on GitHub&lt;/a&gt;&#xA;&lt;/div&gt;&#xA;&lt;h1 id=&#34;unit-tests-for-ai-agent-skills&#34;&gt;Unit Tests for AI Agent Skills&lt;/h1&gt;&#xA;&lt;p&gt;I&amp;rsquo;ve been working with AI coding agents daily - Antigravity, Gemini CLI, Claude Code, and others. One pattern I keep seeing is teams building &lt;em&gt;skills&lt;/em&gt; for these agents: procedural instructions that teach the model how to use internal tools, follow specific workflows, or comply with team conventions.&lt;/p&gt;&#xA;&lt;p&gt;The problem? There&amp;rsquo;s no way to know if they actually work. You write a text file, hand it to an agent, and hope for the best. When you tweak the instructions, you have no signal telling you whether that change made things better or worse. You&amp;rsquo;re flying blind.&lt;/p&gt;</description>
			</item>
	</channel>
</rss>
