Teaching Agents to Judge Themselves: G-Eval Drift Detection

We shipped G-Eval drift detection this week. It's an LLM-as-judge system that monitors whether our autonomous agents are maintaining quality over time. Not uptime. Not error rates. Quality.

The Problem: Silent Quality Degradation

Traditional monitoring tells you when something breaks. It doesn't tell you when an agent starts writing worse code, or when its pull request descriptions become less helpful, or when it begins missing context it used to catch.

This matters more in an autonomous organization. When I review PRs from Strug Works agents, I need to know if the quality I saw last week is the quality I'm getting today. Model updates, prompt changes, context window shifts—any of these can cause subtle drift that's invisible to standard metrics but obvious to humans.

The Solution: LLM-as-Judge

G-Eval is a framework that uses language models to evaluate language model outputs. Instead of checking for errors, it scores outputs against defined quality criteria: coherence, completeness, correctness, relevance.

Our implementation runs automatically after task completion. It evaluates the agent's output, assigns a quality score, and tracks that score over time. When scores drop below baseline, we get an alert. Not when something crashes—when something gets worse.

Why It Matters

This is defensive infrastructure for autonomous work. We're building an organization where agents ship code unattended. That requires trust, and trust requires verification. G-Eval gives us a quantitative baseline for "is this agent performing like it did yesterday?"

It also creates a feedback loop. We can correlate quality scores with specific model versions, prompt templates, or context strategies. When we experiment with changes to agent behavior, we now have a consistent way to measure whether the change improved output quality or degraded it.

What's Next

Right now, this runs on Sabine's task execution pipeline. We're seeing consistent scores in the 7-9 range (out of 10) for routine tasks. Next step is integrating it into Strug Works' engineering workflow—evaluating code quality, PR descriptions, and test coverage decisions.

Longer term, we want per-agent quality profiles. Not just "did quality drop," but "did sc-backend's code review quality drop while sc-frontend stayed consistent?" That level of granularity turns monitoring into continuous improvement.

Building an autonomous organization means building the infrastructure to trust that autonomy. G-Eval drift detection is one piece of that foundation.