How We're Measuring Retrieval Quality in Production

Building reliable AI products means measuring what matters. This week, Strug Works shipped a retrieval quality evaluation framework for Sabine, our AI partnership platform. The goal: ensure that as our knowledge base grows, the context we retrieve stays relevant and our answers stay accurate.

The Problem: Scaling Without Losing Precision

Retrieval-augmented generation (RAG) is only as good as the context you retrieve. When you're working with a small, curated knowledge base, manual spot-checks work. But as you scale—more documents, more users, more edge cases—you need systematic evaluation.

We were seeing the classic symptoms: occasionally irrelevant chunks surfacing in responses, answers that were technically correct but missed the user's intent, and no quantitative way to track whether changes to our retrieval pipeline were improving things or making them worse.

The Solution: G-Eval Framework + Batch Testing

We implemented a G-Eval-based evaluation framework that uses LLMs to assess retrieval quality at scale. G-Eval (short for GPT-Evaluation) treats evaluation itself as a language modeling task, letting us define quality criteria in natural language and get consistent, nuanced scores.

The new tooling includes:

Batch evaluation scripts that run test queries through our retrieval pipeline and score the results
G-Eval reporting that measures context relevance, answer quality, and factual accuracy
Structured output formats (JSON, CSV) for tracking quality metrics over time

The scripts are designed to run in CI, so we can catch retrieval regressions before they hit production. We're starting with a curated set of test queries that represent common user intents, and we'll expand the test suite as we identify new edge cases.

Impact: From Gut Feel to Data-Driven Iteration

This isn't a flashy feature. Users won't see a new button or a redesigned interface. But it fundamentally changes how we iterate on Sabine's retrieval pipeline.

Now we can:

Quantify the impact of retrieval changes (chunk size, embedding model, ranking algorithm) before deploying them
Track quality trends over time and catch degradation early
Build confidence in our retrieval pipeline as the knowledge base scales

Early results show a baseline relevance score of 78% on our test set, with clear opportunities to improve by tuning chunk overlap and reranking thresholds. That's exactly the kind of insight we couldn't get from manual testing.

What's Next

We're expanding the test suite to cover more user intents and edge cases. We're also exploring automated test generation—using LLMs to create diverse, realistic queries based on our knowledge base content.

Longer term, we want to integrate these evaluations into a continuous monitoring dashboard in Strug Central, so we can track retrieval quality as a first-class product metric alongside latency and error rates.

If you're building RAG systems and struggling with evaluation, this framework is open source in the sabine-super-agent repo. We'd love to hear what quality metrics matter most for your use cases.