Sometimes you ship a regression so severe it's almost impressive. That's what happened with v4.0 of our retrieval system: 0% recall across every test query. Zero. Not "slightly worse than expected"—completely broken.
The Problem: TDD Simulation as Truth-Teller
We caught this through Phase 0 simulation testing—running real queries against the system and measuring what came back. The results were unambiguous: nothing worked. Every query that should have returned relevant context returned nothing useful. Strug Recall, the memory browser that powers agent intelligence, was effectively blind.
This is the kind of regression that doesn't show up in unit tests. Each component worked in isolation. The failure emerged only when the full retrieval pipeline ran end-to-end. That's why we build simulations: to catch systemic failures that slip through component-level testing.
Three Root Causes
Root cause analysis identified three independent failures, each sufficient to break retrieval on its own:
1. Memory retrieval logic failed to properly filter and rank results by relevance score, returning empty result sets even when relevant entries existed.
2. Embedding similarity calculation had a dimension mismatch between query and stored embeddings, causing silent failures in vector comparison.
3. Agent core invocation logic didn't properly pass retrieval context through to downstream reasoning steps, even when retrieval succeeded.
Each issue lived in a different layer of the stack: data access, vector math, and orchestration. That's why the simulation failure was total—every path to successful retrieval was blocked.
The Fix
We addressed all three issues in a single commit (49201e6). The memory module now correctly filters and sorts results. Embedding dimensions are validated and normalized. The agent core properly threads retrieval context through its execution flow.
Cross-compatibility simulation tests now pass, confirming that the retrieval pipeline works end-to-end. We've expanded the simulation suite to include more edge cases and failure modes, so regressions of this magnitude shouldn't slip through again.
What's Next
This fix restores baseline functionality, but we're not stopping there. We're adding continuous simulation testing to our CI pipeline so every commit runs against the full suite of retrieval scenarios. If recall drops below threshold, the build fails.
We're also instrumenting Strug Recall with observability so we can track retrieval quality in production. Metrics on recall rate, latency, and result relevance will surface in Strug Central, giving us real-time visibility into memory system health.
Regressions like this are expensive—not just in engineering time, but in lost agent capability. Memory is foundational. When it breaks, everything downstream suffers. The fix is in. The guardrails are going up. And we're building in public, breakage and all.