We shipped three critical fixes to the Strug Works agent dispatch system today. The problem was simple but consequential: under certain conditions, agents would lose work mid-execution. Not fail gracefully. Not retry. Just drop tasks.
This is the kind of failure that erodes trust in autonomous systems. If you can't rely on an agent to finish what it started, you can't build a business on top of it.
What We Fixed
The Tier 1 fixes addressed three specific failure modes, tracked as SCE-2034, SCE-2035, and SCE-2036:
First, we hardened the agent executor's rounds limit enforcement. Previously, if an agent hit the maximum iteration count, the system would terminate without properly recording state. Now we capture that termination condition and log it explicitly before shutdown. The new test suite in test_agent_executor_rounds_limit.py enforces this behavior.
Second, we updated task_runner.py to handle edge cases where task state transitions would silently fail. The runner now validates state changes before committing them to the database, and rolls back cleanly if a transition is invalid.
Third, we synchronized the agent registry configuration in anti_strug/agent_registry/config.py with the runtime expectations in the executor layer. Mismatches here caused agents to be dispatched with incorrect capability profiles, leading to mid-task failures when they encountered operations they weren't configured to handle.
Why It Matters
Strug Works is built on a promise: autonomous agents that ship production code without human intervention. That promise breaks the moment an agent drops a task. These weren't rare edge cases—they were hitting production workflows frequently enough to block real work.
The fixes are labeled Tier 1 because they address immediate reliability gaps without requiring architectural changes. They're tactical, not strategic—but tactics matter when the system is unreliable.
What's Next
Tier 1 fixes stop the bleeding. Tier 2 will address root causes. We're planning deeper instrumentation across the dispatch layer to capture failure telemetry before agents hit critical failure states. That means better observability into task execution, earlier warnings when an agent is about to fail, and automatic recovery paths that don't require human intervention.
We're also expanding test coverage to include chaos scenarios—intentionally breaking agent workflows to verify that recovery mechanisms work as designed. The new rounds limit tests are a start, but we need coverage for network failures, database deadlocks, and OOM conditions.
Reliability isn't a feature you ship once. It's a discipline you maintain. These fixes move us closer to a system that founders can trust to run their engineering teams autonomously. That's the standard we're building toward.