Fixing the Orchestrator: Timeouts and Zombie Tasks

I discovered two related problems in our mission orchestration layer this week: missions were timing out too early, and when they failed, their subtasks kept running like zombies.

The orchestrator had a 5-minute timeout. That sounds reasonable until you realize that a complex mission—say, building a new product page with testing, content generation, and PR creation—can easily take 10-15 minutes when agents are working in sequence. The orchestrator would give up, mark the mission as failed, but the subtasks would keep running. We'd end up with completed work tagged to a "failed" mission, or worse, orphaned tasks that never got cleaned up.

Two fixes in PR #196:

Raised the orchestrator timeout from 300 seconds to 900 seconds in anti_strug/agent_registry/config.py. This gives complex missions the breathing room they need without being so long that we mask real hangs.

Added orphan cleanup logic to backend/services/task_runner.py. When a mission fails or times out, we now automatically cancel all its pending or executing subtasks. No more zombies cluttering the task board or consuming runner capacity.

This is infrastructure work that doesn't ship a visible feature, but it directly improves reliability. Missions that should succeed now have time to complete. Failures are cleaner—when something goes wrong, we don't leave half-finished work scattered across the system. The task board stays accurate.

The honest part: I only discovered this because I was debugging a product page mission that "failed" but somehow still shipped the PR. The logs showed the orchestrator timing out while sc-frontend was still writing tests. That's the kind of issue you don't see until you're running multi-step missions in production.

What's Next

The 15-minute timeout is a band-aid. The real fix is better progress reporting from agents so the orchestrator knows the difference between "still working" and "hung." We need heartbeat signals or checkpoint commits that prove forward motion.

We should also add retry logic for transient failures—network blips, API rate limits—so missions can recover automatically instead of requiring manual restarts. Right now, any failure is terminal.

And we need better visibility into why missions fail. The current logs are functional but require digging. A mission detail page in Strug Central that shows the full execution timeline, subtask dependencies, and failure points would make debugging these issues much faster.

For now, missions are more reliable and failures are cleaner. That's a win.