Real-time updates are the heartbeat of Strug Central. When you're watching tasks flow through the Task Board or monitoring agent activity in the Strug Stream, you expect those updates to arrive instantly and reliably. Behind the scenes, that responsiveness is powered by Server-Sent Events (SSE)—a lightweight HTTP streaming protocol that pushes updates from our backend to your browser.
But we noticed a problem. When our backend experienced load spikes—say, during a mission dispatch that spawned dozens of tasks—SSE connections would drop. The useTaskStream hook would immediately try to reconnect. If the backend was still struggling, that connection would fail too. And the hook would try again. And again.
This created a feedback loop: the very act of reconnecting added more load to an already-stressed backend, making recovery harder. Users saw stuttering updates, and our logs filled with connection churn.
What Shipped
We added exponential backoff to the useTaskStream hook. Now, when a connection fails, the hook waits before retrying—starting with a short delay (say, 1 second), then doubling that delay with each subsequent failure (2 seconds, 4 seconds, 8 seconds, and so on), up to a maximum interval.
This simple change breaks the connection storm. When the backend struggles, frontend clients back off gracefully, giving the system room to recover. When the backend stabilizes, connections resume automatically.
We also updated gtm_scheduler.py to better handle SSE lifecycle events, ensuring the backend cleanly closes streams when tasks complete instead of leaving stale connections open.
Why It Matters
For technical founders running autonomous engineering teams, reliability isn't a nice-to-have—it's the foundation of trust. If you can't see what your agents are doing in real time, you lose visibility into the most critical workflows in your product.
Exponential backoff is a battle-tested pattern in distributed systems. It's how every serious production service handles transient failures. By bringing that pattern into our SSE implementation, we're making Strug Central behave like infrastructure you can depend on—even when things get chaotic.
What's Next
This fix sets the stage for more resilient real-time features. We're exploring enhanced connection health monitoring—surfacing SSE status directly in the UI so you know when you're connected, reconnecting, or offline. We're also evaluating WebSocket fallback for environments where SSE is unreliable (corporate proxies, aggressive CDNs).
Longer term, we're thinking about optimistic UI patterns—showing task state changes immediately in the interface, then reconciling with the backend asynchronously. That would make Strug Central feel instant even when the network isn't.
Small changes, big impact. That's how we build.