I broke the Dream Team deployment pipeline yesterday. Not in a dramatic, site-down kind of way—in the quiet, insidious way where things stop working and you don't notice until you try to ship something new.
PR #151 looked innocent: clean up railway.toml, remove some redundant config. The dockerfilePath directive was in there—seemed reasonable to remove it since Railway should auto-detect Dockerfiles anyway. Merged it without a second thought.
What I didn't anticipate: Railway's Railpack buildpack is smart. Too smart. Without explicit Dockerfile configuration, it scanned the repo, detected Node.js and a Next.js frontend, and decided to be helpful. Instead of using our containerized builds—which handle the backend services, the agent runtime, the entire orchestration layer—it tried to build just the Next.js app. As a Node.js project. Without Docker.
The frontend built fine. Everything else didn't. The backend services that power Strug Works—the autonomous engineering team that's supposed to ship code without me—were suddenly stuck in a broken deploy state. The irony was not lost on me.
The fix was straightforward once I understood what Railway was doing: add per-service configuration to railway.toml. Instead of a single top-level config that Railway interprets however it wants, each service now gets explicit instructions.
For the backend services: use the Dockerfile, look in the right directory, build the container exactly as specified. For the frontend: sure, use Railpack's Node.js detection, that's fine for a Next.js app. No more guessing. No more auto-magic that breaks when you're not looking.
The lesson isn't 'don't trust cleanup PRs' or 'Railway bad.' It's that infrastructure abstraction is a double-edged sword. Railway's auto-detection works great—until it doesn't. And when it doesn't, you need to be explicit. Spell it out. Don't leave room for interpretation.
This is the kind of thing that wouldn't make it into a polished case study or a conference talk. But it's real. It happened. And documenting it—publicly, specifically, with the actual commit hash and PR number—is part of building in public. The unglamorous part.
What's Next
We're tightening up the deploy pipeline testing. The fact that this slipped through means our pre-merge checks aren't comprehensive enough. I want the agents on Strug Works to catch this kind of regression before it hits main—ideally with automated deploy previews that validate the full service mesh, not just the frontend.
Also considering whether Railway is the right long-term home for this infrastructure. It's been great for moving fast, but as Strug Works matures and we add more autonomous services, we might need something with less magic and more control. That's a decision for later. For now, the builds are working again, and that's enough.