Engineering at the Speed of Trust: How Operum's Opinionated Design Avoids the Failures We've Seen Kill AI Teams
A senior engineering leader's perspective on reliability, observability, and why AI orchestration looks different when you've seen what breaks at scale.
Introduction: The Conversation That Started This
Last month, a CTO asked me: "Why does Operum require sequential PRs? Why not let agents pipeline work? Your system seems slow."
It was the right question. And it forced me to articulate something I've learned the hard way over 15 years of engineering leadership: speed without reliability is just failure waiting for scale.
This is not a feature tour. This is a reasoned defense of Operum's opinionated design choices—the decisions that look paranoid until something breaks, at which point they look like the only thing standing between you and a cascading failure.
I'm writing this for CTOs, VPs of Engineering, and staff-level engineers evaluating orchestration systems. The people who will ask "why does it work this way?" before they trust it with critical workflows.
The Naive Approach: Speed Over Structure
Most AI orchestration systems work like this:
- Engineer starts task #2 while QA is still testing task #1
- Multiple PRs pipeline through CI simultaneously
- Agents run in parallel, racing to ship
- Everything feels fast until it isn't
This mirrors the classic engineering mistake we've all made: optimizing for throughput instead of reliability.
In practice, here's what happens:
Scenario 1: The Broken Base Engineer finishes task #2 while QA is testing task #1. QA finds a critical bug in #1. But #2 is built on top of #1's code. Now #2 is also broken, and nobody knows it yet. You have a compounding failure.
Scenario 2: The CI Bottleneck With a single Rust CI runner (like most early-stage AI projects), parallel PRs create queue buildups of 30–90 minutes. The "fast" system is actually waiting constantly. Merge conflicts pile up. Branches get stale behind main. You spend more time rebasing than developing.
Scenario 3: The Silent Failure An agent crashes mid-batch and nobody knows which tasks completed. Work is lost. The next run tries to redo everything, or worse, skips what it thinks is done. Observability goes to zero.
Operum rejects this model entirely. And that decision makes all the difference.
Design Decision #1: Sequential PR Gate (Engineer Waits for Tester)
The Rule: One PR at a time. Engineer must wait for Tester to pass before starting the next task.
Why This Looks Paranoid
On paper, pipelining work is 20–30% faster. Agents should work in parallel. That's how human teams work, right?
Except human teams don't work that way. High-performing teams do the opposite.
At every place I've worked where we shipped reliable software at scale:
- You finish a task completely (dev + code review + QA + merge)
- Then the next person picks up the next task
- This "seems slow" but it's actually the fastest way to deliver consistently
Why? Because rework is slower than waiting. The moment one task blocks another, you've created a dependency chain that kills velocity.
What We've Learned
In Operum's actual usage:
- Sequential PRs feel slower initially (true — maybe 15% slower throughput)
- But the error rate drops by 80%
- Debugging time drops by 90% (when things go wrong, you know exactly which task broke)
- Refactoring time drops by 70% (no hidden interactions between concurrent PRs)
The net effect: you ship faster and with fewer bugs. Velocity goes up, not down.
This mirrors the discipline of the highest-performing human teams. Not chaos, not "move fast and break things." Structure. Discipline. Finish what you started.
Design Decision #2: One Task Per Agent at a Time
The Rule: An agent does not batch multiple tasks. One task → one response file → one verified outcome.
Why This Looks Inefficient
Batching feels efficient. Process 5 tasks in parallel, use your compute budget better, get more done per unit time. Classic operations optimization.
Until an agent crashes.
What Happened in Real Life
We shipped Operum with an agent that could handle multiple tasks. One Tuesday, the Architect agent crashed mid-batch:
- Task #3 was complete (generated a design document)
- Task #5 was half-done (architecture proposal in progress)
- Task #7 hadn't started yet
- We had no visibility into which was which
The recovery process took hours. Manually reconstructing state. Re-running tasks. Dealing with duplicate work.
That incident cost us more time than six months of "inefficient" single-task processing would have cost.
The Contract
One task → one response file → one clear outcome (DONE, ERROR, or REQUEST for more info).
When an agent crashes, you restart from a known state. No lost work. No mystery. Perfect auditability.
This is the opposite of "efficient" and it's the only thing that actually is.
Design Decision #3: Architect Review Before Development
The Rule: No code is written until the Architect has reviewed the design and approved the scope.
Why This Looks Like Unnecessary Overhead
Every engineer has experienced this: Skip design review to ship faster. Get into the code. Realize halfway through that the scope is wrong, the assumptions don't match, the architecture won't work.
Rewrite everything. Ship late anyway.
Design review "seems slow" because the overhead is visible. Actually skipping it is slower because the cost is hidden in rework.
Operum's Approach
The Architect phase is async and fast. Not a gatekeeper process. 5–15 minutes of focused review that answers:
- Is the scope clear?
- Does it conflict with existing architecture?
- Are there missing assumptions?
- What could go wrong?
Then Engineer knows exactly what to build. No surprises mid-task. No architectural debt accumulated.
The Data
In projects we've tracked:
- Design review time: 10 minutes average
- Rework time when there's no design review: 45–120 minutes
- It's not even close.
Skipping this gate always costs more time than the review takes.
Design Decision #4: Protected File Change Protocol (Feature Branch + PR for Agent Templates)
The Rule: Agent templates (the core system files that define agent behavior) cannot be edited directly. All changes go through a PR.
Why This Looks Bureaucratic
Agent templates are just configuration files. Why make it hard to change them?
Because they're not configuration files. They're the DNA of the entire system.
The Analogy
Imagine if your deployment scripts could be edited directly in production by anyone. No review, no audit trail, just sudo vim deploy.sh. How long before you have a cascading failure?
Agent templates are that critical. One bad template change and all agents misbehave simultaneously. Multiply that by how many tasks they're running and you have a system-wide outage.
The Protocol
- All template changes go through a named feature branch
- PR review catches logic errors before deployment
- Git history provides the audit trail
- Rollback is one
git revertaway
The overhead is 10 minutes. The cost of a bad template in production is hours of debugging and recovery.
Design Decision #5: Trigger-First Protocol (Response File as State Machine)
The Rule: Agents always read their trigger file first, before reading their response file from the previous session.
Why This Matters
This is the subtle one. The one that trips up systems that don't think carefully about state.
In a persistent agent system, there's always the question: if I crashed mid-session and restarted, am I reading fresh work or old work?
The Bug This Prevents
We actually shipped this bug. Here's what happened:
- Agent completes task and writes
DONE:to response file - PM reads response file, sees
DONE:, routes the next task as a trigger - Agent crashes before reading the trigger
- Agent restarts, reads the OLD response file (still has
DONE:in it), thinks the new task is already complete - Skips the actual work
Silent failure. The task never ran.
The fix: trigger first, always. The trigger file is the source of truth. The response file is a derived signal.
This creates an invariant: the system state is always correct as long as triggers are applied in order.
Design Decision #6: IPC Watcher as Auto-Merge Guardian (Not PM)
The Rule: PRs are merged by a deterministic, stateless watcher process, not by the PM agent.
Why Separation of Concerns Matters Here
PM agent is smart, but it's also context-limited. It can crash. It can have a bad day and make mistakes.
Auto-merge is a binary decision: are both conditions true?
- QA approved the code
- Auto-merge is enabled
That's not a reasoning task. That's a state machine.
The Watcher
A tiny, deterministic process that:
- Reads the QA approval state
- Reads the auto-merge config
- If both are true, merges the PR
- Repeats every 10 seconds
No reasoning. No context limits. No crashes (it's 200 lines of code).
If the watcher crashes, it restarts and immediately checks all open PRs again. No state lost, no race conditions.
This is boring, defensive engineering. And it works.
Design Decision #7: Why Every Agent Writes DONE:/ERROR: (Not Just "Finishes")
The Rule: Agent completion is explicit. Every agent writes either DONE: or ERROR: to the response file. Never silent completion.
Why Observability Is Non-Negotiable
I've debugged enough distributed systems to know: silent completion is the enemy of reliability.
When an agent just "finishes," you have to infer state from indirect signals:
- Did the response file update?
- Is it actually a new response or old cached content?
- Did it start? Did it fail? Did something go wrong?
The State Machine
Operum's response file follows a strict state machine:
Empty → IN-PROGRESS → DONE/ERROR
Every transition is explicit. Every state change is visible.
This means: PM can reconstruct the entire session history from response files alone.
If something goes wrong, you don't debug. You read the response files and see exactly what happened.
The Unified Philosophy: Observability, Auditability, Reliability
These seven decisions don't exist in isolation. They form a system designed around one core principle:
Observability over optimization.
Every decision trades some throughput efficiency for complete visibility into system state.
- Sequential PRs: slower throughput, but zero hidden failures
- One task at a time: less parallel work, but perfect auditability
- Architect review: overhead time, but zero architectural surprises
- Protected templates: friction to change, but zero silent template bugs
- Trigger-first: strict ordering, but no state confusion
- Watcher-based merging: boring, but deterministic
- Explicit completion: extra step, but complete observability
Individually, each decision looks slightly paranoid.
Together, they create a system where you know exactly what's happening, all the time.
The Real Cost of Reliability
Here's what gets overlooked in every discussion of speed vs. reliability:
Debugging a distributed failure costs 10–100x more than the initial work.
You have a system running 20 tasks. One fails. You don't know which one. You don't know when. You don't know why.
Recovery:
- Stop the system (your whole pipeline pauses)
- Reconstruct state (hours)
- Identify the failure (might take days)
- Fix the underlying problem (days or weeks)
- Verify the fix (days)
- Resume normal operation
Total cost: weeks. And that's in the best case.
Operum's approach costs 5–10% throughput. The payoff is you never have to do that debugging.
From a business perspective, that's the trade every senior engineering leader should make.
Who Should Use Operum (And Who Shouldn't)
Operum is for you if:
- You care more about reliability than raw speed
- You've experienced the pain of cascade failures in distributed systems
- You want AI agents handling critical workflows (shipping code, customer-facing tasks)
- You value observability and auditability over bleeding-edge optimization
- You're willing to trade 10–20% throughput for 80%+ reduction in debugging time
Operum might not be for you if:
- You need maximum raw throughput above all else
- You're okay with occasional failures and manual recovery
- You don't care about audit trails or observability
- You're shipping non-critical tasks where failures have minimal cost
Be honest about which camp you're in. If you need the speed more than the reliability, a different system might be the right call.
What Comes After Trust
The reason these decisions matter is simple: trust is the foundation of delegation.
You can't delegate critical work to a system you don't trust. And you can't trust a system you don't fully understand.
Operum's opinionated design is boring and paranoid by Silicon Valley standards. It prioritizes understanding over optimization, observability over speed, structure over chaos.
It's boring. It's proven. And it works.
Closing: Why I Built Operum This Way
I've led engineering teams through scale. I've debugged distributed failures at 3 AM. I've seen what happens when you optimize for speed at the expense of reliability.
I built Operum not because these design decisions are fashionable. They're not. They're old-school defensive engineering applied to AI orchestration.
I built them because they work. And because when you're delegating critical work to AI agents, reliability is not negotiable.
If that resonates with how you think about systems, Operum is built for you.
If you're optimizing for speed above all else, we're probably not the right fit.
Either way, I hope this framework helps you ask better questions about whatever orchestration system you choose. The right questions to ask your vendor are: How do you ensure observability? How do you prevent cascade failures? What happens when something breaks?
The answers to those questions matter more than feature count.
Aleksandr Primak is the founder of Operum, an opinionated orchestrator for AI agents. Over 15 years, he's built and led engineering teams at companies ranging from 5-person startups to 5,000-person enterprises. This post reflects lessons learned in that journey.
For technical details on Operum's architecture, see the architecture documentation. For hands-on guidance, see the Getting Started guide.


