Structured Handoffs
Seven specialized agents, each with one job. Work flows between them through structured handoffs — never improvised, never blind.
That one sentence is the whole idea. Safety in an agentic workflow does not come from trusting any individual agent to behave — it comes from dividing the work so that shipping requires several independent actors to agree, and a human to give final sign-off. The rest of this page makes that concrete.
The roster
| Agent | Role | Hands off to |
|---|---|---|
| Product Manager | Strategy, vision, BUILD/DEFER/DECLINE decisions on new ideas. | Product Owner |
| Product Owner | Backlog management, ticket quality, Definition of Ready enforcement. | Worker |
| System Architect | Tech-stack decisions, design guidance, pattern library. | Product Owner / Worker |
| Quality Engineer | BDD test plans before implementation. Shift-left testing. | Worker / Reviewer |
| Worker | Picks one ticket. Implements. Opens a PR. | Reviewer |
| PR Reviewer | Reads the diff. Issues GO or NO-GO with reasons. | Human |
| DevOps Engineer | Deployment, previews, infra changes. | Human |
The pattern is deliberate: the agent that decides what is never the agent that decides how, and the agent that writes is never the agent that approves. Authority is split along the seams where mistakes are most expensive — the PM sets direction but does not manage the backlog; the PO shapes the backlog but does not make strategic calls; the Worker writes code but cannot merge; the Reviewer judges but also cannot merge.
Why specialize?
A general-purpose agent is mediocre at everything. A specialized agent with one job, one context, and one set of guardrails is sharp. The cost of specialization is more handoffs — which is why every handoff is structured.
Specialization is also a control. The worker cannot review its own PRs. The reviewer cannot merge. Roles enforce separation of duties at the prompt level, with bot account separation enforcing it at the platform level.
The pipeline
Idea ──▶ PM ──▶ PO ──▶ Architect ─┐
▼
Worker ──▶ Reviewer ──▶ Human ──▶ merged
▲ │
└──── QE ──┘Each arrow is a structured handoff with a documented contract. Read the PR template, the ticket format, and the review template to see how each contract is enforced.
Where humans live in the loop
You sit at exactly three places: defining what to build (with the PM), approving the Ready column (with the PO), and merging PRs (after the Reviewer). Everything in between runs on agents. That is the point of the harness.
What makes it real, not aspirational
A mental model is only reassuring if it is enforced. In Gemba Flow the roster is backed by overlapping platform controls, so that even an agent that misreads its instructions cannot cross the lines above. The layers that matter most to an evaluator:
- Branch protection on
main. Direct pushes are blocked; every change must arrive through a reviewed pull request with passing checks. This is the hard boundary — if it holds, most other failure modes are contained. - Account separation. The human operator, the worker bot, and the reviewer bot are three distinct accounts with scoped permissions. The worker cannot review its own PRs; the reviewer cannot merge; neither bot has admin rights. A pre-operation hook switches to the correct account automatically, so attribution in the audit trail is always honest.
- A merge deny rule. The
gh pr mergecommand is denied at the framework level. Even if an agent is told to merge, the tool call is blocked before it runs. This is the hard enforcement behind “only humans merge” — it does not depend on the agent choosing to comply. - Drift detection in CI. A policy linter runs on every pull request and fails the build if a safety instruction (such as a “NEVER merge” rule) is ever weakened or removed, so the guardrails cannot quietly erode over time.
- Local and scheduled checks. A pre-push hook runs lint and tests before
code reaches GitHub, and weekly audits scan for any restricted action — a
bot merge, a direct push to
main— and raise an alert if one slips through.
The design principle is defense in depth: prompt-level rules handle nuance, platform rules handle the hard boundaries, CI catches drift, and audits catch whatever prevention missed. No single layer is trusted to be sufficient on its own. See Layered Controls for the full layered architecture.
Where to go next
- Read Honest Limits for where this model intentionally stops — the cases where you should not hand work to the agents.
- Ready to try it? The Quickstart takes you from install to your first agent-authored pull request.