Overview
A source-backed method for deciding when one agent should become a team, then designing roles, handoffs, context boundaries, evals, and human checkpoints.
When to use this
A single agent is struggling with too many tools, too much context, parallel research paths, or specialized subtasks that need different prompts and guardrails.
Fill Out The Baseline Record
Multi-agent systems add coordination cost before they add intelligence. Start with a baseline record for the current single-agent attempt: task, current prompt, tools allowed, failure evidence, latency, cost, quality score, and keep/kill rule for any proposed team.
- Failure evidence: what the single agent missed, repeated, hallucinated, or could not fit in context.
- Team hypothesis: which pressure a team should solve, such as parallel search or independent review.
- Keep/kill rule: keep the team only if it beats the baseline on the metric that matters.
Choose A Pattern With One Example
Agent teams fail when every agent can talk to every other agent without a control model. Choose the coordination pattern first, then name roles. A router classifies and dispatches. A manager calls specialists as tools. Handoffs move the active conversation. A workflow fixes the sequence. An evaluator loop improves an output against criteria.
- Use a manager when one agent should own the final synthesis.
- Use handoffs when a specialist should respond directly to the user or own the next state.
- Use explicit workflows when ordering, approvals, or checkpoints matter more than autonomy.
Write A Role Card For Each Agent
A role card should be short enough to inspect before a run. Include the agent's job, input packet, allowed tools, return format, evidence requirements, and what it is not allowed to do. Role cards are clearer than loose titles like researcher, planner, or reviewer.
- Job: one responsibility and one success criterion.
- Input: task brief, constraints, sources, and context it may see.
- Return: findings, evidence, confidence, unresolved questions, and next recommended action.
Set Stop Conditions Before Tools
A team needs traceability because mistakes can compound across agents. Define every stop condition before live tools are available: budget exhausted, loop count reached, evidence missing, worker outputs conflict, unsafe tool request, or human approval needed.
- Set token, time, loop, and subagent-count budgets before execution.
- Require approval for writes, external messages, spending, deletes, and credential changes.
- Capture traces that show who delegated what, which tools ran, and why the run stopped.
Compare Team Against Baseline
Do not judge a multi-agent system by a demo run. Reuse the single-agent eval set, add adversarial and messy cases, and compare quality, cost, latency, and recovery behavior against the baseline record. Apply the keep/kill rule after the same tasks run through both systems.
- Measure both answer quality and coordination quality.
- Track duplicate work, missing handoffs, bad merges, runaway loops, and unsupported claims.
- Promote from sandbox only after traces and outputs pass human review.
Method
- Fill out the baseline record first: task, current prompt, tools, failure evidence, cost, latency, quality, and keep/kill rule.
- Prove that a team is needed by identifying the exact pressure: parallel breadth, specialist context, tool confusion, sequential constraints, or independent review.
- Choose the smallest coordination pattern that fits the pressure: router, manager-with-specialists, handoffs, explicit workflow, or evaluator-optimizer loop.
- Write one role card per agent with job, input, tools, return format, evidence rules, and not-allowed actions.
- Define each stop condition: budget, loop count, unresolved conflict, missing evidence, unsafe tool request, or human approval needed.
- Add guardrails before live tools: human approvals for side effects, sandboxed execution, credential boundaries, budget limits, and trace logging.
- Run the same eval set against the single agent and team, then keep the team only if it improves quality enough to justify added cost and complexity.
Before you start
What to write down first
- Representative tasksUse real tasks so the team is judged on work it actually needs to perform.
- Baseline recordCapture how the single agent performs before adding coordination cost.
- Specialist role cardsGive each agent one job, allowed context, return format, and actions it cannot take.
- Permission mapDecide which tools each role can use and where human approval is required.
- Stopping conditionsDefine when the run stops for cost, loops, missing evidence, conflict, or unsafe action.
Useful review material
- Task transcriptReview the single-agent run to find where context, tools, or quality broke down.
- Context mapDraw what each role can see so private or irrelevant context does not spread everywhere.
- Eval setReuse the same tasks for the single agent and team so the comparison is fair.
- Sandbox environmentTest tool calls, handoffs, and approvals away from production systems.
Decision points
- Is this really a multi-agent task?
- Use a team when the work benefits from parallel branches, specialized context, independent critique, or stateful handoffs. Stay single-agent when the task is linear, low value, highly dependent on shared context, or fixable with clearer tools.
- Should specialists be tools or active handoff targets?
- Use specialists as tools when the manager should synthesize and enforce shared guardrails. Use handoffs when the specialist should own the conversation, collect user input, or operate under a different prompt and tool set for multiple turns.
- What context crosses agent boundaries?
- Pass only the task brief, relevant evidence, constraints, and required output schema unless the specialist truly needs full history. Require compact findings back, with links or artifacts for inspection.
- Where does a human approve or stop the run?
- Add checkpoints before external side effects, irreversible file/database changes, financial actions, credential use, or broad outreach. Also stop when budgets, uncertainty, or conflict thresholds are reached.
Common mistakes
- Splitting one vague prompt into several vague role prompts and expecting coordination to emerge.
- Letting every subagent see the full transcript, which destroys context isolation and raises cost.
- Giving workers broad tools when their job only needs one or two safe capabilities.
- Measuring only final answer quality while ignoring token spend, latency, traceability, and recovery behavior.
- Skipping the single-agent baseline, so there is no proof that the team was worth building.
Troubleshooting
- Subagents duplicate each other's work.
- Make the orchestrator assign non-overlapping briefs, require each worker to state its search angle, and merge findings by evidence category instead of by agent name.
- The manager produces a confident synthesis from weak worker outputs.
- Require workers to return evidence, uncertainty, and unresolved questions, then add an evaluator step that checks whether the synthesis is supported.
- The team is slower and more expensive than the single agent.
- Remove agents that do not improve the eval score, switch handoffs to tool calls where direct interaction is unnecessary, and cap parallel branches.
- A specialist takes actions outside its intended role.
- Narrow its tool set, rewrite the role prompt around explicit allowed actions, and move side-effecting tools behind human approval.
Sources
This playbook is authored from multiple references. Open the originals to inspect details, examples, and current guidance before adapting it.
- How we built our multi-agent research system
Primary source for the orchestrator-worker pattern, parallel subagents, context isolation, token-cost tradeoffs, and production lessons.
- Building effective agents
Baseline framework for workflows versus agents, simple composable patterns, tool design, and when added autonomy is worth it.
- OpenAI Agents SDK agent orchestration
Reference for agents-as-tools, handoffs, specialist agents, and eval-driven iteration in the OpenAI Agents SDK.
- LangChain multi-agent patterns
Pattern comparison across subagents, handoffs, skills, routers, custom workflows, context engineering, and cost tradeoffs.
- AutoGen AgentChat teams
Microsoft-maintained docs for teams, group chat presets, handoff-style swarms, observation, and the single-agent-first warning.
- CrewAI crews
CrewAI's model of crews, tasks, processes, manager agents, memory, planning, callbacks, and hierarchical coordination.
Notes
Agent teams can multiply token spend, permissions, tool calls, and compounding errors. Test in a sandbox, keep source traces, and require human approval before side effects.
Comments