When Do You Need Multi-agent: It Is Not Division of Labor, It Is Runtime Boundaries

When do you actually need multi-agent?

A common answer is: when the task can be split into roles. One agent plans, one searches, one writes code, one reviews, one releases. This answer feels natural because it borrows the language of human organizations. It also makes for a tidy diagram.

But it is not a strong enough engineering criterion.

The real question is not “How many roles can I name?” The better question is: “Are there multiple runtime boundaries that should not be mixed?” If two steps should see different context, use different tools, hold different permissions, optimize for different goals, follow different failure constraints, or carry different approval authority, then multi-agent starts to become an architectural option.

In other words, the core of multi-agent design is not division of labor. It is boundary design.

1. The Question Is Not “How Many Roles?” but “How Many Boundaries?”

1.1 Role language is useful, but it can mislead

Words like planner, executor, reviewer, and releaser are useful. They help us describe workflows, and they help prompts express what a particular step is supposed to do. Their danger is that they make it too easy to assume that every distinct responsibility deserves a separate agent.

That assumption often adds complexity too early.

A single agent can move through different phases with different instructions, output schemas, and tool-selection rules. A normal workflow can insert tests, static analysis, permission checks, and human approvals between steps. Many systems need better context management, clearer tool descriptions, and stronger verification. They do not necessarily need more agent personas.

If all you are doing is asking the same model to speak first like a product manager and then like an engineer, that is not a sufficient reason to build a multi-agent architecture. It is usually prompt-level mode switching.

1.2 Architecture starts with boundaries

Multi-agent becomes interesting when questions like these start to matter:

Should this step be prevented from seeing the previous step’s full reasoning and bias?
Should this step have only a narrow set of tools instead of the whole toolbox?
Does this step need different system instructions, policy constraints, or compliance rules?
Does this step need an independent judgment rather than inheriting the generator’s objective?
If this step fails, should it stop and escalate rather than retry?
Does this step hold high-impact authority such as approval, deployment, payment, deletion, or external communication?

If the answer is yes, you are no longer dealing with simple role decomposition. You are dealing with runtime boundaries.

This is why software development is such a useful case. Requirements clarification, test design, implementation, code review, and release look like one continuous chain. But they should not automatically share the same harness with authority to approve and ship. Otherwise the system quietly becomes: understand the requirement yourself, define the standard yourself, write the implementation yourself, judge it yourself, and finally release it yourself.

That is not merely a “self-review” problem. More precisely, it is a problem of mixed context, mixed permissions, mixed objectives, and mixed risk boundaries.

2. First, Define the Terms: Model, Agent, Harness, and Orchestrator

2.1 A model is not an agent

The vocabulary matters here.

A model is the language model itself. It receives input and produces output. It may express an intention to call a tool, but the model itself does not open files, execute commands, access databases, remember long-term state, or decide when to stop an execution loop.

The OpenAI Agents SDK describes an agent as an LLM configured with instructions, tools, handoffs, guardrails, structured outputs, and related runtime behavior. That is one SDK’s formulation, but it captures the key point: an agent is not a raw model. It is a model plus the surrounding configuration and runtime machinery that lets it act.

So when we discuss multi-agent, we are not merely talking about calling a model several times. We are talking about multiple runtime entities that can act independently under different constraints.

2.2 Agent = Model + Harness

A useful shorthand is: Agent = Model + Harness. The model provides language understanding, reasoning, and generation. The harness is the system layer outside the model that turns that capability into an agent that can act, observe, remember, call tools, handle errors, and continue toward a goal.

LangChain’s article on agent harnesses uses that equation directly: if it is not the model, it is the harness. It defines the harness as the code, configuration, and execution logic outside the model, including state, tool execution, feedback loops, and enforceable constraints. The harness engineering article on Martin Fowler’s site uses a similar framing: harness is the shorthand for everything in an AI agent except the model itself.

At a finer level, the Hugging Face agent glossary distinguishes scaffolding from harness: scaffolding is closer to prompts, tool descriptions, output formats, and context organization; the harness is closer to the execution layer that calls the model, handles tool calls, and decides when to stop. The same glossary also notes that in product contexts such as Claude Code and Codex, harness is often used broadly for the whole non-model agentic layer. Databricks’ explanation of AI agent harnesses matches the same engineering intuition: the harness connects the model to tools, memory, execution environments, and safety controls so that it can do work rather than merely answer.

This distinction locates runtime boundaries inside agent architecture. A boundary is not a synonym for the harness; it is the result of harness design. What an agent can see, which tools it can call, where it can execute, whether it may retry after failure, and whether it needs approval are all harness-level design decisions.

The same model placed in two different harnesses can become two very different agents. One may only read documents and suggest changes. Another may edit code, run tests, operate a browser, and create a pull request. Their difference is not personality. It is the whole non-model harness around the model.

2.3 An orchestrator manages agents, not one model response

An orchestrator sits one level above the agents. It does not merely send a prompt to a model. It manages calls, routing, handoffs, state synchronization, and result aggregation across agents.

Two common patterns matter.

The first is agents-as-tools. A main agent keeps ownership of the final answer while calling specialist agents as bounded capabilities. OpenAI’s orchestration guide describes this as a manager-style workflow: the main agent remains responsible for the final reply, while specialists are invoked as tools.

The second is handoff. Control moves to a specialist agent that owns the next branch of the conversation or work. This is a better fit when the next branch genuinely needs a different policy, tool environment, or context owner.

Neither pattern is just “more roles.” Both answer sharper questions: Who owns the next step? Who owns the context? Who is responsible for the final output? Who enforces the policy boundary?

3. The Core of Multi-agent Design: Split Runtime Boundaries

3.1 Context boundaries: not every fact belongs in one model’s view

The LangChain multi-agent documentation says that context engineering sits at the center of multi-agent design: deciding what each agent sees. That is the heart of the matter.

More context is not always better. Information that helps a generator may contaminate a reviewer. Debug logs needed by a diagnostic agent may not belong in the context of an external-response agent. Sensitive customer data visible to a support agent may not belong in the context of a marketing-analysis agent.

Software development has the same pattern. An implementation agent may need requirements, code structure, test feedback, and local design tradeoffs. A review agent should not necessarily inherit the implementation agent’s full reasoning path. It should usually see the requirement, the diff, test results, risk criteria, and review standards. Otherwise it can be anchored by the implementer’s own story and continue along the same mistaken path.

Context boundaries are not just token optimization. They are cognitive isolation.

3.2 Tool and permission boundaries: not every capability should be open at once

If an agent can read requirements, edit code, modify CI, manage secrets, merge pull requests, and deploy to production, it is certainly powerful. It is also dangerous. One mistaken inference, one prompt injection, or one poisoned tool result can propagate through a large permission surface.

The more tools an agent has, the easier it is to choose the wrong one. The more authority it has, the more expensive a wrong action becomes. The value of multi-agent here is not that the system appears more intelligent. It is that capabilities can be scoped into smaller authorization surfaces.

For example, an implementation agent may have workspace write access and permission to run tests, but not permission to merge into the main branch or deploy production. A release agent may read build artifacts and deployment status, but should not casually edit business logic. A security review agent may read the diff and scanner output, but may not need write access at all.

This split does not always require another LLM agent. Sometimes the best boundary is operating-system permissions, CI required checks, branch protection, a policy engine, or human approval.

3.3 Objective boundaries: generators and validators should not optimize the same local goal

An implementer is naturally biased toward making the solution work. It explains its tradeoffs, tries to prove that the current path is viable, and treats passing tests as strong evidence. A reviewer should optimize for something different: omissions, counterexamples, risk, edge cases, and unacceptable side effects.

If one harness both generates and validates while preserving the full self-justifying context, validation easily becomes “checking the same argument one more time.” This is not because the model is malicious. It is because the objective boundary was never split.

Research on LLM-as-judge systems offers a related warning. Self-Preference Bias in LLM-as-a-Judge studies the risk that models prefer their own or more familiar styles of output. This does not mean a second agent is automatically objective. It means independent evaluation needs more than the sentence “review this objectively.” It needs different context, different criteria, different evidence, and, in high-risk cases, human or deterministic gates.

3.4 Failure boundaries: who may retry, and who must stop

Many agent failures are not caused by the first error. They are caused by what the system does after the error.

An implementation agent may respond to a test failure by reading the error, changing code, and running tests again. A research agent that fails to find sources may change keywords, broaden scope, and try again. A release agent that sees a permission error, deployment anomaly, or production verification failure should usually not keep retrying or bypass approval.

Failure constraints are part of the harness. Whether an agent may retry, how many times it may retry, whether it may degrade, when it must escalate, and whether it may call higher-privilege tools are all runtime rules.

When two steps have different failure semantics, they should not be packed into the same undifferentiated loop.

4. Why Software Development Is the Canonical Case

4.1 Requirements, tests, implementation, review, and release are not the same authority

Software development makes the boundary issue visible because it looks like a linear process while containing very different kinds of authority:

Requirements clarification decides what problem should be solved.
Test design decides what evidence counts as completion.
Implementation decides how to solve it.
Code review decides whether the solution is acceptable.
Release decides whether the change may affect real users or production systems.

A single agent can assist across all of these, especially in a low-risk personal project. But in a serious engineering environment, they should not share one unconstrained harness with authority to ship.

The reason is simple: setting the standard, generating the implementation, judging acceptance, and releasing to production are different powers. Giving all of them to one agent is not advanced automation. It is concentrated control.

4.2 Self-approval is really a mixed-boundary problem

“The agent writes the code and reviews its own code” sounds like the obvious problem. The deeper issue is that several boundaries have collapsed:

It shares the same understanding of the requirement.
It shares the same implementation assumptions.
It shares the same context and reasoning path.
It shares the same local goal of completing the task.
If it also has commit, merge, or deployment rights, it shares the same release path.

That makes errors not only possible, but inheritable. A wrong assumption can be carried forward and reinforced by later steps.

A better design is not necessarily “open five agents.” The better design is to separate the evidence chain between generation and acceptance. Tests should be verifiable. Review should read the diff and standards, not the implementer’s self-defense. Release should depend on independent CI, approval, and audit records.

On platforms like GitHub, branch protection and required reviews are traditional engineering expressions of the same idea: writers do not automatically merge, and merge requires checks, approvals, and policy satisfaction.

4.3 AI-written code still needs separation of duties

AI-generated code does not remove the need for separation of duties. It makes it more important.

OWASP AISVS Appendix C, which covers AI-assisted secure coding controls, says AI-generated code should go through review by a qualified human engineer, that the reviewer should not be the same identity that requested the AI generation, and that the AI agent itself does not count as the human reviewer. It also explicitly names autonomous agents approving their own work as a threat scenario.

The point is not anti-automation. The point is anti-collapse: do not compress generation, validation, and approval into one unconstrained actor.

In practice, many combinations are possible:

An agent writes code, and a human reviews it.
An agent writes code, a separate read-only agent performs preliminary review, and a human gives final approval.
An agent writes code, CI runs tests and scanners, and branch protection blocks non-compliant changes.
An agent drafts release notes, but the release action requires human confirmation or policy-engine approval.

The important question is not the number of agents. It is whether each step has clear permission and accountability.

5. When Not to Use Multi-agent

5.1 If only the tone changes, do not split the agent

If your “multi-agent” system only asks the same model to role-play product manager, architect, developer, and tester while sharing the same full context, tools, permissions, and stop conditions, it is probably not architectural isolation.

It may have prompt-level value, but do not overstate its governance value. It does not automatically create independent review. It does not automatically create permission isolation.

In many cases, one agent with clear phase instructions, structured outputs, tool gating, test commands, and human confirmation is enough.

5.2 When the problem is narrow, low-risk, and stable, workflow is usually better

Microsoft’s single-agent vs multi-agent guidance makes a practical point: do not assume role separation requires multi-agent architecture. Many scenarios should start with a single-agent prototype. Move to multi-agent only when single-agent optimization cannot solve persistent limits in context, performance, permissions, or scale.

If the task scope is narrow, the input and output are stable, the risk is low, and the tool set is small, one agent is often easier to control. It is easier to debug, cheaper to run, lower latency, and simpler to observe.

Summarizing a small public document set, or drafting an internal email in a fixed format, usually does not require multiple agents. Splitting it into a reader agent, a summarizer agent, and a reviewer agent may only add logs, prompts, error surfaces, and token cost.

5.3 If a deterministic gate can solve it, do not make another agent perform it

Some boundaries are not cognitive boundaries. They are deterministic verification boundaries.

Code formatting belongs to a formatter. Known vulnerable dependencies belong to SCA. Test pass/fail belongs to CI. Merge permission belongs to branch protection. Deployment approval may belong to environment protection rules or a release system.

These should not first be handed to another agent to “judge.” An agent can explain the failure and suggest a fix, but the gate itself should usually be deterministic, auditable, and repeatable.

Good multi-agent design does not turn every control into an LLM. It keeps LLMs inside the boundaries where they add value.

6. When You Really Should Split

6.1 Split when the context must differ

When two steps should not share the same context, splitting agents becomes meaningful.

Generation and review are the classic example. The implementation agent may preserve a large exploration trail. The review agent should start from the requirement, diff, tests, and risk rules. Another example is privacy handling: one agent may need raw sensitive input, while another should only see a redacted summary.

Context isolation is not merely efficiency. It is security and cognitive independence.

6.2 Split when tools, credentials, or execution environments must differ

If two steps need materially different tool sets, consider splitting.

One agent may need browser and local workspace access, while another only needs to read the PR diff. One agent may run tests, while another only reads test results. One agent may access staging, while another must never touch production credentials.

This kind of split has direct value: it reduces tool misuse and permission spread.

6.3 Split when evaluation, approval, or audit must be independent

When a step carries evaluation, approval, or audit responsibility, it often needs an independent boundary.

Independence here is not just another prompt. It means different inputs, different standards, different permissions, and traceable records. A reviewer should know what it is reviewing, what criteria it is applying, what it can do, what it cannot do, and how the decision will be logged.

In high-risk systems, final approval often should not be performed by an LLM agent alone. The agent may provide analysis, but the approval path should be owned by humans, policy systems, or organizational authorization mechanisms.

6.4 Split when broad parallel exploration is the real bottleneck

Multi-agent also has a very practical benefit: parallel exploration.

Anthropic’s engineering write-up on its multi-agent research system explains that complex research tasks naturally require exploring many sources, and parallel subagents can dramatically improve coverage and speed. The same write-up also emphasizes the fast growth of coordination complexity and token cost.

So parallelism is a valid reason, but not a free reason. It fits broad problem spaces: open-ended research, complex incident investigation, multi-source evidence gathering, or cross-module code review. It should not be used to decorate a narrow task.

7. The Problems You Create After Splitting

7.1 Coordination complexity grows

Multi-agent does not remove complexity. It moves complexity from inside one agent to the space between agents.

You must decide who assigns work, who merges results, who resolves conflicts, who decides a subtask has enough evidence, who prevents duplicate work, and who recovers when a subagent fails. In Anthropic’s early versions, agents could spawn too many subagents for simple queries, search endlessly for nonexistent sources, or distract one another with excessive updates.

Without clear task boundaries, output formats, and stop conditions, multi-agent systems become harder to debug.

7.2 Cost, latency, and state synchronization get worse

More agents mean more prompts, more context copying, more tool calls, and more intermediate results. Microsoft’s decision framework also warns that multi-agent systems increase protocol design, error handling, state synchronization, monitoring, and debugging costs.

This is not abstract. Every handoff can add latency. Every subagent can reread background material. Every aggregation point can lose detail or introduce inconsistency.

Multi-agent should therefore serve a clear benefit: better isolation, stronger parallelism, clearer auditability, or safer permissions. It should not exist merely because it looks more like a team.

7.3 The security surface expands

More agents also mean more data flow. Data passed from one agent to another may contain unsanitized web content, pull request comments, issue text, tool output, log fragments, or third-party documentation. Any of these can become a prompt-injection carrier.

The OWASP AI Agent Security Cheat Sheet warns against giving agents unrestricted tool access, letting agents make high-impact decisions without human oversight, and passing unsanitized data between agents in multi-agent systems.

So multi-agent is not merely “several intelligent workers collaborating.” It is also several identities, communication channels, authorization points, logs, and attack surfaces.

8. An Engineering Checklist

8.1 Ask about boundaries before counting agents

Before deciding to use multi-agent, ask:

Should these steps see different context?
Do they need different tools, credentials, or execution environments?
Do they perform actions at different risk levels?
Should they follow different recovery strategies when they fail?
Do they require independent evaluation, approval, or audit?
Do they need parallel exploration across a broad problem space?
If you do not use multi-agent, can a workflow, CI check, policy gate, or human approval express the boundary more simply?

These questions get closer to the architecture than “How many roles can I name?”

8.2 Choose the right boundary mechanism

A practical decision table looks like this:

Use a single agent when the task has one coherent context, one permission set, low-risk output, and no strong need for independent review.
Use workflow or CI gates when the boundary is deterministic, verifiable, and repeatable, such as tests, scanning, formatting, or approval rules.
Use agents-as-tools when a manager agent should retain final-output ownership while calling bounded specialist capabilities.
Use handoffs when a specialist should own the next branch of conversation, policy, or tool environment.
Use full multi-agent orchestration when multiple independently bounded agents must work in parallel, hand off work, aggregate evidence, and enforce separate authority.

The point of the checklist is not to make multi-agent sound difficult. It is to make it honest. Split when the boundary is real. Stay restrained when it is not.

Closing Thoughts

Multi-agent is often explained as an organizational metaphor: a team with a planner, an implementer, a reviewer, and a releaser. But the useful engineering question is not “How many people are on the team?” It is “Which powers should not be held by the same actor, which information should not be seen by the same subject, and which actions should not be approved by the same harness?”

When context, tools, permissions, objectives, failure constraints, and approval responsibility must be isolated, multi-agent is a powerful architecture. When the task merely has different tones, different steps, or a slightly longer flow, a single agent plus workflow is often simpler, cheaper, and more reliable.

This framework draws on OpenAI’s documentation for Agents and orchestration, LangChain’s multi-agent design guide and agent harness definition, the harness engineering article on Martin Fowler’s site, Microsoft’s single-agent versus multi-agent guidance, Anthropic’s multi-agent research system write-up, Databricks’ explanation of the AI agent harness, the Hugging Face agent glossary, OWASP’s AI Agent Security Cheat Sheet and AISVS Appendix C, and research on LLM-as-a-Judge self-preference bias.

So when do you need multi-agent?

Not when you can draw several roles.

When you can explain the runtime boundaries that should not be mixed.