The pattern everyone tried first, and mostly got wrong
By late 2024, every enterprise AI architecture deck had the same slide: a supervisor agent at the top, a handful of specialised sub-agents below, arrows criss-crossing in every direction. By mid-2025, the same teams were quietly rewriting those systems as single-agent pipelines with well-defined tool catalogues. Not because multi-agent is wrong as an architecture – but because most teams shipped the pattern before they needed the complexity, and paid for it in latency, cost, debugging time, and operational fragility they had not budgeted for.
In 2026 the picture has clarified. Multi-agent orchestration is a genuinely useful pattern for specific problem shapes. It is also the wrong answer for the majority of enterprise use cases, where a single well-tooled agent with a strong evaluation harness beats a supervisor-and-workers setup on every axis that matters – latency, cost, observability, debuggability, and reliability. Deciding which camp a given problem is in is the architectural decision most teams still get wrong.
The framework that follows walks through when a single agent is enough, the four problem shapes where multi-agent earns its complexity, the three production-validated topologies, the failure modes specific to multi-agent systems, the evaluation discipline that mature deployments require, the disciplined upgrade path from single to multi, and the architectural decisions that age gracefully versus the ones that calcify around the current fashion.
When a single agent is enough
The default architectural choice for enterprise AI deployments in 2026 should be a single agent with a carefully curated tool catalogue, a strong system prompt, a typed-output contract, and a well-maintained evaluation suite. That architecture handles a surprisingly large share of real enterprise work – customer support, sales assistance, internal knowledge search, document review, data extraction, structured-output generation – with materially less complexity and lower operating cost than a multi-agent equivalent.
The reason is structural: modern foundation models are good at tool use. A capable model can reliably orchestrate 10–30 tools inside a single context, handle multi-step plans, recover from tool failures, and maintain coherent state across long conversations. The overhead of splitting those capabilities across multiple agents – prompt boilerplate for each agent, inter-agent communication protocols, duplicated context across hand-offs, added end-to-end latency, multiplied LLM-call cost – is a real tax, and for most workloads the tax exceeds the architectural benefit.
The cost economics make the case sharper. A single-agent request typically makes 3–10 LLM calls behind a user action. A multi-agent request behind the same user action routinely makes 30–80. Even at the same per-call quality, the multi-agent system costs 5–10x more per request, with materially worse p95 and p99 latency. The defensible default is single-agent until a specific pressure justifies splitting.
When multi-agent earns its complexity
There are four problem shapes where multi-agent is genuinely the right architecture rather than a fashionable one. The shared property is that the complexity premium pays back in concrete, measurable benefit on the workload.
- True parallel work. If a task decomposes into independent subtasks that can run concurrently – research across five different knowledge sources, generate four variants of a design, validate against three separate rule sets, query three databases simultaneously – parallel agents deliver wall-clock speedups that a sequential single agent cannot match. The latency benefit is the architectural justification.
- Heterogeneous skill profiles. When different subtasks genuinely benefit from different models – a cheap fast model for routing and intent classification, a frontier model for complex reasoning, a specialised code model for code generation, a specialised vision model for image understanding – a multi-agent architecture lets each step run on the right cost-quality point. The cost economics are the architectural justification.
- Isolation of privileged capability. If one subtask requires elevated permissions (write to production, send external email, move money, access PII) and others do not, separating them into distinct agents with distinct permission scopes is a security and audit win, not just a complexity cost. The defence-in-depth argument is the architectural justification.
- Organisational or cross-team workflows. When "agent A" is owned by one team and "agent B" by another, each with its own lifecycle, evaluation suite, ownership, and release cadence, protocol-level separation lets them evolve independently without coordinating rewrites. The organisational scaling is the architectural justification.
The three topologies that survive production
Across the multi-agent deployments we have worked on through 2024–2026, three topologies have survived contact with production. Other topologies exist in research papers and rarely land cleanly in enterprise environments without operational fragility.
- Supervisor / workers. A single orchestrator agent decides which specialised worker to invoke, aggregates results, and presents the final response. Best for heterogeneous-skill workloads with predictable routing. The supervisor handles cross-worker state; the workers are stateless specialists. Operationally the easiest multi-agent pattern to maintain.
- Pipeline / hand-off. Agents pass control sequentially, each transforming the state and handing off to the next. Best for workflows with clear stages (triage → research → resolve → respond → close, or extract → validate → enrich → store) and stable stage definitions that do not require backtracking. The hand-off schema is the load-bearing contract.
- Parallel / voter. Multiple agents tackle the same task independently; a judge or voter picks the winner. Best for high-stakes decisions where diversity of approach improves robustness – legal review with multiple analysis angles, adversarial classification with diverse classifiers, safety evaluations with diverse rubrics. Cost-heavy; reserve for decisions where the cost of error is meaningfully higher than the cost of running multiple agents.
The failure modes to design against
Multi-agent systems fail in ways single-agent systems do not. Five recurring failure modes are common enough that designing against them from day one saves most of the operational pain:
- Context sprawl. Each agent needs relevant context, and hand-offs either duplicate it (expensive in tokens and latency) or compress it (lossy and biased). Design the hand-off schema up-front, treat it as a versioned contract, test it with adversarial payloads, and measure context-utilisation per agent so the architectural team can see where context is being wasted.
- Cascading errors. An early agent's small mistake becomes a later agent's confident premise, and the error compounds across the pipeline. Build per-step confidence scoring, a supervisor-level "am I still on track" check, and the option to roll back to an earlier state when downstream confidence collapses.
- Loop detection. Two agents deferring to each other, a supervisor re-dispatching the same task, or a planning agent revisiting the same options will burn tokens fast and produce nothing. Hard step-count caps, diversity checks on successive tool calls, and explicit loop-detection heuristics are baseline rather than optional.
- Cost opacity. A multi-agent request can make 30–80 LLM calls behind a single user action. Per-request cost tracking, with token count rolled up across all agents, belongs in observability from day one rather than as a later addition when the bill arrives. Cost-per-task distribution (not average) is the right operational metric.
- Permission leakage. Multi-agent systems handle sensitive data and elevated permissions through the agent boundary. When the permission model is implicit ("the worker only does what the supervisor delegates"), the system is one prompt-injection away from a privilege escalation. Explicit per-agent permission scopes and audit logging are the structural fix.
Evaluation is harder, and mandatory
A single-agent system has one thing to evaluate: the final response. A multi-agent system has many things to evaluate: the planning decomposition, the routing decisions, each worker's output, the aggregation, the final response, and the trajectory itself. Teams that treat multi-agent evaluation as "same as single-agent evaluation" ship brittle systems and then cannot diagnose regressions when they occur.
The practical evaluation framework for multi-agent deployments:
- Final-output evaluation. The headline metric on user-visible quality.
- Planning-quality evaluation. Did the orchestrator pick a sensible decomposition for this task? Did it route to the right specialists? Did it avoid unnecessary steps?
- Per-worker evaluation. Each worker is evaluated independently with rubrics narrow to its specialisation, against its own regression set.
- Trajectory metrics. Steps taken, tools used, retries attempted, parallelism achieved (for parallel topologies), termination correctness.
- Cost-and-latency distributions. Per-task cost variance is the cheapest signal for unhealthy agent behaviour; p95 and p99 latency reveal the tail that dominates user experience on multi-agent workflows.
- Cross-agent consistency. Does the supervisor's understanding of task state match the workers'? When the agents disagree, the architecture has a coordination problem that single-output evaluation cannot detect.
The disciplined upgrade path from single to multi
A useful piece of architectural discipline: start single, measure pressure on the architecture, then split. If a single-agent implementation is hitting a real wall – sequential latency on calls that could be parallel, a capability that genuinely requires permission isolation, a workload where half the steps could run on a much cheaper model – those are the pressures that justify splitting into multi-agent. If the push toward multi-agent is coming from "it feels more sophisticated" or "competitors are doing it", the upgrade is almost always premature.
The architectures that age well are the ones that can collapse back. Write each agent's tool catalogue as a coherent capability module that the orchestration layer can consume either as a single-agent tool set or as distinct specialised agents. That single architectural discipline – tools as the durable asset, agent topology as the replaceable scaffolding – is what makes the difference between architectures that evolve gracefully across two or three model generations and architectures that calcify around the current fashion and become expensive to migrate when the fashion shifts.
Operational considerations the architecture decks miss
Six operational dimensions that distinguish a production-ready multi-agent deployment from a research prototype, and that most architecture decks do not discuss:
- Observability per agent step. Each agent action produces a traceable event with input context, tool calls, output, and reasoning. The trace is what makes the system debuggable when a complex production failure occurs.
- State management. Where the inter-agent state lives – in a shared store, in hand-off messages, in the supervisor's context – is an architectural decision that affects every other operational property. Make it explicit.
- Cost budgeting per request. Hard cost caps per user action prevent runaway loops from producing surprise bills. The cap should fail closed (terminate the request) rather than silently truncating output.
- Retry and graceful degradation. When a worker fails, what happens? Retry on a different model? Fall back to a deterministic path? Surface the failure to the user? Each option is a defensible engineering choice; not deciding is not.
- Audit-trail for regulated workflows. For deployments touching regulated data, the per-agent action log is regulatory evidence. The schema and retention policy belong in the architecture, not as an afterthought.
- Model-swap discipline. The agent topology should survive a model upgrade in any single agent. If swapping the supervisor or any single worker requires re-engineering the topology, the architecture is too tightly coupled to the current model generation.
Frequently asked questions
Common questions raised by AI architects scoping a multi-agent deployment:
- How do I know if my use case justifies multi-agent? Ask whether the four pressures (parallel work, heterogeneous skill profiles, privilege isolation, organisational separation) apply. If two or more apply, multi-agent is probably worth the complexity. If none apply, single-agent with good tools is the right default.
- Which topology should I default to? Supervisor / workers for most enterprise heterogeneous-skill workloads. Pipeline / hand-off for stable-stage workflows where the stages do not need to revisit each other. Parallel / voter for genuinely high-stakes decisions where diversity matters.
- How do I cost a multi-agent system? Per-request cost = sum of LLM calls across all agents per user action, plus the tool-call costs (retrieval, API calls, code execution). Multi-agent requests routinely cost 5–10x equivalent single-agent requests, and the cost is concentrated in the tail of long trajectories. Cost-cap-per-request is operational baseline.
- How do I handle inter-agent communication? Typed message schemas. JSON Schema or equivalent contract for each hand-off, versioned in source control, validated on every transmission. Free-form natural-language hand-offs are the source of most multi-agent reliability issues in production.
- What is the typical timeline to move from single-agent to multi-agent in production? 4–8 weeks if the existing system is well-instrumented and the multi-agent pressures are clear; 3–6 months if the existing system has no evaluation suite or observability. The evaluation and observability work is the rate-limiting factor, not the model engineering.


