How to Build a Multi-Agent AI System: Architecture + Code Patterns (2026)

Most multi-agent walkthroughs in 2026 stop at the demo. Two agents pass JSON, the audience claps, and nobody asks what happens at run number 4,000 when the third specialist times out and the orchestrator has not been told the difference between a tool error and a model refusal. This piece is the part the demos skip — written from the operator side, by people who run agent fleets in production every day.

We will cover single-agent vs multi-agent, the foreman pattern that beats peer-to-peer for business workflows, how to design role cards that survive prompt drift, the communication protocols that matter (MCP, A2A, structured outputs), why graph memory beats vector stores for cross-agent context, the five layers of a production architecture, the failure modes you will hit before month two, and what we learned shipping Knowlee 4Sales.

Why multi-agent is genuinely different

Single-agent and multi-agent are different categories of system, not different sizes of the same one. Treating them interchangeably is the most expensive mistake we see operators make.

A single-agent system is one prompt-and-tool loop. The model receives an input, decides on actions, calls tools, observes results, and produces an output. Reasoning, planning, and execution share one context window. This works extremely well when the work fits inside one context window, the tool surface is bounded (under twelve tools is a useful threshold), and the workflow is mostly linear. A coding assistant operating on one repo, a research helper writing one report, a support bot answering one ticket — these are single-agent problems. Multi-agent here is overengineering.

A multi-agent system is a runtime in which multiple specialized loops collaborate on goals no single loop can complete alone. The runtime decides which agent runs when, what state passes between them, what each one is allowed to touch, and what evidence of the work is captured. You move to multi-agent when at least one of these is true:

The workflow has more than one mode of expertise. Lead discovery is research-heavy. Outreach personalization is writing-heavy. Reply handling is decision-heavy. Forcing one agent to be good at all three produces an agent that is mediocre at each.
The tool surface exceeds what one agent can hold in context. Past roughly fifteen to twenty tools, model accuracy on tool selection degrades sharply. Specialization narrows each agent's tool surface and recovers the accuracy.
Different parts of the workflow have different governance requirements. A research agent that reads public web data is low-risk. An outreach agent that sends external email is medium-risk. A negotiation agent that quotes price is high-risk. Different risk levels demand different oversight, logging, and human-in-the-loop policies. Splitting agents lets you set those policies cleanly.
The work has to keep running while one part is paused. A pipeline-style multi-agent system can hold a draft in review while continuing to discover new leads. A monolith cannot.
Auditability matters. Regulators and enterprise buyers want to know which decisions a system made and on what evidence. Per-role logs are easy to read; one massive single-agent transcript is not.

If none of these apply, build a single-agent system. You can always split later. A useful test: list every distinct decision the system has to make in one run, group by the kind of expertise required, and count the groups. One group means one agent. Three or more means multi-agent, and you should skip the intermediate step where you pretend two of the groups are the same.

The foreman / manager pattern

The single most useful architectural choice in a business multi-agent system is to put one agent in charge — the foreman, the manager, the orchestrator, choose the metaphor — and have everyone else report to them.

Peer-to-peer multi-agent systems, where any agent can call any other agent, look elegant in a diagram and produce loops in production. Two agents will talk to each other indefinitely if neither has been given a stop condition the other respects. Three agents will form a triangle that the operator notices only when the token bill arrives. Free-form agent communication is a research pattern, not a production one.

The foreman pattern fixes this. The foreman owns task decomposition, routing, handoff validation, and failure handling. Specialists own execution within their narrow scope. Specialists never talk to each other directly; they hand work back to the foreman, who decides what runs next.

                ┌───────────────┐
                │   Operator    │
                └──────┬────────┘
                       │ goal
                       ▼
                ┌───────────────┐
                │   Foreman     │  decomposes, routes, validates
                └──┬─────────┬──┘
       work item  │         │ work item
                  ▼         ▼
        ┌─────────────┐ ┌─────────────┐
        │ Specialist  │ │ Specialist  │ ... narrow scope, narrow tools
        │     A       │ │     B       │
        └─────┬───────┘ └─────┬───────┘
              │ result        │ result
              └────────┬──────┘
                       ▼
                ┌───────────────┐
                │   Foreman     │  decides next step or completes
                └───────────────┘

A frontend can render this as Mermaid; the topology is what matters. The foreman is the only agent allowed to spawn or terminate other agents. Specialists return their output and exit. The foreman validates, decides, and either dispatches the next specialist or returns the final result to the operator.

Why this beats peer-to-peer:

Loops are impossible by construction. Specialists cannot call each other, so they cannot ping-pong. The foreman has an explicit max-step counter; if it hits the limit, the run terminates with a clear failure mode.
Handoffs are inspectable in one place. Every transition between agents goes through the foreman. The foreman log is the audit trail. You do not have to reconstruct the order of operations from per-agent traces.
Cost is bounded. The foreman owns the budget. It can refuse to dispatch a specialist if the run has already consumed too many tokens, time, or tool calls.
Errors have one recovery point. When a specialist fails, the foreman decides whether to retry, dispatch a different specialist, escalate to the operator, or abort. That decision lives in one prompt and is testable.
Specialists can be swapped. Because specialists do not depend on each other, you can replace any one of them — different model, different prompt, different tools — without rewriting the system.

The foreman has one cost: it is the bottleneck and the single point of failure. We mitigate this two ways. First, the foreman is intentionally smaller and faster than its specialists — a routing-and-validation prompt, not a doing-the-work prompt. Second, its prompt is the most carefully tested artifact in the system, with replay tests for every handoff.

For deeper coverage of the orchestration patterns and where each fits, see our multi-agent orchestration explainer.

Designing role cards

A role card is the contract that defines a specialist agent. Get it right and the agent behaves predictably. Get it wrong and the agent will wander, expand its scope, or refuse work it should accept. Every specialist in our production fleet has a role card with these five fields.

Identity. One sentence. What this agent is and who it reports to. Example: "You are the Lead-Discovery agent. You report to the Sales Foreman. You find companies that match the operator's ICP and return a structured list of candidates."

Scope. What this agent does and explicitly does not do. Scope keeps the agent from leaking into adjacent work. The boundary is more important than the inclusion list. Example: "You discover and qualify candidate companies. You do not enrich contacts, write outreach copy, or contact anyone. If you receive a request to do those things, return a scope-violation error to the foreman."

Inputs. The schema of the work item the foreman will hand you. Always typed. Always validated. The agent should refuse work that does not match the schema, not improvise around it. Example: ICP definition (industry, headcount range, geography), max candidates, exclusion list.

Outputs. The schema of what you return. Same rules — typed, validated, never freeform. The output schema is what the foreman will validate when work comes back; if the agent returns shape that does not match, the foreman rejects and either retries or fails.

Escalation. What this agent does when it cannot complete the work. Three concrete branches: retryable failure (transient tool error, return a retryable error code), unrecoverable failure (input is invalid, return an unrecoverable error code), human-needed (judgment call outside the agent's scope, return a request-for-human-review). Specialists never silently fail and never improvise around blockers.

Three worked examples from our 4Sales fleet:

Lead-Discovery agent. Identity: finds companies matching ICP. Scope: discovery only — no enrichment, no contact, no outreach. Inputs: ICP schema, max candidates (default 50), exclusion list. Outputs: array of candidate companies with name, domain, evidence URL, fit score. Escalation: retryable on search rate-limit, unrecoverable on malformed ICP, human-needed when fewer than ten candidates pass the fit threshold.

Outreach-Personalization agent. Identity: writes the first-touch message for a contact. Scope: one message per call — never schedules, never sends. Inputs: contact record, company record, recent signal (job change, news, funding), campaign template. Outputs: draft subject, draft body, signal-citation, confidence score. Escalation: retryable on model timeout, unrecoverable on missing signal, human-needed when confidence falls below threshold.

Reply-Handler agent. Identity: classifies and routes inbound replies. Scope: classification and proposed next action — never sends a response and never books a meeting without operator confirmation. Inputs: inbound email, thread history, contact record. Outputs: intent label (interested / not-interested / unsubscribe / out-of-office / question), proposed next action, confidence. Escalation: retryable on parse error, unrecoverable on empty body, human-needed when intent is ambiguous or when the message contains a price question.

Two non-obvious lessons: write the escalation field before inputs and outputs — it forces you to think about failure. And version your role cards. We commit them as YAML alongside the prompts and tag the version in every run log so we can replay old work against old cards when investigating regressions.

For the broader vocabulary of agent roles and how they fit together, see our multi-agent orchestration glossary.

Communication protocols (MCP, A2A, structured outputs)

In 2026, multi-agent systems have three communication protocols that matter, plus structured outputs as the foundation everything sits on. Knowing when each fits is the difference between a fragile pipeline and a system that survives swapping models.

Structured outputs. Every agent in a production system emits typed, schema-validated output. Not free text, not "JSON inside a code block, hopefully," but a real schema (JSON Schema, Pydantic, Zod) enforced at the model layer where possible and at the application layer always. This is the single highest-leverage discipline in multi-agent systems. If your agents return free text, your foreman has to parse it; parsing is where regressions hide; regressions in parsing surface as silent quality drops two weeks later. Make this non-negotiable.

# Pydantic example for a Lead-Discovery agent's output
from pydantic import BaseModel, HttpUrl, Field

class CandidateCompany(BaseModel):
    name: str
    domain: str
    evidence_url: HttpUrl
    fit_score: float = Field(ge=0, le=1)

class DiscoveryResult(BaseModel):
    candidates: list[CandidateCompany]
    notes: str | None = None

MCP — Model Context Protocol. MCP is the open protocol Anthropic published in late 2024 that has become the default tool-and-data interface for agents. The mental model is simple: an MCP server exposes a set of tools (and resources) to any MCP-aware agent client. Your agent does not import a Postgres driver; it talks to a Postgres MCP server. Your agent does not call Slack's REST API; it talks to a Slack MCP server.

Why this matters for multi-agent systems: MCP gives every specialist agent the same plug-shape for every tool. Swapping a specialist from Claude to GPT to a self-hosted model does not break tool access, because the tools live behind MCP, not inside the agent. Auth and credentials live with the MCP server, not in agent prompts, which is the security posture you want anyway. The official spec lives at modelcontextprotocol.io.

// Example MCP tool call from an agent client
{
  "method": "tools/call",
  "params": {
    "name": "supabase.execute_sql",
    "arguments": {
      "query": "select id, domain from companies where icp_match = true limit 50"
    }
  }
}

A2A — Agent-to-Agent. A2A is Google's open protocol (published 2024, broadly adopted through 2025-2026) for agents to discover and call other agents. Where MCP is "agent calls a tool," A2A is "agent calls another agent." A2A is the right protocol when specialists live in different runtimes, different organizations, or behind different security boundaries — for example, your foreman calling a vendor's enrichment agent over HTTPS.

In a single-runtime foreman pattern, you do not need A2A — the foreman calls the specialist as a function. You start needing A2A when specialists run somewhere else: a partner exposes their own agent, you split your fleet across runtimes for compliance reasons, or you want a public-facing agent that customers' agents can call directly. The A2A spec is at a2aprotocol.org.

Choosing between them. A short decision rule: structured outputs are always on. MCP is for an agent calling a tool, a database, an API, a file system. A2A is for an agent calling another agent, especially across a trust boundary. In a typical 2026 production fleet you have one A2A endpoint (the foreman, optionally), several MCP servers (one per data source or tool family), and structured outputs everywhere. Anthropic's agent-design documentation is the canonical reference for the patterns that sit on top.

The mistake we see most often: teams skip MCP and let each agent import its own SDKs, then spend the next quarter debugging cross-agent credential leakage. Centralize tool access. Five hours wrapping a service in MCP saves fifty hours of incident response later.

Memory layer + persistent state

Every multi-agent system needs a memory layer. The choice between vector store and graph database is one of the highest-leverage architectural decisions you will make, and the default answer most tutorials give — "use a vector store" — is wrong for cross-agent business workflows.

Vector stores are optimized for similarity search over chunks of text. They are excellent for "find passages relevant to this question" and they underpin most RAG patterns. They are weak at three things multi-agent systems need: representing relationships ("which contacts at this company replied to which campaigns"), enforcing referential integrity ("a deal must belong to a company"), and traversing causality ("what sequence of signals preceded this won deal"). Vectors collapse everything to a high-dimensional point and lose the structure.

Graph databases keep the structure. A node is a thing (company, contact, deal, signal, campaign, agent run). An edge is a relationship between things (contact WORKS_AT company, deal HAS_CONTACT contact, run PRODUCED_OUTPUT lead, signal PRECEDED reply). Queries traverse the graph: "for every contact at companies in our ICP that received a personalized outreach in the last 30 days and replied, return the originating signal." That query is one Cypher statement against a graph; it is impossible against a vector store.

The Knowlee Brain — the cross-vertical memory layer in our platform — is an Enterprise Knowledge Graph + RAG and is exposed to every agent through the standardized tool-orchestration layer. Every agent reads from and writes to the same graph. The lead-discovery agent writes candidate-company nodes. The outreach agent reads them, writes outreach-attempt edges. The reply-handler reads the outreach edges, writes reply nodes. Two months later, the analytics agent traverses the whole graph to answer "what kinds of signals produced what kinds of replies for what kinds of companies."

This compounds in a way vector search does not. Each agent's output enriches the substrate for every other agent. New verticals plug into the same graph and immediately benefit from — and contribute to — the accumulated context.

A short rule: if agents need similar text, use a vector store. If agents need to reason about entities and relationships, use a graph. Most production systems eventually want both — graph as source of truth, vector store as an index over unstructured chunks attached to graph nodes. Build the graph first; bolt on vectors when a specific agent needs similarity search.

For a deeper definition of agentic memory and how it fits into the broader stack, see our agentic operating system glossary entry.

Production architecture: the five layers

Every multi-agent system that survives in production has five layers. You can build them in any order, but every one of them has to exist before you can call the system production-grade.

Layer 1 — Data foundation. The graph (or graph + vector store) plus the relational tables of record. Every entity the system reasons about lives here, with referential integrity, change-data-capture, and a clear ownership model. Agents read from and write to this layer; they do not keep authoritative state in their own context. This is also where deletes (GDPR, AI Act data subject requests) actually take effect.

Layer 2 — Decision engine. The foreman, the specialist agents, their role cards, their prompts, their model bindings. This layer does the cognitive work and changes the most. Versioning everything with the same rigor as application code is non-optional. Every prompt is a file in git. Every model binding is in a config file. Every role card is YAML alongside its prompt. The state of the decision engine at any point in time is reproducible from a commit hash.

Layer 3 — Workflow layer. The orchestration runtime: a job scheduler that triggers foreman runs (cron, queue, webhook), an agent fleet dashboard that surfaces in-flight work to operators, a human-in-the-loop approval flow where agents can suggest new work for review, a pause-resume mechanism for long-running processes. We use a single fleet dashboard as the control plane — one board, one source of truth, no parallel queues.

Layer 4 — Execution surface. The MCP servers, A2A endpoints, tool wrappers, browser automation, email-sending, database connections. Anything an agent needs to do lives here. Centralizing the execution surface is what makes auth, rate limiting, and authorization tractable. An agent never holds a credential; the MCP server it calls does.

Layer 5 — Audit trail and governance. Per-run structured logs, per-decision explainability (which signals, which evidence, which tool calls produced this output), risk classification on every job (low / limited / high / unacceptable), data-category declarations, human-oversight gates, and a tamper-evident archive. In an EU AI Act context this is what keeps your system on the right side of the high-risk requirements; in any context it is what lets you debug a regression three weeks after the fact.

Two principles tie the layers together. First, agents own no state — state lives in the data foundation, agents are stateless workers that read state in and write state out. Second, the workflow layer is the only place humans touch — operators watch the fleet dashboard, approve human-in-the-loop proposals, review high-risk outputs. Anything that requires a human goes through one surface.

Common failure modes and defenses

Eight months of running multi-agent systems in production produces a list of failure modes you will hit before month two. The defenses are not exotic; they just have to be in place before the failure, not after.

Agent loops. Two agents that ping-pong, or one agent that retries a failing tool until the budget is exhausted. Defenses: foreman pattern (specialists cannot call each other), max-step counter on every foreman run, max-token budget on every run, exponential backoff with hard cap on retries, kill switch that stops a run when the budget is exhausted regardless of state.

Context bleed. A specialist sees information meant for another agent because the foreman handed off too much state. The specialist then makes decisions based on context it should not have. Defenses: role cards with strict input schemas — the foreman validates the work item against the schema before dispatching, and the schema only contains what that role needs. Never pass the whole conversation. Never pass shared scratch space.

Tool authorization drift. An agent gets access to a tool it should not have because someone added it to the allow-list "temporarily" and never removed it. Six months later, the lead-discovery agent has write access to the production database. Defenses: allow-lists per role, committed in code, reviewed in PRs. No runtime tool grants. Periodic audits where the audit job lists every agent's allowed tools and flags additions since last audit.

Output validation gaps. An agent returns slightly malformed output and the foreman parses it anyway. The malformation grows. Two weeks later, downstream consumers crash. Defenses: schema validation on every handoff. The foreman rejects malformed output, retries with a corrective prompt, or fails the run. Never silently coerce. Log every validation failure as a first-class event.

Silent regressions on model updates. A model provider updates a snapshot, behavior shifts subtly, quality drops in a way no single test catches. Defenses: pin model versions explicitly. Maintain a fixture-based regression suite: a frozen set of representative inputs with known-good outputs, run before any model change rolls to production. Eval on the fixtures, not on a live customer run.

Runaway cost. A specialist gets stuck on a hard problem and burns through tokens. The bill arrives at the end of the month. Defenses: per-run cost cap, per-job daily cost cap, per-tenant monthly cost cap. The foreman tracks cost in real time and aborts when limits are hit. Cost reports surface in the workflow layer alongside quality reports.

Data-category violations. An agent that was supposed to operate on public data ends up reading personal data because someone joined two tables. AI Act category violations are non-theoretical; they trigger fines. Defenses: declare data categories on every job in metadata, enforce them at the data-foundation layer (row-level security, view-based access), audit them in the audit-trail layer.

Hallucinated tool calls. A model invents a tool that does not exist, or a parameter shape that does not match. Defenses: MCP-style tool servers that reject unknown tools and validate parameters at the protocol layer. Never let the agent's tool-calling output go straight to a code-eval surface.

These eight produce the bulk of incident reports. Defenses against all of them belong in your week-one architecture, not a backlog ticket.

Operator's perspective: shipping Knowlee 4Sales

We built and ship Knowlee 4Sales — an agentic operating system for sales — and the architecture above is not theoretical. It is the system that runs on our infrastructure today.

Three opinionated decisions shaped what we shipped.

Pipeline-based, not conversational. Most agent products in 2026 still default to chat. Sales work is not a conversation; it is a pipeline of named stages with deliverables, deadlines, and handoffs. We built 4Sales as a pipeline-first system: each stage is a job in our automation registry, each job is a foreman run, each run produces structured output that becomes the input to the next stage. The operator sees the pipeline state on the agent fleet dashboard; the agents do the work between stages. This is dramatically more legible than chat for the operator and dramatically more debuggable for us.

Fleet dashboard as the control plane. There is one board. Every running job, every human-in-the-loop proposal, every strategic task lives on it. Approving a proposal becomes an active card on the spot; finishing a job transitions the card to review; an operator can park, amend, or dismiss work from one surface. This is the pattern from Knowlee OS — our orchestration layer for agentic work — and we ported it directly to 4Sales because the alternative (separate queues for each kind of work) is what every other product gets wrong. For the broader frame, see our agentic operating system for business explainer.

Real artifacts, not magic. Every agent run produces a structured report (or a draft, or an updated record). Operators can read the reports; auditors can verify the trail; we can replay the run against fixtures. There is no hidden state, no opaque "the AI handled it." The five-layer architecture is what makes that possible — and what makes it possible to commercialize the same system into adjacent verticals (recruiting, client services, marketing) without rebuilding the foundations.

If you are about to build your first multi-agent system, the shortest path is: pick the foreman pattern, write the role cards before the prompts, route every tool through MCP, store entities in a graph, version everything in the decision engine, instrument the audit trail before the second specialist ships. Do that and the rest is iteration. Skip any of those and you will spend month four rebuilding what should have been week one.

We learned each of these the hard way so the next operator does not have to.