How to Manage Multiple AI Agents: An Operator's Manual

Building a multi-agent system is the first half of the work. Running it — for months, across hundreds of runs a day, in production, without surprising your customers or your finance team — is the second half, and the part most write-ups skip. By the time you ship your second specialist, the question is no longer "does the architecture work" but "what does the operator actually do every Tuesday morning to keep the fleet healthy?"

This is the operator-level manual we wrote for ourselves. It covers the six things you have to manage every week to keep an agent fleet productive: a kanban as the single control plane, observability that makes incidents debuggable, escalation patterns that route the right calls to humans, cost tracking that catches runaway spend before the bill arrives, the failure modes you will see by month two, and the team rituals that keep institutional knowledge growing.

If you have not yet built the underlying architecture, start with our how to build a multi-agent AI system guide, the foreman / manager pattern explainer, and the multi-agent role cards guide. This piece assumes those foundations are in place.

Section 1 — Kanban as the control plane

The single most important operational decision in running an agent fleet is having one kanban that shows every piece of work, regardless of who or what initiated it. Not three boards (one for scheduled jobs, one for ad-hoc tasks, one for proposals from agents). One board.

The reason is selection bias, but in reverse. When an operator has three boards, they look at the most familiar one and miss the work that lives on the others. The proposals queue gets stale because no one checks it. The scheduled-jobs dashboard gets ignored because all the items there are "supposed to just run." The strategic-tasks list lives in someone's head. Two weeks in, the operator's mental model of the fleet's state is a partial picture, and the missing parts are exactly where things go wrong.

One board fixes the selection bias. Every running agent run is a card. Every proposal an agent makes ("I noticed something worth your attention") is a card in a "draft" column. Every strategic task the operator wants to dispatch is a card. Every scheduled job, when it runs, becomes a card in the "running" column and transitions to "review" on completion. The operator looks at one place and sees everything.

The columns we use, and what each one means:

Backlog. Strategic tasks the operator has scoped but not yet dispatched. Proposals an agent has drafted but the operator has not yet approved. Anything that exists but is not running.
Running. Active agent runs, plus any human-in-progress work for the same fleet. Each card shows the foreman's current dispatch (which specialist is working) and elapsed time.
Review. Completed runs awaiting operator review. Drafts that need approval, results that need quality check, exceptions that need a decision. This is where the operator spends most of their attention.
Done. Completed and reviewed work, archived but searchable. The audit trail flows from here to long-term storage.
Blocked. Runs paused on something — a missing input, an external dependency, a human-needed escalation that has not yet been resolved. Blocked work is the most important column to watch; an aging blocked card is a process failure.

Two principles that make the kanban work in practice. First, the kanban is the single place humans touch. Operators do not edit agent prompts directly during a run; they comment on a card, the foreman reads the comment, the foreman adjusts. Operators do not approve work outside the kanban; they approve cards. This concentration of human attention into one surface is what makes the system manageable.

Second, agents create cards, but operators move them. When an agent emits a proposal, it lands in Backlog as a draft card with a "from agent" badge. The operator decides whether to approve (which becomes a Running card), park (which stays in Backlog with a tag), amend (approve with modifications), or skip (dismiss). The agent does not push work into Running unilaterally; the operator does. The asymmetry — agents propose, operators dispose — is what keeps the human in the loop without making them a bottleneck.

Two-way navigation is non-negotiable. Every Running card carries a link back to whatever produced it (the schedule that fired, the proposal that was approved, the operator who dispatched it). Every Done card is searchable by the run that produced it. When something goes wrong three weeks later, the path from "this customer-visible issue" to "this specific run that caused it" should be a single search, not a forensic investigation.

For the broader architectural picture the kanban sits inside, see our agentic operating system for business explainer.

Section 2 — Observability: logs, metrics, traces

You cannot operate what you cannot see. Agent fleets produce a different observability profile than traditional services, and the instrumentation you carry over from your microservice stack does not cover the right things. Three layers, each non-negotiable.

Logs. Every agent run produces a structured log with a known schema: run ID, foreman version, role-card versions of every specialist invoked, model bindings, tool calls (with arguments and results), schema-validation events, escalations, final outcome. Plain-text logs are not enough. The log has to be queryable on every field — "find all runs where the personalization specialist failed schema validation in the last seven days" must be a query, not a grep.

The log shape we use:

{
  "run_id": "run_01HXYZABC",
  "started_at": "2026-04-30T08:42:11Z",
  "foreman": { "version": "2.3.1", "model": "claude-sonnet-4.7" },
  "goal": { "kind": "outbound_campaign", "campaign_id": "camp_42" },
  "events": [
    {
      "kind": "dispatch",
      "specialist": "lead_discovery",
      "role_card_version": "1.4.0",
      "input_hash": "sha256:..."
    },
    {
      "kind": "result",
      "specialist": "lead_discovery",
      "validated": true,
      "duration_ms": 14210,
      "tokens": { "in": 1842, "out": 4612 },
      "tool_calls": 3
    },
    { "kind": "decision", "next": "qualification", "rationale_hash": "sha256:..." },
    "..."
  ],
  "outcome": "completed",
  "total_tokens": { "in": 18420, "out": 12830 },
  "total_duration_ms": 92410,
  "total_cost_usd": 0.42
}

Metrics. Time-series rollups over the logs that surface the fleet's health. The metrics we watch every day:

Runs per hour, per role, per outcome. A drop in completed runs is the first signal of a regression.
Latency p50/p95/p99 per role. Latency spikes track to specific specialists; a p95 spike on one role usually indicates a model-side change or a tool-side degradation.
Schema-validation failure rate per role. Any non-zero rate is a problem; a rising rate is a regression.
Escalation rate per role. Rising escalations mean the role is encountering work it cannot handle; either the role card is too narrow or the upstream dispatch is wrong.
Cost per role per run. Trending cost is how runaway spend hides; track this daily.
Tool-call error rate per tool. Tool errors propagate into agent failures; instrumenting the tool layer catches them upstream.

Traces. Per-run request flows that show the foreman's dispatch graph and each specialist's tool calls in time order. Traces are how you debug a specific incident — "this campaign produced bad output, what happened on this run." A good trace makes the foreman's decision sequence inspectable in one view: dispatch, result, decision, dispatch, result, decision, completion.

We use OpenTelemetry as the trace backbone because every observability stack speaks it. Foreman dispatches become spans with the specialist as the operation name; tool calls become nested spans inside specialist spans; schema validations become events on the parent span. The result is a flame graph that makes it obvious where time and tokens went.

Two non-obvious observability patterns that pay back the investment:

Per-decision provenance. When the foreman makes a decision (which specialist to dispatch next, whether to retry, whether to escalate), log the decision and a hash of the prompt that produced it. Three weeks later, when a regression appears, you can trace the decision back to the exact prompt version. Without this, you can see what the foreman did but not why, and "why" is most of the debugging value.

Frozen replay fixtures. A small set of representative runs (maybe twenty across the system) is captured as fixtures: input, expected outcome shape, expected role-card versions, expected tool-call sequence. Before any prompt or model change rolls to production, the replay suite runs against the fixtures. Drift in the replay is the early warning that a change has unintended consequences. A live regression-suite tied to fixtures has caught silent quality regressions for us multiple times where standard metrics did not.

Section 3 — Human escalation patterns

The hardest operational discipline is deciding what humans handle versus what the fleet handles autonomously. Get it wrong toward "more humans" and the operator becomes a bottleneck. Get it wrong toward "more autonomous" and you ship a quality regression to a customer that your audit trail traces but does not prevent.

Three escalation patterns, each fitting a different situation.

Per-decision human-in-the-loop. Specific decisions always require a human. Sending an outbound email to an external recipient. Quoting a price in a negotiation context. Applying a state-changing mitigation during an incident. Updating a customer record with personal data. The agent never executes these autonomously; it produces a draft or proposal, the operator approves on the kanban, the action then proceeds.

Per-decision HITL is the right pattern when (a) the action is irreversible, (b) the cost of being wrong is bounded by the operator's review velocity, and (c) the operator's decision is a fast yes/no on the agent's draft, not a from-scratch evaluation. Outbound message approval, contract redline acceptance, and incident-mitigation approval all fit this shape.

Confidence-based escalation. The agent emits a confidence score on every output. Above a threshold, the foreman accepts and proceeds. Below the threshold, the foreman escalates to the operator with the agent's draft, the confidence score, and the rationale. The operator can override (proceed despite low confidence), correct (revise the draft and proceed), or abandon (mark the run as failed).

Confidence-based works when the agent has a meaningful sense of when it is uncertain. Calibrate the threshold against actual operator review patterns: if the operator is dismissing too many low-confidence escalations, raise the threshold; if too many high-confidence outputs are turning out wrong, lower it. We re-tune thresholds quarterly per role, not as a one-time setting.

Policy-triggered escalation. Specific patterns in the input or output trigger mandatory human review, regardless of confidence. Inbound message contains a price question. Output contains a banned phrase. Action involves a data category flagged as sensitive. These are the rules you encode in the role card's escalation field; the agent surfaces them, the foreman routes them, the operator handles them.

Policy-triggered is the right pattern for risk-driven escalations — anything regulatory, anything reputational, anything where "the agent can be reasonably confident but should not be trusted" applies. Most AI Act-relevant decisions live here.

A composite that works in practice: per-decision HITL on terminal actions (sending, applying, persisting), confidence-based on judgment-heavy outputs (drafts, classifications), policy-triggered on sensitive categories. Layer them; do not pick one. The operator's experience is "the kanban shows me drafts when I need to approve, ambiguous results when the agent is unsure, and flagged items when policy demands it."

A failure mode worth naming: escalation overload. When too many things escalate, the operator either rubber-stamps (defeating the purpose) or falls behind (delaying the work). The defense is to track escalation rate per role and tune the thresholds; if a role's escalation rate exceeds the operator's review capacity, either the role card is wrong or the workload is wrong, and the architectural fix is more important than the operational one.

Section 4 — Cost tracking

Multi-agent systems can produce surprising bills. The combination of higher per-run token usage, parallelization across specialists, and runaway runs that hit budget caps too late — all of these add up. The operator's job is to know the cost shape of the fleet at a daily granularity and catch outliers before the month ends.

The cost-tracking layers we maintain:

Per-run cost cap. Every run has a hard cap. The foreman tracks tokens and tool-call costs in real time and aborts if the run exceeds the cap. The cap is set at roughly 3x the median cost of a successful run for that goal type — high enough that legitimate slow runs do not trip it, low enough that runaway runs are stopped before they consume meaningful budget.

Per-job daily cost cap. Each scheduled job has a daily budget. If the day's runs collectively exceed it, the scheduler pauses the job and surfaces a card on the operator's kanban. This catches the case where individual runs are within their per-run cap but the volume has spiked unexpectedly.

Per-tenant monthly cost cap. For multi-tenant systems, each tenant has a monthly budget. Approaching the cap surfaces a warning at 80%, pauses non-critical jobs at 95%, and pauses everything at 100%. Tenants are notified at each threshold.

Per-role cost reporting. Daily and weekly rollups of cost by role. This is how we catch the case where one role's prompt has gotten longer over time and is now using twice the tokens per call. Rising cost on one role with stable run volume is the signature of prompt bloat, and the fix is in the role card, not the budget.

The cost surfaces we look at:

A daily fleet-cost dashboard showing total spend, by role, by tenant, with the trend over the last 30 days.
A per-role weekly diff: cost change versus prior week, normalized for run count. A double-digit increase on a stable role gets investigated.
An anomaly alert when a single run exceeds 5x the median; even if the cap caught it, the cap-hit itself is worth investigating.

A pattern we have seen in customer systems and in our own: a slow drift up in cost over weeks, caused by gradual prompt expansion, that no individual change accounts for but that aggregates into a 2x bill increase over a quarter. The defense is the weekly diff. Without it, the drift compounds invisibly until the month-end bill makes it obvious.

For the architectural choices that affect cost shape, see our single-agent vs multi-agent decision framework — multi-agent has higher per-run cost than single-agent on the same task, and that cost should be in your budget from day one, not a surprise at month two.

Section 5 — Failure modes

Eight months of running multi-agent systems in production produces a list of failure modes you will see by month two. The ones below are the most common; we have hit each of them ourselves and seen each of them in customer systems.

Loops. Two agents that ping-pong, one agent that retries a failing tool until the budget is exhausted. The foreman pattern prevents inter-agent loops by construction (specialists cannot call each other), but intra-agent loops still happen — a specialist that retries its own tool call without a backoff. Defenses: max-step counter on every foreman run, max-retry counter inside every specialist's tool-calling loop, per-run cost cap as a backstop. When a loop trips a defense, the run aborts cleanly with a clear log entry; the kanban shows the abort; the operator knows immediately.

Drift. Quality of output gradually degrades over weeks without any obvious change. Drift is the hardest failure mode because there is no incident; there is just a slow decline that no one notices until a customer complains. Causes: model snapshot updates from the provider, prompt bloat, role-boundary erosion, training-data shift. Defenses: frozen replay fixtures (the regression suite catches drift before customers do), per-role quality metrics with weekly review, version pinning where the provider supports it.

Context bleed. A specialist sees information meant for another agent because the foreman handed off too much state. The specialist then makes decisions based on context it should not have. Defenses: typed input schemas with closed objects (no additionalProperties: true), foreman validates work items against the schema before dispatch, schema only contains what the role needs. Never pass the whole conversation. Never pass shared scratch space.

Tool-authorization drift. An agent gets access to a tool it should not have because someone added it to the allow-list "temporarily" and never removed it. Six months later, a low-risk research agent has write access to a production database. Defenses: allow-lists per role committed in source control, no runtime tool grants, periodic audits where the audit job lists every agent's allowed tools and flags additions since the last audit.

Output-validation gaps. An agent returns slightly malformed output and the foreman parses it anyway. The malformation grows. Two weeks later, downstream consumers crash. Defenses: schema validation on every handoff, foreman rejects malformed output rather than coercing, every validation failure logged as a first-class event for review.

Silent regressions on model updates. A model provider updates a snapshot, behavior shifts subtly, quality drops in a way no single test catches. Defenses: pin model versions explicitly where supported, maintain the frozen replay-fixture suite, gate any model change on a fixture-pass.

Runaway cost. A specialist gets stuck on a hard problem and burns through tokens. Defenses are all in section 4; the operational one is to actually look at the daily dashboard, not just configure the alerts.

Data-category violations. An agent that was supposed to operate on public data ends up reading personal data because someone joined two tables. AI Act category violations are non-theoretical. Defenses: declare data categories on every job in metadata, enforce them at the data-foundation layer (row-level security, view-based access), audit them in the audit-trail layer.

Hallucinated tool calls. A model invents a tool that does not exist, or a parameter shape that does not match. Defenses: tool servers (typically MCP-style) that reject unknown tools and validate parameters at the protocol layer. Never let the agent's tool-calling output go straight to a code-eval surface.

Operator burnout from over-escalation. A real failure mode at the human level. When the fleet escalates too much, the operator's review queue grows faster than they can clear it; quality of review drops; rubber-stamping creeps in. Defenses are operational: track operator review velocity as a metric, tune escalation thresholds against capacity, escalate the right things rather than everything that is uncertain.

For each of these, the defense lives in the architecture before the failure occurs, not in incident response after. The operational discipline is keeping the defenses in place as the system evolves; one of the rituals in the next section is exactly that audit.

Section 6 — Team rituals

A multi-agent fleet is a system that needs sustained operational care. The teams we have seen succeed have a small set of rituals that turn one-time discipline into compounding institutional knowledge. Three rituals matter most.

Daily review. Fifteen to thirty minutes, every working day. The operator scans the kanban: what is in Review (does it need their attention now), what is Blocked (is anything aging too long), what completed yesterday (anything surprising). They look at the daily cost dashboard, the daily run-volume metric, the daily escalation-rate metric. They flag anything that does not match expectation as a card for deeper investigation later.

The point of the daily review is not to fix things; it is to maintain situational awareness. Most days, nothing is wrong. The discipline is showing up anyway, because the day something is wrong, the time-to-detect should be hours, not weeks. Our daily review takes about twenty minutes and has caught regressions weeks before any automated alert would have fired.

Weekly tuning. One hour per week. The operator and the engineer who owns the fleet sit together and look at the week's metrics in detail: cost per role, escalation rate per role, schema-validation failure rate, frozen-fixture drift, replay-suite results. They pick one or two things to adjust — a confidence threshold, an escalation rule, a prompt revision, a role-card scope tightening — and queue them as work for the coming week.

The weekly tuning ritual is what keeps the fleet in calibration as the workload evolves. New patterns of work emerge; old patterns recede; the agents have to be re-aimed periodically. Without the ritual, the fleet drifts out of calibration over weeks and the operator notices only when the drift is severe enough to cause a customer-visible issue.

Monthly audit. Two to four hours, once a month. A more thorough review of the system itself, not just its current state:

Tool-allow-list audit. List every agent's allowed tools, diff against last month, investigate every addition.
Role-card review. Read every role card in the fleet. Does the scope still match what the role actually does? Are the inputs still the right schema? Has the escalation list grown stale? Tag any cards that need revision.
Cost trend review. Plot monthly cost per role against monthly run volume. Investigate any role whose cost is growing faster than its volume.
Quality-fixture review. Run the full frozen-fixture suite. Investigate any drifts. Add new fixtures for any new categories of work that have emerged.
Incident retrospective. For every incident in the past month (defined as: a run that escalated unexpectedly, a regression that customers noticed, a cost spike that tripped a cap), write a one-paragraph retrospective: what happened, what we changed, what would catch it earlier next time.

The monthly audit is institutional memory. Without it, lessons learned in week 3 are forgotten by week 12. With it, the fleet's operating manual gets thicker over time and the team's mental model stays current with the system's actual behavior.

A fourth ritual we have started practicing more recently: quarterly architecture review. Once a quarter, the team revisits the architectural decisions: are we still in the foreman pattern (or have specialists started peer-calling somehow); have new roles emerged that should be split into separate specialists; have old roles converged that should be merged; are the protocols (MCP, A2A, structured outputs) still placed at the right layers. The quarterly review is slower-cadence but bigger-impact than the weekly tuning; it is where the system's shape evolves intentionally rather than by accumulation.

What "well-managed" looks like in steady state

A multi-agent fleet that is being managed well has these properties at any given moment:

An operator can answer "what is the fleet doing right now" in under a minute by looking at the kanban.
An engineer can answer "why did this specific output happen" in under fifteen minutes by looking at the run's logs, traces, and role-card versions.
The daily cost is predictable to within ±15% week over week; surprises are investigated, not normalized.
The escalation queue clears within the operator's review-day; there is no chronic backlog.
New runs of the same goal type produce qualitatively similar outputs to runs from a month ago; drift is detected by the replay-fixture suite, not by customer complaints.
A new engineer can read the role cards, the foreman prompt, and the kanban, and trace a full run end to end within a day of joining.

Hitting all six of these is not automatic. It is the cumulative result of the disciplines above: the kanban as a single surface, observability instrumented from week one, escalation patterns tuned to the operator's actual capacity, cost tracked at the granularity that catches drift, defenses in place against the failure modes that you will hit, and rituals that turn one-time decisions into ongoing practice.

A multi-agent fleet is more like running a small operations team than running a piece of software. The agents do the work; the operator runs the operation. The operating manual is what makes the operation sustainable.

For the architectural foundations the operating manual sits on top of, see our how to build a multi-agent AI system guide and the foreman / manager pattern explainer. For the role-design discipline that makes specialists manageable, see our multi-agent role cards guide. For the protocols that make tool and agent boundaries uniform, see our communication protocols piece. For the architecture-vs-architecture decision that determines whether multi-agent is even right for the workload, see our single-agent vs multi-agent decision framework. For project-level patterns these systems can implement, see our 10 multi-agent project ideas piece. For the broader vocabulary, see our multi-agent orchestration glossary entry and our agentic AI glossary entry. For the wider workforce frame these fleets sit inside, see our agentic workforce 2026 piece and our AI workforce architecture explainer.

The operators who run agent fleets well in 2026 are not the ones with the most sophisticated architectures or the most aggressive automation. They are the ones with one kanban, instrumented observability, calibrated escalations, daily cost discipline, and a Tuesday-morning ritual that has not slipped in six months. Boring operations beat clever automation every quarter we have measured it.