The Foreman / Manager Agent Pattern: Why Orchestrator-Worker Beats Peer-to-Peer (2026)

If you have spent any time building multi-agent systems in 2026, you have already noticed the gap between architecture diagrams and production behavior. The diagrams show three boxes connected by clean arrows. The production system shows two agents that have been talking to each other for nineteen minutes and a token bill that no one wants to forward to finance.

The pattern that closes that gap is older than agent frameworks: put one agent in charge, have everyone else report to it, and never let workers talk to each other directly. We call it the foreman pattern. Some teams call it the manager pattern. Anthropic's engineering write-ups call it orchestrator-worker. The labels do not matter. The discipline does.

This piece is the operator-level argument for why the foreman pattern wins for business workflows, what the alternatives actually look like in production, the anti-patterns that destroy a foreman setup from the inside, and a worked example from the sales pipeline we run on it every day.

What the foreman pattern actually is

A foreman-pattern multi-agent system has exactly one agent at the top of every run. That agent — the foreman — receives the goal, decomposes it into work items, dispatches each work item to a specialist, validates the specialist's output, decides what runs next, and either dispatches again or returns the final result to the operator. The specialists do narrow, well-scoped work and then exit. They never call each other. They never know what the next step is. They do not retain state between dispatches.

In prose, the topology looks like this: the operator hands a goal to the foreman. The foreman reads the goal, picks the first specialist, hands it a typed work item, waits for a typed result, validates the result, and either dispatches the next specialist with a new work item or terminates the run. If a specialist fails, the foreman decides whether to retry, swap to a different specialist, escalate to a human, or abort. The foreman is the only agent that holds the run-level plan; specialists hold only the per-task input they were handed.

A frontend can render this as a directed acyclic flow with the foreman at the center, two to seven specialist boxes radiating around it, and arrows that always go through the foreman. The shape is not an organizational chart; it is a hub-and-spoke. That hub-and-spoke property is what makes the rest of the pattern work.

Three properties define a system as foreman-pattern, not just multi-agent:

Single decomposer. Only one agent decides what to do next. Everyone else executes.
No peer calls. Specialists cannot call each other, directly or transitively. The communication graph is a tree, not a mesh.
Typed handoffs. Every input the foreman gives a specialist and every output it returns is a validated structured object, not free text.

Drop any of those three and you do not have the foreman pattern; you have an orchestrator-shaped diagram with peer-to-peer plumbing underneath.

What the alternatives actually look like

Three other patterns show up in production-shaped writing. Knowing what each fails at is what makes the foreman case obvious.

Peer-to-peer. Every agent can call every other agent. The framework provides a directory or a shared message bus, and any agent can decide that the next step needs the help of any other agent. This pattern looks elegant on a whiteboard because every box can talk to every other box. In production, it produces three failure modes within a week. First, two agents will negotiate with each other indefinitely if neither has been given a stop condition the other respects. Second, three agents will form a triangle that the operator notices only when the token bill arrives — agent A asks B, B asks C, C asks A, and the loop goes on until a budget timer trips. Third, when something goes wrong, reconstructing the order of operations across three or four asynchronous transcripts is the kind of debugging task that consumes whole afternoons.

Swarm. Many copies of similar agents work in parallel on the same problem, with results aggregated downstream. Swarms are useful for narrow, embarrassingly parallel work — running the same evaluation across a thousand candidate companies, drafting forty variants of a subject line, scoring a backlog of tickets — but they do not handle workflows where the next step depends on the previous step. A swarm cannot do "discover companies, then pick the best ten, then find the right contacts at those ten." It can do the first step in parallel, but it needs an orchestrator above it to do the picking and the next-step routing. A swarm without an orchestrator is a parallel computer without a CPU.

Hierarchical (foremen of foremen). A foreman delegates a sub-goal to a junior foreman, who delegates to specialists. This works well at large scale — when one foreman cannot hold the whole plan in context — but it introduces the same problems the foreman pattern was designed to solve, one level down. Every layer of foremen-of-foremen needs the same discipline: typed handoffs, no peer calls between sublevels, clean failure handling. We use it sparingly, only when the top-level foreman has more than seven first-level specialists or more than three distinct phases of work; below that threshold, it adds latency without adding clarity.

Pure event-driven / blackboard. Agents read from and write to a shared scratchpad and trigger themselves on events. This is closer to peer-to-peer than to foreman-pattern in failure mode and produces the same kinds of loops, but it is also notoriously hard to debug because the state is implicit in the blackboard's history rather than explicit in a foreman's plan. We have seen this work at research labs; we have not seen it work in revenue-bearing business workflows.

Why foreman wins for business workflows

The case for the foreman pattern is six concrete properties. Each one closes a specific failure mode that peer-to-peer or swarm exhibits.

Loops are impossible by construction. Specialists cannot call each other. There is no edge in the communication graph between specialist A and specialist B. The foreman has an explicit max-step counter; if it hits the limit, the run terminates with a clear failure mode and a clear log entry. This is not "loops are unlikely"; it is "loops cannot occur as a property of the topology." That structural guarantee is worth more than any number of guardrails grafted onto a peer-to-peer system.

Handoffs are inspectable in one place. Every transition between agents goes through the foreman. The foreman log is the audit trail. You do not have to reconstruct the order of operations from per-agent traces; you read one stream of events. When a regression appears three weeks after deploy, the time-to-root-cause is measured in minutes instead of hours.

Cost is bounded. The foreman owns the budget. It can refuse to dispatch a specialist if the run has already consumed too many tokens, too much wall-clock time, or too many tool calls. In peer-to-peer systems, no agent owns the budget; budgets become a per-process timer that fires after the damage is done.

Errors have one recovery point. When a specialist fails, the foreman decides whether to retry, dispatch a different specialist, escalate to the operator, or abort. That decision lives in one prompt and is testable. In peer-to-peer systems, every agent is partially responsible for error handling, which means none of them are fully responsible, which means the recovery logic is whatever the model felt like generating that turn.

Specialists are swappable. Because specialists do not depend on each other, you can replace any one of them — different model, different prompt, different tools — without rewriting the system. The foreman's contract with each specialist is a typed input schema and a typed output schema; everything inside the specialist is implementation detail. We swap models on specialists routinely; we never swap a peer-to-peer agent without a full integration test, because the agent's behavior is entangled with its peers.

Governance maps onto roles. Different specialists do different categories of work, with different risk profiles. A research specialist that reads public web pages is low-risk. An outreach specialist that sends external email is medium-risk. A negotiation specialist that quotes price is high-risk. With a foreman pattern, you set per-role policies — token budgets, allow-listed tools, human-in-the-loop gates — and the foreman enforces them. With peer-to-peer, governance is per-conversation, which is unusable for AI Act-grade audit.

There is one cost to the foreman pattern, and it is not theoretical: the foreman is the bottleneck and the single point of failure. We mitigate this two ways. First, the foreman is intentionally smaller and faster than its specialists — a routing-and-validation prompt, not a doing-the-work prompt. The foreman's job is to read a result, validate it against a schema, decide on the next dispatch, and emit a decision. That is a ten-token-input, hundred-token-output task on a fast model, not a multi-thousand-token reasoning task. Second, the foreman's prompt is the most carefully tested artifact in the system, with replay tests for every handoff path and a frozen fixture suite for regression. The foreman is the place where investment in robustness pays the highest interest.

A working example: the sales pipeline foreman

The system that ships in our 4Sales product runs on this pattern, and walking through the actual decisions is the clearest argument for it.

The goal a 4Sales operator hands to the system is a campaign objective: "find ten companies in fintech mid-market that match our ICP, write a personalized first-touch message for one decision-maker at each, and surface them on my kanban for review before sending." That is one goal, and it is too big for one agent.

The foreman decomposes it into four phases:

Discovery. Find candidate companies that match the ICP.
Qualification. Score each candidate against fit criteria, rank, and pick the top ten.
Contact selection. For each of the top ten, find one decision-maker who matches the buyer persona.
Personalization. For each contact, draft a first-touch message grounded in a recent signal.

Each phase has a specialist. The discovery specialist's role card limits it to discovery and only discovery. It receives an ICP schema and returns an array of candidate companies with name, domain, evidence URL, and a fit score. It does not enrich contacts; it does not draft messages; it does not contact anyone. If a request asks it to do those things, it returns a scope-violation error.

The qualification specialist receives the array of candidates and the ranking criteria and returns the top ten with evidence. It cannot call discovery to find more candidates; if there are not enough candidates, it returns a request-for-human-review and the foreman escalates to the operator.

The contact-selection specialist is dispatched once per qualified company. Its scope is narrow: given a company and a buyer persona definition, find one contact and return their name, role, and corroborating evidence. The foreman dispatches ten of these in parallel, validates each return, and either accepts the contact, requests a re-search, or escalates if no plausible contact exists.

The personalization specialist is also dispatched once per contact, in parallel. Its scope is one message per call — never schedules, never sends, never combines multiple contacts into one outreach. It receives the contact, the company, a recent signal, and the campaign template. It returns a draft subject line, a draft body, the signal it cited, and a confidence score.

When all four phases complete, the foreman assembles the result and writes ten draft messages to the operator's kanban as cards in the "review" column. The operator approves, amends, parks, or skips each one. Anything approved is dispatched to the sending stage, which is a separate run with its own foreman because the risk profile changed.

What this pattern buys at run time, in practice:

Every phase produces a structured artifact that is independently inspectable. The discovery phase's output is a list of companies. The qualification phase's output is a ranked subset. The contact-selection phase's output is a contact-per-company table. The personalization phase's output is a draft set. Every artifact is queryable in our memory layer; every regression debug session starts from the artifact, not from a transcript.
Specialists run in parallel where the workflow allows. The contact-selection specialist runs ten copies in parallel; the personalization specialist runs ten copies in parallel. The foreman owns the parallelization decision; the specialists do not know they are running concurrently with siblings.
Cost is measured per phase. When discovery costs more than expected, we know which specialist's prompt to investigate. When personalization costs are stable but quality drops, we know to look at the model binding for the personalization role, not at the whole pipeline.
Failures are localized. If contact-selection cannot find a decision-maker for one of the ten companies, the foreman escalates that single card to the operator and continues with the other nine. A peer-to-peer system would entangle the failure with the rest of the run; the foreman pattern keeps it surgical.

The pattern also makes commercial expansion tractable. The same foreman shape, with different specialists, runs the recruiting pipeline in our 4Talents product (sourcing, screening, scheduling, briefing) and the marketing pipeline in 4Marketing (audience building, brief generation, draft creation, distribution). The contract between foreman and specialist is the unit of reuse. For the broader frame around how these pipelines fit into a unified platform, see our agentic operating system for business explainer.

Anti-patterns: how foremen go wrong

The foreman pattern is not self-enforcing. We have built foremen that broke in production. The four anti-patterns below are the ones we have seen consistently — in our own builds and in code reviews of customer systems we have looked at.

Foreman becomes the bottleneck. The most common failure mode. The foreman starts as a thin router and grows. Someone adds a piece of validation logic; later someone adds a piece of result-summarization logic; later someone adds a piece of "let me just check this myself" logic. Three months in, the foreman is doing more work than any specialist, runs are slow because every step waits on it, and replacing it requires understanding everything it has accumulated. Defense: the foreman's prompt has a strict word budget — we cap at roughly 1,200 tokens — and a strict tool budget — three or four tools, never more. Anything beyond that becomes a specialist. We review the foreman prompt every two weeks for scope creep; if a new responsibility cannot be removed, we lift it into a new role.

Role bleed. Two specialists overlap. The discovery specialist starts doing light qualification because it "already has the data." The qualification specialist starts doing fresh discovery because the candidate set looks thin. Within weeks, neither role is doing a clean version of its own job, and the foreman cannot reason about which one to call for what. Defense: the role card's scope field is the most important field, not the inputs. Write what each role does not do, and write it in adversarial language ("if you are asked to enrich contacts, return a scope-violation error"). Test the boundary explicitly with replay fixtures that include scope-violating inputs.

Escalation loop. Specialist A escalates to the foreman; the foreman dispatches specialist B; specialist B does not have what it needs and escalates back to the foreman; the foreman dispatches specialist A again with the same input; specialist A escalates again. The system has not made any progress, and tokens keep burning. Defense: the foreman tracks a per-run history of escalations. If the same dispatch happens twice with no new information, the foreman aborts the run with a clear "stuck" state and surfaces it to the operator. Never let the foreman be the only stop condition; add a max-escalation counter on top of the max-step counter.

Foreman pretending to be a specialist. A subtle one. The foreman starts taking on a small piece of work itself — "I will just personalize this message inline since the personalization specialist is slow today" — and stops being a pure router. Once the foreman holds doing-the-work prompts, it cannot also be a fast, focused, well-tested router. The two purposes are different artifacts and they fight each other inside one prompt. Defense: the foreman never produces final artifacts. It dispatches, validates, decides, and returns. If a phase is slow, the answer is a faster specialist or a smaller specialist scope, not the foreman doing the work.

Schema-less handoffs. The foreman dispatches a specialist with a free-text description of the task and gets back a free-text result. Parsing happens implicitly. The system works for two months and then starts producing weird outputs after a model snapshot rolls. Defense: every handoff is typed both ways. Inputs are validated before dispatch; outputs are validated before next-step decision. Validation failures are first-class log events and trigger either retry-with-correction or run abort, not silent coercion.

These four account for most of the foreman-pattern failures we have seen. None of them are unfixable; all of them are expensive to fix after the system has been in production for a quarter.

When the foreman pattern is the wrong tool

The foreman pattern is opinionated, and for some workloads the opinions cost more than they save. Three categories where we deliberately do not use it:

Single-task workloads. If a run has exactly one specialist, the foreman is a layer of indirection with no benefit. Run a single-agent system with a guardrail wrapper instead. We make the call by counting distinct roles in the workflow; one role means one agent.
Massive embarrassingly parallel work. Running the same evaluation across ten thousand items does not need a foreman; it needs a queue and a worker pool of identical specialists. The foreman pattern introduces sequential overhead that the workload does not require.
Truly conversational work. A genuine open-ended conversation — a research assistant talking with a user across a long session — does not decompose into discrete handoffs. A foreman in the middle of that conversation slows it down without adding clarity. Use a single agent with a strong system prompt and well-scoped tools.

In every other category — pipeline-shaped work, mixed-expertise work, anything with auditability requirements, anything regulated — the foreman pattern is the default. We have not encountered a business workflow where peer-to-peer or swarm beat foreman on the metrics that matter (correctness, debuggability, cost, governance), and we have looked.

For the broader vocabulary of orchestration patterns and where each fits in a production stack, see our multi-agent orchestration explainer and the multi-agent orchestration glossary entry. For the implementation walkthrough that puts a foreman into code, see our how to build a multi-agent AI system guide.

What to take into your week-one architecture

If you are about to design your first foreman-pattern system, the short list is: write the role cards before the prompts, cap the foreman at a thin router (under 1,200 tokens, under five tools), make every handoff typed both ways, instrument the foreman log as a single audit stream from day one, and test scope-violation explicitly with adversarial fixtures. Do that and the rest is iteration. Skip any of those and you will spend month four refactoring what should have been week one.

The foreman pattern is not glamorous. It does not generate the screenshots that go viral. What it does is run, every day, for months, without surprising you — and "without surprising you" is the property that distinguishes a production multi-agent system from a demo. We have shipped hundreds of foreman runs per day across our production fleet for over a year on the same five layers and the same orchestrator-worker discipline. The pattern compounds; the alternatives do not.

For the bigger architectural picture this fits into, see our agentic workforce 2026 piece and our AI workforce architecture explainer.