# When to use a workflow and when to use an agent: the decision that determines everything else

> The field has reached consensus: reach for a deterministic workflow first, every time. An agent — an LLM wired to tools, memory, and goals that runs its own observe-think-act loop — earns its place only when the process is genuinely unpredictable and a workflow cannot be written. Getting this decision wrong costs months: over-agentified pipelines are slower, harder to debug, more expensive, and less reliable than clean rule-based logic. This playbook documents the scoring method, the accuracy-threshold sequence, the four-part agent anatomy, the definition-of-done constraint that most builders skip, and the failure signatures that tell you when you built the wrong thing.
>
> https://pravda.systems/notes/workflow-vs-agent-decision-framework · 2026-06-17

**The single most consequential architectural decision in any automation project is not which tool to use — it is whether the task needs an agent at all.** The field, having run enough production systems to know, has converged on one rule: deterministic workflow first. An agent earns its place only when the process is genuinely unpredictable and a fixed sequence cannot be written. Everything else in this playbook flows from that starting point.

## What is the actual difference between a workflow and an agent?

A **workflow** is a fixed, linear, guard-railed sequence. New lead arrives → research company → draft email → send. It cannot deviate. That constraint is the feature, not the limitation: fixed sequences are cheaper to run, trivial to debug, and they fail loudly when something breaks rather than silently doing the wrong thing with high confidence.

An **agent** is the same reasoning model with four things bolted on:

| Component | Role |
|---|---|
| **LLM** | the reasoning engine — decides which step to take next |
| **Tools** | terminal, browser, file system, external APIs |
| **Memory** | short-term (context window) + long-term (files, databases, vector stores) |
| **Goals** | a specific outcome with a measurable definition of done |

The functional difference: a workflow follows a script; an agent decides its own script on each run, using tools to observe reality and act until its goals are met. That autonomy is what makes agents genuinely useful for unpredictable processes — and genuinely dangerous for predictable ones.

The practitioners who have built the most production systems put it plainly: *"Some of the most valuable automations ever built were purely rule-based. No LLMs, no fancy prompts — just clean logic that moves data from A to B."* Reach for an LLM node last, not first. This is the most useful contrarian point in the field because most public discourse sells the opposite.

## Why is deterministic-first the field's consensus?

The bet is not that agents are bad. The bet is that most tasks that look like they need an agent are actually well-specified enough for a workflow — and operators find this out the hard way after paying agent costs for months.

The evidence behind deterministic-first:

**Cost.** An LLM node on every document in a large batch costs real money. A conditional branch costs nothing. Any task that can be expressed as "if field X contains Y, do Z" should be expressed that way.

**Debuggability.** When a deterministic pipeline produces the wrong output, there is one path through the code to inspect. When an agent produces the wrong output, the reasoning chain that produced it may differ on every run. Deterministic pipelines fail loudly and at known points; agents fail silently at unpredictable ones.

**Reliability under scale.** A workflow that runs correctly once runs correctly a million times. An agent that achieves 90% accuracy on individual tasks achieves compounded errors in multi-step sequences. The math is unforgiving at production volume.

**Scalability.** Deterministic workflows scale horizontally with no additional reasoning cost. Adding parallelization to an agent system requires coordination logic, shared memory architecture, and error propagation strategy.

Agents genuinely earn their place when: the task structure varies run-to-run in ways that cannot be enumerated in advance; the process involves genuine judgment calls with no correct rule; or the system must respond to novel inputs that a workflow author could not anticipate. For everything else, a workflow is the correct tool.

## What is the step-by-step plan: from manual process to correct automation?

### Step 1 — Document every manual step before touching any tool

Before scoring anything, write down every step a human currently takes to complete the task. Include decision points, lookup actions, edge cases, and the tools touched. This documentation pass kills two problems simultaneously: it surfaces redundancies you did not know existed (steps that exist because of how a previous tool was structured, not because the work requires them), and it gives you the raw material for scoring.

Do not skip this. The operators who build wrong spend the least time here.

### Step 2 — Score each task on four dimensions

For each candidate task, score it on:

1. **Frequency** — how often does this run? (daily beats weekly beats monthly; hourly beats daily)
2. **Time-intensity** — how long does a manual run take?
3. **Structure** — how well-defined is the input and the expected output? (a form submission is structured; an inbound email with variable intent is not)
4. **Precision tolerance** — what accuracy is acceptable for this task to be useful?

The highest-value automation targets are high-frequency, time-intensive tasks with structured data and clear success metrics. Those are the ones to automate first, before any complex decision about workflow-vs-agent.

### Step 3 — Apply the accuracy-threshold sequence

This is the sequencing rule that most practitioners learn only after a painful lesson:

**Start with tasks where 90% accuracy is acceptable.** Scheduling summaries, content drafts, classification at volume, routine notifications — tasks where an occasional error is caught by a human in the loop and does not cause downstream damage. These tasks reach useful accuracy within days of deployment.

**High-precision tasks take months, not days.** A task that requires very high accuracy — financial data extraction, compliance-adjacent actions, client-facing outputs that cannot be verified before delivery — typically reaches a workable baseline quickly and then plateaus. The remaining gap requires iteration, prompt refinement, output validation layers, and often a structural redesign of how the task is broken down. Operators who deploy high-precision tasks first spend months debugging instead of delivering.

The practical sequence: deploy the lower-precision tasks to get real production data and build operational confidence, then invest the time the high-precision tasks genuinely require. Do not attempt both simultaneously.

*(reserved for members — sign in free at pravda.systems)*

### Step 4 — Make the workflow-vs-agent call

With documented steps and accuracy scores in hand, the decision is binary:

- **Can you write the complete decision logic in advance?** Use a workflow. If the answer is "mostly yes, except for this one filtering step," use a workflow with a single LLM node at that filtering step — not an agent.
- **Does the process genuinely require judgment on inputs you cannot enumerate?** Use an agent, but apply the anatomy and definition-of-done constraint described below before writing a single prompt.

The most useful diagnostic: *if you could write this as a flowchart with finite branches, it is a workflow.* Agents are for tasks where the flowchart is genuinely infinite or where the branches are context-dependent in ways no author can specify in advance.

## How do you define "done" for an agent prompt?

This is the highest-impact single concept for anyone building agents: **the definition of done.**

Most agent prompts specify what to do. The definition of done specifies what done looks like — the completion criteria and the constraints that must hold when the task is complete. Without it, agents loop indefinitely, halt too early, or produce outputs that satisfy the letter of the instruction while missing its intent.

A definition of done for an agent prompt has two parts:

**Completion criteria** — the specific, observable state that signals the task is finished. Not "research this topic" but "research this topic until you have identified at least 8 empirical sources published after 2024 with verifiable citations, then produce a structured summary."

**Constraint set** — the conditions that must remain true throughout. "Do not send any email without first showing me the draft." "Do not delete any file without writing a backup path to the log." "Stop and surface a question if you encounter an input that does not match any of the three categories."

Bake the definition of done into every agent prompt. This single change eliminates most of the dissatisfaction operators report with agents — the sense that the agent "kind of did the task but not really." The agent did exactly what was asked; what was asked was underspecified.

## How does an agent actually run?

Every agent, regardless of platform, executes the same loop:

1. **Observe** — read the current state: files, tool outputs, previous step results, multimodal data
2. **Think** — plan the next action against the observed reality and the goal
3. **Act** — call a tool, write a file, run a command, send a message
4. **Repeat** — until the definition of done is satisfied or a stop condition fires

The loop has one failure mode that is specific to under-specified prompts: without a definition of done, the agent has no exit condition and will loop until it hits a token limit or a cost ceiling. With a definition of done, the loop is bounded and the agent terminates cleanly.

The memory layer is what separates a one-shot agent from a system that improves. Short-term memory is the context window — everything the agent can see in a single loop. Long-term memory is the persistent layer: files, databases, vector stores. The agents that produce the most consistent, business-specific output are the ones with a persistent memory layer that accumulates context across sessions. Without it, every run starts cold.

## When does parallelization earn its place?

Parallelization is the core operational advantage of agent systems — multiple instances running simultaneously on tasks that would take hours in sequence. But it is not free and not always the right structure.

**When parallelization is worth the overhead:**

- The task set is genuinely independent (results of task A do not affect the inputs to task B)
- Volume is high enough that sequential execution creates unacceptable latency
- Individual task accuracy is high enough that aggregating results is meaningful

**The three parallelization patterns the field uses:**

1. **Fan-out and aggregate** — the same prompt runs against N inputs simultaneously; outputs are aggregated or compared. Useful for research, classification at volume, competitive monitoring across many sources.

2. **Specialist sub-agents under a director** — a coordinating agent routes work to specialized agents based on task type (research to the research agent, writing to the writing agent). Each specialist has a narrow scope and a tight definition of done. The director does not do the work; it delegates and verifies. This pattern is appropriate for sustained content or lead-generation operations where general-purpose single-agent systems fail because they lack cross-session strategy.

3. **Verification loops** — sub-agents check each other's outputs before anything is accepted. One agent produces; another critiques; a third resolves disagreements. This is the pattern for high-precision tasks where you are spending the time required to reach very high accuracy.

The division of labor the field has converged on: **humans for judgment, agents for execution.** A human defines the goal, reviews the output of the first week of any new deployment, and approves anything customer-facing or financially consequential. An agent handles the execution steps — the fetching, drafting, classifying, formatting, posting. The moment you ask an agent to make judgment calls that belong to a human, the error rate climbs and the results stop being trustworthy.

## How do you run it: practical cadence and targets?

**Week 1 — establish the baseline.** Deploy the first workflow or agent on real input. Do not run it autonomously. Review every output. This is not excessive caution; it is calibration. The first week of reviewing real output is where you catch the 10–15% of cases your prompt did not handle, the edge cases in the input structure you did not know existed, and the ways your definition of done was underspecified. A human-in-the-loop approval gate during the first week of any new deployment is standard operational practice, not a sign the system is not ready.

**Week 2–4 — tighten and extend.** Fix the edge cases found in week 1. Add output validation where the accuracy data shows consistent failure modes. Extend to a higher volume once the error rate on real inputs is below your stated tolerance.

**Ongoing — weekly review cadence.** Check error rates, spot-check outputs, and review whether the task structure has changed in ways the original prompt did not anticipate. The 6-step build method that generalizes across any tool: define trigger → identify data sources → specify action → test on real input → schedule → monitor and refine weekly.

**Measurable targets to track:**

- Error rate per task type, measured on a held-out sample weekly
- Cost per completed task (token cost + any API calls), tracked against a weekly ceiling
- Time-to-completion compared to the manual baseline
- Number of tasks reaching your precision threshold without human correction

*(reserved for members — sign in free at pravda.systems)*

## How do you verify you built the right thing?

A correctly-assigned workflow shows:

- Consistent, predictable output across runs with the same input structure
- Failures that are immediate, loud, and located at a specific node
- Linear cost that scales with volume without surprises
- No drift in output quality over time on the same inputs

A correctly-assigned agent shows:

- Output quality that improves over the first several runs as the definition of done gets refined
- Graceful handling of novel inputs that were not in the original test set
- Clear stop conditions that fire reliably, not infinite loops
- Audit trail of the observe-think-act steps taken on each run

**If you are unsure which you built, run the same input twice.** A workflow produces identical outputs. An agent may produce different outputs because its reasoning path is non-deterministic. If your task requires identical outputs for identical inputs, you need a workflow, not an agent — and if you built an agent, you built the wrong thing.

## What are the failure modes to watch for?

**Over-agentifying deterministic data plumbing.** The most common failure. A document extraction task, a CRM upsert, a scheduled report — tasks with fully enumerable structure get wrapped in an agent because agents feel more capable. The result is a system that costs more, fails less clearly, and provides no benefit over a workflow. Diagnostic: if you can describe every step as a conditional branch, replace the agent with a workflow and keep the LLM only where genuine classification or generation is required.

**Skipping the definition of done.** The second most common failure. An agent deployed without explicit completion criteria produces output that is technically plausible but practically useless — or runs until it hits a ceiling. Fix: before writing any agent prompt, write the definition of done first. If you cannot write it, the task is not ready to automate.

**Only ever giving the system easy tasks.** A less obvious failure, but one that costs months of runway. Operators who automate only the easy, low-precision tasks never discover what the system is capable of on hard tasks, and never build the validation infrastructure that hard tasks require. The accuracy-threshold sequence is a starting point, not a ceiling. After the lower-precision tasks are stable, systematically move up the precision ladder with the validation infrastructure the early deployments taught you to build.

**Skipping the memory layer.** An agent without persistent memory produces generic output that ignores everything the business learned in previous runs. Memory is not optional for any agent running repeatedly on business-specific inputs. Even a simple append-only file that records past decisions and outcomes materially improves output quality on subsequent runs.

**Trusting output without verification.** The field's honest position: AI systems are confidently incorrect more often than operators expect, particularly on tasks requiring factual precision or domain-specific knowledge. Verification is not a phase you exit — it is the ongoing discipline. Spot-check outputs. Demand citations. Define what a wrong answer looks like before you deploy, so you can recognize one when it appears.

The decision that determines everything else is not glamorous. It is a scoring rubric, a sequencing rule, and a habit of writing down what done looks like before writing a single prompt. The operators who get this right spend their time directing systems that work. The ones who get it wrong spend their time debugging systems that almost work.
