Skip to lesson

From One-Off to Repeatable Workflows

IntermediateSign in to save

After this you can take a chat sequence that worked once and turn it into a workflow you can rerun, test, and debug — a deterministic structure that calls the model at fixed points instead of a fresh improvisation every time.

Understand

A thing that worked once in a chat is not a thing that works. You coaxed a good result out of a back-and-forth, the answer was right, and the instinct is to keep that conversation as the recipe. But the conversation is the least repeatable artifact you produced. Rerun it tomorrow with a slightly different input and the model takes a different turn three messages in, and the careful sequence that landed yesterday now drifts somewhere you did not intend.

The move that makes it repeatable is to stop running the work as a conversation and start running it as a workflow. Anthropic draws the line cleanly: a workflow is a system where the model and its tools are orchestrated through code paths a human wrote, and an agent is a system where the model directs its own control flow — picks the next step, decides when it is done. The difference is who owns the sequence. In a workflow you own it. You decided there are three steps, in this order, with this check between step two and step three, and the model only fills in the cognitive blank at each node, and the unreliable part stays fenced inside a structure you control.

That framing also tells you why the conversation drifted. A multi-turn chat sits at the agent end of that line whether it feels like it or not. The path through the work is being improvised as you go, one message at a time. There is a measured cost to that improvisation. When Microsoft and Salesforce ran the same tasks two ways in their sharded-prompt study, once as a single assembled prompt and once with the same information spread across conversational turns, average performance dropped 39%, and on the hardest reasoning models the fall was steep: o3 went from 98.1 to 64.1. Their summary is the part to keep: "when LLMs take a wrong turn in a conversation, they get lost and do not recover." The drift is not bad luck. It is the structural property of running work as a chat, and assembling the same information into one deterministic call is what recovers most of the lost ground.

Why the chat drifts and the workflow holdsthe same task as an improvised multi-turn chat (each turn can veer, and a wrong turn doesn't recover) versus a fixed workflow where the model is called at set nodes and a gate catches a bad intermediate before it propagates.
Why the chat drifts and the workflow holdsthe same task as an improvised multi-turn chat (each turn can veer, and a wrong turn doesn't recover) versus a fixed workflow where the model is called at set nodes and a gate catches a bad intermediate before it propagates.

So you wrap the probabilistic calls in deterministic structure, and the cleanest version of that structure is a fixed, schema-validated container the model fills rather than a free-form exchange. A well-run internal command library tends to converge on this shape. Each runnable command is a file with a fixed anatomy, capture the current state, plan, execute, verify, summarize, and the model does the thinking inside each section while the section it belongs to, the order, and the validation are all decided in advance. Written that way, the command stops being documentation about a workflow and becomes the workflow, in a form an executor with no memory of the original chat can run correctly, including you next month.

Probabilistic call inside a deterministic shellthe operator move — the model's varying output sits inside a fixed structure (defined input contract, named step, schema-validated output, content check), so the unreliable part is fenced and the rest is stable.
Probabilistic call inside a deterministic shellthe operator move — the model's varying output sits inside a fixed structure (defined input contract, named step, schema-validated output, content check), so the unreliable part is fenced and the rest is stable.

There is a ladder of structure, and the operator skill is reading which rung a task actually needs. It runs roughly: a single prompt, then a prompt with one tool, then a fixed chain of prompts, then routing an input to the right specialized prompt, then orchestrator-workers where a coordinating call decomposes the task and farms pieces out, and finally the autonomous agent that owns its own loop. The rungs are ordered by how much control you hand the model, and the first three keep control flow in your code while the last two start giving it away. The discipline is to climb only when the rung you are on demonstrably fails, with evidence, not because the higher rung sounds more capable. Chip Huyen's observation from the field is that most teams over-shoot by two rungs — they build an orchestrator-workers system for a task a three-step chain would have handled, and then spend the life of the project debugging flexibility they never needed.

The complexity ladder, and where most teams landthe six rungs ordered by how much control flow you hand the model; the deterministic-vs-model-driven split between rungs 3 and 4; and the common error of over-shooting the task's real need by two rungs.
The complexity ladder, and where most teams landthe six rungs ordered by how much control flow you hand the model; the deterministic-vs-model-driven split between rungs 3 and 4; and the common error of over-shooting the task's real need by two rungs.

The fastest filter for the whole decision is a question you can answer before writing any code: can you draw the flowchart? If you can enumerate the steps and the branches in advance, the task shape is known, and you should build that flowchart as a workflow rather than hire a model to rediscover it on every run. You reach for an agent only when you genuinely cannot draw the flowchart ahead of time — when the steps depend on what earlier steps turn up in a way you cannot enumerate. Most production work that feels like it wants an agent is a flowchart you have not drawn yet.

Where it breaks

The headline failure is over-engineering: building the autonomous agent when a deterministic workflow would have been more reliable, cheaper, and far easier to debug. Handing the model control flow buys flexibility you pay for in every dimension that matters in production, and you keep paying it long after the demo. If the task shape was enumerable, the agent was a mistake even when it works, because a workflow would have worked more often.

A separate trap runs the other direction: freezing the shape too early. The instinct after reading this is to take the chat that worked once and immediately harden it into a rigid pipeline. But the first version that worked was probably one example out of a class you do not understand yet. Freeze it now and you have abstracted over a pattern that has not stabilized, baking in assumptions the second and third real inputs will break. Let the shape prove itself across a handful of genuinely different runs before you commit it to code. Premature abstraction is as expensive as no abstraction.

Two quieter breaks round it out. Naive retries turn into a correctness hazard the moment a step has side effects, because rerunning a model call resamples it and you can get a different plan on the retry, so retrying after a partial write risks a double-write or an inconsistent one. Separating the model decision from the side effect and making that action idempotent is the fix, and the automation lesson covers it in full. The gate problem cuts both ways too. Put a human checkpoint on every step and you have rebuilt a slow manual process with extra friction, but omit the gate before an irreversible action and you ship errors at machine speed. Gate on irreversibility and blast radius, the email that sends, the card that charges, the table that drops, rather than on whether the model happened to sound unsure, because a confident model is not a correct one and a long chain at even 95% per-step reliability quietly decays below usable. A last trap worth naming is the heavy orchestration framework that hides the actual prompts and responses behind its own abstractions, so when the workflow misbehaves you are debugging the framework instead of the call. Build the bare version against the raw API once before you reach for a framework, so you know what it is hiding.

Do it now

Before you turn a one-off into a workflow, run it through this filter. Paste it into a fresh note, answer the lines honestly, and let the rungs and the freeze-check stop you from building the wrong thing:

Paste this
WORKFLOW PROMOTION CHECK — fill before writing any orchestration code

1. The one-off I'm promoting: <the chat sequence / task that worked once>

2. Can I draw the flowchart?  (list the steps + branches in order)
   - Step 1 -> Step 2 -> ... :
   - Branches / conditions :
   - If I CANNOT enumerate these in advance, stop: this may genuinely
     need an agent. If I CAN, I'm building a workflow, not an agent.

3. Lowest rung that actually works (pick ONE, justify):
   [ ] single prompt        [ ] prompt + one tool     [ ] fixed chain
   [ ] routing              [ ] orchestrator-workers   [ ] autonomous agent
   - Why the rung BELOW this one fails (with evidence, not a guess):

4. Freeze check — has the shape stabilized?
   - How many genuinely DIFFERENT inputs have I run it on? (need >= 3)
   - If < 3: do NOT codify yet. Run it on real variety first.

5. Side effects in any step? (writes, sends, charges, deletes)
   - If yes: separate the model's decision from the action and make the
     action idempotent. The automation lesson covers the full pattern.

6. Gates — where does a human or a check sit?
   - After which step does an INVALID intermediate get caught? :
   - Before which IRREVERSIBLE action does a checkpoint sit? :
   - (Gate on irreversibility x blast radius, never on model "confidence".)

Build only the rung from line 3. If lines 2 or 4 fail, you are not ready.

The check earns its keep on lines 2 and 4. Line 2 is the flowchart filter, and most of the time drawing it out reveals the task was a workflow all along. Line 4 is the brake on the other failure — it refuses to let you freeze a pattern you have seen exactly once. Lines 5 and 6 are what separate a workflow that survives a retry and a real side effect from one that double-charges a card the first time a step times out.

Worked example

Illustrative

Illustrative. A constructed before/after to show the promotion, not a real system.

A support team has a chat sequence that worked: paste a customer ticket, ask the model to classify it, ask it to draft a reply, ask it to flag anything needing a refund. Three good answers, one conversation. The proposal on the table is to build an autonomous agent that "handles tickets end to end."

Run the check first. Line 2: can you draw the flowchart? Yes — classify, then draft, then a refund decision. It is three boxes and one branch. That single answer kills the agent proposal; the task shape is fully enumerable, so the rung is a fixed chain, not an autonomous loop. Line 4: the sequence has been run on exactly one ticket, so it is not ready to freeze. The team runs it on a dozen real tickets first and discovers the class it had not seen — tickets that are two issues in one, which the single-classify step mangles. The shape was not stable. Codifying after the first success would have baked in the bug.

The promoted version, once the shape settles, is a chain the team owns rather than a chat the model improvises:

Step 1 — classify (model call). Input: ticket text. Output validated against a fixed enum of categories; an unknown category fails the gate and routes to a human instead of guessing. Gate — if the ticket matches two categories, split it and run the chain twice. This is the case the dozen-ticket run surfaced. Step 2 — draft reply (model call). Reason in plain language first, then the reply is extracted into the response field, because forcing the draft straight into a rigid template degrades the writing. Step 3 — refund decision (model call), then the action. The model's decision is cached on a hash of the ticket id, so a retry after a timeout reuses the same decision instead of resampling a different one. Issuing the refund is idempotent — keyed on the ticket id, so a double-run refunds once. And because issuing a refund is irreversible, this step gates to a human above a dollar threshold no matter how confident the model is.

It is the same three model calls the original chat made. What changed is that the order, the validation, the retry behavior, and the one human checkpoint now live in code the team wrote, so the fortieth ticket runs the way the first good one did, which the improvised chat could never promise.

The promoted chainthe three model calls from the original chat, now fixed in order with a split-gate after classify, free-text-then-extract at draft, and a decision/action split plus an irreversibility gate at the refund step.
The promoted chainthe three model calls from the original chat, now fixed in order with a split-gate after classify, free-text-then-extract at draft, and a decision/action split plus an irreversibility gate at the refund step.