PLAYBOOK — AGENTIC ENGINEERING — 2026-06-17
How to run CLI coding agents without burning your budget or your codebase
Running a CLI coding agent effectively requires treating it as a contractor, not a vending machine. Separate planning from execution, gate every session with a structured project-brain file, watch your context percentage like a fuel gauge, and set a definition of done before you type the first word. Done that way, agentic coding delivers a compounding productivity gain. Done without discipline, it generates thousands of unreviewed lines, burns through a month's token budget in an afternoon, and leaves the codebase worse than before.

Running a CLI coding agent effectively requires treating it as a contractor, not a vending machine. Separate planning from execution, gate every session with a structured project-brain file, watch your context percentage like a fuel gauge, and set a definition of done before you type the first word. Done that way, agentic coding delivers a compounding productivity gain that has proven durable across operator use-cases. Done without discipline, it generates thousands of unreviewed lines, burns through a month's token budget in an afternoon, and leaves the codebase worse than before.
This is the discipline layer that most tutorials skip.
CONTENTS
CH.01
What is the core bet, and why does it actually work?
A CLI coding agent is not a faster autocomplete. It is a reasoning loop: observe the codebase state, think through a plan, act by editing files or running commands, then repeat until the task is done. Every iteration consumes tokens from a finite context window. That loop can run well or badly depending almost entirely on how you set it up — the model's raw capability is rarely the bottleneck.
The bet that works in practice: separate the planning pass from the execution pass. Let the heavier, slower model (the one best at multi-step reasoning) produce a written plan without touching any files. Then hand that plan to the faster, cheaper model to execute one change at a time. This is not a cost-saving trick — it is a quality move. Planning under execution pressure produces worse plans. Execution without a written plan produces drift: the agent pursues a locally-sensible sequence of edits that, viewed from above, is heading in the wrong direction.
The contractor analogy is the right mental model. You would not hand a builder a one-line brief and say "build me a house, here are the keys." You would agree scope, review the blueprint, then let them work. The same protocol produces better results from a coding agent.
CH.02
How do you set up the project brain?
The single most important file in an agentic coding setup is the persistent memory file the agent reads at the start of every session — call it CLAUDE.md, agents.md, or whatever your tool expects. It is the project brief your contractor reads before picking up a tool.
Four things this file must do:
Compress the tech stack and conventions — language, framework version, naming rules, which directories are off-limits, how tests are run. The agent cannot be allowed to guess.
State preferences explicitly — commit message format, PR size limits, which linter runs in CI, the fact that this project uses tabs not spaces. Anything that would cause a code review rejection.
Declare what the agent is capable of — agents tend to underestimate their own reach. Explicitly stating "you have write access to
src/, can runnpm test, and may create new files undercomponents/" prevents the passive, over-cautious behavior of an agent that is afraid to act.Log every failure — this is the idea that separates good setups from great ones. When the agent tries an approach and it fails (a migration strategy that broke the schema, a regex that misfired on edge cases), record it. The log is a map of invalid solution space. On the next session, the agent does not waste time rediscovering dead ends; it reads the log and skips the approaches it already knows fail. Every logged failure makes the file more valuable.
The 300-line cap is a hard rule. More context does not mean better output. Past a certain length, the agent starts averaging across everything it read rather than following specific instructions. Keep the file dense and current. Maintain two levels: a global file for rules that apply to every project (commit style, security constraints, no force-push), and a project file for local context (this schema, this API, this team's conventions). When the two conflict, the project file wins.
CH.03
What does the step-by-step plan mode workflow look like?
Before any edit touches source files, the agent should research the task and produce a written plan. Most CLI agents have an explicit planning mode — a setting that locks file writes and lets the agent read, reason, and propose without acting. Use it on every non-trivial task.
A sound execution loop has four steps:
Define the task in one sentence with a clear definition of done. "Refactor the auth module" is not a task. "Extract the token-refresh logic from
auth.tsinto a standalonerefreshToken(token: string): Promise<string>function, keep the existing tests green, and add one test for the expired-token case" is a task. The definition of done prevents the agent from stopping too early (partial refactor) or too late (sweeping the whole module).Run plan mode — read the codebase, identify which files are involved, propose the sequence of changes. Review the plan. Push back if it touches files outside scope or proposes an abstraction you did not ask for.
Execute one change at a time — the build-validate loop. Instruct the agent to make one logical change, run the compiler or test suite, and self-correct on errors before moving to the next change. Do not let it queue up five changes and run tests once at the end. Bugs compound; catching them at each step is orders of magnitude cheaper than untangling a chain of interdependent changes.
Use surgical file references — when an edit should be confined to a single file, say so explicitly with the filename. Scope creep inside a session is one of the most common ways agentic edits introduce regressions in files the user never intended to touch.
Screenshot debugging belongs in this step too. Drag a screenshot of a broken UI into the terminal. The agent reads the visual state and patches the CSS without you describing it in words. This works better than descriptions for layout bugs because you stop translating a visual problem into text that the agent then re-translates back into a visual fix.
CH.04
What are realistic goals and targets?
Method evidence across operator use-cases now yields a consistent range: expect a 30–60% productivity improvement on tasks that are well-scoped and within the agent's capability range. Not five times. Not ten times. Thirty to sixty percent is a large, compounding gain — it is worth pursuing seriously. Operators who report much higher multipliers are usually comparing their current agentic workflow to a much slower baseline, not accounting for time spent reviewing and fixing agent output, or measuring lines generated rather than verified working features shipped.
Two other targets worth setting explicitly:
The 3–4 month survival rule. A new agentic workflow or tool pattern is worth adopting only after it has survived in real use for three to four months. The space moves fast. A technique that looks transformative in a demonstration in month one is sometimes abandoned by its own advocates by month four. Adopt patterns that have proven durable, not patterns that are trending. Ground your prompts in established engineering literature (clean architecture, proven design patterns) rather than AI-specific persona prompts whose behavior degrades as model versions change.
The context-window fuel gauge. Treat the context percentage the agent displays as your primary operational metric. When it climbs above 70%, the quality of the current session degrades noticeably — the agent starts losing earlier instructions. Either summarize and compact, or close the session and start fresh with a clean context. A single session that consumes the equivalent of half a monthly subscription budget in a few hours is a real failure mode, not an edge case. Watch the gauge.
CH.05
How do you know the agent is doing the right work?
The hardest part of agentic coding is that the agent can be confidently wrong. It will tell you the refactor is complete and the tests pass. Sometimes it is right. Sometimes it has quietly changed the test to match the broken output. Trust requires a verification protocol.
Spot-check, do not rubber-stamp. After every execution pass, read at least one file the agent changed that you did not specifically direct it to touch. If it changed something outside scope, that is a signal to tighten the plan next time, not just to revert. The pattern repeats.
Demand citations for any non-obvious decision. If the agent chooses an approach that is not obvious — a specific algorithm, a non-standard library, an architectural choice — ask it to justify the choice and name the trade-offs. An agent that cannot articulate why it made a decision is an agent that guessed. Guesses in codebases accumulate into technical debt.
Watch for the confidently-incorrect failure mode. The agent's fluent prose and clean formatting are independent of its correctness. A well-structured explanation of a wrong approach reads exactly like a well-structured explanation of a right one. The signals that should trigger skepticism: the plan touches more files than you expected; the agent proposes an abstraction for a single-use case; the test count went up but the tests seem easier to pass than the code is to write; the output is much longer than the task description suggested it would be.
The dual-critic pattern can help on complex tasks: after the main agent produces a plan or output, route the same output to a second agent instance with the instruction "find the flaws in this approach." The critique pass is cheap; the rework from a missed flaw is not.
CH.06
How do you actually run it — tools, parallelism, cost?
The model handoff. Use your most capable reasoning model for the planning pass — plan mode, architecture decisions, complex debugging. Use the faster execution-tier model for the edit-compile-test loop. This is not primarily a cost decision; it is a quality decision. Heavy reasoning on simple edits produces over-engineered solutions. Efficient execution models on well-defined plans produce clean, minimal changes.
The effort dial. Most agents expose a setting that trades reasoning depth against token spend. Reserve the highest effort setting for the planning pass and for complex bugs where the root cause is genuinely unclear. Use a lower effort setting for mechanical changes: adding a field to a schema, writing a unit test for a function that already exists, renaming a variable throughout a file. Matching effort to task complexity is the single most effective cost-control lever available.
Parallelism and worktrees. The practical ceiling for useful parallel agent streams is around four, bounded by human review capacity, not by what the tool can spawn. Running more streams than you can review is how you generate unreviewed output that accumulates silently until it breaks something. For each parallel stream, use a git worktree — a separate checked-out working copy of the repository. This prevents agents from colliding on shared files. After the work merges, delete the worktree and its dependencies. Each new worktree gets a fresh install.
MCP versus bash — a judgment call. MCP servers are the universal integration layer for connecting agents to external services. They are excellent for unknown internal systems or specialized retrieval. For common platforms and simple one-off operations — a single API call, a GitHub push, a database query — a bash command, a curl call, or a CLI tool is often faster to set up and easier to debug. Do not reach for MCP by default. Reach for the simplest tool that does the job.
The memory layer is not optional. Context window is short-term memory. The project-brain file is medium-term memory. For anything that runs across multiple sessions, you need a third layer: a persistent, searchable log of decisions, outputs, and failures that the agent reads at the start of each session. An agent without persistent memory starts every session as a new hire on their first day. An agent with good persistent memory starts every session already knowing your codebase, your preferences, and what did not work last time. The memory layer is what turns a tool into an asset that compounds.
CH.07
What are the failure modes worth naming?
Generating more than reviewers can read. Agentic coding can produce thousands of lines per day. In a team setting, output that cannot be reviewed cannot be merged safely, and output that is merged without review is future technical debt. The constraint is review bandwidth, not generation speed. Calibrate your daily output target to what you can actually read carefully.
Skipping plan mode. When a task feels simple, plan mode feels like overhead. This is when skipping it causes the most damage. Simple-seeming tasks are often the ones where the agent's interpretation of scope diverges most from yours, because the briefer the instruction, the more the agent fills in from its own assumptions. Plan mode takes two minutes. A wrong execution pass takes two hours to unwind.
Context-window neglect. Starting a new, unrelated task in a session that is already at 80% context is one of the most reliable ways to get confused output. The agent is still carrying the weight of the previous task — its tools, its files, its constraints — even though you have moved on. New task: fresh session.
The intermediate plateau. A pattern that appears after the first few weeks of agentic coding: you find the tasks the agent does well and you stop there. The agent handles boilerplate, test generation, and small refactors. You never push it into architecture decisions, cross-file analysis, or complex debugging — the high-impact work. The plateau is comfortable and feels like success. It is not. The real productivity gain comes from gradually expanding the scope of what you give the agent to plan, not just to execute. That expansion requires the verification discipline above, and it requires you to get comfortable pushing back on agent plans that are wrong.
Adopting every new pattern immediately. The agentic tooling space releases new techniques, wrappers, and workflow patterns at a rate that makes sequential adoption impossible. Most of them are not wrong — they are just untested over time. The operators who have sustained the highest productivity gains are not the ones who adopted the most tools; they are the ones who picked a small, stable set and got very good at using them. New pattern: wait, observe whether it survives in real codebases, then decide.
No comments yet — start the conversation.
Sign in to join the discussion — it's free.