Agentic engineering — 2026-06-18PUBLIC

Running Claude Code and CLI coding agents without burning your budget: harness design and context discipline

Most people drive Claude Code and CLI coding agents like autocomplete and watch the bill climb. The operators who ship treat the harness — context discipline, plan-then-execute, model routing, permissions, verification — as the real edge. Here's how to run agents cheap and ship production code.

≈ 24 min read

VIEW MARKDOWNOPEN IN CHATGPT ↗OPEN IN CLAUDE ↗

Running Claude Code and CLI coding agents without burning your budget: harness design and context discipline

The invoice is where it gets real. One operator watched a single agent-team run eat roughly 5 million input tokens — half a $200 monthly limit — in one sitting; another burned $80 in fifteen minutes letting a team of agents argue with itself. @Mnilax watched his Opus month climb to $340 before he found the one setting that dropped it to $87. None of them had a worse model than you. They had a worse harness. Claude Code, Codex, Gemini CLI and their cousins can write a whole application from a one-line brief — yet most people still drive them like advanced autocomplete: type a prompt, watch the spinner, hope it compiles, and pay for every wrong turn. The operators who actually ship stopped treating the agent like a chatbot and started treating it like a brilliant junior with no memory, no context, and no sense of boundaries — then built the memory, the context, and the boundaries by hand. That discipline layer is the whole game, and it's where the budget is won or lost.

CONTENTS

CH.01

What actually separates shipping from tinkering?

The dividing line is not skill with the model. It is the architecture you build around it. Every serious operator in the field lands on the same boundary, and it has nothing to do with which model you pay for.

Tinkering is "vibe coding" — Andrej Karpathy's term for the state where the model produces code on its own and you rarely intervene. Shipping is agentic engineering: the deliberate design of loops, memory, and verification so the system finishes the task while you manage the constraints. One operator drew the line cleanly: you can outsource your thinking, but you can't outsource understanding. The agent handles implementation; you keep oversight of architecture, judgment, and taste. Karpathy frames the same shift as Software 1.0 → Software 3.0 — your lever over the computer is no longer the code you write but the context you manage. Manage it and the agent stays on spec. Neglect it and the agent fills the window with noise and drifts.

The gap shows up in three concrete places.

Gap	Tinkerer	Shipper
Context	lets the window bloat until the model forgets the spec, re-introduces fixed bugs, and loops	compacts, clears, and routes work to fresh windows
Verification	accepts the first output	runs adversarial review and automated QA before anything ships
Delegation	runs one thread	runs parallel streams on isolated git worktrees with specialized sub-agents

Everything below is the harness that closes those three gaps.

CH.02

Why does every long session eventually rot?

The dominant failure mode has a name — context rot — and the second is the vague prompt aimed at an autonomous agent. Learn to spot both and most of the chaos disappears.

Context rot: every file the agent reads, every command it runs, and every error it hits gets stuffed into one opaque window, on top of a system prompt that is already thousands of tokens. Once the window fills, the model drifts — it forgets the spec, re-introduces bugs it just fixed, and starts looping. Repeated mistakes, forgotten instructions, hallucinated file contents: that is the single most-cited reason long sessions fall apart. Treat the context percentage the agent shows you as your primary operational metric, not a curiosity.

The second failure mode is the vague prompt against an autonomous agent. Tell one to "build a landing page" and it will scaffold a React app, install Tailwind, wire up routing, and write five components before you can blink. You didn't ask for React. You didn't ask for Tailwind. But the agent has autonomy and you gave it no constraints — so you lit money on fire, confidently, building the wrong thing. The fix for both failures is the same: stop prompting into the void and start running a defined loop against a defined brain.

CH.03

Which slash commands actually earn the keystrokes?

@charliejhills' command reference is the vocabulary — but the commands are not equally important, and their value shifts with where you are in a project. Here is the working set.

Command	What it does	When to fire it
`/init`	reads your local files and writes a `CLAUDE.md` project brief	first move on any new codebase
`/plan`	switches to read-only mode; maps every step before touching files	before any non-trivial task — "five minutes of planning saves an hour"
`/context`	shows token-window usage	check at ~15% — most people are 15–20% in before they look
`/compact`	compresses history into a dense summary	at ~50% usage (some say 60%); never wait for 95%
`/clear`	full reset between distinct tasks	switching from the auth module to the pricing page
`/goal`	defines a finish line and a done-criterion	autonomous, long-horizon work (see goals, below)
`/agents`, `/workflows`	spawn and orchestrate specialized sub-agents	big tasks (see scaling)
`/model`	switches Opus / Sonnet / Haiku	match model to task horizon (see routing)
`/effort`	sets reasoning depth (low / medium / high / max)	Low for speed, Max for hard problems
`/permissions`	pre-approves safe commands	kill prompt fatigue
`/rewind`	undoes recent actions	recover from a bad edit
`/simplify`	three agents review code in parallel for architecture issues and duplicates	after a messy build
`ultracode` (keyword)	triggers a full agent workflow automatically	only after `/plan` is done — never as a replacement for it

/compact is not a cleanup tool; it is a production necessity once a session exceeds ~20 messages, where token cost compounds non-linearly. And old context degrades new output: if you just finished an auth module, /clear before the pricing page so auth logic doesn't bleed into pricing.

CH.04

Plan first, execute surgically — the four-step loop

Shipping looks like plan → pick the model → execute one change at a time → compact. Not prompt, watch, prompt again. The planning step pays back the most, which is why the field repeats one minute of planning saves ten minutes of building.

Most CLI agents expose a plan mode (Shift+Tab in the terminal in most demos, or a toggle in the desktop app). It locks file writes and lets the agent read, research interfaces, and reason into a written plan before it touches a single source file. You review and approve before anything changes. @Voxyz_ai pushes this into a full "shift-left on architecture" pattern: the agent must first analyze requirements, list edge cases, design the architecture, and plan — delivering folder structure, schema, API design, and performance notes — before writing code.

A sound execution loop has four steps:

Define the task in one sentence with a definition of done. "Refactor the auth module" is not a task. "Extract the token-refresh logic into a standalone refreshToken(token: string): Promise<string>, keep the existing tests green, and add one test for the expired-token case" is a task. The definition of done stops the agent quitting early (a partial refactor) or sweeping too far (rewriting the module).
Run plan mode. Let it read the codebase, identify the files, and propose the sequence. Review it. Push back if it touches files outside scope or proposes an abstraction you never asked for. A thin plan that fails on the first build is a planning bug fixed upstream, not an execution bug.
Execute one change at a time — the build-validate loop. One logical change, run the compiler or tests, self-correct, move on. Do not let the agent queue five changes and test once at the end; bugs compound, and catching them per step is far cheaper than untangling a chain.
Use surgical file references. @styles.css make all the buttons gold restricts the edit to that one file. Drop the @ and the agent may decide to "improve" five other files on the way — scope creep inside a session is one of the most common ways agentic edits introduce regressions in files you never meant to touch.

Then compact — or /clear for a clean break, but first write a short hand-off note so the next session resumes without rediscovery. If you work by heavy voice dictation, pipe the raw transcript through a cheaper model in a separate tab first, then send only the tight summary to your primary agent — don't pay your best model to read your rambling. Verify the plan worked by checking the generated file structure matches the proposal and the first compile or test run goes green.

CH.05

Which model should plan, and which should execute?

The most consistent pattern in the entire field: a heavy model thinks, a cheap model executes — roughly four-fifths of the work goes to the cheap one. And this is a quality decision as much as a cost one.

Plan and architect with Opus, because it reasons deeper and catches edge cases. Lock the plan, then switch to Sonnet for execution — faster, cheaper — and command it to follow the plan exactly. Heavy reasoning on simple edits produces over-engineered solutions; an efficient execution model working from a clear plan produces clean, minimal changes. The /effort dial (low / medium / high / max) is the finer control: reserve the highest setting for the planning pass and for bugs with an unclear root cause, a lower setting for mechanical work. Matching effort to task complexity is the single most effective cost-control lever available.

Situation	Mode	Model tier	Effort
New feature, real architecture	Plan mode, then execute	Opus to plan; Sonnet to build	High for the plan
Quick edits, iterative refinement	Direct execution	Sonnet	Low
Strict instruction-following, ship-to-prod	Plan, then autonomous execution in worktrees	Opus plan; Sonnet build	High plan, low build
Complex bug, unclear root cause	Plan mode, deep reasoning	Opus	Max
Research-heavy, document generation	Default	Codex (more token-efficient)	—

The advanced move is routing across tools: keep planning and frontend on your premium session (its visual output is worth the price), then write a clear spec and dispatch backend and heavier implementation to a cheaper model — or to a quota that's sitting unused. Voxyz_ai's exact dispatch instruction:

That multi-model routing — Opus/Fable for planning and frontend, Codex/GPT for backend — is the pattern professionals use the moment they start hitting token limits on high-end models.

CH.06

CLAUDE.md: the brain you write by hand

CLAUDE.md is not documentation. It is the agent's operational brain, injected at the very top of context before your first prompt, narrowing the model's output paths onto the trajectory you want. Run /init and the agent scans the codebase and writes a starter; you refine it by hand.

The field runs two competing philosophies, and both are right — at different times.

Minimalism (@BorisCherny, Claude Code's creator): "My entire CLAUDE.md is two lines. With every new model you need less instruction, not more, most people are over-engineering the one thing that should stay tiny."
Systematization (@Mnilax): a machine-readable design system — typography, spacing, cards, buttons, inputs, hierarchy — that cut a 10-screen dashboard rebuild from ~6 hours to ~40 minutes, because the model stopped guessing the look fresh every time.

The resolution is project phase. For exploration and prototyping, Cherny wins — minimal context lets the model's general capability shine. For production systems with multiple contributors or frequent UI changes, Mnilax's approach dominates, and the design system has to be machine-readable and referenced by name in prompts. @humzaakhalid's 65-line file is the practical middle ground, with four enforceable rules: think before coding (state assumptions, ask before assuming), simplicity first (minimum code, nothing speculative), surgical changes (touch only what's required), goal-driven execution (define success, loop until verified). Those four map onto the four failure modes the field keeps documenting: silent assumptions, code bloat, collateral damage, infinite loops.

What goes in it: the tech stack and conventions; the build, test, and lint commands (this is what powers the self-correcting loop); the directory structure so the agent stops dropping files at random; security rules (never hardcode keys, use .env, never commit it); a plain statement of what the agent is capable of (agents act timidly and underestimate their reach — tell it it has write access here, may run the test command, may create files there); and a running log of failure modes.

Two refinements raise this from a config file to a brain.

Keep it capped and hierarchical. Past ~500 lines the file bloats the window, raises cost, and degrades reasoning — the model averages across everything rather than following specific instructions. Consensus lands at 200–500 lines, with one staff-engineer voice arguing for around 300: the more bloat in your context, the less likely the AI does exactly what you want. Keep a global file for universal rules and a project file for local context; when they conflict, the project file wins. Put the critical guardrails at the very top to exploit the model's primacy bias. For large projects, split a monolith into a rules/ folder (code-style.md, security.md, testing.md) and load only what's relevant.

Make it self-modifying — the strongest idea in the batch. Instruct the agent: whenever I correct a mistake, or you hit a bug from a wrong assumption, append a rule to a "Learned Rules" section in a strict shape — Category: [never/always] do X because Y. The sharpest articulation of why: logging failures mathematically carves out invalid solution space, letting the agent skip the 80% it already knows fails. When the same mistake recurs two or three times, that's the signal to write the rule, not repeat the correction.

# Learned Rules
- Files: never delete the .env file because it holds local secrets and is not recoverable.
- Tooling: always use the project's package manager, never a different one, because lockfiles diverge.
- Migrations: never edit a committed migration; always add a new one because old ones are already applied elsewhere.
- Style: always write succinctly because verbose replies waste the context budget.

Verify the brain works by opening a fresh session and asking the agent to describe the project architecture without reading individual files. If the description matches reality, the file is carrying its weight.

CH.07

How do you cut the token bill without cutting quality?

This is where the budget is actually won — Mnilax's "tokenmaxxing": audit the settings, route by task, and stop paying migration prices for renames. The savings come from configuration and routing, not from typing less.

Start with the settings audit. ~/.claude/settings.json has ~125 keys, ~40 documented, and 18 that impact billing.

Setting	Action	Effect
`enabledPlugins`	reduce to those actively used	variable
`mcpServers.enabled`	set to `false` rather than deleting unused servers	each enabled server loads its schema at session start; nine unused cost ~30K tokens/session

The single biggest lever is where the response cache breakpoint sits:

Then route by task horizon. The seductive mistake is reaching for maximum reasoning on everything.

Task type	Model / effort	Cost vs. quality
Standard tasks	`high`	baseline
A normal task at `xhigh` (Opus 4.8)	`xhigh`	4× the cost for ~2 quality points out of 100 — not worth it. Paying migration prices for a rename.
Cross-file migration	`xhigh`	quality jumped 78 → 87 — pays back

The rule: high for standard work, xhigh only for long-horizon autonomous jobs like cross-file migrations. And a note on the build-vs-buy line for heavy users:

The metric that matters is decreasing cost per successful task, not per token. Tokens saved on work that never shipped is not a win.

CH.08

When is MCP worth it, and when is bash enough?

MCP unlocks real capability — but the contrarian call is the right default: don't reach for it first. Most of the field over-reaches here, and it costs both tokens and debugging time.

Model Context Protocol (MCP) is the standardized layer between the agent and external tools: GitHub, Slack, Gmail, databases, browsers. A useful representative stack is Playwright (browser automation), Context7 (live docs, so the agent stops citing deprecated APIs), GitHub MCP (repo access via a PAT), and a Postgres connector for direct reads. Install at local, project, or global scope deliberately to avoid tool bloat. Concrete wins the field has shown:

Shopify: enable Products, Inventory, Orders, Customers — the agent reads live store data and surfaces inventory gaps, SEO weaknesses, missing tags, unfulfilled orders.
Figma: the Chrome extension captures live sites as editable layers; read the structure via Figma MCP and rebuild in your brand.
GitHub: the agent reads and writes issues, PRs, Actions, releases — the repo "stops being a folder of files and becomes something the agent actually runs."
Browser / computer use: the agent opens browsers, navigates, fills forms, screenshots.

The claude-code-setup plugin will scan a project and wire MCP servers, skills, hooks, and sub-agents automatically.

But the contrarian note cuts against the reflex: for common platforms and simple operations, plain bash beats MCP. A curl call, a versioned CLI, or a short script is often faster to set up and easier to debug — direct API calls are reported at 5–10× faster than the MCP path. Reserve MCP for unknown internal services or specialized retrieval where the connector earns its overhead. The decision rule is whether the integration is novel enough to justify a server, or routine enough that one curl already solves it.

One security rule, from @AtsuyaYamakawa's Workato implementation: never run agent API calls through a single shared service account with no tracking. Capture who the agent is representing so logs and tickets reflect the real human actor behind every action.

CH.09

How do you design sub-agents and skills that actually work?

One job, one bounded input, one structured output. The generalist "smart" sub-agent is a token sink that returns vibes. Mnilax's "Barry vs. Mahesh" framework is the cleanest articulation of the failure.

A Mahesh re-confirms what the parent session already knows and returns prose — and each one costs ~20K tokens of spawn overhead for nothing. A Barry does exactly one thing and hands back structured data.

Mahesh (bad)	Barry (good)
"Help me with this code"	"Review this function against these 5 rules, return pass/fail with line references"
"Analyze this project"	"Scan `/src/components` for unused imports, return JSON with file paths and line numbers"
"Improve performance"	"Compare bundle size before/after this change, return the exact byte difference"

Skills are folders of instructions Claude loads on demand to become a specialist without manual prompting — installed via a command, per charliejhills. The field shows skills for marketing (7 internal agents for SEO/ads), email/Slack/web building, Remotion video editing, and Nano Banana image generation. For production, skills should be versioned alongside prompts and agent logic, tested against golden datasets before deployment, and observable — trace every step: User query → Agent thought → Tool call → LLM call → Final answer.

CH.10

How do you set production-grade goals?

A goal without a definition of done lets the agent quit early or sprawl forever. Voxyz_ai's templates make "done" non-negotiable. The basic /goal refactor until you are happy with the architecture is not enough — it needs testing, review, progress logging, and an explicit completion bar baked in.

The production-grade template requires real-world testing after every meaningful step, auto-review then commit, progress written somewhere sensible, and a final dedicated review pass:

For parallel agent workflows, the goal splits itself into independent pieces and gives each spawned agent its own dedicated /goal with its expected deliverable, verification, and completion standard:

For this task, write yourself a new end-to-end /goal: complete the whole plan, not just the next step, until the architecture, implementation, tests, review, and final result meet the standard. Split that goal into independent pieces, spawn as many parallel agents as needed to do it better and faster, and give each agent its own dedicated /goal that includes its expected deliverable, verification, and completion standard.

For a "YOLO" refactor with banked rate-limit resets — /goal refactor until you are happy with the architecture. ensure you live test after each significant step and autoreview/commit. track progress in /tmp/refactor-{projectname}.md … run high or xhigh reasoning. — and for trend-driven frontend, redesign {your page} … full creative freedom, but it has to be visually striking and interactive, with motion effects and a hidden easter egg. search 2026 design trends first and use them. Two non-negotiables from Voxyz_ai: always use a worktree for heavy refactors to keep the main branch clean, and make done explicit — every dimension at 100%, production-grade, a real user can walk in and use it.

CH.11

How do you scale past one agent without lighting money on fire?

Parallelism is the core strength of CLI agents and the fastest way to set money on fire. The honest ceiling is your review capacity, not the tool's. A single agent on a project with a front end and a back end context-switches between the two and bloats its own window — so you delegate. But you delegate inside guardrails.

Sub-agents protect context and cut cost. A parent agent calls a child — often on a cheaper model like Haiku or Sonnet — and the child runs in its own fresh context window (the standard 200,000 tokens), returning only a summary, not its full transcript. A main Opus agent handed "build a competitor analysis report" can spawn three Sonnet/Haiku researchers (scrape competitor A, scrape competitor B, analyze sentiment) and synthesize their summaries. A reviewer sub-agent spun up with zero prior context is especially valuable because it cannot be biased by the reasoning that produced the code.

For structured teams, /agents spawns specialized roles (Planner, Coder, Tester, Reviewer) with file-level permissions and /workflows orchestrates them. In RoundtableSpace's demonstration — a QA engineer, a frontend developer, and a backend developer — the QA agent found three bugs, sent feedback straight to the developers, they fixed them, and the app shipped in a single pass. That only works with strict file-level permissions so agents don't overwrite each other's critical code.

Git worktrees prevent collisions. Two agents editing the same files in the same directory overwrite each other; a worktree gives each its own branch and directory:

git worktree add ../feature-x -b feature-x
# the agent works in ../feature-x on its own branch
# when done, merge back to main, then delete the worktree

Worktrees are disposable — delete them after use and reinstall dependencies in fresh ones. The most grounded operator in the field runs up to four parallel windows at mixed effort (say two high, one medium, one low) and is candid about the limit: my mental capacity is pretty much four parallel work streams. Run more than you can review and you generate unreviewed output that accumulates silently until it breaks something.

Agent teams are the heavy-artillery variant — and the wallet hazard. For a full security audit, a team lead splits the work: spawn ten agents to scan the codebase, four to document issues, two "devil's advocate" agents to argue over whether the findings are real. The field is blunt about cost: one demonstrated session ate ~5 million input tokens — half a $200 monthly limit, and team runs have reportedly burned $80 in fifteen minutes. If you're not careful, you will eat up your session limit very, very quick. This is the layer the field calls a nuclear weapon for your wallet — occasionally worth it, never a default. Shut teams down the instant the work is done.

All of this rests on permissions as real kill switches, a three-tier risk ladder. Ask before edits is the safe default — the agent prompts before modifying any file. Edit automatically lets it change existing files but still prompts before creating new ones. Bypass permissions is full autonomy and can run sudo rm -rf if it misreads a request — so you only run bypass inside a sandboxed container, never against a directory you care about. For finer control, set per-tool rules: allow ls and cat silently, force a prompt before bash or curl. This is also what makes the self-correcting loop safe — you grant "allow" on exactly the trusted operations the agent needs to read an error log and retry, without handing it the whole keyboard. Widen autonomy exactly as far as the blast radius is contained, and no further.

CH.12

How do you verify before you ship?

The agent can be confidently wrong — it will tell you the refactor is complete and the tests pass, after quietly editing the test to match the broken output. Fluent prose is independent of correctness; a well-structured explanation of a wrong approach reads exactly like one of a right approach. Trust requires a protocol, and the field has converged on three patterns.

Pattern	What it is	Why it works
Self-correcting build-validate loop	the agent reads its own error log, finds the cause, tries an alternative, and re-runs until the suite is green	catches obvious failures automatically — needs the narrow permissions pre-granted so the retry doesn't stall on a prompt
Adversarial review	a critic agent with zero prior context is told to attack the first agent's output; a third "fixer" addresses what survives	a critic that never saw the original reasoning cannot rationalize the original mistake
Automated QA / the screenshot loop	a browser tool drives the running app — navigating, clicking, screenshotting — or you paste a reference image or an error screenshot and the agent diffs it against the render and fixes the gap	the screenshot is the spec — you don't need to know CSS or media queries, you just show Claude the problem

Two habits sit alongside these. Spot-check, don't rubber-stamp: after every execution pass, read at least one file the agent changed that you did not direct it to touch; if it strayed, tighten the next plan. Demand citations for non-obvious decisions: an agent that can't say why it chose a specific algorithm, library, or architecture guessed, and guesses become debt. Signals to distrust: the plan touches more files than expected; an abstraction appears for a single-use case; the test count rose but the tests look easier to pass than the code is to write.

And the hard rule before anything goes public: never ship vibe-coded code without a manual security audit. AI-generated code is prone to prompt injection and logic flaws — add authentication, audit with the agent and then yourself, and treat every output as a draft.

CH.13

Claude Code or Codex — the honest comparison

Choose Claude Code when you want to compose your own harness; choose Codex when you want the harness pre-built and opinionated. Treat this as a fast-moving snapshot — Claude Code shipping its own /goal to match Codex already dates one row — but the shape holds.

Dimension	Claude Code	ChatGPT Codex
Customization	~30 hook events (~5× granularity)	~6 hook events
Sub-agents	spawns autonomously (planner / explorer / reviewer)	requires an explicit user prompt
Parallelism	worktrees via prompting	native git worktrees built in
Browser / QA	separate Chrome extension	built-in in-app browser + computer-use QA
Images	no native model	native image generation
Enterprise auth	Bedrock, Vertex AI, Microsoft Foundry	less flexible
3rd-party routing	forbidden without approval	permits wrapper clients
Pricing (reported)	Pro $20 / Max5x $100 / Max20x $200	Plus $20 / Pro $200

The character summary is worth keeping verbatim: Claude Code feels like a workflow system you're building out; Codex feels like an opinionated machine designed to take you from agent-is-done all the way to shipped-to-production. Both support MCP, CLIs, VS Code extensions, markdown/YAML skills, plugin marketplaces, and cloud delegation.

CH.14

The learning path — and what's just hype

Skill compounds in order, and binging the curriculum is exactly how most people stall. @eng_khairallah1's tiered path:

Level	Duration	Focus
Level 1	24 min	basics and setup
Level 2	1 hour	real workflows (cowork, teams, design, projects, slides, skills)
Level 3	3.5 hours	pro moves (avoiding sycophancy, Claude Code, limit management)
Level 4	8 hours	expert mode (Claude Computer, API building)

Don't binge it. Do one level per sitting. Actually apply each guide before moving to the next. The Level 2 → 3 jump is where most people fail: Level 2 is about using features; Level 3 is about understanding failure modes — fixing sycophancy (Claude agrees too easily by default; tell it to push back, question assumptions, and flag weak thinking), designing workflows that stay under limits by default, and moving from typing one question at a time to building automated loops.

Now the honest read. Realistic agentic-coding gains are 30–60%, not 5× — and the people doing the best agentic coding aren't screaming from the rooftops that they're five times more productive; the calm operators undersell. The "10x developer" framing ignores the verification and context overhead the same demos spend most of their runtime describing, and generating 10,000 lines a day fails in any team setting because nobody can review it — the constraint is review bandwidth, not generation speed. Karpathy himself notes models still fail at simple logic and spatial reasoning, so the "vibe coding replaces engineering" narrative is overstated. Several of the loudest tutorials are course-length funnels from the same instructor cluster; the recurring revenue figures — "$4M/year," "$10–15K/month," "2,000 students taught" — are creator-reported, recycled across videos, and never independently verified. Weight the technique, discount the income story. And any local-AI throughput number tied to a specific GPU and quant will age within months. The meta-rule the calm operators share: wait three to four months to see whether a new pattern survives, and ground your prompts in established engineering literature rather than "you are a senior engineer" persona theatre.

CH.15

Your first week: from tinkering to shipping

Day 1 — foundation. Install Claude Code (npm install -g @anthropic-ai/claude-code), run /init, then write CLAUDE.md by hand: stack, build commands, directory structure, security rules, critical guardrails at the top, plus Humzaakhalid's four rules. Initialize git and push the initial state. Verify: a fresh session can describe the architecture without reading individual files.

Days 2–3 — harden the workflow. Make plan mode mandatory for every non-trivial feature — no code generation without an approved plan. Set the Opus-plans / Sonnet-executes handoff as the default. Build one reusable Barry-style skill (one job, one bounded input, one structured output) and one post-edit hook that runs tests and reports failures back to the agent. Verify: the hook fires and the agent reads the failure.

Days 4–5 — MCP and parallelize. Audit settings.json with Mnilax's framework and measure the token impact with /context before and after. Enable one MCP server for a tool you use daily — or skip it and use curl if the integration is routine. Then find a task with genuinely independent sub-tasks, create a worktree per sub-task, spawn cheaper sub-agents in parallel, and merge. Stay at or below four streams. Verify: the full suite passes after the merge; if not, find which worktree introduced the regression and re-run that agent with corrected instructions.

Days 6–7 — verify and iterate. Wire the self-correcting loop (test fails → agent reads error → proposes fix → retries). Run one adversarial review on a complex or security-sensitive change. Rewrite one existing prompt using Voxyz_ai's production-grade /goal template. Fold every new mistake into the "Learned Rules" log. Verify: the critic's valid findings get fixed, not waved off.

Ongoing. Monitor context with /context and compact at roughly 60% of the window. Prune CLAUDE.md weekly to stay under ~300 lines. Route by difficulty — Haiku for trivial work, Sonnet for the bulk, Opus for hard logic — and track cost per successful task, not per token. Then keep expanding scope: the comfortable trap after a few weeks is using the agent only for boilerplate and small refactors and never pushing it into architecture, cross-file analysis, or hard debugging. That high-impact work is exactly where the real gain lives, and reaching it takes the verification discipline above plus the willingness to push back on a wrong plan.

The operators who ship are not the ones with the most powerful models. They are the ones with the most disciplined harness. Build the brain, plan the work, guard the permissions, compact the context, verify before you merge — then ship the code.

claudecodeharnessdiscipline

DISCUSSION

No comments yet — start the conversation.