Agentic engineering — 2026-06-18PUBLIC

From vibe coding to agentic loop engineering: the loop patterns, the discipline, and why it compounds

Vibe coding made you faster at approving code you can't verify. Agentic loop engineering reinstalls the skipped discipline as infrastructure — spec, harness, self-verifying loops with disk state and exit conditions. The loop becomes the unit of work, and the skills compound while models churn.

≈ 21 min read

VIEW MARKDOWNOPEN IN CHATGPT ↗OPEN IN CLAUDE ↗

From vibe coding to agentic loop engineering: the loop patterns, the discipline, and why it compounds

You uninstalled the IDE. You typed one sentence — "build me a dashboard with auth" — and a working app appeared. You shipped it. For about a week you felt like a ten-x engineer. Then a user hit an edge case nobody considered, you opened the file to fix it, and you didn't recognize your own codebase. You hadn't written it. You'd approved it. And approving something you can't verify isn't engineering — it's gambling with extra steps.

That gap is the whole story. The people actually shipping at scale stopped prompting a while ago; they write loops. The move from single-shot prompting to agentic loop engineering isn't a better technique — it's a different unit of work, as fundamental as going from batch jobs to time-sharing. The cruel part is that getting faster at vibe coding just gets you to the wrong place sooner. This note merges what the practitioners building these systems actually do: the loop patterns, the discipline that has to go in front of them, and why — done right — the whole thing compounds instead of rotting with the next model release.

CONTENTS

CH.01

Why does vibe coding feel like progress and produce liabilities?

Vibe coding feels productive because it deletes steps — and every deleted step was where understanding got built. The output exists, so you ship it. You didn't produce something; you approved something, and until you learn the difference, every line is a liability.

@pmitu counts eight pre-development stages vibe coding has effectively killed: thinking before building, talking to users first, building waitlists, analyzing audience segments, idea validation, competitor research, prioritizing features before coding, and feedback-loop-driven building. None of those were ceremony. They were where the product got a stake in whether it succeeded. Outsource them to a model's pattern-matching and the model produces something impressive in isolation and catastrophic in context.

@rewind02 draws the opposite posture: "you define the spec, the architecture, the top-level design decisions... you treat agents as fallible, stochastic, powerful interns." The human keeps judgment, taste, and correctness; the machine gets execution. The two approaches even fail differently — and that difference is the tell:

Vibe coding fails silently. The code works until it doesn't; the product ships until a user finds the edge no one considered.
Agentic engineering fails loudly and recoverably. The spec is explicit enough that deviations are detectable, the verification is automated enough that regressions get caught.

The deeper trap is cognitive. @JuliannPod names the fluency illusion: information that feels easy to process gets mistaken for knowledge. You read the generated code, it looks right, so you assume it is — and your brain stops encoding because it assumes something else is doing the work. The cited number is brutal: retention drops from 68.5% to 57.5% when you offload cognition this way. The output of that habit is what he calls "slop" — content that cost the author nothing and reads like it.

What vibe coding really strips out is the muscle that builds taste. "Teaching prompts is two hours. Teaching taste is the rest." Most people skip evaluation entirely — the output exists, so they ship it — and never develop the standard that rejects the first mediocre version. They protect the wrong thing: "The job is patient care. The task is reading a scan. The job is strategy. The task is building a spreadsheet." Protect the task and you get replaced. Protect the judgment and you get amplified.

Which brings up the distinction that decides what you should ever hand off:

Friction	What it is	What to do
Vicious	Busywork — formatting, status updates, boilerplate	Automate it today
Virtuous	The 20 minutes of being stuck. The bad first draft. The part where you don't have the answer yet	Keep it, at least until you've built the skill

"When both types of friction are outsourced, you don't gain time; you lose the version of yourself the friction would have made." — @JuliannPod

The test before you delegate: would doing this myself make me better at my job? If yes, that's virtuous friction — keep it. If no, automate it and move on.

CH.02

What replaced the prompt — and why is the loop the new unit of work?

A single model call has no memory of what it tried, no way to check its own work, and no way to recover from failure. A loop fixes all three by making the output the next input. Input → Model → Output is a coin flip that costs tokens.

@0xwhrrari puts the industry signal plainly: "Anthropic and OpenAI are both telling engineers to write loops. Not prompts. Not agents. Loops." The identity shift is the real change — @0xCodez: "I don't prompt Claude anymore. I write loops — and the loops do the work. My job is to write loops." Write a prompt and you're deciding what comes next. Write a loop and you're architecting the system that decides what comes next. Operator becomes designer.

@_avichawla gives the loop four parts: "A schedule decides what to run, Loop is the maker that produces the work, A separate checker agent grades the output, A file on disk holds the state they both read." That structure is what turns guessing into measuring. @bcherny describes the behavior it produces: "It is the first model I have used that was so methodical and precise, taking measurements and adding logs then verifying that it truly fixed the issue before declaring victory." That's not a property of the weights. It's a property of the loop you build around them.

CH.03

What are the four loop patterns, and when do you reach for each?

Not all loops are equal — four patterns each solve a specific failure mode of autonomous work. Pick by what's most likely to break.

Pattern	Solves	Mechanism
Loop-Until-Done	Open-ended work	Maker produces, checker grades; below the bar, output + critique feed back; exits on pass, max iterations, or token budget
Fan-Out	Goal drift over long context	Split the goal into independent pieces, spawn parallel agents, each with its own bounded context and deliverable; main agent synthesizes
Tournament	Unreliable scoring	Generate multiple candidates, run them against each other; expensive in tokens, reliable in outcome
Adversarial Verification	The model grading itself too kindly	A separate critic instructed to find flaws, not confirm quality

Loop-Until-Done is the foundation, and its one critical decision is the exit condition — set it before the loop starts or you burn tokens forever on work that's "almost there." Fan-out has a sharp template from @Voxyz_ai: "Split that goal into independent pieces, spawn as many parallel agents as needed to do it better and faster, and give each agent its own dedicated /goal that includes its expected deliverable, verification, and completion standard." Adversarial verification exists because of a default flaw — @_avichawla: "Models grading their own output often justify what they already did rather than catching failures." Friendly verification is no verification at all.

CH.04

Where does the loop's memory live?

The single most important technical decision in loop engineering is where state lives — and the answer is disk, never the context window.

@_avichawla is unequivocal: "State has to live on disk, not in context. The model forgets everything between runs, so an MD file or a knowledge graph holds what is done and what is still open." Store state in the context and you lose it the moment the session ends; store it in conversation history and you pay to re-read it every turn while the model drifts anyway.

The implementation is humbler than it sounds. A Markdown file with sections — Completed, In Progress, Blocked, Next Steps — is enough for most loops. Each iteration reads it, the maker works, the checker grades, both write back, and the next iteration resumes from there. A loop can pick up hours or days later with zero context loss. For richer state, a lightweight knowledge graph or structured JSON. The principle holds either way: the file is the source of truth, not the model's memory.

What that buys you, at the extreme, is @Mnilax's result: "the same shape rewrote bun from zig to rust, 750k lines, test suite green." That isn't a prompt. It's a disk-stated loop crossing an enormous codebase, persisting progress between iterations, and verifying at each step.

CH.05

How high should you set the verification bar?

Set the bar low for boring work and high for judgment — and only loop as far as the checker can objectively confirm, then hand off.

The insight is counterintuitive. @_avichawla: "The lower the verification bar, the safer the loop. Boring, repetitive checks like a stale version string or a missing test are trivial to verify." A checker can confirm those objectively and fast, so the loop runs unattended with confidence. Architectural decisions can't be confirmed that cleanly, so a human stays in the seat.

Work type	Verification bar	Run unattended?
Stale version strings, missing tests, lint fixes	Low	Yes
Refactoring against a clear spec	Medium	Mostly, with spot checks
Architectural decisions, creative work	High	No — human in the loop

Real self-verification is not asking the model "is this good?" — that's self-preference wearing a verification badge. It's the model taking observable actions: running tests, reading logs, comparing against a spec, executing the code and reading the output. @Voxyz_ai encodes it: "after every meaningful step: real-time test the real thing (full end-to-end, plus computer use, browser, keystrokes, whatever it needs), auto review then commit, write progress somewhere sensible in the project." @bcherny's "methodical and precise" behavior comes from exactly this grounding in reality, not from the model's personality.

The best loops harden themselves over time. Every failure becomes a permanent test case — the system gets more reliable with each break, not less.

CH.06

What are the Four Horsemen of loop failure?

Loops fail in four predictable ways. Name each one and you can design it out.

Goal Drift. Over a long run the agent loses the thread. Solution: fan out into bounded pieces, and define "done" as the bar, not the build — @Voxyz_ai: "keep going until the architecture and result meet the bar, not just until it runs." Compiling is not done.

Self-Preference. The model grades itself generously. Solution: adversarial verification — a critic structurally separate from the maker, with a different prompt and a different objective.

Context Window Exhaustion. The agent forgets earlier work or starts repeating itself. Solution: disk state, plus disciplined compaction.

Infinite Loops. The loop never crosses the finish line. Solution: hard stops — max iterations, token budget, time limit. @_avichawla: "Exit must be set before the loop starts." @0xLogicrw sharpens it: set a fixed financial budget rather than a step count, so cheaper models can iterate more within the same envelope.

CH.07

How do you wire many agents without chaos?

When one loop can't hold the complexity, the communication medium matters more than the orchestration style — files beat messages.

Start with the right kind of sub-agent. @Mnilax: "Every sub-agent I deleted was a Mahesh: smart, general, no domain shape. Every survivor was a Barry: one job, one bounded input, one structured output." General-purpose sub-agents burn ~20K tokens of spawn overhead per invocation without producing useful work. Narrow skill-agents are dramatically more efficient.

The Barry pattern runs on a file bus. No agent needs to know what the others are doing beyond what's on disk. @RoundtableSpace shows the peer-review variant — "The QA agent found three bugs, sent feedback directly to the developers, they fixed the issues and the app shipped in a single pass" — faster for small teams, harder to debug at scale.

@AiCamila_ sorts the three orchestration styles: Orchestrator (one agent routes to sub-agents), Hierarchical (supervisors manage workers), Swarm (peers collaborate over shared memory). Start with Orchestrator for debuggability; reach for Swarm only when you need emergent behavior and can stomach the complexity. But the non-obvious truth across all three: file-based communication is more reliable than message-passing because it survives crashes, is inspectable by humans, and doesn't depend on the model formatting messages correctly. @0xLogicrw's DecentMem reinforces the point at the memory layer — a dual-pool design (an E-pool for historical experience, an X-pool for generating new candidate ideas, an online decoder adjusting the weights between them). The hard problem in multi-agent systems is how agents store and retrieve shared state, not which org chart you draw.

CH.08

What discipline goes in before the loop runs?

Before a single line of code, install the architecture the prompt skipped: spec, plan, and a definition of done. This is the step vibe coding deleted, and it's where most of the edge hides.

@faraznaqvi0 gives a three-prompt pre-coding sequence that stabilizes results before any code is generated. Prompt 1 forces the model to read the relevant files and restate the goal, patterns, constraints, and assumptions — and to ask only the questions that change the implementation. Prompt 2 asks for the smallest safe plan: files touched, order of work, edge cases, risks, and a recommendation among viable approaches. Prompt 3 defines done before coding starts. The verbatim prompts:

Then @faraznaqvi0 runs chunked-sequential work with save points: "break down bigger tasks with claude code to bite-sized chunks, then knock them out sequentially, after each chunk i can run tests and commit, so I'm never losing progress if something goes wrong... it's like having save points in a video game but for actual work." It costs more time than typing "build me a Twitter clone." It produces something you can debug, extend, and trust.

The prompt skeleton underneath all of this is RCT — Role, Context, Task. Tell the model who it is, give it the situation, state the bounded ask. (One creator who teaches this expands it to four letters with "Format" tacked on — ignore that; format and constraints are real levers, but they belong to a separate Briefing Method: audience, outcome, format, constraints. RCT is three letters.) A bare "write me a blog post about X" is a Google search aimed at a colleague; RCT is a brief. The non-obvious payoff is recursive: RCT is also the skeleton of every agent system prompt — the Role is the job title, the Context is the memory it reads, the Task is its one-line job, and the format constraints are the skill file dictating output shape. Beginner template and orchestration primitive, same shape at different scales.

Chain-of-thought is the same idea grown up. Taught as a one-liner — "think step by step, show your reasoning, then give the final answer" — it's worth doing. But it scales into the operating logic of an agent system at three levels: a clarifying-question loop (one model interrogates the spec before another executes — running the loop in a desktop model before handing a coding agent a detailed spec yields cleaner code and fewer bugs than typing a quick brief into the CLI); multi-step reasoning held across a sequence of tool calls; and trace-map debugging. When an agent run goes wrong, don't rebuild the chain — open the full trace (prompts, tool calls, results, failures, model switches, retrieved memory) and read it backwards from the final output until you hit the step that looks off. "Nine times out of ten the weak link is one or two steps before the final answer." The reported speedup (roughly an hour of guess-and-rerun down to about five minutes) is one creator's own stopwatch, unaudited — but the practice is exactly how anyone who has actually debugged agent runs already works. Steal the practice; ignore the timer.

CH.09

Which model, and what does the wrong one cost?

Agentic engineering looks more expensive than vibe coding until you count total cost — then the discipline is what makes it cheap.

The settings themselves are an engineering surface.

Wrong model, not wrong architecture, is the common waste. @Mnilax again: "CC Fable 5 (Extra) used 1.4M tokens, 5 subagents, and an hour and a half to clean up 21 basic Resharper warnings" — work Codex with GPT 5.5 would have closed in 10 minutes under 300K tokens. The loop was right; the model choice was extravagant. Use the cheapest model that can do the job and reserve the expensive one for decisions that actually need it. @Sprytixl shows the routed version: "routine tasks route to Kimi K2.6 at $0.09 — complex work stays on Opus at $1.41 — context never drops between switches."

That routing instinct rests on a thesis worth internalizing: no single model is best at every job, so single-model prompt habits are a liability. A prompt tuned for one model's long-form register isn't the one that gets the best brainstorm out of another. One creator's self-run, six-week, four-model benchmark across five workflows — read as one bench, not a leaderboard, but directionally consistent with independent testing:

Model	Reported result	Grade
Claude	Held the thread through 13 sub-steps on a 15-source dossier; 47 tool calls, only 2 redundant; shipped a working dashboard panel from one paragraph, first try; recalled correct detail from token 1,000 of a 22,000-token context at session end	A
GPT	Dependable all-rounder; better early-context recall than the big-window model; lacked the writing nuance	strong
Gemini	Biggest context window, yet missed more early-context detail than GPT — the counterintuitive result	mixed
Local Llama (agentic-tuned)	Held only 3 sub-steps; 31 tool calls, 11 redundant + 4 wrong-tool; panel didn't compile; couldn't hold the full 22,000-token system prompt	C−

The shape is the lesson: frontier models hold long agentic threads, local models truncate and lose the plot, and a bigger context window does not guarantee better recall. Local inference still has its place — @0xKnzo and @adiix_official make the math work with Mac Mini clusters for high-volume, low-complexity work — but @ai_appreciator's testing shows the quality gap on complex coding is still real. Local for grunt work, frontier for architecture. Don't pretend they're interchangeable.

CH.10

Why does any of this compound?

The frameworks expire; the skills and the memory don't. That's the entire compounding argument.

@DeRonin_ names what survives a model release: "context engineering, tool design, orchestrator-subagent pattern, eval discipline, the harness mindset (harness > model, always), and MCP as the protocol layer." These are model-agnostic and tool-agnostic. Invest your architecture in a vendor framework and you've built on sand; invest it in these and every new model makes you stronger. @rewind02's "context minimalism" closes the loop: "with every new model you need less instruction, not more." The prompt should shrink as the model improves. The system around the prompt should grow more sophisticated, precisely because more capable models produce more complex output that's harder to verify by eye.

The deepest moat is memory. The one layer beginners skip is persistent, business-specific context — and it's the hardest thing for a competitor to replicate. "Context is the single biggest driver of AI performance. Without it, agents produce generic output; with it, output is specific to your business, customers, and voice." The engineering pattern, worth stealing wholesale, splits memory three ways:

Episodic — an append-only memory.md, timestamped, never overwritten.
Semantic — a user.md of stable preferences (e.g. "UK English," "most productive in mornings") injected into every system prompt.
Procedural — versioned skills/ files updated as the agent learns better methods.

Context loading stays semantically compressed — recent messages verbatim, relevant past sessions retrieved by search, older context aggressively summarized — so tokens don't explode while long-range coherence holds. And the memory is maintained, not set-and-forget: quarterly semantic compression (summarize the last ~90 days, keep names/projects/decisions/lessons, drop small talk), daily Git backup to a private remote, and non-negotiable security — files local or encrypted (FileVault / git-crypt / LUKS) before any push, with sensitive apps excluded from any capture and weekly memory audits. These files hold real business intelligence; treat them like it.

Compounding shows up twice. Each failure locked as a regression test makes the system harder to break over time (see the verification chapter). And the learning order itself is a dependency chain — prompt fluency, then model fluency, then automation, then monetization — where skipping a rung produces a predictable downstream failure: structured prompting without model fluency gives brittle output; automation without structured prompts gives broken workflows; monetization without automation means you're still selling hours.

CH.11

What's the action plan?

Here's the do-this-then-that, with a way to check each step actually worked.

Phase 1 — install virtuous friction. Audit your last three AI-assisted projects: for each, did you understand every line before you shipped it? Where the answer is no, write down the steps you skipped. Then categorize your weekly friction — vicious (automate today) vs virtuous (keep until the skill is yours). Set a verification bar: not "a working app" but "a schema that passes these three queries," "an endpoint that returns 200 for valid and 401 for invalid."

Phase 2 — run the 3-prompt pre-coding sequence. Spec, architecture, definition of done (see the discipline chapter). Argue with each output. Ask "why?" three times. Verify: can you explain every architectural decision to a colleague without the model's justification?

Phase 3 — build the harness.

Write state to disk. A progress.md the agents update after every step. Verify: kill the session, start fresh — can it read the file and resume?
Separate maker and checker. The checker sees only the output, never the maker's reasoning. Verify: does it catch errors the maker missed?
Set exit conditions — max iterations, token budget, verification bar — before the first iteration. Verify: does the loop stop, or run forever?
Build Barries, not Maheshes. One job, one bounded input, one structured output, file-based communication. Verify: can each sub-agent's output be checked without reading its internal reasoning?

Phase 4 — iterate and compound. Tokenmax your settings and select model by task horizon (see the cost chapter). Run evals, not vibes — @AndrewYNg: "The single biggest predictor of whether someone executes well is their ability to drive a disciplined process for evals and error analysis." Build a golden dataset, run it after every change, ship only if the metric moves. Then shift to judgment-only supervision — @Mnilax's cowork templates run an 8-hour workday in the background against ~47 minutes of human steering and approving — but only because the harness, exit conditions, and maker/checker split were built first. Without that infrastructure, "judgment-only" is just vibe coding with extra steps.

How to verify the whole loop works: it terminates before hitting hard stops on most runs; the checker catches real failures instead of rubber-stamping; the state file reflects reality; and the output meets the exit condition from Phase 1. If it keeps hitting hard stops, your exit condition is too vague or the bar is above the checker's capability. If it approves bad output, you need adversarial verification.

A maturity rubric to grade yourself against:

Phase	Success signal	Failure mode
Model fluency	You can predict the right model for a new task	One model for everything
Structured prompting	Briefs replace "write me a…"	Still firing naked questions
Automation	Runs 7 days untouched	Breaks the moment you look away
Agent stack	Completes a 10-step workflow without losing the thread	Agents repeat work, drop context
Memory layer	References specific detail from 30+ days back	Generic output despite weeks of use
Revenue task	Client-ready deliverable in <50% of manual time	Faster but worse than by hand

CH.12

What's the honest assessment?

The edge is real; so are the ways it breaks. Here's what to distrust.

The harness is boring, and that's where agents die. @NeoOpenGPU: most AI agents fail at the plumbing, not the reasoning — auth tokens expiring mid-run, webhooks that drop, state that drifts. Skip the unglamorous part and the agent fails in production while you blame the model.

Self-preference is the default, not the exception. Models justify their own output unless you build adversarial verification, and even checkers get fooled by plausible-looking work. For anything high-stakes, keep a human in the loop. @BratDotAI's guardrail is non-negotiable: never let an agent run production deploys, database changes, or billing changes without your approval.

"100% AI-written code" is marketing. @0xCodez's uninstalled-IDE, zero-lines-by-hand claim sells a brand; @AtsuyaYamakawa has it right: "AI coding empowers people who can design. It does not replace engineers." If you can't verify the logic, 100% AI-written code is 100% unverified code — and @codingtechnyks cites studies showing up to 45% of AI-written code can be insecure. Audit every commit.

The frameworks expire. @DeRonin_: "Avoid AutoGen / AG2: moved to community maintenance, releases stalled. CrewAI: demos well, breaks in production." Not technically inferior — they expire. The reliable path is file-based communication between narrow agents, not message-passing between general ones.

And distrust round numbers, including the impressive ones in this note. The productivity creator whose memory architecture and cross-model framing are folded in here is at his most substantive on the mechanics — but every quantity he attaches traces back to himself: a "5X quality" multiplier on prompts, the benchmark grades above, the hour-to-five-minutes debugging figure, the saved-minutes-per-day claims. Creator-reported, unaudited, often self-promotional. Worse, his "run all four models in production" stack quietly smuggles in two of his own paid products as if they were neutral pillars — read those as "a scheduled-agent layer" and "a browser-automation layer," route by fit, and don't buy the brand names. The architecture is sound. The numbers are his. Use the first; check the second.

The shift from prompting to loop engineering is the most consequential change in how we work with AI since the prompt itself. The machine handles the repetition; you handle the judgment. But judgment isn't a switch you flip — it's a muscle you build by doing the hard parts yourself before you delegate them. Start with virtuous friction. Build the harness. Run the evals. Compound the skills that survive the next model release. Everything else is vibes.

agenticloopengineeringpatterns

DISCUSSION

No comments yet — start the conversation.