Skip to main content
PETBRAINS

// blog #ai-coding-agents

How to Build With AI Coding Agents

Plausible is not correct

LLMs generate plausible code, not correct code. The techniques that make AI coding agents work all reduce to one move: constrain and verify.

An AI coding agent doesn’t know whether your code is correct. It knows what correct code usually looks like. Those are different things, and the gap between them is where most AI-assisted development falls apart.

Once you see that gap, a lot of disconnected advice collapses into one idea. Context management, spec-driven development, formal verification: people present them as separate disciplines. They’re the same move at three different layers.

Why AI coding agents default to plausible

A model generates the most likely continuation of the text in front of it. That’s the whole mechanism. It is not reasoning toward a goal and checking its work. It is predicting what comes next.

Feed it a vague prompt and the most likely continuation is plausible-shaped: code that imports a library that was popular in the training data, ignores the best practice your project actually follows, and solves the problem next to yours instead of yours. It compiles. It looks right in review. It’s wrong in a way that takes you an hour to find.

This is the same failure the planning researchers hit with ordinary chain-of-thought. The model produces reasoning that reads as logical and is logically broken: steps that don’t follow, preconditions it never checked. Plausible prose, wrong conclusion. The surface is convincing precisely because a convincing surface is what the objective rewards.

So the job is never “get a better continuation.” It’s “make the wrong continuations harder to produce and easier to catch.” Two halves: constrain, then verify.

Three fixes that are one fix

Constrain the context

A context window is finite, and everything in it competes. A 2,000-line CLAUDE.md written just in case, forty messages of history, a few tool outputs you stopped needing twenty minutes ago: all of it sits in the window and shapes the next token. The more irrelevant material is present, the more plausible-but-irrelevant continuations are on the menu.

The fix is not a bigger window. It’s a smaller, relevant one. Keep memory to a skeleton and pull detail on demand with imports. Isolate a hard subtask in a subagent with its own clean window so its noise never touches the main thread. Run /clear between unrelated tasks instead of dragging the tail of the last one into the next. Plan first, in a mode that can read and analyze but can’t write, so exploration never silently turns into edits.

Every one of those removes wrong answers from the space before the model picks.

Constrain the spec

Spec-driven development gets a bad reputation because most specs are essays. “A modern solution for efficiently managing user data with cutting-edge technology” is not a specification. It’s plausibility fuel: the model has seen that sentence ten thousand times, so it generates ten thousand more like it. The vaguer the spec, the larger the space of likely outputs, and the lower the odds any of them is the one you meant.

A spec that constrains looks like code. Data shapes with field types. Endpoint signatures. Component dimensions and loading behavior. Album: {id, name, created, photos: Photo[]} tells the model something. “Organize digital memories into albums” tells it nothing it didn’t already assume.

And write the spec after you understand the system, not before. You cannot constrain what you cannot describe. Open the editor, build the skeleton by hand, find out which libraries you actually need and where the hard parts are. Then write the spec. It documents understanding. It does not manufacture it.

Verify the output

This is the half people skip, and it’s the one with the hardest evidence behind it. In the PDDL-INSTRUCT work, a model’s plan is checked transition by transition against a formal verifier, and when a step fails the model gets the specific reason (this precondition wasn’t satisfied) instead of a flat “wrong.” With that loop, Llama-3-8B went from 28% to 94% on a standard planning benchmark. Same model. The verifier and the structured feedback did the work.

The verifier doesn’t make the model smarter. It makes plausible-but-wrong detectable, and detectable is correctable.

The catch, and the cheap way around it

Here’s the honest limitation, and the researchers say it themselves: that approach needs an external verifier, and most domains don’t have one. There’s a formal checker for planning problems. There isn’t one for “is this the right React architecture.” For most of what you build, no formal verifier is coming.

So build the cheapest verifier you can, and attach one to every task.

This is where tight specs pay off a second time. Split the work into atomic tasks, and give each one a result you can actually check. Not “implement authentication.” That’s a system, and models are bad at systems. Instead: “write generateToken(userId) that returns a JWT with payload {userId, exp},” with a check that runs. One file, one function, one verifiable outcome. The check is your homemade verifier. It’s weaker than the formal kind and infinitely better than nothing.

Then automate the checks so they fire without you remembering. A hook that lints and formats after every write. A gate that refuses to act before a plan is approved. The point of the planning research isn’t the specific tool. It’s the shape: generate, check against ground truth, feed the failure back. You can reproduce that shape with a test runner and a git hook.

The loop

Put together, the method is short.

Understand before you spec: prototype the skeleton by hand so you know what you’re constraining. Shrink the context to what the current task needs and isolate the rest. Write the spec as constraints, not prose: shapes, signatures, acceptance criteria. Split into atomic tasks, each carrying its own check. Automate the checks with plan gates and hooks so the rules enforce themselves. Test the loop on one small feature before you trust it with the whole app.

None of this is about prompting harder. The teams shipping real software with AI aren’t the ones with clever prompts. They’re the ones who treat “plausible” as a raw input that needs a verifier bolted to it.

The model’s ceiling is rarely the thing holding you back. Your willingness to constrain it and check it is. A cheap verifier changes the output more than a better model would. And unlike a better model, you can build it this afternoon.

// subscribe

Every issue lands here first.