Written February 23rd, 2026.
I know many engineers that haven't manually written a line of code in months1. Instead, they leverage agents like Claude Code or Codex; iterating with them on a feature until it is mergeable.
A typical workflow looks like this:2
Even with extremely detailed plans, the same themes keep coming up:
It can feel like a slot machine. When the agent nails it, you've saved a ton of time. If not, you're stuck in a frustrating steering process, trying to pull the agent back towards a working implementation.
The solution?
Automating the feedback loops.
By "verification", I'm referring to the agent's ability to check its outputs, and self-correct if necessary.
Lets take a look at an example. Below, I pose a question to "Kimi 2.5 Instant" (a small, non-thinking model). It should count up the number of words in a sentence, then return the answer (10 words).
Word count: 9 words
Here they are:
Wait, there are actually 10 words in the sentence. (The commas after "big" and "fluffy" are punctuation marks attached to the words, not separate words themselves.)
In the first case, Kimi went with its gut... which was wrong.
The second case exemplifies how most modern thinking-based LLMs work; it uses a simple verification strategy - explicitly numbering each word - to identify its mistake, and calculate the correct answer.
Coding agents have this exact same problem.
Luckily, software engineers already have robust, automated tools for verifying program correctness. Most teams have a CI process which runs a variety of checks on each change:
To be effective, your agents will need access to all of these tools. Lets consider a common scenario where a coding agent is implementing a new function:
from dataclasses import dataclass
from typing import Literal
@dataclass
class User:
name: str
email: str
role: Literal["admin", "user"]
def create_user(name: str, email: str) -> User:
return User(name=name, email=email)
error[missing-argument]: No argument provided for required parameter `role`
--> src/users.py:11:12
|
11 | return User(name=name, email=email)
| ^^^^ Missing required argument `role`
def create_user(name: str, email: str) -> User:
return User(name=name, email=email)
return User(name=name, email=email, role="user")All checks passed!
After implementing the code, the agent used ty3 to run type-checking on the codebase, and identified a bug. It is able to use the information from the error-message to immediately self correct.
This is the reason that agents like Claude Code have caught on like wildfire.
The "harness" facilitates these automated feedback loops, where the agent can:
This enables agents to make progress on much longer tasks, without needing human intervention.
This also reinforces how important effective verifiers are. Typed languages like Rust, TypeScript, and typed Python have continued to gain popularity because the tooling in these languages is so effective for LLMs.
One thing I'll reiterate: the agent needs to be able to run independently.
This means:
Your infrastructure should be runnable in a container. Most modern harnesses are adopting multi-architecture platforms ("Claude Code Web", "Codex Web", "Cursor Cloud Agents", ...), so I expect this to become even more important for scaling agents in the future. If you haven't done this already, I'd recommend:
Once you have a basic loop working, it becomes important to focus on your use-case. The biggest decision: do you point your agent at the full stack, or carve out smaller surfaces?
A few heuristics:
The general rule: the smaller the surface, the tighter the feedback loop, the better the agent performs.
Now, you've got your agent-in-a-box which can effectively write and test code. This is great for CRUD-style apps with deterministic inputs and outputs.
Unfortunately, most production apps have to deal with real-world workloads. Unit and integration tests often fail to capture two important criteria:
There are a few strategies for closing this gap. The two I'd highlight:
Benchmarks
Performance regressions are easy for agents to introduce and easy for humans to miss. Agents will happily wrap a hot path in a triple-nested loop and call it done - if your tests pass, the verifier signal says "ship it".
A few benchmarks worth automating:
Wire these into the agent's verification loop the same way you'd wire in tests. If a change pushes a benchmark past a threshold, the agent sees it and self-corrects.
Evals
Evals are the answer for anything non-deterministic - search, chat, tool use, anything where "correct" is fuzzy.
The pattern is straightforward:
Evals have become table stakes for any team shipping LLM-powered features. They're also one of the highest-leverage things an agent can run for itself - if your eval suite is fast and your dataset is representative, the agent gets a tight quality signal on every iteration.
If you're new to this, autoevals is a solid starting point - a library of pre-built graders for common eval patterns (factuality, JSON validity, embedding similarity, ...).
A few common questions I get:
How do you prevent agents from making sweeping, unrelated changes?
Scope the task. Tight plan, tight diff, tight verifiers. If the agent is touching files you didn't expect, that's a signal the plan was too vague.
How do you prevent agents from gaming the verifiers?
The classic failure mode: agent deletes the failing test, marks the task done. Two defenses: (1) restrict which files the agent can edit, and (2) run a meta-check that fails if test count drops or coverage tanks.
How do you prevent agents from drowning in long contexts?
Decompose. If a task can't fit in a single, well-scoped plan, it's actually multiple tasks. Spawn separate agent runs with clean context for each subtask.
When should you NOT use an agent?
Effective agents need:
By providing tools for verification, agents can automate this feedback loop, which massively improves their reliability:
Please note that this makes a very important assumption:
YOU are coming up with a detailed plan.
YOU still have intimate knowledge of the codebase, patterns, best-practices, ...
The step beyond this is the promise of the self-driving codebase - "taking your hands off the wheel":
I'm planning to follow up with two posts on this topic:
Meanwhile, some recommended reading:
[1] As of March 2026, it has been nearly a year since Claude Code was released. This is about when I started to see some programmers transition from being "in-the-loop" (eg. Cursor + Tab) towards fully-agentic workflows.
[2] Boris Tane recently shared a great writeup of this kind of workflow.
[3] Edward Li had a great post from last year (2025), comparing various type-checkers.
[4] Wilson Lin has written a few fascinating posts on the cursor blog, including this one about Self Driving Codebases.