Scaling Agents: Verification Loops

February 23rd, 2026

Written February 23rd, 2026.

I know many engineers that haven't manually written a line of code in months¹. Instead, they leverage agents like Claude Code or Codex; iterating with them on a feature until it is mergeable.

A typical workflow looks like this:²

The engineer scopes a feature
They iterate with an agent to produce a concrete plan
Once approved, the agent implements the plan
The engineer reviews the diff, checks behavior, and gives targeted feedback
Repeat until it's mergeable

Even with extremely detailed plans, the same themes keep coming up:

"Agents seem to be fantastic at some tasks, but terrible at others"
"I've used agents in personal projects, but they're horrible in our company's [brownfield] codebase"

It can feel like a slot machine. When the agent nails it, you've saved a ton of time. If not, you're stuck in a frustrating steering process, trying to pull the agent back towards a working implementation.

The solution?
Automating the feedback loops.

I) Verification

By "verification", I'm referring to the agent's ability to check its outputs, and self-correct if necessary.

Lets take a look at an example. Below, I pose a question to "Kimi 2.5 Instant" (a small, non-thinking model). It should count up the number of words in a sentence, then return the answer (10 words).

Without Verification

User

How many words are in the sentence "The brown fox jumps over the big, fluffy, lazy dog". Only respond with the answer.

Model

With Verification

User

How many words are in the sentence "The brown fox jumps over the big, fluffy, lazy dog".

Model

Word count: 9 words

Here they are:

The
brown
fox
jumps
over
the
big,
fluffy,
lazy
dog

Wait, there are actually 10 words in the sentence. (The commas after "big" and "fluffy" are punctuation marks attached to the words, not separate words themselves.)

In the first case, Kimi went with its gut... which was wrong.

The second case exemplifies how most modern thinking-based LLMs work; it uses a simple verification strategy - explicitly numbering each word - to identify its mistake, and calculate the correct answer.

II) Programming

Coding agents have this exact same problem.

Luckily, software engineers already have robust, automated tools for verifying program correctness. Most teams have a CI process which runs a variety of checks on each change:

Code Compilation
Type Checking
Linting
Testing
...

To be effective, your agents will need access to all of these tools. Lets consider a common scenario where a coding agent is implementing a new function:

>Write a create_user function for the User type in src/users.py

●Write(src/users.py)

⎿

from dataclasses import dataclass
from typing import Literal

@dataclass
class User:
name: str
email: str
role: Literal["admin", "user"]

def create_user(name: str, email: str) -> User:
return User(name=name, email=email)

●Bash(ty check src/users.py)

⎿


error[missing-argument]: No argument provided for required parameter `role`
--> src/users.py:11:12
|
11 | return User(name=name, email=email)
| ^^^^ Missing required argument `role`

●Edit(src/users.py)

⎿

def create_user(name: str, email: str) -> User:
    return User(name=name, email=email)
    return User(name=name, email=email, role="user")

●Bash(ty check src/users.py)

⎿

All checks passed!

●Done. Created the create_user function in src/users.py.

After implementing the code, the agent used ty³ to run type-checking on the codebase, and identified a bug. It is able to use the information from the error-message to immediately self correct.

This is the reason that agents like Claude Code have caught on like wildfire.

The "harness" facilitates these automated feedback loops, where the agent can:

make changes
run the verifiers
make changes based on the feedback
run the verifiers
...

This enables agents to make progress on much longer tasks, without needing human intervention.

This also reinforces how important effective verifiers are. Typed languages like Rust, TypeScript, and typed Python have continued to gain popularity because the tooling in these languages is so effective for LLMs.

III) Infrastructure

One thing I'll reiterate: the agent needs to be able to run independently.

This means:

It can spin up your stack from a clean checkout
It can run your verifiers (lint, types, tests) end-to-end
It can iterate without you babysitting auth keys, env vars, or one-off setup steps

Your infrastructure should be runnable in a container. Most modern harnesses are adopting multi-architecture platforms ("Claude Code Web", "Codex Web", "Cursor Cloud Agents", ...), so I expect this to become even more important for scaling agents in the future. If you haven't done this already, I'd recommend:

Adopt a basic monorepo folder structure
Set up a devcontainer
Have a human walk through the flow on a new machine, and document/codify the setup steps (tooling, permissions, ...)
Verify that your agent can boot in a container, and run all of your verification tools

Once you have a basic loop working, it becomes important to focus on your use-case. The biggest decision: do you point your agent at the full stack, or carve out smaller surfaces?

A few heuristics:

If you're building an isolated module (a parser, a CLI, a pure-data pipeline), let the agent loop on the full thing - the verifier signal is tight.
If you're working in a sprawling brownfield codebase, the agent will drown. Decompose first: split the task into smaller, contained changes with their own verifiers (a single service, a single endpoint, a single component).
If your stack involves a UI, give the agent eyes - playwright and screenshot tools dramatically improve its ability to verify visual changes.

The general rule: the smaller the surface, the tighter the feedback loop, the better the agent performs.

V) Benchmarks & Evals

Now, you've got your agent-in-a-box which can effectively write and test code. This is great for CRUD-style apps with deterministic inputs and outputs.

Unfortunately, most production apps have to deal with real-world workloads. Unit and integration tests often fail to capture two important criteria:

Performance on real-world data
Quality on real-world data

There are a few strategies for closing this gap. The two I'd highlight:

Benchmarks

Performance regressions are easy for agents to introduce and easy for humans to miss. Agents will happily wrap a hot path in a triple-nested loop and call it done - if your tests pass, the verifier signal says "ship it".

A few benchmarks worth automating:

Boot time / cold start
First contentful paint (use react-scan or similar for client perf)
Critical-path latency (p50, p95, p99)
Memory usage on representative payloads
...

Wire these into the agent's verification loop the same way you'd wire in tests. If a change pushes a benchmark past a threshold, the agent sees it and self-corrects.

Evals

Evals are the answer for anything non-deterministic - search, chat, tool use, anything where "correct" is fuzzy.

The pattern is straightforward:

Curate a golden dataset of representative inputs
Define grading criteria (LLM-as-judge, exact match, custom heuristics)
Run the eval on every change, score the diff

Evals have become table stakes for any team shipping LLM-powered features. They're also one of the highest-leverage things an agent can run for itself - if your eval suite is fast and your dataset is representative, the agent gets a tight quality signal on every iteration.

If you're new to this, autoevals is a solid starting point - a library of pre-built graders for common eval patterns (factuality, JSON validity, embedding similarity, ...).

VI) Lightning Round

A few common questions I get:

How do you prevent agents from making sweeping, unrelated changes?
Scope the task. Tight plan, tight diff, tight verifiers. If the agent is touching files you didn't expect, that's a signal the plan was too vague.

How do you prevent agents from gaming the verifiers?
The classic failure mode: agent deletes the failing test, marks the task done. Two defenses: (1) restrict which files the agent can edit, and (2) run a meta-check that fails if test count drops or coverage tanks.

How do you prevent agents from drowning in long contexts?
Decompose. If a task can't fit in a single, well-scoped plan, it's actually multiple tasks. Spawn separate agent runs with clean context for each subtask.

When should you NOT use an agent?

When you don't have working verifiers (you're flying blind)
When the task requires deep institutional context the agent can't access
When you're exploring a problem and don't yet know what "correct" looks like

VII) Recap

Effective agents need:

A goal
A feedback loop (are they getting closer to their goal?)

By providing tools for verification, agents can automate this feedback loop, which massively improves their reliability:

Linter, type-checker, fast test suites
The ability to run your "stack" locally, with debug logs
Utilities like playwright to let your agent "see"
...

Please note that this makes a very important assumption:

YOU are coming up with a detailed plan.
YOU still have intimate knowledge of the codebase, patterns, best-practices, ...

The step beyond this is the promise of the self-driving codebase - "taking your hands off the wheel":

What if you don't know which code has to change?
What if you weren't even allowed to see the code?
How would new functionality be introduced into that system?
How would you maintain quality over time?

I'm planning to follow up with two posts on this topic:

Spec Driven Development (Coming soon...)
Deslopification (Coming soon...)

Meanwhile, some recommended reading:

[1] As of March 2026, it has been nearly a year since Claude Code was released. This is about when I started to see some programmers transition from being "in-the-loop" (eg. Cursor + Tab) towards fully-agentic workflows.

[2] Boris Tane recently shared a great writeup of this kind of workflow.

[3] Edward Li had a great post from last year (2025), comparing various type-checkers.

[4] Wilson Lin has written a few fascinating posts on the cursor blog, including this one about Self Driving Codebases.

Scott Mackie - 02026