The Elastic Loop
The Elastic Loop · Part two

Why

Here is the question I would put to anyone who just learned they can stretch a loop: why would you? Running an agent loose for hours means trading control you are used to for a payoff you have to take partly on trust. So is the trade worth it, and when? My honest answer is that it comes down to one thing: how well you can judge what comes back, and this page is the argument for why that became the thing it all rests on. Part of it is economic: build got cheap, and the cost moved to the ends of the loop. Part of it is technical: agentic engineering is a search you evaluate, not a spec you transcribe. The reason I trust it now and did not months ago is that both parts have measurement behind them.

The middle shrinks, the ends get expensive

Build gets cheaper. Product intent and evaluation become the constraint.

When AI accelerates analysis, specification, implementation, testing, and review, the bottlenecks migrate to the ends of the loop.

The middle of the loop, the part we spent many, many years optimizing with frameworks and ceremonies and career ladders, shrinks. The edges become the expensive part.

You have probably felt this already. An agent hands you a finished implementation in minutes, and you spend the next hour figuring out whether it is the thing you actually wanted. That hour was the work! The typing was never the constraint; we just could not see that clearly while typing was slow.

Agentic engineering is machine learning

The technical half of the argument comes from François Chollet, in a post on X from May 2026. His frame: agentic engineering is a form of machine learning. The engineer defines the goal, the constraints, and the Search spaceThe set of possible solutions a task allows. An agent does not translate a spec line by line; it samples from this space, pulled toward one solution by your context, or proposing a few in a planning step. Generating many on purpose and selecting the best is a discipline you opt into.There are many valid ways to implement a story. The agent is choosing among them, not transcribing the one right answer.. An optimization process generates code. The result should be treated as a Blackbox artifactSomething you judge by how it behaves, not by reading how it was made.A dependency you trust through its tests and its behavior, not by reading its source. whose behavior and GeneralizationWhether something keeps working on cases it was not specifically built or tested against.Does the fix hold for the inputs you did not think of, not just the one in the ticket? you evaluate empirically.

That quietly reclassifies the job. Classic spec-driven development treats software as deterministic translation: requirements in, code out, and if the output is wrong, the spec was wrong. Plenty of teams are retreating to that little waterfall right now, pinning their hopes on better specs with agents doing the typing. The reflex even has a Radar entry: ThoughtWorks lists spec-driven development in Assess, caveats attached, and the full sorting of where specs help and where they harm lives on Grading. Agentic engineering treats software as something else: an optimization problem with a defined search space, a search process, and an evaluation function. Those are two different professions. Many of us, myself included, come from the deterministic one, and the reflexes we built there do not transfer cleanly. What does transfer is judgment about whether a result is any good.

That reframing also moves the target you optimize for. The reflex with any productivity tool is to chase output: more pull requests, more velocity on the dashboard. But if building is a search process, throughput is the wrong objective function. You would never rate a machine-learning model by how many predictions it emits per second, only by how well its output holds up against an evaluation. Optimize agentic work for raw productivity and the search converges on more of everything, which is the exact shape of sprawl and slop. The objective worth optimizing is solution quality, and that is something you have to be able to measure, which brings me to the consequence.

Outcome Grading is the new specification

If the build step is a search process, the leverage moves to whoever can grade what comes out of it. Grade the outcome precisely and you can let machines iterate against that grade, for hours, in parallel, while you do something else. Tests are the primitive form of this (they check known cases). The broader form covers classes of behavior: rubrics, scenarios, golden examples, counterexamples, LLM judges as adversarial reviewers, human review, production signals.

One caveat I want to plant early, because the whole framework gets dangerous without it: a rubric is not automatically truth. Good outcome graders are engineering artifacts in their own right, calibrated with counterexamples, failure taxonomies, human review, and drift control. Skip that calibration and the agent will optimize toward output that sounds plausible to the grader, and the grader will applaud. You will have built a machine for generating confident mediocrity, with charts that say everything is fine.

Only retained feedback counts

There is now a precise measurement language for this shift. Zhang et al. (2026) propose Effective Feedback Compute, which counts neither raw tokens nor tool calls nor wall time. It counts only feedback that is informative, valid, non-redundant, and retained in the agent’s state for later decisions. Feedback that changes nothing downstream is noise, however expensive it was to produce.

That is the empirical version of the BackpressureThe resistance an agent works against while it builds, well before any review at the end: a failing test or type error on the technical side, an acceptance scenario or rubric on the product side. Some of it reaches the agent automatically as a signal in the loop; some it imposes on itself by following a discipline set at the start, like writing the failing test first and working until it goes green. The more of it you can encode, the longer you can let the loop run.A red build is backpressure; the agent reads it and fixes the code. So are acceptance criteria: write them well and the agent works against your definition of good as it goes, instead of a person catching the miss at the end. thesis. Bad loops burn compute; good loops store useful pressure as changed plans, updated memory, better tests, sharper rubrics, and rejected variants that stay rejected. The line I would put on a wall:

The unit of agentic progress is Retained feedbackFeedback only counts when it changes a later decision: an updated plan, a sharper test, a rejected option that stays rejected. The rest is noise, however expensive it was to produce.A review comment matters only if it changes the next commit., not generated output.

The lever sits in the harness

Theory is nice, where is the measurement? clawbench, the agent benchmark from the OpenClaw ecosystem, measures the combination that actually ships: harness plus config plus model, analyzed through traces rather than isolated model scores. One of their findings: swapping the plugin config moves scores ten times more than swapping the model.

Ten times! While the industry refreshes leaderboards and debates which frontier model to standardize on, or whether SWE-Bench was fine with slop all along, the variable that dominates the result is the loop infrastructure around the model. Chollet supplies the theory (it is a search process, so the search setup matters), clawbench supplies the measurement. The lever sits in the harness, not in the model.

If that has the ring of something you have heard before, it should. Long before software, factories learned that the output of a line depends less on any single machine than on the system around it: how the process is maintained, measured, and corrected as it runs. The Toyota Production System made a named discipline of it decades ago, and the agentic version is the same move on a new substrate. The full mapping, from statistical process control through to continuous improvement, lives on Harness.

The explosion and the slow collision

The short version: sprawl is the explosion, slop is the slow collision, both are what a search process does without backpressure. Here is the deeper anatomy.

For sprawl, the pressure reactor image holds up surprisingly well when you push on it. Agents generate pressure: output, speed, options. The harness is the containment wall, and the wall is what makes a productive reaction possible at all, rather than fear of the explosion. Outcome grading is the pressure sensors, telling you what is happening inside. Approval gates are the safety valves. Rollback is the emergency shutdown. Observability is the cooling. None of these components is interesting alone; a reactor is the whole assembly or it is a crater.

Slop needs the front of the loop to explain, which surprised me when I first worked through it. Slop is sampling from the The statistical middle (slop)Output that converges on the bland average of everything the model has ever read: plausible, smooth, and indistinguishable from anyone else’s.The onboarding text that reads like every onboarding text ever written. of everything the model has ever read, and the agent lands there because it lacks the specific context that would pull it out: your customers, your constraints, your domain’s edge cases. Which means slop gets squeezed from both sides. Context at the front keeps the agent from starting in the middle at all (the grounding step in the formula), product and domain backpressure at the back pulls the output out of it.

And here is the uncomfortable dependency between the two: backpressure without context is a filter without signal. Your graders either reject endlessly, which looks exactly like sprawl, or they wave plausible-generic output through because they were calibrated on the middle themselves. You cannot buy your way out of missing context with more gates.

I closed the original talk with a line I still believe: build things that would not exist otherwise. Slop gives that line its negative, and honestly its urgency. Slop is precisely what would exist otherwise, the statistical middle arriving on schedule, with or without you. So the question this whole page boils down to is a simple one to ask and an expensive one to answer:

What, in your loop, would pull the output anywhere else?