<!-- Page: index -->

# The Elastic Loop

> A framework for everyone delegating work to machines. When to keep the loop tight, when to let it stretch, and what has to be in place before you let go.

Part of The Elastic Loop · https://elastic-loop.robert-glaser.de · last updated 2026-06-12

---
We are sitting in the best restaurant in the world. Every dish is suddenly affordable, every once-scarce ingredient abundant, and the kitchen can execute anything you name. There is just no menu. So out of safety, risk aversion, and a plain lack of imagination, most people keep ordering what they know. The schnitzel was fine last time. Let’s do the schnitzel again.

That is about where building software landed this year. Once the kitchen can cook anything, the meal turns on two judgments: what is worth ordering, and whether the plate that came back is any good. In the actual work, that is what is worth building, and whether what came back is any good. The Elastic Loop is a working model for those two judgments, for everyone delegating work, specifically building software products, to AI agents. It is for you if you want to learn when to keep the loop tight, when to let it stretch, and what has to be in place before you let go. It is for engineers, but just as much for the people who decide what gets built and judge whether it is any good: product, design, the domain experts. And for the people who set the conditions, whoever shapes how a team works and whoever decides where the money goes. The breadth is the point: both judgments are spread across every one of those roles, not concentrated in whoever sits at the keyboard.

## How far can you let go?

Chances are you have done this today: you asked an AI for something, read the answer, winced, rephrased, read again. That round trip is the loop, and it is the unit everything here is built from. You put in intent, the agent does some work, you check what came back, and what you learned shapes the next ask. The size of a loop is how much work happens between two of your looks: a sentence, a feature, a week of output.

The version you ran this morning is the tightest one. Every turn under your eyes, every result checked before it counts, and that mode works well. The question this framework keeps circling is how far the loop can stretch beyond it: first elastic, where you hand over a bounded chunk of work and check in at agreed points, then loose, where the agent works for multiple hours on its own while you do something else. And what has to be in place before you can let go that far.

## The whole framework in one sentence

Intent **opens** the loop. Context **grounds** it. Backpressure **keeps it useful**. Verification **closes** it.

**Opens.** Somebody decides to delegate and states what good would actually look like.

**Grounds.** Context is what lets the agent stop guessing and start working from your specifics: your codebase, your customers, your constraints.

**Keeps it useful.** Backpressure is the word that stuck in agentic engineering for resistance: anything an agent’s output has to survive before it counts ([Geoffrey Huntley](https://ghuntley.com/ralph/) and [Moss Banay](https://banay.me/dont-waste-your-backpressure/) did much of the early work on it). A failing test is backpressure, and so is a designer’s critique or an acceptance scenario from someone who knows the domain. Plenty of the resistance that matters never touches a compiler. The challenge is how much of this backpressure can be automated, so it does not need a human in the loop stopping the belt.

**Closes.** The outcome gets checked, and what the loop learned survives it. How expensive that check is for the human decides how many loops you can run at once; more on that in [Harness](/harness).

## Where does your next task belong?

Two questions decide where a task sits: how much leash does the agent get, and what holds the output other than you? Read the grid on those two axes. Each step up adds one more dam between the agent and shipped product.

The master grid. Loop size runs across the columns; backpressure depth runs down the rows.

| Backpressure ↓ / Loop size → | Tight | Elastic | Loose |
| --- | --- | --- | --- |
| **Full** (technical + product / domain) | High-risk work, research, regulated change | A delegated feature with product backpressure | A (dark) factory: agents delivering asynchronously against automated checks, humans reviewing outcomes |
| **Technical only** (tests, CI, linters) | Pair coding with tests, linting etc. | Delegated implementation, system-checked but slop-exposed | **Slop at scale**: looks like productivity while the output drifts generic |
| **No automated checks** (you hold the line) | You are the dam, reading every turn | Output piles up between check-ins, nobody on the volume | **Sprawl at scale**: output multiplying faster than anyone can contain it |

Column timings: Tight is every turn supervised, minutes. Elastic is checkpoints at agreed points, minutes to hours. Loose is multiple hours of autonomy, outcomes only.

A context floor sits beneath the grid and rises left to right: the thicker the floor, the more the context can serve itself.

```text
  Tight     Elastic      Loose
                         ██████
            ██████       ██████
  ██████    ██████       ██████
  thin      rising       thick
```

Loose only opens once the context can serve itself.

### What is context?

A longer leash does not call for more context. The amount a task needs is fixed. What changes is how much of it the agent has to reach on its own, without you handing it over turn by turn from your own head or systems the agent can not use on its own.

So what is *context*? Basically, everything the agent would need to stop guessing: the decision history that evaporated in a chat thread three months ago, the business rules that live only in your most experienced colleague’s head, the design system, API docs that describe what the system does today rather than two years ago, who the customer is and what they have already tried. Most of this exists in any team that has shipped something. It just rarely exists in a form an agent can read, because heads, kitchen conversations, and meeting recordings nobody rewatches do not count.

Freshness is part of the bar, and on the right side of the grid it is the sharp part. In a tight loop a stale document costs you one correction, because you watch the wrong turn happen. In a loose loop a stale document is worse than none, because the agent follows it with full conviction and nobody contradicts it for hours.

Moving that knowledge into a state where it stays available and current without a human pumping it in has become a discipline of its own, usually filed under *context engineering*. It is slower work than adding another check, which is why the floor sits where it does.

## Two ways this goes wrong

> Sprawl is the explosion. Slop is the slow collision.<br />Both are what happens when a search process runs without backpressure, and only one of them announces itself.

A search process, because that is what an agentic loop is, essentially: it explores a space of possible solutions and looks for one or many variants that fit your intent (see [Why](/why)).

### Sprawl

Output without containment: variant inflation, refactoring PRs nobody asked for, backlogs gathering mold. It happens far from any repo too. Picture the team that generates thirty landing-page variants in an afternoon while nobody is assigned to review even one. The good news is that sprawl announces itself. You can see it in pull request volume, and pull people back. Think of a pressure reactor: agents generate pressure, and the containment wall is what makes a productive reaction possible in the first place.

### Slop

Here is the schnitzel again, the order everyone defaults to when the menu is missing and imagination runs short. Output that looks plausible, reads smoothly, and could have come from any team prompting any model on any given day. Lots of noise, almost no signal. The everyday version is the onboarding text that sounds like every onboarding text ever written. Slop does not announce itself. The damage shows up quarters later, in the market, in usage signals, in the competitor who built the thing properly. Colleagues you avoid texting with, because they sent you uncanny copywriting, lots of it. It gets squeezed from both sides: context up front keeps the agent from starting in the statistical middle of everything it has ever read, and product backpressure at the back pulls the output out of it.

Sprawl has two natures, though, and the grid only catches one. The kind that comes from missing checks is what the rows measure: output that does not fit the system, broken refactors, architectural drift. It shows up as a cell. The other kind comes from missing closure: output that passes every check, but that nobody is assigned to read. Those thirty landing pages again. It can happen anywhere, even under full backpressure, because depth of checking is not the same as having someone there to shut the loop. That second kind rides on closure, which I keep off the axes on purpose, since it applies to every cell equally.

![The squeeze](https://elastic-loop.robert-glaser.de/diagrams/squeeze.png)

*[Figure: The squeeze. Context positions the agent’s start outside the statistical middle before any work happens; product and domain backpressure pulls the output the rest of the way toward your specific solution.]*

## Three images I keep reaching for

### The truffle

Delegation runs on taste, and taste is typically acquired. The line truffle people pass around is that the thing smells like a bicycle accident in the forest: wet metal, earth, fog, the forest floor after rain. You only taste the whole of it if you have met those smells before. That is what acquired taste means, qualitative judgment built through sensory contact with the work, the kind you get from reading diffs, shipping mistakes, and watching real users get stuck. I reach for this one with engineers who review every turn because letting go feels like losing the craft. Tight is a legitimate mode, just not the only one. And I reach for it with everyone who judges output without building it: designers, domain experts, anyone whose “this is off” is worth more than a passing test suite. The talk this framework grew out of, [How Does Truffle Taste?](https://www.robert-glaser.de/the-elastic-loop-introducing-agentic-engineering-strategically/), was built around it.

### The mirror

AI amplifies and reflects whatever organizational structure it lands in, including the parts you would rather it didn’t. This one is for leadership teams shopping for tools when the question really sits in the operating model, and for the people who own how the team works, scrum masters and coaches included, because the loop will mirror their rituals right back at them.

### The pressure reactor

Backpressure is containment, and containment is what allows a productive reaction at all. This is the image for architecture and harness debates, where “guardrails” sounds like bureaucracy until you see the wall as the thing that lets you run the reactor hot.

## Where to start

If you are new to all of this, start with Loops.

- [Loops](https://elastic-loop.robert-glaser.de/loops.md): Tight, elastic, loose: three zones and how to size the loop for the task in front of you. None of the zones outranks the others, each just comes with different preconditions.
- [Why](https://elastic-loop.robert-glaser.de/why.md): Why stretch a loop past tight at all, and is the risk worth it? The payoff comes down to how well you can judge what comes back. The economic and technical case, now with measurement behind it.
- [Harness](https://elastic-loop.robert-glaser.de/harness.md): The backpressure layers in full: what holds agent output honest against the system, and what holds it honest against the product.
- [Grading](https://elastic-loop.robert-glaser.de/grading.md): Outcome grading is the new specification. Tests, rubrics, scenarios, golden examples of known-good output, and why a rubric is not automatically truth.
- [Roles](https://elastic-loop.robert-glaser.de/roles.md): Every role carries judgment about agent work that nobody else can supply. What engineers, product people, designers, domain experts, the people who run the process, and leaders each bring to the loop.

## A framework for everyone delegating work to machines. Even machines.

Agents have started applying this loop to themselves. [Cursor’s First Proof](https://1stproof.org/) and [OpenAI’s harness engineering](https://openai.com/index/harness-engineering/) arrived at the same shape independently: decompose, parallelize, verify, iterate. Anthropic’s Fable model shows strong signs of being post-trained to put itself in reinforced loops, and the [dynamic workflows](https://code.claude.com/docs/en/workflows) feature builds harnessed loops on the fly for the task at hand. I am keeping this as a coda rather than a headline, because it is the most interesting thread here and the one I am least sure about.

## Glossary

- **Backpressure**: The resistance an agent works against while it builds, well before any review at the end: a failing test or type error on the technical side, an acceptance scenario or rubric on the product side. Some of it reaches the agent automatically as a signal in the loop; some it imposes on itself by following a discipline set at the start, like writing the failing test first and working until it goes green. The more of it you can encode, the longer you can let the loop run. A red build is backpressure; the agent reads it and fixes the code. So are acceptance criteria: write them well and the agent works against your definition of good as it goes, instead of a person catching the miss at the end.
- **Drift**: Slow movement away from what is correct or wanted, over time, without any single obvious break. The product or codebase quietly getting less coherent release after release.
- **Harness**: The scaffold that turns a model into an agent, assembled from many parts. Among those: the loop it works in, the tools it can reach, how its context is managed as a run grows long (compression, retrieval), the hooks that fire on what it does, subagents, and guardrails. Backpressure and other resistance attach here, and beyond it. An interactive agent tool like Claude Code, Codex, or Pi is a harness. You have been working inside one all along.
- **Search space**: The set of possible solutions a task allows. An agent does not translate a spec line by line; it samples from this space, pulled toward one solution by your context, or proposing a few in a planning step. Generating many on purpose and selecting the best is a discipline you opt into. There are many valid ways to implement a story. The agent is choosing among them, not transcribing the one right answer.
- **The statistical middle (slop)**: Output that converges on the bland average of everything the model has ever read: plausible, smooth, and indistinguishable from anyone else’s. The onboarding text that reads like every onboarding text ever written.

---

<!-- Page: loops -->

# Loops

> Tight, elastic, loose: three zones and how to size the agentic loop for the task in front of you. None of the zones outranks the others, each just comes with different preconditions.

Part of The Elastic Loop · https://elastic-loop.robert-glaser.de · last updated 2026-06-12

---
Let me start with the word itself, because everything in this framework, and agentic work in general, hangs off it. A loop is the round trip between you and a machine working for you: you say what you want, the agent produces something, you check it, and what you learned feeds the next ask. If you have ever rephrased a prompt because the first answer missed the point, you have closed a loop. The size of a loop is how much work happens between two of your looks: a paragraph, a feature, a week of output. As long as humans did the work, nobody had to build that loop. It just happened. Once a machine does the work, the loop becomes something you build on purpose, and everything below, the three zones, the sizing criteria, the closure question, is the craft of building it deliberately instead of inheriting it from habit.

None of this is new, and I would rather say so myself than have someone say it for me. The feedback loop is the beating heart of every iterative method we have practiced since the [Agile Manifesto](https://agilemanifesto.org/) put it in writing in 2001: build something small, look at it, adjust, go again. A sprint is a loop. Continuous delivery runs the same loop tighter, and build-measure-learn is that shape again under a different name. The loop is old. What is new is two things. One you can feel already: who works inside it, and how fast one person can open one. The other is quieter.

You already run more loops at once than you ever named. You ship a change to production, then wait on the UX team running sessions with real users. Weeks later they come back, in a meeting or over the phone, and what they found sends you to fix the code, cut the feature, or reshape it. That is a loop, interlocked with the fast one you ran while building, and it turns over weeks where the other turned over minutes. Nobody drew it on a wall. A person held it together in their head and their calendar. It was implicit, and it worked because a human was the glue.

That glue is what a machine cannot supply. When the worker is a machine you summon on demand, the loop has to become explicit, something you build, because the machine does not know your users, your release rhythm, or which meeting the real verdict shows up in. You do. So you choose loop size per task where the calendar used to decide for you, and you build the longer loops out to production and back instead of trusting them to memory. The far end of that range, many hours of unattended agent work, is the part most of us genuinely have not done before.

The question worth taking seriously here is the one this whole framework circles: if the tight loop you probably ran this morning (ask, read, wince, rephrase) works so well, how far can it stretch? Tight is the one mode everyone knows from their own hands. The other two zones are the same loop with more leash paid out, and every extra meter changes what has to be in place around the work.

![Three sizes of the same loop](https://elastic-loop.robert-glaser.de/diagrams/loop-sizes.png)

*[Figure: Three sizes of the same loop. Tight: minutes, in the loop at every turn. Elastic: minutes to hours, at checkpoints on the edge. Loose: multiple hours, only at the outcome gate.]*

## Three zones, none of them a ladder

**Tight** runs in minutes. You are embedded in the cycle, reviewing every turn, checking hypotheses as they form, working in something close to pair-coding mode. The cognitive load is high and the control is maximal. That trade is exactly right for a whole class of work. Anything high in ambiguity or risk, where a wrong turn is expensive and the path is not clear yet. Legacy brownfield where the documentation lies and problems nobody has seen before, the ground you have to feel your way across. And anything where you are the one who needs to learn something from contact with the material, because handing that off hands off the learning too.

**Elastic** runs in minutes to hours. You hand over a bounded chunk of work, structured delegation with guardrails and checkpoints, and you come back at agreed points instead of every turn. The work that belongs here is understood enough to hand over, but not safe enough to ignore. (This carries more weight than it looks; most delegation failures I see come from misjudging one of its two halves.)

**Loose** runs in multiple hours. A single agent or fleets of agents work asynchronously under policy constraints, and the human reviews outcomes rather than turns. You may have heard of the term *dark factory* for these kinds of loops.

I want to be blunt about a misreading the three zones invite: tight is not a beginner mode you graduate out of, and loose is not a badge of being advanced. Think gears, not a ladder. You shift to match the road. An engineer who keeps a highly regulated change in a tight loop with almost no backpressure at hand is in the right gear. A team that pushes everything loose because it feels like progress is about to find out why the criteria below exist. Misjudging the size does not fail loudly. The agent keeps running, sure of itself, while nobody backfills what it is missing, and the bill arrives later as drift that has compounded past a cheap fix. The only question the zones answer is how much loop this particular task can carry.

## Are your loop sizes choices or accidents?

People keep collapsing two questions into one, so let us keep them apart.

### 1. What state are the loops you already run in?

Four strategic questions get at it:

1. How volatile is the domain?
2. Is the context in place for agents to work from?
3. Where does trust actually live in the organization?
4. And which metrics would catch drift fast enough to matter?

Ask these about a team, and you learn whether its current loop sizes are choices or accidents.

### 2. The loop size of the task in front of you

For that I use seven sizing criteria.

1. **Ambiguity:** unclear intent pulls the loop tighter.
2. **Risk and blast radius:** the bigger the damage a wrong move can do, the tighter the loop and the stronger the gates.
3. **Context availability and freshness:** undocumented or volatile territory pulls tighter.
4. **Verification quality:** strong tests, evals, and scenarios are what permit looser loops in the first place.
5. **Reversibility:** cheap rollback buys you elasticity.
6. **Learning goal:** if a human needs to understand this deeply afterward (comprehension, or cognitive debt [is a thing](https://arxiv.org/abs/2603.22106)), do not over-delegate it away from them.
7. **Agent capability:** [long task horizons](https://metr.org/time-horizons/) only pay off when the closure infrastructure can keep up with them.

Diagnosis tells you where you stand. Sizing is about the next task, and what to do with it. Mix the two and a team ends up arguing about its maturity, when the fight should be over one task’s blast radius.

## Every zone has to close the loop

Whatever size you pick, the loop has to close: the outcome gets checked, and what was learned survives. What changes per zone is the machinery doing the closing.

**In the tight loop**, you are the iteration mechanism. Hypothesis, attempt, check, correction, over and over, backed by the instant signals you get for free while you sit there: the LSP and compiler, the linter and formatter, the diff in front of you. Direct material contact runs through an interactive agent harness like Claude Code, Codex, or Pi, and through tests, UI, and logs.

**In the elastic loop**, intent-carrying artifacts like specs and user stories become the steering artifacts, carrying boundaries, constraints, and risks. Around it: a reproducible setup with CI, acceptance criteria the work is held to, and review at the checkpoints, by a human and by reviewer subagents.

**In the loose loop**, the machinery has to stand in for your absence, and there is a lot of it. Containment first, so a bad run cannot escape: sandboxes and worktrees, rollback paths. Then the judgment you would have applied by hand, now encoded: outcome graders (rubrics, scenarios, golden examples, counterexamples), fix-loops, adversarial agent reviews with their own approval gates, drift monitoring, automated regression gates. And finally the legibility artifacts that make the final human check cheap: screenshots, screen recordings, trace summaries. That last item matters more than it sounds: how many loose loops one person can run in parallel is capped less by containment and more by the cost of verifying each one at the gate (the [Harness](/harness) page takes this apart).

And here is the sharpening that took me a while to see. In the tight loop you supply the micro-iterations implicitly, just by sitting there. In the loose loop nobody supplies them. So the harness, the machinery around the agent described above, has to encode those cycles in advance, decompose, parallelize, verify, iterate, as explicit architecture rather than an emergent property of human presence. [Cursor’s First Proof](https://1stproof.org/), [OpenAI’s harness engineering](https://openai.com/index/harness-engineering/), and [Anthropic’s harness design for long-running apps](https://www.anthropic.com/engineering/harness-design-long-running-apps) all arrived at exactly this shape independently.

## Some loops you build run for weeks

The loop out to production and back, the one the UX team used to carry, you can now build explicitly too. A background agent reads your production signals and user feedback on a weekly beat, turns what it finds into experiments with a clear measure of success, hands those to build agents to ship, and then watches over the following weeks what became of them. Its loop does not close when the build goes green. It closes when the production verdict arrives.

And you rarely build just one. Different signals move at different speeds, and the loops that watch them inherit that: a crash report closes in a day, a retention question in weeks, architecture decay in a quarter. You match each loop’s cadence to its signal instead of to the calendar. The fast loop tells you the code works. The slow one tells you whether it was worth building, and that was always the more important answer. It just used to live in someone’s head instead of in a loop you built.

## Could an agent even work here yet?

One more thing, which sits underneath everything above. Context behaves differently from the choices we have been making so far. Nobody chooses bad context, because it is a maturity state rather than a decision, whether you like it or not. A tight loop survives thin context, because you backfill the missing knowledge turn by turn from your own head; a loose loop has nobody doing that. So bad context collapses the entire loose loop area out of reach, no matter how good your backpressure is.

That is why I treat **context as a gate with two layers**.

- Layer 1 is **agent readiness**: can the agent even work here, with the context quality, freshness, availability, and relevance it would need?
- Layer 2 is **loop health**: does the team trust itself enough to let go? You can fail the second while passing the first, and plenty of teams do. But failing the first is how loose loops produce confident nonsense for hours.

So before you size your next task loose, ask the readiness question honestly: could an agent work here at all yet? Could a new flesh-and-blood colleague work here at all, without endless getting-knowledge-out-of-people’s-heads sessions? And if it could, here is the harder question: would you let it?

## Glossary

- **Backpressure**: The resistance an agent works against while it builds, well before any review at the end: a failing test or type error on the technical side, an acceptance scenario or rubric on the product side. Some of it reaches the agent automatically as a signal in the loop; some it imposes on itself by following a discipline set at the start, like writing the failing test first and working until it goes green. The more of it you can encode, the longer you can let the loop run. A red build is backpressure; the agent reads it and fixes the code. So are acceptance criteria: write them well and the agent works against your definition of good as it goes, instead of a person catching the miss at the end.
- **Counterexample**: A plausible-looking but wrong output you keep on file, so the system learns never to produce that kind of thing again. The bug you once shipped and then wrote a regression test for.
- **Dark factory**: The loose loop run to its end: agents delivering on their own against automated checks, with humans reviewing outcomes rather than every turn. Named after lights-out manufacturing, where the line runs without people on the floor.
- **Drift**: Slow movement away from what is correct or wanted, over time, without any single obvious break. The product or codebase quietly getting less coherent release after release.
- **Golden**: An output you have blessed as correct and keep around as the answer key, to compare new output against. The trusted fixture in an integration test: the known-good result everything else is measured against.
- **Harness**: The scaffold that turns a model into an agent, assembled from many parts. Among those: the loop it works in, the tools it can reach, how its context is managed as a run grows long (compression, retrieval), the hooks that fire on what it does, subagents, and guardrails. Backpressure and other resistance attach here, and beyond it. An interactive agent tool like Claude Code, Codex, or Pi is a harness. You have been working inside one all along.
- **Outcome grading**: Judging whether a result is good, precisely enough that the judgment can be applied again and again, by a person or a machine. You already do the simplest version every time you write a test or a definition of done.
- **Rubric**: A written list of what “good” means for a kind of output, so the same standard can be applied to every result. Acceptance criteria, reused as a grading checklist instead of a one-off.

---

<!-- Page: why -->

# Why

> Why stretch a loop past tight at all, and is the risk worth it? The payoff comes down to how well you can judge what comes back. The economic and technical case, now with measurement behind it.

Part of The Elastic Loop · https://elastic-loop.robert-glaser.de · last updated 2026-06-12

---
Here is the question I would ask anyone who just learned they can stretch a loop: why would you? You give up control you are used to. The payoff comes back hours later, and part of it you have to take on trust. So why do it, and when? The whole thing rests on one ability: judging what comes back. Half of it is about money. Building got cheap, so the cost slid to the two ends of the loop. The other half is stranger. The work itself changed shape, closer to training a model than to writing a spec. Months ago I would not have trusted any of this. What changed is that both halves now have measurement under them.

## The middle shrinks, the ends get expensive

When AI speeds up the whole middle, analysis through review, the work does not vanish. It moves to the two ends.

- At the front: what is worth building, who it is for, what good would look like.
- At the back: whether what came back is any good, and what production teaches you once it ships.

The middle is the part we spent twenty years optimizing. Frameworks, ceremonies, career ladders, all of it aimed at the stretch that just got cheap. A system has one bottleneck at a time, and only that bottleneck decides how fast finished work comes out. Speed up anything else and work piles up in front of it. Eliyahu Goldratt’s [Theory of Constraints](https://en.wikipedia.org/wiki/Theory_of_constraints) named this decades ago. Building was that bottleneck, the slow step everything else waited on. AI sped it up almost overnight, and the constraint jumped to the two ends a cheap middle had hidden: what is worth building, and whether what came back is any good.

You have probably felt this already. An agent hands you a finished feature in minutes. Then you spend an hour working out whether it is the thing you actually wanted. That hour was the work. The typing was never the constraint. We just could not see that while typing was slow.

## Agentic engineering is machine learning

The technical half of the argument comes from François Chollet, in a [post on X](https://x.com/fchollet/status/2053234697392754701) from May 2026. His frame: agentic engineering is a form of machine learning. The engineer defines the goal, the constraints, and the search space. An optimization process generates the code. You then treat the result as a blackbox artifact whose behavior and generalization you check empirically, the way you would with any model.

That quietly reclassifies jobs. Classic spec-driven development treats software as deterministic translation: requirements in, code out, and if the output is wrong, the spec was wrong. Plenty of teams are retreating to that little waterfall right now, pinning their hopes on better specs with agents doing the typing. The reflex even has a Radar entry: ThoughtWorks lists [spec-driven development](https://www.thoughtworks.com/radar/techniques/spec-driven-development) in Assess, caveats attached. The full sorting of where specs help and where they harm lives on [Grading](/grading). Agentic engineering treats software as an optimization problem instead: a defined search space, a search process, an evaluation function. Those are two different professions. Many of us, myself included, come from the deterministic one. Most of those reflexes do not survive the move. The one that does is judgment about whether a result is any good.

That reframing also moves the target you optimize for. The reflex with any productivity tool is to chase output: more pull requests, more green on the dashboard. I find that framing boring, and worse, wrong. If building is a search, throughput is the wrong thing to measure. You would never rate a machine-learning model by how many predictions it emits per second. You rate it on whether the output holds up. 

> Chase raw productivity and the search just hands you more of everything. That is the exact shape of sprawl and slop. 

There is a deeper reflex hiding in that, carried over from deterministic work: variance looks like waste, something to stamp out. But a search runs on variance. It hands you many candidate solutions on purpose, and the skill is to bound that spread and keep the best of it, not to drive it to zero the way a factory drives a part to spec. That distinction comes back when we reach the factory floor on [Harness](/harness).

The objective worth optimizing is solution quality, and the field evidence backs that. In an experiment at Procter & Gamble, [Dell’Acqua, Mollick et al. (2025)](https://www.hbs.edu/faculty/Pages/item.aspx?num=67197) ran 776 professionals through real product work. One person with AI matched a team of two without it. And the gain was better solutions, not faster ones. But solution quality only counts if you can measure it, which is the harder problem.

## Outcome Grading is the new specification

If the build step is a search process, the leverage moves to whoever can grade what comes out of it. Grade the outcome precisely and you can let machines iterate against that grade, for hours, in parallel, while you do something else. Tests are the primitive form of this: this input, this expected output. The broader forms cover whole classes of behavior: rubrics and scenarios, the golden and counter examples that calibrate them, an LLM judge playing adversary, a human where the call is genuinely a judgment call, and production itself, grading whatever slipped through.

One caveat I want to plant early, because the whole framework gets dangerous without it: a rubric is not automatically truth. Good outcome graders are engineering artifacts in their own right, calibrated against counterexamples, failure taxonomies, the occasional human read. Skip that calibration and the agent will optimize toward output that sounds plausible to the grader, and the grader will applaud. You will have built a machine for generating confident mediocrity.

## Only retained feedback counts

There is now a precise measurement language for this shift. [Zhang et al. (2026)](https://arxiv.org/html/2605.29682v1) propose **Effective Feedback Compute**, which counts neither raw tokens nor tool calls nor wall time. It counts only feedback that is informative, valid, and actually retained in the agent’s state for later decisions. Feedback that changes nothing downstream is noise, however expensive it was to produce.

That is the empirical version of the backpressure thesis. A bad loop burns compute and keeps nothing. A good loop stores the pressure: a changed plan, a sharper rubric, a rejected variant that stays rejected. Which is the whole point:

> The unit of agentic progress is retained feedback, not generated output.

## The setup around the model beats the model

Chollet gives the theory; there is now a measurement to go with it. [clawbench](https://github.com/openclaw/clawbench), the agent benchmark from the OpenClaw ecosystem, scores the combination you ship: the model together with the loop and configuration around it, not the model alone. Its finding: swapping the configuration moves the result ten times more than swapping the model does.

Ten times! The whole of that setup has a name, the harness, and it is the variable that dominates the outcome, not the model the industry keeps arguing about. That claim, and the measurement under it, earns its own chapter: [Harness](/harness).

## The explosion and the slow collision

Sprawl is the explosion, slop is the slow collision. Both are what a search process does with nothing pushing back. The anatomy of each is worth a look.

For sprawl, the pressure-reactor image holds up the more you push on it. Agents generate pressure — output, speed, options — and the harness is the containment wall that makes a productive reaction possible instead of just an explosion to fear. The rest of the assembly maps over too: grading is the sensors that tell you what is happening inside, approval gates are the safety valves, rollback is the emergency shutdown. None of it is interesting on its own. A reactor is the whole assembly, or it is a crater.

Slop needs the front of the loop to explain, which surprised me when I first worked through it. Slop is sampling from the statistical middle of everything the model has ever read, and the agent lands there because it lacks the specific context that would pull it out: your customers, your constraints, the edge cases your domain knows by heart. So slop gets squeezed from two sides. Context at the front keeps the agent from starting in the middle at all (the grounding step in the formula), and product and domain backpressure at the back pulls whatever comes out the rest of the way toward your product.

And the two depend on each other: backpressure without context is a filter with no signal. Your graders either reject endlessly, which looks exactly like sprawl, or they wave plausible-generic output through because they were calibrated on the middle themselves. You cannot buy your way out of missing context with more gates.

I closed the original talk that gave birth to this book with a line I still believe: 

> Build things that would not exist otherwise.

Slop gives that line its negative, and honestly its urgency. Slop is precisely what would exist otherwise, the statistical middle arriving on schedule, with or without you. But slop is also the raw material that makes variants and creativity possible at all. So this whole page comes down to one question, easy to ask and expensive to answer:

> What, in your loop, would pull the output anywhere else?

## Glossary

- **Backpressure**: The resistance an agent works against while it builds, well before any review at the end: a failing test or type error on the technical side, an acceptance scenario or rubric on the product side. Some of it reaches the agent automatically as a signal in the loop; some it imposes on itself by following a discipline set at the start, like writing the failing test first and working until it goes green. The more of it you can encode, the longer you can let the loop run. A red build is backpressure; the agent reads it and fixes the code. So are acceptance criteria: write them well and the agent works against your definition of good as it goes, instead of a person catching the miss at the end.
- **Blackbox artifact**: Something you judge by how it behaves, not by reading how it was made. A dependency you trust through its tests and its behavior, not by reading its source.
- **Counterexample**: A plausible-looking but wrong output you keep on file, so the system learns never to produce that kind of thing again. The bug you once shipped and then wrote a regression test for.
- **Evaluation function**: The function that scores how good a candidate solution is, so the search can tell a better result from a worse one and steer toward it. Without one, generating many solutions tells you nothing about which to keep. A test suite or a definition of done, read as a score the search aims for instead of a yes/no check at the very end.
- **Generalization**: Whether something keeps working on cases it was not specifically built or tested against. Does the fix hold for the inputs you did not think of, not just the one in the ticket?
- **Golden**: An output you have blessed as correct and keep around as the answer key, to compare new output against. The trusted fixture in an integration test: the known-good result everything else is measured against.
- **Harness**: The scaffold that turns a model into an agent, assembled from many parts. Among those: the loop it works in, the tools it can reach, how its context is managed as a run grows long (compression, retrieval), the hooks that fire on what it does, subagents, and guardrails. Backpressure and other resistance attach here, and beyond it. An interactive agent tool like Claude Code, Codex, or Pi is a harness. You have been working inside one all along.
- **LLM judge**: A second AI you ask to grade the first one’s output against your criteria. It scales review, but has blind spots of its own. A tireless reviewer who needs a clear checklist, or it will rubber-stamp anything that looks plausible.
- **Retained feedback**: Feedback only counts when it changes a later decision: an updated plan, a sharper test, a rejected option that stays rejected. The rest is noise, however expensive it was to produce. A review comment matters only if it changes the next commit.
- **Rubric**: A written list of what “good” means for a kind of output, so the same standard can be applied to every result. Acceptance criteria, reused as a grading checklist instead of a one-off.
- **Search space**: The set of possible solutions a task allows. An agent does not translate a spec line by line; it samples from this space, pulled toward one solution by your context, or proposing a few in a planning step. Generating many on purpose and selecting the best is a discipline you opt into. There are many valid ways to implement a story. The agent is choosing among them, not transcribing the one right answer.
- **The statistical middle (slop)**: Output that converges on the bland average of everything the model has ever read: plausible, smooth, and indistinguishable from anyone else’s. The onboarding text that reads like every onboarding text ever written.

---

<!-- Page: harness -->

# Harness

> The backpressure layers in full: what holds agent output honest against the system, and what holds it honest against the product.

Part of The Elastic Loop · https://elastic-loop.robert-glaser.de · last updated 2026-06-12

---
A harness is the scaffold that turns a model into an agent, assembled from many parts. Among those: the loop it works in, the tools it can reach, how its context is kept manageable as a run grows long, the hooks that fire on what it does, subagents, and guardrails. A raw model answers once. A harness is what lets it take a hundred turns on a task while you look away. It is also where you start attaching resistance. A hook can block a bad commit, an AGENTS.md can carry rules the agent is meant to follow but can still quietly ignore, and plenty of resistance lives outside the harness too, in a git hook, in CI, in a human reviewer. Backpressure is the name for all of it, and the more of it you can encode, the more leash you can pay out, from tight to elastic to loose. On the [master grid](/#where-does-your-next-task-belong), the vertical axis is exactly this: no automated checks, then technical ones, then full backpressure that adds product and domain judgment on top. This page is about that top level. What actually goes into it, and what does it cost the human to keep running?

Before the parts, it helps to know why they are worth the trouble at all. [clawbench](https://github.com/openclaw/clawbench), the agent benchmark from the OpenClaw ecosystem, scores the combination that actually ships, harness and configuration and model together, and finds that the configuration around the model moves the result about ten times more than the model choice does. While the industry refreshes leaderboards and debates which frontier model to standardize on, or whether SWE-Bench was fine with slop all along, the variable that dominates is the loop infrastructure around the model. Chollet supplies the theory on [Why](/why), that building is a search process and so the setup around the search matters, and clawbench supplies the measurement. The lever sits here, in the harness, not in the model. Which makes it worth knowing exactly what goes into one.

I work with four clusters, and each one owns a different question about the same piece of agent output:

- **Engineering + Architecture** asks: is it correct?
- **Product + Domain** asks: is it right?
- **Security + Compliance** asks: is it allowed?
- **Operations** asks: does it hold up in production?

Correct and right are not the same question. A change can pass every test, satisfy every type checker, and still solve a problem no user has. Say it is right, too. It can still touch data it had no business touching, or clear all three and quietly fall over under real load three weeks later. Four clusters, then. Four ways for plausible output to be wrong, and the work has to survive all four before it counts.

## One layer compiles, the other one doesn’t

The technical layer is the one you have probably seen versions of already, because it is where the public harness debate lives. Its building blocks:

- repo rules, AGENTS.md, skills and documentation (actually, this is context the agent uses as backpressure to hold on itself)
- language servers, compilers, linters
- tests in all their forms
- CI, trial runs in ephemeral containers, security checks
- PR diffs and reviewer subagents
- logs, traces, and UI checks
- hooks that fire on what the agent does, blocking a bad commit or write before it lands
- sandbox boundaries, approval gates, rollback paths

Each of these is something the agent’s work has to get past without a human watching every turn.

The product and domain layer is built from different stuff, and almost none of it compiles:

1. the actual user problem and the job to be done
2. explicit non-goals (the solution spaces you deliberately ruled out)
3. domain rules and edge cases
4. acceptance scenarios and counterexamples
5. quality standards from the people who own product, UX, risk, and operations
6. variant comparisons with explicit trade-offs
7. and production signals like rework, support feedback, and usage patterns

Notice how much of this list is knowledge that lives in someone’s head until somebody does the work of writing it down in a form an agent can be held against.

## Your test suite will not catch slop

Here is the job each layer actually does, which is what makes the two more than a tidy taxonomy. The technical layer mostly fights sprawl, the explosion: compilers, tests, CI, architecture fitness graders filter out what does not fit the system, and they push back against “too much, too wild”. Slop is the other layer’s problem. Scenarios, counterexamples, and domain rubrics (written grading criteria) pull the output away from the statistical middle and into the specific solution space of this product, this domain, this user. That is the back half of the squeeze: context decides where in the possibility space the agent starts, and this layer pulls whatever comes out the rest of the way toward your product.

> Technical backpressure keeps the work correct against the system. Whether the product ever needed it is the other layer’s call.

You need both, and they are not interchangeable. A perfect test suite will wave through onboarding copy that sounds like every onboarding copy ever written, because nothing in it knows what your product is for.

## What the next model takes away

Both layers grow while you pay out leash, and neither grows forever. They also age differently, which is the part worth planning for. Anthropic’s own [harness write-up](https://www.anthropic.com/engineering/harness-design-long-running-apps) argues for shrinking the scaffold as the model improves: every piece you bolt on encodes an assumption about what the model cannot do yet, and those assumptions expire. Between two model generations they tore out their task decomposition and folded a per-stage evaluator back into a single pass at the end, because the newer model held coherence for hours on its own.

The line that decides what you can remove cuts across both layers, not between them. Some of the scaffold only compensates for a capability gap: planning horizon, coherence over a long run, context-window juggling. A better model closes those gaps, so the crutches built for them are the first thing to strip out. The rest is backpressure that encodes a target the model had no way to know. That part stays, however capable the model gets.

The trap is reading the whole technical layer as the disposable side. Judgment lives inside it too. Whether a change fits your architecture, respects your bounded contexts, cuts the seam where you wanted it: a fitness function catches the mechanical part of that question, no cycles in the dependency graph, and stops there. The rest is a call graded by a rubric, an LLM judge, or a human, and a stronger model does not retire it, because it was never a capability gap to begin with. Architecture conformance lives in the technical layer and ages like a domain rule, not like a crutch. So “simplify as the model improves” holds with one correction: strip the scaffolding, keep the judgment, wherever in the harness it happens to sit.

## Backpressure only counts if it sticks

There is a condition both layers share: backpressure is only effective once the signal changes a later decision. A failing test that gets retried into silence is noise. So is a rejected variant nobody learns anything from, however much compute it burned to produce. The signal has to land somewhere it changes the next run: a changed plan, a sharper rubric, a new counterexample. And a wall the agent hit once should leave a guard behind, a new test or a hook that fires before the bad write lands. Retained feedback is what connects the two layers, because retention works the same way whether the signal came from a compiler or from a domain expert.

> Agent struggle becomes harness backlog.

**Example: Agent memory and skills**

Agent memory is the most tangible retention site I know, and a skill is one form of it, procedural memory in executable form. A skill that does not get better after repeated use, or a memory that never gets corrected, is just static documentation wearing a different hat. The raw material for improving them sits right in the agent traces: retry loops and dead ends, wrong or outdated commands, missing setup assumptions, the places where the agent made a best-guess decision because the skill left a gap. Treat those signals as deltas, patch immediately when risk and evidence allow it, route the rest through a review gate, and maintenance becomes a closure mechanism inside the loop. You could even optimize the token efficiency of your skill by telling the agent to use the initial version as a baseline, and run experiments on it until token consumption is improved while the result stays the same (don’t let it rephrase the skill in caveman lingo).

## The quiet layer

Why does the product and domain layer get so little attention? Because its failure mode is quiet. Sprawl shows up in CI logs and PR volume; you can see it and pull people back. Slop surfaces in the market, quarters later, or in some people’s stomachs, the longer they look at it. The engineering mainstream of the harness debate (OpenAI’s [harness engineering](https://openai.com/index/harness-engineering/) write-up, Addy Osmani’s [work on agent harnesses](https://addyosmani.com/blog/agent-harness-engineering/)) covers the technical layer well. But the other layer needs different disciplines and different roles: product people, designers, domain experts, the people whose “this is off” never touches a compiler. Their judgment is the backpressure itself, the thing the output has to clear while it is being made, well before any review step at the end.

## The wins and the costs land on different desks

There is a version of this that has nothing to do with tooling. Charity Majors [wrote in June](https://charity.wtf/2026/06/02/ai-enthusiasts-are-in-a-race-against-time-ai-skeptics-are-in-a-race-against-entropy-xpost/) about the gap between the people experiencing AI wins and the people carrying the costs: “There is no natural feedback loop.” The wins land on one team’s dashboard while the cleanup, the reliability erosion, and the on-call pain land on someone else entirely, and the organization polarizes instead of learning. Seen through this framework’s lens, the skeptics in that fight are often the people carrying the backpressure the system failed to encode. They hold the realistic information about failure modes precisely because the consequences flow to them.

Her question for the enthusiasts is the one I would put on the wall:

> “What would it take for you to feel comfortable shipping code to production without reading it?”

That single question turns a culture war into a loop-sizing requirement. Answer it honestly and you get a concrete list: the evals, the feature flags, the blast-radius limits, the rollback paths. Which is to say, you get a harness spec.

## How many agents can one person actually check?

Context and backpressure both work on the content of the loop: context is the gate in front of it, deciding whether the agent can do the work at all, and backpressure is the continuous pressure inside it, holding that work against reality. Neither says anything about how expensive it is for the human to close the loop at the human gate. Even the best-calibrated loose loop ends in a human decision point, and when closing it means wading through diffs, logs, and pull requests, the ceiling on parallel loops is set by the verification cost per judgment rather than by containment. That cost is the third lever.

Three families of mechanisms, and they sit on different levers. Self-inspection by the agent (opening the build in a browser, reading the DOM snapshot or screenshot, running Lighthouse, axe, Core Web Vitals) is backpressure in the engineering cluster, already covered above, no new lever required. The numeric signals are the clean ones here: a performance budget that speaks when broken and stays silent when held is sounder than an LLM judge grading aesthetics. Self-demonstration sits on both levers at once. Simon Willison’s [Showboat](https://github.com/simonw/showboat) is a CLI through which the agent presents its own work; Peter Steinberger has agents record a video demo of what they built. The demo makes the gate cheaper, since the human grades an outcome instead of digging through the artifact, and the act of building the demo is forced self-confrontation: while assembling it, the agent stumbles over its own defects, feedback that would never have existed without the obligation to present. Legibility artifacts (screenshots, recordings, trace summaries) are pure gate relief and carry no backpressure at all.

The trap deserves its own paragraph. A polished demo can hide slop, and verification stress is partly productive: it is the friction where bad judgment gets caught. Self-demonstration makes the human gate cheaper to clear without making the judgment behind it any sharper. Demo mechanisms lower the cost of good judgment, while the judgment itself still has to come from outcome grading against goldens and counterexamples.

## Does any of this exist outside slide decks?

OpenClaw is the closest thing we have to a running reference in the wild: a 375k-star project with roughly 7,000 open issues and PRs, operated deep in the loose zone. There is a fascinating ring of [metatools](https://openclaw.ai/ecosystem/) the team built around the agent. Almost none of it is the harness itself. It is what has to surround a harness before you can let an agent run loose for hours and trust what comes back, and it does two jobs at once: supply backpressure, and give the agent enough infrastructure to reach its own verdict.

- Context tooling like the crawl family that externalizes scattered knowledge into searchable stores, so the agent can ground itself instead of guessing
- Execution machinery like crabbox, fanning tests across operating systems and clouds in reproducible containers, so the agent can answer what no prompt can answer for it: does this bug actually reproduce?
- Backpressure like ClawPatch (a reviewer subagent with a fix loop) and clawbench, which puts pressure on the harness itself
- Verification through evidence ledgers and human gates that carry the final judgment back to a person

The price of that autonomy was an explosion of infrastructure around a comparatively small harness.

My favorite vignette is Vincent Koc [running 24 tmux panes](https://x.com/vincent_koc/status/2059996577561755684) of agent sessions from a Mac Mini, reading failures, deciding what runs next. Rather than leaving the loop, the human steps one level up, and his judgment moves to the orchestration layer. Those 24 panes are also the verification cost ceiling from the section above. Containment is not what caps him. The cost per human judgment at the gate is, and every mechanism that lowers that cost lets the same attention orchestrate more panes.

## An old discipline on a new substrate

None of this is unprecedented, and it helps to know that. Long before anyone wired a compiler into an agent loop, factories had already worked out how to run a process that produces variable output: how to catch defects without inspecting every part by hand, and how to feed what each run teaches back into the line. The discipline has names, and most of them trace to one place, the Toyota Production System and the lean tradition that grew from it. Lay the two side by side and the mapping is close enough to be useful rather than cute.

| Factory floor | Harness equivalent |
|---|---|
| Total productive maintenance | Keeping the agent-legible environment in repair: AGENTS.md, skills, reproducible setup |
| Statistical process control | Watching loops for drift and rework signals instead of inspecting every output |
| Kaizen | The learning loop: production signals distilled back into skills, tests, and docs |
| Andon | Backpressure that stays silent on success and speaks only on failure |
| Poka-yoke | Deterministic backpressure that makes a class of mistake impossible: compilers, type checkers, scope denial |
| Jidoka | Human-in-the-loop gates: the line stops and escalates to a person on a real anomaly |
| Overall equipment effectiveness | Loop metrics: validated loops per month, loop closure rate |

The parallel is fascinating. We’re looking at the industrialization of software product development. The reassurance: these are not hypotheses. Manufacturing has run on statistical process control and continuous improvement for decades, so the shape of the answer is known even where the agentic details still are not. And it is an orthogonal lineage. Most of the harness debate argues from the software tradition of Agile, DevOps, and CI/CD, while this one comes in from the factory floor, which is exactly why it tends to land with people who have run a real process.

The factory and the loop part ways on one thing, though. Dave Snowden’s Cynefin framework ([HBR, 2007](https://hbr.org/2007/11/a-leaders-framework-for-decision-making)) sorts problems by how knowable the answer is before you start:

- **Clear:** cause and effect are obvious and there is one right way, like following a recipe.
- **Complicated:** a right answer exists, but it takes an expert to find it, like a mechanic diagnosing an engine.
- **Complex:** you cannot work the answer out in advance. You try something, watch what happens, and adjust, the way you learn what a market wants.

Classic manufacturing lives almost entirely in the first two. The ten-thousandth car door has a known target, and the whole Toyota line exists to hit it every time, every deviation a defect.

Most of what we point a loop at is the third kind, because building a software product is complex work, and so is most knowledge work. You don’t know the answer up front. It shows itself once you have tried a few. So we are building the same learning factory and the same hard-won discipline of running a process without checking every part by hand, and aiming it at complex work. That changes what we want from variance. A Toyota line fights to drive variance out. A loop is here to harvest it, generating several real variants and keeping the one that holds. The machinery carries straight over, and here the variance is the product. Backpressure is the wall that keeps it inside useful bounds. Some tasks really are ordered, well specified and fully tested, and there the loop behaves like the classic line and drives variance out. Complex is the common case, though, and complex is where the factory’s instinct flips.

The warning is the one the lean tradition keeps relearning. None of these mechanisms is a tool you install once. They are practices a team keeps alive, or they quietly rot, and a rotting harness is worse than none, because it still looks like it is working. Which is the same thing the [roles](/roles) page says from the other direction: you build and evolve the factory alongside the products it turns out, or you do not really have one.

## Glossary

- **Andon**: A signal on a line that stays dark while things run normally and lights up only when something goes wrong, so attention lands where it is needed. A pipeline that says nothing on green and pings you only on a failure.
- **Backpressure**: The resistance an agent works against while it builds, well before any review at the end: a failing test or type error on the technical side, an acceptance scenario or rubric on the product side. Some of it reaches the agent automatically as a signal in the loop; some it imposes on itself by following a discipline set at the start, like writing the failing test first and working until it goes green. The more of it you can encode, the longer you can let the loop run. A red build is backpressure; the agent reads it and fixes the code. So are acceptance criteria: write them well and the agent works against your definition of good as it goes, instead of a person catching the miss at the end.
- **Counterexample**: A plausible-looking but wrong output you keep on file, so the system learns never to produce that kind of thing again. The bug you once shipped and then wrote a regression test for.
- **Cynefin**: A Welsh word (roughly 'kuh-NEV-in') for sorting problems by how knowable the answer is before you start: clear (one obvious right way), complicated (a right answer an expert can find), complex (no answer you can work out in advance, so you try something and learn from what happens), and chaotic (no stable cause and effect at all). Different kinds of problem call for different approaches. Tying your shoes is clear, fixing an engine is complicated, taking a product into a market is complex. You already switch approaches for each without naming the buckets.
- **Golden**: An output you have blessed as correct and keep around as the answer key, to compare new output against. The trusted fixture in an integration test: the known-good result everything else is measured against.
- **Harness**: The scaffold that turns a model into an agent, assembled from many parts. Among those: the loop it works in, the tools it can reach, how its context is managed as a run grows long (compression, retrieval), the hooks that fire on what it does, subagents, and guardrails. Backpressure and other resistance attach here, and beyond it. An interactive agent tool like Claude Code, Codex, or Pi is a harness. You have been working inside one all along.
- **Jidoka**: Machines that detect an abnormality and stop on their own, handing the decision to a person rather than producing defects at full speed. A check that halts the pipeline and escalates to a human when it hits something it should not decide alone.
- **Kaizen**: Continuous improvement: many small, steady changes to how the work is done, driven by the people doing it and by what each run of the process reveals. A retrospective that actually changes how you work next time, repeated forever.
- **LLM judge**: A second AI you ask to grade the first one’s output against your criteria. It scales review, but has blind spots of its own. A tireless reviewer who needs a clear checklist, or it will rubber-stamp anything that looks plausible.
- **Outcome grading**: Judging whether a result is good, precisely enough that the judgment can be applied again and again, by a person or a machine. You already do the simplest version every time you write a test or a definition of done.
- **Overall equipment effectiveness (OEE)**: A single measure of how productive a line really is, combining how often it runs, how fast it runs, and how much of its output is good. Tracking not just how many pull requests ship, but how many survive review and hold up in production.
- **Poka-yoke**: Designing the work so a given mistake simply cannot happen, instead of relying on people to remember not to make it. A type checker that refuses to compile the bug, rather than a comment asking you to be careful.
- **Retained feedback**: Feedback only counts when it changes a later decision: an updated plan, a sharper test, a rejected option that stays rejected. The rest is noise, however expensive it was to produce. A review comment matters only if it changes the next commit.
- **Rubric**: A written list of what “good” means for a kind of output, so the same standard can be applied to every result. Acceptance criteria, reused as a grading checklist instead of a one-off.
- **Statistical process control (SPC)**: Watching a process with statistics to catch it drifting out of its normal range, rather than inspecting every finished part by hand. Tracking error rates or latency over time and reacting when the trend moves, instead of checking every single request.
- **The statistical middle (slop)**: Output that converges on the bland average of everything the model has ever read: plausible, smooth, and indistinguishable from anyone else’s. The onboarding text that reads like every onboarding text ever written.
- **Total productive maintenance (TPM)**: Keeping every machine on a line in good working order continuously, so production does not stop for failures that routine care would have prevented. Keeping your build scripts, docs, and dev setup current, so the next run does not break on something avoidable.

---

<!-- Page: grading -->

# Grading

> Outcome grading is the new specification. Tests, rubrics, scenarios, golden examples of known-good output, and why a rubric is not automatically truth.

Part of The Elastic Loop · https://elastic-loop.robert-glaser.de · last updated 2026-06-12

---
If you come from deterministic software development, where a requirement translates into code and the code either meets it or it doesn’t, the rest of this framework probably feels like it has a hole in the middle. Agents search a space of possible solutions (the longer argument lives on [Why](/why)), and a search process needs something to search against. That something is outcome grading: the ability to judge whether a result is good, precisely enough that a machine can iterate against the judgment. It is the bridge between the spec world and the search-space world, and without it the framework falls apart for exactly the people I most want to reach. So this page exists to build the bridge properly.

## You have been grading outcomes all along

Grading comes in instruments, and they form a ladder from primitive to broad.

- **Tests** check known cases. If you write software, you have been doing outcome grading all along, just in its most primitive form: this input, this expected output, pass or fail.
- **Rubrics** widen that to qualitative criteria for whole classes of behavior (does the answer cite the source, does the refund flow respect the policy, does the tone match the brand). Think acceptance criteria, reused as a grading checklist. Anthropic turned a question as soft as “is this beautiful?” into four graded criteria, one of them [originality, scored against template defaults and AI slop](https://www.anthropic.com/engineering/harness-design-long-running-apps), which is about as concrete as anti-slop grading gets.
- **Scenarios** generalize beyond the cases you already thought of: walkthroughs of situations the system should handle, written by someone who knows where the domain gets weird.
- **Goldens and counterexamples** calibrate everything else. Goldens are outputs you have blessed as correct and keep as the answer key; counterexamples are the plausible-looking wrong ones, which in my experience are worth more per line than almost any other artifact.
- **LLM judges** make qualitative grading scalable, with failure modes of their own (more on that in a moment).
- **Human review** stays irreplaceable wherever the call is genuinely a judgment call.
- **Production signals** are reality itself as the last layer: usage, rework, support tickets, exceptions, logs, the market quietly grading what everything upstream let through.

Notice who can contribute to this ladder. Tests belong to engineers (remember product owners writing Cucumber tests?). Scenarios, counterexamples, and rubrics very often belong to the domain expert, the designer, the product person. That is the point [Roles](/roles) makes in full.

## The spec changes jobs as the loop stretches

So where does that leave the specification? I keep meeting two camps: one that wants to write everything down before the agent moves, and one that has declared specs dead. Both are answering a sizing question with a doctrine. The spec changes its job as the loop stretches:

| Loop zone | Role of the spec | Harness focus | Grading focus |
|---|---|---|---|
| Tight | Starter grip: intent, examples, non-goals | IDE, tests, UI, logs | Human judgment, small tests, diff feedback |
| Elastic | Steering contract: boundaries, acceptance, constraints, risks, search space | Reproducible setup, CI, PR review, reviewer subagents, checkpoints | Tests, acceptance criteria, rubrics, review checklist |
| Loose | Executable spec in the production system | Sandbox, worktrees, isolated runtime, permissions, audit, rollback, observability | Outcome rubrics, scenarios, goldens, counterexamples, automated graders, abort criteria, regression gates |
| *Learning loop (wraps every zone)* | Feedback distillate for the next piece of work | Stored traces, error classes, skills, docs | Rework, production signals, drift, pattern learning |

Read the spec column top to bottom and you can watch the artifact change character: from a starter grip to a steering contract to something executable, and finally, in the learning loop that wraps every zone, to distilled feedback for the next round. Keep in mind, though:

**A spec earns its keep when it structures the search space. The moment it stands in for that space, it has turned back into a waterfall.**

The spec can and should never be the beginning of a waterfall, but rather a living artifact humans (and agents) iterate on. Zero-shotting is not a sport. Well, most times it’s not. And yet the pendulum is swinging back toward exactly that little waterfall right now: write everything down first, let the agent execute, and treat anyone who wants to feed learning back mid-run as undisciplined. Organizations rehearsed this movement for twenty years of doing agile without becoming agile, adopting the ceremony and dropping the feedback loop.

A spec that pins down every implementation detail has quietly turned the agent back into a typist, and you paid for a search process you never ran.

The reflex has a name by now. ThoughtWorks tracks [spec-driven development](https://www.thoughtworks.com/radar/techniques/spec-driven-development) on the Radar, notably no further than Assess, warning that teams might “relearn a bitter lesson, that handcrafted detailed rules for AI ultimately do not scale”. Birgitta Böckeler [sorts the practice](https://martinfowler.com/articles/exploring-gen-ai/sdd-3-tools.html) into three levels of rigor: spec-first writes the spec up front and archives it once the code exists, spec-anchored keeps it in the repo, maintained with every change and validated against the implementation, and spec-as-source treats code as mere generated output of the spec, a future no tool delivers in production today. Spec-anchored is the close relative of this framework: a living, versioned artifact instead of one-way planning.

What this framework adds is a shift of attention from the input layer to the evaluation layer. Whichever rigor level you run, the artifacts humans iterate on in wide loops are the grading material, the spec and everything that calibrates it, while the agents iterate on the output itself. When you catch yourself patching the output instead of sharpening the grader, you are working on the wrong layer, and the loop will keep producing the thing you just fixed.

![Two iteration layers](https://elastic-loop.robert-glaser.de/diagrams/two-iteration-layers.svg)

*[Figure: Two iteration layers, drawn as two concentric loops. The inner agent loop runs fast over the output; the outer human loop runs slowly over the grading material: spec, rubrics, scenarios, goldens, counterexamples. Output flows outward to evaluation; sharpened graders flow back inward. Patch the output and the loop reproduces it; sharpen the grader and it learns.]*

## A rubric is not automatically truth

Here is the caveat I want printed in bold on every eval dashboard: writing a rubric does not make the rubric true. Good graders are engineering artifacts themselves, calibrated against counterexamples and failure taxonomies, with the occasional human read and drift control. Skip that work and the agent will optimize against the grader you actually built, which usually rewards whatever sounds plausible long before anything that holds up. An LLM judge calibrated on nothing will happily wave through the statistical middle, and slop walks in wearing a passing grade. A polished demo of the agent’s own work deserves the same skepticism: it makes your judgment cheaper to apply, while the grading substance still has to come from goldens, counterexamples, and scenarios.

One failure mode hides inside the word grader: the thing that grades cannot be the thing that made the work. Models reliably praise their own output, and the leniency deepens when the work under review was itself machine-made. Anthropic had to tune their evaluator over several rounds because early versions talked themselves into approving bugs they had just flagged. Whatever closes the loop, a separate reviewer agent, a held-out judge, a human, has to sit outside the run that produced the work, or you have automated self-congratulation.

There is also a structural answer to the agent satisfying the letter of your grader while missing the point (reward hacking, if you want the ML name, the AI cousin of teaching to the test). StrongDM’s software factory shows what it looks like when you take it seriously ([Simon Willison has a good writeup](https://simonwillison.net/2026/Feb/7/software-factory/)). Their internal charter reads

> “Code must not be written by humans. Code must not be reviewed by humans,”

which is too radical for most organizations I work with, but the design vocabulary travels well. Their scenario holdouts keep the test sets outside the repo, so the agent cannot tune itself to pass an exam it can read, the same instinct that made machine-learning teams guard their test data for decades. And their satisfaction testing asks a second AI, across thousands of runs, how often the agent’s path through a scenario would actually leave the user happy.

## What you do with cheap search

Once grading works, the economics get interesting. When search gets cheap, the move is to generate several variants and select, rather than pushing one ticket through faster. Best-of-n becomes a delivery discipline. This is the variance from [Why](/why) put to work: the spread a search throws off, harvested on purpose instead of stamped flat. Basically, what digital product development has always preached, except that building variants with the actual material (code) was slow and expensive. That’s why the Design Sprint exists.

But (and this is the part I see teams get wrong in the first month) variants are only valuable when they are born close to the decision and close to the system. They have to live in the real material: the architecture and design system they target, the real APIs and data they run against, the same tests and scenarios that grade everything else. The anti-pattern is the gorgeous prototype with no system contact, a demo that nobody can merge and nobody learns from. A variant only becomes learning once a decision rides on it; without that, it is just more output.

Which leaves the question I would ask your team before any tooling discussion:

> If an agent handed you five solutions tomorrow, could you say which one is best, and could you say why in a form a machine could check next time?

## Glossary

- **Best-of-n**: Generate several independent attempts at a task, then pick the best, instead of pushing one attempt through faster. Spiking three solutions and choosing one, except cheap enough to do as a routine.
- **Counterexample**: A plausible-looking but wrong output you keep on file, so the system learns never to produce that kind of thing again. The bug you once shipped and then wrote a regression test for.
- **Drift**: Slow movement away from what is correct or wanted, over time, without any single obvious break. The product or codebase quietly getting less coherent release after release.
- **Golden**: An output you have blessed as correct and keep around as the answer key, to compare new output against. The trusted fixture in an integration test: the known-good result everything else is measured against.
- **Holdout**: Test cases you deliberately keep out of the agent’s reach, so it cannot tune itself to pass an exam it has already seen. Keeping the real exam questions out of the study guide.
- **LLM judge**: A second AI you ask to grade the first one’s output against your criteria. It scales review, but has blind spots of its own. A tireless reviewer who needs a clear checklist, or it will rubber-stamp anything that looks plausible.
- **Outcome grading**: Judging whether a result is good, precisely enough that the judgment can be applied again and again, by a person or a machine. You already do the simplest version every time you write a test or a definition of done.
- **Reward hacking**: When the agent learns to satisfy the letter of your grader while missing the point. The AI version of teaching to the test. Code that passes the test by hardcoding the expected value instead of actually solving the problem.
- **Rubric**: A written list of what “good” means for a kind of output, so the same standard can be applied to every result. Acceptance criteria, reused as a grading checklist instead of a one-off.
- **Search space**: The set of possible solutions a task allows. An agent does not translate a spec line by line; it samples from this space, pulled toward one solution by your context, or proposing a few in a planning step. Generating many on purpose and selecting the best is a discipline you opt into. There are many valid ways to implement a story. The agent is choosing among them, not transcribing the one right answer.
- **The statistical middle (slop)**: Output that converges on the bland average of everything the model has ever read: plausible, smooth, and indistinguishable from anyone else’s. The onboarding text that reads like every onboarding text ever written.

---

<!-- Page: roles -->

# Roles

> Every role carries judgment about agent work that nobody else can supply. What engineers, product people, designers, domain experts, the people who run the process, and leaders each bring to the loop.

Part of The Elastic Loop · https://elastic-loop.robert-glaser.de · last updated 2026-06-12

---
Every role on a team carries judgment about agent work that nobody else can supply: what is worth building, which of several solutions is the right one, and whether the result actually holds up. Checking the output is only the last of those. It is the first thing that gets lost when companies roll out agents: engineering builds the [harness](/harness), dashboards fill up with velocity, and the people who actually know the product, the users, and the domain stand around wondering whether their job now requires learning Python. It does not. What the loop needs from them runs its whole length. Someone has to set the intent that opens it and bring the context that grounds it. Someone has to supply the backpressure that keeps it honest. And someone has to have the taste to tell a strong variant from a merely plausible one. The product and domain side of that judgment is also what keeps fast output from sliding into slop (the plausible, generic stuff that passes every technical check and dies quietly in the market). Treating non-engineering roles as smaller coders wastes them. Their leverage is the judgment no compiler or linter can provide, across the loop and not just at its end.

## The engineer: designer of the loop

The shift here is an identity question. If your professional identity was “I type the implementation”, agentic work feels brutal. If it was “I turn ambiguity into reliable systems”, the canvas just got larger. The new disciplines have names by now: intent design, context engineering, harness engineering, backpressure design, verification and evaluation, outcome grading, variant generation and selection, production learning, and plain judgment about what deserves to exist. Nine is a lot, and nobody masters all of them. The shortest version I have: The developer becomes less like a manual fabricator and more like a designer of executable learning factories.

## The product owner: one discipline, two outputs

If you have ever written acceptance criteria, you have already produced product backpressure. You produced it for humans, late in the loop, as part of a handoff. The change is that this material becomes a steering instrument: agents will iterate against whatever definition of good you can make explicit, and they will iterate against the gaps in it too.

POs are not replaced by AI. Their bottleneck moves to intent and verification.

In practice that splits into seven jobs:

1. Make intent explicit: problem, user, impact, assumptions, non-goals, which constraints are hard and which are negotiable.
2. Curate product context in agent-legible form (scenarios, examples, decision history, business rules) instead of letting it evaporate in chat threads.
3. Make options and trade-offs visible before committing.
4. Test variants in real software, close to the actual architecture and data, because a variant far from the system is output rather than learning.
5. Prevent backlog inflation, the sprawl pattern in product clothing, since an agent with an unclear mandate generates artifacts faster than any grooming session can absorb.
6. Co-design evaluation with UX, engineering, QA, and the business side.
7. Feed production learning back into the next loop: usage signals, rework, support pain.

Notice what these seven have in common. They are one discipline with two outputs. The PO defines the search space, which is intent and context and the non-goals that fence it. And the PO supplies what the result gets graded against: acceptance scenarios, counterexamples, rubrics. That sits closer to machine learning engineering than to classic product management. Whether the role’s name survives the shift is a question I cannot answer yet.

## The designer

Interaction vignettes can become rubrics an agent’s output gets graded against. Evaluating generated variants is grading work, and designers have been doing the qualitative version of it for years. Brand and voice consistency is domain-specific backpressure of the purest kind, the “this is off” that no test suite produces. And the design system needs to become a constraint in the search space, walls the agent works within rather than guidelines it might read. Not a bin of Lego bricks to snap together and nothing more. That degrades the agent to an assembler and wastes most of what it can do. I am keeping this section deliberately short: this role is less mapped than the PO’s, and I would rather leave it open for now.

## The domain expert

Edge cases, professional rules, failure taxonomies, regulatory and operational limits, decision history, the war stories about why the obvious solution was wrong the last time someone tried it. Golden examples of known-good output, counterexamples of plausible-but-wrong output. This is the most valuable grading material in the whole loop, for a simple reason. It is the material agents are least able to generate themselves. That makes it the hardest part of the harness to bootstrap, and the part most worth a person’s time to supply.

## The people who run the process

Two anchors hold what I can say so far. First, the mirror logic: the loop reflects the team’s rituals back at it, so whatever a scrum master or coach has built into how the team works, agents will amplify. Second, [Charity Majors’ observation](https://charity.wtf/2026/06/02/ai-enthusiasts-are-in-a-race-against-time-ai-skeptics-are-in-a-race-against-entropy-xpost/) that AI wins and AI costs often land with different people, so “there is no natural feedback loop”. The backpressure discipline I see emerging is organizational loop closure: making sure what individual loops learn lands with the team instead of staying private practice, and turning loop sizing (tight, elastic, loose: how much line the agent gets) into an explicit team decision rather than something each person quietly settles alone. What does a retrospective look like when half the iterations happened inside an agent run? I do not know yet. Anyone selling a finished framework for it this early is guessing. But if a team builds and evolves a factory (the harness) alongside the products the factory produces, there must be mechanisms in place to identify failures and harden the factory with every iteration.

## The engineering leader

The engineering leader carries four jobs in this model.

- **Loop intelligence:** what are we actually learning from the real loops running in our teams, which ones close, which stay open, which decay?
- **Backpressure maturity diagnosis:** which teams are stuck in tight supervision because the resistance layer is missing, not because the people are cautious?
- **Deciding where backpressure investment goes**, because the resistance layer is built, not bought incidentally.
- **Capability building over tool procurement:** the lever is rarely the next license, it is whether the team can encode its judgment into something a loop can use.

## Do these become new jobs?

So does this spawn new roles, or just reshape the ones you have? Mostly it reshapes them. A new title earns its place only when an important task needs a scarce skill no existing role reliably supplies, and someone will fund it and answer for it when it rots. Most candidates fail that test. The skill folds back into a senior version of a job that already exists, or the next model absorbs it. Everything in the sections above is an existing role whose center of gravity moved, from making artifacts to encoding judgment a loop can use.

One job does pass the test, and it is new. The harness engineer owns the shared scaffold your agents run inside, the loop shapes and tools and graders and guardrails, as a product with a roadmap and a pager. The scarce skill is debugging a probabilistic system end to end and knowing which knob actually moves the failure rate, a reliability instinct most product teams have never had to build. It earns a title the way shared infrastructure eventually does. The harness crosses team lines and rots when it belongs to no one, so a company ends up putting someone on it. Below platform scale it stays a hat a staff engineer wears.

One more sits at the edge: who decides what the fleet is allowed to touch, and who answers when an agent writes to production it should never have reached. That usually attaches to a security function you already have, and someone there has to own it. Past that, the honest answer is I don’t know yet. Whether grading and eval work becomes its own seat, or stays a hat the product owner, designer, and domain expert pass around, depends on scale and on how fast the tooling commoditizes the one hard part, calibrating the judge. So the org chart barely grows. One new title, maybe two at scale, and underneath them the old roles carrying more weight than they used to.

## When handoffs stop making sense

Here is the reframe I want to leave you with. Where people expect AI to dissolve silos by making everyone do everything, the actual mechanism is that AI materializes the intermediate steps. The spec draft, the prototype, the test, the review artifact, the things that used to justify a handoff, now appear in hours inside the loop. The old sequence (PO formulates, UX designs, engineering builds, QA checks, operations learns about it later) assumed those steps were expensive enough to deserve their own stations. The loop model runs intent, context, variants, verification, decision, production learning, and the roles gather around it rather than queueing along it. The bottleneck moves to the work everyone now does at once: sharpening the context and the assumptions, weighing the options, agreeing what good looks like.

![Stations versus loop](https://elastic-loop.robert-glaser.de/diagrams/stations-vs-loop.svg)

*[Figure: Stations versus loop. Before: the old sequence of roles as a handoff line, PO to UX to Engineering to QA to Ops, each step expensive enough to deserve its own station. After: the same people gathered around a central loop of six uniform steps (intent, context, variants, verification, decision, production learning), each role labelled with the backpressure it supplies. AI materializes the intermediate steps; each role brings judgment no one else can.]*

Which brings this page back to where it started: every role carries judgment nobody else can supply. The question for your team is:

> Whose judgment is still trapped in someone’s head, where no loop can reach it?

## Glossary

- **Backpressure**: The resistance an agent works against while it builds, well before any review at the end: a failing test or type error on the technical side, an acceptance scenario or rubric on the product side. Some of it reaches the agent automatically as a signal in the loop; some it imposes on itself by following a discipline set at the start, like writing the failing test first and working until it goes green. The more of it you can encode, the longer you can let the loop run. A red build is backpressure; the agent reads it and fixes the code. So are acceptance criteria: write them well and the agent works against your definition of good as it goes, instead of a person catching the miss at the end.
- **Counterexample**: A plausible-looking but wrong output you keep on file, so the system learns never to produce that kind of thing again. The bug you once shipped and then wrote a regression test for.
- **Golden**: An output you have blessed as correct and keep around as the answer key, to compare new output against. The trusted fixture in an integration test: the known-good result everything else is measured against.
- **Harness**: The scaffold that turns a model into an agent, assembled from many parts. Among those: the loop it works in, the tools it can reach, how its context is managed as a run grows long (compression, retrieval), the hooks that fire on what it does, subagents, and guardrails. Backpressure and other resistance attach here, and beyond it. An interactive agent tool like Claude Code, Codex, or Pi is a harness. You have been working inside one all along.
- **Outcome grading**: Judging whether a result is good, precisely enough that the judgment can be applied again and again, by a person or a machine. You already do the simplest version every time you write a test or a definition of done.
- **Rubric**: A written list of what “good” means for a kind of output, so the same standard can be applied to every result. Acceptance criteria, reused as a grading checklist instead of a one-off.
- **Search space**: The set of possible solutions a task allows. An agent does not translate a spec line by line; it samples from this space, pulled toward one solution by your context, or proposing a few in a planning step. Generating many on purpose and selecting the best is a discipline you opt into. There are many valid ways to implement a story. The agent is choosing among them, not transcribing the one right answer.
- **The statistical middle (slop)**: Output that converges on the bland average of everything the model has ever read: plausible, smooth, and indistinguishable from anyone else’s. The onboarding text that reads like every onboarding text ever written.

---

<!-- Page: faq -->

# FAQ

> How to apply the framework, starting with the question I get most: read the book, then put your own agent on it and let it work the framework against your team, your readiness, and your tasks.

Part of The Elastic Loop · https://elastic-loop.robert-glaser.de · last updated 2026-06-19

---
This page answers the practical questions that come up once the framework clicks, starting with: how do I apply this to my team? The book is written for you, a human reader, first. But it also assumes you will put your own agent to work on it, and it is built so an agent can pick it up and carry the framework straight into your situation.

## How do I apply this framework?

Start with something built into the site that is easy to miss. Every page here has a plain-markdown twin (`/loops.md`, `/grading.md`, and so on for each page), the whole book is concatenated at [/llms-full.txt](/llms-full.txt), and there is a map at [/llms.txt](/llms.txt). Because agents love markdown, because everyone loves markdown, right? Hand the material to your agent and let it reason about your situation in these terms.

I think of the book as a *context trajectory* as much as a text: the same pages you read are also a body of context you can hand to an agent, so that its questions, its diagnoses, the artifacts it drafts, all come back shaped by the framework instead of the statistical middle of everything ever written about AI and teams. Reading it yourself and putting an agent on it are not in competition. The reading is where your own judgment forms; the agent is how you bring that judgment to bear on your specifics without working through all of it by hand.

What that looks like in practice, roughly in the order the book moves:

- **Organizational readiness.** Hand the agent [Loops](/loops) and its four strategic questions, then your real context: the artifacts (architecture docs, codebases, a handful of recent pull requests), the work in flight (open stories, your recent agent sessions), and the thing none of those show, how decisions actually get made. Then ask it where your current loop sizes are choices and where they are accidents.
- **Loop sizing.** Give it a specific task off your board and the seven sizing criteria, and make it argue for a size and a set of gates instead of reaching for “loose” because that feels like progress.
- **Harness inventory.** Walk it through the [harness](/harness) clusters and the two backpressure layers against what you actually have wired up, so the gaps in product and domain backpressure stop being invisible. Why not let your agent propose extensions for the harness you’re using right now?
- **Grading material.** Point it at [Grading](/grading) and a slice of your domain, and have it draft the boring artifacts (rubrics, scenarios, and especially counterexamples) a loose loop needs before it can run.

One caveat, because the framework turns on it: an agent applying this to your org is itself a loop, and its read of your situation is a starting grip, not truth, in the way a rubric is not automatically truth. It will sound confident about a trust problem it cannot actually see, or wave through a loop size your blast radius does not support. The judgment about whether its diagnosis is any good is still yours, and supplying that judgment is the whole skill this framework is about. So treat the output as a strong first draft to argue with, rather than a verdict to adopt.

## Do I need to be an engineer to apply it?

No, and treating the framework as an engineering-exclusive topic would make me half-sad. Half of it is judgment about results rather than code: whether the thing was worth building, which of several variants is the right one, whether the output is the real thing or the plausible-generic version of it. [Roles](/roles) is the long version. The short version is: product people, designers, and domain experts supply resistance no compiler, no test suite can produce, and they supply it across the whole loop, not just for the review at the end.

## Isn’t this just spec-driven development?

No, though the two get confused right now, because the industry is mid-swing back toward writing everything down before the agent moves. A spec is useful when it structures the search space. Replace the space with it, though, and you’ve turned the agent into a typist and paid for a search you never ran. [Grading](/grading) sorts out where specs help and where they hurt, and why it’s important that you move from the spec you write up front to the grading material you keep iterating on. The spec alone is just context.

## Is Scrum dead?

Yes, but it matters which part dies. Scrum is risk containment, and the two-week sprint bundles two different risks under one cadence. The first is building the thing wrong: the wrong implementation, the wrong technical path. That was expensive to undo when a person typed it slowly by hand, so capping the damage at two weeks was a sound economic bet. Agents collapse that bet, because redoing the implementation is cheap now.

The second risk is building the wrong thing entirely: the wrong feature, a wrong read of what the work was even for. The sprint review caught that one through stakeholder and PO feedback. It never had anything to do with typing speed. Agents arguably make it worse, because building the wrong thing fast is how you end up with sprawl and slop. So this half of the bet does not collapse. It gets sharper, and the check Scrum ran every two weeks now has to run closer to continuously.

What dies, then, is the calendar: the fixed ceremony that paced both risks at one rhythm. The judgment the review carried survives. It just moves into [Grading](/grading) and product backpressure instead of a recurring slot. [Loops](/loops) lets you set that cadence per task, where the sprint used to set one rhythm for everything at once. Loop sizing is a per-task dial, so it answers the cadence question and not the separate one of coordinating several people who each drive their own agents, which Scrum also handled.

In a sense, you can say that agentic engineering is the biggest lever to finally work truly agile. [Iteration is all we need](https://www.robert-glaser.de/what-if-iteration-is-all-we-need/).
## We are debating which model to standardize on. Where does that fit?

Lower on the list. The measurement on [Why](/why) and [Harness](/harness) (clawbench) finds that the setup around the model, the harness, moves the result about ten times more than the model choice does. Pick a capable frontier model, then put the argument where the leverage sits: the loop, the backpressure, and the grading around it.

## Should our harnesses be standardized, or custom to each use case?

Both, but split along the right seam. The harness itself has to be specific. It encodes this codebase, this domain’s rules, these graders. A one-size harness handed to every team is the statistical middle in tooling form, the same slop you are working to keep out of the output. There is no generic harness worth having, for the same reason there is no generic product worth shipping.

What standardizes is the layer underneath, the disciplines rather than the device: how a team builds a harness and keeps it from rotting, the backpressure clusters it works through, the grading vocabulary, the readiness gate before a task goes loose. A hospital is the closest picture. Every patient gets a custom course of treatment, and nobody franchises that. What is standard is the process around it, the checklists, the sterile technique, the handover discipline. You standardize the process that produces safe variation, not the variation itself.

In between sits a parts catalog: shared skills, templates, reviewer subagents, grading sets other teams can lift and adapt. Treat those as building blocks to compose, never a finished machine to install. The finished machine is always local.

## Why is the book so short?

Because tactics age in months and strategy doesn’t. Anything I could pin down about today’s exact tools, the flags, the model names, the way one harness wires up its hooks, would be wrong by the time you read it, and the book is meant to outlive that. So it stays at the level that holds (probably): how to size a loop, what context buys you, how to codify judgment. The book lives and grows, but it grows along that line.

There is a second reason, and it is the framework turned on itself. The book is context, and context you hand to an agent costs something. A bloated, tactical book would clog your own context window and your agent’s, and most of what clogged it would be the [statistical middle](/why) anyway. I would rather it ground you: a small, dense body of context that shapes the questions your agent asks and the diagnoses it offers. It can synthesize your specifics on its own, once the framework has it in its grip. The book does not need to carry them.

## Why is this book a website?

Because the good old web still serves more devices, and more agents, than any other format. A PDF or an EPUB freezes the moment it ships; this thing keeps living, and the web is the only place that update reaches you without a re-download. Every page here already comes with a plain-markdown twin, and the whole book concatenates at [/llms-full.txt](/llms-full.txt), so your agent reads it as easily as you do.

Everyone is using their own agent now to bend concepts onto their own situation, their role, what they already know. A book fights that. Who wants to copy-and-paste their way through context-managing a book by hand, one that keeps changing under them? Hand an agent a URL and that problem disappears: you point it at the page, and it does the bending for you.

## Glossary

- **Backpressure**: The resistance an agent works against while it builds, well before any review at the end: a failing test or type error on the technical side, an acceptance scenario or rubric on the product side. Some of it reaches the agent automatically as a signal in the loop; some it imposes on itself by following a discipline set at the start, like writing the failing test first and working until it goes green. The more of it you can encode, the longer you can let the loop run. A red build is backpressure; the agent reads it and fixes the code. So are acceptance criteria: write them well and the agent works against your definition of good as it goes, instead of a person catching the miss at the end.
- **Counterexample**: A plausible-looking but wrong output you keep on file, so the system learns never to produce that kind of thing again. The bug you once shipped and then wrote a regression test for.
- **Harness**: The scaffold that turns a model into an agent, assembled from many parts. Among those: the loop it works in, the tools it can reach, how its context is managed as a run grows long (compression, retrieval), the hooks that fire on what it does, subagents, and guardrails. Backpressure and other resistance attach here, and beyond it. An interactive agent tool like Claude Code, Codex, or Pi is a harness. You have been working inside one all along.
- **Rubric**: A written list of what “good” means for a kind of output, so the same standard can be applied to every result. Acceptance criteria, reused as a grading checklist instead of a one-off.
- **Search space**: The set of possible solutions a task allows. An agent does not translate a spec line by line; it samples from this space, pulled toward one solution by your context, or proposing a few in a planning step. Generating many on purpose and selecting the best is a discipline you opt into. There are many valid ways to implement a story. The agent is choosing among them, not transcribing the one right answer.
- **The statistical middle (slop)**: Output that converges on the bland average of everything the model has ever read: plausible, smooth, and indistinguishable from anyone else’s. The onboarding text that reads like every onboarding text ever written.

---

<!-- Page: skill -->

# The readiness skill

> A companion skill that turns the framework on your own team. Point your agent at your project, and it reads the live book, then grills you about how much loop you can carry, which loop sizes are available to you here, and the one constraint gating the next.

Part of The Elastic Loop · https://elastic-loop.robert-glaser.de · last updated 2026-06-17

---
The book is written for you to read first. This is the companion you put to work afterwards: a skill that turns the framework on your own team instead of on a single task. It reads the live book, then sits you down and works through how much loop you can actually carry here, which [loop sizes](/loops) are available to you in this codebase, and the one constraint gating the next size up. What you get back is a readiness verdict. Sometimes that verdict is a plain “loose is not available to you here yet”, when that is the honest read. Never a score, a level, or a maturity badge.

## Point your agent at the repo

The skill lives in the project’s GitHub repository, in its own folder. Hand any capable agent that folder and ask it to install the skill it finds there. It works in any harness that can read a URL.

```
Install this skill: https://github.com/youngbrioche/elastic-loop/tree/main/goodies/elastic-loop-readiness
```

## A few prompts to get going

Install it. Point your agent at the folder and let it take the skill from there:

```
Read https://github.com/youngbrioche/elastic-loop/tree/main/goodies/elastic-loop-readiness and install the skill you find there.
```

Then run the diagnosis, and say who you are so it grills the part of the loop you can actually speak to:

```
Run an elastic-loop readiness diagnosis with me. I am a product owner on a team that has started letting agents touch our codebase.
```

Or skip the self-report and let your own session history set the baseline:

```
Analyze my last 100 sessions in this project, take that as a starting point from an individual team member and start diagnosing our elastic loop readiness.
```

Or aim it straight at a doubt you already carry:

```
We let agents do the work but never merge without reading every line. Where are our loop sizes a choice, and where are they an accident?
```

## Or install it as a Claude Code plugin

If you are in Claude Code, there is a native path. Add the marketplace, install the plugin, reload:

```
/plugin marketplace add youngbrioche/elastic-loop
/plugin install elastic-loop-readiness@elastic-loop
/reload-plugins
```

Then ask for a loop-readiness diagnosis and Claude picks it up, or run it directly with `/elastic-loop-readiness:elastic-loop-readiness`. That is not a typo: the command reads `plugin:skill`, and here the plugin and the skill carry the same name.

Run it with the team, not for the team. Get the people who see different stretches of the loop into one room: an engineer, a product owner, a domain expert. Where they disagree about how far to let go is half of what you came for, because that disagreement is where the team’s trust actually sits. One person quietly filling it in for everyone else gets a tidy answer and skips the conversation that was the point. A single seat’s view is still worth having. It just sees one stretch of the loop and misses the rest.

One thing to keep in mind, because the framework turns on it: the skill running on your team is itself a loop. Its read is a strong first draft to argue with. The judgment about whether it holds is still yours.

## Glossary

- **Harness**: The scaffold that turns a model into an agent, assembled from many parts. Among those: the loop it works in, the tools it can reach, how its context is managed as a run grows long (compression, retrieval), the hooks that fire on what it does, subagents, and guardrails. Backpressure and other resistance attach here, and beyond it. An interactive agent tool like Claude Code, Codex, or Pi is a harness. You have been working inside one all along.

---

## License

© 2026 Robert Glaser. Code is licensed under Apache 2.0; site content is licensed under CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/).