The Elastic Loop
The Elastic Loop · Part four

Grading

If you come from deterministic software development, where a requirement translates into code and the code either meets it or it doesn’t, the rest of this framework probably feels like it has a hole in the middle. Agents search a space of possible solutions (the longer argument lives on Why), and a search process needs something to search against. That something is Outcome gradingJudging whether a result is good, precisely enough that the judgment can be applied again and again, by a person or a machine.You already do the simplest version every time you write a test or a definition of done.: the ability to judge whether a result is good, precisely enough that a machine can iterate against the judgment. It is the bridge between the spec world and the search-space world, and without it the framework falls apart for exactly the people I most want to reach. So this page exists to build the bridge properly.

You have been grading outcomes all along

Grading comes in instruments, and they form a ladder from primitive to broad.

Notice who can contribute to this ladder. Tests belong to engineers (remember product owners writing Cucumber tests?). Scenarios, counterexamples, and rubrics very often belong to the domain expert, the designer, the product person. That is the point Roles makes in full.

The spec changes jobs as the loop stretches

So where does that leave the specification? I keep meeting two camps: one that wants to write everything down before the agent moves, and one that has declared specs dead. Both are answering a sizing question with a doctrine. The spec changes its job as the loop stretches:

Loop modeRole of the specHarness focusGrading focus
Tight / ExploratoryStarter grip: intent, examples, non-goalsIDE, tests, UI, logsHuman judgment, small tests, diff feedback
Task / ImplementationSteering artifact: boundaries, acceptance, contextReproducible setup, CI, PR reviewTests, acceptance criteria, review checklist
Feature / DelegationDelegation contract: goal, constraints, risks, search spaceWorktree, sandbox, traces, checkpointsOutcome rubric, scenarios, counterexamples, human review
Dark factoryThe loose loop run to its end: agents delivering on their own against automated checks, with humans reviewing outcomes rather than every turn.Named after lights-out manufacturing, where the line runs without people on the floor.Executable spec in the production systemIsolated runtime, permissions, audit, rollback, observabilityAutomated graders, goldens, abort criteria, regression gates
Learning LoopFeedback distillate for the next piece of workStored traces, error classes, skills, docsRework, production signals, drift, pattern learning

Read the spec column top to bottom and you can watch the artifact change character: from a starter grip to a steering artifact to a contract to something executable, and finally to distilled feedback for the next round. Keep in mind, though:

The spec is useful when it structures the Search spaceThe set of possible solutions a task allows. An agent does not translate a spec line by line; it samples from this space, pulled toward one solution by your context, or proposing a few in a planning step. Generating many on purpose and selecting the best is a discipline you opt into.There are many valid ways to implement a story. The agent is choosing among them, not transcribing the one right answer.. It becomes harmful when it replaces the search space.

The spec can and should never be the beginning of a waterfall, but rather a living artifact humans (and agents) iterate on. Zero-shotting is not a sport! Well, most times it’s not. And yet the pendulum is swinging back toward exactly that little waterfall right now: write everything down first, let the agent execute, and treat anyone who wants to feed learning back mid-run as undisciplined. Organizations rehearsed this movement for twenty years of doing agile without becoming agile, adopting the ceremony and dropping the feedback loop.

A spec that pins down every implementation detail has quietly turned the agent back into a typist, and you paid for a search process you never ran.

The reflex has a name by now. ThoughtWorks tracks spec-driven development on the Radar, notably no further than Assess, warning that teams might “relearn a bitter lesson, that handcrafted detailed rules for AI ultimately do not scale”. Birgitta Böckeler sorts the practice into three levels of rigor: spec-first writes the spec up front and archives it once the code exists, spec-anchored keeps it in the repo, maintained with every change and validated against the implementation, and spec-as-source treats code as mere generated output of the spec, a future no tool delivers in production today. Spec-anchored is the close relative of this framework: a living, versioned artifact instead of one-way planning.

What this framework adds is a shift of attention from the input layer to the evaluation layer. Whichever rigor level you run, the artifacts humans iterate on in wide loops are the spec, the rubrics, the scenarios, the goldens, and the counterexamples, in short the grading material, while the agents iterate on the output itself. When you catch yourself patching the output instead of sharpening the grader, you are working on the wrong layer, and the loop will keep producing the thing you just fixed.

A rubric is not automatically truth

Here is the caveat I want printed in bold on every eval dashboard: writing a rubric does not make the rubric true. Good graders are engineering artifacts themselves, calibrated with counterexamples, error taxonomies, human review, and DriftSlow movement away from what is correct or wanted, over time, without any single obvious break.The product or codebase quietly getting less coherent release after release. control. Skip that work and the agent will optimize against the grader you actually built, which usually means output that sounds plausible rather than output that is right. An LLM judge calibrated on nothing will happily wave through the The statistical middle (slop)Output that converges on the bland average of everything the model has ever read: plausible, smooth, and indistinguishable from anyone else’s.The onboarding text that reads like every onboarding text ever written., and slop walks in wearing a passing grade! A polished demo of the agent’s own work deserves the same skepticism: it makes your judgment cheaper to apply, while the grading substance still has to come from goldens, counterexamples, and scenarios.

There is also a structural answer to the agent satisfying the letter of your grader while missing the point (Reward hackingWhen the agent learns to satisfy the letter of your grader while missing the point. The AI version of teaching to the test.Code that passes the test by hardcoding the expected value instead of actually solving the problem., if you want the ML name, the AI cousin of teaching to the test). StrongDM’s software factory shows what it looks like when you take it seriously (Simon Willison has a good writeup). Their internal charter reads

“Code must not be written by humans. Code must not be reviewed by humans,”

which is too radical for most organizations I work with, but the design vocabulary travels well. Their scenario HoldoutTest cases you deliberately keep out of the agent’s reach, so it cannot tune itself to pass an exam it has already seen.Keeping the real exam questions out of the study guide. keep the test sets outside the repo, so the agent cannot tune itself to pass an exam it can read, the same instinct that made machine-learning teams guard their test data for decades. And their satisfaction testing asks a second AI, across thousands of runs, how often the agent’s path through a scenario would actually leave the user happy.

Once grading works, the economics get interesting. When search gets cheap, the move is to generate several variants and select, rather than pushing one ticket through faster. Best-of-nGenerate several independent attempts at a task, then pick the best, instead of pushing one attempt through faster.Spiking three solutions and choosing one, except cheap enough to do as a routine. becomes a delivery discipline. Basically, what digital product development has always preached, except that building variants with the actual material (code) was slow and expensive. That’s why the Design Sprint exists.

But (and this is the part I see teams get wrong in the first month) variants are only valuable when they are born close to the decision and close to the system: near the target architecture, inside the design system, against real APIs and data structures, checkable through the same tests and scenarios that grade everything else. The anti-pattern is the gorgeous prototype with no system contact, a demo that nobody can merge and nobody learns from. A variant only becomes learning once a decision rides on it; without that, it is just more output.

Which leaves the question I would ask your team before any tooling discussion:

If an agent handed you five solutions tomorrow, could you say which one is best, and could you say why in a form a machine could check next time?