Grading
If you come from deterministic software development, where a requirement translates into code and the code either meets it or it doesn’t, the rest of this framework probably feels like it has a hole in the middle. Agents search a space of possible solutions (the longer argument lives on Why), and a search process needs something to search against. That something is Outcome gradingJudging whether a result is good, precisely enough that the judgment can be applied again and again, by a person or a machine.You already do the simplest version every time you write a test or a definition of done.: the ability to judge whether a result is good, precisely enough that a machine can iterate against the judgment. It is the bridge between the spec world and the search-space world, and without it the framework falls apart for exactly the people I most want to reach. So this page exists to build the bridge properly.
You have been grading outcomes all along
Grading comes in instruments, and they form a ladder from primitive to broad.
- Tests check known cases. If you write software, you have been doing outcome grading all along, just in its most primitive form: this input, this expected output, pass or fail.
- RubricA written list of what “good” means for a kind of output, so the same standard can be applied to every result.Acceptance criteria, reused as a grading checklist instead of a one-off. widen that to qualitative criteria for whole classes of behavior (does the answer cite the source, does the refund flow respect the policy, does the tone match the brand). Think acceptance criteria, reused as a grading checklist.
- Scenarios generalize beyond the cases you already thought of: walkthroughs of situations the system should handle, written by someone who knows where the domain gets weird.
- GoldenAn output you have blessed as correct and keep around as the answer key, to compare new output against.The trusted fixture in an integration test: the known-good result everything else is measured against. and CounterexampleA plausible-looking but wrong output you keep on file, so the system learns never to produce that kind of thing again.The bug you once shipped and then wrote a regression test for. calibrate everything else. Goldens are outputs you have blessed as correct and keep as the answer key; counterexamples are the plausible-looking wrong ones, which in my experience are worth more per line than almost any other artifact.
- LLM judgeA second AI you ask to grade the first one’s output against your criteria. It scales review, but has blind spots of its own.A tireless reviewer who needs a clear checklist, or it will rubber-stamp anything that looks plausible. make qualitative grading scalable, with failure modes of their own (more on that in a moment).
- Human review stays irreplaceable wherever the call is genuinely a judgment call.
- Production signals are reality itself as the last layer: usage, rework, support tickets, exceptions, logs, the market quietly grading what everything upstream let through.
Notice who can contribute to this ladder. Tests belong to engineers (remember product owners writing Cucumber tests?). Scenarios, counterexamples, and rubrics very often belong to the domain expert, the designer, the product person. That is the point Roles makes in full.
The spec changes jobs as the loop stretches
So where does that leave the specification? I keep meeting two camps: one that wants to write everything down before the agent moves, and one that has declared specs dead. Both are answering a sizing question with a doctrine. The spec changes its job as the loop stretches:
| Loop mode | Role of the spec | Harness focus | Grading focus |
|---|---|---|---|
| Tight / Exploratory | Starter grip: intent, examples, non-goals | IDE, tests, UI, logs | Human judgment, small tests, diff feedback |
| Task / Implementation | Steering artifact: boundaries, acceptance, context | Reproducible setup, CI, PR review | Tests, acceptance criteria, review checklist |
| Feature / Delegation | Delegation contract: goal, constraints, risks, search space | Worktree, sandbox, traces, checkpoints | Outcome rubric, scenarios, counterexamples, human review |
| Dark factoryThe loose loop run to its end: agents delivering on their own against automated checks, with humans reviewing outcomes rather than every turn.Named after lights-out manufacturing, where the line runs without people on the floor. | Executable spec in the production system | Isolated runtime, permissions, audit, rollback, observability | Automated graders, goldens, abort criteria, regression gates |
| Learning Loop | Feedback distillate for the next piece of work | Stored traces, error classes, skills, docs | Rework, production signals, drift, pattern learning |
Read the spec column top to bottom and you can watch the artifact change character: from a starter grip to a steering artifact to a contract to something executable, and finally to distilled feedback for the next round. Keep in mind, though:
The spec is useful when it structures the Search spaceThe set of possible solutions a task allows. An agent does not translate a spec line by line; it samples from this space, pulled toward one solution by your context, or proposing a few in a planning step. Generating many on purpose and selecting the best is a discipline you opt into.There are many valid ways to implement a story. The agent is choosing among them, not transcribing the one right answer.. It becomes harmful when it replaces the search space.
The spec can and should never be the beginning of a waterfall, but rather a living artifact humans (and agents) iterate on. Zero-shotting is not a sport! Well, most times it’s not. And yet the pendulum is swinging back toward exactly that little waterfall right now: write everything down first, let the agent execute, and treat anyone who wants to feed learning back mid-run as undisciplined. Organizations rehearsed this movement for twenty years of doing agile without becoming agile, adopting the ceremony and dropping the feedback loop.
A spec that pins down every implementation detail has quietly turned the agent back into a typist, and you paid for a search process you never ran.
The reflex has a name by now. ThoughtWorks tracks spec-driven development on the Radar, notably no further than Assess, warning that teams might “relearn a bitter lesson, that handcrafted detailed rules for AI ultimately do not scale”. Birgitta Böckeler sorts the practice into three levels of rigor: spec-first writes the spec up front and archives it once the code exists, spec-anchored keeps it in the repo, maintained with every change and validated against the implementation, and spec-as-source treats code as mere generated output of the spec, a future no tool delivers in production today. Spec-anchored is the close relative of this framework: a living, versioned artifact instead of one-way planning.
What this framework adds is a shift of attention from the input layer to the evaluation layer. Whichever rigor level you run, the artifacts humans iterate on in wide loops are the spec, the rubrics, the scenarios, the goldens, and the counterexamples, in short the grading material, while the agents iterate on the output itself. When you catch yourself patching the output instead of sharpening the grader, you are working on the wrong layer, and the loop will keep producing the thing you just fixed.
A rubric is not automatically truth
Here is the caveat I want printed in bold on every eval dashboard: writing a rubric does not make the rubric true. Good graders are engineering artifacts themselves, calibrated with counterexamples, error taxonomies, human review, and DriftSlow movement away from what is correct or wanted, over time, without any single obvious break.The product or codebase quietly getting less coherent release after release. control. Skip that work and the agent will optimize against the grader you actually built, which usually means output that sounds plausible rather than output that is right. An LLM judge calibrated on nothing will happily wave through the The statistical middle (slop)Output that converges on the bland average of everything the model has ever read: plausible, smooth, and indistinguishable from anyone else’s.The onboarding text that reads like every onboarding text ever written., and slop walks in wearing a passing grade! A polished demo of the agent’s own work deserves the same skepticism: it makes your judgment cheaper to apply, while the grading substance still has to come from goldens, counterexamples, and scenarios.
There is also a structural answer to the agent satisfying the letter of your grader while missing the point (Reward hackingWhen the agent learns to satisfy the letter of your grader while missing the point. The AI version of teaching to the test.Code that passes the test by hardcoding the expected value instead of actually solving the problem., if you want the ML name, the AI cousin of teaching to the test). StrongDM’s software factory shows what it looks like when you take it seriously (Simon Willison has a good writeup). Their internal charter reads
“Code must not be written by humans. Code must not be reviewed by humans,”
which is too radical for most organizations I work with, but the design vocabulary travels well. Their scenario HoldoutTest cases you deliberately keep out of the agent’s reach, so it cannot tune itself to pass an exam it has already seen.Keeping the real exam questions out of the study guide. keep the test sets outside the repo, so the agent cannot tune itself to pass an exam it can read, the same instinct that made machine-learning teams guard their test data for decades. And their satisfaction testing asks a second AI, across thousands of runs, how often the agent’s path through a scenario would actually leave the user happy.
What you do with cheap search
Once grading works, the economics get interesting. When search gets cheap, the move is to generate several variants and select, rather than pushing one ticket through faster. Best-of-nGenerate several independent attempts at a task, then pick the best, instead of pushing one attempt through faster.Spiking three solutions and choosing one, except cheap enough to do as a routine. becomes a delivery discipline. Basically, what digital product development has always preached, except that building variants with the actual material (code) was slow and expensive. That’s why the Design Sprint exists.
But (and this is the part I see teams get wrong in the first month) variants are only valuable when they are born close to the decision and close to the system: near the target architecture, inside the design system, against real APIs and data structures, checkable through the same tests and scenarios that grade everything else. The anti-pattern is the gorgeous prototype with no system contact, a demo that nobody can merge and nobody learns from. A variant only becomes learning once a decision rides on it; without that, it is just more output.
Which leaves the question I would ask your team before any tooling discussion:
If an agent handed you five solutions tomorrow, could you say which one is best, and could you say why in a form a machine could check next time?