Trust but Verify
Trust but Verify
The defining hazard of generative coding is plausible-but-wrong output. Diff looks right, names match the convention, tests pass — and a spec requirement is missing, or a contract is silently broken.
Trust-but-verify is the workflow's answer: an independent adversarial reviewer between implement and ship.
The principle
- Hallucinations don't announce themselves. You can't ask the same agent "did you do this correctly?" and trust the answer.
- An independent reviewer is cheap. One Opus invocation against a finished diff costs a fraction of a re-implementation.
Every workflow that ships AI-written code needs a verify phase. Not "we should code-review more" — a phase, in the pipeline, with a structured output and a pass/fail gate.
The phase
flowchart LR
Impl[/implement/] --> V[/verify/]
Spec & Plan & Tasks & Diff --> V
V --> Findings[BLOCKER / WARN / NOTE]
Findings --> Verdict{PASS / FAIL?}
Verdict -->|FAIL| Fix[Back to implement]
Verdict -->|PASS| Retro[/retrospective/]
Inputs: spec, plan, tasks, diff. Output: structured findings + verdict.
Five dimensions
| # | Dimension | What it catches |
|---|---|---|
| 1 | Spec divergence | Requirements not implemented |
| 2 | Missing tasks | Tasks marked [X] without matching diff content |
| 3 | Security | Auth, injection, secrets, scopes, logging |
| 4 | Correctness | Off-by-one, races, edge cases, swallowed errors, false idempotency claims |
| 5 | Contract breaks | Schema, struct, signature, response-shape changes that break callers |
Spec divergence catches the most common failure. Contract breaks catch the most expensive.
Severity
| Severity | Effect |
|---|---|
BLOCKER |
Must fix. Verdict = FAIL. |
WARN |
Real but non-blocking. Verdict still PASS if no BLOCKERs. |
NOTE |
Observation. Informational. |
Verdict math: BLOCKER == 0 → PASS.
The reviewer's prompt
The single most important property: look for what's missing, not validate what's present.
"You are an adversarial reviewer. Find missing items, contract breaks, security regressions, correctness bugs. Quote line ranges for every finding. Do not praise correctness — that is not your role."
Without this framing, the reviewer falls into the same trap as the implementer: reads code, sees plausible patterns, approves. Same model, same diff, different prompt → different catch-rate.
Example finding
### Spec divergence — FR-005 (hard delete on file removal)
**[BLOCKER]** Spec FR-005 requires that when a feature folder disappears,
all rows for that slug are removed. The diff implements file-level deletion
in `ScanFeature`, but `ScanAll` does not call `DeleteAEFeature` for slugs
that have disappeared. Quote: importer.go:50-60.
Composition
Verify is the last line. Earlier gates do the heavy lifting:
- Plan review catches design-level errors before code is written.
- Pre-fetched context prevents drift into unrelated files.
- Constitution rules block recurring mistakes by category.
Verify catches the residual.
Anti-patterns
- "We have tests, we don't need verify." Tests check the test's claims. Verify checks the spec's claims. Different.
- Skip when in a hurry. Same trap as The Illusion of Speed. Verify is faster than the bug-hunt cycle it replaces.
- Same model as implement. Lower yield. Opus for verify even when implement ran Sonnet.
Hallucinations don't announce themselves. The verifier is the announcement.