Trust but Verify

Updated Jun 9, 2026 History

Trust but Verify

The defining hazard of generative coding is plausible-but-wrong output. Diff looks right, names match the convention, tests pass — and a spec requirement is missing, or a contract is silently broken.

Trust-but-verify is the workflow's answer: an independent adversarial reviewer between implement and ship.

The principle

Hallucinations don't announce themselves. You can't ask the same agent "did you do this correctly?" and trust the answer.
An independent reviewer is cheap. One Opus invocation against a finished diff costs a fraction of a re-implementation.

Every workflow that ships AI-written code needs a verify phase. Not "we should code-review more" — a phase, in the pipeline, with a structured output and a pass/fail gate.

The phase

flowchart LR
    Impl[/implement/] --> V[/verify/]
    Spec & Plan & Tasks & Diff --> V
    V --> Findings[BLOCKER / WARN / NOTE]
    Findings --> Verdict{PASS / FAIL?}
    Verdict -->|FAIL| Fix[Back to implement]
    Verdict -->|PASS| Retro[/retrospective/]

Inputs: spec, plan, tasks, diff. Output: structured findings + verdict.

Five dimensions

#	Dimension	What it catches
1	Spec divergence	Requirements not implemented
2	Missing tasks	Tasks marked `[X]` without matching diff content
3	Security	Auth, injection, secrets, scopes, logging
4	Correctness	Off-by-one, races, edge cases, swallowed errors, false idempotency claims
5	Contract breaks	Schema, struct, signature, response-shape changes that break callers

Spec divergence catches the most common failure. Contract breaks catch the most expensive.

Severity

Severity	Effect
`BLOCKER`	Must fix. Verdict = FAIL.
`WARN`	Real but non-blocking. Verdict still PASS if no BLOCKERs.
`NOTE`	Observation. Informational.

Verdict math: BLOCKER == 0 → PASS.

The reviewer's prompt

The single most important property: look for what's missing, not validate what's present.

"You are an adversarial reviewer. Find missing items, contract breaks, security regressions, correctness bugs. Quote line ranges for every finding. Do not praise correctness — that is not your role."

Without this framing, the reviewer falls into the same trap as the implementer: reads code, sees plausible patterns, approves. Same model, same diff, different prompt → different catch-rate.

Example finding

### Spec divergence — FR-005 (hard delete on file removal)
**[BLOCKER]** Spec FR-005 requires that when a feature folder disappears,
all rows for that slug are removed. The diff implements file-level deletion
in `ScanFeature`, but `ScanAll` does not call `DeleteAEFeature` for slugs
that have disappeared. Quote: importer.go:50-60.

Composition

Verify is the last line. Earlier gates do the heavy lifting:

Plan review catches design-level errors before code is written.
Pre-fetched context prevents drift into unrelated files.
Constitution rules block recurring mistakes by category.

Verify catches the residual.

Anti-patterns

"We have tests, we don't need verify." Tests check the test's claims. Verify checks the spec's claims. Different.
Skip when in a hurry. Same trap as The Illusion of Speed. Verify is faster than the bug-hunt cycle it replaces.
Same model as implement. Lower yield. Opus for verify even when implement ran Sonnet.

Hallucinations don't announce themselves. The verifier is the announcement.