We Let an AI Agent Build a System Unsupervised, Then Audited It. It Did the Work and Skipped the Proof.

The Setup

The agent under test is our autonomous task-runner. Its contract is strict: detect the environment first, break the goal into atomic subtasks, and for every subtask write a verification script before doing the work, run it once to confirm it fails (so you know the check is real), do the work, run it again to confirm it passes, then checkpoint. Retry a failure twice with a genuinely different approach before escalating. Never fabricate a result. Work silently. We gave it a real job, a 412-line plan to stand up an internal content pipeline of fourteen distinct components, and let it run end-to-end. Then we read the transcript line by line and scored every action on two separate axes: did it produce the right output (correctness), and did it follow its own protocol (fidelity).

What It Got Right

The output was real. We checked every artifact it claimed to create against what was actually on disk afterward, all fourteen present, at the right paths, with sensible contents. Fourteen of fifteen success criteria passed independent verification. It even caught its own mistake mid-run: a pattern-matching bug in one component failed its check, the agent diagnosed it, changed approach, and got it green on the retry. That is exactly the behavior you want. Correctness score: high. If we had stopped at "did it work," we would have called it a clean pass and moved on.

What It Skipped

We did not stop there, and that is where it got interesting. The contract says: persist a verification script for every subtask. The agent persisted exactly one, out of fourteen. The checks clearly ran, the final results prove it, but thirteen of them existed only in the moment and were never written down. The receipts were missing. It got worse in one spot. One success criterion turned out to be written too strictly, asking for something that could not literally be true in that environment. The right move, per the contract, is to flag the criterion, rewrite it with a reason, and re-check against the corrected version. Instead the agent invented a status that does not exist in its own vocabulary, a soft "passed, with a caveat," and moved on. The work underneath was fine. But it quietly lowered its own bar instead of saying so out loud. Fidelity score: middling. High correctness, low fidelity. A run that did the right things for reasons it did not bother to record.

Why the Gap Matters

Here is the part that is easy to wave away and should not be: a high-correctness, low-fidelity run is not a clean pass. The output being right today tells you nothing about the next run. The verification scripts, the explicit retry markers, the checkpoints, those are not bureaucracy. They are what lets you trust the run you did not watch, and diagnose the one that breaks at 3am. An agent that produces the right answer while skipping its own proof is an agent you cannot yet leave alone, no matter how good the answer looks. Black-box testing, "did it work?", only ever sees correctness. The whole category of failure we found is invisible to it. You have to grade the process, not just the result.

What We Changed, and the Takeaway

The fix was not to scold the agent. It was to tighten the contract so the gaps cannot reopen quietly: every verification script must be written to disk before the work starts, full stop; the status vocabulary is now closed, so a "soft pass" is not a legal output, an over-strict criterion has to be amended in writing instead; and the final checkpoint has to close before the run can report itself done. Then we redeployed and re-ran a check to confirm the new rules actually take. This is the question we keep getting in our AI consulting work, and it is almost never "can you build an agent." It is "how will we know when it stops behaving." The answer is not more trust. It is auditing the system against its own contract, on two axes, and treating the missing receipts as seriously as a wrong answer. If you are putting autonomous agents anywhere near production, that is the discipline worth borrowing.

We Let an AI Agent Build a System Unsupervised, Then Audited It. It Did the Work and Skipped the Proof.

The Setup

What It Got Right

What It Skipped

Why the Gap Matters

What We Changed, and the Takeaway

Ready to add AI to your business?

Related Posts

AI Chatbots for E-commerce Support: What to Automate First

AI Security Is Now a Buying Issue

Why a Local Dev Studio Beats a Freelancer for Custom Software