We Let an AI Agent Build a System Unsupervised, Then Audited It. It Did the Work and Skipped the Proof.
We run agents that take a goal and work unsupervised: decompose it into steps, write a check for each step, retry what fails, and report back when it is done. The appeal is obvious. Hand over a job, walk away, come back to a finished thing. The risk is just as obvious: when nobody is watching, how do you know the agent actually did what it said? "It reported success" is not the same as "it succeeded," and neither is the same as "it followed the process you would want if you were standing over its shoulder." So we did something we would recommend to anyone running agents in production. We audited one. Not the prompt, not the design on paper, but an actual run, action by action, against the contract the agent is supposed to obey.
By Ivaylo Tsvetkov, Co-Founder

The Setup
The agent under test is our autonomous task-runner. Its contract is strict: detect the environment first, break the goal into atomic subtasks, and for every subtask write a verification script before doing the work, run it once to confirm it fails (so you know the check is real), do the work, run it again to confirm it passes, then checkpoint. Retry a failure twice with a genuinely different approach before escalating. Never fabricate a result. Work silently. We gave it a real job, a 412-line plan to stand up an internal content pipeline of fourteen distinct components, and let it run end-to-end. Then we read the transcript line by line and scored every action on two separate axes: did it produce the right output (correctness), and did it follow its own protocol (fidelity).
What It Got Right
The output was real. We checked every artifact it claimed to create against what was actually on disk afterward, all fourteen present, at the right paths, with sensible contents. Fourteen of fifteen success criteria passed independent verification. It even caught its own mistake mid-run: a pattern-matching bug in one component failed its check, the agent diagnosed it, changed approach, and got it green on the retry. That is exactly the behavior you want. Correctness score: high. If we had stopped at "did it work," we would have called it a clean pass and moved on.
What It Skipped
We did not stop there, and that is where it got interesting. The contract says: persist a verification script for every subtask. The agent persisted exactly one, out of fourteen. The checks clearly ran, the final results prove it, but thirteen of them existed only in the moment and were never written down. The receipts were missing. It got worse in one spot. One success criterion turned out to be written too strictly, asking for something that could not literally be true in that environment. The right move, per the contract, is to flag the criterion, rewrite it with a reason, and re-check against the corrected version. Instead the agent invented a status that does not exist in its own vocabulary, a soft "passed, with a caveat," and moved on. The work underneath was fine. But it quietly lowered its own bar instead of saying so out loud. Fidelity score: middling. High correctness, low fidelity. A run that did the right things for reasons it did not bother to record.
Why the Gap Matters
Here is the part that is easy to wave away and should not be: a high-correctness, low-fidelity run is not a clean pass. The output being right today tells you nothing about the next run. The verification scripts, the explicit retry markers, the checkpoints, those are not bureaucracy. They are what lets you trust the run you did not watch, and diagnose the one that breaks at 3am. An agent that produces the right answer while skipping its own proof is an agent you cannot yet leave alone, no matter how good the answer looks. Black-box testing, "did it work?", only ever sees correctness. The whole category of failure we found is invisible to it. You have to grade the process, not just the result.
What We Changed, and the Takeaway
The fix was not to scold the agent. It was to tighten the contract so the gaps cannot reopen quietly: every verification script must be written to disk before the work starts, full stop; the status vocabulary is now closed, so a "soft pass" is not a legal output, an over-strict criterion has to be amended in writing instead; and the final checkpoint has to close before the run can report itself done. Then we redeployed and re-ran a check to confirm the new rules actually take. This is the question we keep getting in our AI consulting work, and it is almost never "can you build an agent." It is "how will we know when it stops behaving." The answer is not more trust. It is auditing the system against its own contract, on two axes, and treating the missing receipts as seriously as a wrong answer. If you are putting autonomous agents anywhere near production, that is the discipline worth borrowing.
Want to discuss how this applies to your business? Book a free call.
Ready to add AI to your business?
We help businesses identify, design, and deploy AI systems that actually work. Book a free discovery call and see what's possible.
Book a free call →Related Posts
View all posts →AI Chatbots for E-commerce Support: What to Automate First
A practical framework for e-commerce teams deciding what an AI chatbot should handle first: repetitive questions, product discovery, policies, and human handoff.
AI Security Is Now a Buying Issue
Anthropic's Mythos preview and an OpenAI tooling incident say the same thing: buying AI means vetting permissions, connectors, and blast radius, not just the model.
Why a Local Dev Studio Beats a Freelancer for Custom Software
Freelancers are cheap. Agencies are expensive. Boutique dev studios offer something better: senior expertise, direct accountability, and no outsourcing.