The hard part of building AI agents is not getting a polished output.
It is getting the system to say:
“I don’t know.” “This source is weak.” “These agents disagree.” “This section failed.” “This claim needs human review.”
This is Gate #4 of the ten an AI agent has to clear before it belongs near a regulated decision. Post #1 laid out the checklist. Post #2 was provenance, every claim knowing where it came from. Post #3 was per-claim confidence, asking how sure you are one claim at a time. This one is the gate underneath all of them: how the system behaves when something goes wrong. Because a confidence score you can trust and a provenance chain you can follow are both worthless if the system can fail without telling you.
The report that looked complete
In my multi-agent due-diligence pipeline, five specialist agents research a company in parallel and a synthesis agent reconciles their findings. The early version had a flaw I did not notice until I went looking for it: it failed silently.
If one of the five agents timed out, the synthesis agent did its job anyway. It reconciled the four sets of findings it did have, wrote a clean executive summary, and handed me a report with a financial section, a risk section, and a recommendation. It looked complete. It had no hole in it where the fifth agent’s work should have been. Nothing on the surface said one fifth of this analysis never ran.
That is the most dangerous shape a failure can take. Not a crash. Not gibberish. A document that looks finished and is quietly missing a quarter of its evidence. In a demo you would never catch it, because the demo is the four agents that worked. The first time it costs anything is the first time someone makes a decision on a report whose silence they mistook for coverage.
Silence is not the same as a finding
The whole point of the earlier gates was to stop the system from passing off a guess as a fact. A silent failure does something worse. It passes off an absence as a completed analysis. The report does not say something false. It just does not say that it is incomplete, and a reader has no way to tell the difference between “this risk section is thin because there is little risk” and “this risk section is thin because the agent that fills it died.”
Those two reports look identical. They should never read the same way. The difference between them is the entire difference between a finding and a blank space wearing a finding’s clothes.
So the real engineering work on this project became less about adding features and more about adding accountability. Six things, in order of how much they changed the system:
- Make failures loud. A failed agent, a timed-out tool call, a section that came back empty — every one of these now shows up in the report, marked, not swallowed into a clean-looking summary.
- Trace every LLM and tool call. Every call is logged and replayable, so a failure is something I can reconstruct rather than guess at after the fact.
- Capture cost, tokens, sources, and run history. The run accounts for what it spent and what it touched, so an incomplete run is visible as an incomplete run.
- Bound confidence by source quality. A claim cannot be more confident than the evidence under it, which means a thin section cannot dress itself up.
- Surface gaps, conflicts, and unanswered questions. The things the system could not resolve are output, not hidden.
- Require evidence before confidence. No source, no score.
Making the gap visible
The fix for the silent timeout was not to make the fifth agent more reliable. Agents will always fail sometimes — that is a given, not a bug to be patched away. The fix was to make its failure impossible to miss.
Now, when an agent does not return, the report says so. The section it would have written carries a marker that the analysis did not complete, not a plausible paragraph stitched from whatever else was lying around. The synthesis agent is no longer allowed to quietly paper over a missing input. A four-of-five run is a valid thing to produce — but it has to announce that it is four of five.
This is the same principle as the gap from the provenance post, pointed at a different failure. There, the gap was a claim the system could not verify, and the right move was to leave the gap and label it rather than fill it with a guess. Here, the gap is a whole section that did not run, and the right move is the same. A known unknown is safe. A reader can see it, weigh it, decide to go find the answer themselves. A hidden unknown is the one that hurts, because it spends its credibility on you before you ever learn it was empty.
The production checklist underneath all of this
This is the lesson I keep coming back to from regulated financial services. A useful agent is not just one that produces an answer. It is one that can account for itself when the answer may be incomplete, uncertain, or wrong.
For teams putting agents into high-trust workflows, I think that is the real production checklist, and none of it is about the quality of the prose:
- Can you audit it?
- Can you trace it?
- Can you challenge it?
- Can you see what failed?
- Can you tell where confidence should stop?
Every one of those questions is about the system’s behavior at its edges — at the moments it does not know, cannot verify, or did not finish. A demo never visits those edges. Production lives there.
Where this fits
Graceful failure is Gate #4 of ten, and it is the one that holds up the others. Audit trail lets you reconstruct a run. Provenance ties each claim to a source. Per-claim confidence tells you how far to trust each one. But all three assume the system will tell you when something broke. Take that away and a perfect audit trail just faithfully records a report that lied by looking complete.
The throughline of the series holds here too. A demo optimizes for looking right, and looking complete is the most seductive form of looking right, because completeness reads as diligence. The fix is not a more finished-looking report. It is a report honest enough to show you its own holes — the failed agent, the weak source, the section that never ran — so the silence in it is something you chose to accept, not something the system hid from you.
The project is a personal one, built in the open on public data, and is a demonstration, not investment advice. Code and a sample report are in the GitHub repository. I am curious how others are handling calibrated confidence, graceful failure, and auditability in agent systems.