That agent demo that got applause at your last all-hands?
It would not survive a compliance review.
I build agentic AI platforms in regulated financial services. Here’s the uncomfortable truth I’ve learned: getting the agent to work was never the hard part. The hard part was getting compliance, risk, and audit to let it touch real customer workflows.
Before an agent ships to regulated production, it needs to pass all ten of these:
- Audit trail — every LLM call and tool call logged and replayable, not just app logs
- Provenance — every claim in the output traceable to a source
- Per-claim confidence — calibrated scores as an output guardrail, not vibes
- Human approval gates — before any consequential action, a human signs off
- Safe action boundaries — an explicit allowlist of what the agent can touch
- Rollback path — every write action must be reversible
- Cost and token ceilings — hard stops, because runaway loops are real
- Ground-truth evals — run against a golden dataset before every release
- Defined failure behavior — what the agent does when it doesn’t know
- Named ownership — one accountable human for the agent’s output
Most demos I’ve reviewed score 2 out of 10.
Regulated production requires 10 out of 10.
The gap between those two numbers is where agentic AI projects go to die — and where the most valuable engineering work is happening right now.
If you’re building agents in a regulated environment: which of these ten is the hardest at your org?
This is the first in a series on making agentic AI survive regulated production. Next up: per-claim confidence, the output guardrail almost nobody implements.