Frontier models are failing one in three production attempts — and getting harder to audit

AI agents are now embedded in real enterprise workflows, and they're still failing roughly one in three attempts on structured benchmarks.