Distinguish Robust Capabilities from Brittle Ones
The gap between how AI presents and how it performs is not a bug. It is an architectural feature — and it mimics human signals of competence so precisely that your normal defences do not work.
Consider what large language models are actually doing. They generate outputs through next-token prediction optimised for distributional plausibility — a process that produces fluent, confident-sounding text whether or not the content is factually accurate. Fluency and reliability are formally separable properties of these systems: a model can be simultaneously maximally fluent and substantively wrong. This is not a defect awaiting correction. It is the mechanism. The system is optimising for plausibility, and we are pattern-matching plausibility to trustworthiness, because for most of human history that heuristic worked.
— Farkas et al., arXiv:2509.13345, 2025
A robust capability is one where the model's extrapolation tends to be well-calibrated — reliably predictable across variations in input. A brittle capability is one where performance is high within the evaluation envelope and collapses outside it, often without any signal to the user that this is happening. That last part is the important one. Traditional software fails loudly. Brittle AI fails fluently — it produces a confident, well-formed wrong answer, which is the one failure mode your quality controls were not designed to catch.
Even systems that excel at complex tasks may generate non-existent citations, biographies, or facts (hallucination). Performance can be inconsistent: accuracy on mathematics problems can decrease significantly when irrelevant information is inserted into the problem description. This brittleness extends to multimodal capabilities, where models often have low performance on spatial reasoning tasks such as basic counting of objects in a scene.
Crucially, expert human oversight can mitigate some of these risks, but introduces a corresponding danger of over-reliance, where humans accept AI-generated outputs without adequate verification.
DOI: 10.48550/arXiv.2602.21012 · Bengio et al. (2026), International AI Safety Report 2026A further dimension of brittleness concerns safety measures themselves. Research from Carnegie Mellon University demonstrates that widely used post-training alignment methods — the techniques vendors use to make models "safe" — can themselves be brittle. A model can pass safety evaluations under normal testing conditions while failing when the input distribution shifts slightly. Unsafe behaviours embedded during pretraining are difficult to remove through post-hoc alignment.
Safety measures applied after model development (through RLHF, red-teaming, and guardrails) are often reactive, brittle, expensive, and struggle to address unknown or emerging risks. The dominant "Make AI Safe" paradigm addresses a model after it is built; the consequence is that safety is a thin layer over a system whose core behaviours were not designed for safety.
The leading models — GPT-5, Claude 4, Gemini 2.5 — demonstrate capability scores that consistently outpace safety scores. This structural gap has not closed as capabilities have increased.
See: arXiv:2509.06786 (R²AI); IBM Think: What is AI Safety?, May 2026Cao and Yousefzadeh (2023) show that AI models extrapolate outside their training distribution frequently and without notifying users or stakeholders. They argue that whether a model has extrapolated for a given input should be a mandatory part of any explainability system — the model should declare when it is operating beyond its training support, just as a human expert would caveat advice that goes beyond their direct knowledge. A doctor who gives confident opinions outside their specialty without flagging the limitation is not being helpful. They are being dangerous. The fact that AI systems do this by default, without any caveat, is a governance problem masquerading as a technical one.
AI models extrapolate outside their range of familiar data, frequently and without notifying the users and stakeholders. Knowing whether a model has extrapolated is a fundamental insight that should be included in explaining AI models in favour of transparency, accountability, and fairness. The right to AI explainability, as consolidated in the research community and in policy, is incomplete without this component.
DOI: 10.1177/20539517231169731 · Cao, X. & Yousefzadeh, R. (2023), Big Data & Society, 10(1)| Capability | Stability signal | Assessment |
|---|---|---|
| Pattern extraction on familiar structured data | Consistent across similar inputs; verifiable against source | Robust |
| Summarisation of provided text (with grounding) | Performance measurable and stable | Robust |
| Factual recall under pressure or topic drift | Fluent but unverifiable; hallucination-prone | Brittle |
| Mathematical reasoning with irrelevant insertions | Degrades under perturbation (IASR 2026) | Brittle |
| Spatial/counting tasks (multimodal) | Low baseline; documented failure mode | Brittle |
| Safety behaviour under adversarial prompting | Post-hoc alignment is itself brittle (CMU/IBM) | Brittle |
| Consistent probabilistic outputs | LLMs fail to follow target distributions (Yang & Zhang 2025) | Brittle |
Interrogate Vendor Claims with Rigour
Vendors are not deceiving you. They are showing you the results that make them look best. The problem is structural — and the solution is methodological rigour from the buyer, not moral expectations of the seller.
A vendor benchmark is an experimental result. The question is whether that result has internal validity (the benchmark measures what it claims to measure, rather than something else that correlates with it) and external validity (it generalises to your actual deployment conditions). Both are routinely violated. This is not unique to AI — any industry with significant information asymmetry between seller and buyer produces this dynamic. The appropriate response is not cynicism about vendors. It is methodological rigour from the buying side.
The internal validity of AI benchmarks is systematically compromised by what researchers call the "Clever Hans" problem — named after the horse that appeared to do arithmetic but was actually reading its trainer's body language. Models appear to solve tasks by learning spurious statistical features of the test set rather than the underlying capability being measured. Schlegel et al. (2024) find that simple surface features of benchmark items predict LLM performance — meaning high scores can be achieved without the model possessing the capacity the benchmark purports to test. The horse passes the exam without learning the subject.
AI benchmark internal validity requires that observed performance be attributable to the specific capability under test, not to other unrelated capabilities, systematic errors, or biases in the benchmark data. This condition is routinely unmet. Simple features of benchmark items predict LLM answers with significant accuracy, indicating that benchmark performance is partly a measure of surface-pattern exploitation rather than genuine capability.
External validity — how benchmark results generalise to real-world performance — has received even less scrutiny. A model that scores 95% on a benchmark can still fail specific user segments in production because test sets rarely capture the full complexity of live environments.
arXiv: 2410.11672 · Schlegel et al. (2024), Leaving the barn door open for Clever Hans: Simple features predict LLM benchmark answersA KPMG survey found that most organisations completing AI evaluations rely solely on vendor-supplied benchmarks and documentation. Gartner (2024) cites misalignment between vendor claims and production performance as a primary cause of enterprise AI deployment failure. Think about what a vendor-supplied benchmark actually is: a restaurant showing you photographs of its best dishes. The photographs are real. The dish exists. The question is whether it will be on the menu when you arrive, whether the kitchen can produce it consistently, and whether your particular dietary requirements were accounted for in the recipe. None of that is captured in the photograph.
A formal affordance analysis of OpenAI's Preparedness Framework Version 2 (April 2025) found that the framework: (1) requests evaluation of only a small minority of AI risks; (2) encourages deployment of systems with capabilities for unintentionally enabling "severe harm" (defined as >1,000 deaths or >$100B in damages) at the "Medium" risk level; and (3) allows the CEO to deploy even more dangerous capabilities, especially if other companies do so. This was not a selective reading — it is what the framework formally permits, as a matter of logical structure.
The finding is generalisable: vendor safety frameworks, as currently published, are voluntary, largely unaudited, and structured to allow deployment to proceed. They do not constitute independent safety assurance.
arXiv: 2509.24394 · Affordance analysis of OpenAI Preparedness Framework V2 (2025)These responses indicate a vendor cannot or will not substantiate their claims under scientific standards:
- "Our benchmark results speak for themselves" — without specifying which benchmark, on what data, under what conditions, or what failure rate is tolerated
- "We use state-of-the-art safety techniques" — without describing the techniques, their evaluation methodology, or whether they were tested pre- or post-mitigation
- "Accuracy is 97%" — without stating the base rate, the distribution of the 3%, or whether errors are randomly or systematically distributed
- "Human oversight is built in" — without defining what a human reviewer can detect, given the opacity of the model's reasoning
- "This is a foundational model question, not ours to answer" — when asked about failure modes of the specific deployment you are evaluating
Treat these answers as evidence, not reassurance. Write them into the procurement record alongside the accuracy claims at signature. Re-test against them quarterly. Cite them in any subsequent dispute. This is not bureaucracy for its own sake — it is the only mechanism by which you can tell, six months from now, whether the system you deployed is still the system you evaluated. AI models update. Vendor terms shift. The baseline you relied on at procurement becomes invisible unless you made it explicit at the time.
Surface the Assumptions Your Strategy Rests On
Strategy documents describe what the technology will do. What they rarely specify is the conditions under which it won't. That omission is not accidental — it is the part that would make the strategy harder to approve.
There is something deeply human about the AI strategy document. It is a form of collective self-reassurance — a document that says "we have thought about this" in a way that satisfies governance committees without necessarily satisfying the question. The claim "this system will accurately process customer support enquiries" is, in the language of science, a hypothesis about future experimental conditions. Like any hypothesis, it has a scope (the conditions under which it holds) and a failure surface (the conditions under which it breaks down). Strategy documents routinely specify the former and omit the latter — not out of negligence but because specifying failure surfaces is psychologically uncomfortable and politically inconvenient. It is much easier to get a strategy approved if the downside scenarios are left implicit.
Several hidden assumptions recur across enterprise AI strategies and deserve explicit audit:
The fluency-reliability assumption. The most pervasive error is assuming that a system that sounds authoritative is authoritative. This is the same error we make with human experts — we treat confident delivery as evidence of sound reasoning, and we are wrong often enough that entire fields (medicine, finance, law) have developed formal second-opinion protocols to compensate for it. Hallucination in LLMs is not a bug awaiting a fix; it is a structural property of probabilistic token prediction. The IAPP (2025) states plainly that the capacity to generate fabricated information is "not just a failure of learning specific data, but a characteristic tied to the model's operational mechanics." It cannot be trained away. The next model version will not solve it. Any strategy that depends on this assumption being wrong is load-bearing on a foundation that does not exist.
Hallucinations in LLMs cannot be entirely "fixed" or "trained away" because they arise from the model's operational mechanics and its inherent limits in mapping the vast space of language and knowledge. This challenges the assumption widespread in strategy documents that hallucination is a temporary limitation of current models. It is a characteristic of the probabilistic architecture.
iapp.org · Hallucinations in LLMs, July 2025The benchmark-production equivalence assumption. Strategy documents that cite benchmark performance as justification for deployment implicitly assume the benchmark's experimental conditions are representative of production conditions. They rarely are. Distribution shift — the difference between test conditions and deployment conditions — is the primary source of AI production failures.
The stable-performance assumption. Traditional software fails loudly — it crashes, throws an error, returns a null. You know something has gone wrong because something has visibly gone wrong. AI systems fail differently. They fail fluently. Probabilistic systems can deteriorate behaviourally without crossing any operational alert threshold, producing outputs that are worse than last month's without triggering a single incident ticket. Venturebeat (2026) documents silent partial failure as a distinct failure mode: the system degrades behaviourally before it degrades operationally, and "by the time the signal reaches a postmortem, the erosion has been happening for weeks." The first signal is often user mistrust — a qualitative, distributed, hard-to-aggregate phenomenon that most monitoring systems are not built to detect. Strategies that include no mechanism for catching this are not managing risk. They are hoping it doesn't happen.
AI companies claim they will achieve artificial general intelligence within the decade, yet none scored above D in Existential Safety planning. One reviewer described this disconnect as "deeply disturbing": despite racing toward human-level AI, "none of the companies has anything like a coherent, actionable plan" for ensuring such systems remain safe and controllable.
If the developers of these systems have not formally addressed what their most extreme claims imply, enterprise strategies that treat those claims as reliable planning inputs are building on an unsupported foundation.
futureoflife.org/ai-safety-index-summer-2025 · Future of Life Institute, July 2025Frame Governance Around What AI Can Be Held To
The governance documents that get signed off are almost always written to be approved, not to work. These are different objectives, and the gap between them is where the liability accumulates.
The most common failure in AI governance is specifying accountability in terms that do not map onto how AI systems actually work. "Human oversight" is the canonical example — it appears in almost every governance framework, repeated as if the phrase itself were a control. It is not. A phrase is not a mechanism. The question is not whether a human is nominally present in the process. The question is whether that human is in a position to catch the specific failure modes the system exhibits, given the information they have, the time they are allocated, and the opacity of the reasoning they are being asked to review.
Human oversight is meaningful only if the human is in a position to falsify the system's output — to detect when it is wrong and act on that detection. If they cannot do this — if they are reviewing outputs they cannot independently verify, in a timeframe that does not permit genuine scrutiny — then their presence in the loop is the appearance of oversight: a control that satisfies a checkbox but provides no actual protection. This is not an edge case. It is the normal condition for most enterprise AI deployments.
— Lumenova AI, Behavioral AI Testing: The Importance of Control Conditions, 2026
What can a reviewer actually see? In opaque systems, reviewers see outputs, not reasoning. They can judge whether an output looks plausible — but, as Principle I established, an AI system is specifically optimised to produce outputs that look plausible. Asking someone to evaluate outputs for accuracy when the outputs are designed to appear accurate is not oversight. It is a reading comprehension exercise dressed up as quality control.
What does the error distribution look like? A 3% error rate is not the same thing regardless of which 3% fails. If errors are randomly distributed, a sampling-based review process can catch them. If errors are systematically concentrated in specific input types — which AI failures very often are — a generic review process will miss them at precisely the highest-stakes moments. This is not a theoretical concern. It is the normal pattern of AI production failures (see the MIT / Deloitte findings below). The governance question is not "what is the error rate?" It is "where do the errors land?"
According to MIT Technology Review, 80% of AI system failures in production environments stem from scenarios not adequately represented in training data. A Deloitte study found that 37% of enterprises experienced significant operational disruptions due to AI system failures in edge case scenarios. In regulated industries, edge cases often involve precisely the scenarios where compliance is non-negotiable — meaning the failure distribution is inverse to the stakes.
The implication for governance: a review process calibrated against average-case performance provides no protection for the cases that matter most.
MIT Technology Review / Deloitte cited in: CloudFactory, Edge Case Handling, September 2025LLM errors shift from predictable (factual inaccuracy, unstable reasoning) to hermeneutic forms, where linguistic fluency, structural coherence, and superficially plausible citations conceal deeper distortions of meaning. Evaluators — including experienced domain specialists — frequently conflated criteria such as correctness, relevance, and consistency, indicating that human judgement collapses analytical distinctions into intuitive heuristics shaped by form and fluency.
The finding implies that "human oversight" is not a reliable safeguard when the human reviewer is subject to the same fluency-reliability conflation described in Principle I. The oversight mechanism must be designed to overcome this cognitive tendency, not assume its absence.
arXiv: 2512.16750 · Farkas et al. (2025), Plausibility as Failure: How LLMs and Humans Co-Construct Epistemic ErrorSpecify what a human reviewer can actually detect given the information available and the time allocated. "Human in the loop" is not a governance control if the loop cannot surface the failure mode.
A system that cannot explain its reasoning cannot be held accountable for that reasoning. The humans who deployed it can. Accountability must follow traceability — map decisions to decision-makers, not to models.
Build monitoring that catches behavioural degradation before operational failure. Set thresholds against verified output quality, not only uptime. Re-verify against the original diligence record quarterly.
Any accuracy figure must be accompanied by an analysis of where failures concentrate. Are they random or systematic? Do they cluster in high-stakes input types? Precision vs recall trade-offs must be made explicit.