§ Free research briefing

Using AI Safely:
A Framework for Decision-Makers

Distinguishing robust from brittle capabilities. Interrogating vendor claims with method. Surfacing hidden strategic assumptions. Building governance that is formally accountable.

Format

Four principles · Research briefing

For

Leadership, procurement & governance teams

Sources

15 peer-reviewed references · June 2026

The decision-makers signing off on AI systems are not, for the most part, naive. They are using perfectly sensible heuristics — heuristics that have served them well for decades. The problem is that those heuristics were designed for a world in which things that look competent usually are competent, and things that sound confident usually have something to back that confidence up. AI systems have learned to trigger exactly those signals without the underlying substance. The demo is not dishonest. It is, however, optimised to make you feel a kind of trust that the technology has not earned.

This briefing applies frameworks from the philosophy of science to four failure patterns that are already causing real organisational harm. The philosophical framing is not decoration: treating an AI deployment as you would any scientific experiment — with explicit conditions, defined scope, and genuine exposure to falsification — is the most rigorous methodology available for separating what a system can actually do from what it has been constructed to appear to do.

The four failure patterns this framework addresses

01 · Procurement

Fluency Mistaken for Reliability

A team approves a tool because its demo was impressive — without asking whether impressiveness is evidence of reliability. It is not. We have simply evolved, over millions of years, to treat fluent speech as a proxy for competence. AI has reverse-engineered that proxy.

02 · Boardroom

Strategy Built on Unexamined Claims

A board signs off on an AI strategy built on claims that no one in the room was equipped to interrogate. This is not a failure of intelligence. It is a predictable consequence of information asymmetry — the same dynamic that makes it very hard to buy a second-hand car from a dealer who knows more about the car than you do.

03 · Governance

"Human Oversight" Without Definition

A governance document specifies "human oversight" without defining what oversight means when the system's reasoning is opaque. This is bureaucratic magic: an incantation that satisfies a committee without protecting anyone. The phrase is doing the work of a policy so it doesn't have to.

04 · Reporting

Accuracy Figures Treated as Readiness

A 97% accuracy figure is treated as evidence of readiness. Nobody asks where the 3% lands. If you're six feet tall on average, you don't drown in a river with an average depth of five feet. The distribution is everything — and the distribution is precisely what the headline number conceals.

Foundational framework · Philosophy of science

AI Deployment as Experiment: Why This Framing Is Not Metaphor

There is a useful clarification worth making before the four principles, because it reframes the whole problem. When we ask whether to trust an AI system's output in a given context, we are asking exactly the same question as: should we trust an experimental result beyond the conditions under which it was obtained? Both questions turn on the same distinction: internal validity (does the evidence hold within the test?) versus external validity (does it generalise to the situation you actually care about?). Most AI evaluation gets the first right and treats the second as someone else's problem.

AI models are trained on a corpus of data and evaluated on held-out samples from the same distribution. When the system is deployed, it encounters the real world — a different distribution. This is the classic problem of generalisation under distribution shift. It is identical, structurally, to asking whether something that works in a lab will work in the field. (Spoiler: sometimes. Depending on how carefully you specified the lab conditions and how honest you were about the difference.)

Popper's demarcation criterion is directly useful here: a scientific claim is meaningful only if it is falsifiable — only if there exist conditions under which it could be shown to be wrong. A vendor claim that cannot specify the conditions under which a system would fail is not a scientific claim. It is unfalsifiable — which means, in Popper's framework, it provides no genuine epistemic warrant at all. The correct response is not scepticism about AI in general. It is insisting that every claim comes with the experimental conditions under which it was established and the conditions under which it breaks down.

Balestriero, Pesenti, and LeCun (2021) add a further complication. There is a reassuring story vendors often tell — implicitly, through the language of "training" and "testing" — which is that a well-trained model is operating within its knowledge, filling gaps between things it has already seen. They demonstrated mathematically that in high-dimensional spaces (which all practical models occupy), this story is formally false. Interpolation almost surely never occurs. Any new input lies outside the convex hull of the training data. Models are always extrapolating, in the precise geometric sense, even on tasks they appear to handle confidently. The safety implied by "it's been trained on this" simply does not hold.

The practical upshot for a decision-maker: every AI system, on every input, is making a bet about structure it has never directly observed. The question is not whether it is extrapolating (it is, always) but how far — and whether the extrapolation is the kind that tends to succeed or fail. Benchmarks measure performance inside a specific experimental envelope. Deployment extends that envelope. AI safety is, in large part, the question of how honestly the evaluation envelope was specified and how much it resembles where you are actually deploying.

Principle I

Distinguish Robust Capabilities from Brittle Ones

The gap between how AI presents and how it performs is not a bug. It is an architectural feature — and it mimics human signals of competence so precisely that your normal defences do not work.

Consider what large language models are actually doing. They generate outputs through next-token prediction optimised for distributional plausibility — a process that produces fluent, confident-sounding text whether or not the content is factually accurate. Fluency and reliability are formally separable properties of these systems: a model can be simultaneously maximally fluent and substantively wrong. This is not a defect awaiting correction. It is the mechanism. The system is optimising for plausibility, and we are pattern-matching plausibility to trustworthiness, because for most of human history that heuristic worked.

"LLMs generate coherence not from communicative intent, but from token-level prediction trained to maximise fluency. Users tend to interpret this fluency as indicative of understanding or reliability. As a result, an accuracy paradox occurs: improvements only in accuracy misplace user trust, as narrowly defined accuracy metrics fail to capture [the full failure space]."

— Farkas et al., arXiv:2509.13345, 2025

A robust capability is one where the model's extrapolation tends to be well-calibrated — reliably predictable across variations in input. A brittle capability is one where performance is high within the evaluation envelope and collapses outside it, often without any signal to the user that this is happening. That last part is the important one. Traditional software fails loudly. Brittle AI fails fluently — it produces a confident, well-formed wrong answer, which is the one failure mode your quality controls were not designed to catch.

Finding · Bengio et al. · International AI Safety Report 2026

Even systems that excel at complex tasks may generate non-existent citations, biographies, or facts (hallucination). Performance can be inconsistent: accuracy on mathematics problems can decrease significantly when irrelevant information is inserted into the problem description. This brittleness extends to multimodal capabilities, where models often have low performance on spatial reasoning tasks such as basic counting of objects in a scene.

Crucially, expert human oversight can mitigate some of these risks, but introduces a corresponding danger of over-reliance, where humans accept AI-generated outputs without adequate verification.

DOI: 10.48550/arXiv.2602.21012 · Bengio et al. (2026), International AI Safety Report 2026

A further dimension of brittleness concerns safety measures themselves. Research from Carnegie Mellon University demonstrates that widely used post-training alignment methods — the techniques vendors use to make models "safe" — can themselves be brittle. A model can pass safety evaluations under normal testing conditions while failing when the input distribution shifts slightly. Unsafe behaviours embedded during pretraining are difficult to remove through post-hoc alignment.

Finding · Safety Pretraining Research · IBM / CMU

Safety measures applied after model development (through RLHF, red-teaming, and guardrails) are often reactive, brittle, expensive, and struggle to address unknown or emerging risks. The dominant "Make AI Safe" paradigm addresses a model after it is built; the consequence is that safety is a thin layer over a system whose core behaviours were not designed for safety.

The leading models — GPT-5, Claude 4, Gemini 2.5 — demonstrate capability scores that consistently outpace safety scores. This structural gap has not closed as capabilities have increased.

See: arXiv:2509.06786 (R²AI); IBM Think: What is AI Safety?, May 2026

Cao and Yousefzadeh (2023) show that AI models extrapolate outside their training distribution frequently and without notifying users or stakeholders. They argue that whether a model has extrapolated for a given input should be a mandatory part of any explainability system — the model should declare when it is operating beyond its training support, just as a human expert would caveat advice that goes beyond their direct knowledge. A doctor who gives confident opinions outside their specialty without flagging the limitation is not being helpful. They are being dangerous. The fact that AI systems do this by default, without any caveat, is a governance problem masquerading as a technical one.

Finding · Cao & Yousefzadeh (2023) · Big Data & Society

AI models extrapolate outside their range of familiar data, frequently and without notifying the users and stakeholders. Knowing whether a model has extrapolated is a fundamental insight that should be included in explaining AI models in favour of transparency, accountability, and fairness. The right to AI explainability, as consolidated in the research community and in policy, is incomplete without this component.

DOI: 10.1177/20539517231169731 · Cao, X. & Yousefzadeh, R. (2023), Big Data & Society, 10(1)

Capability	Stability signal	Assessment
Pattern extraction on familiar structured data	Consistent across similar inputs; verifiable against source	Robust
Summarisation of provided text (with grounding)	Performance measurable and stable	Robust
Factual recall under pressure or topic drift	Fluent but unverifiable; hallucination-prone	Brittle
Mathematical reasoning with irrelevant insertions	Degrades under perturbation (IASR 2026)	Brittle
Spatial/counting tasks (multimodal)	Low baseline; documented failure mode	Brittle
Safety behaviour under adversarial prompting	Post-hoc alignment is itself brittle (CMU/IBM)	Brittle
Consistent probabilistic outputs	LLMs fail to follow target distributions (Yang & Zhang 2025)	Brittle

Questions that reveal brittleness — and which a responsible vendor can answer

Q1Show the same task with slight variation in input format, topic, or phrasing — does performance hold, and is that stability documented?

Q2What are the documented failure modes, and under precisely what conditions do they occur? (A vendor who cannot enumerate failure modes has not tested them.)

Q3Was safety evaluated on pre-mitigation or post-mitigation models? Was red-teaming conducted by an independent party?

Q4Is there any mechanism by which the system signals when it is extrapolating beyond its training support?

Q5When the system does not know the answer, does it refuse or does it confabulate fluently? Ask for documented examples of each.

Principle II

Interrogate Vendor Claims with Rigour

Vendors are not deceiving you. They are showing you the results that make them look best. The problem is structural — and the solution is methodological rigour from the buyer, not moral expectations of the seller.

A vendor benchmark is an experimental result. The question is whether that result has internal validity (the benchmark measures what it claims to measure, rather than something else that correlates with it) and external validity (it generalises to your actual deployment conditions). Both are routinely violated. This is not unique to AI — any industry with significant information asymmetry between seller and buyer produces this dynamic. The appropriate response is not cynicism about vendors. It is methodological rigour from the buying side.

The internal validity of AI benchmarks is systematically compromised by what researchers call the "Clever Hans" problem — named after the horse that appeared to do arithmetic but was actually reading its trainer's body language. Models appear to solve tasks by learning spurious statistical features of the test set rather than the underlying capability being measured. Schlegel et al. (2024) find that simple surface features of benchmark items predict LLM performance — meaning high scores can be achieved without the model possessing the capacity the benchmark purports to test. The horse passes the exam without learning the subject.

Finding · Schlegel et al. (2024) · Benchmark Validity

AI benchmark internal validity requires that observed performance be attributable to the specific capability under test, not to other unrelated capabilities, systematic errors, or biases in the benchmark data. This condition is routinely unmet. Simple features of benchmark items predict LLM answers with significant accuracy, indicating that benchmark performance is partly a measure of surface-pattern exploitation rather than genuine capability.

External validity — how benchmark results generalise to real-world performance — has received even less scrutiny. A model that scores 95% on a benchmark can still fail specific user segments in production because test sets rarely capture the full complexity of live environments.

arXiv: 2410.11672 · Schlegel et al. (2024), Leaving the barn door open for Clever Hans: Simple features predict LLM benchmark answers

A KPMG survey found that most organisations completing AI evaluations rely solely on vendor-supplied benchmarks and documentation. Gartner (2024) cites misalignment between vendor claims and production performance as a primary cause of enterprise AI deployment failure. Think about what a vendor-supplied benchmark actually is: a restaurant showing you photographs of its best dishes. The photographs are real. The dish exists. The question is whether it will be on the menu when you arrive, whether the kitchen can produce it consistently, and whether your particular dietary requirements were accounted for in the recipe. None of that is captured in the photograph.

Finding · OpenAI Preparedness Framework Analysis (2025) · Affordance Analysis

A formal affordance analysis of OpenAI's Preparedness Framework Version 2 (April 2025) found that the framework: (1) requests evaluation of only a small minority of AI risks; (2) encourages deployment of systems with capabilities for unintentionally enabling "severe harm" (defined as >1,000 deaths or >$100B in damages) at the "Medium" risk level; and (3) allows the CEO to deploy even more dangerous capabilities, especially if other companies do so. This was not a selective reading — it is what the framework formally permits, as a matter of logical structure.

The finding is generalisable: vendor safety frameworks, as currently published, are voluntary, largely unaudited, and structured to allow deployment to proceed. They do not constitute independent safety assurance.

arXiv: 2509.24394 · Affordance analysis of OpenAI Preparedness Framework V2 (2025)

Deflection patterns to recognise and document

These responses indicate a vendor cannot or will not substantiate their claims under scientific standards:

"Our benchmark results speak for themselves" — without specifying which benchmark, on what data, under what conditions, or what failure rate is tolerated
"We use state-of-the-art safety techniques" — without describing the techniques, their evaluation methodology, or whether they were tested pre- or post-mitigation
"Accuracy is 97%" — without stating the base rate, the distribution of the 3%, or whether errors are randomly or systematically distributed
"Human oversight is built in" — without defining what a human reviewer can detect, given the opacity of the model's reasoning
"This is a foundational model question, not ours to answer" — when asked about failure modes of the specific deployment you are evaluating

Six diligence questions — the procurement record must contain written answers to all of these

D1What training data underpins this system, and has it been independently audited? What is the data's coverage of the specific domain and distribution we will be deploying into?

D2On what dataset was the headline performance figure measured, and how closely does that dataset resemble our production environment in format, topic, and distributional characteristics?

D3What are the documented failure modes? Provide concrete examples of the cases the model gets wrong. What is the severity distribution of those failures?

D4Was safety evaluated on pre- or post-mitigation models? Who conducted the evaluation, and what access did evaluators have to the model? Was red-teaming conducted by an independent party?

D5What contractual liability do you accept for errors in decisions made on the basis of this system's outputs? Is this documented in the contract?

D6How will you notify us when model updates change the performance characteristics we have relied on? What is your process for communicating capability degradation?

Treat these answers as evidence, not reassurance. Write them into the procurement record alongside the accuracy claims at signature. Re-test against them quarterly. Cite them in any subsequent dispute. This is not bureaucracy for its own sake — it is the only mechanism by which you can tell, six months from now, whether the system you deployed is still the system you evaluated. AI models update. Vendor terms shift. The baseline you relied on at procurement becomes invisible unless you made it explicit at the time.

Principle III

Surface the Assumptions Your Strategy Rests On

Strategy documents describe what the technology will do. What they rarely specify is the conditions under which it won't. That omission is not accidental — it is the part that would make the strategy harder to approve.

There is something deeply human about the AI strategy document. It is a form of collective self-reassurance — a document that says "we have thought about this" in a way that satisfies governance committees without necessarily satisfying the question. The claim "this system will accurately process customer support enquiries" is, in the language of science, a hypothesis about future experimental conditions. Like any hypothesis, it has a scope (the conditions under which it holds) and a failure surface (the conditions under which it breaks down). Strategy documents routinely specify the former and omit the latter — not out of negligence but because specifying failure surfaces is psychologically uncomfortable and politically inconvenient. It is much easier to get a strategy approved if the downside scenarios are left implicit.

The epistemic problem here is structurally similar to what Popper called naive inductivism: the inference from "this system has performed well on N cases" to "this system will perform well generally." Popper's response — that no number of confirming instances warrants a universal claim, and that a single genuine counter-instance falsifies it — applies directly. A single high-stakes failure can destroy the operational and reputational value of a successful deployment. The question is whether your strategy is designed to surface that failure before or after it happens.

Several hidden assumptions recur across enterprise AI strategies and deserve explicit audit:

The fluency-reliability assumption. The most pervasive error is assuming that a system that sounds authoritative is authoritative. This is the same error we make with human experts — we treat confident delivery as evidence of sound reasoning, and we are wrong often enough that entire fields (medicine, finance, law) have developed formal second-opinion protocols to compensate for it. Hallucination in LLMs is not a bug awaiting a fix; it is a structural property of probabilistic token prediction. The IAPP (2025) states plainly that the capacity to generate fabricated information is "not just a failure of learning specific data, but a characteristic tied to the model's operational mechanics." It cannot be trained away. The next model version will not solve it. Any strategy that depends on this assumption being wrong is load-bearing on a foundation that does not exist.

Finding · IAPP (2025) · Hallucinations in LLMs: Technical Challenges, Systemic Risks and AI Governance Implications

Hallucinations in LLMs cannot be entirely "fixed" or "trained away" because they arise from the model's operational mechanics and its inherent limits in mapping the vast space of language and knowledge. This challenges the assumption widespread in strategy documents that hallucination is a temporary limitation of current models. It is a characteristic of the probabilistic architecture.

iapp.org · Hallucinations in LLMs, July 2025

The benchmark-production equivalence assumption. Strategy documents that cite benchmark performance as justification for deployment implicitly assume the benchmark's experimental conditions are representative of production conditions. They rarely are. Distribution shift — the difference between test conditions and deployment conditions — is the primary source of AI production failures.

The stable-performance assumption. Traditional software fails loudly — it crashes, throws an error, returns a null. You know something has gone wrong because something has visibly gone wrong. AI systems fail differently. They fail fluently. Probabilistic systems can deteriorate behaviourally without crossing any operational alert threshold, producing outputs that are worse than last month's without triggering a single incident ticket. Venturebeat (2026) documents silent partial failure as a distinct failure mode: the system degrades behaviourally before it degrades operationally, and "by the time the signal reaches a postmortem, the erosion has been happening for weeks." The first signal is often user mistrust — a qualitative, distributed, hard-to-aggregate phenomenon that most monitoring systems are not built to detect. Strategies that include no mechanism for catching this are not managing risk. They are hoping it doesn't happen.

Assumption audit — questions to put to every strategic AI claim before sign-off

A1Does this claim rest on benchmark performance? If so: what is the distributional overlap between the benchmark dataset and our production environment?

A2Does this claim assume the model behaves consistently outside the conditions under which it was evaluated? What evidence supports that generalisation?

A3Does this claim assume hallucination rates will decrease to negligible levels with future model versions? Is that assumption load-bearing — i.e., does the strategy fail if it doesn't hold?

A4Does our roadmap include monitoring for silent degradation — not just uptime and API response, but output quality against verified ground truth on a rolling basis?

A5Does the strategy depend on vendor safety frameworks providing genuine protection? Have we examined what those frameworks actually permit, not just what they claim?

Finding · Future of Life Institute · AI Safety Index, Summer 2025

AI companies claim they will achieve artificial general intelligence within the decade, yet none scored above D in Existential Safety planning. One reviewer described this disconnect as "deeply disturbing": despite racing toward human-level AI, "none of the companies has anything like a coherent, actionable plan" for ensuring such systems remain safe and controllable.

If the developers of these systems have not formally addressed what their most extreme claims imply, enterprise strategies that treat those claims as reliable planning inputs are building on an unsupported foundation.

futureoflife.org/ai-safety-index-summer-2025 · Future of Life Institute, July 2025

Principle IV

Frame Governance Around What AI Can Be Held To

The governance documents that get signed off are almost always written to be approved, not to work. These are different objectives, and the gap between them is where the liability accumulates.

The most common failure in AI governance is specifying accountability in terms that do not map onto how AI systems actually work. "Human oversight" is the canonical example — it appears in almost every governance framework, repeated as if the phrase itself were a control. It is not. A phrase is not a mechanism. The question is not whether a human is nominally present in the process. The question is whether that human is in a position to catch the specific failure modes the system exhibits, given the information they have, the time they are allocated, and the opacity of the reasoning they are being asked to review.

Human oversight is meaningful only if the human is in a position to falsify the system's output — to detect when it is wrong and act on that detection. If they cannot do this — if they are reviewing outputs they cannot independently verify, in a timeframe that does not permit genuine scrutiny — then their presence in the loop is the appearance of oversight: a control that satisfies a checkbox but provides no actual protection. This is not an edge case. It is the normal condition for most enterprise AI deployments.

"Asking AI systems to behave differently under alternative framings will typically produce performed expectations, not genuine behavioural shifts. Without controls, you cannot distinguish the two. Any study probing latent capabilities or alternative processing modes should implement a robust control condition."

— Lumenova AI, Behavioral AI Testing: The Importance of Control Conditions, 2026

What can a reviewer actually see? In opaque systems, reviewers see outputs, not reasoning. They can judge whether an output looks plausible — but, as Principle I established, an AI system is specifically optimised to produce outputs that look plausible. Asking someone to evaluate outputs for accuracy when the outputs are designed to appear accurate is not oversight. It is a reading comprehension exercise dressed up as quality control.

What does the error distribution look like? A 3% error rate is not the same thing regardless of which 3% fails. If errors are randomly distributed, a sampling-based review process can catch them. If errors are systematically concentrated in specific input types — which AI failures very often are — a generic review process will miss them at precisely the highest-stakes moments. This is not a theoretical concern. It is the normal pattern of AI production failures (see the MIT / Deloitte findings below). The governance question is not "what is the error rate?" It is "where do the errors land?"

Finding · Edge Case Production Failures · MIT Technology Review / Deloitte (2024–25)

According to MIT Technology Review, 80% of AI system failures in production environments stem from scenarios not adequately represented in training data. A Deloitte study found that 37% of enterprises experienced significant operational disruptions due to AI system failures in edge case scenarios. In regulated industries, edge cases often involve precisely the scenarios where compliance is non-negotiable — meaning the failure distribution is inverse to the stakes.

The implication for governance: a review process calibrated against average-case performance provides no protection for the cases that matter most.

MIT Technology Review / Deloitte cited in: CloudFactory, Edge Case Handling, September 2025

Finding · Epistemic Failure as Co-constructed · Farkas et al. (2025)

LLM errors shift from predictable (factual inaccuracy, unstable reasoning) to hermeneutic forms, where linguistic fluency, structural coherence, and superficially plausible citations conceal deeper distortions of meaning. Evaluators — including experienced domain specialists — frequently conflated criteria such as correctness, relevance, and consistency, indicating that human judgement collapses analytical distinctions into intuitive heuristics shaped by form and fluency.

The finding implies that "human oversight" is not a reliable safeguard when the human reviewer is subject to the same fluency-reliability conflation described in Principle I. The oversight mechanism must be designed to overcome this cognitive tendency, not assume its absence.

arXiv: 2512.16750 · Farkas et al. (2025), Plausibility as Failure: How LLMs and Humans Co-Construct Epistemic Error

Define oversight operationally

Specify what a human reviewer can actually detect given the information available and the time allocated. "Human in the loop" is not a governance control if the loop cannot surface the failure mode.

Assign accountability by traceable capacity

A system that cannot explain its reasoning cannot be held accountable for that reasoning. The humans who deployed it can. Accountability must follow traceability — map decisions to decision-makers, not to models.

Monitor for silent degradation

Build monitoring that catches behavioural degradation before operational failure. Set thresholds against verified output quality, not only uptime. Re-verify against the original diligence record quarterly.

Analyse the failure distribution, not just the rate

Any accuracy figure must be accompanied by an analysis of where failures concentrate. Are they random or systematic? Do they cluster in high-stakes input types? Precision vs recall trade-offs must be made explicit.

Governance specification checklist — against ISO/IEC 42001 and EU AI Act requirements

G1Does our governance document define oversight in terms of what a human reviewer can detect — not just that a human is present in the process?

G2Is there a named accountable function for AI accuracy diligence — not a committee, a named individual?

G3Are performance thresholds defined by business risk and failure cost, not by statistical defaults or vendor suggestions?

G4Do our incident response processes cover silent and partial failure — or only system outages and API errors?

G5Have we mapped where AI-driven decisions carry regulatory liability under the EU AI Act, and confirmed our governance controls are sufficient for those specific use cases?

G6Is our oversight process documented in a form that would withstand external audit — and have we tested it, not just written it down?

References & Further Reading

IASR 2026Bengio, Y. et al. (2026). International AI Safety Report 2026. arXiv:2602.21012. DOI: 10.48550/arXiv.2602.21012. Led by Yoshua Bengio; 100+ contributing experts across 29 nations, UN, OECD, EU.
IASR 2025Bengio, Y. et al. (2025). International AI Safety Report. arXiv:2501.17805. DOI: 10.48550/arXiv.2501.17805.
Balestriero 2021Balestriero, R., Pesenti, J. & LeCun, Y. (2021). Learning in High Dimension Always Amounts to Extrapolation. arXiv: 2110.09485. Geometric proof that interpolation almost surely never occurs in high-dimensional spaces.
Cao 2023Cao, X. & Yousefzadeh, R. (2023). Extrapolation and AI transparency. Big Data & Society, 10(1). DOI: 10.1177/20539517231169731.
Schlegel 2024Schlegel, V. et al. (2024). Leaving the barn door open for Clever Hans: Simple features predict LLM benchmark answers. arXiv: 2410.11672.
Farkas 2025Farkas, L. et al. (2025). Plausibility as Failure: How LLMs and Humans Co-Construct Epistemic Error. arXiv: 2512.16750.
Prep. FW 2025Affordance analysis of OpenAI Preparedness Framework V2 (April 2025). arXiv: 2509.24394.
FLI 2025Future of Life Institute. (2025). AI Safety Index: Summer 2025. futureoflife.org. No major AI company scored above D in Existential Safety planning.
Yang 2025Yang, I.Y. & Zhang, D.Y. (2025). Failure to Mix: Large language models struggle to answer according to desired probability distributions. arXiv:2511.14630.
IAPP 2025IAPP. (2025). Hallucinations in LLMs: Technical Challenges, Systemic Risks and AI Governance Implications. iapp.org, July 2025.
R²AI 2025arXiv:2509.06786. (2025). R²AI: Towards Resistant and Resilient AI in an Evolving World. Safety scores consistently trail capability scores across GPT-5, Claude 4, Gemini 2.5.
Lumenova 2026Lumenova AI. (2026). Behavioral AI Testing: The Importance of Control Conditions. lumenova.ai.
EU AI ActRegulation (EU) 2024/1689. The AI Act. Entered into force 1 August 2024. Human-in-the-loop requirements for high-risk AI systems.
ISO 42001ISO/IEC 42001:2023. Artificial Intelligence — Management System. International standard for AI governance, transparency, and accountability frameworks.

§ Take this further with your team

Most of the errors in this briefing are entirely preventable.

They are not caused by technical ignorance. They are caused by the absence of a shared framework for asking the right questions — before the vendor presentation, before the board sign-off, before the deployment. A half-day workshop gives leadership and strategy teams exactly that: a structured method, a shared vocabulary, and the specific questions that responsible vendors can answer and evasive ones cannot.

Enquire about a workshop →

Using AI Safely:A Framework for Decision-Makers