Artificial intelligence promises to revolutionise legal practice, but its tendency to "hallucinate" false information is a fundamental flaw that threatens to undermine that promise. In Nevada earlier this year, District Judge David Hardy uncovered 14 fictitious case citations in a single filing. The fallout was swift: an associate was dismissed, senior lawyers were sanctioned, and the firm was ordered to notify law school deans of the misconduct.
Meanwhile, across the Atlantic, an English court tax judge disclosed the use of AI in drafting a judgment raising an uncomfortable question: if judges can use AI transparently and within guidelines, does this legitimise its place in judicial decision-making, despite the cautionary tales from legal practice?
New research from OpenAI, led by Adam Tauman Kalai, offers a mathematical explanation for why hallucinations in LLMs are not just common but inevitable. For the legal profession, the implications are clear: fabricated content will continue to surface in courtrooms worldwide unless lawyers adapt their practices.
The OpenAI study: why hallucinations are inevitable
OpenAI’s research team has shown that hallucinations are an unavoidable feature of current large language models (Kalai et al, 2025). Because these systems generate text one word at a time based on probabilities, error rates compound over long passages.
The mathematics are telling: even with a 99% per-token accuracy rate, a 1,000-word response has a near-certainty of containing at least one error. The result: whilst a model might be reasonably accurate on a yes/no question, a 20-page legal memo generated under the same conditions has a high likelihood of containing multiple fabricated claims or citations.
The unreliability is evident even in simple factual tasks. When asked the birthday of one of the paper's authors, DeepSeek-V3 produced three confidently stated but incorrect dates. None were even close.
Why does the problem persist?
Hallucinations remain stubbornly common. According to Kalai et al, the reasons are structural:
- Model incentives: Current evaluation benchmarks penalise uncertainty. A system that admits “I don’t know” receives the same score as one that provides an entirely wrong answer. This creates powerful incentives for models to guess confidently (see p.1).
- User pressures: Under tight deadlines and economic constraints, lawyers often rely on outputs without performing verification.
- Compounding risk: Even the best-performing models still hallucinate. Google’s Gemini-2.0-Flash shows a 0.7% hallucination rate, while GPT-4o demonstrates 1.5% (Vectara, 2025). In long legal briefs, those small percentages translate into multiple errors.
The core issue is therefore not just technical accuracy, but structural misalignment: are today’s AI systems optimised for confidence, rather than the truth?
Hallucinations in courtrooms worldwide
AI is designed to sound confident even when wrong. The problem is compounded when AI-generated submissions come from litigants in person—individuals without formal legal representation who may lack awareness of AI's limitations. Insufficient public understanding of how these systems work means fabricated evidence can appear in proceedings with serious consequences for families and vulnerable parties.
Courts are punishing lawyers for trusting this confidence without checking.
United States
Courts and professional bodies have issued explicit warnings: Texas Bar Ethics Opinion 705 requires lawyers to independently verify any AI-generated information, whilst the Eastern District of Texas mandates that lawyers certify they have reviewed and verified all computer-generated content. The American Bar Association's Formal Opinion 512 similarly emphasises that attorneys remain fully responsible for verifying the accuracy of all AI outputs.
Yet, at the time of writing, there are now over 280 recorded instances of hallucinations in U.S. court filings (Charlotin, 2025). These include fictional case citations, misrepresented precedents and even fictitious evidence.
The professional failure lies squarely with the lawyers themselves: despite widespread knowledge that AI systems hallucinate, practitioners continue to submit AI-generated content to courts without independent verification, treating probabilistic outputs as authoritative submissions. However, the systems are operating exactly as their probabilistic design dictates: generating plausible-sounding text without any mechanism for truth verification.
The problem first gained widespread attention in 2023 with Mata v Avianca, where lawyers submitted a brief citing six non-existent judicial opinions generated by ChatGPT. The Southern District of New York imposed sanctions, and the case became a cautionary tale that reverberated throughout the legal profession. Yet the warnings have proven insufficient. Similar incidents have continued across jurisdictions, with judges in Arizona and Alabama imposing increasingly creative sanctions to deter AI misuse.
United Kingdom
The Bar Council of England and Wales explicitly warns barristers as to the “ability of LLMs to generate convincing but false content raises ethical concerns”, further cautioning members against taking “such systems’ outputs on trust and certainly not at face value.” (para. 17) Advice worth heeding. Though less than the United States, we continue to see incidents before the English courts.
There are currently over 15 documented incidents of legal decisions where generative AI produced fictitious or false content before the English courts. Some resulted in adverse costs orders and/or where relevant referral to the Bar Standards Board for investigation (Charlotin, 2025).
In the meantime, in VP Evans (as executrix of HB Evans, deceased) & Ors v The Commissioners for HMRC [2025] UKFTT 1112 (TC), Tribunal Judge Christopher McNall confirmed using CoPilot Chat to draft a case management decision. Among his reasoning for justifying his use of AI, the Judge made reference to the Practice Direction as well as the senior Courts and Tribunals judiciary published "AI: Guidance for Judicial Office Holders". The Judge ruled that for the “discrete case management matter” it was appropriate to use AI as a tool for the swift production of decisions. He underlined that above all, he was the “decision-maker”, “responsible for [the decision]” and that the judgment being “the evaluative faculty, weighing-up the arguments, and framing the terms of the order” was his and not AI’s. In other words, he did not delegate the decision making process to the AI tool (at 42-49). This is a development that highlights the technology's growing integration into judicial processes whilst simultaneously raising questions about verification protocols.
Can Retrieval-Augmented Generation solve matters?
Some suggest that Retrieval-Augmented Generation (RAG) (where AI searches documents or databases before responding) offers a solution. Research confirms that RAG can reduce factual errors by grounding answers in external sources (Lewis et al, 2020; Shuster et al, 2021). For legal applications, this might mean searching case databases or verified statutes before generating outputs.
But RAG is no panacea. OpenAI’s mathematical analysis applies to all language models, including those with retrieval (Kalai et al, 2025). RAG cannot:
- Correct reasoning errors or misapplications.
- Prevent misquoting or misinterpreting sources.
- Help in genuinely novel cases with no precedent available.
For lawyers, the message is clear: RAG may reduce risk, but verification remains non-negotiable.
Practical strategies for lawyers
Just as our review of MIT “cognitive debt” study urged lawyers to “Think First” before delegating to AI, OpenAI’s hallucination research helps us point to practical safeguards:
- Verify before you file
Treat every AI-generated citation, statute, or factual claim as unverified until independently checked. Failure to do so risks not only embarrassment but professional sanctions.
- Redesign your prompts
- Design prompts that minimise hallucination risk. Instead of "Find cases supporting X" (which invites fabrication), try "Draft an argument structure for X, leaving citation brackets for me to fill in manually" or "Suggest search terms for finding cases about X." This keeps AI in an assistive role where you control verification.
- Request AI outputs in a format that makes verification easier, for instance: "List all case citations separately with court, year, and holding".
- Alternatively, use AI only for tasks where errors are immediately detectable (drafting templates, brainstorming arguments) rather than factual research.
- Use AI as scout, not advocate
Treat outputs as research leads, not authoritative content. AI can accelerate discovery, but the responsibility for accuracy always rests with the lawyer.
The path forward
Hallucinations are not a passing glitch. Rather, they are mathematically inevitable under current AI architectures. Waiting for “better models” will not remove the risk.
For the legal profession, this means that whilst AI can accelerate research and other time consuming tasks, it cannot replace professional judgment. Furthermore, Courts will continue to punish misuse. The real question is whether lawyers can integrate these tools responsibly before further eroding public trust in the justice system.
Your thoughts?
How is your practice addressing the verification challenge? Do you treat AI as a useful assistant, or as an unreliable risk? We’d love to hear your thoughts: info@deep-lex.com
Sources
- Merken, Sara, 'Some US Judges Move Beyond Fines to Keep Lawyers' AI Errors in Check' Reuters (16 September 2025) https://www.reuters.com/legal/government/some-judges-move-beyond-fines-keep-lawyers-ai-errors-check-2025-09-16/ accessed 3 October 2025; see also Ambrogi, Bob, 'Nevada Judge Takes Creative and Unusual Approach to Combat AI-Generated Fictitious Citations' LawSites (September 2025) https://www.lawnext.com/2025/09/nevada-judge-takes-creative-and-unusual-approach-to-combat-ai-generated-fictitious-citations.html accessed 3 October 2025.
- VP Evans (as executrix of HB Evans, deceased) & Ors v The Commissioners for HMRC [2025] UKFTT 1112 (TC) https://caselaw.nationalarchives.gov.uk/ accessed 3 October 2025.
- Kalai, Adam Tauman and others, 'Why Language Models Hallucinate' (OpenAI, 2025) arXiv:2509.04664 https://arxiv.org/pdf/2509.04664v1 accessed 3 October 2025.
- State Bar of Texas Professional Ethics Committee, Opinion No 705 (April 2025) https://www.legalethicstexas.com/resources/opinions/opinion-705/ accessed 3 October 2025.
- American Bar Association Standing Committee on Ethics and Professional Responsibility, Formal Opinion 512: Generative Artificial Intelligence Tools (29 July 2024)https://www.americanbar.org/content/dam/aba/administrative/professional_responsibility/ethics-opinions/aba-formal-opinion-512.pdf accessed 3 October 2025.
- Charlotin, Damien, 'AI Hallucination Cases' (Damien Charlotin, 2025) https://www.damiencharlotin.com/hallucinations/ accessed 3 October 2025.
- Mata v Avianca, Inc No 22-cv-1461, 2023 WL 4114965 (SDNY 22 June 2023); Weiser, Benjamin, 'Here's What Happens When Your Lawyer Uses ChatGPT' The New York Times (New York, 27 May 2023) https://www.nytimes.com/2023/05/27/nyregion/avianca-airline-lawsuit-chatgpt.html accessed 3 October 2025.
- Vectara, 'Hallucination Leaderboard: Model Hallucination and Factual Consistency Benchmarks' (2025) https://github.com/vectara/hallucination-leaderboard/ accessed September 2025.
- Lewis, Patrick and others, 'Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks' (2020) 33 Advances in Neural Information Processing Systems 9459.
- Shuster, Kurt and others, 'Retrieval Augmentation Reduces Hallucination in Conversation' in Findings of the Association for Computational Linguistics: EMNLP 2021 (Association for Computational Linguistics 2021) 3784.
