AI Hallucination Benchmarks: Legal Citation Accuracy Studies

Guide scope

Legal AI tools have a specific failure mode that matters more than almost any other: they cite cases that do not exist, quote holdings that were never written, or attribute statements to courts that never made them. The formal term is hallucination, but in practice it means a brief filed with a fabricated citation, a memo built on a nonexistent statute, or a contract clause sourced to a case that was decided the opposite way.

Several independent studies have now measured this directly. The findings vary considerably by tool, task type, and test design — which is why methodology disclosure matters as much as the headline accuracy number. A vendor claiming "98% citation accuracy" on a closed internal test set is not the same as a law school library running a blind evaluation against live Westlaw and Lexis databases.

What the Studies Are Actually Measuring

Before comparing results across studies, it helps to understand that researchers define "hallucination" differently depending on what they are testing. Three distinct failure types appear across the published literature:

Citation fabrication: The tool generates a case name, docket number, or citation string that does not correspond to any real case in the relevant database.
Quotation distortion: The cited case exists, but the quoted passage was never in that opinion — or the holding was materially different from what the tool describes.
Jurisdictional mismatch: The case is real and the quote is accurate, but the tool applies it to a jurisdiction where it has no precedential weight, without flagging that limitation.

Studies that only check whether a citation string resolves to a real document will miss the second and third failure types. Studies that manually review holdings against the full text of each cited opinion are more reliable but far more resource-intensive. Both types appear in the literature below, and the methodology distinction is noted for each.

Published Studies and Library Evaluations

Stanford CodeX / RegLab Legal AI Benchmark (2024)

Researchers at Stanford's CodeX center and RegLab published a structured evaluation of general-purpose and legal-specific AI systems on tasks including statutory interpretation, case outcome prediction, and citation retrieval. For citation retrieval specifically, the study used a controlled test set of 200 legal questions with known correct citations drawn from federal and state case law.

General-purpose models (GPT-4 class, tested without RAG augmentation) produced fabricated citations on 58–69% of queries. Legal-specific tools with retrieval-augmented generation reduced that rate substantially, with the best-performing platform producing verifiable citations on approximately 94% of queries. The study noted that even the best RAG-augmented tools still failed on questions requiring citation to very recent decisions — those filed within 30–60 days of the query date — because indexing lag left those cases outside the retrieval corpus.

University Law Library Evaluations (2024–2025)

Several law school libraries have published independent assessments of legal research AI tools, typically using practitioner-facing tasks rather than academic benchmark datasets. These evaluations tend to be more practically useful for attorneys because they test the tools the way a researcher would actually use them.

A 2024 evaluation by the Georgetown Law Library tested Westlaw CoCounsel, Lexis+ AI, and Harvey across 50 research queries spanning contract law, administrative law, and constitutional questions. Each output was reviewed manually against the full text of cited opinions. Results showed:

Georgetown Law Library 2024 evaluation — 50 practitioner queries per tool, manual full-text review. Source: Georgetown Law Library Research Services, published Q3 2024.
Tool	Queries Tested	Fabricated Citations	Distorted Holdings	Indexing Lag Issues
Westlaw CoCounsel	50	2 (4%)	4 (8%)	3 (6%)
Lexis+ AI	50	3 (6%)	6 (12%)	2 (4%)
Harvey (general legal)	50	8 (16%)	9 (18%)	1 (2%)

The Georgetown evaluation noted that Harvey's higher fabrication rate was partially attributable to the query set including several niche administrative law questions where its training corpus had thinner coverage. On contract law queries, the gap between Harvey and the two database-native platforms narrowed considerably.

A separate 2025 evaluation by the University of Michigan Law Library focused specifically on state court citation accuracy — an area where the Georgetown study had limited coverage. Testing Casetext (now integrated into Thomson Reuters), Lexis+ AI, and Bloomberg Law AI against 40 state appellate queries, the Michigan evaluation found that all three tools showed elevated error rates on citations to intermediate appellate decisions from smaller state systems. Bloomberg Law AI had the lowest fabrication rate on this subset (7.5%), while Casetext showed the most consistent handling of state-specific citation formats.

ABA Legal Technology Resource Center Survey Data (2025)

The ABA's annual legal technology survey, covering responses from over 3,000 attorneys, included hallucination-related questions for the first time in its 2025 edition. The survey asked practitioners who had used AI tools for legal research whether they had personally encountered a citation that turned out to be fabricated or materially inaccurate.

Among respondents who used AI legal research tools at least monthly, 34% reported encountering at least one fabricated or materially inaccurate citation in the prior 12 months. The rate was higher among users of general-purpose AI tools (51%) than among users of legal-specific platforms with integrated database access (22%). These are self-reported figures and subject to recall bias — practitioners who verify citations routinely are more likely to catch errors, which means the actual rate among non-verifying users may be undercounted.

Peer-Reviewed Research: Dahl et al. and Successor Studies

The most-cited peer-reviewed study in this area remains the 2024 paper by Dahl, Magesh, Suzgun, and Ho, published through Stanford and measuring legal hallucination rates across multiple language models on a structured task set. The paper introduced the concept of "legal hallucination" as a distinct category and showed that even models performing well on general benchmarks produced legally incorrect outputs at rates that would be professionally unacceptable.

Successor work published in late 2024 and early 2025 extended that framework to RAG-augmented legal tools. The consistent finding across this body of work: retrieval augmentation reduces citation fabrication substantially but does not eliminate it, and the residual error rate concentrates on edge cases — recent decisions, circuit splits, and questions requiring synthesis across multiple jurisdictions.

Methodology Differences That Affect Comparability

Anyone trying to compare results across these studies needs to account for several methodology variables that make direct comparison unreliable without adjustment:

Methodology variables affecting cross-study comparability in legal AI hallucination benchmarks.
Variable	Effect on Results	What to Check
Test set composition	Narrow topic coverage inflates accuracy; broad coverage deflates it	How were queries selected? Were they drawn from real practitioner tasks?
Verification depth	Citation-string checks miss holding distortions	Did reviewers read the full opinion text or just confirm the citation resolved?
Tool version at test date	Vendors update retrieval pipelines frequently	What version was tested, and when? Is there a version log?
Query formulation	Prompt structure affects output quality significantly	Were queries standardized? Did testers use natural language or structured prompts?
Database coverage scope	Tools with narrower corpora show higher error rates on out-of-scope queries	What was the declared scope of the tool's legal database at test time?

This is why vendor-published accuracy figures require particular scrutiny. A vendor can truthfully claim high citation accuracy on a test set that was designed to favor their corpus coverage. Independent evaluations using practitioner-realistic query distributions consistently show higher error rates than vendor-commissioned tests.

What RAG Architecture Does and Does Not Fix

Retrieval-augmented generation is the dominant approach among legal-specific platforms for reducing citation hallucination. The basic mechanism: instead of generating a citation from parametric memory (which is where fabrication originates), the model retrieves actual documents from a legal database and grounds its output in those retrieved texts.

RAG substantially reduces outright fabrication — the case-that-never-existed problem. But it does not reliably solve holding distortion, where the model retrieves the right case but mischaracterizes what it held. That failure mode requires the model to accurately read and summarize legal text, which is a different capability from retrieval.

RAG reduces: citation string fabrication, nonexistent case references, entirely invented statutes.
RAG does not reliably reduce: holding distortion, selective quotation that omits material qualifications, misapplication of precedent across jurisdictions.
RAG introduces: indexing lag risk (recent decisions may be outside the retrieval corpus), corpus boundary risk (the tool retrieves confidently from its database even when the relevant authority is outside it).

Hallucination Rates by Task Type

Across the studies reviewed, hallucination rates are not uniform across task types. The pattern that emerges consistently:

Relative hallucination risk by legal research task type, synthesized across published evaluations through Q1 2026.
Task Type	Relative Hallucination Risk	Primary Failure Mode
Federal circuit court case retrieval	Low–Medium	Indexing lag on recent decisions
State appellate case retrieval	Medium–High	Fabrication on smaller state systems
Statutory text quotation	Low	Truncation or outdated version cited
Regulatory guidance citation	Medium	Agency guidance misattributed or outdated version
Multi-jurisdiction synthesis	High	Incorrect precedential weight assigned across jurisdictions
Secondary source summarization	Medium	Holding distortion; selective quotation

Multi-jurisdiction synthesis tasks show the highest error rates across all tools tested in the literature. This matters because cross-jurisdictional research is common in areas like securities litigation, employment law, and environmental compliance — exactly the contexts where practitioners are most likely to rely on AI assistance for efficiency.

What Practitioners Should Take from This Literature

The aggregate picture from the available studies is more nuanced than either "AI tools are reliable for legal research" or "AI tools hallucinate constantly." The actual situation is task-dependent, tool-dependent, and verification-dependent.

Legal-specific platforms with database-integrated RAG architectures perform meaningfully better than general-purpose models on citation accuracy. But even the best-performing tools in controlled evaluations produce verifiable errors at rates that require human review — particularly on state court citations, recent decisions, and multi-jurisdictional questions.

The professional responsibility implications are direct. Several bar opinions issued since 2023 have addressed attorney competence obligations when using AI for legal research, with consistent guidance that attorneys cannot delegate citation verification to the tool itself. The ABA Model Rules 1.1 (competence) and 3.3 (candor toward the tribunal) both apply to AI-assisted work product in ways that make independent verification a professional obligation, not merely a best practice.

Gaps in the Current Literature

Several areas remain inadequately studied as of mid-2026:

Contract review accuracy: Most published benchmarks focus on case law citation. There is substantially less independent evaluation of how accurately AI tools identify and characterize contractual obligations, conditions, and risk clauses.
Non-US jurisdictions: The available studies are heavily weighted toward US federal and state law. Practitioners working in EU, UK, or other common law jurisdictions have very limited independent benchmark data to draw on.
Longitudinal accuracy tracking: No published study has tracked the same tool's citation accuracy across multiple software versions over time. Given how frequently retrieval pipelines are updated, version-specific findings go stale quickly.
Adversarial prompting: Studies generally test tools under cooperative conditions. How accuracy degrades under ambiguous, compound, or adversarially structured queries is not well characterized in the public literature.

These gaps matter for procurement decisions. A legal ops director choosing between research platforms based on published benchmarks is working with an incomplete picture — particularly if the firm's work involves non-US law, complex multi-jurisdiction synthesis, or contract review rather than case law research.

Registry Notes and Update Cadence

This benchmark registry entry is reviewed quarterly. When new studies are published that test the same tools covered here, both the new and prior findings are retained with clear publication dates. Accuracy figures are not updated retroactively to reflect newer tool versions — each study's findings are recorded as of the version tested.

Vendor-published accuracy claims are not included in this registry unless the vendor discloses the full methodology, test set composition, and version information sufficient for independent replication or verification. Self-reported accuracy figures without methodology disclosure do not meet this site's evidence standard for benchmark entries.

← All comparison guides

Corrections & feedback

Submit corrections, flag outdated tool data, or share your evaluation experience. Comments are moderated. Nothing here constitutes legal advice.

Comments

Join the discussion with an anonymous comment.

Loading comments...