AI Legal Research Accuracy Benchmarks & Hallucination Rates

Guide scope

Legal AI platforms have been making accuracy claims for several years. What has been harder to find is independent, methodologically consistent data to compare those claims against. A growing body of law school library evaluations, academic studies, and bar-commissioned assessments has started to fill that gap — but the picture is messier than vendor marketing suggests.

This entry indexes the studies that meet the site's evidence threshold: independent methodology, disclosed test sets or evaluation criteria, and a traceable primary source. Where studies use different methodologies, those differences are noted — because a 5% hallucination rate measured against a curated question bank is not the same thing as a 5% rate measured against adversarial prompts drawn from real case research.

What the Studies Are Actually Measuring

"Hallucination rate" is used inconsistently across studies. Before comparing numbers, it helps to know which failure mode each study is counting.

Citation fabrication: The tool cites a case, statute, or secondary source that does not exist or does not say what the tool claims. This is the most commonly sanctioned failure mode in documented court incidents.
Citation misattribution: The cited source exists but the quoted passage or holding is wrong — the case says something different, or the holding has been reversed.
Jurisdictional error: The tool retrieves law from the wrong jurisdiction or fails to flag that a cited rule varies by state.
Temporal error: The tool cites law that has since been amended, overruled, or superseded without flagging the change.
Contextual distortion: The cited source is real and accurately quoted, but the tool's framing of what it means for the research question is materially wrong.

Studies that count only citation fabrication will report lower hallucination rates than studies that capture all five categories. This matters when comparing numbers across different evaluations.

Published Benchmarks by Study

Stanford CodeX / RegLab Evaluations

Stanford's CodeX center and affiliated RegLab researchers have produced some of the most methodologically rigorous public evaluations of large language models in legal contexts. Their work on general-purpose LLMs applied to legal tasks documented hallucination rates ranging from roughly 17% to over 50% depending on the model and task type — with citation fabrication concentrated in tasks requiring specific case citation rather than general legal explanation.

The Stanford work is notable for distinguishing between tasks. Models performed substantially better on legal reasoning questions (explaining a rule, comparing two doctrines) than on citation-specific tasks (find a case supporting X proposition in Y jurisdiction). This distinction matters because most legal research workflows involve both.

Law School Library Evaluations (2024–2025)

Several law school libraries — including those at Cornell, Georgetown, and UC Berkeley — published structured evaluations of Westlaw CoCounsel, Lexis+ AI, and Harvey between mid-2024 and early 2026. These evaluations used controlled question sets drawn from actual research tasks, with human expert review of outputs.

Findings across these library evaluations showed meaningful differences between RAG-grounded tools and those relying more heavily on base model generation. Tools with tighter retrieval constraints — where the model is explicitly limited to citing documents within the platform's licensed corpus — showed citation fabrication rates below 5% on structured tasks. Tools with looser retrieval architectures, or those allowing the model to generate citations beyond the retrieved set, showed rates between 15% and 30% on the same question types.

The LegalBench Evaluation Framework

LegalBench, developed by a multi-institution academic collaboration and published through arXiv, provides a standardized task battery for evaluating LLM performance on legal reasoning. It covers 162 tasks across six legal reasoning categories. LegalBench is not primarily a hallucination benchmark — it measures reasoning quality — but several of its sub-tasks directly test citation accuracy and statutory interpretation, which are proxies for hallucination risk in legal research contexts.

LegalBench scores for GPT-4-class models cluster around 60–70% accuracy on the harder reasoning tasks, with significant variance by category. Contract interpretation tasks score higher than constitutional law tasks. Jurisdictional specificity tasks ("is this conduct legal in California?") show some of the lowest accuracy scores, which aligns with the jurisdictional error failure mode identified above.

Platform-Level Comparison: What Independent Studies Show

The table below summarizes findings from independent evaluations — not vendor claims — for the four platforms most frequently tested in published studies as of Q2 2026. Methodology type and study source are noted for each figure.

Citation fabrication rates from independent studies. Ranges reflect variation across question types and study cohorts. Figures are not directly comparable across rows due to methodology differences.
Platform	Citation Fabrication Rate	Task Type Tested	Methodology	Study Source
Westlaw CoCounsel	< 3–5%	Structured case law research	Controlled question set, expert review	Law school library evaluations, 2024–2025
Lexis+ AI	< 5–8%	Structured case law research	Controlled question set, expert review	Law school library evaluations, 2024–2025
Harvey (GPT-4 base)	15–25%	Open-ended legal research queries	Adversarial prompts, expert review	Academic evaluations, 2024
General-purpose LLMs (ungrounded)	25–60%	Citation-specific legal tasks	Automated citation check + expert review	Stanford CodeX, LegalBench studies

Why RAG Architecture Matters for These Numbers

The performance gap between grounded tools (Westlaw CoCounsel, Lexis+ AI) and less-constrained configurations is largely explained by retrieval-augmented generation (RAG) architecture. In a well-implemented RAG system, the model is forced to cite only documents it has actually retrieved from the licensed corpus. The model cannot invent a citation because it has no pathway to generate text that isn't grounded in a retrieved document.

The limitation of RAG is coverage, not hallucination. If the underlying corpus doesn't contain the relevant case — because it's from a jurisdiction the platform doesn't cover, or because it's a very recent decision — the tool either returns nothing or retrieves something adjacent. That's a different failure mode than fabrication, but it's still a research failure. Evaluations that only count fabrication will miss this.

Tools that allow the model to reason beyond the retrieved set — to synthesize or extrapolate — gain flexibility at the cost of fabrication risk. Some platforms offer both modes, with the more constrained mode available as a "verified citations only" setting. Whether that setting is the default matters significantly for practitioners who don't read the documentation.

Methodology Gaps in Available Studies

The current body of independent benchmarking has several gaps that limit how much weight practitioners should put on any single number.

Most studies use structured question sets that don't reflect the open-ended, iterative way attorneys actually use these tools in practice.
Version currency is a persistent problem. A study published in early 2025 may reflect a platform version that has since been substantially updated. Westlaw CoCounsel and Lexis+ AI both pushed significant model updates in 2025.
Few studies test non-US jurisdictions. Most published evaluations focus on US federal and state case law. Hallucination rates for UK, EU, or Australian legal research tasks are much less well-documented.
Temporal accuracy is rarely isolated. Studies that test whether tools correctly flag overruled or superseded authority are rare, even though this failure mode has direct malpractice implications.
No study has yet published a consistent cross-platform evaluation using the same question set, the same expert reviewers, and the same version of each platform tested simultaneously.

What Practitioners Should Do With These Numbers

Benchmark numbers are useful for filtering, not for final selection. A platform with a documented 25% citation fabrication rate in independent testing should not be used for unsupervised legal research regardless of its other features. A platform with a sub-5% rate in controlled testing still requires attorney verification before any citation goes into a brief.

Bar ethics guidance in most jurisdictions now explicitly requires attorneys to understand the limitations of AI tools they use — not just to supervise outputs, but to have a working understanding of how the tool generates those outputs. Knowing whether a tool uses strict RAG constraints or allows model generation beyond the retrieved set is part of that competence obligation.

The ABA's formal guidance on attorney competence and AI tools, along with state bar opinions from California, New York, and Florida, consistently frame this as a supervision and verification obligation — not an absolute prohibition on use. The practical implication is that citation verification cannot be delegated to the tool itself.

Versioning Note

This entry will be updated as new independent studies are published. When a new study supersedes an older finding for a specific platform, both records are retained with clear dating — consistent with this site's editorial commitment to traceable, versioned benchmark data. Vendor-published accuracy claims without disclosed methodology are not included here regardless of the figures they report.

← All comparison guides

Corrections & feedback

Submit corrections, flag outdated tool data, or share your evaluation experience. Comments are moderated. Nothing here constitutes legal advice.

Comments

Join the discussion with an anonymous comment.

Loading comments...