The citation problem in medical AI
Every clinician has been trained to cite properly. When you make a claim in a presentation, a paper, or a clinical discussion, you are expected to back it with a reference — not a vague gesture toward "the literature" but a specific source that your audience can look up and verify. This is foundational to evidence-based medicine. It is how we distinguish opinion from evidence, how we build on prior work, and how we hold each other accountable.
AI tools have introduced a strange regression in citation standards. We now have systems that generate confident, well-structured medical answers — and then either cite nothing, cite vaguely, or cite sources that turn out to be fabricated. The answer sounds authoritative. The evidence behind it is absent or illusory.
This should concern every clinician who uses AI tools in their work. Not because AI is inherently unreliable, but because the current standard for AI citations in medicine is far below what we would accept from a colleague, a trainee, or a textbook. And yet we are increasingly treating AI-generated answers as if they meet that standard.
The three failures of AI citations
AI citation failures in medicine fall into three categories, and each is dangerous in a different way.
Failure one: hallucinated references. This is the most well-documented problem. Large language models sometimes generate references that do not exist — plausible-sounding journal titles, realistic author names, convincing DOIs that lead nowhere. The reference looks real. It is formatted correctly. But the paper was never written, the study was never conducted, and the data was never collected.
Hallucinated references are not rare edge cases. Studies examining AI-generated citations have found fabrication rates that should alarm anyone relying on these tools for clinical or academic work. The danger is compounded by the fact that checking a hallucinated reference requires going to a database, searching for it, and confirming it does not exist — a process most users skip because the reference looks legitimate on its face.
Failure two: shallow citations. Even when the reference is real, AI tools frequently cite at the wrong level of granularity. They cite a journal article but not the page. They cite a guideline but not the section. They cite a systematic review but not the specific analysis within it that supports their claim.
In practice, a citation to "Smith et al., Lancet, 2023" without a page number or specific finding is barely more useful than no citation at all. The paper might be 15 pages long. The claim the AI made might come from a single sentence in the discussion section — or it might not appear in the paper at all. Without granularity, you cannot verify without re-reading the entire paper. That defeats the purpose of having a citation.
Failure three: wrong sources entirely. This is the subtlest and arguably most damaging failure. The AI gives you a citation that is real and that does say something related to the topic — but it is not the right kind of source for a clinical claim. It cites a review article instead of the original RCT. It cites a blog post instead of the guideline. It cites an abstract instead of the full-text paper with the methods and results you need to evaluate the evidence.
A citation is not just a reference. It is a claim about the provenance of evidence. When the citation is wrong — hallucinated, shallow, or misdirected — the entire chain of reasoning it supports is compromised.
The sourcing problem: health media vs peer-reviewed literature
Beyond outright fabrication, there is a systemic sourcing problem that most clinicians are not yet aware of. When general AI tools like ChatGPT and Gemini cite sources in response to health-related queries, the majority of those sources are not peer-reviewed research.
Analyses of AI chatbot citations have documented a striking pattern. Nearly a third of citations in health-related queries point to health media sites — hospital blogs, wellness portals, patient-education pages. Another quarter come from commercial and affiliate-driven websites. The remaining citations are spread across government sites, social media, and miscellaneous web content. Actual academic and research sources account for less than a quarter of all citations.
Put concretely: when you ask ChatGPT about the management of muscle-invasive bladder cancer, the sources underpinning its answer are more likely to be a hospital's patient information page than the EAU guideline, more likely to be a health news article about a trial than the trial itself. The AI is drawing from a web that is dominated by secondary and tertiary interpretations of medical evidence, not from the evidence itself.
This matters because each layer of interpretation introduces distortion. The original RCT reports a hazard ratio with a confidence interval. The journal news piece rounds the numbers and drops the confidence interval. The hospital blog simplifies further. The wellness portal reframes it as "promising results." By the time the AI synthesises from these web sources, the nuance of the original evidence — the nuance that matters for clinical decisions — has been stripped away.
The AI does not cite the science. It cites the commentary on the science. For clinical decisions, that distance between source and citation is the difference between evidence-based medicine and well-formatted opinion.
Why page-level granularity is the minimum standard
In clinical practice, a citation serves one purpose: it lets you verify the claim. If the citation does not enable verification, it is decorative — it creates the appearance of evidence without the substance.
Page-level granularity means the citation tells you not just which document, but which page within that document, and ideally which passage on that page. This is the level of granularity that makes verification possible in seconds rather than minutes. You open the document, go to the page, read the passage, and confirm the claim. Or — equally important — you discover that the AI's interpretation of the passage was wrong, and you correct course before the error propagates into your presentation, your paper, or your clinical decision.
This is not a theoretical concern. AI tools regularly make subtle errors in synthesis — merging findings from two different studies, overstating the strength of a recommendation, or presenting a finding from one subgroup analysis as if it applies to the entire study population. These errors are detectable only if you can check the source quickly. If checking the source requires searching through a 200-page guideline or re-reading a 15-page paper, most clinicians will not do it. They will trust the AI — and the error will go undetected.
Page-level citation is not a premium feature. It is the minimum standard for any AI tool that claims to support evidence-based medicine. Without it, "evidence-based" is just a marketing term.
Why most AI tools lack proper citations
Understanding why most AI tools have poor citations helps explain why the problem is not simple to fix — and why architectural choices matter.
General-purpose AI models like GPT-4 or Gemini are trained on massive corpora of web text. They learn patterns, not sources. When they generate a response, they are predicting the most likely next token based on patterns in their training data — they are not retrieving a specific passage from a specific document. They have no concept of "page 47 of this guideline." They have a statistical summary of everything they were trained on, blended together.
This is why hallucination happens. The model generates text that is statistically plausible — including reference-like text that follows the pattern of a citation — without any mechanism to verify that the reference exists. It is not lying. It is pattern-matching, and sometimes the pattern it matches is a reference that was never written.
Adding web search to these models (as ChatGPT has done) partially addresses the problem by grounding some answers in real web pages. But it does not solve the sourcing problem — web search finds web pages, which are overwhelmingly health media and commercial content, not peer-reviewed literature. And it does not provide page-level granularity, because web pages do not have page numbers.
The architectural problem is fundamental: a model that answers from general knowledge cannot cite a specific page in a specific document, because it never read a specific page in a specific document. It read a statistical summary of the entire internet. Page-level citation requires a fundamentally different architecture.
How RAG architecture solves the citation problem
Retrieval-Augmented Generation — RAG — is the architectural approach that makes proper citations possible. The difference is structural, not cosmetic.
In a RAG system, the AI does not answer from general knowledge. Instead, it first retrieves specific passages from a defined document collection, and then generates an answer using those passages as context. Each passage carries metadata — which document it came from, which page, which section. When the AI uses a passage to support a claim, that metadata becomes the citation.
This is fundamentally different from a model that generates citations after the fact. In a RAG system, the citation is not an afterthought — it is baked into the retrieval step. The system found a passage on page 47, used that passage to generate part of the answer, and reports page 47 as the citation. The citation is a byproduct of the retrieval process, not a separate generation step that might hallucinate.
RAG does not eliminate all error. The retrieval step might find a passage that is tangentially relevant rather than directly relevant. The AI might misinterpret the passage. But the citation — the pointer back to the source — is always real. It always points to an actual passage in an actual document on an actual page. You can always verify.
In RAG, the citation comes from retrieval, not generation. The AI cites what it found, not what it imagined. This is the architectural difference that makes medical AI citations trustworthy.
The verification loop: citations as invitations to read
There is a deeper reason why page-level citations matter, beyond simply checking whether the AI got it right. Citations change how you interact with the tool.
When an AI gives you an answer without a citation, you are in a trust-or-reject binary. You either accept the answer at face value or you discard it entirely. There is no middle ground, because there is nothing to check.
When an AI gives you an answer with a page-level citation, you have a third option: verify and refine. Click the citation. Read the original passage. Confirm the claim, or discover nuance the AI missed, or find that the context changes the interpretation. This verification loop takes seconds — because you are going directly to the right page — and it transforms the tool from an oracle you must trust into a research assistant whose work you can check.
In clinical practice, this distinction is everything. We do not make decisions based on unchecked claims. We verify. We consider context. We weigh evidence against our own clinical experience. A tool that supports this process — by making verification fast and easy — is infinitely more useful than a tool that demands blind trust.
Properly cited AI answers also have a paradoxical effect: they make you read more, not less. Every citation is an invitation to revisit the source material. In my own experience, using a citation-first tool has led me back into documents I had not opened in months — re-reading passages I had forgotten, discovering adjacent content I had overlooked, building a richer understanding of literature I thought I already knew.
Citation-first as a requirement, not a feature
The medical community is still in the early stages of figuring out how AI fits into clinical practice. Guidelines on AI use in medicine are being drafted. Professional bodies are issuing position statements. Hospitals are developing policies. This is the right time to establish expectations.
My position is that page-level citation should be a hard requirement for any AI tool used in clinical or academic medical work. Not a premium feature. Not an optional add-on. A requirement. If a tool cannot tell you exactly where its answer came from — document, page, passage — it should not be used for clinical decision support, for academic writing, or for teaching.
This is not a high bar. It is the same standard we apply to colleagues, to trainees, to textbooks, and to ourselves. "I read it somewhere" has never been an acceptable citation in medicine. "The AI told me" should not be either. The standard is: show me where, let me verify, and then I will incorporate it into my thinking.
We would not accept "I read it somewhere" from a trainee. We should not accept it from an AI tool. The standard for evidence is the same regardless of whether the answer comes from a person or a machine.
What this means for how you choose tools
When evaluating any AI tool for medical literature work, ask three questions. First: does it answer from your own documents, or from general knowledge? General knowledge means web sources, which means health media and commercial content, not peer-reviewed literature. Second: does it cite specific pages and passages, or just document titles? Document-level citation is better than nothing, but it is not enough for practical verification. Third: can you verify the citation in seconds? If checking a citation requires re-reading an entire paper, the citation is not doing its job.
Medevidex was built around these three requirements because I needed a tool that met them. Every answer cites the document, the page, and the passage. Every citation links back to the source in context. Verification takes seconds. The tool retrieves from your uploaded documents — not from the web, not from training data, not from a curated corpus you cannot inspect or control.
This is not the only architecture that could work. But it is an architecture that solves the citation problem at its root, rather than applying cosmetic fixes to a fundamentally uncitable system. And in medicine, where the chain from evidence to decision to patient outcome is direct and consequential, solving the citation problem is not optional.
Read more
Why I Built Medevidex · Medevidex vs ChatGPT, Consensus, and OpenEvidence · What Is RAG and Why It Matters for Clinical AI