- NIH study finds AI tools excel at diagnosing genetic diseases from textbook descriptions, but struggle with patient-written health summaries.
- Large language models' accuracy in diagnosing genetic conditions varied widely, with GPT-4 performing best at 90% accuracy for textbook-based questions.
- When tested on real patient data, the best AI model's accuracy dropped to 21%, highlighting the need for more diverse training data in clinical AI applications.
According to a recent report, National Institutes of Health (NIH) researchers found that while artificial intelligence (AI) tools can accurately diagnose genetic diseases from textbook descriptions, they struggle with patient-written health summaries.
The study focused on large language models (LLMs), which are AI systems trained on vast amounts of text. These models have the potential to be valuable in medicine due to their ability to analyze questions and offer user-friendly responses.
“We may not always think of it this way, but so much of medicine is words-based,” said Ben Solomon, M.D., senior author of the study and clinical director at the NIH’s National Human Genome Research Institute (NHGRI).
“For example, electronic health records and the conversations between doctors and patients all consist of words. Large language models have been a huge leap forward for AI, and being able to analyze words in a clinically useful way could be incredibly transformational.”
The researchers tested 10 LLMs, including two recent versions of ChatGPT, by designing questions based on medical textbooks about 63 genetic conditions. These included well-known conditions like sickle cell anemia and cystic fibrosis, as well as rare ones.
For each condition, they selected 3 to 5 common symptoms and created questions in a standard format: “I have X, Y, and Z symptoms. What’s the most likely genetic condition?” The models varied widely in their accuracy, with results ranging from 21% to 90%. GPT-4, one of the latest versions of ChatGPT, performed the best.
The models' success generally matched their size, with smaller models using billions of parameters and the largest over a trillion. While lower-performing models improved with further experiments, they still outperformed non-AI technologies, including Google search.
Researchers optimized the models by testing different approaches, such as replacing medical terms with a common language. For instance, "macrocephaly" was replaced with "a big head" to better reflect how patients or caregivers might describe symptoms.
Although accuracy dropped when medical terms were removed, 7 out of 10 models still proved more accurate than Google searches with everyday language.
Furthermore, researchers tested large language models using real patient data from the NIH Clinical Center. Patients provided brief, varied descriptions of their genetic conditions. The best model made accurate diagnoses only 21% of the time, with others as low as 1%. The variability in patient descriptions proved challenging for the models, which performed better with standardized questions.
Dr. Solomon noted that to be clinically useful, these models need more diverse data reflecting different patient experiences. This study highlights AI's potential and limitations in healthcare, emphasizing the need for human oversight as AI becomes more common in clinical settings.
Edited by Harshajit Sarmah