AI Falls Short in Real-World Genetic Disease Diagnosis

NIH study finds AI tools excel at diagnosing genetic diseases from textbook descriptions, but struggle with patient-written health summaries.
Large language models' accuracy in diagnosing genetic conditions varied widely, with GPT-4 performing best at 90% accuracy for textbook-based questions.
When tested on real patient data, the best AI model's accuracy dropped to 21%, highlighting the need for more diverse training data in clinical AI applications.

According to a recent report, National Institutes of Health (NIH) researchers found that while artificial intelligence (AI) tools can accurately diagnose genetic diseases from textbook descriptions, they struggle with patient-written health summaries.

The study focused on large language models (LLMs), which are AI systems trained on vast amounts of text. These models have the potential to be valuable in medicine due to their ability to analyze questions and offer user-friendly responses.

“We may not always think of it this way, but so much of medicine is words-based,” said Ben Solomon, M.D., senior author of the study and clinical director at the NIH’s National Human Genome Research Institute (NHGRI).

“For example, electronic health records and the conversations between doctors and patients all consist of words. Large language models have been a huge leap forward for AI, and being able to analyze words in a clinically useful way could be incredibly transformational.”

The researchers tested 10 LLMs, including two recent versions of ChatGPT, by designing questions based on medical textbooks about 63 genetic conditions. These included well-known conditions like sickle cell anemia and cystic fibrosis, as well as rare ones.

For each condition, they selected 3 to 5 common symptoms and created questions in a standard format: “I have X, Y, and Z symptoms. What’s the most likely genetic condition?” The models varied widely in their accuracy, with results ranging from 21% to 90%. GPT-4, one of the latest versions of ChatGPT, performed the best.

The models' success generally matched their size, with smaller models using billions of parameters and the largest over a trillion. While lower-performing models improved with further experiments, they still outperformed non-AI technologies, including Google search.

Researchers optimized the models by testing different approaches, such as replacing medical terms with a common language. For instance, "macrocephaly" was replaced with "a big head" to better reflect how patients or caregivers might describe symptoms.

Although accuracy dropped when medical terms were removed, 7 out of 10 models still proved more accurate than Google searches with everyday language.

Furthermore, researchers tested large language models using real patient data from the NIH Clinical Center. Patients provided brief, varied descriptions of their genetic conditions. The best model made accurate diagnoses only 21% of the time, with others as low as 1%. The variability in patient descriptions proved challenging for the models, which performed better with standardized questions.

Dr. Solomon noted that to be clinically useful, these models need more diverse data reflecting different patient experiences. This study highlights AI's potential and limitations in healthcare, emphasizing the need for human oversight as AI becomes more common in clinical settings.

Edited by Harshajit Sarmah

AI Struggles to Diagnose Genetic Diseases from Patient Descriptions, NIH Study Reveals

Read Next

AI Startup Zaher AI Lands $150K, Becomes Meska Studio’s First Flagship Firm

Pentagon Backs Blue Origin and Anduril to Study Rapid Space-to-Earth Cargo Delivery

FLOQ Reaches Nearly 1 Million Users in Less Than 3 Months

Meta’s $10B Louisiana Data Centre Gets Green Light for Gas Power Amid Criticism

Europe’s DAOs Drive Public-Good Funding and Blockchain Innovation

SecureDApp Launches an AI-Powered Blockchain Forensics Platform to Combat Web3 Financial Crimes

Victim Loses $91M in Bitcoin After Fraudster Poses as Hardware Wallet Support: ZachXBT

Real-Money Gaming Ban in India Risks Jobs, Spurs Legal and Industry Backlash

Edtech Startup Arivihan Raises $4.17 Million in Pre-Series A Round Led by Prosus and Accel

Battery Startup Group14 Lands $463M as Demand for Silicon Anodes Grows

Subscribe to Newsletter