News Release
Wednesday, August 14, 2024
NIH researchers found that large-scale language models rely on succinct, textbook-like language to evaluate medical questions.
Researchers at the National Institutes of Health (NIH) found that while artificial intelligence (AI) tools can accurately diagnose genetic diseases from textbook-like descriptions, their accuracy drops significantly when analyzing summaries patients write about their health. These findings, reported in the American Journal of Human Genetics, suggest that these AI tools need to be improved before they can be applied in clinical settings to help diagnose and answer patient questions.
Researchers have studied a type of AI called large-scale language models, which are trained on huge amounts of text-based data. These models could be extremely useful in healthcare because of their ability to analyze and answer questions, and often with a user-friendly interface.
“We may not always think of it this way, but so much of healthcare is based on language,” said Ben Solomon, M.D., lead author of the study and clinical director of NIH's National Human Genome Research Institute (NHGRI). “For example, electronic medical records and doctor-patient conversations are all made up of language. Large-scale language models are a giant leap for AI, and being able to analyze language in a clinically useful way could be incredibly transformative.”
The researchers tested 10 large-scale language models, including the two latest versions of ChatGPT. Using medical books and other references, the researchers designed questions about 63 different genetic diseases, including well-known diseases such as sickle cell anemia, cystic fibrosis, and Marfan syndrome, as well as a number of rare genetic diseases.
These conditions can manifest in different ways depending on the patient, and the researchers aimed to capture some of the most common symptoms. They selected three to five symptoms for each condition and created questions in the standard format: “I have symptoms X, Y, and Z. What is the most likely genetic condition?”
When posed with these questions, the ability of large-scale language models to pinpoint the correct genetic diagnosis varied widely, with initial accuracies ranging from 21% to 90%. The best-performing model was GPT-4, one of the most recent versions of ChatGPT.
The success of the models generally correlated with their size, that is, the amount of data they were trained on. The smallest models had billions of parameters, and the largest had over a trillion parameters. For many of the poorly performing models, the researchers were able to improve their accuracy in subsequent experiments, and overall the models provided more accurate responses than non-AI technologies, including a standard Google search.
The researchers optimized and tested the model in different ways, including replacing medical terminology with more common words. For example, instead of stating that a child has “macrophagy,” the question states that the child has a “large head,” more closely matching how patients and caregivers might describe symptoms to a doctor.
Overall, the accuracy of the models decreased when medical descriptions were removed. However, when general language was used, 7 in 10 models were more accurate than Google Search.
“It's important that these tools are accessible to people without a medical background,” said Kendall Flaherty, a graduate researcher at NHGRI who led the study. “There aren't many clinical geneticists in the world, and some states and countries don't have access to these experts. AI tools may help people get some of their questions answered without having to wait years for a consultation.”
To test the effectiveness of large-scale language models with information from real patients, the researchers asked patients at the NIH Clinical Center to provide short descriptions of their genetic conditions and symptoms. These descriptions ranged from one sentence to several paragraphs and were more diverse in style and content than the textbook-like questions.
When presented with these descriptions from real patients, the best-performing model was able to correctly diagnose only 21% of the time. Many models performed much worse, dropping to around 1% accuracy.
The researchers expected that interpreting patient-written summaries would be more difficult because patients at the NIH Clinical Center often have extremely rare conditions, so the model may not have enough information to make a diagnosis for these conditions.
But when the researchers wrote standardized questions about the same ultra-rare genetic disorders seen in NIH patients, accuracy improved, suggesting that because the model was trained on textbooks and other reference sources that tend to be more concise and standardized, it had difficulty interpreting the various wordings and formats of patient descriptions.
“For these models to be clinically useful in the future, we need more data, and that data needs to reflect the diversity of patients,” said Dr. Solomon. “Not only should it represent all known medical conditions, but it should also represent variations in age, race, gender, cultural background, and more, so that we can capture the diversity of patient experiences in the data. Then these models can learn how different people talk about their conditions.”
In addition to showing room for improvement, this study highlights the current limitations of large-scale language models and the ongoing need for human oversight when applying AI to healthcare.
“These technologies are already being implemented in clinical practice,” Dr. Solomon added, “The biggest question is no longer whether clinicians will use AI, but where and how clinicians should use AI, and where they should not use AI to provide the best care for patients.”
The National Human Genome Research Institute (NHGRI) is one of 27 Institutes and Centers of the NIH, an agency of the U.S. Department of Health and Human Services. NHGRI's Intramural Research Division develops and implements technologies to understand, diagnose, and treat genomic and genetic diseases. Additional information about NHGRI can be found at https://www.genome.gov/.
About the National Institutes of Health (NIH): NIH is the nation's medical research agency, comprised of 27 Institutes and Centers, and is part of the U.S. Department of Health and Human Services. NIH is the primary federal agency conducting and supporting basic, clinical, and translational medical research, investigating the causes, treatments, and cures for both common and rare diseases. For more information about NIH and its programs, visit www.nih.gov.
NIH…Transforming Discovery into Health®