The use of artificial intelligence, particularly large language models like ChatGPT, is becoming increasingly prevalent. As a result, there is a growing desire to utilize AI models in the interpretation of medical information as a tool for making critical medical decisions. A research team at Ben-Gurion University of the Negev decided to examine the capabilities of large language models (LLMs) that specialize in medical information and compare them. The surprising findings of the research were published in the journal Computers in Biology and Medicine.
Artificial intelligence applied to medical information has become a common tool used to answer patient questions via medical chatbots, predict diseases, create synthetic data to protect patient privacy, or generate medical questions and answers for medical students.
AI models that process textual data have proven effective in classifying information.
However, when the data becomes life-saving clinical medical information, there is a need to understand the deep meaning of medical codes and the differences between them.
Doctoral student Ofir Ben Shoham and Dr. Nadav Rappoport from the Department of Software and Information Systems Engineering at Ben-Gurion University of the Negev decided to examine to what extent large language models understand the medical world and can answer questions on the subject. To do this, they conducted a comparison between general models and models that were fine-tuned on medical information.
To this end, the researchers built a dedicated evaluation method (MedConceptsQA(https://github.com/nadavlab/MedConceptsQA/)) for answering questions about medical concepts. The researchers generated over 800,000 closed questions and answers covering international medical concepts at three difficulty levels, to assess how people who work with language models interpret medical terms and distinguish between medical concepts, such as diagnoses, procedures, and drugs. The researchers created questions that request a description of a medical code automatically, using an algorithm they developed. While the easy questions require basic knowledge, the difficult questions require detailed understanding and the ability to identify small differences between similar medical concepts. Medium-level questions require slightly more basic information. The researchers used existing clinical data standards available for evaluating clinical codes, allowing them to distinguish between medical concepts for tasks such as medical coding practice, summarization, automatic billing, and more.
The research findings indicated that most models showed poor performance – equivalent to random guessing – including those models trained on medical data. This was the case across the board, except for ChatGPT-4, which showed better performance than the others with an average accuracy of about 60%, although it was still far from satisfactory.
“It seems that for the most part, models that have been specially trained for medical purposes have achieved levels of accuracy close to random guessing in this measure, despite being pre-trained on medical data,” noted Dr. Rappoport.
It should be noted that models created for general purposes (such as Llama3-70B and ChatGPT-4) achieved better performance. ChatGPT-4 demonstrated the best performance, although its accuracy remained insufficient for some of the specific medical code questions that the researchers built. ChatGPT-4 achieved an average improvement of 9-11% compared to Llama3-OpenBioLLM-70B, the clinical language model that achieved the best results.
“Our measure serves as a valuable resource for evaluating the abilities of large language models to interpret medical codes and distinguish between medical concepts. We demonstrate that most clinical language models achieve random guessing performance, while ChatGPT-3.5, ChatGPT-4, and Llama3-70B outperform these clinical models, despite the fact that the focus of these models is not at all in the medical field,” explained doctoral student Ben Shoham. “With our question bank, we can very easily, at the push of a button, evaluate other models that will be released in the future, and compare them.”
Clinical data often includes both standard medical codes and natural language texts. This research highlights the need for a broader clinical language in models to understand medical information and the caution required in their widespread use. “We present a benchmark for evaluating the quality of information of medical codes and highlight to users the need for caution when making use of this information,” concluded Dr. Rappoport.