Performance of Multimodal GPT-4V on USMLE with Image: Potential for Imaging Diagnostic Support with Explanations

Yang, Zhichao; Yao, Zonghai; Tasmin, Mahbuba; Vashisht, Parth; Jang, Won Seok; Ouyang, Feiyun; Wang, Beining; Berlowitz, Dan; Yu, Hong

doi:10.1101/2023.10.26.23297629

Cited by 6 publications

(4 citation statements)

References 52 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The best performance under the open-book setting is achieved by human physicians (95%, CI: 91-99%), though not significantly different from GPT-4V. Our findings, therefore, align with the previous ones, which show the superior performance of GPT-4V in the closed-book setting 12,13 .…”

supporting

confidence: 89%

Hidden mysteries behind genome, epigenome, and exposome of lupus erythematosus

Jin

Zhao

2021

Trends in Molecular Medicine

View full text Add to dashboard Cite

supporting

confidence: 89%

Hidden mysteries behind genome, epigenome, and exposome of lupus erythematosus

Jin

Zhao

2021

Trends in Molecular Medicine

View full text Add to dashboard Cite

“…Wu et al examine GPT-4V's potential in multimodal medical diagnosis, demonstrating substantial promise yet revealing limitations in high-stakes domains [6]. Echoing this view, Yang et al assessed the performance of Multimodal GPT-4V in medical licensing exams, particularly in imaging diagnostics, offering a glimpse into future support systems for medical professionals [7]. The concept of "Socratic models" by Zeng et al brings forth the idea of zero-shot multimodal reasoning, allowing models to compose answers from disparate sources without explicit training [8].…”

Section: Novel Roles For Multimodal Large Language and Vision Modelsmentioning

confidence: 99%

“…The creative maker spaces have become vibrant hubs of 21 st -century innovation, merging the traditional tactile experience with digital fabrication and design. However, integrating new artificial intelligence (AI) tools and, in particular, the current generation of multimodal large language models (LLMs) [1][2][3][4][5][6][7][8][9][10][11][12][13][14][15][16][17] into these environments has the potential to enhance human creativity and innovation [18][19][20][21][22][23][24][25][26][27][28][29][30][31][32]. In recent years, the intersection of AI and multimodal (MM) learning has spawned a generation of models that integrate and interpret information across various forms of data, including text, images, and speech.…”

Section: Introductionmentioning

confidence: 99%

Deep Learning Classification Methods Applied to Tabular Cybersecurity Benchmarks

Noever¹,

Noever²

2021

IJNSA

View full text Add to dashboard Cite

This research recasts the network attack dataset from UNSW-NB15 as an intrusion detection problem in image space. Using one-hot-encodings, the resulting grayscale thumbnails provide a quarter-million examples for deep learning algorithms. Applying the MobileNetV2’s convolutional neural network architecture, the work demonstrates a 97% accuracy in distinguishing normal and attack traffic. Further class refinements to 9 individual attack families (exploits, worms, shellcodes) show an overall 54% accuracy. Using feature importance rank, a random forest solution on subsets shows the most important source-destination factors and the least important ones as mainly obscure protocols. It further extends the image classification problem to other cybersecurity benchmarks such as malware signatures extracted from binary headers, with an 80% overall accuracy to detect computer viruses as portable executable files (headers only). Both novel image datasets are available to the research community on Kaggle.

show abstract

“…Recent studies have further explored the diagnostic application of multimodal LLMs (also called 'vision-language models') that are able to ingest not only text but also image data as input (12)(13)(14)(15)(16)(17)(18)(19)(20). However, several studies demonstrated low performance of Generative Pretrained Transformer 4 Vision (GPT-4V) by OpenAI in differential diagnosis based on various types of radiological images (12,16,18,20,21).…”

Section: Introductionmentioning

confidence: 99%

Impact of Multimodal Prompt Elements on Diagnostic Performance of GPT-4(V) in Challenging Brain MRI Cases

Schramm,

Preis,

Metz

et al. 2024

Preprint

View full text Add to dashboard Cite

Background Recent studies have explored the application of multimodal large language models (LLMs) in radiological differential diagnosis. Yet, how different multimodal input combinations affect diagnostic performance is not well understood. Purpose To evaluate the impact of varying multimodal input elements on the accuracy of GPT-4(V)-based brain MRI differential diagnosis. Methods Thirty brain MRI cases with a challenging yet verified diagnosis were selected. Seven prompt groups with variations of four input elements (image, image annotation, medical history, image description) were defined. For each MRI case and prompt group, three identical queries were performed using an LLM-based search engine (© PerplexityAI, powered by GPT-4(V)). Accuracy of LLM-generated differential diagnoses was rated using a binary and a numeric scoring system and analyzed using a chi-square test and a Kruskal-Wallis test. Results were corrected for false discovery rate employing the Benjamini-Hochberg procedure. Regression analyses were performed to determine the contribution of each individual input element to diagnostic performance. Results The prompt group containing an annotated image, medical history, and image description as input exhibited the highest diagnostic accuracy (67.8% correct responses). Significant differences were observed between prompt groups, especially between groups that contained the image description among their inputs, and those that did not. Regression analyses confirmed a large positive effect of the image description on diagnostic accuracy (p << 0.001), as well as a moderate positive effect of the medical history (p < 0.001). The presence of unannotated or annotated images had only minor or insignificant effects on diagnostic accuracy. Conclusion The textual description of radiological image findings was identified as the strongest contributor to performance of GPT-4(V) in brain MRI differential diagnosis, followed by the medical history. The unannotated or annotated image alone yielded very low diagnostic performance. These findings offer guidance on the effective utilization of multimodal LLMs in clinical practice.

show abstract

Performance of Multimodal GPT-4V on USMLE with Image: Potential for Imaging Diagnostic Support with Explanations

Cited by 6 publications

References 52 publications

Hidden mysteries behind genome, epigenome, and exposome of lupus erythematosus

Hidden mysteries behind genome, epigenome, and exposome of lupus erythematosus

Deep Learning Classification Methods Applied to Tabular Cybersecurity Benchmarks

Impact of Multimodal Prompt Elements on Diagnostic Performance of GPT-4(V) in Challenging Brain MRI Cases

Contact Info

Product

Resources

About