no code implementations • 29 May 2025 • Krithik Vishwanath, Anton Alyakin, Mrigayu Ghosh, Jin Vivian Lee, Daniel Alexander Alber, Karl L. Sangwon, Douglas Kondziolka, Eric Karl Oermann
This study aims to assess the performance of state-of-the-art LLMs on neurosurgery board-like questions and to evaluate their robustness to the inclusion of distractor statements.
1 code implementation • 1 Apr 2025 • Krithik Vishwanath, Anton Alyakin, Daniel Alexander Alber, Jin Vivian Lee, Douglas Kondziolka, Eric Karl Oermann
Large language models (LLMs) have the potential to transform medicine, but real-world clinical scenarios contain extraneous information that can hinder performance.
no code implementations • 13 Mar 2025 • Shrutika Singh, Anton Alyakin, Daniel Alexander Alber, Jaden Stryker, Ai Phuong S Tong, Karl Sangwon, Nicolas Goff, Mathew de la Paz, Miguel Hernandez-Rovira, Ki Yun Park, Eric Claude Leuthardt, Eric Karl Oermann
Using this benchmark, we evaluated three state-of-the-art LLMs (GPT-4o, GPT-3. 5, and LLama-3-70B-instruct) and found an average absolute deterioration of 39. 43% in performance on free-response questions relative to multiple-choice (p = 1. 3 * 10-5) which was greater than the human performance decline of 22. 29%.
no code implementations • 26 Feb 2025 • Anton Alyakin, Jaden Stryker, Daniel Alexander Alber, Karl L. Sangwon, Jin Vivian Lee, Brandon Duderstadt, Akshay Save, David Kurland, Spencer Frome, Shrutika Singh, Jeff Zhang, Eunice Yang, Ki Yun Park, Cordelia Orillac, Aly A. Valliani, Sean Neifert, Albert Liu, Aneek Patel, Christopher Livia, Darryl Lau, Ilya Laufer, Peter A. Rozman, Eveline Teresa Hidalgo, Howard Riina, Rui Feng, Todd Hollon, Yindalon Aphinyanaphongs, John G. Golfinos, Laura Snyder, Eric Leuthardt, Douglas Kondziolka, Eric Karl Oermann
Using NeuroPubs, VLMs generated publication-ready graphical abstracts (70% of 100 abstracts) and board-style questions indistinguishable from human-written ones (54% of 89, 587 questions).
no code implementations • 14 Dec 2024 • Gabriel R. Rosenbaum, Lavender Yao Jiang, Ivaxi Sheth, Jaden Stryker, Anton Alyakin, Daniel Alexander Alber, Nicolas K. Goff, Young Joon Fred Kwon, John Markert, Mustafa Nasir-Moin, Jan Moritz Niehues, Karl L. Sangwon, Eunice Yang, Eric Karl Oermann
We test GPT-4, Llama3-70b, and PalmyraMed-70b, a specialized medical model.
1 code implementation • Findings (EMNLP) 2021 • Kelly Marchisio, Youngser Park, Ali Saad-Eldin, Anton Alyakin, Kevin Duh, Carey Priebe, Philipp Koehn
Alternatively, word embeddings may be understood as nodes in a weighted graph.
1 code implementation • 27 Nov 2019 • Anton Alyakin, Yichen Qin, Carey E. Priebe
To the extent that the robustness of the Wilcoxon test (minimum asymptotic relative efficiency (ARE) of the Wilcoxon test vs the t-test is 0. 864) suggests that the Wilcoxon test should be the default test of choice (rather than "use Wilcoxon if there is evidence of non-normality", the default position should be "use Wilcoxon unless there is good reason to believe the normality assumption"), the results in this article suggest that the LqRT is potentially the new default go-to test for practitioners.
Methodology