Search Results for author: Anton Alyakin

Found 7 papers, 3 papers with code

Evaluating the performance and fragility of large language models on the self-assessment for neurological surgeons

no code implementations29 May 2025 Krithik Vishwanath, Anton Alyakin, Mrigayu Ghosh, Jin Vivian Lee, Daniel Alexander Alber, Karl L. Sangwon, Douglas Kondziolka, Eric Karl Oermann

This study aims to assess the performance of state-of-the-art LLMs on neurosurgery board-like questions and to evaluate their robustness to the inclusion of distractor statements.

Medical large language models are easily distracted

1 code implementation1 Apr 2025 Krithik Vishwanath, Anton Alyakin, Daniel Alexander Alber, Jin Vivian Lee, Douglas Kondziolka, Eric Karl Oermann

Large language models (LLMs) have the potential to transform medicine, but real-world clinical scenarios contain extraneous information that can hinder performance.

RAG Retrieval-augmented Generation

It is Too Many Options: Pitfalls of Multiple-Choice Questions in Generative AI and Medical Education

no code implementations13 Mar 2025 Shrutika Singh, Anton Alyakin, Daniel Alexander Alber, Jaden Stryker, Ai Phuong S Tong, Karl Sangwon, Nicolas Goff, Mathew de la Paz, Miguel Hernandez-Rovira, Ki Yun Park, Eric Claude Leuthardt, Eric Karl Oermann

Using this benchmark, we evaluated three state-of-the-art LLMs (GPT-4o, GPT-3. 5, and LLama-3-70B-instruct) and found an average absolute deterioration of 39. 43% in performance on free-response questions relative to multiple-choice (p = 1. 3 * 10-5) which was greater than the human performance decline of 22. 29%.

Multiple-choice

LqRT: Robust Hypothesis Testing of Location Parameters using Lq-Likelihood-Ratio-Type Test in Python

1 code implementation27 Nov 2019 Anton Alyakin, Yichen Qin, Carey E. Priebe

To the extent that the robustness of the Wilcoxon test (minimum asymptotic relative efficiency (ARE) of the Wilcoxon test vs the t-test is 0. 864) suggests that the Wilcoxon test should be the default test of choice (rather than "use Wilcoxon if there is evidence of non-normality", the default position should be "use Wilcoxon unless there is good reason to believe the normality assumption"), the results in this article suggest that the LqRT is potentially the new default go-to test for practitioners.

Methodology

Cannot find the paper you are looking for? You can Submit a new open access paper.