1 code implementation • 20 Jan 2025 • Nishant Balepur, Vishakh Padmakumar, Fumeng Yang, Shi Feng, Rachel Rudinger, Jordan Lee Boyd-Graber
However, this preference data format does not convey why users prefer responses that are chosen or rejected, so LLMs trained on these datasets cannot tailor responses to varied user needs.
no code implementations • 14 Nov 2024 • Feng Gu, Wichayaporn Wongkamjan, Jordan Lee Boyd-Graber, Jonathan K. Kummerfeld, Denis Peskoff, Jonathan May
AIs can beat humans in game environments; however, how helpful those agents are to human remains understudied.
no code implementations • 24 Jun 2024 • Yoo yeon Sung, Maharshi Gor, Eve Fleisig, Ishani Mondal, Jordan Lee Boyd-Graber
Adversarial datasets should ensure AI robustness that matches human performance.
2 code implementations • 16 Jun 2024 • Xiyang Wu, Tianrui Guan, Dianqi Li, Shuaiyi Huang, Xiaoyu Liu, Xijun Wang, Ruiqi Xian, Abhinav Shrivastava, Furong Huang, Jordan Lee Boyd-Graber, Tianyi Zhou, Dinesh Manocha
This motivates the development of AutoHallusion, the first automated benchmark generation approach that employs several key strategies to create a diverse range of hallucination examples.
Ranked #1 on
Visual Question Answering (VQA)
on AutoHallusion
no code implementations • 7 Jun 2024 • Wichayaporn Wongkamjan, Feng Gu, Yanze Wang, Ulf Hermjakob, Jonathan May, Brandon M. Stewart, Jonathan K. Kummerfeld, Denis Peskoff, Jordan Lee Boyd-Graber
The boardgame Diplomacy is a challenging setting for communicative and cooperative artificial intelligence.
1 code implementation • 17 Feb 2024 • Zongxia Li, Ishani Mondal, Yijun Liang, Huy Nghiem, Jordan Lee Boyd-Graber
Question answering (QA) can only make progress if we know if an answer is correct, but current answer correctness (AC) metrics struggle with verbose, free-form answers from large language models (LLMs).
1 code implementation • 3 Dec 2023 • Hyojung Han, Jordan Lee Boyd-Graber, Marine Carpuat
Translations help people understand content written in another language.
1 code implementation • NeurIPS 2021 • Alexander Hoyle, Pranav Goel, Andrew Hian-Cheong, Denis Peskov, Jordan Lee Boyd-Graber, Philip Resnik
To address the standardization gap, we systematically evaluate a dominant classical model and two state-of-the-art neural models on two commonly used datasets.