1 code implementation • 17 Feb 2024 • Zongxia Li, Ishani Mondal, Yijun Liang, Huy Nghiem, Jordan Lee Boyd-Graber
Question answering (QA) can only make progress if we know if an answer is correct, but for many of the most challenging and interesting QA examples, current answer correctness (AC) metrics do not align with human judgments, particularly verbose, free form answers from large language models (LLM).
1 code implementation • 29 Jan 2024 • Zongxia Li, Andrew Mao, Daniel Stephens, Pranav Goel, Emily Walpole, Alden Dima, Juan Fung, Jordan Boyd-Graber
Topic models are a popular tool for understanding text collections, but their evaluation has been a point of contention.
no code implementations • 24 Jan 2024 • Zongxia Li, Ishani Mondal, Yijun Liang, Huy Nghiem, Jordan Boyd-Graber
Question answering (QA) can only make progress if we know if an answer is correct, but for many of the most challenging and interesting QA examples, current evaluation metrics to determine answer equivalence (AE) often do not align with human judgments, particularly more verbose, free-form answers from large language models (LLM).
6 code implementations • 23 Oct 2023 • Tianrui Guan, Fuxiao Liu, Xiyang Wu, Ruiqi Xian, Zongxia Li, Xiaoyu Liu, Xijun Wang, Lichang Chen, Furong Huang, Yaser Yacoob, Dinesh Manocha, Tianyi Zhou
Our comprehensive case studies within HallusionBench shed light on the challenges of hallucination and illusion in LVLMs.
Ranked #1 on Visual Question Answering (VQA) on HallusionBench
1 code implementation • 11 Jul 2023 • Fuxiao Liu, Paiheng Xu, Zongxia Li, Yue Feng
We investigate the role of various demonstration components in the in-context learning (ICL) performance of large language models (LLMs).
1 code implementation • 13 Oct 2022 • Haozhe An, Zongxia Li, Jieyu Zhao, Rachel Rudinger
A common limitation of diagnostic tests for detecting social biases in NLP models is that they may only detect stereotypic associations that are pre-specified by the designer of the test.