no code implementations • 12 Apr 2023 • Anjana Arunkumar, Shubham Sharma, Rakhi Agrawal, Sriram Chandrasekaran, Chris Bryan
Cross-task generalization is a significant outcome that defines mastery in natural language understanding.
1 code implementation • 9 Feb 2023 • Anjana Arunkumar, Swaroop Mishra, Bhavdeep Sachdeva, Chitta Baral, Chris Bryan
In pursuit of creating better benchmarks, we propose VAIDA, a novel benchmark creation paradigm for NLP, that focuses on guiding crowdworkers, an under-explored facet of addressing benchmark idiosyncrasies.
no code implementations • 14 Oct 2022 • Swaroop Mishra, Anjana Arunkumar, Chris Bryan, Chitta Baral
Evaluation of models on benchmarks is unreliable without knowing the degree of sample hardness; this subsequently overestimates the capability of AI systems and limits their adoption in real world applications.
no code implementations • 14 Oct 2022 • Swaroop Mishra, Anjana Arunkumar, Chris Bryan, Chitta Baral
Inspired by successful quality indices in several domains such as power, food, and water, we take the first step towards a metric by identifying certain language properties that can represent various possible interactions leading to biases in a benchmark.
no code implementations • 10 Oct 2022 • Swaroop Mishra, Anjana Arunkumar, Chitta Baral
We find limitations in AUC; e. g., a model having higher AUC is not always better in performing selective answering.
10 code implementations • 16 Apr 2022 • Yizhong Wang, Swaroop Mishra, Pegah Alipoormolabashi, Yeganeh Kordi, Amirreza Mirzaei, Anjana Arunkumar, Arjun Ashok, Arut Selvan Dhanasekaran, Atharva Naik, David Stap, Eshaan Pathak, Giannis Karamanolakis, Haizhi Gary Lai, Ishan Purohit, Ishani Mondal, Jacob Anderson, Kirby Kuznia, Krima Doshi, Maitreya Patel, Kuntal Kumar Pal, Mehrad Moradshahi, Mihir Parmar, Mirali Purohit, Neeraj Varshney, Phani Rohitha Kaza, Pulkit Verma, Ravsehaj Singh Puri, Rushang Karia, Shailaja Keyur Sampat, Savan Doshi, Siddhartha Mishra, Sujan Reddy, Sumanta Patro, Tanay Dixit, Xudong Shen, Chitta Baral, Yejin Choi, Noah A. Smith, Hannaneh Hajishirzi, Daniel Khashabi
This large and diverse collection of tasks enables rigorous benchmarking of cross-task generalization under instructions -- training models to follow instructions on a subset of tasks and evaluating them on the remaining unseen ones.
no code implementations • 12 Mar 2022 • Swaroop Mishra, Anjana Arunkumar
Our hypothesis is based on the fact that deep neural networks are data driven models, and data is what leads/misleads models.
no code implementations • 10 Jun 2021 • Swaroop Mishra, Anjana Arunkumar
Models that top leaderboards often perform unsatisfactorily when deployed in real world applications; this has necessitated rigorous and expensive pre-deployment model testing.
no code implementations • 10 Jun 2021 • Swaroop Mishra, Anjana Arunkumar
We show that our algorithm produces the exact same output as BP, in contrast to several recently proposed algorithms approximating BP.
no code implementations • 10 Aug 2020 • Swaroop Mishra, Anjana Arunkumar, Bhavdeep Sachdeva, Chris Bryan, Chitta Baral
A `state of the art' model A surpasses humans in a benchmark B, but fails on similar benchmarks C, D, and E. What does B have that the other benchmarks do not?
no code implementations • 14 Jul 2020 • Swaroop Mishra, Anjana Arunkumar, Chris Bryan, Chitta Baral
In order to stop the inflation in model performance -- and thus overestimation in AI systems' capabilities -- we propose a simple and novel evaluation metric, WOOD Score, that encourages generalization during evaluation.
1 code implementation • 2 May 2020 • Swaroop Mishra, Anjana Arunkumar, Bhavdeep Sachdeva, Chris Bryan, Chitta Baral
The data creation paradigm consists of several data visualizations to help data creators (i) understand the quality of data and (ii) visualize the impact of the created data instance on the overall quality.