no code implementations • 13 May 2025 • Darshan Deshpande, Varun Gangal, Hersh Mehta, Jitin Krishnan, Anand Kannappan, Rebecca Qian
The increasing adoption of agentic workflows across diverse domains brings a critical need to scalably and systematically evaluate the complex traces these systems generate.
no code implementations • 24 Mar 2025 • Sky CH-Wang, Darshan Deshpande, Smaranda Muresan, Anand Kannappan, Rebecca Qian
We introduce Browsing Lost Unformed Recollections, a tip-of-the-tongue known-item search and reasoning benchmark for general AI assistants.
no code implementations • 18 Dec 2024 • Darshan Deshpande, Selvan Sunitha Ravi, Sky CH-Wang, Bartosz Mielczarek, Anand Kannappan, Rebecca Qian
The LLM-as-judge paradigm is increasingly being adopted for automated evaluation of model outputs.
no code implementations • 11 Jul 2024 • Selvan Sunitha Ravi, Bartosz Mielczarek, Anand Kannappan, Douwe Kiela, Rebecca Qian
Retrieval Augmented Generation (RAG) techniques aim to mitigate hallucinations in Large Language Models (LLMs).
2 code implementations • 20 Nov 2023 • Pranab Islam, Anand Kannappan, Douwe Kiela, Rebecca Qian, Nino Scherrer, Bertie Vidgen
We test 16 state of the art model configurations (including GPT-4-Turbo, Llama2 and Claude2, with vector stores and long context prompts) on a sample of 150 cases from FinanceBench, and manually review their answers (n=2, 400).
Ranked #1 on
How to refund a wrong transaction in PhonePe
on How to refund a wrong transaction in PhonePe
(using extra training data)
How to refund a wrong transaction in PhonePe
Question Answering
+2
no code implementations • 14 Nov 2023 • Bertie Vidgen, Nino Scherrer, Hannah Rose Kirk, Rebecca Qian, Anand Kannappan, Scott A. Hale, Paul Röttger
While some of the models do not give a single unsafe response, most give unsafe responses to more than 20% of the prompts, with over 50% unsafe responses in the extreme.
no code implementations • 11 Nov 2023 • Hsuan Su, Rebecca Qian, Chinnadhurai Sankar, Shahin Shayandeh, Shang-Tse Chen, Hung-Yi Lee, Daniel M. Bikel
In this paper, we propose a diagnosis method to attribute bias to each component of a TOD system.
1 code implementation • 25 May 2022 • Rebecca Qian, Candace Ross, Jude Fernandes, Eric Smith, Douwe Kiela, Adina Williams
Unwanted and often harmful social biases are becoming ever more salient in NLP research, affecting both models and datasets.
no code implementations • 19 Apr 2022 • Yuxuan Sun, Ethan Carlson, Rebecca Qian, Kavya Srinet, Arthur Szlam
In this work we give a case study of an embodied machine-learning (ML) powered agent that improves itself via interactions with crowd-workers.
no code implementations • NLP4ConvAI (ACL) 2022 • Eric Michael Smith, Orion Hsu, Rebecca Qian, Stephen Roller, Y-Lan Boureau, Jason Weston
At the heart of improving conversational AI is the open problem of how to evaluate conversations.
1 code implementation • 25 Jan 2021 • Anurag Pratik, Soumith Chintala, Kavya Srinet, Dhiraj Gandhi, Rebecca Qian, Yuxuan Sun, Ryan Drew, Sara Elkafrawy, Anoushka Tiwari, Tucker Hart, Mary Williamson, Abhinav Gupta, Arthur Szlam
In recent years, there have been significant advances in building end-to-end Machine Learning (ML) systems that learn at scale.