no code implementations • CoNLL (EMNLP) 2021 • Daniel Deutsch, Dan Roth
Reference-based metrics such as ROUGE or BERTScore evaluate the content quality of a summary by comparing the summary to a reference.
1 code implementation • 2 Apr 2024 • Marcel Nawrath, Agnieszka Nowak, Tristan Ratz, Danilo C. Walenta, Juri Opitz, Leonardo F. R. Ribeiro, João Sedoc, Daniel Deutsch, Simon Mille, Yixin Liu, Lining Zhang, Sebastian Gehrmann, Saad Mahamood, Miruna Clinciu, Khyathi Chandu, Yufang Hou
At the heart of the Pyramid evaluation method for text summarization lie human written summary content units (SCUs).
no code implementations • 1 Apr 2024 • Parker Riley, Daniel Deutsch, George Foster, Viresh Ratnakar, Ali Dabirmoghaddam, Markus Freitag
Reliable human evaluation is critical to the development of successful natural language generation models, but achieving it is notoriously difficult.
no code implementations • 15 Nov 2023 • Wenda Xu, Daniel Deutsch, Mara Finkelstein, Juraj Juraska, Biao Zhang, Zhongtao Liu, William Yang Wang, Lei LI, Markus Freitag
Recent large language models (LLM) are leveraging human feedback to improve their generation quality.
no code implementations • 9 Nov 2023 • Jan-Thorsten Peter, David Vilar, Daniel Deutsch, Mara Finkelstein, Juraj Juraska, Markus Freitag
Quality Estimation (QE), the evaluation of machine translation output without the need of explicit references, has seen big improvements in the last years with the use of neural metrics.
1 code implementation • 30 Oct 2023 • Christoph Leiter, Juri Opitz, Daniel Deutsch, Yang Gao, Rotem Dror, Steffen Eger
Specifically, we propose a novel competition setting in which we select a list of allowed LLMs and disallow fine-tuning to ensure a focus on prompting.
no code implementations • 25 Aug 2023 • Daniel Deutsch, Juraj Juraska, Mara Finkelstein, Markus Freitag
As research on machine translation moves to translating text beyond the sentence level, it remains unclear how effective automatic evaluation metrics are at scoring longer translations.
no code implementations • 14 Aug 2023 • Patrick Fernandes, Daniel Deutsch, Mara Finkelstein, Parker Riley, André F. T. Martins, Graham Neubig, Ankush Garg, Jonathan H. Clark, Markus Freitag, Orhan Firat
Automatic evaluation of machine translation (MT) is a critical tool driving the rapid iterative development of MT systems.
1 code implementation • 23 May 2023 • Daniel Deutsch, George Foster, Markus Freitag
Kendall's tau is frequently used to meta-evaluate how well machine translation (MT) evaluation metrics score individual translations.
1 code implementation • 20 Dec 2022 • Lining Zhang, Simon Mille, Yufang Hou, Daniel Deutsch, Elizabeth Clark, Yixin Liu, Saad Mahamood, Sebastian Gehrmann, Miruna Clinciu, Khyathi Chandu, João Sedoc
To prevent the costly and inefficient use of resources on low-quality annotations, we want a method for creating a pool of dependable annotators who can effectively complete difficult tasks, such as evaluating automatic summarization.
no code implementations • 22 Oct 2022 • Daniel Deutsch, Rotem Dror, Dan Roth
There is significant interest in developing evaluation metrics which accurately estimate the quality of generated text without the aid of a human-written reference text, which can be time consuming and expensive to collect or entirely unavailable in online applications.
no code implementations • 22 Jun 2022 • Sebastian Gehrmann, Abhik Bhattacharjee, Abinaya Mahendiran, Alex Wang, Alexandros Papangelis, Aman Madaan, Angelina McMillan-Major, Anna Shvets, Ashish Upadhyay, Bingsheng Yao, Bryan Wilie, Chandra Bhagavatula, Chaobin You, Craig Thomson, Cristina Garbacea, Dakuo Wang, Daniel Deutsch, Deyi Xiong, Di Jin, Dimitra Gkatzia, Dragomir Radev, Elizabeth Clark, Esin Durmus, Faisal Ladhak, Filip Ginter, Genta Indra Winata, Hendrik Strobelt, Hiroaki Hayashi, Jekaterina Novikova, Jenna Kanerva, Jenny Chim, Jiawei Zhou, Jordan Clive, Joshua Maynez, João Sedoc, Juraj Juraska, Kaustubh Dhole, Khyathi Raghavi Chandu, Laura Perez-Beltrachini, Leonardo F. R. Ribeiro, Lewis Tunstall, Li Zhang, Mahima Pushkarna, Mathias Creutz, Michael White, Mihir Sanjay Kale, Moussa Kamal Eddine, Nico Daheim, Nishant Subramani, Ondrej Dusek, Paul Pu Liang, Pawan Sasanka Ammanamanchi, Qi Zhu, Ratish Puduppully, Reno Kriz, Rifat Shahriyar, Ronald Cardenas, Saad Mahamood, Salomey Osei, Samuel Cahyawijaya, Sanja Štajner, Sebastien Montella, Shailza, Shailza Jolly, Simon Mille, Tahmid Hasan, Tianhao Shen, Tosin Adewumi, Vikas Raunak, Vipul Raheja, Vitaly Nikolaev, Vivian Tsai, Yacine Jernite, Ying Xu, Yisi Sang, Yixin Liu, Yufang Hou
This problem is especially pertinent in natural language generation which requires ever-improving suites of datasets, metrics, and human evaluation to make definitive claims.
1 code implementation • 29 Apr 2022 • Daniel Deutsch, Dan Roth
We introduce Repro, an open-source library which aims at improving the reproducibility and usability of research code.
no code implementations • Findings (ACL) 2022 • Daniel Deutsch, Dan Roth
Question answering-based summarization evaluation metrics must automatically determine whether the QA model's prediction is correct or not, a task known as answer verification.
no code implementations • NAACL 2022 • Daniel Deutsch, Rotem Dror, Dan Roth
How reliably an automatic summarization evaluation metric replicates human judgments of summary quality is quantified by system-level correlations.
no code implementations • 15 Nov 2021 • Daniel Deutsch, Dan Roth
In this work, we propose a method for incorporating question-answering (QA) signals into a summarization model.
1 code implementation • 31 Mar 2021 • Daniel Deutsch, Rotem Dror, Dan Roth
After evaluating which of the proposed methods is most appropriate for summarization through two simulation experiments, we analyze the results of applying these methods to several different automatic evaluation metrics across three sets of human annotations.
no code implementations • COLING 2020 • Disha Jindal, Daniel Deutsch, Dan Roth
Identifying the key events in a document is critical to holistically understanding its important information.
1 code implementation • 23 Oct 2020 • Daniel Deutsch, Dan Roth
Reference-based metrics such as ROUGE or BERTScore evaluate the content quality of a summary by comparing the summary to a reference.
2 code implementations • 1 Oct 2020 • Daniel Deutsch, Tania Bedrax-Weiss, Dan Roth
A desirable property of a reference-based evaluation metric that measures the content quality of a summary is that it should estimate how much information that summary has in common with a reference.
1 code implementation • EMNLP (NLPOSS) 2020 • Daniel Deutsch, Dan Roth
We present SacreROUGE, an open-source library for using and developing summarization evaluation metrics.
no code implementations • CONLL 2019 • Daniel Deutsch, Shyam Upadhyay, Dan Roth
We experimentally show the benefits of our algorithm on constituency parsing and semantic role labeling.
no code implementations • IJCNLP 2019 • Daniel Deutsch, Dan Roth
A key challenge in topic-focused summarization is determining what information should be included in the summary, a problem known as content selection.
1 code implementation • ACL 2018 • Daniel Deutsch, John Hewitt, Dan Roth
Modeling derivational morphology to generate words with particular semantics is useful in many text generation tasks, such as machine translation or abstractive question answering.