Search Results for author: Daniel Deutsch

Found 24 papers, 10 papers with code

Understanding the Extent to which Content Quality Metrics Measure the Information Quality of Summaries

no code implementations CoNLL (EMNLP) 2021 Daniel Deutsch, Dan Roth

Reference-based metrics such as ROUGE or BERTScore evaluate the content quality of a summary by comparing the summary to a reference.

Question Answering

Finding Replicable Human Evaluations via Stable Ranking Probability

no code implementations1 Apr 2024 Parker Riley, Daniel Deutsch, George Foster, Viresh Ratnakar, Ali Dabirmoghaddam, Markus Freitag

Reliable human evaluation is critical to the development of successful natural language generation models, but achieving it is notoriously difficult.

Machine Translation Text Generation

There's no Data Like Better Data: Using QE Metrics for MT Data Filtering

no code implementations9 Nov 2023 Jan-Thorsten Peter, David Vilar, Daniel Deutsch, Mara Finkelstein, Juraj Juraska, Markus Freitag

Quality Estimation (QE), the evaluation of machine translation output without the need of explicit references, has seen big improvements in the last years with the use of neural metrics.

Machine Translation NMT +2

The Eval4NLP 2023 Shared Task on Prompting Large Language Models as Explainable Metrics

1 code implementation30 Oct 2023 Christoph Leiter, Juri Opitz, Daniel Deutsch, Yang Gao, Rotem Dror, Steffen Eger

Specifically, we propose a novel competition setting in which we select a list of allowed LLMs and disallow fine-tuning to ensure a focus on prompting.

Machine Translation Text Generation

Training and Meta-Evaluating Machine Translation Evaluation Metrics at the Paragraph Level

no code implementations25 Aug 2023 Daniel Deutsch, Juraj Juraska, Mara Finkelstein, Markus Freitag

As research on machine translation moves to translating text beyond the sentence level, it remains unclear how effective automatic evaluation metrics are at scoring longer translations.

Machine Translation Sentence

Ties Matter: Meta-Evaluating Modern Metrics with Pairwise Accuracy and Tie Calibration

1 code implementation23 May 2023 Daniel Deutsch, George Foster, Markus Freitag

Kendall's tau is frequently used to meta-evaluate how well machine translation (MT) evaluation metrics score individual translations.

Machine Translation

Needle in a Haystack: An Analysis of High-Agreement Workers on MTurk for Summarization

1 code implementation20 Dec 2022 Lining Zhang, Simon Mille, Yufang Hou, Daniel Deutsch, Elizabeth Clark, Yixin Liu, Saad Mahamood, Sebastian Gehrmann, Miruna Clinciu, Khyathi Chandu, João Sedoc

To prevent the costly and inefficient use of resources on low-quality annotations, we want a method for creating a pool of dependable annotators who can effectively complete difficult tasks, such as evaluating automatic summarization.

On the Limitations of Reference-Free Evaluations of Generated Text

no code implementations22 Oct 2022 Daniel Deutsch, Rotem Dror, Dan Roth

There is significant interest in developing evaluation metrics which accurately estimate the quality of generated text without the aid of a human-written reference text, which can be time consuming and expensive to collect or entirely unavailable in online applications.

Machine Translation

GEMv2: Multilingual NLG Benchmarking in a Single Line of Code

no code implementations22 Jun 2022 Sebastian Gehrmann, Abhik Bhattacharjee, Abinaya Mahendiran, Alex Wang, Alexandros Papangelis, Aman Madaan, Angelina McMillan-Major, Anna Shvets, Ashish Upadhyay, Bingsheng Yao, Bryan Wilie, Chandra Bhagavatula, Chaobin You, Craig Thomson, Cristina Garbacea, Dakuo Wang, Daniel Deutsch, Deyi Xiong, Di Jin, Dimitra Gkatzia, Dragomir Radev, Elizabeth Clark, Esin Durmus, Faisal Ladhak, Filip Ginter, Genta Indra Winata, Hendrik Strobelt, Hiroaki Hayashi, Jekaterina Novikova, Jenna Kanerva, Jenny Chim, Jiawei Zhou, Jordan Clive, Joshua Maynez, João Sedoc, Juraj Juraska, Kaustubh Dhole, Khyathi Raghavi Chandu, Laura Perez-Beltrachini, Leonardo F. R. Ribeiro, Lewis Tunstall, Li Zhang, Mahima Pushkarna, Mathias Creutz, Michael White, Mihir Sanjay Kale, Moussa Kamal Eddine, Nico Daheim, Nishant Subramani, Ondrej Dusek, Paul Pu Liang, Pawan Sasanka Ammanamanchi, Qi Zhu, Ratish Puduppully, Reno Kriz, Rifat Shahriyar, Ronald Cardenas, Saad Mahamood, Salomey Osei, Samuel Cahyawijaya, Sanja Štajner, Sebastien Montella, Shailza, Shailza Jolly, Simon Mille, Tahmid Hasan, Tianhao Shen, Tosin Adewumi, Vikas Raunak, Vipul Raheja, Vitaly Nikolaev, Vivian Tsai, Yacine Jernite, Ying Xu, Yisi Sang, Yixin Liu, Yufang Hou

This problem is especially pertinent in natural language generation which requires ever-improving suites of datasets, metrics, and human evaluation to make definitive claims.

Benchmarking Text Generation

Repro: An Open-Source Library for Improving the Reproducibility and Usability of Publicly Available Research Code

1 code implementation29 Apr 2022 Daniel Deutsch, Dan Roth

We introduce Repro, an open-source library which aims at improving the reproducibility and usability of research code.

Benchmarking Answer Verification Methods for Question Answering-Based Summarization Evaluation Metrics

no code implementations Findings (ACL) 2022 Daniel Deutsch, Dan Roth

Question answering-based summarization evaluation metrics must automatically determine whether the QA model's prediction is correct or not, a task known as answer verification.

Attribute Benchmarking +1

Re-Examining System-Level Correlations of Automatic Summarization Evaluation Metrics

no code implementations NAACL 2022 Daniel Deutsch, Rotem Dror, Dan Roth

How reliably an automatic summarization evaluation metric replicates human judgments of summary quality is quantified by system-level correlations.

A Statistical Analysis of Summarization Evaluation Metrics using Resampling Methods

1 code implementation31 Mar 2021 Daniel Deutsch, Rotem Dror, Dan Roth

After evaluating which of the proposed methods is most appropriate for summarization through two simulation experiments, we analyze the results of applying these methods to several different automatic evaluation metrics across three sets of human annotations.

Understanding the Extent to which Summarization Evaluation Metrics Measure the Information Quality of Summaries

1 code implementation23 Oct 2020 Daniel Deutsch, Dan Roth

Reference-based metrics such as ROUGE or BERTScore evaluate the content quality of a summary by comparing the summary to a reference.

Towards Question-Answering as an Automatic Metric for Evaluating the Content Quality of a Summary

2 code implementations1 Oct 2020 Daniel Deutsch, Tania Bedrax-Weiss, Dan Roth

A desirable property of a reference-based evaluation metric that measures the content quality of a summary is that it should estimate how much information that summary has in common with a reference.

Question Answering

SacreROUGE: An Open-Source Library for Using and Developing Summarization Evaluation Metrics

1 code implementation EMNLP (NLPOSS) 2020 Daniel Deutsch, Dan Roth

We present SacreROUGE, an open-source library for using and developing summarization evaluation metrics.

Summary Cloze: A New Task for Content Selection in Topic-Focused Summarization

no code implementations IJCNLP 2019 Daniel Deutsch, Dan Roth

A key challenge in topic-focused summarization is determining what information should be included in the summary, a problem known as content selection.

Sentence

A Distributional and Orthographic Aggregation Model for English Derivational Morphology

1 code implementation ACL 2018 Daniel Deutsch, John Hewitt, Dan Roth

Modeling derivational morphology to generate words with particular semantics is useful in many text generation tasks, such as machine translation or abstractive question answering.

abstractive question answering Machine Translation +3

Cannot find the paper you are looking for? You can Submit a new open access paper.