Search Results for author: Elizabeth Clark

Found 18 papers, 3 papers with code

Don't Take This Out of Context! On the Need for Contextual Models and Evaluations for Stylistic Rewriting

no code implementations • 24 May 2023 • Akhila Yerukola, Xuhui Zhou, Elizabeth Clark, Maarten Sap

Most existing stylistic text rewriting methods and evaluation metrics operate on a sentence level, but ignoring the broader context of the text can lead to preferring generic, ambiguous, and incoherent rewrites.

Sentence

Paper
Add Code

SEAHORSE: A Multilingual, Multifaceted Dataset for Summarization Evaluation

no code implementations • 22 May 2023 • Elizabeth Clark, Shruti Rijhwani, Sebastian Gehrmann, Joshua Maynez, Roee Aharoni, Vitaly Nikolaev, Thibault Sellam, Aditya Siddhant, Dipanjan Das, Ankur P. Parikh

In this work, we introduce SEAHORSE, a dataset for multilingual, multifaceted summarization evaluation.

Paper
Add Code

Missing Information, Unresponsive Authors, Experimental Flaws: The Impossibility of Assessing the Reproducibility of Previous Human Evaluations in NLP

no code implementations • 2 May 2023 • Anya Belz, Craig Thomson, Ehud Reiter, Gavin Abercrombie, Jose M. Alonso-Moral, Mohammad Arvan, Anouck Braggaar, Mark Cieliebak, Elizabeth Clark, Kees Van Deemter, Tanvi Dinkar, Ondřej Dušek, Steffen Eger, Qixiang Fang, Mingqi Gao, Albert Gatt, Dimitra Gkatzia, Javier González-Corbelle, Dirk Hovy, Manuela Hürlimann, Takumi Ito, John D. Kelleher, Filip Klubicka, Emiel Krahmer, Huiyuan Lai, Chris van der Lee, Yiru Li, Saad Mahamood, Margot Mieskes, Emiel van Miltenburg, Pablo Mosteiro, Malvina Nissim, Natalie Parde, Ondřej Plátek, Verena Rieser, Jie Ruan, Joel Tetreault, Antonio Toral, Xiaojun Wan, Leo Wanner, Lewis Watson, Diyi Yang

We report our efforts in identifying a set of previous human evaluations in NLP that would be suitable for a coordinated study examining what makes human evaluations in NLP more/less reproducible.

Paper
Add Code

mFACE: Multilingual Summarization with Factual Consistency Evaluation

no code implementations • 20 Dec 2022 • Roee Aharoni, Shashi Narayan, Joshua Maynez, Jonathan Herzig, Elizabeth Clark, Mirella Lapata

Abstractive summarization has enjoyed renewed interest in recent years, thanks to pre-trained language models and the availability of large-scale datasets.

Abstractive Text Summarization

Paper
Add Code

Needle in a Haystack: An Analysis of High-Agreement Workers on MTurk for Summarization

1 code implementation • 20 Dec 2022 • Lining Zhang, Simon Mille, Yufang Hou, Daniel Deutsch, Elizabeth Clark, Yixin Liu, Saad Mahamood, Sebastian Gehrmann, Miruna Clinciu, Khyathi Chandu, João Sedoc

To prevent the costly and inefficient use of resources on low-quality annotations, we want a method for creating a pool of dependable annotators who can effectively complete difficult tasks, such as evaluating automatic summarization.

Paper
Code

Dialect-robust Evaluation of Generated Text

no code implementations • 2 Nov 2022 • Jiao Sun, Thibault Sellam, Elizabeth Clark, Tu Vu, Timothy Dozat, Dan Garrette, Aditya Siddhant, Jacob Eisenstein, Sebastian Gehrmann

Evaluation metrics that are not robust to dialect variation make it impossible to tell how well systems perform for many groups of users, and can even penalize systems for producing text in lower-resource dialects.

nlg evaluation

Paper
Add Code

GEMv2: Multilingual NLG Benchmarking in a Single Line of Code

no code implementations • 22 Jun 2022 • Sebastian Gehrmann, Abhik Bhattacharjee, Abinaya Mahendiran, Alex Wang, Alexandros Papangelis, Aman Madaan, Angelina McMillan-Major, Anna Shvets, Ashish Upadhyay, Bingsheng Yao, Bryan Wilie, Chandra Bhagavatula, Chaobin You, Craig Thomson, Cristina Garbacea, Dakuo Wang, Daniel Deutsch, Deyi Xiong, Di Jin, Dimitra Gkatzia, Dragomir Radev, Elizabeth Clark, Esin Durmus, Faisal Ladhak, Filip Ginter, Genta Indra Winata, Hendrik Strobelt, Hiroaki Hayashi, Jekaterina Novikova, Jenna Kanerva, Jenny Chim, Jiawei Zhou, Jordan Clive, Joshua Maynez, João Sedoc, Juraj Juraska, Kaustubh Dhole, Khyathi Raghavi Chandu, Laura Perez-Beltrachini, Leonardo F. R. Ribeiro, Lewis Tunstall, Li Zhang, Mahima Pushkarna, Mathias Creutz, Michael White, Mihir Sanjay Kale, Moussa Kamal Eddine, Nico Daheim, Nishant Subramani, Ondrej Dusek, Paul Pu Liang, Pawan Sasanka Ammanamanchi, Qi Zhu, Ratish Puduppully, Reno Kriz, Rifat Shahriyar, Ronald Cardenas, Saad Mahamood, Salomey Osei, Samuel Cahyawijaya, Sanja Štajner, Sebastien Montella, Shailza, Shailza Jolly, Simon Mille, Tahmid Hasan, Tianhao Shen, Tosin Adewumi, Vikas Raunak, Vipul Raheja, Vitaly Nikolaev, Vivian Tsai, Yacine Jernite, Ying Xu, Yisi Sang, Yixin Liu, Yufang Hou

This problem is especially pertinent in natural language generation which requires ever-improving suites of datasets, metrics, and human evaluation to make definitive claims.

Benchmarking Text Generation

Paper
Add Code

Repairing the Cracked Foundation: A Survey of Obstacles in Evaluation Practices for Generated Text

no code implementations • 14 Feb 2022 • Sebastian Gehrmann, Elizabeth Clark, Thibault Sellam

We summarize, categorize, and discuss how researchers have been addressing these issues and what their findings mean for the current state of model evaluations.

nlg evaluation Text Generation

Paper
Add Code

All That's `Human' Is Not Gold: Evaluating Human Evaluation of Generated Text

no code implementations • ACL 2021 • Elizabeth Clark, Tal August, Sofia Serrano, Nikita Haduong, Suchin Gururangan, Noah A. Smith

Human evaluations are typically considered the gold standard in natural language generation, but as models{'} fluency improves, how well can evaluators detect and judge machine-generated text?

nlg evaluation Text Generation

Paper
Add Code

All That's 'Human' Is Not Gold: Evaluating Human Evaluation of Generated Text

no code implementations • 30 Jun 2021 • Elizabeth Clark, Tal August, Sofia Serrano, Nikita Haduong, Suchin Gururangan, Noah A. Smith

Human evaluations are typically considered the gold standard in natural language generation, but as models' fluency improves, how well can evaluators detect and judge machine-generated text?

nlg evaluation Text Generation

Paper
Add Code

Choose Your Own Adventure: Paired Suggestions in Collaborative Writing for Evaluating Story Generation Models

no code implementations • NAACL 2021 • Elizabeth Clark, Noah A. Smith

Story generation is an open-ended and subjective task, which poses a challenge for evaluating story generation models.

Story Generation

Paper
Add Code

Exploring the Effect of Author and Reader Identity in Online Story Writing: the STORIESINTHEWILD Corpus.

no code implementations • WS 2020 • Tal August, Maarten Sap, Elizabeth Clark, Katharina Reinecke, Noah A. Smith

We analyze the effect of author and reader characteristics and story writing setup on the quality of stories in a short storytelling task.

Paper
Add Code

Evaluation of Text Generation: A Survey

no code implementations • 26 Jun 2020 • Asli Celikyilmaz, Elizabeth Clark, Jianfeng Gao

The paper surveys evaluation methods of natural language generation (NLG) systems that have been developed in the last few years.

nlg evaluation Text Generation +1

Paper
Add Code

TuringAdvice: A Generative and Dynamic Evaluation of Language Use

1 code implementation • NAACL 2021 • Rowan Zellers, Ari Holtzman, Elizabeth Clark, Lianhui Qin, Ali Farhadi, Yejin Choi

We propose TuringAdvice, a new challenge task and dataset for language understanding models.

Paper
Code

Counterfactual Story Reasoning and Generation

1 code implementation • IJCNLP 2019 • Lianhui Qin, Antoine Bosselut, Ari Holtzman, Chandra Bhagavatula, Elizabeth Clark, Yejin Choi

Counterfactual reasoning requires predicting how alternative events, contrary to what actually happened, might have resulted in different outcomes.

counterfactual Counterfactual Reasoning +1

Paper
Code

Sentence Mover's Similarity: Automatic Evaluation for Multi-Sentence Texts

no code implementations • ACL 2019 • Elizabeth Clark, Asli Celikyilmaz, Noah A. Smith

For evaluating machine-generated texts, automatic methods hold the promise of avoiding collection of human judgments, which can be expensive and time-consuming.