1 code implementation • HumEval (ACL) 2022 • Anastasia Shimorina, Anya Belz
This paper presents the Human Evaluation Datasheet (HEDS), a template for recording the details of individual human evaluation experiments in Natural Language Processing (NLP), and reports on first experience of researchers using HEDS sheets in practice.
1 code implementation • MSR (COLING) 2020 • Simon Mille, Anya Belz, Bernd Bohnet, Thiago castro Ferreira, Yvette Graham, Leo Wanner
As in SR’18 and SR’19, the shared task comprised two tracks: (1) a Shallow Track where the inputs were full UD structures with word order information removed and tokens lemmatised; and (2) a Deep Track where additionally, functional words and morphological information were removed.
no code implementations • INLG (ACL) 2020 • David M. Howcroft, Anya Belz, Miruna-Adriana Clinciu, Dimitra Gkatzia, Sadid A. Hasan, Saad Mahamood, Simon Mille, Emiel van Miltenburg, Sashank Santhanam, Verena Rieser
Human assessment remains the most trusted form of evaluation in NLG, but highly diverse approaches and a proliferation of different quality criteria used by researchers make it difficult to compare results and draw conclusions across papers, with adverse implications for meta-evaluation and reproducibility.
no code implementations • INLG (ACL) 2020 • Anya Belz, Simon Mille, David M. Howcroft
Current standards for designing and reporting human evaluations in NLP mean it is generally unclear which evaluations are comparable and can be expected to yield similar results when applied to the same system outputs.
no code implementations • INLG (ACL) 2020 • Anya Belz, Shubham Agarwal, Anastasia Shimorina, Ehud Reiter
Across NLP, a growing body of work is looking at the issue of reproducibility.
no code implementations • INLG (ACL) 2021 • Simon Mille, Thiago castro Ferreira, Anya Belz, Brian Davis
Clarity had a higher degree of reproducibility than Fluency, as measured by the coefficient of variation.
no code implementations • INLG (ACL) 2021 • Anya Belz, Anastasia Shimorina, Shubham Agarwal, Ehud Reiter
The NLP field has recently seen a substantial increase in work related to reproducibility of results, and more generally in recognition of the importance of having shared definitions and practices relating to evaluation.
no code implementations • INLG (ACL) 2021 • Maja Popović, Anya Belz
In this paper we report our reproduction study of the Croatian part of an annotation-based human evaluation of machine-translated user reviews (Popovic, 2020).
no code implementations • 10 Dec 2024 • Anya Belz, Craig Thomson
This paper presents version 3. 0 of the Human Evaluation Datasheet (HEDS).
no code implementations • 13 May 2024 • Michela Lorandi, Anya Belz
Rerunning a metric-based evaluation should be more straightforward, and results should be closer, than in a human-based evaluation, especially where code and model checkpoints are made available by the original authors.
1 code implementation • 19 Feb 2024 • Michela Lorandi, Anya Belz
The performance of NLP methods for severely under-resourced languages cannot currently hope to match the state of the art in NLP methods for well resourced languages.
no code implementations • 25 Jan 2024 • Mohammed Sabry, Anya Belz
We compare the performance of ported modules with that of equivalent modules trained (i) from scratch, and (ii) from parameters sampled from the same distribution as the ported module.
1 code implementation • 19 Aug 2023 • Michela Lorandi, Anya Belz
LLMs like GPT are great at tasks involving English which dominates in their training data.
no code implementations • 2 May 2023 • Anya Belz, Craig Thomson, Ehud Reiter, Gavin Abercrombie, Jose M. Alonso-Moral, Mohammad Arvan, Anouck Braggaar, Mark Cieliebak, Elizabeth Clark, Kees Van Deemter, Tanvi Dinkar, Ondřej Dušek, Steffen Eger, Qixiang Fang, Mingqi Gao, Albert Gatt, Dimitra Gkatzia, Javier González-Corbelle, Dirk Hovy, Manuela Hürlimann, Takumi Ito, John D. Kelleher, Filip Klubicka, Emiel Krahmer, Huiyuan Lai, Chris van der Lee, Yiru Li, Saad Mahamood, Margot Mieskes, Emiel van Miltenburg, Pablo Mosteiro, Malvina Nissim, Natalie Parde, Ondřej Plátek, Verena Rieser, Jie Ruan, Joel Tetreault, Antonio Toral, Xiaojun Wan, Leo Wanner, Lewis Watson, Diyi Yang
We report our efforts in identifying a set of previous human evaluations in NLP that would be suitable for a coordinated study examining what makes human evaluations in NLP more/less reproducible.
no code implementations • 24 Apr 2023 • Mohammed Sabry, Anya Belz
Recent parameter-efficient finetuning (PEFT) techniques aim to improve over the considerable cost of fully finetuning large pretrained language models (PLM).
no code implementations • 17 Nov 2022 • Aleksandar Savkov, Francesco Moramarco, Alex Papadopoulos Korfiatis, Mark Perera, Anya Belz, Ehud Reiter
Evaluating automatically generated text is generally hard due to the inherently subjective nature of many aspects of the output quality.
no code implementations • NAACL 2022 • Tom Knoll, Francesco Moramarco, Alex Papadopoulos Korfiatis, Rachel Young, Claudia Ruffini, Mark Perera, Christian Perstl, Ehud Reiter, Anya Belz, Aleksandar Savkov
A growing body of work uses Natural Language Processing (NLP) methods to automatically generate medical notes from audio recordings of doctor-patient consultations.
no code implementations • ACL 2022 • Anya Belz, Maja Popović, Simon Mille
This paper describes and tests a method for carrying out quantified reproducibility assessment (QRA) that is based on concepts and definitions from metrology.
no code implementations • ACL 2022 • Francesco Moramarco, Alex Papadopoulos Korfiatis, Mark Perera, Damir Juric, Jack Flann, Ehud Reiter, Anya Belz, Aleksandar Savkov
In recent years, machine learning models have rapidly become better at generating clinical consultation notes; yet, there is little work on how to properly evaluate the generated consultation notes to understand the impact they may have on both the clinician using them and the patient's clinical safety.
no code implementations • 2 Sep 2021 • Anya Belz
Reproducibility has become an intensely debated topic in NLP and ML over recent years, but no commonly accepted way of assessing reproducibility, let alone quantifying it, has so far emerged.
no code implementations • 17 Mar 2021 • Anastasia Shimorina, Anya Belz
This paper introduces the Human Evaluation Datasheet, a template for recording the details of individual human evaluation experiments in Natural Language Processing (NLP).
1 code implementation • EACL 2021 • Anya Belz, Shubham Agarwal, Anastasia Shimorina, Ehud Reiter
Against the background of what has been termed a reproducibility crisis in science, the NLP field is becoming increasingly interested in, and conscientious about, the reproducibility of its results.