Search Results for author: Anya Belz

Found 14 papers, 3 papers with code

The Human Evaluation Datasheet: A Template for Recording Details of Human Evaluation Experiments in NLP

1 code implementation HumEval (ACL) 2022 Anastasia Shimorina, Anya Belz

This paper presents the Human Evaluation Datasheet (HEDS), a template for recording the details of individual human evaluation experiments in Natural Language Processing (NLP), and reports on first experience of researchers using HEDS sheets in practice.

Disentangling the Properties of Human Evaluation Methods: A Classification System to Support Comparability, Meta-Evaluation and Reproducibility Testing

no code implementations INLG (ACL) 2020 Anya Belz, Simon Mille, David M. Howcroft

Current standards for designing and reporting human evaluations in NLP mean it is generally unclear which evaluations are comparable and can be expected to yield similar results when applied to the same system outputs.

Twenty Years of Confusion in Human Evaluation: NLG Needs Evaluation Sheets and Standardised Definitions

no code implementations INLG (ACL) 2020 David M. Howcroft, Anya Belz, Miruna-Adriana Clinciu, Dimitra Gkatzia, Sadid A. Hasan, Saad Mahamood, Simon Mille, Emiel van Miltenburg, Sashank Santhanam, Verena Rieser

Human assessment remains the most trusted form of evaluation in NLG, but highly diverse approaches and a proliferation of different quality criteria used by researchers make it difficult to compare results and draw conclusions across papers, with adverse implications for meta-evaluation and reproducibility.

Experimental Design

A Reproduction Study of an Annotation-based Human Evaluation of MT Outputs

no code implementations INLG (ACL) 2021 Maja Popović, Anya Belz

In this paper we report our reproduction study of the Croatian part of an annotation-based human evaluation of machine-translated user reviews (Popovic, 2020).

Experimental Design

The ReproGen Shared Task on Reproducibility of Human Evaluations in NLG: Overview and Results

no code implementations INLG (ACL) 2021 Anya Belz, Anastasia Shimorina, Shubham Agarwal, Ehud Reiter

The NLP field has recently seen a substantial increase in work related to reproducibility of results, and more generally in recognition of the importance of having shared definitions and practices relating to evaluation.

The Third Multilingual Surface Realisation Shared Task (SR’20): Overview and Evaluation Results

1 code implementation MSR (COLING) 2020 Simon Mille, Anya Belz, Bernd Bohnet, Thiago castro Ferreira, Yvette Graham, Leo Wanner

As in SR’18 and SR’19, the shared task comprised two tracks: (1) a Shallow Track where the inputs were full UD structures with word order information removed and tokens lemmatised; and (2) a Deep Track where additionally, functional words and morphological information were removed.

User-Driven Research of Medical Note Generation Software

no code implementations5 May 2022 Tom Knoll, Francesco Moramarco, Alex Papadopoulos Korfiatis, Rachel Young, Claudia Ruffini, Mark Perera, Christian Perstl, Ehud Reiter, Anya Belz, Aleksandar Savkov

A growing body of work uses Natural Language Processing (NLP) methods to automatically generate medical notes from audio recordings of doctor-patient consultations.

Quantified Reproducibility Assessment of NLP Results

no code implementations ACL 2022 Anya Belz, Maja Popović, Simon Mille

This paper describes and tests a method for carrying out quantified reproducibility assessment (QRA) that is based on concepts and definitions from metrology.

Human Evaluation and Correlation with Automatic Metrics in Consultation Note Generation

no code implementations ACL 2022 Francesco Moramarco, Alex Papadopoulos Korfiatis, Mark Perera, Damir Juric, Jack Flann, Ehud Reiter, Anya Belz, Aleksandar Savkov

In recent years, machine learning models have rapidly become better at generating clinical consultation notes; yet, there is little work on how to properly evaluate the generated consultation notes to understand the impact they may have on both the clinician using them and the patient's clinical safety.

Quantifying Reproducibility in NLP and ML

no code implementations2 Sep 2021 Anya Belz

Reproducibility has become an intensely debated topic in NLP and ML over recent years, but no commonly accepted way of assessing reproducibility, let alone quantifying it, has so far emerged.

The Human Evaluation Datasheet 1.0: A Template for Recording Details of Human Evaluation Experiments in NLP

no code implementations17 Mar 2021 Anastasia Shimorina, Anya Belz

This paper introduces the Human Evaluation Datasheet, a template for recording the details of individual human evaluation experiments in Natural Language Processing (NLP).

A Systematic Review of Reproducibility Research in Natural Language Processing

1 code implementation EACL 2021 Anya Belz, Shubham Agarwal, Anastasia Shimorina, Ehud Reiter

Against the background of what has been termed a reproducibility crisis in science, the NLP field is becoming increasingly interested in, and conscientious about, the reproducibility of its results.

Cannot find the paper you are looking for? You can Submit a new open access paper.