Search Results for author: Anya Belz

Found 22 papers, 5 papers with code

The Human Evaluation Datasheet: A Template for Recording Details of Human Evaluation Experiments in NLP

1 code implementation HumEval (ACL) 2022 Anastasia Shimorina, Anya Belz

This paper presents the Human Evaluation Datasheet (HEDS), a template for recording the details of individual human evaluation experiments in Natural Language Processing (NLP), and reports on first experience of researchers using HEDS sheets in practice.

The Third Multilingual Surface Realisation Shared Task (SR’20): Overview and Evaluation Results

1 code implementation MSR (COLING) 2020 Simon Mille, Anya Belz, Bernd Bohnet, Thiago castro Ferreira, Yvette Graham, Leo Wanner

As in SR’18 and SR’19, the shared task comprised two tracks: (1) a Shallow Track where the inputs were full UD structures with word order information removed and tokens lemmatised; and (2) a Deep Track where additionally, functional words and morphological information were removed.

Twenty Years of Confusion in Human Evaluation: NLG Needs Evaluation Sheets and Standardised Definitions

no code implementations INLG (ACL) 2020 David M. Howcroft, Anya Belz, Miruna-Adriana Clinciu, Dimitra Gkatzia, Sadid A. Hasan, Saad Mahamood, Simon Mille, Emiel van Miltenburg, Sashank Santhanam, Verena Rieser

Human assessment remains the most trusted form of evaluation in NLG, but highly diverse approaches and a proliferation of different quality criteria used by researchers make it difficult to compare results and draw conclusions across papers, with adverse implications for meta-evaluation and reproducibility.

Diversity Experimental Design

Disentangling the Properties of Human Evaluation Methods: A Classification System to Support Comparability, Meta-Evaluation and Reproducibility Testing

no code implementations INLG (ACL) 2020 Anya Belz, Simon Mille, David M. Howcroft

Current standards for designing and reporting human evaluations in NLP mean it is generally unclear which evaluations are comparable and can be expected to yield similar results when applied to the same system outputs.

The ReproGen Shared Task on Reproducibility of Human Evaluations in NLG: Overview and Results

no code implementations INLG (ACL) 2021 Anya Belz, Anastasia Shimorina, Shubham Agarwal, Ehud Reiter

The NLP field has recently seen a substantial increase in work related to reproducibility of results, and more generally in recognition of the importance of having shared definitions and practices relating to evaluation.

A Reproduction Study of an Annotation-based Human Evaluation of MT Outputs

no code implementations INLG (ACL) 2021 Maja Popović, Anya Belz

In this paper we report our reproduction study of the Croatian part of an annotation-based human evaluation of machine-translated user reviews (Popovic, 2020).

Experimental Design

HEDS 3.0: The Human Evaluation Data Sheet Version 3.0

no code implementations10 Dec 2024 Anya Belz, Craig Thomson

This paper presents version 3. 0 of the Human Evaluation Datasheet (HEDS).

Reproducing the Metric-Based Evaluation of a Set of Controllable Text Generation Techniques

no code implementations13 May 2024 Michela Lorandi, Anya Belz

Rerunning a metric-based evaluation should be more straightforward, and results should be closer, than in a human-based evaluation, especially where code and model checkpoints are made available by the original authors.

Attribute Text Generation

High-quality Data-to-Text Generation for Severely Under-Resourced Languages with Out-of-the-box Large Language Models

1 code implementation19 Feb 2024 Michela Lorandi, Anya Belz

The performance of NLP methods for severely under-resourced languages cannot currently hope to match the state of the art in NLP methods for well resourced languages.

Data-to-Text Generation

Assessing the Portability of Parameter Matrices Trained by Parameter-Efficient Finetuning Methods

no code implementations25 Jan 2024 Mohammed Sabry, Anya Belz

We compare the performance of ported modules with that of equivalent modules trained (i) from scratch, and (ii) from parameters sampled from the same distribution as the ported module.

Sentiment Analysis Transfer Learning

PEFT-Ref: A Modular Reference Architecture and Typology for Parameter-Efficient Finetuning Techniques

no code implementations24 Apr 2023 Mohammed Sabry, Anya Belz

Recent parameter-efficient finetuning (PEFT) techniques aim to improve over the considerable cost of fully finetuning large pretrained language models (PLM).

Consultation Checklists: Standardising the Human Evaluation of Medical Note Generation

no code implementations17 Nov 2022 Aleksandar Savkov, Francesco Moramarco, Alex Papadopoulos Korfiatis, Mark Perera, Anya Belz, Ehud Reiter

Evaluating automatically generated text is generally hard due to the inherently subjective nature of many aspects of the output quality.

User-Driven Research of Medical Note Generation Software

no code implementations NAACL 2022 Tom Knoll, Francesco Moramarco, Alex Papadopoulos Korfiatis, Rachel Young, Claudia Ruffini, Mark Perera, Christian Perstl, Ehud Reiter, Anya Belz, Aleksandar Savkov

A growing body of work uses Natural Language Processing (NLP) methods to automatically generate medical notes from audio recordings of doctor-patient consultations.

Quantified Reproducibility Assessment of NLP Results

no code implementations ACL 2022 Anya Belz, Maja Popović, Simon Mille

This paper describes and tests a method for carrying out quantified reproducibility assessment (QRA) that is based on concepts and definitions from metrology.

Human Evaluation and Correlation with Automatic Metrics in Consultation Note Generation

no code implementations ACL 2022 Francesco Moramarco, Alex Papadopoulos Korfiatis, Mark Perera, Damir Juric, Jack Flann, Ehud Reiter, Anya Belz, Aleksandar Savkov

In recent years, machine learning models have rapidly become better at generating clinical consultation notes; yet, there is little work on how to properly evaluate the generated consultation notes to understand the impact they may have on both the clinician using them and the patient's clinical safety.

Quantifying Reproducibility in NLP and ML

no code implementations2 Sep 2021 Anya Belz

Reproducibility has become an intensely debated topic in NLP and ML over recent years, but no commonly accepted way of assessing reproducibility, let alone quantifying it, has so far emerged.

The Human Evaluation Datasheet 1.0: A Template for Recording Details of Human Evaluation Experiments in NLP

no code implementations17 Mar 2021 Anastasia Shimorina, Anya Belz

This paper introduces the Human Evaluation Datasheet, a template for recording the details of individual human evaluation experiments in Natural Language Processing (NLP).

A Systematic Review of Reproducibility Research in Natural Language Processing

1 code implementation EACL 2021 Anya Belz, Shubham Agarwal, Anastasia Shimorina, Ehud Reiter

Against the background of what has been termed a reproducibility crisis in science, the NLP field is becoming increasingly interested in, and conscientious about, the reproducibility of its results.

Diversity

Cannot find the paper you are looking for? You can Submit a new open access paper.