Search Results for author: Anya Belz

Found 21 papers, 5 papers with code

ReproGen: Proposal for a Shared Task on Reproducibility of Human Evaluations in NLG

no code implementations • INLG (ACL) 2020 • Anya Belz, Shubham Agarwal, Anastasia Shimorina, Ehud Reiter

Across NLP, a growing body of work is looking at the issue of reproducibility.

Paper
Add Code

The Third Multilingual Surface Realisation Shared Task (SR’20): Overview and Evaluation Results

1 code implementation • MSR (COLING) 2020 • Simon Mille, Anya Belz, Bernd Bohnet, Thiago castro Ferreira, Yvette Graham, Leo Wanner

As in SR’18 and SR’19, the shared task comprised two tracks: (1) a Shallow Track where the inputs were full UD structures with word order information removed and tokens lemmatised; and (2) a Deep Track where additionally, functional words and morphological information were removed.

Paper
Code

The ReproGen Shared Task on Reproducibility of Human Evaluations in NLG: Overview and Results

no code implementations • INLG (ACL) 2021 • Anya Belz, Anastasia Shimorina, Shubham Agarwal, Ehud Reiter

The NLP field has recently seen a substantial increase in work related to reproducibility of results, and more generally in recognition of the importance of having shared definitions and practices relating to evaluation.

Paper
Add Code

Another PASS: A Reproduction Study of the Human Evaluation of a Football Report Generation System

no code implementations • INLG (ACL) 2021 • Simon Mille, Thiago castro Ferreira, Anya Belz, Brian Davis

Clarity had a higher degree of reproducibility than Fluency, as measured by the coefficient of variation.

Paper
Add Code

A Reproduction Study of an Annotation-based Human Evaluation of MT Outputs

no code implementations • INLG (ACL) 2021 • Maja Popović, Anya Belz

In this paper we report our reproduction study of the Croatian part of an annotation-based human evaluation of machine-translated user reviews (Popovic, 2020).

Experimental Design

Paper
Add Code

Twenty Years of Confusion in Human Evaluation: NLG Needs Evaluation Sheets and Standardised Definitions

no code implementations • INLG (ACL) 2020 • David M. Howcroft, Anya Belz, Miruna-Adriana Clinciu, Dimitra Gkatzia, Sadid A. Hasan, Saad Mahamood, Simon Mille, Emiel van Miltenburg, Sashank Santhanam, Verena Rieser

Human assessment remains the most trusted form of evaluation in NLG, but highly diverse approaches and a proliferation of different quality criteria used by researchers make it difficult to compare results and draw conclusions across papers, with adverse implications for meta-evaluation and reproducibility.

Experimental Design

Paper
Add Code

Disentangling the Properties of Human Evaluation Methods: A Classification System to Support Comparability, Meta-Evaluation and Reproducibility Testing

no code implementations • INLG (ACL) 2020 • Anya Belz, Simon Mille, David M. Howcroft

Current standards for designing and reporting human evaluations in NLP mean it is generally unclear which evaluations are comparable and can be expected to yield similar results when applied to the same system outputs.

Paper
Add Code

The Human Evaluation Datasheet: A Template for Recording Details of Human Evaluation Experiments in NLP

1 code implementation • HumEval (ACL) 2022 • Anastasia Shimorina, Anya Belz

This paper presents the Human Evaluation Datasheet (HEDS), a template for recording the details of individual human evaluation experiments in Natural Language Processing (NLP), and reports on first experience of researchers using HEDS sheets in practice.

Paper
Code

Reproducing the Metric-Based Evaluation of a Set of Controllable Text Generation Techniques

no code implementations • 13 May 2024 • Michela Lorandi, Anya Belz

Rerunning a metric-based evaluation should be more straightforward, and results should be closer, than in a human-based evaluation, especially where code and model checkpoints are made available by the original authors.

Attribute Text Generation

Paper
Add Code

High-quality Data-to-Text Generation for Severely Under-Resourced Languages with Out-of-the-box Large Language Models

1 code implementation • 19 Feb 2024 • Michela Lorandi, Anya Belz

The performance of NLP methods for severely under-resourced languages cannot currently hope to match the state of the art in NLP methods for well resourced languages.

Data-to-Text Generation

Paper
Code

Assessing the Portability of Parameter Matrices Trained by Parameter-Efficient Finetuning Methods

no code implementations • 25 Jan 2024 • Mohammed Sabry, Anya Belz

We compare the performance of ported modules with that of equivalent modules trained (i) from scratch, and (ii) from parameters sampled from the same distribution as the ported module.

Sentiment Analysis Transfer Learning

Paper
Add Code

Data-to-text Generation for Severely Under-Resourced Languages with GPT-3.5: A Bit of Help Needed from Google Translate

1 code implementation • 19 Aug 2023 • Michela Lorandi, Anya Belz

LLMs like GPT are great at tasks involving English which dominates in their training data.

Data-to-Text Generation Prompt Engineering +1

Paper
Code

Missing Information, Unresponsive Authors, Experimental Flaws: The Impossibility of Assessing the Reproducibility of Previous Human Evaluations in NLP

no code implementations • 2 May 2023 • Anya Belz, Craig Thomson, Ehud Reiter, Gavin Abercrombie, Jose M. Alonso-Moral, Mohammad Arvan, Anouck Braggaar, Mark Cieliebak, Elizabeth Clark, Kees Van Deemter, Tanvi Dinkar, Ondřej Dušek, Steffen Eger, Qixiang Fang, Mingqi Gao, Albert Gatt, Dimitra Gkatzia, Javier González-Corbelle, Dirk Hovy, Manuela Hürlimann, Takumi Ito, John D. Kelleher, Filip Klubicka, Emiel Krahmer, Huiyuan Lai, Chris van der Lee, Yiru Li, Saad Mahamood, Margot Mieskes, Emiel van Miltenburg, Pablo Mosteiro, Malvina Nissim, Natalie Parde, Ondřej Plátek, Verena Rieser, Jie Ruan, Joel Tetreault, Antonio Toral, Xiaojun Wan, Leo Wanner, Lewis Watson, Diyi Yang

We report our efforts in identifying a set of previous human evaluations in NLP that would be suitable for a coordinated study examining what makes human evaluations in NLP more/less reproducible.

Paper
Add Code

PEFT-Ref: A Modular Reference Architecture and Typology for Parameter-Efficient Finetuning Techniques

no code implementations • 24 Apr 2023 • Mohammed Sabry, Anya Belz

Recent parameter-efficient finetuning (PEFT) techniques aim to improve over the considerable cost of fully finetuning large pretrained language models (PLM).

Paper
Add Code

Consultation Checklists: Standardising the Human Evaluation of Medical Note Generation

no code implementations • 17 Nov 2022 • Aleksandar Savkov, Francesco Moramarco, Alex Papadopoulos Korfiatis, Mark Perera, Anya Belz, Ehud Reiter

Evaluating automatically generated text is generally hard due to the inherently subjective nature of many aspects of the output quality.

Paper
Add Code

User-Driven Research of Medical Note Generation Software

no code implementations • NAACL 2022 • Tom Knoll, Francesco Moramarco, Alex Papadopoulos Korfiatis, Rachel Young, Claudia Ruffini, Mark Perera, Christian Perstl, Ehud Reiter, Anya Belz, Aleksandar Savkov

A growing body of work uses Natural Language Processing (NLP) methods to automatically generate medical notes from audio recordings of doctor-patient consultations.

Paper
Add Code

Quantified Reproducibility Assessment of NLP Results

no code implementations • ACL 2022 • Anya Belz, Maja Popović, Simon Mille

This paper describes and tests a method for carrying out quantified reproducibility assessment (QRA) that is based on concepts and definitions from metrology.

Paper
Add Code

Human Evaluation and Correlation with Automatic Metrics in Consultation Note Generation

no code implementations • ACL 2022 • Francesco Moramarco, Alex Papadopoulos Korfiatis, Mark Perera, Damir Juric, Jack Flann, Ehud Reiter, Anya Belz, Aleksandar Savkov

In recent years, machine learning models have rapidly become better at generating clinical consultation notes; yet, there is little work on how to properly evaluate the generated consultation notes to understand the impact they may have on both the clinician using them and the patient's clinical safety.

Paper
Add Code

Quantifying Reproducibility in NLP and ML

no code implementations • 2 Sep 2021 • Anya Belz

Reproducibility has become an intensely debated topic in NLP and ML over recent years, but no commonly accepted way of assessing reproducibility, let alone quantifying it, has so far emerged.

Paper
Add Code

The Human Evaluation Datasheet 1.0: A Template for Recording Details of Human Evaluation Experiments in NLP

no code implementations • 17 Mar 2021 • Anastasia Shimorina, Anya Belz

This paper introduces the Human Evaluation Datasheet, a template for recording the details of individual human evaluation experiments in Natural Language Processing (NLP).

Paper
Add Code

A Systematic Review of Reproducibility Research in Natural Language Processing

1 code implementation • EACL 2021 • Anya Belz, Shubham Agarwal, Anastasia Shimorina, Ehud Reiter

Against the background of what has been termed a reproducibility crisis in science, the NLP field is becoming increasingly interested in, and conscientious about, the reproducibility of its results.

Paper
Code

Cannot find the paper you are looking for? You can Submit a new open access paper.