no code implementations • INLG (ACL) 2020 • Emiel van Miltenburg, Wei-Ting Lu, Emiel Krahmer, Albert Gatt, Guanyi Chen, Lin Li, Kees Van Deemter
Because our manipulated descriptions form minimal pairs with the reference descriptions, we are able to assess the impact of different kinds of errors on the perceived quality of the descriptions.
no code implementations • INLG (ACL) 2020 • David M. Howcroft, Anya Belz, Miruna-Adriana Clinciu, Dimitra Gkatzia, Sadid A. Hasan, Saad Mahamood, Simon Mille, Emiel van Miltenburg, Sashank Santhanam, Verena Rieser
Human assessment remains the most trusted form of evaluation in NLG, but highly diverse approaches and a proliferation of different quality criteria used by researchers make it difficult to compare results and draw conclusions across papers, with adverse implications for meta-evaluation and reproducibility.
1 code implementation • GWC 2016 • Marten Postma, Emiel van Miltenburg, Roxane Segers, Anneleen Schoen, Piek Vossen
We describe Open Dutch WordNet, which has been derived from the Cornetto database, the Princeton WordNet and open source resources.
no code implementations • GWC 2016 • Emiel van Miltenburg
Le and Fokkens (2015) recently showed that taxonomy-based approaches are more reliable than corpus-based approaches in estimating human similarity ratings.
no code implementations • ACL (EvalNLGEval, INLG) 2020 • Emiel van Miltenburg, Chris van der Lee, Thiago Castro-Ferreira, Emiel Krahmer
NLG researchers often use uncontrolled corpora to train and evaluate their systems, using textual similarity metrics, such as BLEU.
1 code implementation • LANTERN (COLING) 2020 • Emiel van Miltenburg
While useful, these evaluations do not tell us anything about the kinds of image descriptions that systems are able to produce.
no code implementations • 31 May 2024 • Emiel van Miltenburg
This short position paper provides a manually curated list of non-English image captioning datasets (as of May 2024).
no code implementations • 21 Dec 2023 • Anouck Braggaar, Christine Liebrecht, Emiel van Miltenburg, Emiel Krahmer
This review gives an extensive overview of evaluation methods for task-oriented dialogue systems, paying special attention to practical applications of dialogue systems, for example for customer service.
no code implementations • 2 May 2023 • Anya Belz, Craig Thomson, Ehud Reiter, Gavin Abercrombie, Jose M. Alonso-Moral, Mohammad Arvan, Anouck Braggaar, Mark Cieliebak, Elizabeth Clark, Kees Van Deemter, Tanvi Dinkar, Ondřej Dušek, Steffen Eger, Qixiang Fang, Mingqi Gao, Albert Gatt, Dimitra Gkatzia, Javier González-Corbelle, Dirk Hovy, Manuela Hürlimann, Takumi Ito, John D. Kelleher, Filip Klubicka, Emiel Krahmer, Huiyuan Lai, Chris van der Lee, Yiru Li, Saad Mahamood, Margot Mieskes, Emiel van Miltenburg, Pablo Mosteiro, Malvina Nissim, Natalie Parde, Ondřej Plátek, Verena Rieser, Jie Ruan, Joel Tetreault, Antonio Toral, Xiaojun Wan, Leo Wanner, Lewis Watson, Diyi Yang
We report our efforts in identifying a set of previous human evaluations in NLP that would be suitable for a coordinated study examining what makes human evaluations in NLP more/less reproducible.
no code implementations • 29 Mar 2023 • Emiel van Miltenburg
This year the International Conference on Natural Language Generation (INLG) will feature an award for the paper with the best evaluation.
no code implementations • 8 Dec 2022 • Hien Huynh, Tomas O. Lentz, Emiel van Miltenburg
This case study investigates the extent to which a language model (GPT-2) is able to capture native speakers' intuitions about implicit causality in a sentence completion task.
no code implementations • INLG (ACL) 2021 • Emiel van Miltenburg, Miruna-Adriana Clinciu, Ondřej Dušek, Dimitra Gkatzia, Stephanie Inglis, Leo Leppänen, Saad Mahamood, Emma Manning, Stephanie Schoch, Craig Thomson, Luou Wen
We observe a severe under-reporting of the different kinds of errors that Natural Language Generation systems make.
no code implementations • 16 Jun 2021 • Simon Mille, Kaustubh D. Dhole, Saad Mahamood, Laura Perez-Beltrachini, Varun Gangal, Mihir Kale, Emiel van Miltenburg, Sebastian Gehrmann
By applying this framework to the GEM generation benchmark, we propose an evaluation suite made of 80 challenge sets, demonstrate the kinds of analyses that it enables and shed light onto the limits of current generation models.
no code implementations • NAACL 2021 • Emiel van Miltenburg, Chris van der Lee, Emiel Krahmer
Preregistration refers to the practice of specifying what you are going to do, and what you expect to find in your study, before carrying out the study.
no code implementations • ACL (GEM) 2021 • Sebastian Gehrmann, Tosin Adewumi, Karmanya Aggarwal, Pawan Sasanka Ammanamanchi, Aremu Anuoluwapo, Antoine Bosselut, Khyathi Raghavi Chandu, Miruna Clinciu, Dipanjan Das, Kaustubh D. Dhole, Wanyu Du, Esin Durmus, Ondřej Dušek, Chris Emezue, Varun Gangal, Cristina Garbacea, Tatsunori Hashimoto, Yufang Hou, Yacine Jernite, Harsh Jhamtani, Yangfeng Ji, Shailza Jolly, Mihir Kale, Dhruv Kumar, Faisal Ladhak, Aman Madaan, Mounica Maddela, Khyati Mahajan, Saad Mahamood, Bodhisattwa Prasad Majumder, Pedro Henrique Martins, Angelina McMillan-Major, Simon Mille, Emiel van Miltenburg, Moin Nadeem, Shashi Narayan, Vitaly Nikolaev, Rubungo Andre Niyongabo, Salomey Osei, Ankur Parikh, Laura Perez-Beltrachini, Niranjan Ramesh Rao, Vikas Raunak, Juan Diego Rodriguez, Sashank Santhanam, João Sedoc, Thibault Sellam, Samira Shaikh, Anastasia Shimorina, Marco Antonio Sobrevilla Cabezudo, Hendrik Strobelt, Nishant Subramani, Wei Xu, Diyi Yang, Akhila Yerukola, Jiawei Zhou
We introduce GEM, a living benchmark for natural language Generation (NLG), its Evaluation, and Metrics.
Ranked #1 on Extreme Summarization on GEM-XSum
Abstractive Text Summarization Cross-Lingual Abstractive Summarization +5
no code implementations • 15 Jun 2020 • Emiel van Miltenburg
Automatic image description systems are commonly trained and evaluated using crowdsourced, human-generated image descriptions.
no code implementations • WS 2019 • Emiel van Miltenburg, Merel van de Kerkhof, Ruud Koolen, Martijn Goudbeek, Emiel Krahmer
Task effects in NLG corpus elicitation recently started to receive more attention, but are usually not modeled statistically.
no code implementations • WS 2019 • Chris van der Lee, Albert Gatt, Emiel van Miltenburg, S Wubben, er, Emiel Krahmer
Currently, there is little agreement as to how Natural Language Generation (NLG) systems should be evaluated.
1 code implementation • IJCNLP 2019 • Thiago Castro Ferreira, Chris van der Lee, Emiel van Miltenburg, Emiel Krahmer
In contrast, recent neural models for data-to-text generation have been proposed as end-to-end approaches, where the non-linguistic input is rendered in natural language with much less explicit intermediate representations in-between.
Ranked #8 on Data-to-Text Generation on WebNLG Full
1 code implementation • WS 2018 • Emiel van Miltenburg, Desmond Elliott, Piek Vossen
This taxonomy serves as a reference point to think about how other people should be described, and can be used to classify and compute statistics about labels applied to people.
1 code implementation • COLING 2018 • Emiel van Miltenburg, Desmond Elliott, Piek Vossen
Automatic image description systems typically produce generic sentences that only make use of a small subset of the vocabulary available to them.
no code implementations • COLING 2018 • Emiel van Miltenburg, {\'A}kos K{\'a}d{\'a}r, Ruud Koolen, Emiel Krahmer
We present a corpus of spoken Dutch image descriptions, paired with two sets of eye-tracking data: Free viewing, where participants look at images without any particular purpose, and Description viewing, where we track eye movements while participants produce spoken descriptions of the images they are viewing.
1 code implementation • COLING 2018 • Emiel van Miltenburg, Ruud Koolen, Emiel Krahmer
Automatic image description systems are commonly trained and evaluated on written image descriptions.
1 code implementation • WS 2017 • Emiel van Miltenburg, Desmond Elliott, Piek Vossen
Automatic image description systems are commonly trained and evaluated on large image description datasets.
1 code implementation • 13 Apr 2017 • Emiel van Miltenburg, Desmond Elliott
In recent years we have seen rapid and significant progress in automatic image description but what are the open problems in this area?
no code implementations • EACL 2017 • Emiel van Miltenburg
This research proposal discusses pragmatic factors in image description, arguing that current automatic image description systems do not take these factors into account.
1 code implementation • WS 2016 • Chantal van Son, Emiel van Miltenburg, Roser Morante
This paper discusses the need for a dictionary of affixal negations and regular antonyms to facilitate their automatic detection in text.
1 code implementation • WS 2016 • Emiel van Miltenburg, Roser Morante, Desmond Elliott
We provide a qualitative analysis of the descriptions containing negations (no, not, n't, nobody, etc) in the Flickr30K corpus, and a categorization of negation uses.
2 code implementations • 19 May 2016 • Emiel van Miltenburg
An untested assumption behind the crowdsourced descriptions of the images in the Flickr30K dataset (Young et al., 2014) is that they "focus only on the information that can be obtained from the image alone" (Hodosh et al., 2013, p. 859).
no code implementations • LREC 2016 • Emiel van Miltenburg, Benjamin Timmermans, Lora Aroyo
The main goal of this study is to find out (i) whether it is feasible to collect keywords for a large collection of sounds through crowdsourcing, and (ii) how people talk about sounds, and what information they can infer from hearing a sound in isolation.
no code implementations • 30 Apr 2015 • Emiel van Miltenburg
This paper presents a pattern-based method that can be used to infer adjectival scales, such as <lukewarm, warm, hot>, from a corpus.