Search Results for author: Desmond Elliott

Found 54 papers, 32 papers with code

Sequential Compositional Generalization in Multimodal Models

no code implementations18 Apr 2024 Semih Yagcioglu, Osman Batur İnce, Aykut Erdem, Erkut Erdem, Desmond Elliott, Deniz Yuret

The rise of large-scale multimodal models has paved the pathway for groundbreaking advances in generative modeling and reasoning, unlocking transformative applications in a variety of complex tasks.

Text Rendering Strategies for Pixel Language Models

no code implementations1 Nov 2023 Jonas F. Lotz, Elizabeth Salesky, Phillip Rust, Desmond Elliott

Pixel-based language models process text rendered as images, which allows them to handle any script, making them a promising approach to open vocabulary language modelling.

Language Modelling Sentence

Evaluating Bias and Fairness in Gender-Neutral Pretrained Vision-and-Language Models

1 code implementation26 Oct 2023 Laura Cabello, Emanuele Bugliarello, Stephanie Brandl, Desmond Elliott

We quantify bias amplification in pretraining and after fine-tuning on three families of vision-and-language models.

Fairness Retrieval

PHD: Pixel-Based Language Modeling of Historical Documents

1 code implementation22 Oct 2023 Nadav Borenstein, Phillip Rust, Desmond Elliott, Isabelle Augenstein

We then pre-train our model, PHD, on a combination of synthetic scans and real historical newspapers from the 1700-1900 period.

Language Modelling Optical Character Recognition (OCR)

LMCap: Few-shot Multilingual Image Captioning by Retrieval Augmented Language Model Prompting

1 code implementation31 May 2023 Rita Ramos, Bruno Martins, Desmond Elliott

Multilingual image captioning has recently been tackled by training with large-scale machine translated data, which is an expensive, noisy, and time-consuming process.

Image Captioning Language Modelling +1

The Role of Data Curation in Image Captioning

1 code implementation5 May 2023 Wenyan Li, Jonas F. Lotz, Chen Qiu, Desmond Elliott

Image captioning models are typically trained by treating all samples equally, neglecting to account for mismatched or otherwise difficult data points.

Few-Shot Learning Image Captioning +2

Retrieval-augmented Image Captioning

1 code implementation16 Feb 2023 Rita Ramos, Desmond Elliott, Bruno Martins

The encoder in our model jointly processes the image and retrieved captions using a pretrained V&L BERT, while the decoder attends to the multimodal encoder representations, benefiting from the extra textual evidence from the retrieved captions.

Image Captioning Retrieval +1

An Exploration of Hierarchical Attention Transformers for Efficient Long Document Classification

no code implementations11 Oct 2022 Ilias Chalkidis, Xiang Dai, Manos Fergadiotis, Prodromos Malakasiotis, Desmond Elliott

Non-hierarchical sparse attention Transformer-based models, such as Longformer and Big Bird, are popular approaches to working with long documents.

Document Classification

SmallCap: Lightweight Image Captioning Prompted with Retrieval Augmentation

1 code implementation CVPR 2023 Rita Ramos, Bruno Martins, Desmond Elliott, Yova Kementchedjhieva

Recent advances in image captioning have focused on scaling the data and model size, substantially increasing the cost of pre-training and finetuning.

Image Captioning Retrieval

Language Modelling with Pixels

1 code implementation14 Jul 2022 Phillip Rust, Jonas F. Lotz, Emanuele Bugliarello, Elizabeth Salesky, Miryam de Lhoneux, Desmond Elliott

We pretrain the 86M parameter PIXEL model on the same English data as BERT and evaluate on syntactic and semantic tasks in typologically diverse languages, including various non-Latin scripts.

Language Modelling Named Entity Recognition (NER)

Revisiting Transformer-based Models for Long Document Classification

1 code implementation14 Apr 2022 Xiang Dai, Ilias Chalkidis, Sune Darkner, Desmond Elliott

The recent literature in text classification is biased towards short text sequences (e. g., sentences or paragraphs).

Document Classification text-classification

IGLUE: A Benchmark for Transfer Learning across Modalities, Tasks, and Languages

3 code implementations27 Jan 2022 Emanuele Bugliarello, Fangyu Liu, Jonas Pfeiffer, Siva Reddy, Desmond Elliott, Edoardo Maria Ponti, Ivan Vulić

Our benchmark enables the evaluation of multilingual multimodal models for transfer learning, not only in a zero-shot setting, but also in newly defined few-shot learning setups.

Cross-Modal Retrieval Few-Shot Learning +5

Visually Grounded Reasoning across Languages and Cultures

3 code implementations EMNLP 2021 Fangyu Liu, Emanuele Bugliarello, Edoardo Maria Ponti, Siva Reddy, Nigel Collier, Desmond Elliott

The design of widespread vision-and-language datasets and pre-trained encoders directly adopts, or draws inspiration from, the concepts and images of ImageNet.

Visual Reasoning Zero-Shot Learning

MDAPT: Multilingual Domain Adaptive Pretraining in a Single Model

1 code implementation Findings (EMNLP) 2021 Rasmus Kær Jørgensen, Mareike Hartmann, Xiang Dai, Desmond Elliott

Domain adaptive pretraining, i. e. the continued unsupervised pretraining of a language model on domain-specific text, improves the modelling of text for downstream tasks within the domain.

Language Modelling named-entity-recognition +4

Vision-and-Language or Vision-for-Language? On Cross-Modal Influence in Multimodal Transformers

4 code implementations EMNLP 2021 Stella Frank, Emanuele Bugliarello, Desmond Elliott

Models that have learned to construct cross-modal representations using both modalities are expected to perform worse when inputs are missing from a modality.

Language Modelling

The Role of Syntactic Planning in Compositional Image Captioning

1 code implementation EACL 2021 Emanuele Bugliarello, Desmond Elliott

Image captioning has focused on generalizing to images drawn from the same distribution as the training set, and not to the more challenging problem of generalizing to different distributions of images.

Image Captioning

Multimodal Pretraining Unmasked: A Meta-Analysis and a Unified Framework of Vision-and-Language BERTs

3 code implementations30 Nov 2020 Emanuele Bugliarello, Ryan Cotterell, Naoaki Okazaki, Desmond Elliott

Large-scale pretraining and task-specific fine-tuning is now the standard methodology for many tasks in computer vision and natural language processing.

Multimodal Speech Recognition with Unstructured Audio Masking

no code implementations EMNLP (nlpbt) 2020 Tejas Srinivasan, Ramon Sanabria, Florian Metze, Desmond Elliott

Our experiments on the Flickr 8K Audio Captions Corpus show that multimodal ASR can generalize to recover different types of masked words in this unstructured masking setting.

8k Automatic Speech Recognition +2

Fine-Grained Grounding for Multimodal Speech Recognition

1 code implementation Findings of the Association for Computational Linguistics 2020 Tejas Srinivasan, Ramon Sanabria, Florian Metze, Desmond Elliott

In experiments on the Flickr8K Audio Captions Corpus, we find that our model improves over approaches that use global visual features, that the proposals enable the model to recover entities and other related words, such as adjectives, and that improvements are due to the model's ability to localize the correct proposals.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

On Forgetting to Cite Older Papers: An Analysis of the ACL Anthology

1 code implementation ACL 2020 Marcel Bollmann, Desmond Elliott

The field of natural language processing is experiencing a period of unprecedented growth, and with it a surge of published papers.

CompGuessWhat?!: A Multi-task Evaluation Framework for Grounded Language Learning

no code implementations ACL 2020 Alessandro Suglia, Ioannis Konstas, Andrea Vanzo, Emanuele Bastianelli, Desmond Elliott, Stella Frank, Oliver Lemon

To remedy this, we present GROLLA, an evaluation framework for Grounded Language Learning with Attributes with three sub-tasks: 1) Goal-oriented evaluation; 2) Object attribute prediction evaluation; and 3) Zero-shot evaluation.

Attribute Grounded language learning

The Sensitivity of Language Models and Humans to Winograd Schema Perturbations

2 code implementations ACL 2020 Mostafa Abdou, Vinit Ravishankar, Maria Barrett, Yonatan Belinkov, Desmond Elliott, Anders Søgaard

Large-scale pretrained language models are the major driving force behind recent improvements in performance on the Winograd Schema Challenge, a widely employed test of common sense reasoning ability.

Common Sense Reasoning

Multimodal Machine Translation through Visuals and Speech

no code implementations28 Nov 2019 Umut Sulubacak, Ozan Caglayan, Stig-Arne Grönroos, Aku Rouhe, Desmond Elliott, Lucia Specia, Jörg Tiedemann

Multimodal machine translation involves drawing information from more than one modality, based on the assumption that the additional modalities will contain useful alternative views of the input data.

Image Captioning Multimodal Machine Translation +4

Bootstrapping Disjoint Datasets for Multilingual Multimodal Representation Learning

no code implementations9 Nov 2019 Ákos Kádár, Grzegorz Chrupała, Afra Alishahi, Desmond Elliott

However, we do find that using an external machine translation model to generate the synthetic data sets results in better performance.

Machine Translation Representation Learning +4

Adversarial Removal of Demographic Attributes Revisited

no code implementations IJCNLP 2019 Maria Barrett, Yova Kementchedjhieva, Yanai Elazar, Desmond Elliott, Anders S{\o}gaard

Elazar and Goldberg (2018) showed that protected attributes can be extracted from the representations of a debiased neural network for mention detection at above-chance levels, by evaluating a diagnostic classifier on a held-out subsample of the data it was trained on.

Understanding the Effect of Textual Adversaries in Multimodal Machine Translation

no code implementations WS 2019 Koel Dutta Chowdhury, Desmond Elliott

It is assumed that multimodal machine translation systems are better than text-only systems at translating phrases that have a direct correspondence in the image.

Multimodal Machine Translation Sentence +1

Compositional Generalization in Image Captioning

1 code implementation CONLL 2019 Mitja Nikolaus, Mostafa Abdou, Matthew Lamm, Rahul Aralikatte, Desmond Elliott

Image captioning models are usually evaluated on their ability to describe a held-out set of images, not on their ability to generalize to unseen concepts.

Caption Generation Image Captioning +1

Cross-lingual Visual Verb Sense Disambiguation

1 code implementation NAACL 2019 Spandana Gella, Desmond Elliott, Frank Keller

We extend this line of work to the more challenging task of cross-lingual verb sense disambiguation, introducing the MultiSense dataset of 9, 504 images annotated with English, German, and Spanish verbs.

Machine Translation Translation

Talking about other people: an endless range of possibilities

1 code implementation WS 2018 Emiel van Miltenburg, Desmond Elliott, Piek Vossen

This taxonomy serves as a reference point to think about how other people should be described, and can be used to classify and compute statistics about labels applied to people.

Text Generation

How2: A Large-scale Dataset for Multimodal Language Understanding

2 code implementations1 Nov 2018 Ramon Sanabria, Ozan Caglayan, Shruti Palaskar, Desmond Elliott, Loïc Barrault, Lucia Specia, Florian Metze

In this paper, we introduce How2, a multimodal collection of instructional videos with English subtitles and crowdsourced Portuguese translations.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +3

Adversarial Evaluation of Multimodal Machine Translation

no code implementations EMNLP 2018 Desmond Elliott

The promise of combining language and vision in multimodal machine translation is that systems will produce better translations by leveraging the image data.

Multimodal Machine Translation text similarity +1

Findings of the Third Shared Task on Multimodal Machine Translation

1 code implementation WS 2018 Lo{\"\i}c Barrault, Fethi Bougares, Lucia Specia, Chiraag Lala, Desmond Elliott, Stella Frank

In this task a source sentence in English is supplemented by an image and participating systems are required to generate a translation for such a sentence into German, French or Czech.

Multimodal Machine Translation Sentence +1

Measuring the Diversity of Automatic Image Descriptions

1 code implementation COLING 2018 Emiel van Miltenburg, Desmond Elliott, Piek Vossen

Automatic image description systems typically produce generic sentences that only make use of a small subset of the vocabulary available to them.

Text Generation

Cross-linguistic differences and similarities in image descriptions

1 code implementation WS 2017 Emiel van Miltenburg, Desmond Elliott, Piek Vossen

Automatic image description systems are commonly trained and evaluated on large image description datasets.

Specificity

Imagination improves Multimodal Translation

no code implementations IJCNLP 2017 Desmond Elliott, Ákos Kádár

We decompose multimodal translation into two sub-tasks: learning to translate and learning visually grounded representations.

Translation

Room for improvement in automatic image description: an error analysis

1 code implementation13 Apr 2017 Emiel van Miltenburg, Desmond Elliott

In recent years we have seen rapid and significant progress in automatic image description but what are the open problems in this area?

Pragmatic factors in image description: the case of negations

1 code implementation WS 2016 Emiel van Miltenburg, Roser Morante, Desmond Elliott

We provide a qualitative analysis of the descriptions containing negations (no, not, n't, nobody, etc) in the Flickr30K corpus, and a categorization of negation uses.

Negation

1 Million Captioned Dutch Newspaper Images

no code implementations LREC 2016 Desmond Elliott, Martijn Kleppe

Images naturally appear alongside text in a wide variety of media, such as books, magazines, newspapers, and in online articles.

Data-to-Text Generation Image Captioning +3

A Corpus of Images and Text in Online News

no code implementations LREC 2016 Laura Hollink, Adriatik Bedjeti, Martin van Harmelen, Desmond Elliott

The corpus consists of JSON-LD files with the following data about each article: the original URL of the article on the news publisher{'}s website, the date of publication, the headline of the article, the URL of the image displayed with the article (if any), and the caption of that image.

Automatic Description Generation from Images: A Survey of Models, Datasets, and Evaluation Measures

no code implementations15 Jan 2016 Raffaella Bernardi, Ruket Cakici, Desmond Elliott, Aykut Erdem, Erkut Erdem, Nazli Ikizler-Cinbis, Frank Keller, Adrian Muscat, Barbara Plank

Automatic description generation from natural images is a challenging problem that has recently received a large amount of interest from the computer vision and natural language processing communities.

Retrieval

Multilingual Image Description with Neural Sequence Models

1 code implementation15 Oct 2015 Desmond Elliott, Stella Frank, Eva Hasler

In this paper we present an approach to multi-language image description bringing together insights from neural machine translation and neural image description.

Image Captioning Translation

Cannot find the paper you are looking for? You can Submit a new open access paper.