Our model outperforms the state-of-the-art NMN model on CLEVR-Ref+ dataset with +8. 1% improvement in accuracy on the single-referent test set and +4. 3% on the full test set.
We define three broad classes of task descriptions for these tasks: statement, question, and completion, with numerous lexical variants within each class.
We demonstrate the benefit of our Conv-FEVER dataset by showing that the models trained on this data perform reasonably well to detect factually inconsistent responses with respect to the provided knowledge through evaluation on our human annotated data.
Robots operating in human spaces must be able to engage in natural language interaction with people, both understanding and executing instructions, and using conversation to resolve ambiguity and recover from mistakes.
Recent work has shown that pre-trained language models such as BERT improve robustness to spurious correlations in the dataset.
To measure the true progress of existing models, we split the test set into two sets, one which requires reasoning on linguistic structure and the other which doesn't.
User generated text on social media often suffers from a lot of undesired characteristics including hatespeech, abusive language, insults etc.
In this paper, we study abstractive summarization for open-domain videos.
Ranked #1 on Text Summarization on How2
We extend this line of work to the more challenging task of cross-lingual verb sense disambiguation, introducing the MultiSense dataset of 9, 504 images annotated with English, German, and Spanish verbs.
In this paper we propose a model to learn multimodal multilingual representations for matching images and sentences in different languages, with the aim of advancing multilingual versions of image search and image understanding.
A large amount of recent research has focused on tasks that combine language and vision, resulting in a proliferation of datasets and methods.
We introduce a new task, visual sense disambiguation for verbs: given an image and a verb, assign the correct sense of the verb, i. e., the one that describes the action depicted in the image.
The experimental results on a large email collection from a contact center in the tele- com domain show that the proposed ap- proach is effective in predicting the best topic of the agent's next sentence.