To foster future research into fine-grained visual grounding, our benchmark RefCOCOm, the MRES-32M dataset and model UniRES will be publicly available at https://github. com/Rubics-Xuan/MRES
Owing to the combination of the unified architecture and pre-training task, EVE is easy to scale up, enabling better downstream performance with fewer resources and faster training speed.
We show that only language-paired two-modality data is sufficient to connect all modalities.
In this paper, we propose a novel method called Joint QA and DC GEneration (JADE), which utilizes a pre-trained multimodal model and easily-crawled image-text pairs to automatically generate and filter large-scale VQA and dense captioning datasets.
Different from widely-studied vision-language pretraining models, VALOR jointly models relationships of vision, audio and language in an end-to-end manner.
Ranked #1 on Video Captioning on VATEX (using extra training data)
Our method performs joint masking on image-text input and integrates both implicit and explicit targets for the masked signals to recover.
In this paper, we propose an Omni-perception Pre-Trainer (OPT) for cross-modal understanding and generation, by jointly modeling visual, text and audio resources.
Ranked #1 on Image Retrieval on Localized Narratives
Besides, we provide detailed visualizations of the self-attention between patches in the encoder and the "words-to-patches" attention in the decoder thanks to the full Transformer architecture.
These models are autoregressive in that they generate each word by conditioning on previously generated words, which leads to heavy latency during inference.
The whole process involves a visual understanding module and a language generation module, which brings more challenges to the design of deep neural networks than other tasks.
In this paper, we propose a Non-Autoregressive Image Captioning (NAIC) model with a novel training paradigm: Counterfactuals-critical Multi-Agent Learning (CMAL).
First, we propose Normalized Self-Attention (NSA), a reparameterization of SA that brings the benefits of normalization inside SA.
This report describes our solution for the VATEX Captioning Challenge 2020, which requires generating descriptions for the videos in both English and Chinese languages.
Image captioning attempts to generate a sentence composed of several linguistic words, which are used to describe objects, attributes, and interactions in an image, denoted as visual semantic units in this paper.
The discriminator and the generator are trained in an adversarial manner to enable more natural and human-like captions.