In this paper, we present LEMON, a LargE-scale iMage captiONer, and provide the first empirical study on the scaling behavior of VLP for image captioning.
In this paper, we propose UNICORN, a vision-language (VL) model that unifies text generation and bounding box prediction into a single architecture.
In this paper, we propose a single UniFied transfOrmer (UFO), which is capable of processing either unimodal inputs (e. g., image or language) or multimodal inputs (e. g., the concatenation of the image and the question), for vision-language (VL) representation learning.
To address this challenge, we propose PICa, a simple yet effective method that Prompts GPT3 via the use of Image Captions, for knowledge-based VQA.
3D visual grounding aims at grounding a natural language description about a 3D scene, usually represented in the form of 3D point clouds, to the targeted object region.
In this paper, we present a neat yet effective transformer-based framework for visual grounding, namely TransVG, to address the task of grounding a language query to the corresponding region onto an image.
Ranked #5 on Referring Expression Comprehension on RefCOCO
Due to this aligned representation learning, even pre-trained on the same downstream task dataset, TAP already boosts the absolute accuracy on the TextVQA dataset by +5. 4%, compared with a non-TAP baseline.
We then evaluate the framework on a proposed URMC dataset, which consists of conversations between a standardized patient and a behavioral health professional, along with expert annotations of body language, emotions, and potential psychiatric symptoms.
Particularly, we represent the input image with global and regional visual features, we introduce two parallel DCCNs to model multimodal context vectors with visual features at different granularities.
Ranked #1 on Multimodal Machine Translation on Multi30K
We improve one-stage visual grounding by addressing current limitations on grounding long and complex queries.
Multi-modal neural machine translation (NMT) aims to translate source sentences into a target language paired with images.
Our core innovation is the learning of a region-phrase score function, based on which an image-sentence score function is further constructed.
We propose a simple, fast, and accurate one-stage approach to visual grounding, inspired by the following insight.
The core idea is first converting the sparse weak labels such as keypoints to the initial estimate of body part masks, and then iteratively refine the part mask predictions.
Scene graph generation refers to the task of automatically mapping an image into a semantic structural graph, which requires correctly labeling each extracted object and their interaction relationships.
The attention mechanism is important for skeleton based action recognition because there exist spatio-temporal key stages while the joint predictions can be inaccurate.
In this work, we propose a multi-task learning framework to predict the steering angle and speed control simultaneously in an end-to-end manner.