But the most encouraging finding is that with much less training overhead and parameters, SimREC can still achieve better performance than a set of large-scale pre-trained models, e. g., UNITER and VILLA, portraying the special role of REC in existing V&L research.
Despite the exciting performance, Transformer is criticized for its excessive parameters and computation cost.
In this paper, we propose a simple yet universal network termed SeqTR for visual grounding tasks, e. g., phrase localization, referring expression comprehension (REC) and segmentation (RES).
Based on the LaConv module, we further build the first fully language-driven convolution network, termed as LaConvNet, which can unify the visual recognition and multi-modal reasoning in one forward structure.
The latter contains a Global Adaptive Controller that can adaptively fuse the global information into the decoder to guide the caption generation.
In addition, we address a key challenge in this multi-task setup, i. e., the prediction conflict, with two innovative designs namely, Consistency Energy Maximization (CEM) and Adaptive Soft Non-Located Suppression (ASNLS).
Referring Expression Comprehension (REC) is an emerging research spot in computer vision, which refers to detecting the target region in an image given an text description.