Experiments demonstrate that MAD leads to consistent gains in the low-shot, domain-shifted, and fully-supervised conditions on VCR, SNLI-VE, and VQA, achieving SOTA performance on VCR compared to other single models pretrained with image-text data.
Experiments demonstrate that our proposed CLIP-TD leads to exceptional gains in the low-shot (up to 51. 9%) and domain-shifted (up to 71. 3%) conditions of VCR, while simultaneously improving performance under standard fully-supervised conditions (up to 2%), achieving state-of-art performance on VCR compared to other single models that are pretrained with image-text data only.
Vision-language (V+L) pretraining models have achieved great success in supporting multimedia applications by understanding the alignments between images and text.
However, we show that directly applying such models to recognize image regions for object detection leads to poor performance due to a domain shift: CLIP was trained to match an image as a whole to a text description, without capturing the fine-grained alignment between image regions and text spans.
This design is motivated by two observations: 1) transformers learned on image datasets provide decent spatial priors that can ease the learning of video transformers, which are often times computationally-intensive if trained from scratch; 2) discriminative clues, i. e., spatial and temporal information, needed to make correct predictions vary among different videos due to large intra-class and inter-class variations.
Ranked #3 on Action Recognition on Diving-48
1 code implementation • 22 Nov 2021 • Lu Yuan, Dongdong Chen, Yi-Ling Chen, Noel Codella, Xiyang Dai, Jianfeng Gao, Houdong Hu, Xuedong Huang, Boxin Li, Chunyuan Li, Ce Liu, Mengchen Liu, Zicheng Liu, Yumao Lu, Yu Shi, Lijuan Wang, JianFeng Wang, Bin Xiao, Zhen Xiao, Jianwei Yang, Michael Zeng, Luowei Zhou, Pengchuan Zhang
Computer vision foundation models, which are trained on diverse, large-scale dataset and can be adapted to a wide range of downstream tasks, are critical for this mission to solve real-world computer vision applications.
Ranked #1 on Action Recognition In Videos on Kinetics-600
Large-scale multimodal contrastive pretraining has demonstrated great utility to support high performance in a range of downstream tasks by mapping multiple modalities into a shared embedding space.
1 code implementation • 8 Jun 2021 • Linjie Li, Jie Lei, Zhe Gan, Licheng Yu, Yen-Chun Chen, Rohit Pillai, Yu Cheng, Luowei Zhou, Xin Eric Wang, William Yang Wang, Tamara Lee Berg, Mohit Bansal, Jingjing Liu, Lijuan Wang, Zicheng Liu
Most existing video-and-language (VidL) research focuses on a single dataset, or multiple datasets of a single task.
This work concerns video-language pre-training and representation learning.
Vision-and-language pre-training has achieved impressive success in learning multimodal representations between vision and language.
Experiments on text-to-video retrieval and video question answering on six datasets demonstrate that ClipBERT outperforms (or is on par with) existing methods that exploit full-length videos, suggesting that end-to-end learning with just a few sparsely sampled clips is often more accurate than using densely extracted offline features from full-length videos, proving the proverbial less-is-more principle.
Ranked #4 on Visual Question Answering on MSRVTT-QA (using extra training data)
Additionally, we collect the first dataset, Surgical Hands, that provides multi-instance articulated hand pose annotations for in-vivo videos.
Transformer has become ubiquitous in the deep learning field.
Ranked #1 on Question Answering on Quasart-T
The model is unified in that (1) it can be fine-tuned for either vision-language generation (e. g., image captioning) or understanding (e. g., visual question answering) tasks, and (2) it uses a shared multi-layer transformer network for both encoding and decoding, which differs from many existing methods where the encoder and decoder are implemented using separate models.
Ranked #1 on Image Captioning on Flickr30k Captions test
Our dataset, ActivityNet-Entities, augments the challenging ActivityNet Captions dataset with 158k bounding box annotations, each grounding a noun phrase.
Video action recognition, a critical problem in video understanding, has been gaining increasing attention.
A naive extension of this approach to the video domain is to treat the entire segment as a bag of spatial object proposals.
To address this problem, we propose an end-to-end transformer model for dense video captioning.
Ranked #4 on Video Captioning on YouCook2
To answer this question, we introduce the problem of procedure segmentation--to segment a video procedure into category-independent procedure segments.
Attention mechanisms have attracted considerable interest in image captioning due to its powerful performance.
In this paper, a novel algorithm named negotiation-based MARL with sparse interactions (NegoSI) is presented.