The perception models used in deep learning on the other hand are designed for individual modalities, often relying on domain-specific assumptions such as the local grid structures exploited by virtually all existing vision models.
Ranked #29 on Audio Classification on AudioSet
Cross-modality interaction is a critical component in Text-Video Retrieval (TVR), yet there has been little examination of how different influencing factors for computing interaction affect performance.
Ranked #9 on Video Retrieval on MSR-VTT-1kA (using extra training data)
Motivated by the wavelet theory, we construct a new Wavelet Vision Transformer (\textbf{Wave-ViT}) that formulates the invertible down-sampling with wavelet transforms and self-attention learning in a unified way.
Ranked #212 on Image Classification on ImageNet
Prior work has studied different visual modalities in isolation and developed separate architectures for recognition of images, videos, and 3D data.
Ranked #1 on Scene Recognition on SUN-RGBD (using extra training data)
The vision community is witnessing a modeling shift from CNNs to Transformers, where pure Transformer architectures have attained top accuracy on the major video recognition benchmarks.
Ranked #28 on Action Classification on Kinetics-600 (using extra training data)
Deep residual nets are foundations of our submissions to ILSVRC & COCO 2015 competitions, where we also won the 1st places on the tasks of ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation.
Ranked #1 on Image Classification on cifar100
Furthermore, performance improvement has been largely achieved by scaling up the dataset with noisy image-text pairs collected from the web, which is a suboptimal source of supervision.
Ranked #3 on Open Vocabulary Attribute Detection on OVAD-Box benchmark (using extra training data)
State-of-the-art computer vision systems are trained to predict a fixed set of predetermined object categories.
Ranked #1 on Zero-Shot Learning on COCO-MLT (using extra training data)
In this paper, we propose a CLIP4Clip model to transfer the knowledge of the CLIP model to video-language retrieval in an end-to-end manner.
Ranked #1 on Text to Video Retrieval on MSR-VTT
We present a new state-of-the-art on the text to video retrieval task on MSRVTT and LSMDC benchmarks where our model outperforms all previous solutions by a large margin.
Ranked #23 on Video Retrieval on LSMDC (using extra training data)