Search Results

Perceiver: General Perception with Iterative Attention

towhee-io/towhee 4 Mar 2021

The perception models used in deep learning on the other hand are designed for individual modalities, often relying on domain-specific assumptions such as the local grid structures exploited by virtually all existing vision models.

3D Point Cloud Classification Audio Classification +1

Disentangled Representation Learning for Text-Video Retrieval

towhee-io/towhee 14 Mar 2022

Cross-modality interaction is a critical component in Text-Video Retrieval (TVR), yet there has been little examination of how different influencing factors for computing interaction affect performance.

Ranked #9 on Video Retrieval on MSR-VTT-1kA (using extra training data)

Representation Learning Retrieval +1

Wave-ViT: Unifying Wavelet and Transformers for Visual Representation Learning

towhee-io/towhee 11 Jul 2022

Motivated by the wavelet theory, we construct a new Wavelet Vision Transformer (\textbf{Wave-ViT}) that formulates the invertible down-sampling with wavelet transforms and self-attention learning in a unified way.

Image Classification Instance Segmentation +4

Omnivore: A Single Model for Many Visual Modalities

towhee-io/towhee CVPR 2022

Prior work has studied different visual modalities in isolation and developed separate architectures for recognition of images, videos, and 3D data.

 Ranked #1 on Scene Recognition on SUN-RGBD (using extra training data)

Action Classification Action Recognition +3

Video Swin Transformer

towhee-io/towhee CVPR 2022

The vision community is witnessing a modeling shift from CNNs to Transformers, where pure Transformer architectures have attained top accuracy on the major video recognition benchmarks.

Ranked #28 on Action Classification on Kinetics-600 (using extra training data)

Action Classification Action Recognition +5

Deep Residual Learning for Image Recognition

towhee-io/towhee CVPR 2016

Deep residual nets are foundations of our submissions to ILSVRC & COCO 2015 competitions, where we also won the 1st places on the tasks of ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation.

Domain Generalization +11

BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation

towhee-io/towhee 28 Jan 2022

Furthermore, performance improvement has been largely achieved by scaling up the dataset with noisy image-text pairs collected from the web, which is a suboptimal source of supervision.

Ranked #3 on Open Vocabulary Attribute Detection on OVAD-Box benchmark (using extra training data)

Image Captioning Image-text matching +5

Learning Transferable Visual Models From Natural Language Supervision

towhee-io/towhee 26 Feb 2021

State-of-the-art computer vision systems are trained to predict a fixed set of predetermined object categories.

 Ranked #1 on Zero-Shot Learning on COCO-MLT (using extra training data)

Benchmarking Few-Shot Image Classification +13

CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval

towhee-io/towhee 18 Apr 2021

In this paper, we propose a CLIP4Clip model to transfer the knowledge of the CLIP model to video-language retrieval in an end-to-end manner.

Retrieval Text Retrieval +4

MDMMT: Multidomain Multimodal Transformer for Video Retrieval

towhee-io/towhee 19 Mar 2021

We present a new state-of-the-art on the text to video retrieval task on MSRVTT and LSMDC benchmarks where our model outperforms all previous solutions by a large margin.

Ranked #23 on Video Retrieval on LSMDC (using extra training data)

Retrieval Text to Video Retrieval +1