Search Results

Transformer-Based Multi-modal Proposal and Re-Rank for Wikipedia Image-Caption Matching

towhee-io/towhee 21 Jun 2022

With the increased accessibility of web and online encyclopedias, the amount of data to manage is constantly increasing.

HorNet: Efficient High-Order Spatial Interactions with Recursive Gated Convolutions

towhee-io/towhee 28 Jul 2022

In this paper, we show that the key ingredients behind the vision Transformers, namely input-adaptive, long-range and high-order spatial interactions, can also be efficiently implemented with a convolution-based framework.

Image Classification object-detection +2

A ConvNet for the 2020s

towhee-io/towhee CVPR 2022

The "Roaring 20s" of visual recognition began with the introduction of Vision Transformers (ViTs), which quickly superseded ConvNets as the state-of-the-art image classification model.

Ranked #3 on Domain Generalization on ImageNet-Sketch (using extra training data)

Domain Generalization Image Classification +3

Contrastive Learning with Large Memory Bank and Negative Embedding Subtraction for Accurate Copy Detection

towhee-io/towhee 8 Dec 2021

Copy detection, which is a task to determine whether an image is a modified copy of any image in a database, is an unsolved problem.

Contrastive Learning Copy Detection +1

RepMLPNet: Hierarchical Vision MLP with Re-parameterized Locality

towhee-io/towhee CVPR 2022

Our results reveal that 1) Locality Injection is a general methodology for MLP models; 2) RepMLPNet has favorable accuracy-efficiency trade-off compared to the other MLPs; 3) RepMLPNet is the first MLP that seamlessly transfer to Cityscapes semantic segmentation.

Image Classification Semantic Segmentation

Wave-ViT: Unifying Wavelet and Transformers for Visual Representation Learning

towhee-io/towhee 11 Jul 2022

Motivated by the wavelet theory, we construct a new Wavelet Vision Transformer (\textbf{Wave-ViT}) that formulates the invertible down-sampling with wavelet transforms and self-attention learning in a unified way.

Image Classification Instance Segmentation +4

Learning Transferable Visual Models From Natural Language Supervision

towhee-io/towhee 26 Feb 2021

State-of-the-art computer vision systems are trained to predict a fixed set of predetermined object categories.

Action Recognition Few-Shot Image Classification +7

Disentangled Representation Learning for Text-Video Retrieval

towhee-io/towhee 14 Mar 2022

Cross-modality interaction is a critical component in Text-Video Retrieval (TVR), yet there has been little examination of how different influencing factors for computing interaction affect performance.

Ranked #4 on Video Retrieval on MSR-VTT-1kA (using extra training data)

Representation Learning Video Retrieval

MaxViT: Multi-Axis Vision Transformer

towhee-io/towhee 4 Apr 2022

We also show that our proposed model expresses strong generative modeling capability on ImageNet, demonstrating the superior potential of MaxViT blocks as a universal vision module.

Ranked #15 on Image Classification on ImageNet (using extra training data)

Image Classification object-detection +1

Swin Transformer V2: Scaling Up Capacity and Resolution

towhee-io/towhee CVPR 2022

Three main techniques are proposed: 1) a residual-post-norm method combined with cosine attention to improve training stability; 2) A log-spaced continuous position bias method to effectively transfer models pre-trained using low-resolution images to downstream tasks with high-resolution inputs; 3) A self-supervised pre-training method, SimMIM, to reduce the needs of vast labeled images.

Ranked #2 on Instance Segmentation on COCO minival (using extra training data)

Action Classification Image Classification +4