TVTSv2: Learning Out-of-the-box Spatiotemporal Visual Representations at Scale

1 code implementation23 May 2023 Ziyun Zeng, Yixiao Ge, Zhan Tong, Xihui Liu, Shu-Tao Xia, Ying Shan

We argue that tuning a text encoder end-to-end, as done in previous work, is suboptimal since it may overfit in terms of styles, thereby losing its original generalization ability to capture the semantics of various language registers.

Representation Learning

What Makes for Good Visual Tokenizers for Large Language Models?

1 code implementation20 May 2023 Guangzhi Wang, Yixiao Ge, Xiaohan Ding, Mohan Kankanhalli, Ying Shan

In our benchmark, which is curated to evaluate MLLMs visual semantic understanding and fine-grained perception capabilities, we discussed different visual tokenizers pre-trained with dominant methods (i. e., DeiT, CLIP, MAE, DINO), and observe that: i) Fully/weakly supervised models capture more semantics than self-supervised models, but the gap is narrowed by scaling up the pre-training dataset.

Image Captioning Object Counting +2

$π$-Tuning: Transferring Multimodal Foundation Models with Optimal Multi-task Interpolation

no code implementations27 Apr 2023 Chengyue Wu, Teng Wang, Yixiao Ge, Zeyu Lu, Ruisong Zhou, Ying Shan, Ping Luo

Foundation models have achieved great advances in multi-task learning with a unified interface of unimodal and multimodal tasks.

Multi-Task Learning

Attack is Good Augmentation: Towards Skeleton-Contrastive Representation Learning

no code implementations8 Apr 2023 Binqian Xu, Xiangbo Shu, Rui Yan, Guo-Sen Xie, Yixiao Ge, Mike Zheng Shou

In particular, we propose a novel Attack-Augmentation Mixing-Contrastive learning (A$^2$MC) to contrast hard positive features and hard negative features for learning more robust skeleton representations.

Action Recognition Contrastive Learning +4

TagGPT: Large Language Models are Zero-shot Multimodal Taggers

1 code implementation6 Apr 2023 Chen Li, Yixiao Ge, Jiayong Mao, Dian Li, Ying Shan

Given a new entity that needs tagging for distribution, TagGPT introduces two alternative options for zero-shot tagging, i. e., a generative method with late semantic matching with the tag set, and another selective method with early matching in prompts.

Optical Character Recognition (OCR) Prompt Engineering +4

Accelerating Vision-Language Pretraining with Free Language Modeling

1 code implementation CVPR 2023 Teng Wang, Yixiao Ge, Feng Zheng, Ran Cheng, Ying Shan, XiaoHu Qie, Ping Luo

FLM successfully frees the prediction rate from the tie-up with the corruption rate while allowing the corruption spans to be customized for each token to be predicted.

Language Modelling Masked Language Modeling

BoxSnake: Polygonal Instance Segmentation with Box Supervision

1 code implementation21 Mar 2023 Rui Yang, Lin Song, Yixiao Ge, Xiu Li

Box-supervised instance segmentation has gained much attention as it requires only simple box annotations instead of costly mask or polygon annotations.

Box-supervised Instance Segmentation Semantic Segmentation

Binary Embedding-based Retrieval at Tencent

1 code implementation17 Feb 2023 Yukang Gan, Yixiao Ge, Chang Zhou, Shupeng Su, Zhouchuan Xu, Xuyuan Xu, Quanchao Hui, Xiang Chen, Yexin Wang, Ying Shan

To tackle the challenge, we propose a binary embedding-based retrieval (BEBR) engine equipped with a recurrent binarization algorithm that enables customized bits per dimension.

Binarization Retrieval

RILS: Masked Visual Reconstruction in Language Semantic Space

1 code implementation CVPR 2023 Shusheng Yang, Yixiao Ge, Kun Yi, Dian Li, Ying Shan, XiaoHu Qie, Xinggang Wang

Both masked image modeling (MIM) and natural language supervision have facilitated the progress of transferable visual pre-training.

Modeling Uncertain Feature Representation for Domain Generalization

1 code implementation16 Jan 2023 Xiaotong Li, Zixuan Hu, Jun Liu, Yixiao Ge, Yongxing Dai, Ling-Yu Duan

In this paper, we improve the network generalization ability by modeling domain shifts with uncertainty (DSU), i. e., characterizing the feature statistics as uncertain distributions during training.

Domain Generalization Image Classification +3

Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation

2 code implementations22 Dec 2022 Jay Zhangjie Wu, Yixiao Ge, Xintao Wang, Weixian Lei, YuChao Gu, Yufei Shi, Wynne Hsu, Ying Shan, XiaoHu Qie, Mike Zheng Shou

To replicate the success of text-to-image (T2I) generation, recent works employ large-scale video datasets to train a text-to-video (T2V) generator.

Style Transfer Text-to-Video Generation +1

Rethinking the Objectives of Vector-Quantized Tokenizers for Image Synthesis

no code implementations6 Dec 2022 YuChao Gu, Xintao Wang, Yixiao Ge, Ying Shan, XiaoHu Qie, Mike Zheng Shou

Vector-Quantized (VQ-based) generative models usually consist of two basic components, i. e., VQ tokenizers and generative transformers.

Conditional Image Generation

Darwinian Model Upgrades: Model Evolving with Selective Compatibility

no code implementations13 Oct 2022 Binjie Zhang, Shupeng Su, Yixiao Ge, Xuyuan Xu, Yexin Wang, Chun Yuan, Mike Zheng Shou, Ying Shan

The traditional model upgrading paradigm for retrieval requires recomputing all gallery embeddings before deploying the new model (dubbed as "backfilling"), which is quite expensive and time-consuming considering billions of instances in industrial applications.

Face Recognition Retrieval

Learning Transferable Spatiotemporal Representations from Natural Script Knowledge

1 code implementation CVPR 2023 Ziyun Zeng, Yuying Ge, Xihui Liu, Bin Chen, Ping Luo, Shu-Tao Xia, Yixiao Ge

Pre-training on large-scale video data has become a common recipe for learning transferable spatiotemporal representations in recent years.

Representation Learning Video Understanding

Equivariant Filter Design for Discrete-time systems

no code implementations12 Sep 2022 Yixiao Ge, Pieter van Goor, Robert Mahony

The kinematics of many nonlinear control systems, especially in the robotics field, admit a transitive Lie-group symmetry, which is useful in high performance observer design.

Not All Models Are Equal: Predicting Model Transferability in a Self-challenging Fisher Space

1 code implementation7 Jul 2022 Wenqi Shao, Xun Zhao, Yixiao Ge, Zhaoyang Zhang, Lei Yang, Xiaogang Wang, Ying Shan, Ping Luo

It is challenging because the ground-truth model ranking for each task can only be generated by fine-tuning the pre-trained models on the target dataset, which is brute-force and computationally expensive.

Masked Image Modeling with Denoising Contrast

1 code implementation19 May 2022 Kun Yi, Yixiao Ge, Xiaotong Li, Shusheng Yang, Dian Li, Jianping Wu, Ying Shan, XiaoHu Qie

Since the development of self-supervised visual representation learning from contrastive learning to masked image modeling (MIM), there is no significant difference in essence, that is, how to design proper pretext tasks for vision dictionary look-up.

Contrastive Learning Denoising +6

Privacy-Preserving Model Upgrades with Bidirectional Compatible Training in Image Retrieval

1 code implementation29 Apr 2022 Shupeng Su, Binjie Zhang, Yixiao Ge, Xuyuan Xu, Yexin Wang, Chun Yuan, Ying Shan

The task of privacy-preserving model upgrades in image retrieval desires to reap the benefits of rapidly evolving new models without accessing the raw gallery images.

Image Retrieval Privacy Preserving +1

MILES: Visual BERT Pre-training with Injected Language Semantics for Video-text Retrieval

1 code implementation26 Apr 2022 Yuying Ge, Yixiao Ge, Xihui Liu, Alex Jinpeng Wang, Jianping Wu, Ying Shan, XiaoHu Qie, Ping Luo

Dominant pre-training work for video-text retrieval mainly adopt the "dual-encoder" architectures to enable efficient retrieval, where two separate encoders are used to contrast global video and text representations, but ignore detailed local semantics.

Action Recognition Retrieval +5

Unleashing Vanilla Vision Transformer with Masked Image Modeling for Object Detection

2 code implementations6 Apr 2022 Yuxin Fang, Shusheng Yang, Shijie Wang, Yixiao Ge, Ying Shan, Xinggang Wang

We present an approach to efficiently and effectively adapt a masked image modeling (MIM) pre-trained vanilla Vision Transformer (ViT) for object detection, which is based on our two novel observations: (i) A MIM pre-trained vanilla ViT encoder can work surprisingly well in the challenging object-level recognition scenario even with randomly sampled partial observations, e. g., only 25% $\sim$ 50% of the input embeddings.

Instance Segmentation object-detection +1

mc-BEiT: Multi-choice Discretization for Image BERT Pre-training

1 code implementation29 Mar 2022 Xiaotong Li, Yixiao Ge, Kun Yi, Zixuan Hu, Ying Shan, Ling-Yu Duan

Image BERT pre-training with masked image modeling (MIM) becomes a popular practice to cope with self-supervised representation learning.

Instance Segmentation object-detection +4

Revitalize Region Feature for Democratizing Video-Language Pre-training of Retrieval

2 code implementations15 Mar 2022 Guanyu Cai, Yixiao Ge, Binjie Zhang, Alex Jinpeng Wang, Rui Yan, Xudong Lin, Ying Shan, Lianghua He, XiaoHu Qie, Jianping Wu, Mike Zheng Shou

Recent dominant methods for video-language pre-training (VLP) learn transferable representations from the raw pixels in an end-to-end manner to achieve advanced performance on downstream video-language retrieval.

Question Answering Retrieval +4

All in One: Exploring Unified Video-Language Pre-training

1 code implementation CVPR 2023 Alex Jinpeng Wang, Yixiao Ge, Rui Yan, Yuying Ge, Xudong Lin, Guanyu Cai, Jianping Wu, Ying Shan, XiaoHu Qie, Mike Zheng Shou

In this work, we for the first time introduce an end-to-end video-language model, namely \textit{all-in-one Transformer}, that embeds raw video and textual signals into joint representations using a unified backbone architecture.

Language Modelling Multiple-choice +10

Towards Universal Backward-Compatible Representation Learning

1 code implementation3 Mar 2022 Binjie Zhang, Yixiao Ge, Yantao Shen, Shupeng Su, Fanzi Wu, Chun Yuan, Xuyuan Xu, Yexin Wang, Ying Shan

The task of backward-compatible representation learning is therefore introduced to support backfill-free model upgrades, where the new query features are interoperable with the old gallery features.

Face Recognition Representation Learning

Uncertainty Modeling for Out-of-Distribution Generalization

1 code implementation ICLR 2022 Xiaotong Li, Yongxing Dai, Yixiao Ge, Jun Liu, Ying Shan, Ling-Yu Duan

In this paper, we improve the network generalization ability by modeling the uncertainty of domain shifts with synthesized feature statistics during training.

Image Classification Out-of-Distribution Generalization +2

Hot-Refresh Model Upgrades with Regression-Alleviating Compatible Training in Image Retrieval

1 code implementation24 Jan 2022 Binjie Zhang, Yixiao Ge, Yantao Shen, Yu Li, Chun Yuan, Xuyuan Xu, Yexin Wang, Ying Shan

In contrast, hot-refresh model upgrades deploy the new model immediately and then gradually improve the retrieval accuracy by backfilling the gallery on-the-fly.

Image Retrieval regression +1

Bridging Video-text Retrieval with Multiple Choice Questions

2 code implementations CVPR 2022 Yuying Ge, Yixiao Ge, Xihui Liu, Dian Li, Ying Shan, XiaoHu Qie, Ping Luo

As an additional benefit, our method achieves competitive results with much shorter pre-training videos on single-modality downstream tasks, e. g., action recognition with linear evaluation.

Action Recognition Multiple-choice +8

Video-Text Pre-training with Learned Regions

1 code implementation2 Dec 2021 Rui Yan, Mike Zheng Shou, Yixiao Ge, Alex Jinpeng Wang, Xudong Lin, Guanyu Cai, Jinhui Tang

Video-Text pre-training aims at learning transferable representations from large-scale video-text pairs via aligning the semantics between visual and textual information.

Representation Learning Retrieval +2

Object-aware Video-language Pre-training for Retrieval

1 code implementation CVPR 2022 Alex Jinpeng Wang, Yixiao Ge, Guanyu Cai, Rui Yan, Xudong Lin, Ying Shan, XiaoHu Qie, Mike Zheng Shou

In this work, we present Object-aware Transformers, an object-centric approach that extends video-language transformer to incorporate object representations.

Retrieval Text Matching

Mutual CRF-GNN for Few-Shot Learning

no code implementations CVPR 2021 Shixiang Tang, Dapeng Chen, Lei Bai, Kaijian Liu, Yixiao Ge, Wanli Ouyang

In this MCGN, the labels and features of support data are used by the CRF for inferring GNN affinities in a principled and probabilistic way.

Few-Shot Learning

Refining Pseudo Labels with Clustering Consensus over Generations for Unsupervised Object Re-identification

1 code implementation CVPR 2021 Xiao Zhang, Yixiao Ge, Yu Qiao, Hongsheng Li

Unsupervised object re-identification targets at learning discriminative representations for object retrieval without any annotations.

Pseudo Label Retrieval

DivCo: Diverse Conditional Image Synthesis via Contrastive Generative Adversarial Network

1 code implementation CVPR 2021 Rui Liu, Yixiao Ge, Ching Lam Choi, Xiaogang Wang, Hongsheng Li

Conditional generative adversarial networks (cGANs) target at synthesizing diverse images given the input conditions and latent codes, but unfortunately, they usually suffer from the issue of mode collapse.

Contrastive Learning Image Generation

Progressive Correspondence Pruning by Consensus Learning

no code implementations ICCV 2021 Chen Zhao, Yixiao Ge, Feng Zhu, Rui Zhao, Hongsheng Li, Mathieu Salzmann

Correspondence selection aims to correctly select the consistent matches (inliers) from an initial set of putative correspondences.

Denoising Pose Estimation +1

Improved Mutual Mean-Teaching for Unsupervised Domain Adaptive Re-ID

2 code implementations24 Aug 2020 Yixiao Ge, Shijie Yu, Dapeng Chen

SDA, a domain-translation-based framework, focuses on carefully translating the source-domain images to the target domain.

Domain Adaptation Pseudo Label +1

Self-supervising Fine-grained Region Similarities for Large-scale Image Localization

3 code implementations ECCV 2020 Yixiao Ge, Haibo Wang, Feng Zhu, Rui Zhao, Hongsheng Li

The task of large-scale retrieval-based image localization is to estimate the geographical location of a query image by recognizing its nearest reference images from a city-scale dataset.

Image Retrieval Retrieval

Structured Domain Adaptation with Online Relation Regularization for Unsupervised Person Re-ID

4 code implementations14 Mar 2020 Yixiao Ge, Feng Zhu, Dapeng Chen, Rui Zhao, Xiaogang Wang, Hongsheng Li

To tackle the challenges, we propose an end-to-end structured domain adaptation framework with an online relation-consistency regularization term.

Pseudo Label Translation +2

Mutual Mean-Teaching: Pseudo Label Refinery for Unsupervised Domain Adaptation on Person Re-identification

2 code implementations ICLR 2020 Yixiao Ge, Dapeng Chen, Hongsheng Li

In order to mitigate the effects of noisy pseudo labels, we propose to softly refine the pseudo labels in the target domain by proposing an unsupervised framework, Mutual Mean-Teaching (MMT), to learn better features from the target domain via off-line refined hard pseudo labels and on-line refined soft pseudo labels in an alternative training manner.

Pseudo Label Unsupervised Domain Adaptation +1

FD-GAN: Pose-guided Feature Distilling GAN for Robust Person Re-identification

2 code implementations NeurIPS 2018 Yixiao Ge, Zhuowan Li, Haiyu Zhao, Guojun Yin, Shuai Yi, Xiaogang Wang, Hongsheng Li

Our proposed FD-GAN achieves state-of-the-art performance on three person reID datasets, which demonstrates that the effectiveness and robust feature distilling capability of the proposed FD-GAN.

Person Re-Identification

