2 code implementations • NeurIPS 2018 • Yixiao Ge, Zhuowan Li, Haiyu Zhao, Guojun Yin, Shuai Yi, Xiaogang Wang, Hongsheng Li
Our proposed FD-GAN achieves state-of-the-art performance on three person reID datasets, which demonstrates that the effectiveness and robust feature distilling capability of the proposed FD-GAN.
Ranked #3 on Person Re-Identification on CUHK03
2 code implementations • ICLR 2020 • Yixiao Ge, Dapeng Chen, Hongsheng Li
In order to mitigate the effects of noisy pseudo labels, we propose to softly refine the pseudo labels in the target domain by proposing an unsupervised framework, Mutual Mean-Teaching (MMT), to learn better features from the target domain via off-line refined hard pseudo labels and on-line refined soft pseudo labels in an alternative training manner.
4 code implementations • 14 Mar 2020 • Yixiao Ge, Feng Zhu, Dapeng Chen, Rui Zhao, Xiaogang Wang, Hongsheng Li
To tackle the challenges, we propose an end-to-end structured domain adaptation framework with an online relation-consistency regularization term.
Ranked #4 on Unsupervised Domain Adaptation on Market to MSMT
3 code implementations • NeurIPS 2020 • Yixiao Ge, Feng Zhu, Dapeng Chen, Rui Zhao, Hongsheng Li
To solve these problems, we propose a novel self-paced contrastive learning framework with hybrid memory.
Ranked #3 on Unsupervised Domain Adaptation on Market to MSMT
3 code implementations • ECCV 2020 • Yixiao Ge, Haibo Wang, Feng Zhu, Rui Zhao, Hongsheng Li
The task of large-scale retrieval-based image localization is to estimate the geographical location of a query image by recognizing its nearest reference images from a city-scale dataset.
2 code implementations • 24 Aug 2020 • Yixiao Ge, Shijie Yu, Dapeng Chen
SDA, a domain-translation-based framework, focuses on carefully translating the source-domain images to the target domain.
no code implementations • ICCV 2021 • Yi Zheng, Shixiang Tang, Guolong Teng, Yixiao Ge, Kaijian Liu, Jing Qin, Donglian Qi, Dapeng Chen
To tackle the problem, we propose an online pseudo label generation by hierarchical cluster dynamics for adaptive ReID.
1 code implementation • ICCV 2021 • Chen Zhao, Yixiao Ge, Feng Zhu, Rui Zhao, Hongsheng Li, Mathieu Salzmann
Correspondence selection aims to correctly select the consistent matches (inliers) from an initial set of putative correspondences.
1 code implementation • CVPR 2021 • Rui Liu, Yixiao Ge, Ching Lam Choi, Xiaogang Wang, Hongsheng Li
Conditional generative adversarial networks (cGANs) target at synthesizing diverse images given the input conditions and latent codes, but unfortunately, they usually suffer from the issue of mode collapse.
no code implementations • 27 Apr 2021 • Yixiao Ge, Xiao Zhang, Ching Lam Choi, Ka Chun Cheung, Peipei Zhao, Feng Zhu, Xiaogang Wang, Rui Zhao, Hongsheng Li
In this way, our BAKE framework achieves online knowledge ensembling across multiple samples with only a single network.
1 code implementation • CVPR 2021 • Xiao Zhang, Yixiao Ge, Yu Qiao, Hongsheng Li
Unsupervised object re-identification targets at learning discriminative representations for object retrieval without any annotations.
no code implementations • CVPR 2021 • Shixiang Tang, Dapeng Chen, Lei Bai, Kaijian Liu, Yixiao Ge, Wanli Ouyang
In this MCGN, the labels and features of support data are used by the CRF for inferring GNN affinities in a principled and probabilistic way.
no code implementations • ICLR 2022 • Binjie Zhang, Yixiao Ge, Yantao Shen, Yu Li, Chun Yuan, Xuyuan Xu, Yexin Wang, Ying Shan
In contrast, hot-refresh model upgrades deploy the new model immediately and then gradually improve the retrieval accuracy by backfilling the gallery on-the-fly.
1 code implementation • CVPR 2022 • Alex Jinpeng Wang, Yixiao Ge, Guanyu Cai, Rui Yan, Xudong Lin, Ying Shan, XiaoHu Qie, Mike Zheng Shou
In this work, we present Object-aware Transformers, an object-centric approach that extends video-language transformer to incorporate object representations.
Ranked #20 on Zero-Shot Video Retrieval on DiDeMo
1 code implementation • 2 Dec 2021 • Rui Yan, Mike Zheng Shou, Yixiao Ge, Alex Jinpeng Wang, Xudong Lin, Guanyu Cai, Jinhui Tang
Video-Text pre-training aims at learning transferable representations from large-scale video-text pairs via aligning the semantics between visual and textual information.
1 code implementation • ICLR 2022 • Wenqi Shao, Yixiao Ge, Zhaoyang Zhang, Xuyuan Xu, Xiaogang Wang, Ying Shan, Ping Luo
It is difficult for Transformers to capture inductive bias such as the positional context in an image with LN.
2 code implementations • CVPR 2022 • Yuying Ge, Yixiao Ge, Xihui Liu, Dian Li, Ying Shan, XiaoHu Qie, Ping Luo
As an additional benefit, our method achieves competitive results with much shorter pre-training videos on single-modality downstream tasks, e. g., action recognition with linear evaluation.
Ranked #8 on Zero-Shot Video Retrieval on MSVD
1 code implementation • 24 Jan 2022 • Binjie Zhang, Yixiao Ge, Yantao Shen, Yu Li, Chun Yuan, Xuyuan Xu, Yexin Wang, Ying Shan
In contrast, hot-refresh model upgrades deploy the new model immediately and then gradually improve the retrieval accuracy by backfilling the gallery on-the-fly.
1 code implementation • ICLR 2022 • Xiaotong Li, Yongxing Dai, Yixiao Ge, Jun Liu, Ying Shan, Ling-Yu Duan
In this paper, we improve the network generalization ability by modeling the uncertainty of domain shifts with synthesized feature statistics during training.
2 code implementations • 3 Mar 2022 • Binjie Zhang, Yixiao Ge, Yantao Shen, Shupeng Su, Fanzi Wu, Chun Yuan, Xuyuan Xu, Yexin Wang, Ying Shan
The task of backward-compatible representation learning is therefore introduced to support backfill-free model upgrades, where the new query features are interoperable with the old gallery features.
1 code implementation • CVPR 2023 • Alex Jinpeng Wang, Yixiao Ge, Rui Yan, Yuying Ge, Xudong Lin, Guanyu Cai, Jianping Wu, Ying Shan, XiaoHu Qie, Mike Zheng Shou
In this work, we for the first time introduce an end-to-end video-language model, namely \textit{all-in-one Transformer}, that embeds raw video and textual signals into joint representations using a unified backbone architecture.
Ranked #6 on TGIF-Transition on TGIF-QA (using extra training data)
2 code implementations • 15 Mar 2022 • Guanyu Cai, Yixiao Ge, Binjie Zhang, Alex Jinpeng Wang, Rui Yan, Xudong Lin, Ying Shan, Lianghua He, XiaoHu Qie, Jianping Wu, Mike Zheng Shou
Recent dominant methods for video-language pre-training (VLP) learn transferable representations from the raw pixels in an end-to-end manner to achieve advanced performance on downstream video-language retrieval.
1 code implementation • 29 Mar 2022 • Xiaotong Li, Yixiao Ge, Kun Yi, Zixuan Hu, Ying Shan, Ling-Yu Duan
Image BERT pre-training with masked image modeling (MIM) becomes a popular practice to cope with self-supervised representation learning.
2 code implementations • ICCV 2023 • Yuxin Fang, Shusheng Yang, Shijie Wang, Yixiao Ge, Ying Shan, Xinggang Wang
We present an approach to efficiently and effectively adapt a masked image modeling (MIM) pre-trained vanilla Vision Transformer (ViT) for object detection, which is based on our two novel observations: (i) A MIM pre-trained vanilla ViT encoder can work surprisingly well in the challenging object-level recognition scenario even with randomly sampled partial observations, e. g., only 25% $\sim$ 50% of the input embeddings.
1 code implementation • 26 Apr 2022 • Yuying Ge, Yixiao Ge, Xihui Liu, Alex Jinpeng Wang, Jianping Wu, Ying Shan, XiaoHu Qie, Ping Luo
Dominant pre-training work for video-text retrieval mainly adopt the "dual-encoder" architectures to enable efficient retrieval, where two separate encoders are used to contrast global video and text representations, but ignore detailed local semantics.
Ranked #7 on Zero-Shot Video Retrieval on MSVD
1 code implementation • 29 Apr 2022 • Shupeng Su, Binjie Zhang, Yixiao Ge, Xuyuan Xu, Yexin Wang, Chun Yuan, Ying Shan
The task of privacy-preserving model upgrades in image retrieval desires to reap the benefits of rapidly evolving new models without accessing the raw gallery images.
1 code implementation • 19 May 2022 • Kun Yi, Yixiao Ge, Xiaotong Li, Shusheng Yang, Dian Li, Jianping Wu, Ying Shan, XiaoHu Qie
Since the development of self-supervised visual representation learning from contrastive learning to masked image modeling (MIM), there is no significant difference in essence, that is, how to design proper pretext tasks for vision dictionary look-up.
1 code implementation • 7 Jul 2022 • Wenqi Shao, Xun Zhao, Yixiao Ge, Zhaoyang Zhang, Lei Yang, Xiaogang Wang, Ying Shan, Ping Luo
It is challenging because the ground-truth model ranking for each task can only be generated by fine-tuning the pre-trained models on the target dataset, which is brute-force and computationally expensive.
Ranked #2 on Transferability on classification benchmark
no code implementations • 12 Sep 2022 • Yixiao Ge, Pieter van Goor, Robert Mahony
The kinematics of many nonlinear control systems, especially in the robotics field, admit a transitive Lie-group symmetry, which is useful in high performance observer design.
1 code implementation • CVPR 2023 • Ziyun Zeng, Yuying Ge, Xihui Liu, Bin Chen, Ping Luo, Shu-Tao Xia, Yixiao Ge
Pre-training on large-scale video data has become a common recipe for learning transferable spatiotemporal representations in recent years.
no code implementations • 13 Oct 2022 • Binjie Zhang, Shupeng Su, Yixiao Ge, Xuyuan Xu, Yexin Wang, Chun Yuan, Mike Zheng Shou, Ying Shan
The traditional model upgrading paradigm for retrieval requires recomputing all gallery embeddings before deploying the new model (dubbed as "backfilling"), which is quite expensive and time-consuming considering billions of instances in industrial applications.
no code implementations • 6 Dec 2022 • YuChao Gu, Xintao Wang, Yixiao Ge, Ying Shan, XiaoHu Qie, Mike Zheng Shou
Vector-Quantized (VQ-based) generative models usually consist of two basic components, i. e., VQ tokenizers and generative transformers.
3 code implementations • ICCV 2023 • Jay Zhangjie Wu, Yixiao Ge, Xintao Wang, Weixian Lei, YuChao Gu, Yufei Shi, Wynne Hsu, Ying Shan, XiaoHu Qie, Mike Zheng Shou
To replicate the success of text-to-image (T2I) generation, recent works employ large-scale video datasets to train a text-to-video (T2V) generator.
1 code implementation • 16 Jan 2023 • Xiaotong Li, Zixuan Hu, Jun Liu, Yixiao Ge, Yongxing Dai, Ling-Yu Duan
In this paper, we improve the network generalization ability by modeling domain shifts with uncertainty (DSU), i. e., characterizing the feature statistics as uncertain distributions during training.
1 code implementation • CVPR 2023 • Shusheng Yang, Yixiao Ge, Kun Yi, Dian Li, Ying Shan, XiaoHu Qie, Xinggang Wang
Both masked image modeling (MIM) and natural language supervision have facilitated the progress of transferable visual pre-training.
1 code implementation • 17 Feb 2023 • Yukang Gan, Yixiao Ge, Chang Zhou, Shupeng Su, Zhouchuan Xu, Xuyuan Xu, Quanchao Hui, Xiang Chen, Yexin Wang, Ying Shan
To tackle the challenge, we propose a binary embedding-based retrieval (BEBR) engine equipped with a recurrent binarization algorithm that enables customized bits per dimension.
1 code implementation • ICCV 2023 • Rui Yang, Lin Song, Yixiao Ge, Xiu Li
Box-supervised instance segmentation has gained much attention as it requires only simple box annotations instead of costly mask or polygon annotations.
1 code implementation • CVPR 2023 • Teng Wang, Yixiao Ge, Feng Zheng, Ran Cheng, Ying Shan, XiaoHu Qie, Ping Luo
FLM successfully frees the prediction rate from the tie-up with the corruption rate while allowing the corruption spans to be customized for each token to be predicted.
1 code implementation • 6 Apr 2023 • Chen Li, Yixiao Ge, Jiayong Mao, Dian Li, Ying Shan
Given a new entity that needs tagging for distribution, TagGPT introduces two alternative options for zero-shot tagging, i. e., a generative method with late semantic matching with the tag set, and another selective method with early matching in prompts.
no code implementations • 8 Apr 2023 • Binqian Xu, Xiangbo Shu, Rui Yan, Guo-Sen Xie, Yixiao Ge, Mike Zheng Shou
In particular, we propose a novel Attack-Augmentation Mixing-Contrastive learning (A$^2$MC) to contrast hard positive features and hard negative features for learning more robust skeleton representations.
1 code implementation • 27 Apr 2023 • Chengyue Wu, Teng Wang, Yixiao Ge, Zeyu Lu, Ruisong Zhou, Ying Shan, Ping Luo
Foundation models have achieved great advances in multi-task learning with a unified interface of unimodal and multimodal tasks.
1 code implementation • 20 May 2023 • Guangzhi Wang, Yixiao Ge, Xiaohan Ding, Mohan Kankanhalli, Ying Shan
In our benchmark, which is curated to evaluate MLLMs visual semantic understanding and fine-grained perception capabilities, we discussed different visual tokenizers pre-trained with dominant methods (i. e., DeiT, CLIP, MAE, DINO), and observe that: i) Fully/weakly supervised models capture more semantics than self-supervised models, but the gap is narrowed by scaling up the pre-training dataset.
1 code implementation • 23 May 2023 • Ziyun Zeng, Yixiao Ge, Zhan Tong, Xihui Liu, Shu-Tao Xia, Ying Shan
We argue that tuning a text encoder end-to-end, as done in previous work, is suboptimal since it may overfit in terms of styles, thereby losing its original generalization ability to capture the semantics of various language registers.
2 code implementations • NeurIPS 2023 • YuChao Gu, Xintao Wang, Jay Zhangjie Wu, Yujun Shi, Yunpeng Chen, Zihan Fan, Wuyou Xiao, Rui Zhao, Shuning Chang, Weijia Wu, Yixiao Ge, Ying Shan, Mike Zheng Shou
Public large-scale text-to-image diffusion models, such as Stable Diffusion, have gained significant attention from the community.
1 code implementation • NeurIPS 2023 • Rui Yang, Lin Song, Yanwei Li, Sijie Zhao, Yixiao Ge, Xiu Li, Ying Shan
This paper aims to efficiently enable Large Language Models (LLMs) to use multimodal tools.
no code implementations • 12 Jun 2023 • Sijie Zhao, Yixiao Ge, Zhongang Qi, Lin Song, Xiaohan Ding, Zehua Xie, Ying Shan
Therefore, we propose StickerCLIP as a benchmark model on the Sticker820K dataset.
no code implementations • 22 Jun 2023 • Binjie Zhang, Yixiao Ge, Xuyuan Xu, Ying Shan, Mike Zheng Shou
In situations involving system upgrades that require updating the upstream foundation model, it becomes essential to re-train all downstream modules to adapt to the new foundation model, which is inflexible and inefficient.
1 code implementation • 26 Jun 2023 • Chen Li, Xutan Peng, Teng Wang, Yixiao Ge, Mengyang Liu, Xuyuan Xu, Yexin Wang, Ying Shan
Art forms such as movies and television (TV) dramas are reflections of the real world, which have attracted much attention from the multimodal learning community recently.
1 code implementation • 29 Jun 2023 • Yunpeng Bai, Xintao Wang, Yan-Pei Cao, Yixiao Ge, Chun Yuan, Ying Shan
This paper introduces DreamDiffusion, a novel method for generating high-quality images directly from brain electroencephalogram (EEG) signals, without the need to translate thoughts into text.
1 code implementation • 16 Jul 2023 • Yuying Ge, Yixiao Ge, Ziyun Zeng, Xintao Wang, Ying Shan
Research on image tokenizers has previously reached an impasse, as frameworks employing quantized visual tokens have lost prominence due to subpar performance and convergence in multimodal comprehension (compared to BLIP-2, etc.)
2 code implementations • 30 Jul 2023 • Bohao Li, Rui Wang, Guangzhi Wang, Yuying Ge, Yixiao Ge, Ying Shan
Based on powerful Large Language Models (LLMs), recent generative Multimodal Large Language Models (MLLMs) have gained prominence as a pivotal research area, exhibiting remarkable capability for both comprehension and generation.
1 code implementation • 20 Aug 2023 • Weixian Lei, Yixiao Ge, Jianfeng Zhang, Dylan Sun, Kun Yi, Ying Shan, Mike Zheng Shou
A well-trained lens with a ViT backbone has the potential to serve as one of these foundation models, supervising the learning of subsequent modalities.
Ranked #2 on Zero-Shot Transfer 3D Point Cloud Classification on ModelNet40 (using extra training data)
1 code implementation • ICCV 2023 • Xiaotong Li, Zixuan Hu, Yixiao Ge, Ying Shan, Ling-Yu Duan
The experimental results on 10 downstream tasks and 12 self-supervised models demonstrate that our approach can seamlessly integrate into existing ranking techniques and enhance their performances, revealing its effectiveness for the model selection task and its potential for understanding the mechanism in transfer learning.
no code implementations • 12 Sep 2023 • Yixiao Ge, Pieter van Goor, Robert Mahony
With this structure, we show that a naive coordinate implementation of the EKF fails to account for geometry of the manifold in the update step and in the reset step.
1 code implementation • 27 Sep 2023 • Ruyang Liu, Chen Li, Yixiao Ge, Ying Shan, Thomas H. Li, Ge Li
Without bells and whistles, BT-Adapter achieves (1) state-of-the-art zero-shot results on various video tasks using thousands of fewer GPU hours.
Ranked #5 on Zero-Shot Video Retrieval on LSMDC
Video-based Generative Performance Benchmarking (Consistency) Video-based Generative Performance Benchmarking (Contextual Understanding) +6
1 code implementation • 2 Oct 2023 • Yuying Ge, Sijie Zhao, Ziyun Zeng, Yixiao Ge, Chen Li, Xintao Wang, Ying Shan
We identify two crucial design principles: (1) Image tokens should be independent of 2D physical patch positions and instead be produced with a 1D causal dependency, exhibiting intrinsic interdependence that aligns with the left-to-right autoregressive prediction mechanism in LLMs.
1 code implementation • NeurIPS 2023 • Cheng Cheng, Lin Song, Ruoyi Xue, Hang Wang, Hongbin Sun, Yixiao Ge, Ying Shan
Without bells and whistles, our approach outperforms the state-of-the-art online few-shot learning method by an average of 3. 6\% on eight image classification datasets with higher inference speed.
1 code implementation • 14 Nov 2023 • Chen Li, Yixiao Ge, Dian Li, Ying Shan
Instruction tuning is a crucial supervised training phase in Large Language Models (LLMs), aiming to enhance the LLM's ability to generalize instruction execution and adapt to user preferences.
1 code implementation • 27 Nov 2023 • Weixian Lei, Yixiao Ge, Kun Yi, Jianfeng Zhang, Difei Gao, Dylan Sun, Yuying Ge, Ying Shan, Mike Zheng Shou
In this paper, we present ViT-Lens-2 that facilitates efficient omni-modal representation learning by perceiving novel modalities with a pretrained ViT and aligning them to a pre-defined space.
2 code implementations • 27 Nov 2023 • Xiaohan Ding, Yiyuan Zhang, Yixiao Ge, Sijie Zhao, Lin Song, Xiangyu Yue, Ying Shan
1) We propose four architectural guidelines for designing large-kernel ConvNets, the core of which is to exploit the essential characteristics of large kernels that distinguish them from small kernels - they can see wide without going deep.
Ranked #1 on Object Detection on COCO 2017 (mAP metric)
1 code implementation • 28 Nov 2023 • Bohao Li, Yuying Ge, Yixiao Ge, Guangzhi Wang, Rui Wang, Ruimao Zhang, Ying Shan
Multimodal large language models (MLLMs), building upon the foundation of powerful large language models (LLMs), have recently demonstrated exceptional capabilities in generating not only texts but also images given interleaved multimodal inputs (acting like a combination of GPT-4V and DALL-E 3).
1 code implementation • 11 Dec 2023 • Yuzhou Huang, Liangbin Xie, Xintao Wang, Ziyang Yuan, Xiaodong Cun, Yixiao Ge, Jiantao Zhou, Chao Dong, Rui Huang, Ruimao Zhang, Ying Shan
Both quantitative and qualitative results on this evaluation dataset indicate that our SmartEdit surpasses previous methods, paving the way for the practical application of complex instruction-based image editing.
1 code implementation • 11 Dec 2023 • Yi Chen, Yuying Ge, Yixiao Ge, Mingyu Ding, Bohao Li, Rui Wang, Ruifeng Xu, Ying Shan, Xihui Liu
Given diverse environmental inputs, including real-time task progress, visual observations, and open-form language instructions, a proficient task planner is expected to predict feasible actions, which is a feat inherently achievable by Multimodal Large Language Models (MLLMs).
1 code implementation • 14 Dec 2023 • Jinguo Zhu, Xiaohan Ding, Yixiao Ge, Yuying Ge, Sijie Zhao, Hengshuang Zhao, Xiaohua Wang, Ying Shan
In combination with the existing text tokenizer and detokenizer, this framework allows for the encoding of interleaved image-text data into a multimodal sequence, which can subsequently be fed into the transformer model.
1 code implementation • 20 Dec 2023 • Zhaoyang Zhang, Wenqi Shao, Yixiao Ge, Xiaogang Wang, Jinwei Gu, Ping Luo
This work introduces a new Transformer model called Cached Transformer, which uses Gated Recurrent Cached (GRC) attention to extend the self-attention mechanism with a differentiable memory cache of tokens.
1 code implementation • 4 Jan 2024 • Chengyue Wu, Yukang Gan, Yixiao Ge, Zeyu Lu, Jiahao Wang, Ye Feng, Ping Luo, Ying Shan
Humans generally acquire new skills without compromising the old; however, the opposite holds for Large Language Models (LLMs), e. g., from LLaMA to CodeLLaMA.
no code implementations • 15 Jan 2024 • Jay Zhangjie Wu, Guian Fang, HaoNing Wu, Xintao Wang, Yixiao Ge, Xiaodong Cun, David Junhao Zhang, Jia-Wei Liu, YuChao Gu, Rui Zhao, Weisi Lin, Wynne Hsu, Ying Shan, Mike Zheng Shou
Experiments on the TVGE dataset demonstrate the superiority of the proposed T2VScore on offering a better metric for text-to-video generation.
1 code implementation • 18 Jan 2024 • Xiaohu Jiang, Yixiao Ge, Yuying Ge, Dachuan Shi, Chun Yuan, Ying Shan
Image-text training like CLIP has dominated the pretraining of vision foundation models in recent years.
1 code implementation • 25 Jan 2024 • Yiyuan Zhang, Xiaohan Ding, Kaixiong Gong, Yixiao Ge, Ying Shan, Xiangyu Yue
We propose to improve transformers of a specific modality with irrelevant data from other modalities, e. g., improve an ImageNet model with audio or point cloud datasets.
1 code implementation • 30 Jan 2024 • Tianheng Cheng, Lin Song, Yixiao Ge, Wenyu Liu, Xinggang Wang, Ying Shan
The You Only Look Once (YOLO) series of detectors have established themselves as efficient and practical tools.
no code implementations • 25 Mar 2024 • Yixiao Ge, Pieter van Goor, Robert Mahony
Stochastic inference on Lie groups plays a key role in state estimation problems, such as inertial navigation, visual inertial odometry, pose estimation in virtual reality, etc.
1 code implementation • 30 Mar 2024 • Ruyang Liu, Chen Li, Haoran Tang, Yixiao Ge, Ying Shan, Ge Li
In this paper, we investigate a straightforward yet unexplored question: Can we feed all spatial-temporal tokens into the LLM, thus delegating the task of video sequence modeling to the LLMs?