Search Results for author: Yixiao Ge

Found 72 papers, 59 papers with code

FD-GAN: Pose-guided Feature Distilling GAN for Robust Person Re-identification

2 code implementations • NeurIPS 2018 • Yixiao Ge, Zhuowan Li, Haiyu Zhao, Guojun Yin, Shuai Yi, Xiaogang Wang, Hongsheng Li

Our proposed FD-GAN achieves state-of-the-art performance on three person reID datasets, which demonstrates that the effectiveness and robust feature distilling capability of the proposed FD-GAN.

Ranked #3 on Person Re-Identification on CUHK03

Generative Adversarial Network Person Re-Identification

1,267

Paper
Code

Mutual Mean-Teaching: Pseudo Label Refinery for Unsupervised Domain Adaptation on Person Re-identification

2 code implementations • ICLR 2020 • Yixiao Ge, Dapeng Chen, Hongsheng Li

In order to mitigate the effects of noisy pseudo labels, we propose to softly refine the pseudo labels in the target domain by proposing an unsupervised framework, Mutual Mean-Teaching (MMT), to learn better features from the target domain via off-line refined hard pseudo labels and on-line refined soft pseudo labels in an alternative training manner.

Ranked #1 on Unsupervised Person Re-Identification on Market-1501->MSMT17

Clustering Pseudo Label +2

3,126

Paper
Code

Structured Domain Adaptation with Online Relation Regularization for Unsupervised Person Re-ID

4 code implementations • 14 Mar 2020 • Yixiao Ge, Feng Zhu, Dapeng Chen, Rui Zhao, Xiaogang Wang, Hongsheng Li

To tackle the challenges, we propose an end-to-end structured domain adaptation framework with an online relation-consistency regularization term.

Ranked #4 on Unsupervised Domain Adaptation on Market to MSMT

Pseudo Label Relation +3

Paper
Code

Self-paced Contrastive Learning with Hybrid Memory for Domain Adaptive Object Re-ID

3 code implementations • NeurIPS 2020 • Yixiao Ge, Feng Zhu, Dapeng Chen, Rui Zhao, Hongsheng Li

To solve these problems, we propose a novel self-paced contrastive learning framework with hybrid memory.

Ranked #3 on Unsupervised Domain Adaptation on Market to MSMT

Clustering Contrastive Learning +4

389

Paper
Code

Self-supervising Fine-grained Region Similarities for Large-scale Image Localization

3 code implementations • ECCV 2020 • Yixiao Ge, Haibo Wang, Feng Zhu, Rui Zhao, Hongsheng Li

The task of large-scale retrieval-based image localization is to estimate the geographical location of a query image by recognizing its nearest reference images from a city-scale dataset.

Image Retrieval Retrieval

264

Paper
Code

Improved Mutual Mean-Teaching for Unsupervised Domain Adaptive Re-ID

2 code implementations • 24 Aug 2020 • Yixiao Ge, Shijie Yu, Dapeng Chen

SDA, a domain-translation-based framework, focuses on carefully translating the source-domain images to the target domain.

Domain Adaptation Pseudo Label +1

Paper
Code

Online Pseudo Label Generation by Hierarchical Cluster Dynamics for Adaptive Person Re-Identification

no code implementations • ICCV 2021 • Yi Zheng, Shixiang Tang, Guolong Teng, Yixiao Ge, Kaijian Liu, Jing Qin, Donglian Qi, Dapeng Chen

To tackle the problem, we propose an online pseudo label generation by hierarchical cluster dynamics for adaptive ReID.

Model Optimization Person Re-Identification +1

Paper
Add Code

Progressive Correspondence Pruning by Consensus Learning

1 code implementation • ICCV 2021 • Chen Zhao, Yixiao Ge, Feng Zhu, Rui Zhao, Hongsheng Li, Mathieu Salzmann

Correspondence selection aims to correctly select the consistent matches (inliers) from an initial set of putative correspondences.

Denoising Pose Estimation +1

Paper
Code

DivCo: Diverse Conditional Image Synthesis via Contrastive Generative Adversarial Network

1 code implementation • CVPR 2021 • Rui Liu, Yixiao Ge, Ching Lam Choi, Xiaogang Wang, Hongsheng Li

Conditional generative adversarial networks (cGANs) target at synthesizing diverse images given the input conditions and latent codes, but unfortunately, they usually suffer from the issue of mode collapse.

Contrastive Learning Generative Adversarial Network +1

Paper
Code

Self-distillation with Batch Knowledge Ensembling Improves ImageNet Classification

no code implementations • 27 Apr 2021 • Yixiao Ge, Xiao Zhang, Ching Lam Choi, Ka Chun Cheung, Peipei Zhao, Feng Zhu, Xiaogang Wang, Rui Zhao, Hongsheng Li

In this way, our BAKE framework achieves online knowledge ensembling across multiple samples with only a single network.

Classification General Classification +1

Paper
Add Code

Refining Pseudo Labels with Clustering Consensus over Generations for Unsupervised Object Re-identification

1 code implementation • CVPR 2021 • Xiao Zhang, Yixiao Ge, Yu Qiao, Hongsheng Li

Unsupervised object re-identification targets at learning discriminative representations for object retrieval without any annotations.

Clustering Pseudo Label +1

Paper
Code

Mutual CRF-GNN for Few-Shot Learning

no code implementations • CVPR 2021 • Shixiang Tang, Dapeng Chen, Lei Bai, Kaijian Liu, Yixiao Ge, Wanli Ouyang

In this MCGN, the labels and features of support data are used by the CRF for inferring GNN affinities in a principled and probabilistic way.

Few-Shot Learning

Paper
Add Code

Hot-Refresh Model Upgrades with Regression-Free Compatible Training in Image Retrieval

no code implementations • ICLR 2022 • Binjie Zhang, Yixiao Ge, Yantao Shen, Yu Li, Chun Yuan, Xuyuan Xu, Yexin Wang, Ying Shan

In contrast, hot-refresh model upgrades deploy the new model immediately and then gradually improve the retrieval accuracy by backﬁlling the gallery on-the-ﬂy.

Image Retrieval regression +1

Paper
Add Code

Object-aware Video-language Pre-training for Retrieval

1 code implementation • CVPR 2022 • Alex Jinpeng Wang, Yixiao Ge, Guanyu Cai, Rui Yan, Xudong Lin, Ying Shan, XiaoHu Qie, Mike Zheng Shou

In this work, we present Object-aware Transformers, an object-centric approach that extends video-language transformer to incorporate object representations.

Ranked #20 on Zero-Shot Video Retrieval on DiDeMo

Object Retrieval +2

Paper
Code

Video-Text Pre-training with Learned Regions

1 code implementation • 2 Dec 2021 • Rui Yan, Mike Zheng Shou, Yixiao Ge, Alex Jinpeng Wang, Xudong Lin, Guanyu Cai, Jinhui Tang

Video-Text pre-training aims at learning transferable representations from large-scale video-text pairs via aligning the semantics between visual and textual information.

Representation Learning Retrieval +2

Paper
Code

Dynamic Token Normalization Improves Vision Transformers

1 code implementation • ICLR 2022 • Wenqi Shao, Yixiao Ge, Zhaoyang Zhang, Xuyuan Xu, Xiaogang Wang, Ying Shan, Ping Luo

It is difficult for Transformers to capture inductive bias such as the positional context in an image with LN.

Inductive Bias ListOps +2

Paper
Code

Bridging Video-text Retrieval with Multiple Choice Questions

2 code implementations • CVPR 2022 • Yuying Ge, Yixiao Ge, Xihui Liu, Dian Li, Ying Shan, XiaoHu Qie, Ping Luo

As an additional benefit, our method achieves competitive results with much shorter pre-training videos on single-modality downstream tasks, e. g., action recognition with linear evaluation.

Ranked #8 on Zero-Shot Video Retrieval on MSVD

Action Recognition Multiple-choice +8

2,972

Paper
Code

Hot-Refresh Model Upgrades with Regression-Alleviating Compatible Training in Image Retrieval

1 code implementation • 24 Jan 2022 • Binjie Zhang, Yixiao Ge, Yantao Shen, Yu Li, Chun Yuan, Xuyuan Xu, Yexin Wang, Ying Shan

In contrast, hot-refresh model upgrades deploy the new model immediately and then gradually improve the retrieval accuracy by backfilling the gallery on-the-fly.

Image Retrieval regression +1

Paper
Code

Uncertainty Modeling for Out-of-Distribution Generalization

1 code implementation • ICLR 2022 • Xiaotong Li, Yongxing Dai, Yixiao Ge, Jun Liu, Ying Shan, Ling-Yu Duan

In this paper, we improve the network generalization ability by modeling the uncertainty of domain shifts with synthesized feature statistics during training.

Image Classification Out-of-Distribution Generalization +2

137

Paper
Code

Towards Universal Backward-Compatible Representation Learning

2 code implementations • 3 Mar 2022 • Binjie Zhang, Yixiao Ge, Yantao Shen, Shupeng Su, Fanzi Wu, Chun Yuan, Xuyuan Xu, Yexin Wang, Ying Shan

The task of backward-compatible representation learning is therefore introduced to support backfill-free model upgrades, where the new query features are interoperable with the old gallery features.

Face Recognition Representation Learning

Paper
Code

All in One: Exploring Unified Video-Language Pre-training

1 code implementation • CVPR 2023 • Alex Jinpeng Wang, Yixiao Ge, Rui Yan, Yuying Ge, Xudong Lin, Guanyu Cai, Jianping Wu, Ying Shan, XiaoHu Qie, Mike Zheng Shou

In this work, we for the first time introduce an end-to-end video-language model, namely \textit{all-in-one Transformer}, that embeds raw video and textual signals into joint representations using a unified backbone architecture.

Ranked #6 on TGIF-Transition on TGIF-QA (using extra training data)

Language Modelling Multiple-choice +10

272

Paper
Code

Revitalize Region Feature for Democratizing Video-Language Pre-training of Retrieval

2 code implementations • 15 Mar 2022 • Guanyu Cai, Yixiao Ge, Binjie Zhang, Alex Jinpeng Wang, Rui Yan, Xudong Lin, Ying Shan, Lianghua He, XiaoHu Qie, Jianping Wu, Mike Zheng Shou

Recent dominant methods for video-language pre-training (VLP) learn transferable representations from the raw pixels in an end-to-end manner to achieve advanced performance on downstream video-language retrieval.

Question Answering Retrieval +4

Paper
Code

mc-BEiT: Multi-choice Discretization for Image BERT Pre-training

1 code implementation • 29 Mar 2022 • Xiaotong Li, Yixiao Ge, Kun Yi, Zixuan Hu, Ying Shan, Ling-Yu Duan

Image BERT pre-training with masked image modeling (MIM) becomes a popular practice to cope with self-supervised representation learning.

Ranked #35 on Self-Supervised Image Classification on ImageNet (finetuned)

Instance Segmentation object-detection +5

Paper
Code

Unleashing Vanilla Vision Transformer with Masked Image Modeling for Object Detection

2 code implementations • ICCV 2023 • Yuxin Fang, Shusheng Yang, Shijie Wang, Yixiao Ge, Ying Shan, Xinggang Wang

We present an approach to efficiently and effectively adapt a masked image modeling (MIM) pre-trained vanilla Vision Transformer (ViT) for object detection, which is based on our two novel observations: (i) A MIM pre-trained vanilla ViT encoder can work surprisingly well in the challenging object-level recognition scenario even with randomly sampled partial observations, e. g., only 25% $\sim$ 50% of the input embeddings.

Instance Segmentation Object +2

1,951

Paper
Code

MILES: Visual BERT Pre-training with Injected Language Semantics for Video-text Retrieval

1 code implementation • 26 Apr 2022 • Yuying Ge, Yixiao Ge, Xihui Liu, Alex Jinpeng Wang, Jianping Wu, Ying Shan, XiaoHu Qie, Ping Luo

Dominant pre-training work for video-text retrieval mainly adopt the "dual-encoder" architectures to enable efficient retrieval, where two separate encoders are used to contrast global video and text representations, but ignore detailed local semantics.

Ranked #7 on Zero-Shot Video Retrieval on MSVD

Action Recognition Retrieval +6

130

Paper
Code

Privacy-Preserving Model Upgrades with Bidirectional Compatible Training in Image Retrieval

1 code implementation • 29 Apr 2022 • Shupeng Su, Binjie Zhang, Yixiao Ge, Xuyuan Xu, Yexin Wang, Chun Yuan, Ying Shan

The task of privacy-preserving model upgrades in image retrieval desires to reap the benefits of rapidly evolving new models without accessing the raw gallery images.

Image Retrieval Privacy Preserving +1

Paper
Code

Masked Image Modeling with Denoising Contrast

1 code implementation • 19 May 2022 • Kun Yi, Yixiao Ge, Xiaotong Li, Shusheng Yang, Dian Li, Jianping Wu, Ying Shan, XiaoHu Qie

Since the development of self-supervised visual representation learning from contrastive learning to masked image modeling (MIM), there is no significant difference in essence, that is, how to design proper pretext tasks for vision dictionary look-up.

Contrastive Learning Denoising +6

Paper
Code

Not All Models Are Equal: Predicting Model Transferability in a Self-challenging Fisher Space

1 code implementation • 7 Jul 2022 • Wenqi Shao, Xun Zhao, Yixiao Ge, Zhaoyang Zhang, Lei Yang, Xiaogang Wang, Ying Shan, Ping Luo

It is challenging because the ground-truth model ranking for each task can only be generated by fine-tuning the pre-trained models on the target dataset, which is brute-force and computationally expensive.

Ranked #2 on Transferability on classification benchmark

Transferability

Paper
Code

Equivariant Filter Design for Discrete-time systems

no code implementations • 12 Sep 2022 • Yixiao Ge, Pieter van Goor, Robert Mahony

The kinematics of many nonlinear control systems, especially in the robotics field, admit a transitive Lie-group symmetry, which is useful in high performance observer design.

Paper
Add Code

Learning Transferable Spatiotemporal Representations from Natural Script Knowledge

1 code implementation • CVPR 2023 • Ziyun Zeng, Yuying Ge, Xihui Liu, Bin Chen, Ping Luo, Shu-Tao Xia, Yixiao Ge

Pre-training on large-scale video data has become a common recipe for learning transferable spatiotemporal representations in recent years.

Descriptive Representation Learning +1

Paper
Code

Darwinian Model Upgrades: Model Evolving with Selective Compatibility

no code implementations • 13 Oct 2022 • Binjie Zhang, Shupeng Su, Yixiao Ge, Xuyuan Xu, Yexin Wang, Chun Yuan, Mike Zheng Shou, Ying Shan

The traditional model upgrading paradigm for retrieval requires recomputing all gallery embeddings before deploying the new model (dubbed as "backfilling"), which is quite expensive and time-consuming considering billions of instances in industrial applications.

Face Recognition Retrieval

Paper
Add Code

Rethinking the Objectives of Vector-Quantized Tokenizers for Image Synthesis

no code implementations • 6 Dec 2022 • YuChao Gu, Xintao Wang, Yixiao Ge, Ying Shan, XiaoHu Qie, Mike Zheng Shou

Vector-Quantized (VQ-based) generative models usually consist of two basic components, i. e., VQ tokenizers and generative transformers.

Conditional Image Generation

Paper
Add Code

Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation

3 code implementations • ICCV 2023 • Jay Zhangjie Wu, Yixiao Ge, Xintao Wang, Weixian Lei, YuChao Gu, Yufei Shi, Wynne Hsu, Ying Shan, XiaoHu Qie, Mike Zheng Shou

To replicate the success of text-to-image (T2I) generation, recent works employ large-scale video datasets to train a text-to-video (T2V) generator.

Style Transfer Text-to-Video Generation +1

4,074

Paper
Code

Modeling Uncertain Feature Representation for Domain Generalization

1 code implementation • 16 Jan 2023 • Xiaotong Li, Zixuan Hu, Jun Liu, Yixiao Ge, Yongxing Dai, Ling-Yu Duan

In this paper, we improve the network generalization ability by modeling domain shifts with uncertainty (DSU), i. e., characterizing the feature statistics as uncertain distributions during training.

Domain Generalization Image Classification +3

137

Paper
Code

RILS: Masked Visual Reconstruction in Language Semantic Space

1 code implementation • CVPR 2023 • Shusheng Yang, Yixiao Ge, Kun Yi, Dian Li, Ying Shan, XiaoHu Qie, Xinggang Wang

Both masked image modeling (MIM) and natural language supervision have facilitated the progress of transferable visual pre-training.

Sentence

Paper
Code

Binary Embedding-based Retrieval at Tencent

1 code implementation • 17 Feb 2023 • Yukang Gan, Yixiao Ge, Chang Zhou, Shupeng Su, Zhouchuan Xu, Xuyuan Xu, Quanchao Hui, Xiang Chen, Yexin Wang, Ying Shan

To tackle the challenge, we propose a binary embedding-based retrieval (BEBR) engine equipped with a recurrent binarization algorithm that enables customized bits per dimension.

Binarization Retrieval

Paper
Code

BoxSnake: Polygonal Instance Segmentation with Box Supervision

1 code implementation • ICCV 2023 • Rui Yang, Lin Song, Yixiao Ge, Xiu Li

Box-supervised instance segmentation has gained much attention as it requires only simple box annotations instead of costly mask or polygon annotations.

Box-supervised Instance Segmentation Segmentation +1

Paper
Code

Accelerating Vision-Language Pretraining with Free Language Modeling

1 code implementation • CVPR 2023 • Teng Wang, Yixiao Ge, Feng Zheng, Ran Cheng, Ying Shan, XiaoHu Qie, Ping Luo

FLM successfully frees the prediction rate from the tie-up with the corruption rate while allowing the corruption spans to be customized for each token to be predicted.

Language Modelling Masked Language Modeling

Paper
Code

TagGPT: Large Language Models are Zero-shot Multimodal Taggers

1 code implementation • 6 Apr 2023 • Chen Li, Yixiao Ge, Jiayong Mao, Dian Li, Ying Shan

Given a new entity that needs tagging for distribution, TagGPT introduces two alternative options for zero-shot tagging, i. e., a generative method with late semantic matching with the tag set, and another selective method with early matching in prompts.

Optical Character Recognition (OCR) Prompt Engineering +5

Paper
Code

Attack is Good Augmentation: Towards Skeleton-Contrastive Representation Learning

no code implementations • 8 Apr 2023 • Binqian Xu, Xiangbo Shu, Rui Yan, Guo-Sen Xie, Yixiao Ge, Mike Zheng Shou

In particular, we propose a novel Attack-Augmentation Mixing-Contrastive learning (A$^2$MC) to contrast hard positive features and hard negative features for learning more robust skeleton representations.

Action Recognition Contrastive Learning +4

Paper
Add Code

$π$-Tuning: Transferring Multimodal Foundation Models with Optimal Multi-task Interpolation

1 code implementation • 27 Apr 2023 • Chengyue Wu, Teng Wang, Yixiao Ge, Zeyu Lu, Ruisong Zhou, Ying Shan, Ping Luo

Foundation models have achieved great advances in multi-task learning with a unified interface of unimodal and multimodal tasks.

Multi-Task Learning

Paper
Code

What Makes for Good Visual Tokenizers for Large Language Models?

1 code implementation • 20 May 2023 • Guangzhi Wang, Yixiao Ge, Xiaohan Ding, Mohan Kankanhalli, Ying Shan

In our benchmark, which is curated to evaluate MLLMs visual semantic understanding and fine-grained perception capabilities, we discussed different visual tokenizers pre-trained with dominant methods (i. e., DeiT, CLIP, MAE, DINO), and observe that: i) Fully/weakly supervised models capture more semantics than self-supervised models, but the gap is narrowed by scaling up the pre-training dataset.

Image Captioning Object Counting +2

Paper
Code

TVTSv2: Learning Out-of-the-box Spatiotemporal Visual Representations at Scale

1 code implementation • 23 May 2023 • Ziyun Zeng, Yixiao Ge, Zhan Tong, Xihui Liu, Shu-Tao Xia, Ying Shan

We argue that tuning a text encoder end-to-end, as done in previous work, is suboptimal since it may overfit in terms of styles, thereby losing its original generalization ability to capture the semantics of various language registers.

Representation Learning

Paper
Code

Mix-of-Show: Decentralized Low-Rank Adaptation for Multi-Concept Customization of Diffusion Models

2 code implementations • NeurIPS 2023 • YuChao Gu, Xintao Wang, Jay Zhangjie Wu, Yujun Shi, Yunpeng Chen, Zihan Fan, Wuyou Xiao, Rui Zhao, Shuning Chang, Weijia Wu, Yixiao Ge, Ying Shan, Mike Zheng Shou

Public large-scale text-to-image diffusion models, such as Stable Diffusion, have gained significant attention from the community.

Attribute

363

Paper
Code

GPT4Tools: Teaching Large Language Model to Use Tools via Self-instruction

1 code implementation • NeurIPS 2023 • Rui Yang, Lin Song, Yanwei Li, Sijie Zhao, Yixiao Ge, Xiu Li, Ying Shan

This paper aims to efficiently enable Large Language Models (LLMs) to use multimodal tools.

Image Generation Instruction Following +3

724

Paper
Code

Sticker820K: Empowering Interactive Retrieval with Stickers

no code implementations • 12 Jun 2023 • Sijie Zhao, Yixiao Ge, Zhongang Qi, Lin Song, Xiaohan Ding, Zehua Xie, Ying Shan

Therefore, we propose StickerCLIP as a benchmark model on the Sticker820K dataset.

Image Retrieval Retrieval

Paper
Add Code

TaCA: Upgrading Your Visual Foundation Model with Task-agnostic Compatible Adapter

no code implementations • 22 Jun 2023 • Binjie Zhang, Yixiao Ge, Xuyuan Xu, Ying Shan, Mike Zheng Shou

In situations involving system upgrades that require updating the upstream foundation model, it becomes essential to re-train all downstream modules to adapt to the new foundation model, which is inflexible and inefficient.

Question Answering Retrieval +5

Paper
Add Code

PTVD: A Large-Scale Plot-Oriented Multimodal Dataset Based on Television Dramas

1 code implementation • 26 Jun 2023 • Chen Li, Xutan Peng, Teng Wang, Yixiao Ge, Mengyang Liu, Xuyuan Xu, Yexin Wang, Ying Shan

Art forms such as movies and television (TV) dramas are reflections of the real world, which have attracted much attention from the multimodal learning community recently.

Genre classification Retrieval +1

Paper
Code

DreamDiffusion: Generating High-Quality Images from Brain EEG Signals

1 code implementation • 29 Jun 2023 • Yunpeng Bai, Xintao Wang, Yan-Pei Cao, Yixiao Ge, Chun Yuan, Ying Shan

This paper introduces DreamDiffusion, a novel method for generating high-quality images directly from brain electroencephalogram (EEG) signals, without the need to translate thoughts into text.

EEG Image Generation

388

Paper
Code

Planting a SEED of Vision in Large Language Model

1 code implementation • 16 Jul 2023 • Yuying Ge, Yixiao Ge, Ziyun Zeng, Xintao Wang, Ying Shan

Research on image tokenizers has previously reached an impasse, as frameworks employing quantized visual tokens have lost prominence due to subpar performance and convergence in multimodal comprehension (compared to BLIP-2, etc.)

Language Modelling Large Language Model +1

461

Paper
Code

SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension

2 code implementations • 30 Jul 2023 • Bohao Li, Rui Wang, Guangzhi Wang, Yuying Ge, Yixiao Ge, Ying Shan

Based on powerful Large Language Models (LLMs), recent generative Multimodal Large Language Models (MLLMs) have gained prominence as a pivotal research area, exhibiting remarkable capability for both comprehension and generation.

Benchmarking Multiple-choice

351

Paper
Code

ViT-Lens: Initiating Omni-Modal Exploration through 3D Insights

1 code implementation • 20 Aug 2023 • Weixian Lei, Yixiao Ge, Jianfeng Zhang, Dylan Sun, Kun Yi, Ying Shan, Mike Zheng Shou

A well-trained lens with a ViT backbone has the potential to serve as one of these foundation models, supervising the learning of subsequent modalities.

Ranked #2 on Zero-Shot Transfer 3D Point Cloud Classification on ModelNet40 (using extra training data)

3D Classification Question Answering +4

127

Paper
Code

Exploring Model Transferability through the Lens of Potential Energy

1 code implementation • ICCV 2023 • Xiaotong Li, Zixuan Hu, Yixiao Ge, Ying Shan, Ling-Yu Duan

The experimental results on 10 downstream tasks and 12 self-supervised models demonstrate that our approach can seamlessly integrate into existing ranking techniques and enhance their performances, revealing its effectiveness for the model selection task and its potential for understanding the mechanism in transfer learning.

Model Selection Transfer Learning

Paper
Code

A Note on the Extended Kalman Filter on a Manifold

no code implementations • 12 Sep 2023 • Yixiao Ge, Pieter van Goor, Robert Mahony

With this structure, we show that a naive coordinate implementation of the EKF fails to account for geometry of the manifold in the update step and in the reset step.

Paper
Add Code

One For All: Video Conversation is Feasible Without Video Instruction Tuning

1 code implementation • 27 Sep 2023 • Ruyang Liu, Chen Li, Yixiao Ge, Ying Shan, Thomas H. Li, Ge Li

Without bells and whistles, BT-Adapter achieves (1) state-of-the-art zero-shot results on various video tasks using thousands of fewer GPU hours.

Ranked #5 on Zero-Shot Video Retrieval on LSMDC

Video-based Generative Performance Benchmarking (Consistency) Video-based Generative Performance Benchmarking (Contextual Understanding) +6

Paper
Code

Making LLaMA SEE and Draw with SEED Tokenizer

1 code implementation • 2 Oct 2023 • Yuying Ge, Sijie Zhao, Ziyun Zeng, Yixiao Ge, Chen Li, Xintao Wang, Ying Shan

We identify two crucial design principles: (1) Image tokens should be independent of 2D physical patch positions and instead be produced with a 1D causal dependency, exhibiting intrinsic interdependence that aligns with the left-to-right autoregressive prediction mechanism in LLMs.

multimodal generation

461

Paper
Code

Meta-Adapter: An Online Few-shot Learner for Vision-Language Model

1 code implementation • NeurIPS 2023 • Cheng Cheng, Lin Song, Ruoyi Xue, Hang Wang, Hongbin Sun, Yixiao Ge, Ying Shan

Without bells and whistles, our approach outperforms the state-of-the-art online few-shot learning method by an average of 3. 6\% on eight image classification datasets with higher inference speed.

Few-Shot Learning Image Classification +3

Paper
Code

Vision-Language Instruction Tuning: A Review and Analysis

1 code implementation • 14 Nov 2023 • Chen Li, Yixiao Ge, Dian Li, Ying Shan

Instruction tuning is a crucial supervised training phase in Large Language Models (LLMs), aiming to enhance the LLM's ability to generalize instruction execution and adapt to user preferences.

Paper
Code

ViT-Lens: Towards Omni-modal Representations

1 code implementation • 27 Nov 2023 • Weixian Lei, Yixiao Ge, Kun Yi, Jianfeng Zhang, Difei Gao, Dylan Sun, Yuying Ge, Ying Shan, Mike Zheng Shou

In this paper, we present ViT-Lens-2 that facilitates efficient omni-modal representation learning by perceiving novel modalities with a pretrained ViT and aligning them to a pre-defined space.

EEG Image Generation +2

127

Paper
Code

UniRepLKNet: A Universal Perception Large-Kernel ConvNet for Audio, Video, Point Cloud, Time-Series and Image Recognition

2 code implementations • 27 Nov 2023 • Xiaohan Ding, Yiyuan Zhang, Yixiao Ge, Sijie Zhao, Lin Song, Xiangyu Yue, Ying Shan

1) We propose four architectural guidelines for designing large-kernel ConvNets, the core of which is to exploit the essential characteristics of large kernels that distinguish them from small kernels - they can see wide without going deep.

Ranked #1 on Object Detection on COCO 2017 (mAP metric)

Image Classification Object Detection +3

802

Paper
Code

SEED-Bench-2: Benchmarking Multimodal Large Language Models

1 code implementation • 28 Nov 2023 • Bohao Li, Yuying Ge, Yixiao Ge, Guangzhi Wang, Rui Wang, Ruimao Zhang, Ying Shan

Multimodal large language models (MLLMs), building upon the foundation of powerful large language models (LLMs), have recently demonstrated exceptional capabilities in generating not only texts but also images given interleaved multimodal inputs (acting like a combination of GPT-4V and DALL-E 3).

Benchmarking Image Generation +1

229

Paper
Code

SmartEdit: Exploring Complex Instruction-based Image Editing with Multimodal Large Language Models

1 code implementation • 11 Dec 2023 • Yuzhou Huang, Liangbin Xie, Xintao Wang, Ziyang Yuan, Xiaodong Cun, Yixiao Ge, Jiantao Zhou, Chao Dong, Rui Huang, Ruimao Zhang, Ying Shan

Both quantitative and qualitative results on this evaluation dataset indicate that our SmartEdit surpasses previous methods, paving the way for the practical application of complex instruction-based image editing.

116

Paper
Code

EgoPlan-Bench: Benchmarking Egocentric Embodied Planning with Multimodal Large Language Models

1 code implementation • 11 Dec 2023 • Yi Chen, Yuying Ge, Yixiao Ge, Mingyu Ding, Bohao Li, Rui Wang, Ruifeng Xu, Ying Shan, Xihui Liu

Given diverse environmental inputs, including real-time task progress, visual observations, and open-form language instructions, a proficient task planner is expected to predict feasible actions, which is a feat inherently achievable by Multimodal Large Language Models (MLLMs).

Benchmarking Human-Object Interaction Detection

Paper
Code

VL-GPT: A Generative Pre-trained Transformer for Vision and Language Understanding and Generation

1 code implementation • 14 Dec 2023 • Jinguo Zhu, Xiaohan Ding, Yixiao Ge, Yuying Ge, Sijie Zhao, Hengshuang Zhao, Xiaohua Wang, Ying Shan

In combination with the existing text tokenizer and detokenizer, this framework allows for the encoding of interleaved image-text data into a multimodal sequence, which can subsequently be fed into the transformer model.

Image Captioning In-Context Learning +4

Paper
Code

Cached Transformers: Improving Transformers with Differentiable Memory Cache

1 code implementation • 20 Dec 2023 • Zhaoyang Zhang, Wenqi Shao, Yixiao Ge, Xiaogang Wang, Jinwei Gu, Ping Luo

This work introduces a new Transformer model called Cached Transformer, which uses Gated Recurrent Cached (GRC) attention to extend the self-attention mechanism with a differentiable memory cache of tokens.

Image Classification Instance Segmentation +6

Paper
Code

LLaMA Pro: Progressive LLaMA with Block Expansion

1 code implementation • 4 Jan 2024 • Chengyue Wu, Yukang Gan, Yixiao Ge, Zeyu Lu, Jiahao Wang, Ye Feng, Ping Luo, Ying Shan

Humans generally acquire new skills without compromising the old; however, the opposite holds for Large Language Models (LLMs), e. g., from LLaMA to CodeLLaMA.

Instruction Following Math

383

Paper
Code

Towards A Better Metric for Text-to-Video Generation

no code implementations • 15 Jan 2024 • Jay Zhangjie Wu, Guian Fang, HaoNing Wu, Xintao Wang, Yixiao Ge, Xiaodong Cun, David Junhao Zhang, Jia-Wei Liu, YuChao Gu, Rui Zhao, Weisi Lin, Wynne Hsu, Ying Shan, Mike Zheng Shou

Experiments on the TVGE dataset demonstrate the superiority of the proposed T2VScore on offering a better metric for text-to-video generation.

Text-to-Video Generation Video Alignment +1

Paper
Add Code

Supervised Fine-tuning in turn Improves Visual Foundation Models

1 code implementation • 18 Jan 2024 • Xiaohu Jiang, Yixiao Ge, Yuying Ge, Dachuan Shi, Chun Yuan, Ying Shan

Image-text training like CLIP has dominated the pretraining of vision foundation models in recent years.

Paper
Code

Multimodal Pathway: Improve Transformers with Irrelevant Data from Other Modalities

1 code implementation • 25 Jan 2024 • Yiyuan Zhang, Xiaohan Ding, Kaixiong Gong, Yixiao Ge, Ying Shan, Xiangyu Yue

We propose to improve transformers of a specific modality with irrelevant data from other modalities, e. g., improve an ImageNet model with audio or point cloud datasets.

Paper
Code

YOLO-World: Real-Time Open-Vocabulary Object Detection

1 code implementation • 30 Jan 2024 • Tianheng Cheng, Lin Song, Yixiao Ge, Wenyu Liu, Xinggang Wang, Ying Shan

The You Only Look Once (YOLO) series of detectors have established themselves as efficient and practical tools.

Instance Segmentation Language Modelling +4

3,296

Paper
Code

A Geometric Perspective on Fusing Gaussian Distributions on Lie Groups

no code implementations • 25 Mar 2024 • Yixiao Ge, Pieter van Goor, Robert Mahony

Stochastic inference on Lie groups plays a key role in state estimation problems, such as inertial navigation, visual inertial odometry, pose estimation in virtual reality, etc.

Pose Estimation

Paper
Add Code

ST-LLM: Large Language Models Are Effective Temporal Learners

1 code implementation • 30 Mar 2024 • Ruyang Liu, Chen Li, Haoran Tang, Yixiao Ge, Ying Shan, Ge Li

In this paper, we investigate a straightforward yet unexplored question: Can we feed all spatial-temporal tokens into the LLM, thus delegating the task of video sequence modeling to the LLMs?

Ranked #1 on Video-based Generative Performance Benchmarking (Correctness of Information) on VideoInstruct

Reading Comprehension Video Understanding

Paper
Code

Cannot find the paper you are looking for? You can Submit a new open access paper.